Skip to main content

Adding ColPali to ColBERT Live!

I recently added support for ColPali image search to the ColBERT Live! Library. This post is going to skip over the introduction to ColPali and how it works; please check out Antaripa Saha's excellent article for that. TLDR, ColPali allows you to natively compare text queries with image-based documents, with accuracy that bests that previous state of the art. ("Natively" means there's no extract-to-text pipeline involved.)

Adding ColPali to ColBERT Live!

The "Col" in ColPali refers to performing maxsim-based "late interaction" relevance as seen in ColBERT. Since ColBERT Live! already abstracts away the details of computing embedding vectors, it is straightforward to add support for ColPali / ColQwen by implementing an appropriate Model subclass.

However, ColBERT Live!'s default parameters were initially tuned for text search. To be able to give appropriate guidance for image search, I ran a grid search on the ViDoRe benchmark that was created by the ColPali authors, like this:

# Use the provided doc_pool
doc_pool=$1

# Arrays for other parameters
VIDORE_QUERY_POOL=(0.01 0.02 0.03)
VIDORE_N_ANN=(40 80)
VIDORE_N_MAXSIM=(10 15 20 25)

# Loop through all combinations
for query_pool in "${VIDORE_QUERY_POOL[@]}"; do
    for n_ann in "${VIDORE_N_ANN[@]}"; do
        for n_maxsim in "${VIDORE_N_MAXSIM[@]}"; do
            VIDORE_DOC_POOL=$doc_pool VIDORE_QUERY_POOL=$query_pool VIDORE_N_ANN=$n_ann VIDORE_N_MAXSIM=$n_maxsim \
            vidore-benchmark evaluate-retriever --model-name colbert_live --collection-name "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d" --split test
            echo "----------------------------------------"
        done
    done
done

(I first ran a larger grid against a single dataset, which informed the ranges of parameters evaluated here for the full collection; in particular, disabling query pooling entirely at 0.0 almost always underperformed, so I left that out of the full grid.)

In the interest of time, each pass against the full collection was evaluated using 10% of the test queries.

Here's the results of speed (QPS) vs relevance (NDCG@5). I've omitted the synthetic datasets since they are much easier, to the point of being uninformative – that is, many many combinations of parameters score a perfect 1.0. See the ColPali paper for details on how the datasets were created.

Results graph

Drilling into results grouped by doc pool factor (charts at the end), we can summarize as:

  • arxivqa: dpool=3 effectively ties with dpool=1. dpool=2 also comes close.
  • docvqa: all pooling options looses significant accuracy compared to dpool=1 (dpool=2 and dpool=3 come closest, maxing out at 0.596 vs 0.678)
  • Infovqa: dpool=2 effectively ties with dpool=1
  • shiftproject: dpool=2 and dpool=3 come close to dpool=1
  • tabfquad: all the pooled indexes outperform dpool=1 by significant margins (maxing out at 0.974 for dpool=3 vs 0.925 for dpool=1)
  • tatdqa: dpool=2 and dpool=3 outscore dpool=1 (0.784 vs 0.753); the others lag behind

Grouping by query embeddings pooling, we see

  • arxivqa: qpool=0.03 notches slightly better NDCG than qpool=0.01 while being 3x faster (when paired with dpool=3)
  • docvqa: qpool=0.03 maxes out at significantly higher NDCG than 0.02 or 0.01
  • infovqa: max relevance effectively tied, but qpool=0.01 gets there with dpool=2, so it is 2x faster
  • shiftproject: qpool=0.03 is close to strictly superior to the others, hitting higher throughput at the same relevance by doing better with higher dpool values
  • tabfquad: a mixed bag, all three qpool options do better at some points on the qps/ndcg tradeoff curve
  • tatdqa: this time it's qpool=0.01 that is (nearly) strictly superior to the others by taking advantage of higher dpool

On balance, I think it's reasonable to leave the same defaults in place as with text-based ColBERT (doc pooling 2, query pooling 0.03), but one size does not fit all with ColPali and if you can build an evaluation dataset of your own you should do so, there is no other way to predict what the best tradeoff is for any given use case.

I've also left the defaults for n_ann_docs unchanged, even though ViDoRe sees better results with much smaller values (40 or 80, compared to 240 or more for BEIR). I believe this is simply the result of testing against a much smaller dataset: when n_ann_docs is a significant fraction of the dataset size, you end up giving "partial credit" for matches to images that in fact are not relevant.

Under the hood: pytorch vs C++ for maxsim computation

The libraries for ColBERT and ColPali (colbert-ai and colpali-engine, respectively) have a lot in common, since they are both doing multi-vector search (and ColPali is explicitly inspired by ColBERT). However, they have different approaches to computing the maxsim score between query and document vectors:

  1. ColBERT is designed to score a single query at a time; ColPali is designed to score batches
  2. ColBERT uses PyTorch dot product + sum methods on GPU, and a custom C++ extension on CPU
  3. ColPali uses PyTorch einsum for both GPU and CPU

At first I thought that of course it would be better to use ColBERT's C++ extension on CPU for both text and image search. But testing surprised me:

  1. The einsum approach was just as fast as the custom C++ on CPU
  2. CPU is not appreciably slower than GPU at computing maxsim scores

(2) is not enough for me to recommend running image searches on CPU only (since you still have to compute query embeddings, which is faster on GPU), but it does highlight that the overhead of moving data to the GPU and starting CUDA kernels is very high, even for operations like massively parallel dot products and sums that GPUs are good at.

But this was a useful finding when computing the ViDoRe grid: instead of needing to load a large model into the GPU for each data point, I was able to precompute the query embeddings a single time, and then do all the scoring on CPU.

Grid search results grouped by doc pool distance

tatdqa test docpool tabfquad test subsampled docpool shiftproject test docpool infovqa test subsampled docpool docvqa test subsampled docpool arxivqa test subsampled docpool

Grid search results grouped by query pool distance







 

Comments

Popular posts from this blog

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas...

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren...