I recently added support for ColPali image search to the ColBERT Live! Library. This post is going to skip over the introduction to ColPali and how it works; please check out Antaripa Saha's excellent article for that. TLDR, ColPali allows you to natively compare text queries with image-based documents, with accuracy that bests that previous state of the art. ("Natively" means there's no extract-to-text pipeline involved.)
Adding ColPali to ColBERT Live!
The "Col" in ColPali refers to performing maxsim-based "late interaction" relevance as seen in ColBERT. Since ColBERT Live! already abstracts away the details of computing embedding vectors, it is straightforward to add support for ColPali / ColQwen by implementing an appropriate Model subclass.
However, ColBERT Live!'s default parameters were initially tuned for text search. To be able to give appropriate guidance for image search, I ran a grid search on the ViDoRe benchmark that was created by the ColPali authors, like this:
# Use the provided doc_pool doc_pool=$1 # Arrays for other parameters VIDORE_QUERY_POOL=(0.01 0.02 0.03) VIDORE_N_ANN=(40 80) VIDORE_N_MAXSIM=(10 15 20 25) # Loop through all combinations for query_pool in "${VIDORE_QUERY_POOL[@]}"; do for n_ann in "${VIDORE_N_ANN[@]}"; do for n_maxsim in "${VIDORE_N_MAXSIM[@]}"; do VIDORE_DOC_POOL=$doc_pool VIDORE_QUERY_POOL=$query_pool VIDORE_N_ANN=$n_ann VIDORE_N_MAXSIM=$n_maxsim \ vidore-benchmark evaluate-retriever --model-name colbert_live --collection-name "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d" --split test echo "----------------------------------------" done done done
(I first ran a larger grid against a single dataset, which informed the ranges of parameters evaluated here for the full collection; in particular, disabling query pooling entirely at 0.0 almost always underperformed, so I left that out of the full grid.)
In the interest of time, each pass against the full collection was evaluated using 10% of the test queries.
Here's the results of speed (QPS) vs relevance (NDCG@5). I've omitted the synthetic datasets since they are much easier, to the point of being uninformative – that is, many many combinations of parameters score a perfect 1.0. See the ColPali paper for details on how the datasets were created.
Drilling into results grouped by doc pool factor (charts at the end), we can summarize as:
- arxivqa: dpool=3 effectively ties with dpool=1. dpool=2 also comes close.
- docvqa: all pooling options looses significant accuracy compared to dpool=1 (dpool=2 and dpool=3 come closest, maxing out at 0.596 vs 0.678)
- Infovqa: dpool=2 effectively ties with dpool=1
- shiftproject: dpool=2 and dpool=3 come close to dpool=1
- tabfquad: all the pooled indexes outperform dpool=1 by significant margins (maxing out at 0.974 for dpool=3 vs 0.925 for dpool=1)
- tatdqa: dpool=2 and dpool=3 outscore dpool=1 (0.784 vs 0.753); the others lag behind
Grouping by query embeddings pooling, we see
- arxivqa: qpool=0.03 notches slightly better NDCG than qpool=0.01 while being 3x faster (when paired with dpool=3)
- docvqa: qpool=0.03 maxes out at significantly higher NDCG than 0.02 or 0.01
- infovqa: max relevance effectively tied, but qpool=0.01 gets there with dpool=2, so it is 2x faster
- shiftproject: qpool=0.03 is close to strictly superior to the others, hitting higher throughput at the same relevance by doing better with higher dpool values
- tabfquad: a mixed bag, all three qpool options do better at some points on the qps/ndcg tradeoff curve
- tatdqa: this time it's qpool=0.01 that is (nearly) strictly superior to the others by taking advantage of higher dpool
On balance, I think it's reasonable to leave the same defaults in place as with text-based ColBERT (doc pooling 2, query pooling 0.03), but one size does not fit all with ColPali and if you can build an evaluation dataset of your own you should do so, there is no other way to predict what the best tradeoff is for any given use case.
I've also left the defaults for n_ann_docs unchanged, even though ViDoRe sees better results with much smaller values (40 or 80, compared to 240 or more for BEIR). I believe this is simply the result of testing against a much smaller dataset: when n_ann_docs is a significant fraction of the dataset size, you end up giving "partial credit" for matches to images that in fact are not relevant.
Under the hood: pytorch vs C++ for maxsim computation
The libraries for ColBERT and ColPali (colbert-ai and colpali-engine, respectively) have a lot in common, since they are both doing multi-vector search (and ColPali is explicitly inspired by ColBERT). However, they have different approaches to computing the maxsim score between query and document vectors:
- ColBERT is designed to score a single query at a time; ColPali is designed to score batches
- ColBERT uses PyTorch dot product + sum methods on GPU, and a custom C++ extension on CPU
- ColPali uses PyTorch einsum for both GPU and CPU
At first I thought that of course it would be better to use ColBERT's C++ extension on CPU for both text and image search. But testing surprised me:
- The einsum approach was just as fast as the custom C++ on CPU
- CPU is not appreciably slower than GPU at computing maxsim scores
(2) is not enough for me to recommend running image searches on CPU only (since you still have to compute query embeddings, which is faster on GPU), but it does highlight that the overhead of moving data to the GPU and starting CUDA kernels is very high, even for operations like massively parallel dot products and sums that GPUs are good at.
But this was a useful finding when computing the ViDoRe grid: instead of needing to load a large model into the GPU for each data point, I was able to precompute the query embeddings a single time, and then do all the scoring on CPU.
Comments