Spyced

Notes on LLMap

2025-01-21T08:49:00.000-08:00

Lessons learned writing LLMap

I wrote LLMap to solve code search in the Apache Cassandra repo. Cassandra is too large (~200kloc across ~2500 files, about 4.5M tokens) to throw at even the largest LLM context window. And of course there are many codebases larger still.

The idea is simple: ask the LLM to do the work. But getting it to work consistently was harder than I expected. Here's a few of the hiccups I ran into and how I worked around them.

DeepSeek V3 can't classify things without thinking first

Recall that LLMap optimizes the problem by using a multi-stage analysis to avoid spending more time than necessary analyzing obviously irrelevant files:

Coarse analysis using code skeletons
Full source analysis of potentially relevant files from (1)
Refine the output of (2) to only the most relevant snippets

It turns out that if you just ask DeepSeek V3 to classify the skeleton as relevant/irrelevant you will get garbage results. Sometimes it calls everything relevant, sometimes nothing.

While LLMap works just as well if you skip the first stage, I really wanted to get the skeleton pass working since analyzing full source is ~5x as slow and expensive as analyzing a skeleton.

My first solution was to do multiple passes: first ask it to think about its potential relevance, then ask it to classify. This worked, but having to perform two passes severely cuts into the advantage of working with skeletons.

Asking it to conclude its initial thinking with a relevance classification sort of worked but trying to parse out the conclusion did not work because it would say things like "not relevant" as part of its thinking (so just looking for the word "relevant" would hit a false positive). And I didn't want to resort to formally structured responses because LLMs are consistently dumber when they have to spend effort adhering to JSON.

What ended up working was asking it to classify using special identifiers LLMAP_RELEVANT, LLMAP_IRRELEVANT. It was able to do that almost reliably -- but still, sometimes (under 1% of the time) it would forget to follow instructions.

Don't be afraid to continue the conversation

It turns out that when it forgets to conclude with LLMAP_RELEVANT or LLMAP_IRRELEVANT, just asking a second time is enough to solve the problem:

I still want to avoid multiple passes for everything, but adding a second round trip for a handful of files is fine. (And more successful that repeating the same query from scratch.)

I also take advantage of this in the 3rd refining stage. Since there's only a few queries made (almost always 3 or fewer for the Cassandra codebase) I actually do make two passes on each one, just to ask R1 "Take one more look and make sure you didn't miss anything important."

Structuring requests this way takes advantage of DeepSeek's caching, so a continued conversation is cheaper than two equivalent uncached requests.

It's easier to refactor than to write something new

I really didn't want to become an expert in tree-sitter parsing to generate the code skeletons so I asked Claude to write it for me, giving it Aider's similar repo-map code as an example.

Claude crashed and burned. Repeatedly.

Eventually I bit the bullet and hand-wrote a recursive parser. But then when I needed to support very large Java files I revisited the parser to add semantic chunking, and it turns out that Claude was able to transform the recursive parser into one using tree-sitter's declarative queries with minimal hand-holding.

Usually I leverage this the other way around by having the LLM do a rough draft and then improve it. Interesting that it can work both ways.

On a related note, after spending a few hours having Claude extend the parsing to support Python, the code was getting pretty messy. So I threw it at o1-pro and asked it to simplify it. It delivered:

I still don't have a great intuition for when o1-pro can outperform Claude at coding, but this was one of its wins.

Claude is my ORM

I switched from file-based to sqlite-based caching late in development. Claude nailed the conversion on the first try. I glanced at the code it wrote but didn't see anything obviously crufty so I left it alone. That's the only file in the repo 100% AI generated.

Conclusion

My biggest takeaway is that even though the core idea—"ask the LLM to do the work"—is straightforward, getting robust, repeatable results at scale requires iterative design and careful handling of edge cases. DeepSeek V3's classification quirks forced a hybrid approach with multiple passes and gently nudging the model when it forgot instructions.

(I found Flash to be a lot better at following instructions without a ton of gymnastics, but Flash is too rate-limited to be useful, so here we are.)

Give LLMap a try! Extending it with Aider should be straightforward, PRs are welcome.

The Missing Piece in AI Coding: Automated Context Discovery

2024-12-12T10:10:00.000-08:00

I recently switched tasks from writing the ColBERT Live! library and related benchmarking tools to authoring BM25 search for Cassandra. I was able to implement the former almost entirely with "coding in English" via Aider. That is: I gave the LLM tasks, in English, and it generated diffs for me that Aider applied to my source files. This made me easily 5x more productive vs writing code by hand, even with AI autocomplete like Copilot. It felt amazing!

(Take a minute to check out this short thread on a real-life session with Aider, if you've never tried it.)

Coming back to Cassandra, by contrast, felt like swimming through molasses. Doing everything by hand is tedious when you know that an LLM could do it faster if you could just structure the problem correctly for it. It felt like writing assembly without a compiler -- a useful skill in narrow situations, but mostly not a good use of human intelligence today.

The key difference in these two scenarios is that ColBERT Live! is a dozen source files with simple relationships. Cassandra is thousands of files, poorly modularized, with call stacks dozens deep.

I want to live in a world where we can tackle Cassandra-sized projects with the productivity improvements that you can only get today in much smaller code bases. The good news is that we can get to that world with only a few key improvements to our tools, every component of which exists today -- we just need to assemble them.

Context collection for code

Context is king for getting useful results out of AI, and code is no exception. Too little context and the AI will hallucinate APIs that don't exist. Too much context and it will get overwhelmed and even forget the problem statement.

In 2023, when GPT-4 was new, I assembled the context for my vector search work by hand, pasting snippets into my prompt. The current state of the art in late 2024 is using "@" references pioneered by Cursor to manually mark files or methods as context for AI chat and code generation. This saves a lot of copy-and-paste back-and-forth but fundamentally still leaves it up to the human to decide what context to add manually. Which means that it still becomes prohibitively difficult when you're refactoring intertwined, deeply nested inheritance hierarchies in a project with thousands of source files.

What we need is tools that can build context automatically, or at least create a good first draft.

There are two related pieces to this problem. Both can be automated with the right approach:

Figure out what files we need to change for a given task
Add relevant context from those files' dependencies

Aider author Paul Gauthier has invented an elegant solution to the second. I'll describe it in detail because everyone else needs to adopt it, and I can only conclude that the reason they haven't is that they don't know about it. Then we'll come back to the first problem.

Code Skeletons and the Aider repo map

Aider parses your project's source files and generates a "skeleton" of classes and functions per file that looks like this:

src/vidore_benchmark/evaluation/eval_manager.py:
⋮...
│class EvalManager:
│    """
│    Stores evaluation results for various datasets and metrics. Supports the
│    JSON output of the `VisionRetriever.compute_metric`.
│
│    The data is stored in a pandas DataFrame with a MultiIndex for columns.
│    The first level of the MultiIndex is the dataset name and the second level is the metric name.
│
│    Usage:
│    >>> eval_manager = EvalManager.from_dirpath("data/evaluation_results/")
│    >>> print(eval_manager.data)
⋮...
│    @classmethod
│    def from_dict(cls, data: Dict[Any, Any]):
⋮...
│    @staticmethod
│    def melt(df: pd.DataFrame) -> pd.DataFrame:
⋮

Aider uses these skeletons to create a graph where identifiers are nodes in the graph and (directed) edges are the references in your code. That is, if function a calls function b, there will be an edge from a to b.

When you /add files to your Aider session (roughly the equivalent of an @ reference in Cursor), Aider starts with the identifiers from those files as "hot" nodes in the graph and runs PageRank to determine what the most-relevant neighbors are. The skeletons of those neighbors are then automatically added to the context for your request.

This works shockingly well, to the point that I consider this aspect of context gathering to be effectively a solved problem, modulo some implementation details like disambiguating common symbols and extending the code graph into library dependencies.

Figuring out which files to change

The harder part is figuring out what files to change in the first place, given a problem description. Nobody knows how to one-shot this and in fact I don't think it's possible for sufficiently complex codebases. But this is the kind of problem that LLMs are good at solving, if we can narrow down the scope without leaving out critical information.

The Aider code skeletons point us in the right direction. Past a fairly low threshold, it's not feasible to throw your entire codebase at an LLM, not even in skeleton form. But if we can get a foothold in the right general area, we can iteratively refine it using the skeletons, pagerank, and the LLM. Here's what I mean:

First, generate an initial set of candidate files.

Everyone doing code RAG is doing this with more or less success. Here are the approaches I know about:

Cody does some kind of keyword + graph search. I'm not a huge fan of this approach since their results got noticeably worse when they switched to this from vector search.
Cursor does vector search with custom caching based on merkle trees and I think a custom embeddings model. They get good results, a solid B+. I do worry a little that the longer your problem description becomes the more you have to deal with diluting your embedding context too much, resulting in the need to chunk the problem itself and combine multiple searches. Probably not something that needs to be solved for a 1.0 product though.
Augment Code is both fast and effective, but AFAIK they haven't said anything about how it works. You should probably decompile their plugin to find out.
Not in a shipping product, but Gemini Flash and Claude Haiku both give reasonable starting points for adding new features when I send them the almost 5000 source filenames (with paths) in the Cassandra source tree and ask them to guess which ones are most relevant. Just throwing things at Flash is a surprisingly powerful (and inexpensive) tool.

Second, expand the candidate pool by analyzing skeletonized content and graph edges.

This is the part that nobody does yet and I think it will result in massive accuracy improvements.

Take the candidate files from the first step, and use them as seeds for pagerank as described above. Send the skeletons of the original candidates along with the skeletons of the top pagerank results, and ask the LLM to refine it. "I'm implementing feature XYZ. Here are the skeletons of files that may or may not be relevant. Give me the names of only the relevant ones."
Repeat until stable.

This will give you the starting files with which to anchor your context.

I'll add a few thoughts on implementation. First, it's unrealistic to expect this to be perfect every time. Assuming that you're building this in an IDE, you should show the context visually so that the human operator can add or remove entries as needed.

Second, this should be a separate step from starting to write code, partly to reduce latency and partly so that the human supervisor can review it first.

Third, both of the above allow composition of the same mechanisms when it is desirable to manually specify context instead. ("Change method X to run asynchronously using the same technique in source file A, with point-of-call usage that looks like the code in source file B." I thank Stephan Deibel of Wingware for this reminder.)

Other useful sources of context

While finding relevant files from a problem description is the most common starting point, it's not the only way to guide the AI. In practice, I've found several other powerful entry points for building context from different lenses into the codebase. I have used all of these manually and I think they are common enough scenarios to be worth automating:

"I am implementing feature X. Here is my diff so far. I want you to implement Y next." Doing this in a single uncommitted change is living dangerously, so diff of branch vs main is usually preferred.
"Here is the current source of class X and the diff of the last commit to touch it. What was the author trying to accomplish?"
"Here is a stack trace for an UnsupportedOperationException, and the source code for each method in the stack. Why did we end up here unexpectedly?" (This is also a good special case of finding relevant files from a problem description.)

Flow in AI-first coding

Finally, I have some thoughts about the low-level details of how best to perform this kind of AI supervision in an IDE.

AI code generation is most successful when given small units of work at a time. If you try to do too much then you may end up spending more time debugging it than it would have taken you to write it yourself in the first place. Don't be greedy: stick to one change at a time, and review each change before adding more.

To this end, I don't believe that a codegen tool should autocommit for me. That's because the pre-commit dialog named Commit is the easiest way to review changes in IntelliJ. I can easily see all the changes, doubleclick on any file to review, and critically, I can tweak it directly from the diff window using the full set of semantic tooling (rename, extract method, etc). And if the AI really got lost, I can undo it trivially with git stash. (Prefer stash to reset in case you end up wishing you had it back later after all!)

Revising already-committed code is more work; I have to find the commit in the log, right click -> compare with local, and then I can doubleclick on files again and amend the commit with any changes. If I want to start over, I have to soft-reset before I can stash.

Thus, my ideal integration would suggest a commit message for me (i.e., populate the Commit message window), but let me push the button to actually commit after review and revision.

Preparing for the next generation of frontier models

We already live in a world where it's no longer necessary to write Java or Python or Go by hand for small projects. We can bring large projects into that world by solving the context problem with a combination of graph techniques and giving summarized files to the LLM for analysis.

This will unlock massive productivity gains with frontier models like Sonnet 3.5-new and o1, which are already at or beyond human-level competence in syntax generation. Even more interesting is the emergence of specialized open-source coding models like DeepSeek Coder and Qwen 2.5-coder that can compete with closed-source frontier models while being small enough to run locally. And whatever tooling we build for today's models will be that much more useful with the next generation.

It's an exciting time to write code!

dlib is not compatibile with numpy 2.x

2024-11-24T15:31:00.000-08:00

I spent way too long trying to figure out this problem with dlib while using the Python face_recognition library that wraps it, and since I couldn't find anyone giving the correct diagnosis and solution online, I'm posting it as a public service to the next person who hits it.

Here's the error I was getting:

RuntimeError: Error while calling cudaMallocHost(&data, new_size*sizeof(float)) in file /home/jonathan/Projects/dlib/dlib/cuda/gpu_data.cpp:211. code: 2, reason: out of memory

Eventually I gave up and switched from the GPU model ("cnn") to the CPU one ("hog"). Then I started getting errors about

RuntimeError: Unsupported image type, must be 8bit gray or RGB image.

The errors persisted after adding PIL code to convert to RGB.

This one was easier to track down on Google: it happens when you have numpy 2.x installed, which is not compatible with dlib. Seems like something along the way should give a warning about that!

At any rate, with numpy downgraded to the latest 1.x version, the cudaMallocHost error also went away. I guess something in numpy 2.x is getting interpreted as a Very Large image size value.

Postscript:

Later on I started getting cudaMallocHost errors again. These came from using an Image that had not been converted to RGB. So the unifying theme seems to be "cnn mode doesn't have the same sanity checks enabled that hog does, if you get weird errors you should switch to hog and once it works try cnn again."

Adding ColPali to ColBERT Live!

2024-10-14T09:15:00.001-07:00

I recently added support for ColPali image search to the ColBERT Live! Library. This post is going to skip over the introduction to ColPali and how it works; please check out Antaripa Saha's excellent article for that. TLDR, ColPali allows you to natively compare text queries with image-based documents, with accuracy that bests that previous state of the art. ("Natively" means there's no extract-to-text pipeline involved.)

Adding ColPali to ColBERT Live!

The "Col" in ColPali refers to performing maxsim-based "late interaction" relevance as seen in ColBERT. Since ColBERT Live! already abstracts away the details of computing embedding vectors, it is straightforward to add support for ColPali / ColQwen by implementing an appropriate Model subclass.

However, ColBERT Live!'s default parameters were initially tuned for text search. To be able to give appropriate guidance for image search, I ran a grid search on the ViDoRe benchmark that was created by the ColPali authors, like this:

# Use the provided doc_pool
doc_pool=$1

# Arrays for other parameters
VIDORE_QUERY_POOL=(0.01 0.02 0.03)
VIDORE_N_ANN=(40 80)
VIDORE_N_MAXSIM=(10 15 20 25)

# Loop through all combinations
for query_pool in "${VIDORE_QUERY_POOL[@]}"; do
    for n_ann in "${VIDORE_N_ANN[@]}"; do
        for n_maxsim in "${VIDORE_N_MAXSIM[@]}"; do
            VIDORE_DOC_POOL=$doc_pool VIDORE_QUERY_POOL=$query_pool VIDORE_N_ANN=$n_ann VIDORE_N_MAXSIM=$n_maxsim \
            vidore-benchmark evaluate-retriever --model-name colbert_live --collection-name "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d" --split test
            echo "----------------------------------------"
        done
    done
done

(I first ran a larger grid against a single dataset, which informed the ranges of parameters evaluated here for the full collection; in particular, disabling query pooling entirely at 0.0 almost always underperformed, so I left that out of the full grid.)

In the interest of time, each pass against the full collection was evaluated using 10% of the test queries.

Here's the results of speed (QPS) vs relevance (NDCG@5). I've omitted the synthetic datasets since they are much easier, to the point of being uninformative – that is, many many combinations of parameters score a perfect 1.0. See the ColPali paper for details on how the datasets were created.

Drilling into results grouped by doc pool factor (charts at the end), we can summarize as:

arxivqa: dpool=3 effectively ties with dpool=1. dpool=2 also comes close.
docvqa: all pooling options looses significant accuracy compared to dpool=1 (dpool=2 and dpool=3 come closest, maxing out at 0.596 vs 0.678)
Infovqa: dpool=2 effectively ties with dpool=1
shiftproject: dpool=2 and dpool=3 come close to dpool=1
tabfquad: all the pooled indexes outperform dpool=1 by significant margins (maxing out at 0.974 for dpool=3 vs 0.925 for dpool=1)
tatdqa: dpool=2 and dpool=3 outscore dpool=1 (0.784 vs 0.753); the others lag behind

Grouping by query embeddings pooling, we see

arxivqa: qpool=0.03 notches slightly better NDCG than qpool=0.01 while being 3x faster (when paired with dpool=3)
docvqa: qpool=0.03 maxes out at significantly higher NDCG than 0.02 or 0.01
infovqa: max relevance effectively tied, but qpool=0.01 gets there with dpool=2, so it is 2x faster
shiftproject: qpool=0.03 is close to strictly superior to the others, hitting higher throughput at the same relevance by doing better with higher dpool values
tabfquad: a mixed bag, all three qpool options do better at some points on the qps/ndcg tradeoff curve
tatdqa: this time it's qpool=0.01 that is (nearly) strictly superior to the others by taking advantage of higher dpool

On balance, I think it's reasonable to leave the same defaults in place as with text-based ColBERT (doc pooling 2, query pooling 0.03), but one size does not fit all with ColPali and if you can build an evaluation dataset of your own you should do so, there is no other way to predict what the best tradeoff is for any given use case.

I've also left the defaults for n_ann_docs unchanged, even though ViDoRe sees better results with much smaller values (40 or 80, compared to 240 or more for BEIR). I believe this is simply the result of testing against a much smaller dataset: when n_ann_docs is a significant fraction of the dataset size, you end up giving "partial credit" for matches to images that in fact are not relevant.

Under the hood: pytorch vs C++ for maxsim computation

The libraries for ColBERT and ColPali (colbert-ai and colpali-engine, respectively) have a lot in common, since they are both doing multi-vector search (and ColPali is explicitly inspired by ColBERT). However, they have different approaches to computing the maxsim score between query and document vectors:

ColBERT is designed to score a single query at a time; ColPali is designed to score batches
ColBERT uses PyTorch dot product + sum methods on GPU, and a custom C++ extension on CPU
ColPali uses PyTorch einsum for both GPU and CPU

At first I thought that of course it would be better to use ColBERT's C++ extension on CPU for both text and image search. But testing surprised me:

The einsum approach was just as fast as the custom C++ on CPU
CPU is not appreciably slower than GPU at computing maxsim scores

(2) is not enough for me to recommend running image searches on CPU only (since you still have to compute query embeddings, which is faster on GPU), but it does highlight that the overhead of moving data to the GPU and starting CUDA kernels is very high, even for operations like massively parallel dot products and sums that GPUs are good at.

But this was a useful finding when computing the ViDoRe grid: instead of needing to load a large model into the GPU for each data point, I was able to precompute the query embeddings a single time, and then do all the scoring on CPU.

Grid search results grouped by doc pool distance

Grid search results grouped by query pool distance

Why JVector 3 is the most advanced vector index on the planet

2024-10-14T07:57:00.000-07:00

Transcript of airhacks.fm episode 316

Adam Bien: Hey, Jonathan, how JVector 4 is doing?

Jonathan Ellis: JVector 4?

AB: Yeah, because Vector 3 is completed, I think now.

JE: JVector 4. Well, shoot man. If you want to sneak preview, we may have some news to talk about with GPU acceleration for JVector 4, but that's super, super early and I can't promise any specifics yet.

0:00:28 JVector 3 features and improvements

AB: It was a joke actually. So, I know that JVector 3 is completed. And so what's the major features or what happened between JVector 2 and 3?

JE: JVector 2 was a fairly straightforward adaptation of Microsoft Research’s DiskANN search indexing to Java and to Cassandra. And so that means that you have a two pass search where you have a core index that works on a graph, whose comparisons are done with quantized vectors that are kept in memory. And then you refine the results of that search by using full resolution vectors from disk. And so JVector 3 has been, how do we go beyond this DiskANN design and improve it on two dimensions. The two dimensions that we were feeling pain with from our hosted database as a service were around compression levels and around how large my index can get. And these are related problems.

JE: So the reason why we care so much about how large the index can get is that your search time is linear across the number of index segments you have, like the number of physical indexes you have, but it's logarithmic within an index. And so if you read the HNSW paper, if you read the DiskANN, if you read a lot of these papers, they're about how do we give you logarithmic search performance as the index size increases? But once you reach the limit of the largest graph that I can construct, then I have to spill across into multiple separate segments. And then combining the results from those segments is now a linear process.

And so the state of the art with DiskANN and I think this is true for most of the other vector indexes as well, is that you build the index using full resolution vectors that are kept in memory. So that's the main bottleneck that you have, is the size of your data set in memory. So JVector 3, we added the ability to build the index using compressed vectors without losing accuracy. So that's the first big one around increasing the size of those indexes that we can build. And then the other pieces in JVector 3 are around different types of how do we improve the compression that we're doing.

AB: So with JVector 3, I will need less RAM to achieve the same?

JE: Yes, you will need less RAM both to build the indexes and to serve the indexes. Building the indexes because of what I just discussed and serving the indexes is actually coming from a different optimization that we made.

AB: And how big or how much RAM is recommended? Of course, it depends on the size of the database, but, I mean, there is... What is the typical size?

JE: A good reference point, is the blog post that I wrote for Foojay, a little while ago about indexing all of Wikipedia on my laptop. And so that indexed the 41 million rows of a Wikipedia dataset that had been chunked, in other words, broken up into smaller pieces. So it's not 41 million articles, it's 41 million paragraph fragments that are embedded separately and indexed separately.

AB: This was the size of the embedding. So you decided to chunk the articles into pieces for which the vector was calculated, right?

JE: Yes. Because the larger the piece of text that you give to an embedding model like Cohere or like openai-v3-small or on the open source side, like the E5 family, the larger the piece of text you give it, the less accurate it's going to be about any of the specific contexts that are contained in that text.

AB: But it depends on a bit on the dimensions, right? The more dimensions vector has, the more text we can pass, maybe. Is it true?

JE: There is something to that, but there's also diminishing returns, right? So Cohere's embedding vectors are 1024 dimensions. openai-v3-small is 1536. openai-v3-large is double that. And then that's about the biggest that anyone has, and openai-v3-large is not twice as good as openai-v3-small. It's less than twice. And so you do have this diminishing returns as you get larger. So nobody knows how to make a vector embedding model that can capture all of the information in an entire Wikipedia article accurately. And so we break it up into smaller chunks.

AB: How long did it take to index the entire Wikipedia on your laptop?

JE: About six hours.

AB: And which laptop is it this Intel or M1?

JE: This is an Intel 12900 I9, so it has 16 performance cores with two threads apiece, and then eight efficiency cores with one thread. So it's a total of 24 threads.

AB: And it was fully utilized? So 90%, 100% or?

JE: Yeah, effectively 100%. So one of our goals with JVector that we covered last time was, it uses a non-blocking concurrency control design that lets it scale linearly with the number of threads you can throw at it, even when you're building or modifying the index.

AB: Okay, and which model you use to create the embeddings, Cohere?

JE: I specifically use the Cohere embeddings because they have helpfully already done this. They chunked up the Wikipedia articles, and they ran it through their embeddings model, and they posted it on Hugging Face for anyone to use. If you did it yourself from scratch, then it would cost about $5000 in API costs. I didn't want to do that if I can avoid it. And so I used the one that they pre-built.

JE: Not for Cohere specifically. Cohere is closed source. They have a service that they sell. But yeah, there are open source embedding models. I mentioned E5 earlier. And there's a few others that are fairly competitive that you could also use. But if you're going to do that, then you really need GPU acceleration to do that when you're talking about 41 million pieces of text. Otherwise, you're going to be waiting a very long time.

AB: Okay, so you got the embeddings, and then in your particular case, you just read the embeddings and stored that in JVector. So it was not the calculation of the embeddings, rather than just storing the embeddings in the vector, right?

JE: Right. And the demo does the JVector piece, but then it also does the index for, given a JVector embedding ID, what was the article that it was originally associated with? So you want to be able to go both directions to be able to perform useful searches. And I used Chronicle Map for that second part, which is ... I want to say it's LGPL licensed. I don't remember the license. It's an open source license from a fintech company. And it is by far the best performing option for, you want a Map, but you want it to be larger than memory. So they implement the Collections Map interface, but it can spill to disk. And there's a few rough edges that kind of come with the territory. It's a difficult problem that they're solving. So for instance, they use mmap. They use a bunch of internal things in the JDK that aren't technically meant to be general purpose. And so you need to pass a bunch of add-opens flags and so forth. But it’s definitely the best option for...

[overlapping conversation]

AB: Unsafe as well?

JE: I don't remember off the top of my head, but I'm almost certain.

AB: It's everyone uses that still. But are they using off heap storage or rather than they're flushing it to disk?

JE: This is really cool, actually. I measured it. I took a heap dump after I put around a million rows in and then took a heap dump. And the Chronicle Map heap footprint was 2KB. So I thought it was going to have some per entry overhead on heap, like a reference or something. But no, it's all off heap. When they say it's off heap, it really is off heap.

AB: This is the Open HFT, right? Chronicle Map. This is the organization. Yeah, looks like. Interesting. Never heard about that. So what it means is with JVector and Wikipedia, we could have now semantic search in Wikipedia.

JE: That's right.

AB: So what is missing? I could send you a prompt. For the prompt, I will have then to use Cohere, right? So create the same embeddings. Or this is what I wanted to ask you. Are the embeddings compatible? So are Cohere embeddings compatible with E5, for instance?

AB: Is there no standard on the horizon?

JE: I don't think there's a standard on the horizon. And the reason is that all of these companies, OpenAI and Cohere and who else is in the embedding space? DeepSeek is open source. Somebody's doing it for code specifically, and they have a proprietary model. I don't remember the… [ed: it was voyage-code-2]

AB: Code Llama?

JE: Anything that says Llama is usually open source, so probably not. In any case, they have what they believe are their secret sauce, and they don't want to give that away. So everyone's competing to beat each other right now. So yeah, I don't think there's going to be a standard.

AB: Okay. After six hours, we have this amazing still, right? After six hours, we have the entire Wikipedia in JVector. How big was JVector on a disk?

JE: It's about 40 gigabytes.

AB: RAM and disk?

JE: That's the disk component. Well, it exists in RAM while you're building it, and then we write it to disk, and then it's 40 gigabytes. And then when you go to serve the requests, you don't need to read the full 40 gigabytes in. You only need to read about two gigabytes in. And the distinction is that the 40 gigabytes includes the edge lists, and it also includes a bunch of GC overhead. And so the edge lists of the graph are 32 edges times 4 bytes a piece which is 128 bytes. And that becomes the bottleneck when you're compressing the vectors down enough. And so the compressed vectors are only about two gigabytes, and then everything else is the edge list. And the reason that you want to keep those in memory is that you're modifying them every time you insert a new vector. And so, if you tried to keep that on disk, then I think the performance would be unacceptably slow.

AB: How much RAM you dedicated to JVector?

JE: I dedicated 40 gigabytes to the JVM. That was for everything.

AB: This is a good practice to have the same amount of RAM as disk?

JE: No, no. In practice when we deploy this in the Astra service, we have index building machines separate from the request serving machines. And the index building machines are free to allocate much more memory because all they're doing is building the index and then the serving machines, they're smaller, less expensive machines and their job is to just serve requests. And so that's why we're okay with the footprint for constructing the index being larger than for serving it.

0:15:12 JVector/Cassandra integration

AB: I wanted to ask you this last time with JVector, what about single point of failure? So can be vector easily clustered or federated, or is there something like this?

JE: In the Cassandra space, when you give Cassandra a table definition, implicit in that is how to partition that table across the cluster. And so we use the first N components of your primary key as the partition key. And that controls how it gets spread across the cluster. And then each node in the cluster is responsible for indexing the data that is sent to it locally.

AB: So using the same mechanism in JVector.

JE: JVector's just the local index piece. So it plugs into Cassandra as, I can index data for you.

AB: Of course. So what it will mean, what it would mean is if you have JVector with the index, then it would be the same on every Cassandra machine because it doesn't matter, right? You will have to replicate the file because you would... Where I’m going with it if I would like to search something, right? So what usually happens is I need association between Cassandra Partition key and JVector ID or the vector, because this is the association, right? If you would implement, for instance, RAG, then Cassandra would be my data. So the text, let's say or whatever, JSON, and I would like to associate this with meaning, which is the vector in JVector, right?

JE: Yes.

AB: This is how it would work.

JE: So JVector is like the low level component, right? So if you're using Cassandra, you say create table “documents” with primary key ID. I'll declare that to be a UUID. And then my body which is declared of type text and my embedding, which is declared as type of float vector 1024 dimensions. And so then I can just insert my text and my vector into that Cassandra table, and it takes care of managing the association between the two and indexing the vectors with JVector. So that's not something that as an application developer, you have to deal with manually.

AB: So what you did, you used JVector directly, which is unusual. Usually you’d use Cassandra's on every insert you would write to the column, which will be indexed behind the scenes, right?

JE: Right. And there's a reason for that, which is that the version of JVector in Cassandra, the current Cassandra release is JVector 1.X. And so it's Cassandra 5.0 RC-1 was just released a couple days ago [ed: from when this was recorded in July]. So Cassandra's not looking to change a major dependency right now. And so to demonstrate JVector 3, my choices were either use Astra in local mode, which is actually open source.

(Not many people know that, but the branch of Cassandra that we deploy in Astra is open source. So I could do that, but then I'd have to do a whole bunch of explanation about, here's Astra and here's how I'm running in local mode.)

Or I can just use raw JVector with Chronicle Map. And I thought that was an easier thing to explain to people.

AB: I had a conversation with a client the other day, and they had this highly complicated object graph. So it was the edges between nodes were more important than the object itself. So, and my client asked me about queries, and there are many possibilities, like Neo4j or something like this. But I thought maybe it is possible to use a JVector to index the edges, and then use an LLM to have queries, full text queries, and to find the things. And I think this could work, right?

JE: If you can represent those edges as vectors while retaining their spatial meaning inside the original graph, then yeah, that would work. But that's a fairly big if. I don't know how to do that.

AB: Okay, but because they already mentioned Cassandra, I was like, okay, Cassandra is interesting, because we could use JVector, and then we have instead of full text search, full semantic search, right? This was the idea. Instead of using text comparison, we could use a meaning comparison, and actually it could or should work. There's actually no reason why not to. But I was just curious whether you had other similar ideas with Cassandra and JVector.

JE: Yeah, just to make sure we're on the same page, JVector is not a generic graph engine at all. It just uses a graph structure to index the vectors.

AB: Yes, so we would use, of course, Cassandra or Astra, it would be a cloud service, and just store the objects, which on the objects are connected to each other. And for the connections, I would probably use single table design or something like this. And the objects would be then indexed or the hash or embedded.

JE: Right, you could absolutely represent an edge list in Cassandra, and then separately you could index a vector embedding associated with the node, or you can index the full text, or you can index both. It's actually really useful to be able to say, I want to restrict my vector search to rows that also match this specific full text search query.

For instance, vector indexes are not particularly good at very short queries. And so one of the examples that one of our technical field came to me with was, she was working with an e-commerce company. And so they wanted users to be able to specify things like red faucet. And if you just use the vector index, it would come back with faucets that were not red, right? It's like, this is also close in the vector embedding space. And so the way to solve that was to say, I'm going to put my red in quotes, and that will tell my application to send that to the full text term matching, and then restrict the vector search to only rows that match that full text predicate.

AB: Interesting, this one had to exclude the color from the semantic search and perform full text search instead, right?

JE: Right. Under the hood, JVector is wired into Cassandra along with all the other index types that we have. And so you can affect the search results by any of the other predicates.

AB: That's so cool. So it means that the JVector is just another index for Cassandra.

AB: Okay, so what do we have? We have JVector 3, which uses compression and therefore uses significantly less memory than JVector 2. You know how much less?

JE: As I mentioned, the edge list becomes your bottleneck. And so it's something like 10x less ballpark.

AB: So what it means is for usual clients or customers, they would notice that Cassandra is smaller, needs less RAM, right?

JE: That's right, yes. Or alternatively, that you can index, you can ingest more data in the same amount of RAM. While we were talking, I checked the actual file on disk and I misremembered that the completed index is 50 gigabytes and not 40 gigabytes.

AB: It's not much, right? Entire Wikipedia is...

JE: Yeah, and then there's another 40 gigabytes is the Chronicle Map that stores the article contents. So the total for both of them is about 90.

AB: And you use the Chronicle Map just as a bridge between the Cohere model from Hugging Face and JVector?

JE: It's the JVector internal IDs and the text. JVector associates integers with vectors, and then when you do a search for another vector, what you get back is a list of integers saying, here's the vectors that I have internally that are closest to what you gave me. And so to turn that into something useful in your application, you need to be able to map those integers to whatever your source form is. And again, when you're using it from Cassandra, Cassandra takes care of doing that mapping for you, but when you're doing it with JVector raw, then you need to do that yourself. And so that's what Chronicle Map contains, the map of JVector integers to text fragments that correspond to the embedding. So it's a map of integer to stream in the code.

0:25:57 Compression techniques for vector indexes

AB: And so we have the compression, the RAM improvement, the architecture was changed. Is there any other thing you did for 3?

JE: We went pretty hard on the compression algorithms that we support. So commonly in the vector search world, you commonly have product quantization, which is what disk ANN introduced. And you have binary quantization, which is a very simple transformation of for each dimension in your embedding, I'm going to turn that into either a one or a zero bit. So I'm going from 32 bits of a float32 down to a single bit. And that's popular because you can do it without an external codebook.

With product quantization, you end up with this codebook that tells you what the quantized points correspond to in your original vector space. But for a bunch of database designs, including Lucene, including PostgreSQL, it's really difficult to fit in the concept of, I have this third source of information that was calculated from my original data, but it's not the same as my original data. And so binary quantization is super popular with pgvector, for instance, because you can quantize it down, compress it by a factor of 32. And that's all you need. All you need is that compressed set of ones and zeros. You don't need that extra codebook. The downside of binary quantization is that it is very, very inaccurate, which you would expect, because you're just doing this very simple transformation on it.

So for instance, with OpenAI's ADA embedding model, if you index those vectors uncompressed, so with full resolution, and then you search for the top 100 in a dataset, then you're going to expect that of those 100 results that you get back from your approximated index, 95% of those, or 95 of those 100 results are going to be in the correct exact top 100. And the way you measure this is you pre-compute offline the correct answers for your dataset. And you do that on a large machine with GPUs to accelerate it. And then once you have that answer key, then you can test arbitrary vector indexes to see how good the accuracy is. So if you're doing it with uncompressed, you get about 95% right. If you do it with binary compression, you get about 69% right.

So you've gone from 95% to about 70%. That's a pretty meaningful loss of accuracy. If you're using the E5 small V2 embeddings, which is the open source model I mentioned earlier, then you go from about 87% with the uncompressed vectors to 45% with binary compression. So it's really a massive accuracy loss that you get. And so in the DiskANN world, you can say I'm okay with that because I keep the uncompressed vectors on disk and I can pull them into memory on demand to re-rank the answers that I'm getting from my quantized search. And so in that case, you can trade off less accuracy from your quantization in exchange for faster comparisons. And that lets me go deeper into the graph and then make it up on the reranking.

But it turns out in the case of binary quantization that it's actually a very narrow sweet spot because even with reranking for E5 V2, for instance, I cannot get that 45% back up to 87% with any meaningful amount of reranking. Even if I search five times deeper in the graph and rerank 500 instead of 100, I'm still not getting back up to my original accuracy. OpenAI's ada002 and openai-v3-small, as well as Cohere's. Those are the three that I've found where binary quantization can be useful. And even for those three, I can get even better results using product quantization at 64X.

We did add binary quantization support. It's not as useful in JVector as it is in some of these other options because we have a high quality product quantization implementation as well, and that's usually better, but we do support the binary quantization. Now, something that is less often seen, and in fact, I don't know of anyone else doing it, is refining product quantization with anisotropic weights. And what that is, is this is an idea that came out of one of Google's vector search papers and...

AB: One question regarding the bit level quantization. Is it worth to trade off RAM for CPU? I mean, because I think CPU is cheaper. Is it always the case that we... That is the way to go, or is it a specific case for specific models?

JE: Other things being equal, I would rather have a smaller footprint in memory. Because that's usually my bottleneck. And the reason is that you need to keep the entire quantized data set in memory to do this core search quickly.

AB: You're right. So we would like to consume less RAM, but we need more CPU. And yeah, one point of time, I think it's easier to get CPU. I think both is hard, right?

JE: Both is hard, but if you have an application and you have a hundred thousand users hitting it with requests, that request volume tends to stay similar even as they're generating more and more data that you need to index. And so that's why my default is I'm going to optimize for keeping that memory size low because that's what keeps growing over time, even if my request volume doesn't.

AB: You're right. Because what you're saying is that the CPU requirements are more or less equal, but RAM will grow because the database gets larger usually, not smaller. And this is, yeah, okay. This is actually a good point. Yeah. What brings me to completely different fact was what happened in projects. We, at the beginning of our cloud projects, we bought storage, which was more cold than warm.

So it means at the beginning, it was more expensive for us because the API calls were more expensive. But the longer we were in the project, the cheaper it got because in one point of time, it was break even. So we could actually save some money because the size requirements, they are getting as the size grows over time. So it gets more expensive over time anyway. And this is what you said, this is also smart because you say, okay, because I have to keep the entire database in memory and it will just grow. This is usual for applications. So this is what I cannot control, but it seems like the JVector, the CPU requirements are linear, I guess. Or is it?

JE: The CPU requirements are logarithmic in the size of the data set, assuming that you can keep your index segment count under control.

AB: But still what I didn't understood completely is if I have multiple Cassandra nodes, the partition key and so forth, so we have a consistent hashing. So Cassandra knows where it lives, but the vector is more or less there and it has to be prefilled with the data of the specific partition key, right? So what you will need or no, it doesn't have, right? Because on every insert, okay, this is easier than I thought because let's say we have three nodes in a cluster.

So you go to the leader, you store it to the leaders or leader gets the data. So it is stored into the index like any other index. And then behind the scenes, I don't know whether the gossip protocol is still there. So it gets replicated to the other nodes. And then because it is identical setup, also then is the state of the JVector the same as the other nodes. So this is non-issue because we always going through Cassandra and Cassandra replicates the data because of the consistent hashing, right? So because we know the key, Cassandra knows how to route the data to particular nodes. And this was the idea there.

JE: Right. You can treat the replication as kind of a black box because what's going to happen is even in a simple case, like I have three nodes, the JVector indexes are probably not going to be identical because the indexing process is sensitive to the order in which the vectors are inserted and so forth. And Cassandra's replication does not guarantee any specific order that the follower nodes will see those writes in. And in fact, all of the nodes can be receiving writes simultaneously and replicating to the others.

So you will almost certainly, even if you try to stop it, even people will almost certainly always see slightly different orders on the indexing, but that's okay. And you don't need to care about that at any level outside of, I'm debugging Cassandra's repair code because Cassandra will take care of that reconciliation at the data level. And then the indexes are all derived from those SS tables that contain the raw data.

AB: Well, it means eventual consistency if you know the leader stops writing or writes to leader are stopping, then all the followers will get the data anyway, but the data might be in different order. That's what they're saying. And it doesn't matter because the vector is associated with the Cassandra data.

So we have two major features, less RAM because of compression, even less RAM because of one bit vectorization. So we have almost in minus RAM requirements already.

0:38:07 Anisotropic product quantization

JE: I want to talk about just one more thing related to the compression, which is the anisotropic product quantization.

AB: Exactly.

JE: As I was saying, this is something that Google published in one of their vector search papers and hasn't really... It's flying under the radar for most people, I think. And the idea is that if you are doing vector search with cosine similarity, which is the most common way that you want to compare vectors from OpenAI or cohere, or E5 and so forth, if you know that you're comparing with cosine similarity, then you should be able to take advantage of that when you're performing your quantization so that you care more about angular distance than you do about linear distance. And so the anisotropic PQ is a way to do that.

And the other thing you care about is you care more about being relatively accurate for vectors that are close together than you do about vectors that are far apart. If they're far apart, then it's like, okay, they're not going to be in my top 100 search, so I kind of don't care how far apart they are, as long as they sort towards the end of my candidate list. And so combining those two concepts, you can actually make product quantization more accurate. It's just as fast, the similarity computation is the same code, but you're encoding it in a more accurate way. And so you pay a price in terms of extra computation at the encoding time, to get more accurate results at query time. And since almost always you're doing the encoding just once or relatively few times compared to how many searches you're running, that's usually a good trade.

AB: So what you said is, that there's a different, similarity algorithm and which actually compares vectors close to each other pretty well. And if they are, shouldn't be the default, where are the cases where the vectors, the far away vectors matter? I mean, this would be... Are there any use cases for that?

JE: This is is specific to product quantization, where I'm splitting my vectors up into subspaces, and then I'm quantizing each of the subspaces into 256 centroids. And then I encode each of those subspaces as a single byte. And so the anisotropic algorithm is about, how do I influence how those 256 centroids are computed so that they give me those two properties of capturing the angular distance and emphasizing the accuracy more heavily for vectors that are closer together.

AB: So almost sounds like consistent hashing. Because of the partitioning, I think.

JE: Right. It's similar in the sense of, you know where to find it from the property of the key. Yes. But actually computing it for product quantization is a lot more difficult. And it's an iterative process rather than performing a deterministic hash on it.

AB: And the search is more accurate because I'm searching in subspace. And the subspace's smaller, and this is maybe the reason why I get better results.

JE: Right. Because I've trained the subspaces, if you will, to encode this angular distance, preferentially, for vectors that are closer together. And so if I take my query vector and I sum up the dot products with all of the subspaced quantization, then I've introduced less error in that reconstruction or in that asymmetric computation.

AB: And you mentioned JVector 3 is one of the, or is it the only vector that has been computed?

JE: I don't know of anyone else doing it. But I could be wrong. There's a lot of vector search projects out there, but I don't know of anyone else doing it. So that's one of the reasons why I feel comfortable telling people that JVector's the most advanced vector index in the world, and not just the most advanced Java Vector index, but it's the most advanced vector index.

AB: And you picked Java because you wanted to have a scalable solution. So we had this in the first podcast, right?

JE: Exactly.

AB: Which is still amazing because that, yeah, I'm also amazed of performance of Java, but, yeah. But you have a DB background which matters even more. The last thing, because then I would like to ask you some questions regarding implementations. So you wanted to mention the last feature of JVector, right? It's called antiendomorphic now?

JE: Anisotropic.

AB: Anisotropic. Anisotropic. This is a nice term. Could you explain it the first time? Anisotropic?

JE: That's the term that the Google paper uses. I mean, if you just translate it into normal English, it just means weighted, that's all it means. I guess it sounds more impressive if you call it anisotropic.

# 0:43:53 SIMD acceleration with Fused ADC

AB: Should be specific... I cannot pronounce that from [uninteligible], you know, the specificity and anisotropic and then will be great title for the next paper. Okay. What is the last feature?

JE: Well, we're getting short on time, so I'll just tell you what it is and then I'll refer our listeners to the JVector README to go into a little more detail. But the last one's called Fused ADC. And the way this came about is there's, there's a French SIMD expert. SIMD is the AVX-512 on Intel chips, neon on ARM chips, the hardware that lets you do multiple operations against a stream of data, concurrently on your CPU. And so this is perfect for doing things with vectors, because you have these long streams of float32s that you want to multiply and add together to compute similarities. And so this Frenchman named Fabien André came up with a way to store the lookup tables for product quantized comparisons in the registers of the SIMD hardware. And so he published three papers over the years that refined his solution more and more. The first one was called, it was a long title, but the operative keywords are “fast scan” for doing this accelerated similarity computation.

Then the next paper was called Quick ADC, and then the third one was called Quicker ADC. And it's the third one that implements full AVX-512 acceleration against product quantization lookup tables. Super, super interesting, but the long story short is JVector can store those lookup tables in line with the index on disk, and pull it out as it's retrieving edge lists, pull it out at the same time with the same disk access, put it into the SIMD registers and compute the similarities all at once, in the same time that we could have done a traditional asymmetric distance computation against traditional product quantization.

So I've gone from, I need to hold my quantized vectors in memory and that's my bottleneck. And now I've gone to, I don't have any memory footprint. My memory footprint is two megabytes. Like, that's it. If you have a hundred million vectors, if you have a billion vectors, it's still two megabytes. And I'm pulling everything else off from the disk on demand without a speed penalty. So that's the last thing I wanted to mention.

AB: But SIMD is on CPU still?

JE: SIMD is CPU, yes.

AB: Which becomes more and more important, I think, because of the GPU shortage. It's a super interesting graph that you have a solution which is accelerated on CPU and not necessarily GPU because they are expensive, power hungry, and not available. So maybe I think in the research, more stuff happens on CPUs, right?

JE: It is true that most of the research has historically been done on CPUs. You are starting to see more research happen about, is there a way that we can accelerate vector search using GPU.

The easy way to accelerate vector searches in GPU is during index construction. And there's a bunch of people who are doing interesting work in solving that. Microsoft has an open source repository called SPTAG that uses GPU to accelerate construction. NVIDIA has one called CAGRA that also does the same thing.

Then the harder part is, can we also use GPU to accelerate the searches itself? And the reason that's harder is that every time you go to the GPU with a request, there's just this really, really high penalty in terms of launching the kernel. There's a fairly high latency penalty there. And also anything that you need the GPU to compute on, you need to copy into VRAM. And that's also a fairly slow operation. So I would say nobody knows the best way to combine the two. So I think it's fair to say that right now, the state of the art is doing searches on CPU.

0:48:44 Low-level optimizations in JVector

AB: How was the implementation of all the features? You used low-level Java, which looks like C. Or were you able to use high-level Java constructs. Nice code. I know Java records, Java 21 features, and more. So what is the code?

JE: The bad news is it's still true that the more you care about performance in Java, the more you end up writing code that looks like C. I mentioned that the edge list was our bottleneck. Originally, the edge list was an array of a class that was nested three deep. And so you had a reference to a reference to a reference. And then you would have one of those entries for every edge. And the edge itself was just a 32-bit integer. So it's four bytes. So I've got four bytes for the edge. And then I've got another 48 or so bytes for my internals. And so that's obviously a really bad efficiency.

I think it's the first time in 20-plus years of writing Java code where I went in and I re-optimized my objects for that memory footprint. And I'm used to doing it for reducing allocations. I've done that a lot, everyone knows that. But this is the first time that the reference overhead in that object hierarchy was one of the bottlenecks that we ran into.

AB: Is it a common pattern, what you did, that you replaced the object references or it wasn't?

JE: It was specific to the JVector internals.

AB: And were you able to use the Java vector API or was it...

JE: Yes. So there's three implementations of vector similarity in JVector. One is just normal unoptimized CPU. And so that's because Cassandra still needs to run on JDK 11. And so we have a no acceleration at all code path for JDK 11. The second one is for if you have JDK 19 or more recent. This is the last time that they changed the Panama API in an incompatible way for us. So if you have 19 or more recent, then you can use the Panama vector code. And then finally, for the Fused ADC, for the part about the transposed codebooks that we put into SIMD registers, that's just way too low level, even for the Panama intrinsics. And so we actually have some C code that we use the Java 22 foreign function API to leverage. And for that, that actually is Java 22, because in Java 22, that's where you can have a float array that Java can access and get a pointer to that, a memory segment reference to that, and pass it off to a C API and not have to copy it off into native memory before doing that.

JE: Most of the places that people are using float16, you see Lucene doing float16, for instance, that's because they don't have product quantization and the re-ranking pipeline. So no, I'm not going to be rushing to add that because I don't think it solves the problem that we have since we're all doing much higher compression levels with PQ.

0:52:45 Conclusion

AB: So we covered JVector 3. So all the features is there. So there is no reason not to use it, right? So just use it as significantly less memory in best possible case, you will save as one on a tenth of RAM, right?

JE: That's right. And if you enable the Fused ADC, then even less when you're serving the queries.

JE: I'll give one more shout out to that Foojay article about indexing Wikipedia with JVector. That's probably what I'd refer people to if they want to go deeper and say, Okay, what does the code actually look like?

AB: Perfect. Where people can find Astra, JVector, Cassandra, and you on the internet?

JE: Well, datastax.com for Astra. And then I'm on Twitter as Spyced. And then, of course, on github.com/jbellis. That's where the JVector repository lives and so forth.

AB: So thank you for your time and see you next time maybe in JVector 4, 5, or 6.

JE: All right. Thanks so much, Adam!

A week of Windows Subsystem for Linux

2024-10-11T10:02:00.000-07:00

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.

Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding, so I decided to give WSL2 another try.

Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward.

The Good

NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch.
JetBrains products work a lot better now in remote development mode. I was a good citizen and filed half a dozen bug reports with the rough edges I'm still seeing (the most aggravating is related to the diff window, and profiling Cassandra inside WSL2 still doesn't quite work), but it's usable enough that I can overlook the problems.
Windows Terminal is surprisingly good. It really does make it feel like a seamless single system with very few hiccups.
It's nice being able to use Windows tools like SpeechPulse in my development workflow. Yes, I know you can roll your own Whisper transcription, but I'd rather let someone else deal with the corner cases and turn it into a product for me.
The performance of ext4 running inside a virtual disk image on NTFS seems to be... perfectly fine? I haven't formally benchmarked it, but after throwing a bunch of fairly demanding work at it, including running local Cassandra vector search with 10s of millions of rows, it isn't noticeably slower than native ext4 was.

Here's Python opening up a chart made with matplotlib from Windows Terminal. No special configuration required.

The Bad

As far as I can tell from web forums, nobody really knows how to mount raw EXT4 volumes to WSL2 on boot. With Claude's help, I wrote a PowerShell script that mounts the volumes successfully, but I haven't been able to get it to run at boot or on login. Everyone seems to think that running it as a Windows scheduled task should work, but I don't think I've seen anyone actually claim success. Maybe I need to sacrifice a goat.

The Ugly

Backing up WSL2 data is a pain. Understandably, none of the usual tools on the Windows side can operate at the file level inside the WSL2 disk image. My first plan was to rsync my projects from inside WSL2 to a directory on NTFS where BlackBlaze could back it up from there, but rsync is not at its best when dealing with lots of small files like you have with, say, Git checkouts. So now I'm using restic to manage my own backup snapshots on a local disk instead.
Despite the tricks that Windows uses (mostly successfully) to make the Linux integration seamless, at the end of the day it's still running in a separate VM. In particular, setting up sshd is a bag of pain. It may be easier to get SSH access to the WSL2 VM using TailScale, but I haven't tried this yet.
You can't increase the maximum size of a virtual disk image--you have to start over with a new one. There is a tool to export your data to a tarball and restore from it, but obviously this is time-consuming for anything but a toy volume.