Skip to main content

Notes on LLMap

Lessons learned writing LLMap

I wrote LLMap to solve code search in the Apache Cassandra repo. Cassandra is too large (~200kloc across ~2500 files, about 4.5M tokens) to throw at even the largest LLM context window. And of course there are many codebases larger still.

The idea is simple: ask the LLM to do the work. But getting it to work consistently was harder than I expected. Here's a few of the hiccups I ran into and how I worked around them.

DeepSeek V3 can't classify things without thinking first

Recall that LLMap optimizes the problem by using a multi-stage analysis to avoid spending more time than necessary analyzing obviously irrelevant files:

  1. Coarse analysis using code skeletons
  2. Full source analysis of potentially relevant files from (1)
  3. Refine the output of (2) to only the most relevant snippets

It turns out that if you just ask DeepSeek V3 to classify the skeleton as relevant/irrelevant you will get garbage results. Sometimes it calls everything relevant, sometimes nothing.

While LLMap works just as well if you skip the first stage, I really wanted to get the skeleton pass working since analyzing full source is ~5x as slow and expensive as analyzing a skeleton.

My first solution was to do multiple passes: first ask it to think about its potential relevance, then ask it to classify. This worked, but having to perform two passes severely cuts into the advantage of working with skeletons.

Asking it to conclude its initial thinking with a relevance classification sort of worked but trying to parse out the conclusion did not work because it would say things like "not relevant" as part of its thinking (so just looking for the word "relevant" would hit a false positive). And I didn't want to resort to formally structured responses because LLMs are consistently dumber when they have to spend effort adhering to JSON.

What ended up working was asking it to classify using special identifiers LLMAP_RELEVANT, LLMAP_IRRELEVANT. It was able to do that almost reliably -- but still, sometimes (under 1% of the time) it would forget to follow instructions.

Don't be afraid to continue the conversation

It turns out that when it forgets to conclude with LLMAP_RELEVANT or LLMAP_IRRELEVANT, just asking a second time is enough to solve the problem:

I still want to avoid multiple passes for everything, but adding a second round trip for a handful of files is fine. (And more successful that repeating the same query from scratch.)

I also take advantage of this in the 3rd refining stage. Since there's only a few queries made (almost always 3 or fewer for the Cassandra codebase) I actually do make two passes on each one, just to ask R1 "Take one more look and make sure you didn't miss anything important."

Structuring requests this way takes advantage of DeepSeek's caching, so a continued conversation is cheaper than two equivalent uncached requests.

It's easier to refactor than to write something new

I really didn't want to become an expert in tree-sitter parsing to generate the code skeletons so I asked Claude to write it for me, giving it Aider's similar repo-map code as an example.

Claude crashed and burned. Repeatedly.

Eventually I bit the bullet and hand-wrote a recursive parser. But then when I needed to support very large Java files I revisited the parser to add semantic chunking, and it turns out that Claude was able to transform the recursive parser into one using tree-sitter's declarative queries with minimal hand-holding.

Usually I leverage this the other way around by having the LLM do a rough draft and then improve it. Interesting that it can work both ways.

On a related note, after spending a few hours having Claude extend the parsing to support Python, the code was getting pretty messy. So I threw it at o1-pro and asked it to simplify it. It delivered:

I still don't have a great intuition for when o1-pro can outperform Claude at coding, but this was one of its wins.

Claude is my ORM

I switched from file-based to sqlite-based caching late in development. Claude nailed the conversion on the first try. I glanced at the code it wrote but didn't see anything obviously crufty so I left it alone. That's the only file in the repo 100% AI generated.

Conclusion

My biggest takeaway is that even though the core idea—"ask the LLM to do the work"—is straightforward, getting robust, repeatable results at scale requires iterative design and careful handling of edge cases. DeepSeek V3's classification quirks forced a hybrid approach with multiple passes and gently nudging the model when it forgot instructions.

(I found Flash to be a lot better at following instructions without a ton of gymnastics, but Flash is too rate-limited to be useful, so here we are.)

Give LLMap a try! Extending it with Aider should be straightforward, PRs are welcome.

Comments

Popular posts from this blog

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas...

Why PHP sucks

(July 8 2005) Apparently I got linked by some PHP sites, and while there were a few well-reasoned comments here I mostly just got people who only knew PHP reacting like I told them their firstborn was ugly. These people tended to give variants on one or more themes: All environments have warts, so PHP is no worse than anything else in this respect I can work around PHP's problems, ergo they are not really problems You aren't experienced enough in PHP to judge it yet As to the first, it is true that PHP is not alone in having warts. However, the lack of qualitative difference does not mean that the quantitative difference is insignificant. Similarly, problems can be worked around, but languages/environments designed by people with more foresight and, to put it bluntly, clue, simply don't make the kind of really boneheaded architecture mistakes that you can't help but run into on a daily baisis in PHP. Finally, as I noted in my original introduction, with PHP, ...