Lessons learned writing LLMap
I wrote LLMap to solve code search in the Apache Cassandra repo. Cassandra is too large (~200kloc across ~2500 files, about 4.5M tokens) to throw at even the largest LLM context window. And of course there are many codebases larger still.
The idea is simple: ask the LLM to do the work. But getting it to work consistently was harder than I expected. Here's a few of the hiccups I ran into and how I worked around them.
DeepSeek V3 can't classify things without thinking first
Recall that LLMap optimizes the problem by using a multi-stage analysis to avoid spending more time than necessary analyzing obviously irrelevant files:
- Coarse analysis using code skeletons
- Full source analysis of potentially relevant files from (1)
- Refine the output of (2) to only the most relevant snippets
It turns out that if you just ask DeepSeek V3 to classify the skeleton as relevant/irrelevant you will get garbage results. Sometimes it calls everything relevant, sometimes nothing.
While LLMap works just as well if you skip the first stage, I really wanted to get the skeleton pass working since analyzing full source is ~5x as slow and expensive as analyzing a skeleton.
My first solution was to do multiple passes: first ask it to think about its potential relevance, then ask it to classify. This worked, but having to perform two passes severely cuts into the advantage of working with skeletons.
Asking it to conclude its initial thinking with a relevance classification sort of worked but trying to parse out the conclusion did not work because it would say things like "not relevant" as part of its thinking (so just looking for the word "relevant" would hit a false positive). And I didn't want to resort to formally structured responses because LLMs are consistently dumber when they have to spend effort adhering to JSON.
What ended up working was asking it to classify using special identifiers LLMAP_RELEVANT, LLMAP_IRRELEVANT. It was able to do that almost reliably -- but still, sometimes (under 1% of the time) it would forget to follow instructions.
Don't be afraid to continue the conversation
It turns out that when it forgets to conclude with LLMAP_RELEVANT or LLMAP_IRRELEVANT, just asking a second time is enough to solve the problem:
I still want to avoid multiple passes for everything, but adding a second round trip for a handful of files is fine. (And more successful that repeating the same query from scratch.)I also take advantage of this in the 3rd refining stage. Since there's only a few queries made (almost always 3 or fewer for the Cassandra codebase) I actually do make two passes on each one, just to ask R1 "Take one more look and make sure you didn't miss anything important."
Structuring requests this way takes advantage of DeepSeek's caching, so a continued conversation is cheaper than two equivalent uncached requests.
It's easier to refactor than to write something new
I really didn't want to become an expert in tree-sitter parsing to generate the code skeletons so I asked Claude to write it for me, giving it Aider's similar repo-map code as an example.
Claude crashed and burned. Repeatedly.
Eventually I bit the bullet and hand-wrote a recursive parser. But then when I needed to support very large Java files I revisited the parser to add semantic chunking, and it turns out that Claude was able to transform the recursive parser into one using tree-sitter's declarative queries with minimal hand-holding.
Usually I leverage this the other way around by having the LLM do a rough draft and then improve it. Interesting that it can work both ways.
On a related note, after spending a few hours having Claude extend the parsing to support Python, the code was getting pretty messy. So I threw it at o1-pro and asked it to simplify it. It delivered:
I still don't have a great intuition for when o1-pro can outperform Claude at coding, but this was one of its wins.
Claude is my ORM
I switched from file-based to sqlite-based caching late in development. Claude nailed the conversion on the first try. I glanced at the code it wrote but didn't see anything obviously crufty so I left it alone. That's the only file in the repo 100% AI generated.
Conclusion
My biggest takeaway is that even though the core idea—"ask the LLM to do the work"—is straightforward, getting robust, repeatable results at scale requires iterative design and careful handling of edge cases. DeepSeek V3's classification quirks forced a hybrid approach with multiple passes and gently nudging the model when it forgot instructions.
(I found Flash to be a lot better at following instructions without a ton of gymnastics, but Flash is too rate-limited to be useful, so here we are.)
Give LLMap a try! Extending it with Aider should be straightforward, PRs are welcome.
Comments