I recently switched tasks from writing the ColBERT Live! library and related benchmarking tools to authoring BM25 search for Cassandra. I was able to implement the former almost entirely with "coding in English" via Aider. That is: I gave the LLM tasks, in English, and it generated diffs for me that Aider applied to my source files. This made me easily 5x more productive vs writing code by hand, even with AI autocomplete like Copilot. It felt amazing!
(Take a minute to check out this short thread on a real-life session with Aider, if you've never tried it.)
Coming back to Cassandra, by contrast, felt like swimming through molasses. Doing everything by hand is tedious when you know that an LLM could do it faster if you could just structure the problem correctly for it. It felt like writing assembly without a compiler -- a useful skill in narrow situations, but mostly not a good use of human intelligence today.
The key difference in these two scenarios is that ColBERT Live! is a dozen source files with simple relationships. Cassandra is thousands of files, poorly modularized, with call stacks dozens deep.
I want to live in a world where we can tackle Cassandra-sized projects with the productivity improvements that you can only get today in much smaller code bases. The good news is that we can get to that world with only a few key improvements to our tools, every component of which exists today -- we just need to assemble them.
Context collection for code
Context is king for getting useful results out of AI, and code is no exception. Too little context and the AI will hallucinate APIs that don't exist. Too much context and it will get overwhelmed and even forget the problem statement.
In 2023, when GPT-4 was new, I assembled the context for my vector search work by hand, pasting snippets into my prompt. The current state of the art in late 2024 is using "@" references pioneered by Cursor to manually mark files or methods as context for AI chat and code generation. This saves a lot of copy-and-paste back-and-forth but fundamentally still leaves it up to the human to decide what context to add manually. Which means that it still becomes prohibitively difficult when you're refactoring intertwined, deeply nested inheritance hierarchies in a project with thousands of source files.
What we need is tools that can build context automatically, or at least create a good first draft.
There are two related pieces to this problem. Both can be automated with the right approach:
- Figure out what files we need to change for a given task
- Add relevant context from those files' dependencies
Aider author Paul Gauthier has invented an elegant solution to the second. I'll describe it in detail because everyone else needs to adopt it, and I can only conclude that the reason they haven't is that they don't know about it. Then we'll come back to the first problem.
Code Skeletons and the Aider repo map
Aider parses your project's source files and generates a "skeleton" of classes and functions per file that looks like this:
src/vidore_benchmark/evaluation/eval_manager.py: ⋮... │class EvalManager: │ """ │ Stores evaluation results for various datasets and metrics. Supports the │ JSON output of the `VisionRetriever.compute_metric`. │ │ The data is stored in a pandas DataFrame with a MultiIndex for columns. │ The first level of the MultiIndex is the dataset name and the second level is the metric name. │ │ Usage: │ >>> eval_manager = EvalManager.from_dirpath("data/evaluation_results/") │ >>> print(eval_manager.data) ⋮... │ @classmethod │ def from_dict(cls, data: Dict[Any, Any]): ⋮... │ @staticmethod │ def melt(df: pd.DataFrame) -> pd.DataFrame: ⋮
Aider uses these skeletons to create a graph where identifiers are nodes in the graph and (directed) edges are the references in your code. That is, if function a calls function b, there will be an edge from a to b.
When you /add files to your Aider session (roughly the equivalent of an @ reference in Cursor), Aider starts with the identifiers from those files as "hot" nodes in the graph and runs PageRank to determine what the most-relevant neighbors are. The skeletons of those neighbors are then automatically added to the context for your request.
This works shockingly well, to the point that I consider this aspect of context gathering to be effectively a solved problem, modulo some implementation details like disambiguating common symbols and extending the code graph into library dependencies.
Figuring out which files to change
The harder part is figuring out what files to change in the first place, given a problem description. Nobody knows how to one-shot this and in fact I don't think it's possible for sufficiently complex codebases. But this is the kind of problem that LLMs are good at solving, if we can narrow down the scope without leaving out critical information.
The Aider code skeletons point us in the right direction. Past a fairly low threshold, it's not feasible to throw your entire codebase at an LLM, not even in skeleton form. But if we can get a foothold in the right general area, we can iteratively refine it using the skeletons, pagerank, and the LLM. Here's what I mean:
First, generate an initial set of candidate files.
Everyone doing code RAG is doing this with more or less success. Here are the approaches I know about:
- Cody does some kind of keyword + graph search. I'm not a huge fan of this approach since their results got noticeably worse when they switched to this from vector search.
- Cursor does vector search with custom caching based on merkle trees and I think a custom embeddings model. They get good results, a solid B+. I do worry a little that the longer your problem description becomes the more you have to deal with diluting your embedding context too much, resulting in the need to chunk the problem itself and combine multiple searches. Probably not something that needs to be solved for a 1.0 product though.
- Augment Code is both fast and effective, but AFAIK they haven't said anything about how it works. You should probably decompile their plugin to find out.
- Not in a shipping product, but Gemini Flash and Claude Haiku both give reasonable starting points for adding new features when I send them the almost 5000 source filenames (with paths) in the Cassandra source tree and ask them to guess which ones are most relevant. Just throwing things at Flash is a surprisingly powerful (and inexpensive) tool.
Second, expand the candidate pool by analyzing skeletonized content and graph edges.
This is the part that nobody does yet and I think it will result in massive accuracy improvements.
- Take the candidate files from the first step, and use them as seeds for pagerank as described above. Send the skeletons of the original candidates along with the skeletons of the top pagerank results, and ask the LLM to refine it. "I'm implementing feature XYZ. Here are the skeletons of files that may or may not be relevant. Give me the names of only the relevant ones."
- Repeat until stable.
This will give you the starting files with which to anchor your context.
I'll add a few thoughts on implementation. First, it's unrealistic to expect this to be perfect every time. Assuming that you're building this in an IDE, you should show the context visually so that the human operator can add or remove entries as needed.
Second, this should be a separate step from starting to write code, partly to reduce latency and partly so that the human supervisor can review it first.
Third, both of the above allow composition of the same mechanisms when it is desirable to manually specify context instead. ("Change method X to run asynchronously using the same technique in source file A, with point-of-call usage that looks like the code in source file B." I thank Stephan Deibel of Wingware for this reminder.)
Other useful sources of context
While finding relevant files from a problem description is the most common starting point, it's not the only way to guide the AI. In practice, I've found several other powerful entry points for building context from different lenses into the codebase. I have used all of these manually and I think they are common enough scenarios to be worth automating:
- "I am implementing feature X. Here is my diff so far. I want you to implement Y next." Doing this in a single uncommitted change is living dangerously, so diff of branch vs main is usually preferred.
- "Here is the current source of class X and the diff of the last commit to touch it. What was the author trying to accomplish?"
- "Here is a stack trace for an UnsupportedOperationException, and the source code for each method in the stack. Why did we end up here unexpectedly?" (This is also a good special case of finding relevant files from a problem description.)
Flow in AI-first coding
Finally, I have some thoughts about the low-level details of how best to perform this kind of AI supervision in an IDE.
AI code generation is most successful when given small units of work at a time. If you try to do too much then you may end up spending more time debugging it than it would have taken you to write it yourself in the first place. Don't be greedy: stick to one change at a time, and review each change before adding more.
To this end, I don't believe that a codegen tool should autocommit for me. That's because the pre-commit dialog named Commit is the easiest way to review changes in IntelliJ. I can easily see all the changes, doubleclick on any file to review, and critically, I can tweak it directly from the diff window using the full set of semantic tooling (rename, extract method, etc). And if the AI really got lost, I can undo it trivially with git stash. (Prefer stash to reset in case you end up wishing you had it back later after all!)
Revising already-committed code is more work; I have to find the commit in the log, right click -> compare with local, and then I can doubleclick on files again and amend the commit with any changes. If I want to start over, I have to soft-reset before I can stash.
Thus, my ideal integration would suggest a commit message for me (i.e., populate the Commit message window), but let me push the button to actually commit after review and revision.
Preparing for the next generation of frontier models
We already live in a world where it's no longer necessary to write Java or Python or Go by hand for small projects. We can bring large projects into that world by solving the context problem with a combination of graph techniques and giving summarized files to the LLM for analysis.
This will unlock massive productivity gains with frontier models like Sonnet 3.5-new and o1, which are already at or beyond human-level competence in syntax generation. Even more interesting is the emergence of specialized open-source coding models like DeepSeek Coder and Qwen 2.5-coder that can compete with closed-source frontier models while being small enough to run locally. And whatever tooling we build for today's models will be that much more useful with the next generation.
It's an exciting time to write code!
Comments