Why JVector 3 is the most advanced vector index on the planet

Transcript of airhacks.fm episode 316

Adam Bien: Hey, Jonathan, how JVector 4 is doing?

Jonathan Ellis: JVector 4?

AB: Yeah, because Vector 3 is completed, I think now.

JE: JVector 4. Well, shoot man. If you want to sneak preview, we may have some news to talk about with GPU acceleration for JVector 4, but that's super, super early and I can't promise any specifics yet.

0:00:28 JVector 3 features and improvements

AB: It was a joke actually. So, I know that JVector 3 is completed. And so what's the major features or what happened between JVector 2 and 3?

JE: JVector 2 was a fairly straightforward adaptation of Microsoft Research’s DiskANN search indexing to Java and to Cassandra. And so that means that you have a two pass search where you have a core index that works on a graph, whose comparisons are done with quantized vectors that are kept in memory. And then you refine the results of that search by using full resolution vectors from disk. And so JVector 3 has been, how do we go beyond this DiskANN design and improve it on two dimensions. The two dimensions that we were feeling pain with from our hosted database as a service were around compression levels and around how large my index can get. And these are related problems.

JE: So the reason why we care so much about how large the index can get is that your search time is linear across the number of index segments you have, like the number of physical indexes you have, but it's logarithmic within an index. And so if you read the HNSW paper, if you read the DiskANN, if you read a lot of these papers, they're about how do we give you logarithmic search performance as the index size increases? But once you reach the limit of the largest graph that I can construct, then I have to spill across into multiple separate segments. And then combining the results from those segments is now a linear process.

And so the state of the art with DiskANN and I think this is true for most of the other vector indexes as well, is that you build the index using full resolution vectors that are kept in memory. So that's the main bottleneck that you have, is the size of your data set in memory. So JVector 3, we added the ability to build the index using compressed vectors without losing accuracy. So that's the first big one around increasing the size of those indexes that we can build. And then the other pieces in JVector 3 are around different types of how do we improve the compression that we're doing.

AB: So with JVector 3, I will need less RAM to achieve the same?

JE: Yes, you will need less RAM both to build the indexes and to serve the indexes. Building the indexes because of what I just discussed and serving the indexes is actually coming from a different optimization that we made.

AB: And how big or how much RAM is recommended? Of course, it depends on the size of the database, but, I mean, there is... What is the typical size?

JE: A good reference point, is the blog post that I wrote for Foojay, a little while ago about indexing all of Wikipedia on my laptop. And so that indexed the 41 million rows of a Wikipedia dataset that had been chunked, in other words, broken up into smaller pieces. So it's not 41 million articles, it's 41 million paragraph fragments that are embedded separately and indexed separately.

AB: This was the size of the embedding. So you decided to chunk the articles into pieces for which the vector was calculated, right?

JE: Yes. Because the larger the piece of text that you give to an embedding model like Cohere or like openai-v3-small or on the open source side, like the E5 family, the larger the piece of text you give it, the less accurate it's going to be about any of the specific contexts that are contained in that text.

AB: But it depends on a bit on the dimensions, right? The more dimensions vector has, the more text we can pass, maybe. Is it true?

JE: There is something to that, but there's also diminishing returns, right? So Cohere's embedding vectors are 1024 dimensions. openai-v3-small is 1536. openai-v3-large is double that. And then that's about the biggest that anyone has, and openai-v3-large is not twice as good as openai-v3-small. It's less than twice. And so you do have this diminishing returns as you get larger. So nobody knows how to make a vector embedding model that can capture all of the information in an entire Wikipedia article accurately. And so we break it up into smaller chunks.

AB: How long did it take to index the entire Wikipedia on your laptop?

JE: About six hours.

AB: And which laptop is it this Intel or M1?

JE: This is an Intel 12900 I9, so it has 16 performance cores with two threads apiece, and then eight efficiency cores with one thread. So it's a total of 24 threads.

AB: And it was fully utilized? So 90%, 100% or?

JE: Yeah, effectively 100%. So one of our goals with JVector that we covered last time was, it uses a non-blocking concurrency control design that lets it scale linearly with the number of threads you can throw at it, even when you're building or modifying the index.

AB: Okay, and which model you use to create the embeddings, Cohere?

JE: I specifically use the Cohere embeddings because they have helpfully already done this. They chunked up the Wikipedia articles, and they ran it through their embeddings model, and they posted it on Hugging Face for anyone to use. If you did it yourself from scratch, then it would cost about $5000 in API costs. I didn't want to do that if I can avoid it. And so I used the one that they pre-built.

JE: Not for Cohere specifically. Cohere is closed source. They have a service that they sell. But yeah, there are open source embedding models. I mentioned E5 earlier. And there's a few others that are fairly competitive that you could also use. But if you're going to do that, then you really need GPU acceleration to do that when you're talking about 41 million pieces of text. Otherwise, you're going to be waiting a very long time.

AB: Okay, so you got the embeddings, and then in your particular case, you just read the embeddings and stored that in JVector. So it was not the calculation of the embeddings, rather than just storing the embeddings in the vector, right?

JE: Right. And the demo does the JVector piece, but then it also does the index for, given a JVector embedding ID, what was the article that it was originally associated with? So you want to be able to go both directions to be able to perform useful searches. And I used Chronicle Map for that second part, which is ... I want to say it's LGPL licensed. I don't remember the license. It's an open source license from a fintech company. And it is by far the best performing option for, you want a Map, but you want it to be larger than memory. So they implement the Collections Map interface, but it can spill to disk. And there's a few rough edges that kind of come with the territory. It's a difficult problem that they're solving. So for instance, they use mmap. They use a bunch of internal things in the JDK that aren't technically meant to be general purpose. And so you need to pass a bunch of add-opens flags and so forth. But it’s definitely the best option for...

[overlapping conversation]

AB: Unsafe as well?

JE: I don't remember off the top of my head, but I'm almost certain.

AB: It's everyone uses that still. But are they using off heap storage or rather than they're flushing it to disk?

JE: This is really cool, actually. I measured it. I took a heap dump after I put around a million rows in and then took a heap dump. And the Chronicle Map heap footprint was 2KB. So I thought it was going to have some per entry overhead on heap, like a reference or something. But no, it's all off heap. When they say it's off heap, it really is off heap.

AB: This is the Open HFT, right? Chronicle Map. This is the organization. Yeah, looks like. Interesting. Never heard about that. So what it means is with JVector and Wikipedia, we could have now semantic search in Wikipedia.

JE: That's right.

AB: So what is missing? I could send you a prompt. For the prompt, I will have then to use Cohere, right? So create the same embeddings. Or this is what I wanted to ask you. Are the embeddings compatible? So are Cohere embeddings compatible with E5, for instance?

AB: Is there no standard on the horizon?

JE: I don't think there's a standard on the horizon. And the reason is that all of these companies, OpenAI and Cohere and who else is in the embedding space? DeepSeek is open source. Somebody's doing it for code specifically, and they have a proprietary model. I don't remember the… [ed: it was voyage-code-2]

AB: Code Llama?

JE: Anything that says Llama is usually open source, so probably not. In any case, they have what they believe are their secret sauce, and they don't want to give that away. So everyone's competing to beat each other right now. So yeah, I don't think there's going to be a standard.

AB: Okay. After six hours, we have this amazing still, right? After six hours, we have the entire Wikipedia in JVector. How big was JVector on a disk?

JE: It's about 40 gigabytes.

AB: RAM and disk?

JE: That's the disk component. Well, it exists in RAM while you're building it, and then we write it to disk, and then it's 40 gigabytes. And then when you go to serve the requests, you don't need to read the full 40 gigabytes in. You only need to read about two gigabytes in. And the distinction is that the 40 gigabytes includes the edge lists, and it also includes a bunch of GC overhead. And so the edge lists of the graph are 32 edges times 4 bytes a piece which is 128 bytes. And that becomes the bottleneck when you're compressing the vectors down enough. And so the compressed vectors are only about two gigabytes, and then everything else is the edge list. And the reason that you want to keep those in memory is that you're modifying them every time you insert a new vector. And so, if you tried to keep that on disk, then I think the performance would be unacceptably slow.

AB: How much RAM you dedicated to JVector?

JE: I dedicated 40 gigabytes to the JVM. That was for everything.

AB: This is a good practice to have the same amount of RAM as disk?

JE: No, no. In practice when we deploy this in the Astra service, we have index building machines separate from the request serving machines. And the index building machines are free to allocate much more memory because all they're doing is building the index and then the serving machines, they're smaller, less expensive machines and their job is to just serve requests. And so that's why we're okay with the footprint for constructing the index being larger than for serving it.

0:15:12 JVector/Cassandra integration

AB: I wanted to ask you this last time with JVector, what about single point of failure? So can be vector easily clustered or federated, or is there something like this?

JE: In the Cassandra space, when you give Cassandra a table definition, implicit in that is how to partition that table across the cluster. And so we use the first N components of your primary key as the partition key. And that controls how it gets spread across the cluster. And then each node in the cluster is responsible for indexing the data that is sent to it locally.

AB: So using the same mechanism in JVector.

JE: JVector's just the local index piece. So it plugs into Cassandra as, I can index data for you.

AB: Of course. So what it will mean, what it would mean is if you have JVector with the index, then it would be the same on every Cassandra machine because it doesn't matter, right? You will have to replicate the file because you would... Where I’m going with it if I would like to search something, right? So what usually happens is I need association between Cassandra Partition key and JVector ID or the vector, because this is the association, right? If you would implement, for instance, RAG, then Cassandra would be my data. So the text, let's say or whatever, JSON, and I would like to associate this with meaning, which is the vector in JVector, right?

JE: Yes.

AB: This is how it would work.

JE: So JVector is like the low level component, right? So if you're using Cassandra, you say create table “documents” with primary key ID. I'll declare that to be a UUID. And then my body which is declared of type text and my embedding, which is declared as type of float vector 1024 dimensions. And so then I can just insert my text and my vector into that Cassandra table, and it takes care of managing the association between the two and indexing the vectors with JVector. So that's not something that as an application developer, you have to deal with manually.

AB: So what you did, you used JVector directly, which is unusual. Usually you’d use Cassandra's on every insert you would write to the column, which will be indexed behind the scenes, right?

JE: Right. And there's a reason for that, which is that the version of JVector in Cassandra, the current Cassandra release is JVector 1.X. And so it's Cassandra 5.0 RC-1 was just released a couple days ago [ed: from when this was recorded in July]. So Cassandra's not looking to change a major dependency right now. And so to demonstrate JVector 3, my choices were either use Astra in local mode, which is actually open source.

(Not many people know that, but the branch of Cassandra that we deploy in Astra is open source. So I could do that, but then I'd have to do a whole bunch of explanation about, here's Astra and here's how I'm running in local mode.)

Or I can just use raw JVector with Chronicle Map. And I thought that was an easier thing to explain to people.

AB: I had a conversation with a client the other day, and they had this highly complicated object graph. So it was the edges between nodes were more important than the object itself. So, and my client asked me about queries, and there are many possibilities, like Neo4j or something like this. But I thought maybe it is possible to use a JVector to index the edges, and then use an LLM to have queries, full text queries, and to find the things. And I think this could work, right?

JE: If you can represent those edges as vectors while retaining their spatial meaning inside the original graph, then yeah, that would work. But that's a fairly big if. I don't know how to do that.

AB: Okay, but because they already mentioned Cassandra, I was like, okay, Cassandra is interesting, because we could use JVector, and then we have instead of full text search, full semantic search, right? This was the idea. Instead of using text comparison, we could use a meaning comparison, and actually it could or should work. There's actually no reason why not to. But I was just curious whether you had other similar ideas with Cassandra and JVector.

JE: Yeah, just to make sure we're on the same page, JVector is not a generic graph engine at all. It just uses a graph structure to index the vectors.

AB: Yes, so we would use, of course, Cassandra or Astra, it would be a cloud service, and just store the objects, which on the objects are connected to each other. And for the connections, I would probably use single table design or something like this. And the objects would be then indexed or the hash or embedded.

JE: Right, you could absolutely represent an edge list in Cassandra, and then separately you could index a vector embedding associated with the node, or you can index the full text, or you can index both. It's actually really useful to be able to say, I want to restrict my vector search to rows that also match this specific full text search query.

For instance, vector indexes are not particularly good at very short queries. And so one of the examples that one of our technical field came to me with was, she was working with an e-commerce company. And so they wanted users to be able to specify things like red faucet. And if you just use the vector index, it would come back with faucets that were not red, right? It's like, this is also close in the vector embedding space. And so the way to solve that was to say, I'm going to put my red in quotes, and that will tell my application to send that to the full text term matching, and then restrict the vector search to only rows that match that full text predicate.

AB: Interesting, this one had to exclude the color from the semantic search and perform full text search instead, right?

JE: Right. Under the hood, JVector is wired into Cassandra along with all the other index types that we have. And so you can affect the search results by any of the other predicates.

AB: That's so cool. So it means that the JVector is just another index for Cassandra.

AB: Okay, so what do we have? We have JVector 3, which uses compression and therefore uses significantly less memory than JVector 2. You know how much less?

JE: As I mentioned, the edge list becomes your bottleneck. And so it's something like 10x less ballpark.

AB: So what it means is for usual clients or customers, they would notice that Cassandra is smaller, needs less RAM, right?

JE: That's right, yes. Or alternatively, that you can index, you can ingest more data in the same amount of RAM. While we were talking, I checked the actual file on disk and I misremembered that the completed index is 50 gigabytes and not 40 gigabytes.

AB: It's not much, right? Entire Wikipedia is...

JE: Yeah, and then there's another 40 gigabytes is the Chronicle Map that stores the article contents. So the total for both of them is about 90.

AB: And you use the Chronicle Map just as a bridge between the Cohere model from Hugging Face and JVector?

JE: It's the JVector internal IDs and the text. JVector associates integers with vectors, and then when you do a search for another vector, what you get back is a list of integers saying, here's the vectors that I have internally that are closest to what you gave me. And so to turn that into something useful in your application, you need to be able to map those integers to whatever your source form is. And again, when you're using it from Cassandra, Cassandra takes care of doing that mapping for you, but when you're doing it with JVector raw, then you need to do that yourself. And so that's what Chronicle Map contains, the map of JVector integers to text fragments that correspond to the embedding. So it's a map of integer to stream in the code.

0:25:57 Compression techniques for vector indexes

AB: And so we have the compression, the RAM improvement, the architecture was changed. Is there any other thing you did for 3?

JE: We went pretty hard on the compression algorithms that we support. So commonly in the vector search world, you commonly have product quantization, which is what disk ANN introduced. And you have binary quantization, which is a very simple transformation of for each dimension in your embedding, I'm going to turn that into either a one or a zero bit. So I'm going from 32 bits of a float32 down to a single bit. And that's popular because you can do it without an external codebook.

With product quantization, you end up with this codebook that tells you what the quantized points correspond to in your original vector space. But for a bunch of database designs, including Lucene, including PostgreSQL, it's really difficult to fit in the concept of, I have this third source of information that was calculated from my original data, but it's not the same as my original data. And so binary quantization is super popular with pgvector, for instance, because you can quantize it down, compress it by a factor of 32. And that's all you need. All you need is that compressed set of ones and zeros. You don't need that extra codebook. The downside of binary quantization is that it is very, very inaccurate, which you would expect, because you're just doing this very simple transformation on it.

So for instance, with OpenAI's ADA embedding model, if you index those vectors uncompressed, so with full resolution, and then you search for the top 100 in a dataset, then you're going to expect that of those 100 results that you get back from your approximated index, 95% of those, or 95 of those 100 results are going to be in the correct exact top 100. And the way you measure this is you pre-compute offline the correct answers for your dataset. And you do that on a large machine with GPUs to accelerate it. And then once you have that answer key, then you can test arbitrary vector indexes to see how good the accuracy is. So if you're doing it with uncompressed, you get about 95% right. If you do it with binary compression, you get about 69% right.

So you've gone from 95% to about 70%. That's a pretty meaningful loss of accuracy. If you're using the E5 small V2 embeddings, which is the open source model I mentioned earlier, then you go from about 87% with the uncompressed vectors to 45% with binary compression. So it's really a massive accuracy loss that you get. And so in the DiskANN world, you can say I'm okay with that because I keep the uncompressed vectors on disk and I can pull them into memory on demand to re-rank the answers that I'm getting from my quantized search. And so in that case, you can trade off less accuracy from your quantization in exchange for faster comparisons. And that lets me go deeper into the graph and then make it up on the reranking.

But it turns out in the case of binary quantization that it's actually a very narrow sweet spot because even with reranking for E5 V2, for instance, I cannot get that 45% back up to 87% with any meaningful amount of reranking. Even if I search five times deeper in the graph and rerank 500 instead of 100, I'm still not getting back up to my original accuracy. OpenAI's ada002 and openai-v3-small, as well as Cohere's. Those are the three that I've found where binary quantization can be useful. And even for those three, I can get even better results using product quantization at 64X.

We did add binary quantization support. It's not as useful in JVector as it is in some of these other options because we have a high quality product quantization implementation as well, and that's usually better, but we do support the binary quantization. Now, something that is less often seen, and in fact, I don't know of anyone else doing it, is refining product quantization with anisotropic weights. And what that is, is this is an idea that came out of one of Google's vector search papers and...

AB: One question regarding the bit level quantization. Is it worth to trade off RAM for CPU? I mean, because I think CPU is cheaper. Is it always the case that we... That is the way to go, or is it a specific case for specific models?

JE: Other things being equal, I would rather have a smaller footprint in memory. Because that's usually my bottleneck. And the reason is that you need to keep the entire quantized data set in memory to do this core search quickly.

AB: You're right. So we would like to consume less RAM, but we need more CPU. And yeah, one point of time, I think it's easier to get CPU. I think both is hard, right?

JE: Both is hard, but if you have an application and you have a hundred thousand users hitting it with requests, that request volume tends to stay similar even as they're generating more and more data that you need to index. And so that's why my default is I'm going to optimize for keeping that memory size low because that's what keeps growing over time, even if my request volume doesn't.

AB: You're right. Because what you're saying is that the CPU requirements are more or less equal, but RAM will grow because the database gets larger usually, not smaller. And this is, yeah, okay. This is actually a good point. Yeah. What brings me to completely different fact was what happened in projects. We, at the beginning of our cloud projects, we bought storage, which was more cold than warm.

So it means at the beginning, it was more expensive for us because the API calls were more expensive. But the longer we were in the project, the cheaper it got because in one point of time, it was break even. So we could actually save some money because the size requirements, they are getting as the size grows over time. So it gets more expensive over time anyway. And this is what you said, this is also smart because you say, okay, because I have to keep the entire database in memory and it will just grow. This is usual for applications. So this is what I cannot control, but it seems like the JVector, the CPU requirements are linear, I guess. Or is it?

JE: The CPU requirements are logarithmic in the size of the data set, assuming that you can keep your index segment count under control.

AB: But still what I didn't understood completely is if I have multiple Cassandra nodes, the partition key and so forth, so we have a consistent hashing. So Cassandra knows where it lives, but the vector is more or less there and it has to be prefilled with the data of the specific partition key, right? So what you will need or no, it doesn't have, right? Because on every insert, okay, this is easier than I thought because let's say we have three nodes in a cluster.

So you go to the leader, you store it to the leaders or leader gets the data. So it is stored into the index like any other index. And then behind the scenes, I don't know whether the gossip protocol is still there. So it gets replicated to the other nodes. And then because it is identical setup, also then is the state of the JVector the same as the other nodes. So this is non-issue because we always going through Cassandra and Cassandra replicates the data because of the consistent hashing, right? So because we know the key, Cassandra knows how to route the data to particular nodes. And this was the idea there.

JE: Right. You can treat the replication as kind of a black box because what's going to happen is even in a simple case, like I have three nodes, the JVector indexes are probably not going to be identical because the indexing process is sensitive to the order in which the vectors are inserted and so forth. And Cassandra's replication does not guarantee any specific order that the follower nodes will see those writes in. And in fact, all of the nodes can be receiving writes simultaneously and replicating to the others.

So you will almost certainly, even if you try to stop it, even people will almost certainly always see slightly different orders on the indexing, but that's okay. And you don't need to care about that at any level outside of, I'm debugging Cassandra's repair code because Cassandra will take care of that reconciliation at the data level. And then the indexes are all derived from those SS tables that contain the raw data.

AB: Well, it means eventual consistency if you know the leader stops writing or writes to leader are stopping, then all the followers will get the data anyway, but the data might be in different order. That's what they're saying. And it doesn't matter because the vector is associated with the Cassandra data.

So we have two major features, less RAM because of compression, even less RAM because of one bit vectorization. So we have almost in minus RAM requirements already.

0:38:07 Anisotropic product quantization

JE: I want to talk about just one more thing related to the compression, which is the anisotropic product quantization.

AB: Exactly.

JE: As I was saying, this is something that Google published in one of their vector search papers and hasn't really... It's flying under the radar for most people, I think. And the idea is that if you are doing vector search with cosine similarity, which is the most common way that you want to compare vectors from OpenAI or cohere, or E5 and so forth, if you know that you're comparing with cosine similarity, then you should be able to take advantage of that when you're performing your quantization so that you care more about angular distance than you do about linear distance. And so the anisotropic PQ is a way to do that.

And the other thing you care about is you care more about being relatively accurate for vectors that are close together than you do about vectors that are far apart. If they're far apart, then it's like, okay, they're not going to be in my top 100 search, so I kind of don't care how far apart they are, as long as they sort towards the end of my candidate list. And so combining those two concepts, you can actually make product quantization more accurate. It's just as fast, the similarity computation is the same code, but you're encoding it in a more accurate way. And so you pay a price in terms of extra computation at the encoding time, to get more accurate results at query time. And since almost always you're doing the encoding just once or relatively few times compared to how many searches you're running, that's usually a good trade.

AB: So what you said is, that there's a different, similarity algorithm and which actually compares vectors close to each other pretty well. And if they are, shouldn't be the default, where are the cases where the vectors, the far away vectors matter? I mean, this would be... Are there any use cases for that?

JE: This is is specific to product quantization, where I'm splitting my vectors up into subspaces, and then I'm quantizing each of the subspaces into 256 centroids. And then I encode each of those subspaces as a single byte. And so the anisotropic algorithm is about, how do I influence how those 256 centroids are computed so that they give me those two properties of capturing the angular distance and emphasizing the accuracy more heavily for vectors that are closer together.

AB: So almost sounds like consistent hashing. Because of the partitioning, I think.

JE: Right. It's similar in the sense of, you know where to find it from the property of the key. Yes. But actually computing it for product quantization is a lot more difficult. And it's an iterative process rather than performing a deterministic hash on it.

AB: And the search is more accurate because I'm searching in subspace. And the subspace's smaller, and this is maybe the reason why I get better results.

JE: Right. Because I've trained the subspaces, if you will, to encode this angular distance, preferentially, for vectors that are closer together. And so if I take my query vector and I sum up the dot products with all of the subspaced quantization, then I've introduced less error in that reconstruction or in that asymmetric computation.

AB: And you mentioned JVector 3 is one of the, or is it the only vector that has been computed?

JE: I don't know of anyone else doing it. But I could be wrong. There's a lot of vector search projects out there, but I don't know of anyone else doing it. So that's one of the reasons why I feel comfortable telling people that JVector's the most advanced vector index in the world, and not just the most advanced Java Vector index, but it's the most advanced vector index.

AB: And you picked Java because you wanted to have a scalable solution. So we had this in the first podcast, right?

JE: Exactly.

AB: Which is still amazing because that, yeah, I'm also amazed of performance of Java, but, yeah. But you have a DB background which matters even more. The last thing, because then I would like to ask you some questions regarding implementations. So you wanted to mention the last feature of JVector, right? It's called antiendomorphic now?

JE: Anisotropic.

AB: Anisotropic. Anisotropic. This is a nice term. Could you explain it the first time? Anisotropic?

JE: That's the term that the Google paper uses. I mean, if you just translate it into normal English, it just means weighted, that's all it means. I guess it sounds more impressive if you call it anisotropic.

# 0:43:53 SIMD acceleration with Fused ADC

AB: Should be specific... I cannot pronounce that from [uninteligible], you know, the specificity and anisotropic and then will be great title for the next paper. Okay. What is the last feature?

JE: Well, we're getting short on time, so I'll just tell you what it is and then I'll refer our listeners to the JVector README to go into a little more detail. But the last one's called Fused ADC. And the way this came about is there's, there's a French SIMD expert. SIMD is the AVX-512 on Intel chips, neon on ARM chips, the hardware that lets you do multiple operations against a stream of data, concurrently on your CPU. And so this is perfect for doing things with vectors, because you have these long streams of float32s that you want to multiply and add together to compute similarities. And so this Frenchman named Fabien André came up with a way to store the lookup tables for product quantized comparisons in the registers of the SIMD hardware. And so he published three papers over the years that refined his solution more and more. The first one was called, it was a long title, but the operative keywords are “fast scan” for doing this accelerated similarity computation.

Then the next paper was called Quick ADC, and then the third one was called Quicker ADC. And it's the third one that implements full AVX-512 acceleration against product quantization lookup tables. Super, super interesting, but the long story short is JVector can store those lookup tables in line with the index on disk, and pull it out as it's retrieving edge lists, pull it out at the same time with the same disk access, put it into the SIMD registers and compute the similarities all at once, in the same time that we could have done a traditional asymmetric distance computation against traditional product quantization.

So I've gone from, I need to hold my quantized vectors in memory and that's my bottleneck. And now I've gone to, I don't have any memory footprint. My memory footprint is two megabytes. Like, that's it. If you have a hundred million vectors, if you have a billion vectors, it's still two megabytes. And I'm pulling everything else off from the disk on demand without a speed penalty. So that's the last thing I wanted to mention.

AB: But SIMD is on CPU still?

JE: SIMD is CPU, yes.

AB: Which becomes more and more important, I think, because of the GPU shortage. It's a super interesting graph that you have a solution which is accelerated on CPU and not necessarily GPU because they are expensive, power hungry, and not available. So maybe I think in the research, more stuff happens on CPUs, right?

JE: It is true that most of the research has historically been done on CPUs. You are starting to see more research happen about, is there a way that we can accelerate vector search using GPU.

The easy way to accelerate vector searches in GPU is during index construction. And there's a bunch of people who are doing interesting work in solving that. Microsoft has an open source repository called SPTAG that uses GPU to accelerate construction. NVIDIA has one called CAGRA that also does the same thing.

Then the harder part is, can we also use GPU to accelerate the searches itself? And the reason that's harder is that every time you go to the GPU with a request, there's just this really, really high penalty in terms of launching the kernel. There's a fairly high latency penalty there. And also anything that you need the GPU to compute on, you need to copy into VRAM. And that's also a fairly slow operation. So I would say nobody knows the best way to combine the two. So I think it's fair to say that right now, the state of the art is doing searches on CPU.

0:48:44 Low-level optimizations in JVector

AB: How was the implementation of all the features? You used low-level Java, which looks like C. Or were you able to use high-level Java constructs. Nice code. I know Java records, Java 21 features, and more. So what is the code?

JE: The bad news is it's still true that the more you care about performance in Java, the more you end up writing code that looks like C. I mentioned that the edge list was our bottleneck. Originally, the edge list was an array of a class that was nested three deep. And so you had a reference to a reference to a reference. And then you would have one of those entries for every edge. And the edge itself was just a 32-bit integer. So it's four bytes. So I've got four bytes for the edge. And then I've got another 48 or so bytes for my internals. And so that's obviously a really bad efficiency.

I think it's the first time in 20-plus years of writing Java code where I went in and I re-optimized my objects for that memory footprint. And I'm used to doing it for reducing allocations. I've done that a lot, everyone knows that. But this is the first time that the reference overhead in that object hierarchy was one of the bottlenecks that we ran into.

AB: Is it a common pattern, what you did, that you replaced the object references or it wasn't?

JE: It was specific to the JVector internals.

AB: And were you able to use the Java vector API or was it...

JE: Yes. So there's three implementations of vector similarity in JVector. One is just normal unoptimized CPU. And so that's because Cassandra still needs to run on JDK 11. And so we have a no acceleration at all code path for JDK 11. The second one is for if you have JDK 19 or more recent. This is the last time that they changed the Panama API in an incompatible way for us. So if you have 19 or more recent, then you can use the Panama vector code. And then finally, for the Fused ADC, for the part about the transposed codebooks that we put into SIMD registers, that's just way too low level, even for the Panama intrinsics. And so we actually have some C code that we use the Java 22 foreign function API to leverage. And for that, that actually is Java 22, because in Java 22, that's where you can have a float array that Java can access and get a pointer to that, a memory segment reference to that, and pass it off to a C API and not have to copy it off into native memory before doing that.

JE: Most of the places that people are using float16, you see Lucene doing float16, for instance, that's because they don't have product quantization and the re-ranking pipeline. So no, I'm not going to be rushing to add that because I don't think it solves the problem that we have since we're all doing much higher compression levels with PQ.

0:52:45 Conclusion

AB: So we covered JVector 3. So all the features is there. So there is no reason not to use it, right? So just use it as significantly less memory in best possible case, you will save as one on a tenth of RAM, right?

JE: That's right. And if you enable the Fused ADC, then even less when you're serving the queries.

JE: I'll give one more shout out to that Foojay article about indexing Wikipedia with JVector. That's probably what I'd refer people to if they want to go deeper and say, Okay, what does the code actually look like?

AB: Perfect. Where people can find Astra, JVector, Cassandra, and you on the internet?

JE: Well, datastax.com for Astra. And then I'm on Twitter as Spyced. And then, of course, on github.com/jbellis. That's where the JVector repository lives and so forth.

AB: So thank you for your time and see you next time maybe in JVector 4, 5, or 6.

JE: All right. Thanks so much, Adam!

Spyced

Search This Blog