Wednesday, January 28, 2009

All you ever wanted to know about writing bloom filters

It seems like Bloom filters are all the rage these days. Three years ago I had barely heard of them and now it seems like I see articles and code using them all the time. That's mostly a good thing, since bloom filters are a very useful tool to avoid performing expensive computations without the full memory overhead of a standard map/dictionary.
Bloom filters are surprisingly simple: divide a memory area into buckets (one bit per bucket for a standard bloom filter; more -- typically four -- for a counting bloom filter). To insert a key, generate several hashes per key, and mark the buckets for each hash. To check if a key is present, check each bucket; if any bucket is empty, the key was never inserted in the filter. If all buckets are non-empty, though, the key is only probably inserted -- other keys' hashes could have covered the same buckets. Determining exactly how big to make the filter and how many hashes to use to achieve a given false positive rate is a solved problem; the math is out there.
But it turns out that it's surprisingly hard to find good information on one part of the implementation: how do you generate an indefinite number of hashes? Even small filters will use three or four; a dozen or more is not unheard of.
Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free. Unfortunately, Cassandra was using perhaps the worst possible implementation of hash functions: a hardcoded list of hash functions, apparently from this page. A lot of those hashes are reimplementations of the same algorithm! For 6 or less hash functions, Cassandra was only actually generating 2 distinct hash values, so the false positive rate was far far higher than would have been expected.
Judging from the bloom filter implementations out there, generating appropriate hashes is surprisingly hard to get right. One implementation in java uses object.hashCode() to seed the stdlib's PRNG and calls Random.nextInt() to generate the bloom hashes. This works okay for small filters but the false positive rate is up to 140% of the expected rate for large filters. This one in python combines the stdlib hash and a pjw hash with simple arithmetic to achieve "only" an extra 10-15% false positives.
It turns out that most off-the-shelf algorithms suck for bloom filter purposes because their output isn't distributed well-enough across the (32-bit) hash space; most hash function authors are more concerned with speed than achieving a really good output distribution. For a lot of applications, this is a good tradeoff, but you are presumably using a bloom filter because a filter hit requires a relatively very expensive operation, so double-digit increases in this rate is Very Bad. (So I would be skeptical of any home-made hash function generator too, like PyBloom's.)
There are two approaches that I found generate results in line with the theoretical predictions. One is to use a cryptographic hash function like SHA-1. Cryptography's requirements may be overkill for a bloom filter, but crypto hashes do spread their output uniformly across the hash space. So SHA-1 works fine, and you can feed the hash a constant or part of the earlier-generated hash to generate extra data as needed so you're not limited to 5 hashes. (Although Java makes this a pain by having all the digest methods also reset the hash to uninitialized. Explicit is better than implicit!)
The other is the Murmur and Jenkins hashes. Both of these generate very well-distributed hashes; Murmur is about 10% faster than Jenkins. (Murmur3 is faster still.) You can generate extra hashes either by using the output of hash#i as the initial value for hash#i+1, which is the approach used by Hadoop, or by computing hash#i = hash0 + i*hash1, which is described by Kirsch and Mitzenmacher and is the approach Cassandra settled on (with hash0 and hash1 being the two halves of a 128-bit Murmur3 hash). The Hadoop approach is about twice as fast as the SHA approach for 5 hashes; Cassandra's is about 5x faster. (Remember, SHA will give you 160 bits of pseudo-randomness at a time, so Murmur will be faster still depending on how far you are from a multiple of that.)
In short: use the murmur hash in your bloom filter. If you're using java, you can grab Cassandra's implementation from my github tree until our new repository is up at the apache incubation site.  Unfortunately it's spread across half a dozen files, without even counting the tests.  Start with [Counting]BloomFilter and go up from there.  Your other option is Hadoop's (if you grab it from svn, because that has my fix included), which is even less self-contained and requires wrapping your keys in a Key object.  Sorry.
Bonus tip: if you do roll your own, you should probably port Cassandra's test suite to your language of choice.  In particular, nobody else bothers to check that false positive rates are within what is predicted by the math, but if you don't, it's easy for bugs that pass simple tests to slip by, like the one in Cassandra's older implementation, or the one I found in Hadoop's.
(Updated 2014 to link the Kirsch/Mitzenmacher paper in case people don't read the comments.)

5 comments:

Thomas Hurst said...

SHA512 gives you 3.2x the bits to play with, while only being 20-40% slower per call compared with SHA1; may be a better option.

You can also get more indexes out of the hash function by XORing different chunks of it together; a lot cheaper than generating another SHA512, especially if you only need a few more items.

Jonathan Ellis said...

Thomas: good point, although for low numbers of hashes you'd have more wasted computation. Perhaps a hybrid approach would be worth doing for very performance-sensitive uses.

Justin Mason said...

'You can also get more indexes out of the hash function by XORing different chunks of it together; a lot cheaper than generating another SHA512, especially if you only need a few more items.'

Alternatively, it's possible to simply use contiguous, non-overlapping chunks of the SHA512 hash as multiple indexes in this case, right? They've already been XORed over each other multiple times; another XOR won't make much difference...

Tony Finch said...

You only need two hash functions, since you can derive as many hash values as you need as a linear combination of two without compromising the performance of your Bloom filter. This is explained in a paper by Adam Kirsch and Michael Mitzenmacher: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf

In practice you can use one pass of a crypto hash function and extract two segments of bits from the result to make your hash values.

Jonathan Ellis said...

Tony: one of the code bases I found (the python one I linked) used that approach to poor effect, so I dismissed it. But you are right, it turns out that if you use good hashes as input the combinatorial algorithm is fine.

Note that two iterations of Murmur are as good as a crypto hash, and faster. So that is what Cassandra is using now.

Hopefully that is the final word. :)