Wednesday, January 28, 2009

All you ever wanted to know about writing bloom filters

It seems like Bloom filters are all the rage these days. Three years ago I had barely heard of them and now it seems like I see articles and code using them all the time. That's mostly a good thing, since bloom filters are a very useful tool to avoid performing expensive computations without the full memory overhead of a standard map/dictionary.
Bloom filters are surprisingly simple: divide a memory area into buckets (one bit per bucket for a standard bloom filter; more -- typically four -- for a counting bloom filter). To insert a key, generate several hashes per key, and mark the buckets for each hash. To check if a key is present, check each bucket; if any bucket is empty, the key was never inserted in the filter. If all buckets are non-empty, though, the key is only probably inserted -- other keys' hashes could have covered the same buckets. Determining exactly how big to make the filter and how many hashes to use to achieve a given false positive rate is a solved problem; the math is out there.
But it turns out that it's surprisingly hard to find good information on one part of the implementation: how do you generate an indefinite number of hashes? Even small filters will use three or four; a dozen or more is not unheard of.
Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free. Unfortunately, Cassandra was using perhaps the worst possible implementation of hash functions: a hardcoded list of hash functions, apparently from this page. A lot of those hashes are reimplementations of the same algorithm! For 6 or less hash functions, Cassandra was only actually generating 2 distinct hash values, so the false positive rate was far far higher than would have been expected.
Judging from the bloom filter implementations out there, generating appropriate hashes is surprisingly hard to get right. One implementation in java uses object.hashCode() to seed the stdlib's PRNG and calls Random.nextInt() to generate the bloom hashes. This works okay for small filters but the false positive rate is up to 140% of the expected rate for large filters. This one in python combines the stdlib hash and a pjw hash with simple arithmetic to achieve "only" an extra 10-15% false positives.
It turns out that most off-the-shelf algorithms suck for bloom filter purposes because their output isn't distributed well-enough across the (32-bit) hash space; most hash function authors are more concerned with speed than achieving a really good output distribution. For a lot of applications, this is a good tradeoff, but you are presumably using a bloom filter because a filter hit requires a relatively very expensive operation, so double-digit increases in this rate is Very Bad. (So I would be skeptical of any home-made hash function generator too, like PyBloom's.)
There are two approaches that I found generate results in line with the theoretical predictions. One is to use a cryptographic hash function like SHA-1. Cryptography's requirements may be overkill for a bloom filter, but crypto hashes do spread their output uniformly across the hash space. So SHA-1 works fine, and you can feed the hash a constant or part of the earlier-generated hash to generate extra data as needed so you're not limited to 5 hashes. (Although Java makes this a pain by having all the digest methods also reset the hash to uninitialized. Explicit is better than implicit!)
The other is the Murmur and Jenkins hashes. Both of these generate very well-distributed hashes; Murmur is about 10% faster than Jenkins. (Murmur3 is faster still.) You can generate extra hashes either by using the output of hash#i as the initial value for hash#i+1, which is the approach used by Hadoop, or by computing hash#i = hash0 + i*hash1, which is described by Kirsch and Mitzenmacher and is the approach Cassandra settled on (with hash0 and hash1 being the two halves of a 128-bit Murmur3 hash). The Hadoop approach is about twice as fast as the SHA approach for 5 hashes; Cassandra's is about 5x faster. (Remember, SHA will give you 160 bits of pseudo-randomness at a time, so Murmur will be faster still depending on how far you are from a multiple of that.)
In short: use the murmur hash in your bloom filter. If you're using java, you can grab Cassandra's implementation from my github tree until our new repository is up at the apache incubation site.  Unfortunately it's spread across half a dozen files, without even counting the tests.  Start with [Counting]BloomFilter and go up from there.  Your other option is Hadoop's (if you grab it from svn, because that has my fix included), which is even less self-contained and requires wrapping your keys in a Key object.  Sorry.
Bonus tip: if you do roll your own, you should probably port Cassandra's test suite to your language of choice.  In particular, nobody else bothers to check that false positive rates are within what is predicted by the math, but if you don't, it's easy for bugs that pass simple tests to slip by, like the one in Cassandra's older implementation, or the one I found in Hadoop's.
(Updated 2014 to link the Kirsch/Mitzenmacher paper in case people don't read the comments.)