Skip to main content

Consistent hashing vs order-preserving partitioning in distributed databases

The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner.

The reason that almost all similar systems use consistent hashing (the dynamo paper has the best description; see sections 4.1-4.3) is that it provides a kind of brain-dead load balancing from the hash algorithm spreading keys across the ring. But the dynamo authors go into some detail about how this by itself doesn't actually give good results in practice; their solution was to assign multiple tokens to each node in the cluster and they describe several approaches to that. But Cassandra's original designer considers this a hack and prefers real load balancing.

An order-preserving partitioner, where keys are distributed to nodes in their natural order, has huge advantages over consistent hashing, particularly the ability to do range queries across the keys in the system (which has also been committed to Cassandra now). This is important, because the corollary of "the partitioner uses the key to determine what node the data is on" is, "each key should only have an amount of data associated with it (see the data model explanation) that is relatively small compared to a node's capacity." Cassandra column families will often have more columns in them than you'd see in a traditional database, but "millions" is pushing it (depending on column size) and "billions" is a bad idea. So you'll want to model things such that you spread data across multiple keys and if you then pick an appropriate key naming convention, range queries will let you slice and dice that data as needed.

Cassandra is in the process of implementing load balancing still, but in the meantime order-preserving partitioning is still be useful without that if you know what your key distribution will look like in advance and can pick your node tokens accordingly. Otherwise, there's always the old-school hash-based partitioner until we get that done (for the release after the one we'll have in the next week or so).

See the introduction and getting started pages of the Cassandra wiki for more on Cassandra, and drop us a line on the mailing list or in IRC if you have questions; we're actively trying to improve our docs.

Comments

Jay Farrimond said…
Very interesting. BTW, reading the post I get the feeling you're now a part of the Cassandra development team. Is this true?
Jonathan Ellis said…
Yes, I was added as a committer at the end of March.
Jay Farrimond said…
Cool! I started to say "good for you", then I realized that it's really "good for Cassandra!"

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren