Friday, March 27, 2009

Why I like the Cassandra distributed database

I need a distributed database. A real distributed database; replication doesn't count because under a replication-oriented db, each node still needs to be able to handle the full write volume, and you can only throw hardware at that for so long.

So, I'm working on the Cassandra distributed database. I gave a lightning talk on it at PyCon this morning. Cassandra is written in Java and implements a sort of hybrid between Dynamo and Bigtable. (Both papers are worth reading.) It takes its distribution algorithm from Dynamo and its data model from Bigtable -- sort of the best of both worlds. Avinash Lakshman, Cassandra's architect, is one of the authors of the Dynamo paper.

There is a video about Cassandra here. The first 1/4 is about using Cassandra and then the rest is mostly about the internals.

Cassandra is very bleeding edge. Facebook runs several Cassandra clusters in production (the largest is 120 machines and 40TB of data), but there are sharp edges that will cut you. If you want something that Just Works out of the box Cassandra is a poor fit right now and will be for several months.

Cassandra was open-sourced by Facebook last summer. There was some initial buzz but the facebook developers had trouble dealing with the community and the project looked moribund -- FB was doing development against an internal repository and throwing code over the wall every few months which is no way to run an OSS project. Now the code is being developed in the open on Apache and I was voted in as a committer so things are starting to move again.

There are other distributed databases that are worth considering. Here is why those don't fit my needs (and I am not saying that these are bad choices if your requirements are different, just that they don't work for me):


  • Follows the bigtable model, so it's more complicated than it needs to be. (300+kloc vs 50 for Cassandra; many more components). This means it's that much harder for me to troubleshoot. HBase is more bug-free than Cassandra but not so bug-free that troubleshooting would not be required.
  • Does not have any non-java clients. I need CPython support.
  • Sits on top of HDFS, which is optimized for streaming reads, not random accesses. So HBase is fine for batch processing but not so good for online apps.
  • See HBase, except it's written in C++.
Project Voldemort:
  • Voldemort is probably the key / value store in the pattern of Dynamo that is farthest along. If you only need key / value I would definitely recommend Voldemort. Next closest is probably Dynomite. Then there are a whole bunch of "me too" key value stores that made fatal architecture decisions or are writing a "memcached with persistence" without really thinking it through.
Running Cassandra [updated 04-22-09 for hacker news visitors]:
  1. prereqs: jdk6 and ant
  2. Check out the code from (did I mention this was early-adopter only?)
  3. run ant [optional: ant test]. If you get an error like "class file has wrong version 50.0, should be 49.0" then ant is using an old jdk version instead of 6.
  4. For non-java clients, install Thrift. (For java, trunk includes libthrift.jar.) This is a major undertaking in its own right. See this page for a list of dependencies, although most of the rest of that page is now outdated -- for instance, Cassandra no longer depends on the fb303 interface. Python users will have to hand-edit the generated in three obvious places until this bug is fixed -- just replace the broken argument with None.
  5. run bin/cassandra [optionally -f for foreground]
  6. Connect to the server. By default it listens on localhost:9160. Look at config/server-conf.xml for the columnfamily definitions.
  7. Insert and query some data. Here is an introduction.
  8. Ask on the mailing list or #cassandra on freenode if you have questions.
(Cassandra has a new website up to replace the google code one. We're actively working on the docs, so let us know what needs work.) Good luck!


Jim said...

> Then there are a whole bunch of "me too" key value stores that made fatal architecture decisions or are writing a "memcached with persistance" without really thinking it through.

Selecting the C and P parts of the CAP triangle is not a "fatal architecture decision", it is simply one choice among several equally valid options. By selecting availability over consistency Vogels/Amazon has decided to use application-specific code to reconcile inconsistent data; anyone who makes the A & P choice is doing this and it can impose a non-trivial cost on the end-users of the system.

Eventually consistent can easily turn into not consistent if the end-user isn't cognizant of the tradeoffs involved and doesn't understand what burden is being shifted from the database to their own codebase. An equally valid claim could be made that distributed databases which opt for availability and partition-tolerance over consistency will be good choices for end-users who never really needed a database in the first place while people who need ACID properties are more likely to hold on to consistency as a key property in their database.

Jonathan Ellis said...

It's fatal for my needs. (I chose my words carefully; I said 'decision' rather than 'mistake.' :)

Like I said, YMMV depending on your requirements.

jdunck said...

/me looks at the thrift depends list.

Possibly stupid question: from a system w/ a working thrift install, is it possible to generate the .py (or whatever lang) and then use that generated code on other systems?

I'm asking because it seems like a client-library buildbot would be useful here.


Jonathan Ellis said...

You can definitely pre-generate the pure Python library & client code. Not sure about the `fastbinary` c extension -- it may depend on the C++ library which would make things trickier.

Buildbot would be great. I have no idea how to set one up. (Maybe once Thrift gets a stable release out -- they are working on that now.)

Jason Dusek said...

It's no secret that Amazon is still a major Oracle customer; an A & P system like Dynamo is not suitable for certain kinds of data.

Michael Greene said...

I have to agree on the above comments' points about a Thrift library through buildbot/etc... but would also like to point out that HBase has a Thrift API which you have ignored in your post.

Alexander Fairley said...

As an FYI, you don't have a link to your github on
" (Update: also a commit Thursday added incorrectly-indented docstrings. Wonderful. Here is my github tree w/o that problem.) "

As an aside, calling this CAP business a theorem has been bugging me, but I guess that
establishes it at least in some models.

Phil Wise said...

Make sure that the API of cassandra is enough for what you need. There are some operations on supercolumns that you might expect to be there but are missing (I'm sure for good reason). I went through and documented the cassandra api a few months ago when I was investigating it.

Jonathan Ellis said...

Thanks for taking the time to look at Cassandra, Phil. Certainly back in January it took a brave soul to dive in. We're hoping to make things a little more user-friendly as we approach a release. :)

I'd reply on your blog but I don't see a comment section? So here goes:

* we do plan to add the ability to add and remove columnfamilies after cluster start. the "right" way is to use something like ZooKeeper. the "quick hack" way is to just allow adding new CFs via the config xml -- if anyone is really blocking on this I can point you in the right direction; it's a couple hours of work is all. Otherwise I'd rather wait until we can do it "right."
* the api isn't set in stone; we're definitely willing to evaluate proposals for new features.
* for your specific example, the functionality is already there; get_column will take either columnfamily:column or supercolumnfamily:supercolumn:column. This is admittedly confusing, but it does work. :)
* take another look at Remove; I've done a lot of work on making that actually work. (similar to get_column you can specify varying degrees of granularity there -- you can remove a subcolumn, supercolumn, or entire columnfamily associated w/ a key.)

Feel free to drop by #cassandra if you have any more questions!

Sam said...

Hi Jonathan,
Thanks for the info, it helps a lot.
One question: is Cassandra a good fit to keep media on distributed nodes - images, videos etc..
AFAIK, thrift interface wont allow streaming access to the db.

Can you please shed some light! any help is appreciated.


Jonathan Ellis said...

Cassandra is really designed for storing structured and semi-structured data than large media blobs, although I sketched out what large-object support might look like here:

You could also use cassandra to store metadata about where your blobs are stored on conventional servers for something like MogileFS w/o the mysql bottleneck.

Sam said...

Right, I am thinking of something similar.
Although MogileFS is great and even proven to perform at Digg, I am looking for a java alternative. Dont ask me why :-)

Thanks for your reply.