So, I'm working on the Cassandra distributed database. I gave a lightning talk on it at PyCon this morning. Cassandra is written in Java and implements a sort of hybrid between Dynamo and Bigtable. (Both papers are worth reading.) It takes its distribution algorithm from Dynamo and its data model from Bigtable -- sort of the best of both worlds. Avinash Lakshman, Cassandra's architect, is one of the authors of the Dynamo paper.
There is a video about Cassandra here. The first 1/4 is about using Cassandra and then the rest is mostly about the internals.
Cassandra is very bleeding edge. Facebook runs several Cassandra clusters in production (the largest is 120 machines and 40TB of data), but there are sharp edges that will cut you. If you want something that Just Works out of the box Cassandra is a poor fit right now and will be for several months.
Cassandra was open-sourced by Facebook last summer. There was some initial buzz but the facebook developers had trouble dealing with the community and the project looked moribund -- FB was doing development against an internal repository and throwing code over the wall every few months which is no way to run an OSS project. Now the code is being developed in the open on Apache and I was voted in as a committer so things are starting to move again.
There are other distributed databases that are worth considering. Here is why those don't fit my needs (and I am not saying that these are bad choices if your requirements are different, just that they don't work for me):
HBase:
- Follows the bigtable model, so it's more complicated than it needs to be. (300+kloc vs 50 for Cassandra; many more components). This means it's that much harder for me to troubleshoot. HBase is more bug-free than Cassandra but not so bug-free that troubleshooting would not be required.
- Does not have any non-java clients. I need CPython support.
- Sits on top of HDFS, which is optimized for streaming reads, not random accesses. So HBase is fine for batch processing but not so good for online apps.
- See HBase, except it's written in C++.
- Voldemort is probably the key / value store in the pattern of Dynamo that is farthest along. If you only need key / value I would definitely recommend Voldemort. Next closest is probably Dynomite. Then there are a whole bunch of "me too" key value stores that made fatal architecture decisions or are writing a "memcached with persistence" without really thinking it through.
- prereqs: jdk6 and ant
- Check out the code from http://svn.apache.org/repos/asf/incubator/cassandra/trunk (did I mention this was early-adopter only?)
- run ant [optional: ant test]. If you get an error like "class file has wrong version 50.0, should be 49.0" then ant is using an old jdk version instead of 6.
- For non-java clients, install Thrift. (For java, trunk includes libthrift.jar.) This is a major undertaking in its own right. See this page for a list of dependencies, although most of the rest of that page is now outdated -- for instance, Cassandra no longer depends on the fb303 interface. Python users will have to hand-edit the generated Cassandra.py in three obvious places until this bug is fixed -- just replace the broken argument with None.
- run bin/cassandra [optionally -f for foreground]
- Connect to the server. By default it listens on localhost:9160. Look at config/server-conf.xml for the columnfamily definitions.
- Insert and query some data. Here is an introduction.
- Ask on the mailing list or #cassandra on freenode if you have questions.
Comments
Selecting the C and P parts of the CAP triangle is not a "fatal architecture decision", it is simply one choice among several equally valid options. By selecting availability over consistency Vogels/Amazon has decided to use application-specific code to reconcile inconsistent data; anyone who makes the A & P choice is doing this and it can impose a non-trivial cost on the end-users of the system.
Eventually consistent can easily turn into not consistent if the end-user isn't cognizant of the tradeoffs involved and doesn't understand what burden is being shifted from the database to their own codebase. An equally valid claim could be made that distributed databases which opt for availability and partition-tolerance over consistency will be good choices for end-users who never really needed a database in the first place while people who need ACID properties are more likely to hold on to consistency as a key property in their database.
Like I said, YMMV depending on your requirements.
Possibly stupid question: from a system w/ a working thrift install, is it possible to generate the .py (or whatever lang) and then use that generated code on other systems?
I'm asking because it seems like a client-library buildbot would be useful here.
No?
Buildbot would be great. I have no idea how to set one up. (Maybe once Thrift gets a stable release out -- they are working on that now.)
" (Update: also a commit Thursday added incorrectly-indented docstrings. Wonderful. Here is my github tree w/o that problem.) "
As an aside, calling this CAP business a theorem has been bugging me, but I guess that http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
establishes it at least in some models.
I'd reply on your blog but I don't see a comment section? So here goes:
* we do plan to add the ability to add and remove columnfamilies after cluster start. the "right" way is to use something like ZooKeeper. the "quick hack" way is to just allow adding new CFs via the config xml -- if anyone is really blocking on this I can point you in the right direction; it's a couple hours of work is all. Otherwise I'd rather wait until we can do it "right."
* the api isn't set in stone; we're definitely willing to evaluate proposals for new features.
* for your specific example, the functionality is already there; get_column will take either columnfamily:column or supercolumnfamily:supercolumn:column. This is admittedly confusing, but it does work. :)
* take another look at Remove; I've done a lot of work on making that actually work. (similar to get_column you can specify varying degrees of granularity there -- you can remove a subcolumn, supercolumn, or entire columnfamily associated w/ a key.)
Feel free to drop by #cassandra if you have any more questions!
Thanks for the info, it helps a lot.
One question: is Cassandra a good fit to keep media on distributed nodes - images, videos etc..
AFAIK, thrift interface wont allow streaming access to the db.
Can you please shed some light! any help is appreciated.
Cheers
You could also use cassandra to store metadata about where your blobs are stored on conventional servers for something like MogileFS w/o the mysql bottleneck.
Although MogileFS is great and even proven to perform at Digg, I am looking for a java alternative. Dont ask me why :-)
Thanks for your reply.