- Getting Started: Cassandra is surprisingly easy to try out. This walks you through both single-node and clustered setup.
- The Dynamo paper and Amazon's related article on eventual consistency: Cassandra's replication model is strongly influenced by Dynamo's. Almost everything you read here also applies to Cassandra. (The major exceptions are vector clocks, and even that may change, and Cassandra's support for order-preserving partitioning with active load balancing.)
- WTF is a SuperColumn? Arin Sarkissian from Digg explains the Cassandra data model.
- Operations: stuff you will want to know when you run Cassandra in production
- Cassandra users survey from Nov 09: What Twitter, Mahalo, Ooyala, SimpleGeo, and others are using Cassandra for
- More articles here (Cassandra on OS X seems to be a particularly popular topic)
Tuesday, December 15, 2009
Wednesday, July 29, 2009
Stu Hood flew up from Rackspace's Virginia offices just for the night, which normally probably wouldn't have been worth it, but Cliff Moon, author of dynomite, showed up (thanks, Cliff!) and was able to give Stu a lot of pointers on implementing merkle trees. Cliff and I also had a good discussion with Jun Rao about hinted handoff--Cliff and Jun are not fans, and I tend to agree with them--and eventual consistency.
I also met David Pollack and got to talk a little about persistence for Goat Rodeo, and talked to a ton of people from Twitter and Digg. I think those two, with Rackspace and IBM Research, constituted the companies with more than one engineer attending. The rest was "long tail."
Back at OSCON, my Cassandra talk was standing room only. Slides:
- Gearman: Bringing the Power of Map/Reduce to Everyday Applications
- High Performance SQL with PostgreSQL [8.4]
- Linux Filesystem Performance for Databases (reiserfs blows everyone away for random writes, by a factor of > 2!?)
- Neo4j - The Benefits of Graph Databases
- Release Mismanagement: How to Alienate Users and Frustrate Developers
Monday, July 06, 2009
We had two more bug-fix release candidates, and it's virtually certain that 0.3-final will be the same exact code as 0.3-rc3. (If you're using rc1, you do want to upgrade; see CHANGES.txt.) But, we got stuck in the ASF bureaucracy and it's going to take at least one more round-trip before the crack Release Prevention Team grudgingly lets us call it official.
In the meantime, work continues apace on trunk for 0.4.
Tuesday, June 23, 2009
In the meantime, there is git-svn.
(The ASF does have a git mirror set up, but I'm going to ignore that because (a) its reliability has been questionable and (b) sticking with git-svn hopefully makes this more useful for non-ASF projects.)
Disclaimer: I am not a git expert, and probably some of this will make you cringe if you are. Still, I hope it will be useful for some others fumbling their way towards enlightenment. As background, I suggest the git crash course for svn users. Just the parts up to the Remote section.
- git-svn init https://svn.apache.org/repos/asf/cassandra/trunk cassandra
Creating new code:
- git checkout -b [ticket number]
- [edit stuff, maybe get add or git rm new or obsolete files]
- git commit -a -m 'commit'
- repeat 2-3 as necessary
- git-jira-attacher [revision] (usually some variant of HEAD^^^)
- git log (just to make sure I'm about to commit what I think I'm about to commit)
- git-svn dcommit
- git checkout master
- git-svn rebase -l (this will put the changes you just committed into master)
- git branch -d [ticket number]
- git checkout -b [ticket number]
- wget patches and git-apply, or jira-apply CASSANDRA-[ticket-number]
- review in gitk/qgit and/or IDE (the intellij git plugin is quite decent)
- commit .. branch -d as above
- git branch
One fairly common complication is if you finish a ticket A, then start on ticket B (that depends on A) while waiting for A to be reviewed. So you checkout -b from your branch A rather than master and build some patches on that. As sometimes happens, the reviewer finds something you need to improve in your patch set for A, so you make those changes. Now you need to rebase your patches to B on top of the changes you made to A. The best way to do this is to branch A to B-2, then git cherry-pick from B and resolve conflicts as necessary.
Final note: I often like to create lots of small commits as I am exploring a solution and combine them into larger units with git rebase -i for patch submission. (It's easier to combine small patches, than pull apart large ones.) So my early commit messages are often terse and need editing. You can change commit messages with edit mode in rebase, then using commit --amend and rebase --continue, but that is tedious. I complained about this to my friend Zach Wily and he made this git amend-message command (place in [alias] in your .gitconfig):
amend-message = "!bash -c ' \ c=$0; \ if [ $c == \"bash\" ]; then echo \"Usage: git amend-message <commit>\"; exit 1; fi; \ saved_head=$(git rev-parse HEAD); \ commit=$(git rev-parse $c); \ commits=$(git log --reverse --pretty=format:%H $commit..HEAD); \ echo \"Rewinding to $commit...\"; \ git reset --hard $commit; \ git commit --amend; \ for X in $commits; do \ echo \"Applying $X...\"; \ git cherry-pick $X >> /dev/null; \ if [ $? -ne 0 ]; then \ echo \" apply failed (is this a merge?), rolling back all changes\"; \ git reset --hard $saved_head; \ echo \" ** AMEND-MESSAGE FAILED, sorry\"; \ exit 1; \ fi; \ done; \ echo \"Done\"'"(Zach would like the record to show that he knows this is pretty hacky. "For instance, it won't work if one of the commits after the one you're changing is a merge, since cherry-pick can't handle those." But it's quite useful, all the same.)
For what it's worth, the rest of my aliases are
st = status ci = commit co = checkout br = branch cp = cherry-pick
Wednesday, May 27, 2009
Short version: no. A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals.
This paper describes the results of attempting to build a relatively complex datastructure (prefix hash trees, for range queries) on top of OpenDHT. The result was mostly failure:
A simple put-get interface was not quite enough. In particular, OpenDHT relies on timeouts to invalidate entries and has no support for atomicity primitives... In return for ease of implementation and deployment, we sacrificed performance. With the OpenDHT implementation, a PHT query operation took a median of 2–4 seconds. This is due to the fact that layering entirely on top of a DHT service inherently implies that applications must perform a sequence of put-get operations to implement higher level semantics with limited opportunity for optimization within the DHT.In other words, there are two primary problems with the DHT approach:
- Most DHTs will require a second locking layer to achieve correctness when implementing a more complex data structure on top of the DHT semantics. In particular, this will certainly apply to eventually-consistent systems in the Dynamo mold.
- Advanced functionality like range queries needs to be supported natively to be at all efficient.
This is one reason why I think Cassandra is the most promising of the open-source distributed databases -- you get a relatively rich data model and a distribution model that supports efficient range queries. These are not things that can be grafted on top of a simpler DHT foundation, so Cassandra will be useful for a wider variety of applications.
Monday, May 18, 2009
Wednesday, May 13, 2009
Cassandra in a nutshell:
- Scales writes very, very well: just add more nodes!
- Has a much richer data model than vanilla key/value stores -- closer to what you'd be used to in a relational db.
- Is pretty bleeding edge -- to my knowledge, Facebook is the only group running Cassandra in production. (Their largest cluster is 120 machines and 40TB of data.) At Rackspace we are working on a Cassandra-based app now that 0.3 has the extra features we need.
- Moved to the Apache Incubator about 40 days ago, at which point development greatly accelerated.
- Range queries on keys, including user-defined key collation.
- Remove support, which is nontrivial in an eventually consistent world.
- Workaround for a weird bug in JDK select/register that seems particularly common on VM environments. Cassandra should deploy fine on EC2 now. (Oddly, it never had problems on Slicehost / Cloud Servers, which is also Xen-based.)
- Much improved infrastructure: the beginnings of a decent test suite ("ant test" for unit tests; "nosetests" for system tests), code coverage reporting, etc.
- Expanded node status reporting via JMX
- Improved error reporting/logging on both server and client
- Reduced memory footprint in default configuration
- and plenty of bug fixes.
- An advanced on-disk storage engine that never does random writes
- Transaction log-based data integrity
- P2P gossip failure detection
- Read repair
- Hinted handoff
- Bootstrap (adding new nodes to a running cluster)
The cassandra development and user community is also growing at an exciting pace. Besides the original two developers from Facebook, we now have five developers regularly contributing improvements and fixes, and many others on a more ad-hoc basis.
How fast is it?
In a nutshell, Cassandra is much faster than relational databases, and much slower than memory-only systems or systems that don't sync each update to disk. Actual benchmarks are in the works. We plan to start performance tuning with the next release, but if you want to benchmark it, here are some suggestions to get numbers closer to what you'll see in the wild (and about 10x more throughput than if you don't do these):
- Do enough runs of your benchmark first that each operation tested by your suite runs 20k times before timing it for real. This will allow the JVM jit to compile down to machine code; otherwise you'll just be getting the interpreted version.
- Change the root logger level in conf/log4j.properties from DEBUG to INFO; we do a LOT of logging for debuggability and for small column values the logging has more overhead than the actual workload. (It would be even faster if we were to remove them entirely but that didn't make this release.)
Friday, May 01, 2009
- We now have an order-preserving partitioner as well as the hash-based one
- Yes, if you tell Cassandra to wait for all replicas to be ack'd before calling a write a success, then you would have traditional consistency (as opposed to "eventual") but you'd also have no tolerance for hardware failures which is a main point of this kind of system.
- Zookeeper is not currently used by Cassandra, although we have plans to use it in the future.
- Load balancing is not implemented yet.
- The move to Apache is finished and development is active there now.
The reason that almost all similar systems use consistent hashing (the dynamo paper has the best description; see sections 4.1-4.3) is that it provides a kind of brain-dead load balancing from the hash algorithm spreading keys across the ring. But the dynamo authors go into some detail about how this by itself doesn't actually give good results in practice; their solution was to assign multiple tokens to each node in the cluster and they describe several approaches to that. But Cassandra's original designer considers this a hack and prefers real load balancing.
An order-preserving partitioner, where keys are distributed to nodes in their natural order, has huge advantages over consistent hashing, particularly the ability to do range queries across the keys in the system (which has also been committed to Cassandra now). This is important, because the corollary of "the partitioner uses the key to determine what node the data is on" is, "each key should only have an amount of data associated with it (see the data model explanation) that is relatively small compared to a node's capacity." Cassandra column families will often have more columns in them than you'd see in a traditional database, but "millions" is pushing it (depending on column size) and "billions" is a bad idea. So you'll want to model things such that you spread data across multiple keys and if you then pick an appropriate key naming convention, range queries will let you slice and dice that data as needed.
Cassandra is in the process of implementing load balancing still, but in the meantime order-preserving partitioning is still be useful without that if you know what your key distribution will look like in advance and can pick your node tokens accordingly. Otherwise, there's always the old-school hash-based partitioner until we get that done (for the release after the one we'll have in the next week or so).
See the introduction and getting started pages of the Cassandra wiki for more on Cassandra, and drop us a line on the mailing list or in IRC if you have questions; we're actively trying to improve our docs.
Thursday, April 30, 2009
David MacIver has an interesting blog entry up about determining logical project structure via commit logs. I was very interested because one of Cassandra's oldest issues is creating categories for our JIRA instance. (I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you, in this case.)
The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results.
In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not exactly a revelation that Filter and FilterTest go together. I'd be tempted to play with it more but with only about two months and 250 commits in the apache repo there's not really all that much data there. (Cassandra's first two years were in an internal Facebook repository.) Working with data that exists as a side effect of natural activity is fascinating.
Saturday, April 11, 2009
box = store.new_sandbox() print [c.Title for c in box.recall( Comic, lambda c: 'Hob' in c.Title or c.Views > 0)]This is cool as hell. The Geniusql part start about 15 minutes in.
Tuesday, April 07, 2009
Saturday, March 28, 2009
I wrote a little background information yesterday about why I think Cassandra in particular is compelling.
Friday, March 27, 2009
So, I'm working on the Cassandra distributed database. I gave a lightning talk on it at PyCon this morning. Cassandra is written in Java and implements a sort of hybrid between Dynamo and Bigtable. (Both papers are worth reading.) It takes its distribution algorithm from Dynamo and its data model from Bigtable -- sort of the best of both worlds. Avinash Lakshman, Cassandra's architect, is one of the authors of the Dynamo paper.
There is a video about Cassandra here. The first 1/4 is about using Cassandra and then the rest is mostly about the internals.
Cassandra is very bleeding edge. Facebook runs several Cassandra clusters in production (the largest is 120 machines and 40TB of data), but there are sharp edges that will cut you. If you want something that Just Works out of the box Cassandra is a poor fit right now and will be for several months.
Cassandra was open-sourced by Facebook last summer. There was some initial buzz but the facebook developers had trouble dealing with the community and the project looked moribund -- FB was doing development against an internal repository and throwing code over the wall every few months which is no way to run an OSS project. Now the code is being developed in the open on Apache and I was voted in as a committer so things are starting to move again.
There are other distributed databases that are worth considering. Here is why those don't fit my needs (and I am not saying that these are bad choices if your requirements are different, just that they don't work for me):
- Follows the bigtable model, so it's more complicated than it needs to be. (300+kloc vs 50 for Cassandra; many more components). This means it's that much harder for me to troubleshoot. HBase is more bug-free than Cassandra but not so bug-free that troubleshooting would not be required.
- Does not have any non-java clients. I need CPython support.
- Sits on top of HDFS, which is optimized for streaming reads, not random accesses. So HBase is fine for batch processing but not so good for online apps.
- See HBase, except it's written in C++.
- Voldemort is probably the key / value store in the pattern of Dynamo that is farthest along. If you only need key / value I would definitely recommend Voldemort. Next closest is probably Dynomite. Then there are a whole bunch of "me too" key value stores that made fatal architecture decisions or are writing a "memcached with persistence" without really thinking it through.
- prereqs: jdk6 and ant
- Check out the code from http://svn.apache.org/repos/asf/incubator/cassandra/trunk (did I mention this was early-adopter only?)
- run ant [optional: ant test]. If you get an error like "class file has wrong version 50.0, should be 49.0" then ant is using an old jdk version instead of 6.
- For non-java clients, install Thrift. (For java, trunk includes libthrift.jar.) This is a major undertaking in its own right. See this page for a list of dependencies, although most of the rest of that page is now outdated -- for instance, Cassandra no longer depends on the fb303 interface. Python users will have to hand-edit the generated Cassandra.py in three obvious places until this bug is fixed -- just replace the broken argument with None.
- run bin/cassandra [optionally -f for foreground]
- Connect to the server. By default it listens on localhost:9160. Look at config/server-conf.xml for the columnfamily definitions.
- Insert and query some data. Here is an introduction.
- Ask on the mailing list or #cassandra on freenode if you have questions.
Monday, February 16, 2009
I installed KDE 4.2 over the weekend to see if I was missing anything there.
It's like daylight after being in a cave for two months. I didn't realize how hard it has been to use a butt-ugly environment until I wasn't anymore. (Yes, I tried all the gnome themes I could find. Even Nimbus which took a bit of work. What's that recently-famous phrase? "Lipstick on a pig?")
What is better in KDE? In a word, everything. And put me in the camp that really likes having the desktop turned into a usable area for the first time. Like apple's dashboard, except it doesn't suck. I always hated dashboard.
Things that could be improved:
- Never in a thousand years would I have thought to look under "Regional & Language" for the preference to turn caps lock into control. I had to google this.
- I'm still not sure how to set F9 to Present Windows. Or how to bind a keystroke to the K menu as a poor man's quicksilver.
- More generally, a "Welcome to kde. Let me teach you how to be a power user" tutorial would be nice. I have the feeling there is lots of awesome under the hood if I knew where it was. I never got that feeling from gnome. ("Beauty is only skin deep, but ugly goes right to the bone.")
- Firefox UI widgets are imperfectly themed from XUL to GTK to KDE. But it is useable. (And having my second monitor redraw correctly instead of leaving artifacts when windows are moved makes up for that.) Is this KDE's fault? Firefox's? I don't know.
- Konqueror is still using KHTML instead of webkit which means it is mostly unusable in the world of "web 2.0." Yes, you can install webkitkde but that is Very Alpha. ("Open in new window" doesn't work, for instance. "Open in new tab" is gone entirely.)
- I couldn't find an option to just use icons in the task manager widget.