Monday, July 06, 2009

Cassandra 0.3 update

Two months after the first release candidate, Cassandra 0.3 is still not out. But, we're close!

We had two more bug-fix release candidates, and it's virtually certain that 0.3-final will be the same exact code as 0.3-rc3. (If you're using rc1, you do want to upgrade; see CHANGES.txt.) But, we got stuck in the ASF bureaucracy and it's going to take at least one more round-trip before the crack Release Prevention Team grudgingly lets us call it official.

In the meantime, work continues apace on trunk for 0.4.

Tuesday, June 23, 2009

Patch-oriented development made sane with git-svn

One of the drawbacks to working on Cassandra is that unlike every other OSS project I have ever worked on, we are using a patch-oriented development process rather than post-commit review. It's really quite painfully slow. Somehow this became sort of the default for ASF projects, but there is precedent for switching to post-commit review eventually.

In the meantime, there is git-svn.

(The ASF does have a git mirror set up, but I'm going to ignore that because (a) its reliability has been questionable and (b) sticking with git-svn hopefully makes this more useful for non-ASF projects.)

Disclaimer: I am not a git expert, and probably some of this will make you cringe if you are. Still, I hope it will be useful for some others fumbling their way towards enlightenment. As background, I suggest the git crash course for svn users. Just the parts up to the Remote section.

Checkout:

  1. git-svn init https://svn.apache.org/repos/asf/cassandra/trunk cassandra
Once that's done the only git-svn commands you need to know about are dcommit to push the changes in the current git branch back to svn, and rebase, to pull changes from svn and re-apply your uncommitted patches on top of that (basically exactly like svn up).

Creating new code:

  1. git checkout -b [ticket number]
  2. [edit stuff, maybe get add or git rm new or obsolete files]
  3. git commit -a -m 'commit'
  4. repeat 2-3 as necessary
  5. git-jira-attacher [revision] (usually some variant of HEAD^^^)
[after review]
  1. git log (just to make sure I'm about to commit what I think I'm about to commit)
  2. git-svn dcommit
  3. git checkout master
  4. git-svn rebase -l (this will put the changes you just committed into master)
  5. git branch -d [ticket number]
When I'm reviewing code it looks similar:
  1. git checkout -b [ticket number]
  2. wget patches and git-apply, or jira-apply CASSANDRA-[ticket-number]
  3. review in gitk/qgit and/or IDE (the intellij git plugin is quite decent)
  4. commit .. branch -d as above
The last operation is "see who I need to bug to get reviews moving." This is just a list of the branches I haven't merged into master and deleted yet:
  1. git branch
Git-svn takes a lot of the pain out of the ASF's patch-and-jira workflow. In particular, you can easily break changes for a ticket up into multiple patches that are easily reviewed, and the latency of waiting for patch review doesn't kill your throughput so badly since you can just leave that branch alone and start a new one for your next piece of functionality. And of course you get git commit --amend and git rebase -i for massaging patches during the review process.

One fairly common complication is if you finish a ticket A, then start on ticket B (that depends on A) while waiting for A to be reviewed. So you checkout -b from your branch A rather than master and build some patches on that. As sometimes happens, the reviewer finds something you need to improve in your patch set for A, so you make those changes. Now you need to rebase your patches to B on top of the changes you made to A. The best way to do this is to branch A to B-2, then git cherry-pick from B and resolve conflicts as necessary.

Final note: I often like to create lots of small commits as I am exploring a solution and combine them into larger units with git rebase -i for patch submission. (It's easier to combine small patches, than pull apart large ones.) So my early commit messages are often terse and need editing. You can change commit messages with edit mode in rebase, then using commit --amend and rebase --continue, but that is tedious. I complained about this to my friend Zach Wily and he made this git amend-message command (place in [alias] in your .gitconfig):

   amend-message = "!bash -c ' \
       c=$0; \
       if [ $c == \"bash\" ]; then echo \"Usage: git amend-message <commit>\"; exit 1; fi; \
       saved_head=$(git rev-parse HEAD); \
       commit=$(git rev-parse $c); \
       commits=$(git log --reverse --pretty=format:%H $commit..HEAD); \
       echo \"Rewinding to $commit...\"; \
       git reset --hard $commit; \
       git commit --amend; \
       for X in $commits; do \
           echo \"Applying $X...\"; \
           git cherry-pick $X >> /dev/null; \
           if [ $? -ne 0 ]; then \
               echo \"  apply failed (is this a merge?), rolling back all changes\"; \
               git reset --hard $saved_head; \
               echo \" ** AMEND-MESSAGE FAILED, sorry\"; \
               exit 1; \
           fi; \
       done; \
       echo \"Done\"'"
(Zach would like the record to show that he knows this is pretty hacky. "For instance, it won't work if one of the commits after the one you're changing is a merge, since cherry-pick can't handle those." But it's quite useful, all the same.)

For what it's worth, the rest of my aliases are

 st = status
 ci = commit
 co = checkout
 br = branch
 cp = cherry-pick

Wednesday, May 27, 2009

Why you won't be building your killer app on a distributed hash table

I ran across A case study in building layered DHT applications while doing some research on implementing load-balancing in Cassandra. The question addressed is, "Are DHTs a general-purpose tool that you can build more sophisticated services on?"

Short version: no. A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals.

This paper describes the results of attempting to build a relatively complex datastructure (prefix hash trees, for range queries) on top of OpenDHT. The result was mostly failure:

A simple put-get interface was not quite enough. In particular, OpenDHT relies on timeouts to invalidate entries and has no support for atomicity primitives... In return for ease of implementation and deployment, we sacrificed performance. With the OpenDHT implementation, a PHT query operation took a median of 2–4 seconds. This is due to the fact that layering entirely on top of a DHT service inherently implies that applications must perform a sequence of put-get operations to implement higher level semantics with limited opportunity for optimization within the DHT.
In other words, there are two primary problems with the DHT approach:
  • Most DHTs will require a second locking layer to achieve correctness when implementing a more complex data structure on top of the DHT semantics. In particular, this will certainly apply to eventually-consistent systems in the Dynamo mold.
  • Advanced functionality like range queries needs to be supported natively to be at all efficient.
While they spin this in a positive manner -- "hey, at least it didn't take much code" -- the reality is that for most of us, query latency of two to four seconds is several orders of magnitude away from acceptable.

This is one reason why I think Cassandra is the most promising of the open-source distributed databases -- you get a relatively rich data model and a distribution model that supports efficient range queries. These are not things that can be grafted on top of a simpler DHT foundation, so Cassandra will be useful for a wider variety of applications.

Monday, May 18, 2009

Belated 2009 Introduction to SQLAlchemy slides

I was asked to put my slides up again -- sorry it took so long. The slides and code samples are now up here. Video of the tutorial is also up. (3 parts, first is linked). There's definitely audio problems in parts but at least some is watchable.

Wednesday, May 13, 2009

Cassandra 0.3 release candidate and progress

We have a release candidate out for Cassandra 0.3. Grab the download and check out how to get started. The facebook presentation from almost a year ago now is also still a good intro to some of the features and data model.

Cassandra in a nutshell:

  • Scales writes very, very well: just add more nodes!
  • Has a much richer data model than vanilla key/value stores -- closer to what you'd be used to in a relational db.
  • Is pretty bleeding edge -- to my knowledge, Facebook is the only group running Cassandra in production. (Their largest cluster is 120 machines and 40TB of data.) At Rackspace we are working on a Cassandra-based app now that 0.3 has the extra features we need.
  • Moved to the Apache Incubator about 40 days ago, at which point development greatly accelerated.
Changes in 0.3 include
  • Range queries on keys, including user-defined key collation.
  • Remove support, which is nontrivial in an eventually consistent world.
  • Workaround for a weird bug in JDK select/register that seems particularly common on VM environments. Cassandra should deploy fine on EC2 now. (Oddly, it never had problems on Slicehost / Cloud Servers, which is also Xen-based.)
  • Much improved infrastructure: the beginnings of a decent test suite ("ant test" for unit tests; "nosetests" for system tests), code coverage reporting, etc.
  • Expanded node status reporting via JMX
  • Improved error reporting/logging on both server and client
  • Reduced memory footprint in default configuration
  • and plenty of bug fixes.
For those of you just joining us, Cassandra already had
  • An advanced on-disk storage engine that never does random writes
  • Transaction log-based data integrity
  • P2P gossip failure detection
  • Read repair
  • Hinted handoff
  • Bootstrap (adding new nodes to a running cluster)
(Read repair and hinted handoff are discussed in more detail in the Dynamo paper.)

The cassandra development and user community is also growing at an exciting pace. Besides the original two developers from Facebook, we now have five developers regularly contributing improvements and fixes, and many others on a more ad-hoc basis.

How fast is it?

In a nutshell, Cassandra is much faster than relational databases, and much slower than memory-only systems or systems that don't sync each update to disk. Actual benchmarks are in the works. We plan to start performance tuning with the next release, but if you want to benchmark it, here are some suggestions to get numbers closer to what you'll see in the wild (and about 10x more throughput than if you don't do these):

  • Do enough runs of your benchmark first that each operation tested by your suite runs 20k times before timing it for real. This will allow the JVM jit to compile down to machine code; otherwise you'll just be getting the interpreted version.
  • Change the root logger level in conf/log4j.properties from DEBUG to INFO; we do a LOT of logging for debuggability and for small column values the logging has more overhead than the actual workload. (It would be even faster if we were to remove them entirely but that didn't make this release.)

Friday, May 01, 2009

A better analysis of Cassandra than most

Vladimir Sedach wrote a three-part dive into Cassandra. (Almost two months ago now. Guess I need to set up Google Alerts. Trouble is there's a surprising amount of noise around the word `cassandra.`) A few notes:
  • We now have an order-preserving partitioner as well as the hash-based one
  • Yes, if you tell Cassandra to wait for all replicas to be ack'd before calling a write a success, then you would have traditional consistency (as opposed to "eventual") but you'd also have no tolerance for hardware failures which is a main point of this kind of system.
  • Zookeeper is not currently used by Cassandra, although we have plans to use it in the future.
  • Load balancing is not implemented yet.
  • The move to Apache is finished and development is active there now.

Consistent hashing vs order-preserving partitioning in distributed databases

The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner.

The reason that almost all similar systems use consistent hashing (the dynamo paper has the best description; see sections 4.1-4.3) is that it provides a kind of brain-dead load balancing from the hash algorithm spreading keys across the ring. But the dynamo authors go into some detail about how this by itself doesn't actually give good results in practice; their solution was to assign multiple tokens to each node in the cluster and they describe several approaches to that. But Cassandra's original designer considers this a hack and prefers real load balancing.

An order-preserving partitioner, where keys are distributed to nodes in their natural order, has huge advantages over consistent hashing, particularly the ability to do range queries across the keys in the system (which has also been committed to Cassandra now). This is important, because the corollary of "the partitioner uses the key to determine what node the data is on" is, "each key should only have an amount of data associated with it (see the data model explanation) that is relatively small compared to a node's capacity." Cassandra column families will often have more columns in them than you'd see in a traditional database, but "millions" is pushing it (depending on column size) and "billions" is a bad idea. So you'll want to model things such that you spread data across multiple keys and if you then pick an appropriate key naming convention, range queries will let you slice and dice that data as needed.

Cassandra is in the process of implementing load balancing still, but in the meantime order-preserving partitioning is still be useful without that if you know what your key distribution will look like in advance and can pick your node tokens accordingly. Otherwise, there's always the old-school hash-based partitioner until we get that done (for the release after the one we'll have in the next week or so).

See the introduction and getting started pages of the Cassandra wiki for more on Cassandra, and drop us a line on the mailing list or in IRC if you have questions; we're actively trying to improve our docs.

Thursday, April 30, 2009

Automatic project structure inference

David MacIver has an interesting blog entry up about determining logical project structure via commit logs. I was very interested because one of Cassandra's oldest issues is creating categories for our JIRA instance. (I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you, in this case.)

The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results.

In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not exactly a revelation that Filter and FilterTest go together. I'd be tempted to play with it more but with only about two months and 250 commits in the apache repo there's not really all that much data there. (Cassandra's first two years were in an internal Facebook repository.) Working with data that exists as a side effect of natural activity is fascinating.