Monday, January 25, 2010

Cassandra 0.5.0 released

Apache Cassandra 0.5.0 was released over the weekend, four months after 0.4. (Upgrade notes; full changelog.) We're excited about releasing 0.5 because it makes life even better for people using Cassandra as their primary data source -- as opposed to a replica, possibly denormalized, of data that exists somewhere else.

The Cassandra distributed database has always had a commitlog to provide durable writes, and in 0.4 we added an option to waiting for commitlog sync before acknowledging writes, for cases where even a few seconds of potential data loss was not an option. But what if a node goes down temporarily? 0.5 adds proactive repair, what Dynamo calls "anti-entropy," to synchronize any updates Hinted Handoff or read repair didn't catch across all replicas for a given piece of data.

0.5 also adds load balancing and significantly improves bootstrap (adding nodes to a running cluster). We've also been busy adding documentation on operations in production and system internals.

Finally, in 0.5 we've improved concurrency across the board, improving insert speed by over 50% on the stress.py benchmark (from contrib/) on a relatively modest 4-core system with 2GB of ram. We've also added a [row] key cache, enabling similar relative improvements in reads:

(You will note that unlike most systems, Cassandra reads are usually slower than writes. 0.6 will narrow this gap with full row caching and mmap'd I/O, but fundamentally we think optimizing for writes is the right thing to do since writes have always been harder to scale.)

Log replay, flush, compaction, and range queries are also faster.

0.5 also brings new tools, including JSON-based data export and import, an improved command-line interface, and new JMX metrics.

One final note: like all distributed systems, Cassandra is designed to maximize throughput when under load from many clients. Benchmarking with a single thread or a small handful will not give you numbers representative of production (unless you only ever have four or five users at a time in production, I suppose). Please don't ask "why is Cassandra so slow" and offer up a single-threaded benchmark as evidence; that makes me sad inside. Here's 1000 words:

(Thanks to Brandon Williams for the graphs.)

13 comments:

Flavio said...

Hey Jonathan.
I would be really interested to know what is the cluster size of the cluster you have been using for the benchmarks. It would give a better understanding of the throughput/sec.

And how many threads have been used in the first graph and was the experiment really only executed on one single node?

Jonathan Ellis said...

The first graph is a single node, benchmarked with 50 threads.

The second graph is a 4 node cluster running with a replication factor of 3 (that is, each write is replicated to 3 nodes), so you'd expect throughput to be about 30% more than the single node case, minus some overhead for dealing with the network, and that's what happens.

Flavio said...

Okay great! Thanks. So you are using replication factor of 3 and read/write quorum of 2?

Jonathan Ellis said...

Both graphs are with ConsistencyLevel.ONE but that's not going to affect throughput much (just latency) since ConsistencyLevel doesn't reduce the number of writes you do (in a non-failure scenario), just how many you wait for before returning success to the client.

Anonymous said...

I would be interested to see your thread writing code. I'd like to emulate multiple threads writing to cassandra.

Jonathan Ellis said...

it's in contrib/py_stress (it actually uses multiple processes if multiprocessing is available, since the GIL sucks, but it will fall back to threads if it has to).

dehora said...

Hey Jonathan, what was your physical disk setup (mounts, drives) for this bench?

Jonathan Ellis said...

One commitlog disk, one data disk, both xfs.

Michael said...

I apologize if this is trivial, I want to be sure I'm interpreting the output from stress.py correctly. Default arguments to the script are OK (as long as I'm running python 2.6 to get multiprocessing). Run the script first with "-o insert" and then "-o read". The total throughput is the bottom entry in the leftmost column divided by the bottom entry in the bottom entry in the rightmost column. Is that correct? I happen to be using the 0.6.0-beta2 release.

Jonathan Ellis said...

Yes, although the time in the lower right is not a high-resolution clock :)

Michael said...

Right. I should be using "time python stress.py". Duh, of course. Sorry for being so dense.

Yousef Ourabi said...

The links to changelog / notes are 404s

Jonathan Ellis said...

Yousef: thanks for the heads up, I've updated the post to the correct ones. (they changed when we graduated from the incubator.)