Skip to main content

Cassandra 0.3 release candidate and progress

We have a release candidate out for Cassandra 0.3. Grab the download and check out how to get started. The facebook presentation from almost a year ago now is also still a good intro to some of the features and data model.

Cassandra in a nutshell:

  • Scales writes very, very well: just add more nodes!
  • Has a much richer data model than vanilla key/value stores -- closer to what you'd be used to in a relational db.
  • Is pretty bleeding edge -- to my knowledge, Facebook is the only group running Cassandra in production. (Their largest cluster is 120 machines and 40TB of data.) At Rackspace we are working on a Cassandra-based app now that 0.3 has the extra features we need.
  • Moved to the Apache Incubator about 40 days ago, at which point development greatly accelerated.
Changes in 0.3 include
  • Range queries on keys, including user-defined key collation.
  • Remove support, which is nontrivial in an eventually consistent world.
  • Workaround for a weird bug in JDK select/register that seems particularly common on VM environments. Cassandra should deploy fine on EC2 now. (Oddly, it never had problems on Slicehost / Cloud Servers, which is also Xen-based.)
  • Much improved infrastructure: the beginnings of a decent test suite ("ant test" for unit tests; "nosetests" for system tests), code coverage reporting, etc.
  • Expanded node status reporting via JMX
  • Improved error reporting/logging on both server and client
  • Reduced memory footprint in default configuration
  • and plenty of bug fixes.
For those of you just joining us, Cassandra already had
  • An advanced on-disk storage engine that never does random writes
  • Transaction log-based data integrity
  • P2P gossip failure detection
  • Read repair
  • Hinted handoff
  • Bootstrap (adding new nodes to a running cluster)
(Read repair and hinted handoff are discussed in more detail in the Dynamo paper.)

The cassandra development and user community is also growing at an exciting pace. Besides the original two developers from Facebook, we now have five developers regularly contributing improvements and fixes, and many others on a more ad-hoc basis.

How fast is it?

In a nutshell, Cassandra is much faster than relational databases, and much slower than memory-only systems or systems that don't sync each update to disk. Actual benchmarks are in the works. We plan to start performance tuning with the next release, but if you want to benchmark it, here are some suggestions to get numbers closer to what you'll see in the wild (and about 10x more throughput than if you don't do these):

  • Do enough runs of your benchmark first that each operation tested by your suite runs 20k times before timing it for real. This will allow the JVM jit to compile down to machine code; otherwise you'll just be getting the interpreted version.
  • Change the root logger level in conf/log4j.properties from DEBUG to INFO; we do a LOT of logging for debuggability and for small column values the logging has more overhead than the actual workload. (It would be even faster if we were to remove them entirely but that didn't make this release.)

Comments

Alex Popescu said…
Hi Jonathan,

I have just checked the Cassandra page and I still couldn't find a released version. While checking the SVN repository, I've noticed 3 tags (3 RCs for 0.3) and also a 0.3 branch.
Should I understand that there isn't yet a 0.3 final? Are there any plans to pack a distribution?

tia,
./alex
Jonathan Ellis said…
the saga of the 0.3 release is in the cassandra-dev archives.

the short version is, releasing with the ASF is like pulling teeth.

one at a time.

with no anaesthetic.
Alex Popescu said…
Jonathan, I've been active on ASF for quite a while (not anymore lately) and while I cannot say it was easy, I don't think I've heard anyone get killed by the process. Just check how may releases Struts2 or Jackrabbit had.
So, is there anything else? Or is it just the lack of somebody stepping up to run the process?
Jonathan Ellis said…
No major blockers, no. (Which is part of why it's frustrating. :)
Jonathan Ellis said…
Added some links to what's been happening over here: http://spyced.blogspot.com/2009/07/cassandra-03-update.html

Popular posts from this blog

The Missing Piece in AI Coding: Automated Context Discovery

I recently switched tasks from writing the ColBERT Live! library and related benchmarking tools to authoring BM25 search for Cassandra . I was able to implement the former almost entirely with "coding in English" via Aider . That is: I gave the LLM tasks, in English, and it generated diffs for me that Aider applied to my source files. This made me easily 5x more productive vs writing code by hand, even with AI autocomplete like Copilot. It felt amazing! (Take a minute to check out this short thread on a real-life session with Aider , if you've never tried it.) Coming back to Cassandra, by contrast, felt like swimming through molasses. Doing everything by hand is tedious when you know that an LLM could do it faster if you could just structure the problem correctly for it. It felt like writing assembly without a compiler -- a useful skill in narrow situations, but mostly not a good use of human intelligence today. The key difference in these two sce...

Why PHP sucks

(July 8 2005) Apparently I got linked by some PHP sites, and while there were a few well-reasoned comments here I mostly just got people who only knew PHP reacting like I told them their firstborn was ugly. These people tended to give variants on one or more themes: All environments have warts, so PHP is no worse than anything else in this respect I can work around PHP's problems, ergo they are not really problems You aren't experienced enough in PHP to judge it yet As to the first, it is true that PHP is not alone in having warts. However, the lack of qualitative difference does not mean that the quantitative difference is insignificant. Similarly, problems can be worked around, but languages/environments designed by people with more foresight and, to put it bluntly, clue, simply don't make the kind of really boneheaded architecture mistakes that you can't help but run into on a daily baisis in PHP. Finally, as I noted in my original introduction, with PHP, ...

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...