Skip to main content

Apache Cassandra: 2010 in review

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome.

Code

2010 started with the release of Cassandra 0.5, followed by 0.6 and graduation from the ASF incubator a few months later. Seven more stable releases of 0.6 proceeded, adding many features to improve operations in response to feedback from production users.

0.7 adds highly anticipated features like column value indexes, live schema updates, more efficient cluster expansion, and more control over replication, but didn't quite make it into 2010, with rc4 released on new year's 2011.

We also committed the distributed counters patchset, begun at Digg and enhanced by Twitter for their real-time analytics product. Notable as the most-involved feature discussion to date, distributed counters started with a vector clock approach, but switched to a new design by Kelvin Kakugawa after we realized vector clocks were a dead end for anything but the trivial case of monotonic-increments-by-one.

One of the biggest trends was increasing activity around Cassandra as well as in the core database itself. 2010 saw Hadoop map/reduce integration, as well as Pig support and a patch for Hive.

We also saw Lucandra, which implements a Cassandra back end for Lucene and is used in several high volume production sites, grow up into Solandra, embedding Solr and Cassandra in the same JVM for even more performance.

Community

Cassandra hit its stride in 2010, starting with graduation from the ASF incubator in April. 2010 saw 1025 tickets resolved, nearly twice as many compared to 2009 (565).

Like many Apache projects, Cassandra has a relatively small set of committers, but a much larger group of contributors. In 2010 Cassandra passed over 100 people who have contributed at least one patch. Release manager Eric Evans put together a great way to visual this with a Code Swarm video of Cassandra development.

I started Riptano with Matt Pfeil in April to provide professional products and services around Cassandra. In October, we announced funding from Lightspeed and Sequoia. From May to December, we conducted eleven Cassandra training events in eight months, and twice that many private classes on-site with customers.

Riptano is now up to 25 employees, with offices in the San Francisco bay area, Austin, and New York, and engineers working remotely in San Antonio, France, and Belarus.

In August, Riptano and Rackspace organized a very successful inaugural Cassandra Summit, with about 200 attendees (videos available), followed by almost a full track at ApacheCon in November. Cassandra was also represented at many other conferences on multiple subjects, for several languages, and continents.

Controversy

Cassandra got a lot of negative publicity when Kevin Rose blamed Cassandra for Digg v4's teething problems. However, there was no deluge of bug reports coming out of Digg's Cassandra team, and Digg engineers Arin Sarkissian and Chris Goffinet (now working on Cassandra for Twitter) got on Quora to refute the idea that Cassandra was at fault:

The whole "Cassandra to blame" thing is 100% a result of folks clinging on to the NoSQL vs SQL thing. It's a red herring.

The new version of Digg has a whole new architecture with a bunch of technologies involved. Problem is, over the last few months or so the only technological change we mentioned (blogged about etc) was Cassandra. That made it pretty easy for folks to cling on to it as the "problem".

Meanwhile, Digg competitor Reddit has continued migrating to Cassandra, crediting it with enabling their 3x traffic growth in 2010.

More importantly, 2010 saw dozens of new Cassandra deployments, including a new contender for the largest-cluster crown when Digital Reasoning announced a 400-node cluster for the US government.

We look forward to another great year in 2011!

Comments

Jonathan, It doesn't seem that Arin Sarkissian and Chris Goffinet *refuted* the idea that Cassandra was at fault at Digg. They (esp. Chris) seem to suggest that Cassandra might not be the *only* system at fault, and that they should have load tested better.
Chris: "it wasn't the sole fault of Cassandra"
Jonathan Ellis said…
Having spent a lot of time talking with Chris, Arin, Rob, and other Digg engineers, I can confirm that you're reading too much into that extra word. :)
Anonymous said…
RPM build for the apache cassandra project hosted at google code
http://code.google.com/p/cassandra-rpm/

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren