Tuesday, January 04, 2011

Apache Cassandra: 2010 in review

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome.

Code

2010 started with the release of Cassandra 0.5, followed by 0.6 and graduation from the ASF incubator a few months later. Seven more stable releases of 0.6 proceeded, adding many features to improve operations in response to feedback from production users.

0.7 adds highly anticipated features like column value indexes, live schema updates, more efficient cluster expansion, and more control over replication, but didn't quite make it into 2010, with rc4 released on new year's 2011.

We also committed the distributed counters patchset, begun at Digg and enhanced by Twitter for their real-time analytics product. Notable as the most-involved feature discussion to date, distributed counters started with a vector clock approach, but switched to a new design by Kelvin Kakugawa after we realized vector clocks were a dead end for anything but the trivial case of monotonic-increments-by-one.

One of the biggest trends was increasing activity around Cassandra as well as in the core database itself. 2010 saw Hadoop map/reduce integration, as well as Pig support and a patch for Hive.

We also saw Lucandra, which implements a Cassandra back end for Lucene and is used in several high volume production sites, grow up into Solandra, embedding Solr and Cassandra in the same JVM for even more performance.

Community

Cassandra hit its stride in 2010, starting with graduation from the ASF incubator in April. 2010 saw 1025 tickets resolved, nearly twice as many compared to 2009 (565).

Like many Apache projects, Cassandra has a relatively small set of committers, but a much larger group of contributors. In 2010 Cassandra passed over 100 people who have contributed at least one patch. Release manager Eric Evans put together a great way to visual this with a Code Swarm video of Cassandra development.

I started Riptano with Matt Pfeil in April to provide professional products and services around Cassandra. In October, we announced funding from Lightspeed and Sequoia. From May to December, we conducted eleven Cassandra training events in eight months, and twice that many private classes on-site with customers.

Riptano is now up to 25 employees, with offices in the San Francisco bay area, Austin, and New York, and engineers working remotely in San Antonio, France, and Belarus.

In August, Riptano and Rackspace organized a very successful inaugural Cassandra Summit, with about 200 attendees (videos available), followed by almost a full track at ApacheCon in November. Cassandra was also represented at many other conferences on multiple subjects, for several languages, and continents.

Controversy

Cassandra got a lot of negative publicity when Kevin Rose blamed Cassandra for Digg v4's teething problems. However, there was no deluge of bug reports coming out of Digg's Cassandra team, and Digg engineers Arin Sarkissian and Chris Goffinet (now working on Cassandra for Twitter) got on Quora to refute the idea that Cassandra was at fault:

The whole "Cassandra to blame" thing is 100% a result of folks clinging on to the NoSQL vs SQL thing. It's a red herring.

The new version of Digg has a whole new architecture with a bunch of technologies involved. Problem is, over the last few months or so the only technological change we mentioned (blogged about etc) was Cassandra. That made it pretty easy for folks to cling on to it as the "problem".

Meanwhile, Digg competitor Reddit has continued migrating to Cassandra, crediting it with enabling their 3x traffic growth in 2010.

More importantly, 2010 saw dozens of new Cassandra deployments, including a new contender for the largest-cluster crown when Digital Reasoning announced a 400-node cluster for the US government.

We look forward to another great year in 2011!