Wednesday, March 24, 2010

Cassandra in action

There's been a lot of new articles about Cassandra deployments in the past month, enough that I thought it would be useful to summarize in a post.

Ryan King explained in an interview with Alex Popescu why Twitter is moving to Cassandra for tweet storage, and why they selected Cassandra over the alternatives. My experience is that the more someone understands large systems and the problems you can run into with them from an operational standpoint, the more likely they are to choose Cassandra when doing this kind of evaluation. Ryan's list of criteria is worth checking out.

Digg followed up their earlier announcement that they had taken part of their site live on Cassandra with another saying that they've now "reimplemented most of Digg's functionality using Cassandra as our primary datastore." Digg engineer Ian Eure also gave some more details on Digg's cassandra data model in a Hacker News thread.

Om Malik quoted extensively from the Digg announcement and from Rackspace engineer Stu Hood, who explained Cassandra's appeal: "Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure. When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single row to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values."

The Twitter and Digg news kicked off a lot of publicity, including a lot of "me too" articles but some interesting ones, including a highscalability post wondering if this was the end of the mysql + memcached era. If not quite yet the end, then the beginning of it. As Ian Eure from Digg said, "If you're deploying memcache on top of your database, you're inventing your own ad-hoc, difficult to maintain NoSQL system." Possibly the best commentary on this idea is Dare Obasanjo's, who explained "Digg's usage of Cassandra actually serves as a rebuttal to [an article claiming SQL scales just fine] since they couldn't feasibly get what they want with either horizontal or vertical scaling of their relational database-based solution."

Reddit also migrated to Cassandra from memcachedb, in only 10 days, the fastest migration to Cassandra I've seen. More comments from the engineer doing the migration, ketralnis, in the reddit discussion thread.

CloudKick blogged about how they use Cassandra for time series data, including a sketch of their data model. CloudKick migrated from PostgreSQL, skewering the theory you will sometimes see proffered that "only MySQL users are migrating to NoSQL, not people who use [my favorite vendor's relational database]."

Jake Luciani wrote about how Lucandra, the Cassandra Lucene back-end works, and how he's using it to power the Twitter search app sparse.ly. IMO, Lucandra is one of Cassandra's killer apps.

The FightMyMonster team switched from HBase to Cassandra after concluding that "HBase is more suitable for data warehousing, and large scale data processing and analysis... and Cassandra is more suitable for real time transaction processing and the serving of interactive data." Dominic covers CAP, architecture considerations, benchmarks, map/reduce, and durability in explaining his conclusion.

Eric Peters gave a talk on Cassandra use at his company, Frugal Mechanic, at the Seattle Tech Startups Meetup. This was interesting not because Frugal Mechanic is a big name but because it's not. I haven't seen Eric's name on the Cassandra mailing lists at all, but there he was deploying it and giving a talk on it, showing that Cassandra is starting to move beyond early adopters. (And, just maybe, that our documentation is improving. :)

Finally, Eric Florenzano has a live demo up now of Cassandra running a Twitter clone at twissandra.com, with source at github, as an example of how to use Cassandra's data model. If you're interested in the nuts and bolts of how to build an app on Cassandra, you should check it out.

10 comments:

Alex Popescu said...

Pretty soon there will be some more exciting Cassandra coverage on myNoSQL. Thanks for the round-up.

Otis Gospodnetic said...

Note that Jake Luciani will be giving a talk on Lucandra next month at the Search & Discovery Meetup in NY:

http://www.meetup.com/NYC-Search-and-Discovery/

The Lucandra meetup is about to be scheduled there.

Jonathan Ellis said...

Glad to hear that Alex, I look forward to reading it.

Thanks for the link, Otis!

Birdman said...

Great summary of the recent news, these are all good resources.

Anonymous said...

There's = There is. Effectively, "There is been". D'oh!

Jonathan Ellis said...

"There's" is also the contracted form of "there has." Kids these days... :P

Jawaad Mahmood said...

For what it is worth, we are doing a talk about Cassandra at the Tokyo hackerspace next week. It definitely isn't "just Another nosql" anymore.

Jonathan Ellis said...

@Jawaad: Great! Hope you can post your slides afterwards!

mabstyle said...

[an article claiming SQL scales just fine]

I like how you censured the link. For those who want to read some good original articles on the subject, instead of mis-characterizations of same:

one

two

three

Chmouel said...

By the way the reedit move was from a key value model (memcache) to another key/value model (cassandra)... not the full thing like digg...