Monday, April 26, 2010

And now for something completely different

A month ago I left Rackspace to start Riptano, a Cassandra support and services company.

I was in the unusal position of being a technical person looking for a business-savvy co-founder. For whatever reason, the converse seems a lot more common. Maybe technical people tend to sterotype softer skills as being easy.

But despite some examples to the contrary (notably for me, Josh Coates at Mozy), I found that starting a company is too hard for just one person. Unfortunately, all of my fairly slim portfolio of business guys I'd like to co-found with were unavailable. So progress was slow, until Matt Pfeil heard that I was leaving Rackspace and drove to San Antonio from Austin to talk me out of it. Not only was he not successful in talking me out of leaving, but he ended up co-founding Riptano. And here we are, with a Riptano mini-faq.

Isn't Cassandra mostly just a web 2.0 thing for ex-mysql shops?

Although most of the early adopters fit this stereotype, we're seeing interest from a lot of Oracle users and a lot of industries. Unlike many "NoSQL" databases, Cassandra doesn't drop durability (the D in ACID), and besides scalability, enterprises are very interested in our support for multiple data centers and Hadoop analytics.

Are you going to fork Cassandra?

No. Although the ASF license allows doing basically anything with the code, including creating proprietary forks, we think the track record of this strategy in the open source database world is mixed at best.

We might create a (still open-source) Cassandra distribution similar to Cloudera's Distribution for Hadoop, but the mainline Cassandra development is responsive enough that there isn't as much need for a third party to do this as there is with Hadoop.

What does Rackspace think?

Rackspace has been the primary driver of Cassandra development recently, employing (until I left) the three most active committers on the project. For the same reasons Rackspace supported Cassandra to begin with, Rackspace is excited to see Riptano help take the Cassandra ecosystem to the next level. Rackspace has invested in Riptano and has been completely supportive in every way.

Where did you get the name "Riptano?" Does it mean anything?

We took a sophisticated, augmented AI approach. By which I mean, we took a program that generated random, pronouceable strings, and put together a couple fragments that sounded good together. (This is basically the same approach we took at Mozy, only there Josh insisted on a four letter domain name which narrowed it down a lot.)

I hope it doesn't mean "your dog has bad breath" somewhere.

And yes, Riptano is on twitter.

Are you hiring?

Yes. We'll have a jobs page on the site soon. In the meantime you can email me a resume if you can't wait. Prior participation in the Apache Cassandra project is of course a huge plus.

Wednesday, April 07, 2010

Cassandra: Fact vs fiction

Cassandra has seen some impressive adoption success over the past months, leading some to conclude that Cassandra is the frontrunner in the highly scalable databases space (a subset of the hot NoSQL category). Among all the attention, some misunderstandings have been propagated, which I'd like to clear up.

Fiction: "Cassandra relies on high-speed fiber between datacenters" and can't reliably replicate between datacenters with more than a few ms of latency between them.

Fact: Cassandra's multi-datacenter replication is one of its earliest features and is by far the most battle-tested in the NoSQL space. Facebook had Cassandra deployed on east and west coast datacenters since before open sourcing it. SimpleGeo's Cassandra cluster spans 3 EC2 availability zones, and Digg is also deployed on both coasts. Claims that this can't possibly work are an excellent sign that you're reading an article by someone who doesn't know what he's talking about.

Fiction: "It’s impossible to tell when [Cassandra] replicas will be up-to-date."

Fact: Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor), to use the Dynamo vocabulary. If you do writes and reads both with QUORUM, for one example, you can expect data consistency as soon as there are enough reachable nodes for a quorum. Cassandra also provides read repair and anti-entropy, so that even reads at ConsistencyLevel.ONE will be consistent after either of these events.

Fiction: Cassandra has a small community

Fact: Although popularity has never been a good metric for determining correctness, it's true that when using bleeding edge technology, it's good to have company. As I write this late at night (in the USA), there are 175 people in the Cassandra irc channel, 60 in the HBase one, 32 in Riak's, and 15 in Voldemort's. (Six months ago, the numbers were 90, 45, and 12 for Cassandra, HBase, and Voldemort. I did not hang out in #riak yet then.) Mailing list participation tells a similar story.

It's also interesting that the creators of Thrudb and dynomite are both using Cassandra now, indicating that the predicted NoSQL consolidation is beginning.

Fiction: "Cassandra only supports one [keyspace] per install."

Fact: This has not been true for almost a year (June of 2009).

Fiction: Cassandra cannot support Hadoop, or supporting tools such as Pig.

Fact: It has always been straightforward to send the output of Hadoop jobs to Cassandra, and Facebook, Digg, and others have been using Hadoop like this as a Cassandra bulk-loader for over a year. For 0.6, I contributed a Hadoop InputFormat and related code to let Hadoop jobs process data from Cassandra as well, while cooperating with Hadoop to keep processing on the nodes that actually hold the data. Stu Hood then contributed a Pig LoadFunc, also in 0.6.

Fiction: Cassandra achieves its high performance by sacrificing reliability (alternately phrased: Cassandra is only good for data you can afford to lose)

Fact: unlike some NoSQL databases (notably MongoDB and HBase), Cassandra offers full single-server durability. Relying on replication is not sufficient for can't-afford-to-lose-data scenarios; if your data center loses power, you are highly likely to lose data if you are not syncing to disk no matter how many replicas you have, and if you run large systems in production long enough, you will realize that power outages through some combination of equipment failure and human error are not occurrences you can ignore. But with its fsync'd commitlog design, Cassandra can protect you against that scenario too.

What to do after your data is saved, e.g. backups and snapshots, is outside of my scope here but covered in the operations wiki page.