Monday, April 26, 2010

And now for something completely different

A month ago I left Rackspace to start Riptano, a Cassandra support and services company.

I was in the unusal position of being a technical person looking for a business-savvy co-founder. For whatever reason, the converse seems a lot more common. Maybe technical people tend to sterotype softer skills as being easy.

But despite some examples to the contrary (notably for me, Josh Coates at Mozy), I found that starting a company is too hard for just one person. Unfortunately, all of my fairly slim portfolio of business guys I'd like to co-found with were unavailable. So progress was slow, until Matt Pfeil heard that I was leaving Rackspace and drove to San Antonio from Austin to talk me out of it. Not only was he not successful in talking me out of leaving, but he ended up co-founding Riptano. And here we are, with a Riptano mini-faq.

Isn't Cassandra mostly just a web 2.0 thing for ex-mysql shops?

Although most of the early adopters fit this stereotype, we're seeing interest from a lot of Oracle users and a lot of industries. Unlike many "NoSQL" databases, Cassandra doesn't drop durability (the D in ACID), and besides scalability, enterprises are very interested in our support for multiple data centers and Hadoop analytics.

Are you going to fork Cassandra?

No. Although the ASF license allows doing basically anything with the code, including creating proprietary forks, we think the track record of this strategy in the open source database world is mixed at best.

We might create a (still open-source) Cassandra distribution similar to Cloudera's Distribution for Hadoop, but the mainline Cassandra development is responsive enough that there isn't as much need for a third party to do this as there is with Hadoop.

What does Rackspace think?

Rackspace has been the primary driver of Cassandra development recently, employing (until I left) the three most active committers on the project. For the same reasons Rackspace supported Cassandra to begin with, Rackspace is excited to see Riptano help take the Cassandra ecosystem to the next level. Rackspace has invested in Riptano and has been completely supportive in every way.

Where did you get the name "Riptano?" Does it mean anything?

We took a sophisticated, augmented AI approach. By which I mean, we took a program that generated random, pronouceable strings, and put together a couple fragments that sounded good together. (This is basically the same approach we took at Mozy, only there Josh insisted on a four letter domain name which narrowed it down a lot.)

I hope it doesn't mean "your dog has bad breath" somewhere.

And yes, Riptano is on twitter.

Are you hiring?

Yes. We'll have a jobs page on the site soon. In the meantime you can email me a resume if you can't wait. Prior participation in the Apache Cassandra project is of course a huge plus.

Wednesday, April 07, 2010

Cassandra: Fact vs fiction

Cassandra has seen some impressive adoption success over the past months, leading some to conclude that Cassandra is the frontrunner in the highly scalable databases space (a subset of the hot NoSQL category). Among all the attention, some misunderstandings have been propagated, which I'd like to clear up.

Fiction: "Cassandra relies on high-speed fiber between datacenters" and can't reliably replicate between datacenters with more than a few ms of latency between them.

Fact: Cassandra's multi-datacenter replication is one of its earliest features and is by far the most battle-tested in the NoSQL space. Facebook had Cassandra deployed on east and west coast datacenters since before open sourcing it. SimpleGeo's Cassandra cluster spans 3 EC2 availability zones, and Digg is also deployed on both coasts. Claims that this can't possibly work are an excellent sign that you're reading an article by someone who doesn't know what he's talking about.

Fiction: "It’s impossible to tell when [Cassandra] replicas will be up-to-date."

Fact: Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor), to use the Dynamo vocabulary. If you do writes and reads both with QUORUM, for one example, you can expect data consistency as soon as there are enough reachable nodes for a quorum. Cassandra also provides read repair and anti-entropy, so that even reads at ConsistencyLevel.ONE will be consistent after either of these events.

Fiction: Cassandra has a small community

Fact: Although popularity has never been a good metric for determining correctness, it's true that when using bleeding edge technology, it's good to have company. As I write this late at night (in the USA), there are 175 people in the Cassandra irc channel, 60 in the HBase one, 32 in Riak's, and 15 in Voldemort's. (Six months ago, the numbers were 90, 45, and 12 for Cassandra, HBase, and Voldemort. I did not hang out in #riak yet then.) Mailing list participation tells a similar story.

It's also interesting that the creators of Thrudb and dynomite are both using Cassandra now, indicating that the predicted NoSQL consolidation is beginning.

Fiction: "Cassandra only supports one [keyspace] per install."

Fact: This has not been true for almost a year (June of 2009).

Fiction: Cassandra cannot support Hadoop, or supporting tools such as Pig.

Fact: It has always been straightforward to send the output of Hadoop jobs to Cassandra, and Facebook, Digg, and others have been using Hadoop like this as a Cassandra bulk-loader for over a year. For 0.6, I contributed a Hadoop InputFormat and related code to let Hadoop jobs process data from Cassandra as well, while cooperating with Hadoop to keep processing on the nodes that actually hold the data. Stu Hood then contributed a Pig LoadFunc, also in 0.6.

Fiction: Cassandra achieves its high performance by sacrificing reliability (alternately phrased: Cassandra is only good for data you can afford to lose)

Fact: unlike some NoSQL databases (notably MongoDB and HBase), Cassandra offers full single-server durability. Relying on replication is not sufficient for can't-afford-to-lose-data scenarios; if your data center loses power, you are highly likely to lose data if you are not syncing to disk no matter how many replicas you have, and if you run large systems in production long enough, you will realize that power outages through some combination of equipment failure and human error are not occurrences you can ignore. But with its fsync'd commitlog design, Cassandra can protect you against that scenario too.

What to do after your data is saved, e.g. backups and snapshots, is outside of my scope here but covered in the operations wiki page.

Tuesday, March 30, 2010

Cassandra in Google Summer of Code 2010

Cassandra is participating in the Google Summer of Code, which opened for proposal submission today. Cassandra is part of the Apache Software Foundation, which has its own page of guidelines up for students and mentors.

We have a good mix of project ideas involving both core and non-core areas, from straightforward code bashing to some pretty tricky stuff, depending on your appetite. Core tickets aren't necessarily harder than non-core, but they will require reading and understanding more existing code.

Non-core

  • Create a web ui for cassandra: we have a (fairly minimal) command line interface, but a web gui is more user-friendly. There is the beginnings of such a beast in the Cassandra source tree at contrib/cassandra_browser [pretty ugly Python code] and a gtk-based one at http://github.com/driftx/chiton [also Python, less ugly].
  • First-class commandline interface: if you prefer to kick things old-school, improving the cli itself would also be welcome.
  • Create a Cassandra demo application: we have Twissandra, but we can always use more examples to introduce people to "thinking in Casssandra," which is the hardest part of using it. This one seems to be the most popular with students so far. (So stand out from the crowd, and submit something else too. :)

Almost-core

Core

  • Avro RPC support: currently Cassandra's client layer is the Thrift RPC framework, which sucks for reasons outside our scope here. We're moving to Avro, the new hotness from Doug Cutting (creator of Lucene and Hadoop, you may have heard of those). Basically this means porting org.apache.cassandra.thrift.CassandraServer to org.apache.cassandra.avro.CassandraServer; some examples are already done by Eric Evans.
  • Session-level consistency: In one and two Amazon discusses the concept of "eventual consistency." Cassandra uses eventual consistency in a design similar to Dynamo. Supporting session consistency would be useful and relatively easy to add: we already have the concept of a Memtable to "stage" updates in before flushing to disk; if we applied mutations to a session-level memtable on the coordinator machine (that is, the machine the client is connected to), and then did a final merge from that table against query results before handing them to the client, we'd get it almost for free.
  • Optimize commitlog performance: this is about as low-level as you'll find in Cassandra's code base. fsync, CAS, it's all here. http://wiki.apache.org/cassandra/ArchitectureCommitLog describes the current CommitLog design.

You can comment directly on the JIRA tickets after creating an account (it's open to the public) if you're interested or have other questions. And of course feel free to propose other ideas!

Wednesday, March 24, 2010

Cassandra in action

There's been a lot of new articles about Cassandra deployments in the past month, enough that I thought it would be useful to summarize in a post.

Ryan King explained in an interview with Alex Popescu why Twitter is moving to Cassandra for tweet storage, and why they selected Cassandra over the alternatives. My experience is that the more someone understands large systems and the problems you can run into with them from an operational standpoint, the more likely they are to choose Cassandra when doing this kind of evaluation. Ryan's list of criteria is worth checking out.

Digg followed up their earlier announcement that they had taken part of their site live on Cassandra with another saying that they've now "reimplemented most of Digg's functionality using Cassandra as our primary datastore." Digg engineer Ian Eure also gave some more details on Digg's cassandra data model in a Hacker News thread.

Om Malik quoted extensively from the Digg announcement and from Rackspace engineer Stu Hood, who explained Cassandra's appeal: "Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure. When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single row to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values."

The Twitter and Digg news kicked off a lot of publicity, including a lot of "me too" articles but some interesting ones, including a highscalability post wondering if this was the end of the mysql + memcached era. If not quite yet the end, then the beginning of it. As Ian Eure from Digg said, "If you're deploying memcache on top of your database, you're inventing your own ad-hoc, difficult to maintain NoSQL system." Possibly the best commentary on this idea is Dare Obasanjo's, who explained "Digg's usage of Cassandra actually serves as a rebuttal to [an article claiming SQL scales just fine] since they couldn't feasibly get what they want with either horizontal or vertical scaling of their relational database-based solution."

Reddit also migrated to Cassandra from memcachedb, in only 10 days, the fastest migration to Cassandra I've seen. More comments from the engineer doing the migration, ketralnis, in the reddit discussion thread.

CloudKick blogged about how they use Cassandra for time series data, including a sketch of their data model. CloudKick migrated from PostgreSQL, skewering the theory you will sometimes see proffered that "only MySQL users are migrating to NoSQL, not people who use [my favorite vendor's relational database]."

Jake Luciani wrote about how Lucandra, the Cassandra Lucene back-end works, and how he's using it to power the Twitter search app sparse.ly. IMO, Lucandra is one of Cassandra's killer apps.

The FightMyMonster team switched from HBase to Cassandra after concluding that "HBase is more suitable for data warehousing, and large scale data processing and analysis... and Cassandra is more suitable for real time transaction processing and the serving of interactive data." Dominic covers CAP, architecture considerations, benchmarks, map/reduce, and durability in explaining his conclusion.

Eric Peters gave a talk on Cassandra use at his company, Frugal Mechanic, at the Seattle Tech Startups Meetup. This was interesting not because Frugal Mechanic is a big name but because it's not. I haven't seen Eric's name on the Cassandra mailing lists at all, but there he was deploying it and giving a talk on it, showing that Cassandra is starting to move beyond early adopters. (And, just maybe, that our documentation is improving. :)

Finally, Eric Florenzano has a live demo up now of Cassandra running a Twitter clone at twissandra.com, with source at github, as an example of how to use Cassandra's data model. If you're interested in the nuts and bolts of how to build an app on Cassandra, you should check it out.

Monday, March 15, 2010

Why your data may not belong in the cloud

Several of the reports of the recently-concluded NoSQL Live event mentioned that I took a contrarian position on the "NoSQL in the Cloud" panel, arguing that traditional, bare metal servers usually make more sense. Here's why.

There are two reasons to use cloud infrastructure (and by cloud I mean here "commodity VMs such as those provided by Rackspace Cloud Servers or Amazon EC2):

  1. You only need a fraction of the capacity of a single machine
  2. Your demand is highly elastic; you want to be able to quickly spin up many new instances, then drop them when you are done
Most people looking at NoSQL solutions are doing it because their data is larger than a traditional solution can handle, or will be, so (1) is not a very strong motivation. But what about (2)? At first glance, cloud is a great fit for adding capacity to a database cluster painlessly. But there's an important difference between load like web traffic that bounces up and down frequently, and databases: with few exceptions, databases only get larger with time. You won't have 20 TB of data this week, and 2 next.

When capacity only grows in one direction it makes less sense to pay a premium for the flexibility of being able to reduce your capacity nearly instantly, especially when you also get reduced I/O performance (the most common bottleneck for databases) in the bargain because of the virtualization layer. That's why, despite working for a cloud provider, I don't think it's always a good fit for databases. (It doesn't hurt that Rackspace also offers classic bare metal hosting in the same data centers, so you can have the best of both worlds.)

Monday, February 08, 2010

Distributed deletes in the Cassandra database

Handling deletes in a distributed, eventually consistent system is a little tricky, as demonstrated by the fairly frequent recurrence of the question, "Why doesn't disk usage immediately decrease when I remove data in Cassandra?"

As background, recall that a Cassandra cluster defines a ReplicationFactor that determines how many nodes each key and associated columns are written to. In Cassandra (as in Dynamo), the client controls how many replicas to block for on writes, which includes deletions. In particular, the client may (and typically will) specify a ConsistencyLevel of less than the cluster's ReplicationFactor, that is, the coordinating server node should report the write successful even if some replicas are down or otherwise not responsive to the write.

(Thus, the "eventual" in eventual consistency: if a client reads from a replica that did not get the update with a low enough ConsistencyLevel, it will potentially see old data. Cassandra uses Hinted Handoff, Read Repair, and Anti Entropy to reduce the inconsistency window, as well as offering higher consistency levels such as ConstencyLevel.QUORUM, but it's still something we have to be aware of.)

Thus, a delete operation can't just wipe out all traces of the data being removed immediately: if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, Cassandra replaces it with a special value called a tombstone. The tombstone can then be propagated to replicas that missed the initial remove request.

There's one more piece to the problem: how do we know when it's safe to remove tombstones? In a fully distributed system, we can't. We could add a coordinator like ZooKeeper, but that would pollute the simplicity of the design, as well as complicating ops -- then you'd essentially have two systems to monitor, instead of one. (This is not to say ZK is bad software -- I believe it is best in class at what it does -- only that it solves a problem that we do not wish to add to our system.)

So, Cassandra does what distributed systems designers frequently do when confronted with a problem we don't know how to solve: define some additional constraints that turn it into one that we do. Here, we defined a constant, GCGraceSeconds, and had each node track tombstone age locally. Once it has aged past the constant, it can be GC'd. This means that if you have a node down for longer than GCGraceSeconds, you should treat it as a failed node and replace it as described in Cassandra Operations. The default setting is very conservative, at 10 days; you can reduce that once you have Anti Entropy configured to your satisfaction. And of course if you are only running a single Cassandra node, you can reduce it to zero, and tombstones will be GC'd at the first compaction.

Monday, January 25, 2010

Cassandra 0.5.0 released

Apache Cassandra 0.5.0 was released over the weekend, four months after 0.4. (Upgrade notes; full changelog.) We're excited about releasing 0.5 because it makes life even better for people using Cassandra as their primary data source -- as opposed to a replica, possibly denormalized, of data that exists somewhere else.

The Cassandra distributed database has always had a commitlog to provide durable writes, and in 0.4 we added an option to waiting for commitlog sync before acknowledging writes, for cases where even a few seconds of potential data loss was not an option. But what if a node goes down temporarily? 0.5 adds proactive repair, what Dynamo calls "anti-entropy," to synchronize any updates Hinted Handoff or read repair didn't catch across all replicas for a given piece of data.

0.5 also adds load balancing and significantly improves bootstrap (adding nodes to a running cluster). We've also been busy adding documentation on operations in production and system internals.

Finally, in 0.5 we've improved concurrency across the board, improving insert speed by over 50% on the stress.py benchmark (from contrib/) on a relatively modest 4-core system with 2GB of ram. We've also added a [row] key cache, enabling similar relative improvements in reads:

(You will note that unlike most systems, Cassandra reads are usually slower than writes. 0.6 will narrow this gap with full row caching and mmap'd I/O, but fundamentally we think optimizing for writes is the right thing to do since writes have always been harder to scale.)

Log replay, flush, compaction, and range queries are also faster.

0.5 also brings new tools, including JSON-based data export and import, an improved command-line interface, and new JMX metrics.

One final note: like all distributed systems, Cassandra is designed to maximize throughput when under load from many clients. Benchmarking with a single thread or a small handful will not give you numbers representative of production (unless you only ever have four or five users at a time in production, I suppose). Please don't ask "why is Cassandra so slow" and offer up a single-threaded benchmark as evidence; that makes me sad inside. Here's 1000 words:

(Thanks to Brandon Williams for the graphs.)

Thursday, January 21, 2010

Linux performance basics

I want to write about Cassandra performance tuning, but first I need to cover some basics: how to use vmstat, iostat, and top to understand what part of your system is the bottleneck -- not just for Cassandra but for any system.

vmstat
You will typically run vmstat with "vmstat sampling-period", e.g., "vmstat 5." The output looks like this:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
20  0 195540  32772   6952 576752    0    0    11    12   38   43  1  0 99  0
22  2 195536  35988   6680 575132    6    0  2952    14  959 16375 72 21  4  3
The first line is your total system average since boot; typically this will not be very useful, since you are interested in what is causing problems NOW. Then you will get one line per sample period; most of the output is self explanatory. The reason to start with vmstat is the "swap" section: si and so are swap in (memory read from disk) and swap out (memory written to disk). Remember that a little swapping is normal, particularly during application startup: by default, Linux will swap infrequently used pages of application memory to disk to free up more room for disk caching, even if there is enough ram to accommodate all applications.

iostat
To get more details of io, use iostat -x. Again, you want to give it a sampling interval, and ignore the first set of output. iostat also gives you some cpu information but top does that better; let's focus on the Device section:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               9.80     0.20   36.60    0.40  5326.40     4.80   144.09     0.06    1.62   1.41   5.20
There are 3 easy ways to tell if a disk is a probable bottleneck here, and none of them show up without the -x flag, so get in the habit of using that. "avgqu-sz" is the size of the io request queue; if it is large, there are lots of requests waiting in line. "await" is how long (in ms) the average request took to be satisfied (including time enqueued); recall that on non-SSDs, a single seek is between 5 and 10ms. Finally, "%util" is Linux's guess at how fully saturated the device is.

top
To learn more about per-process CPU and memory usage, use "top." I won't paste top output here because everyone is so familiar with it, but I will mention a few useful things to know:

  • "P" and "M" toggle between sorting by cpu usage and sorting by memory usage
  • "1" toggles breaking down the CPU summary by CPU core
  • SHR (shared memory) is included in RES (resident memory)
  • Amount of memory belonging to a process that has been swapped out is VIRT - RES
  • a state (S column) of D means the process (or thread, see below) is waiting for disk or network i/o
  • "steal" is how much CPU the hypervisor is giving to another VM in a virtual environment; as virtual provisioning becomes more common, avoiding noisy neighbors is increasingly important

"top -H" will split out individual threads into their own lines; both per-process and per-thread views are useful. The per-thread view is particularly useful when dealing with Java applications since you can easily correlate them with thread names from the JVM to see which threads are consuming your CPU. Briefly, you take the PID (thread ID) from top, convert it to hex -- e.g., "python -c 'print hex(12345)'" -- and match it with the corresponding thread ID from jstack.

Now you can troubleshoot with a process like: "Am I swapping? If so, what processes are using all the memory? If my application makes a lot of disk read requests, are my reads being cached or are they actually hitting the disk? If I am hitting the disk, is it saturated? How much 'hot data' can I have before I run out of cache room? Are any/all of my cpu cores maxed? Which threads are actually using the CPU? Which threads spend most of their time waiting for i/o?" Then if you go to ask for help tuning something, you can show that you've done your homework.