Monday, February 08, 2010

Distributed deletes in the Cassandra database

Handling deletes in a distributed, eventually consistent system is a little tricky, as demonstrated by the fairly frequent recurrence of the question, "Why doesn't disk usage immediately decrease when I remove data in Cassandra?"

As background, recall that a Cassandra cluster defines a ReplicationFactor that determines how many nodes each key and associated columns are written to. In Cassandra (as in Dynamo), the client controls how many replicas to block for on writes, which includes deletions. In particular, the client may (and typically will) specify a ConsistencyLevel of less than the cluster's ReplicationFactor, that is, the coordinating server node should report the write successful even if some replicas are down or otherwise not responsive to the write.

(Thus, the "eventual" in eventual consistency: if a client reads from a replica that did not get the update with a low enough ConsistencyLevel, it will potentially see old data. Cassandra uses Hinted Handoff, Read Repair, and Anti Entropy to reduce the inconsistency window, as well as offering higher consistency levels such as ConstencyLevel.QUORUM, but it's still something we have to be aware of.)

Thus, a delete operation can't just wipe out all traces of the data being removed immediately: if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, Cassandra replaces it with a special value called a tombstone. The tombstone can then be propagated to replicas that missed the initial remove request.

There's one more piece to the problem: how do we know when it's safe to remove tombstones? In a fully distributed system, we can't. We could add a coordinator like ZooKeeper, but that would pollute the simplicity of the design, as well as complicating ops -- then you'd essentially have two systems to monitor, instead of one. (This is not to say ZK is bad software -- I believe it is best in class at what it does -- only that it solves a problem that we do not wish to add to our system.)

So, Cassandra does what distributed systems designers frequently do when confronted with a problem we don't know how to solve: define some additional constraints that turn it into one that we do. Here, we defined a constant, GCGraceSeconds, and had each node track tombstone age locally. Once it has aged past the constant, it can be GC'd. This means that if you have a node down for longer than GCGraceSeconds, you should treat it as a failed node and replace it as described in Cassandra Operations. The default setting is very conservative, at 10 days; you can reduce that once you have Anti Entropy configured to your satisfaction. And of course if you are only running a single Cassandra node, you can reduce it to zero, and tombstones will be GC'd at the first compaction.


Royans said...

Interesting problem. Couple of questions to understand if this could be made smarter.

1) If the concern is with the dead nodes, can't delete operations cleanup faster if there are no dead nodes ? -- I'm not sure if its easy to figure out who all were part of the cluster and died off without being decommissioned properly

2) If an old server does come up, and joins a cluster after GCGraceSeconds has passed, could it be asked to rebuild its data store from scratch to avoid polluting the current state of the cluster ?

Jonathan Ellis said...

1) there's no such thing as being able to know if the other nodes are alive and reachable (unreachable = dead for all practical purposes), or, even if magically you did, if they still will be in the ms or two it will take for them to handle the delete. That is why this is an interesting problem. :)

2) OK, so there is a distinction b/t "dead" and "unreachable" -- if node X is unreachable from node Y, which one of them should have to rebuild its data store when the connection is restored? There is no "right" answer, so we leave it up to the operator. (Of course 10 days is a really really long time for a network partition to last, in practice.)