Skip to main content


Showing posts from 2009

Cassandra reading list

I put together this list for a co-worker who wants to learn more about Cassandra: ( 0.5 beta 2 out now!) Getting Started : Cassandra is surprisingly easy to try out. This walks you through both single-node and clustered setup. The Dynamo paper and Amazon's related article on eventual consistency : Cassandra's replication model is strongly influenced by Dynamo's. Almost everything you read here also applies to Cassandra. (The major exceptions are vector clocks, and even that may change , and Cassandra's support for order-preserving partitioning with active load balancing.) WTF is a SuperColumn ? Arin Sarkissian from Digg explains the Cassandra data model. Operations : stuff you will want to know when you run Cassandra in production Cassandra users survey from Nov 09 : What Twitter, Mahalo, Ooyala, SimpleGeo, and others are using Cassandra for More articles here (Cassandra on OS X seems to be a particularly popular topic) If you want to know more about the interna

Cassandra hackfest and OSCON report

The best part of OSCON for me wasn't actually part of OSCON. The guys at Twitter put together a Cassandra hackfest on Wednesday night, with much awesomeness resulting. Thanks to Evan for organizing! Stu Hood flew up from Rackspace's Virginia offices just for the night, which normally probably wouldn't have been worth it, but Cliff Moon , author of dynomite , showed up (thanks, Cliff!) and was able to give Stu a lot of pointers on implementing merkle trees . Cliff and I also had a good discussion with Jun Rao about hinted handoff--Cliff and Jun are not fans, and I tend to agree with them--and eventual consistency . I also met David Pollack and got to talk a little about persistence for Goat Rodeo , and talked to a ton of people from Twitter and Digg. I think those two, with Rackspace and IBM Research, constituted the companies with more than one engineer attending. The rest was "long tail." Back at OSCON, my Cassandra talk was standing room only. Slid

Cassandra 0.3 update

Two months after the first release candidate , Cassandra 0.3 is still not out. But, we're close! We had two more bug-fix release candidates, and it's virtually certain that 0.3-final will be the same exact code as 0.3-rc3 . (If you're using rc1, you do want to upgrade; see CHANGES.txt .) But, we got stuck in the ASF bureaucracy and it's going to take at least one more round-trip before the crack Release Prevention Team grudgingly lets us call it official. In the meantime, work continues apace on trunk for 0.4.

Patch-oriented development made sane with git-svn

One of the drawbacks to working on Cassandra is that unlike every other OSS project I have ever worked on, we are using a patch-oriented development process rather than post-commit review. It's really quite painfully slow. Somehow this became sort of the default for ASF projects, but there is precedent for switching to post-commit review eventually. In the meantime, there is git-svn. (The ASF does have a git mirror set up , but I'm going to ignore that because (a) its reliability has been questionable and (b) sticking with git-svn hopefully makes this more useful for non-ASF projects.) Disclaimer: I am not a git expert , and probably some of this will make you cringe if you are. Still, I hope it will be useful for some others fumbling their way towards enlightenment. As background, I suggest the git crash course for svn users . Just the parts up to the Remote section. Checkout: git-svn init cassandra Once that&#

Why you won't be building your killer app on a distributed hash table

I ran across A case study in building layered DHT applications while doing some research on implementing load-balancing in Cassandra . The question addressed is, "Are DHTs a general-purpose tool that you can build more sophisticated services on?" Short version: no. A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals. This paper describes the results of attempting to build a relatively complex datastructure (prefix hash trees, for range queries) on top of OpenDHT. The result was mostly failure: A simple put-get interface was not quite enough. In particular, OpenDHT relies on timeouts to invalidate entries and has no support for atomicity primitives... In return for ease of implementation and deployment, we sacrificed performance. With the OpenDHT implementation, a PHT query operation took a median of 2–4 seconds

Cassandra 0.3 release candidate and progress

We have a release candidate out for Cassandra 0.3. Grab the download and check out how to get started . The facebook presentation from almost a year ago now is also still a good intro to some of the features and data model. Cassandra in a nutshell : Scales writes very, very well: just add more nodes! Has a much richer data model than vanilla key/value stores -- closer to what you'd be used to in a relational db. Is pretty bleeding edge -- to my knowledge, Facebook is the only group running Cassandra in production. (Their largest cluster is 120 machines and 40TB of data .) At Rackspace we are working on a Cassandra-based app now that 0.3 has the extra features we need. Moved to the Apache Incubator about 40 days ago, at which point development greatly accelerated. Changes in 0.3 include Range queries on keys, including user-defined key collation. Remove support, which is nontrivial in an eventually consistent world. Workaround for a weird bug in JDK select/register that see

A better analysis of Cassandra than most

Vladimir Sedach wrote a three-part dive into Cassandra . (Almost two months ago now. Guess I need to set up Google Alerts. Trouble is there's a surprising amount of noise around the word `cassandra.`) Part 0 Part 1 Part 2 A few notes: We now have an order-preserving partitioner as well as the hash-based one Yes, if you tell Cassandra to wait for all replicas to be ack'd before calling a write a success, then you would have traditional consistency (as opposed to "eventual") but you'd also have no tolerance for hardware failures which is a main point of this kind of system. Zookeeper is not currently used by Cassandra, although we have plans to use it in the future. Load balancing is not implemented yet. The move to Apache is finished and development is active there now.

Consistent hashing vs order-preserving partitioning in distributed databases

The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner. The reason that almost all similar systems use consistent hashing (the dynamo paper has the best description; see sections 4.1-4.3) is that it provides a kind of brain-dead load balancing from the hash algorithm spreading keys across the ring. But the dynamo authors go into some detail about how this by itself doesn't actually give good results in practice; their solution was to assign multiple tokens to each node in the cluster and they describe several approaches to that. But Cassandra's original designer considers this a hack and prefers real load balancing . An order-preserving partitioner, where keys are distributed to nodes in their natural order, has huge advantages over consistent hashing, particularly the ability to do range queries across the keys in the system (which has also been committed to Cassandra now

Automatic project structure inference

David MacIver has an interesting blog entry up about determining logical project structure via commit logs . I was very interested because one of Cassandra's oldest issues is creating categories for our JIRA instance. (I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you, in this case.) The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results . In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not e

The best PyCon talk you didn't see

There were a lot of good talks at PyCon but I humbly submit that the best one you haven't seen yet is Robert Brewer's talk on DejaVu . Robert describes how his Geniusql layer disassembles and parses python bytecode to let his ORM turn python lambdas into SQL. Microsoft got a lot of press for doing something similar for .NET with LINQ , but Bob was there first. box = store.new_sandbox() print [c.Title for c in box.recall( Comic, lambda c: 'Hob' in c.Title or c.Views > 0)] This is cool as hell. The Geniusql part start about 15 minutes in.

Credit where credit is due

I'm starting to conclude that git just doesn't fit my brain . Several months in, I'm still confused when things don't work the way they "should." My co-worker says I should start a wiki for weird-ass things to do with git: "You keep coming up with use cases that would never occur to me." But, I have to give the git community credit: I've never gone in to #git on freenode and gotten less than fantastic help. Even with git-svn.

Distributed Databases and Cassandra at PyCon

I'll be leading an open-spaces discussion about distributed database architecture, implementation, and use today at 5:00 PM in the Lambert room. Specifically, we will cover bigtable, dynamo, and cassandra, and how to port a typical relational schema to cassandra's ColumnFamily model. I wrote a little background information yesterday about why I think Cassandra in particular is compelling .

Why I like the Cassandra distributed database

I need a distributed database. A real distributed database; replication doesn't count because under a replication-oriented db, each node still needs to be able to handle the full write volume, and you can only throw hardware at that for so long. So, I'm working on the Cassandra distributed database. I gave a lightning talk on it at PyCon this morning. Cassandra is written in Java and implements a sort of hybrid between Dynamo and Bigtable . (Both papers are worth reading.) It takes its distribution algorithm from Dynamo and its data model from Bigtable -- sort of the best of both worlds. Avinash Lakshman, Cassandra's architect, is one of the authors of the Dynamo paper. There is a video about Cassandra here . The first 1/4 is about using Cassandra and then the rest is mostly about the internals. Cassandra is very bleeding edge. Facebook runs several Cassandra clusters in production (the largest is 120 machines and 40TB of data ), but there are sharp edges th

Impressed by KDE 4.2

I'm running Linux on my desktop at work after a year of OS X, and Gnome as shipped by Ubuntu 8.10 has just been a world of hurt. The panel looks and works like ass when moved to the left side of the screen (the only sane place to put it in today's world of widescreen monitors), network-manager just decided to quit working one day (I got by with wicd after that), alt-tab behavior sucks both ways you can configure it, etc. I installed KDE 4.2 over the weekend to see if I was missing anything there. Wow. It's like daylight after being in a cave for two months. I didn't realize how hard it has been to use a butt-ugly environment until I wasn't anymore. (Yes, I tried all the gnome themes I could find. Even Nimbus which took a bit of work. What's that recently-famous phrase? "Lipstick on a pig?") What is better in KDE? In a word, everything. And put me in the camp that really likes having the desktop turned into a usable area for the first time.

SQLAlchemy at PyCon 2009

I will be giving an Introduction to SQLAlchemy tutorial and Mike Bayer and Jason Kirtland will be teaching Advanced SQLAlchemy , both on Thursday. I'll be covering similar material as last year , updated for 0.5. I'm also trying to see if I can get the emails of the registrants so far to see what else they would like covered. My tutorial style is exercise-heavy, so if you've read the docs or my slides but still find it hard to write SQLA code, coming to the tutorial is a great way to fix that. (Note: the blog link to the 2008 slides is broken since we moved If you want them, drop me a note.)

All you ever wanted to know about writing bloom filters

It seems like Bloom filters are all the rage these days. Three years ago I had barely heard of them and now it seems like I see articles and code using them all the time. That's mostly a good thing, since bloom filters are a very useful tool to avoid performing expensive computations without the full memory overhead of a standard map/dictionary. Bloom filters are surprisingly simple: divide a memory area into buckets (one bit per bucket for a standard bloom filter; more -- typically four -- for a counting bloom filter). To insert a key, generate several hashes per key, and mark the buckets for each hash. To check if a key is present, check each bucket; if any bucket is empty, the key was never inserted in the filter. If all buckets are non-empty, though, the key is only probably inserted -- other keys' hashes could have covered the same buckets. Determining exactly how big to make the filter and how many hashes to use to achieve a given false positive rate is a solved p