Skip to main content

Posts

On applying for jobs

A friend asks , If [I see a] job I could do, even though I don't meet the stated requirements, should I apply anyway? Short answer: yes. Longer answer: companies are all over the map here, although in general the less layers of bureaucracy there are between the team that the candidate will work with and the hiring process, the more likely the list of requirements is to be actual requirements. How can you tell? HR paper pushers like to think in terms of checklists because that lets them go through hundreds of resumes without any real understanding of the position, so they write ads like this one -- lots of really specific "5+ years of X," not much about what the position actually involves. But if it's the team lead himself writing the description, which you will see at smaller companies, then you get much more about what the position involves and less checklist items, because the lead is comfortable determining competence based on skill instead of p...

Apache Cassandra: 2010 in review

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome. Code 2010 started with the release of Cassandra 0.5 , followed by 0.6 and graduation from the ASF incubator a few months later. Seven more stable releases of 0.6 proceeded, adding many features to improve operations in response to feedback from production users. 0.7 adds highly anticipated features like column value indexes , live schema updates , more efficient cluster expansion, and more control over replication, but didn't quite make it into 2010, with rc4 released on new year's 2011 . We also committed the distributed counters patchset, begun at Digg and enhanced by Twitter for their real-time analytics product . Notable as the most-involved feature discussion to date, distributed counters started with a vector clock approach , but switched to a new desig...

And now for something completely different

A month ago I left Rackspace to start Riptano , a Cassandra support and services company. I was in the unusal position of being a technical person looking for a business-savvy co-founder. For whatever reason, the converse seems a lot more common . Maybe technical people tend to sterotype softer skills as being easy. But despite some examples to the contrary (notably for me, Josh Coates at Mozy ), I found that starting a company is too hard for just one person . Unfortunately, all of my fairly slim portfolio of business guys I'd like to co-found with were unavailable. So progress was slow, until Matt Pfeil heard that I was leaving Rackspace and drove to San Antonio from Austin to talk me out of it. Not only was he not successful in talking me out of leaving, but he ended up co-founding Riptano. And here we are, with a Riptano mini-faq. Isn't Cassandra mostly just a web 2.0 thing for ex-mysql shops? Although most of the early adopters fit this stereotype, we...

Cassandra: Fact vs fiction

Cassandra has seen some impressive adoption success over the past months, leading some to conclude that Cassandra is the frontrunner in the highly scalable databases space (a subset of the hot NoSQL category ). Among all the attention, some misunderstandings have been propagated, which I'd like to clear up. Fiction : "Cassandra relies on high-speed fiber between datacenters" and can't reliably replicate between datacenters with more than a few ms of latency between them. Fact : Cassandra's multi-datacenter replication is one of its earliest features and is by far the most battle-tested in the NoSQL space. Facebook had Cassandra deployed on east and west coast datacenters since before open sourcing it. SimpleGeo's Cassandra cluster spans 3 EC2 availability zones , and Digg is also deployed on both coasts. Claims that this can't possibly work are an excellent sign that you're reading an article by someone who doesn't know what he's ta...

Cassandra in Google Summer of Code 2010

Cassandra is participating in the Google Summer of Code, which opened for proposal submission today . Cassandra is part of the Apache Software Foundation, which has its own page of guidelines up for students and mentors. We have a good mix of project ideas involving both core and non-core areas, from straightforward code bashing to some pretty tricky stuff, depending on your appetite. Core tickets aren't necessarily harder than non-core, but they will require reading and understanding more existing code. Non-core Create a web ui for cassandra : we have a (fairly minimal) command line interface, but a web gui is more user-friendly. There is the beginnings of such a beast in the Cassandra source tree at contrib/cassandra_browser [pretty ugly Python code] and a gtk-based one at http://github.com/driftx/chiton [also Python, less ugly]. First-class commandline interface : if you prefer to kick things old-school, improving the cli itself would also be welcome. Create a Cassandr...

Cassandra in action

There's been a lot of new articles about Cassandra deployments in the past month, enough that I thought it would be useful to summarize in a post. Ryan King explained in an interview with Alex Popescu why Twitter is moving to Cassandra for tweet storage, and why they selected Cassandra over the alternatives. My experience is that the more someone understands large systems and the problems you can run into with them from an operational standpoint, the more likely they are to choose Cassandra when doing this kind of evaluation. Ryan's list of criteria is worth checking out. Digg followed up their earlier announcement that they had taken part of their site live on Cassandra with another saying that they've now "reimplemented most of Digg's functionality using Cassandra as our primary datastore." Digg engineer Ian Eure also gave some more details on Digg's cassandra data model in a Hacker News thread. Om Malik quoted extensively from the Digg...

Why your data may not belong in the cloud

Several of the reports of the recently-concluded NoSQL Live event mentioned that I took a contrarian position on the "NoSQL in the Cloud" panel, arguing that traditional, bare metal servers usually make more sense. Here's why. There are two reasons to use cloud infrastructure (and by cloud I mean here "commodity VMs such as those provided by Rackspace Cloud Servers or Amazon EC2): You only need a fraction of the capacity of a single machine Your demand is highly elastic; you want to be able to quickly spin up many new instances, then drop them when you are done Most people looking at NoSQL solutions are doing it because their data is larger than a traditional solution can handle, or will be, so (1) is not a very strong motivation. But what about (2)? At first glance, cloud is a great fit for adding capacity to a database cluster painlessly. But there's an important difference between load like web traffic that bounces up and down frequently,...

Distributed deletes in the Cassandra database

Handling deletes in a distributed, eventually consistent system is a little tricky, as demonstrated by the fairly frequent recurrence of the question, " Why doesn't disk usage immediately decrease when I remove data in Cassandra ?" As background, recall that a Cassandra cluster defines a ReplicationFactor that determines how many nodes each key and associated columns are written to. In Cassandra (as in Dynamo ), the client controls how many replicas to block for on writes, which includes deletions. In particular, the client may (and typically will) specify a ConsistencyLevel of less than the cluster's ReplicationFactor, that is, the coordinating server node should report the write successful even if some replicas are down or otherwise not responsive to the write. (Thus, the "eventual" in eventual consistency: if a client reads from a replica that did not get the update with a low enough ConsistencyLevel, it will potentially see old data. Cassandra u...

Cassandra 0.5.0 released

Apache Cassandra 0.5.0 was released over the weekend, four months after 0.4. ( Upgrade notes ; full changelog .) We're excited about releasing 0.5 because it makes life even better for people using Cassandra as their primary data source -- as opposed to a replica, possibly denormalized, of data that exists somewhere else. The Cassandra distributed database has always had a commitlog to provide durable writes, and in 0.4 we added an option to waiting for commitlog sync before acknowledging writes, for cases where even a few seconds of potential data loss was not an option. But what if a node goes down temporarily? 0.5 adds proactive repair, what Dynamo calls "anti-entropy," to synchronize any updates Hinted Handoff or read repair didn't catch across all replicas for a given piece of data. 0.5 also adds load balancing and significantly improves bootstrap (adding nodes to a running cluster). We've also been busy adding documentation on operations in product...

Linux performance basics

I want to write about Cassandra performance tuning, but first I need to cover some basics: how to use vmstat, iostat, and top to understand what part of your system is the bottleneck -- not just for Cassandra but for any system. vmstat You will typically run vmstat with "vmstat sampling-period", e.g., "vmstat 5." The output looks like this: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 20 0 195540 32772 6952 576752 0 0 11 12 38 43 1 0 99 0 22 2 195536 35988 6680 575132 6 0 2952 14 959 16375 72 21 4 3 The first line is your total system average since boot; typically this will not be very useful, since you are interested in what is causing problems NOW. Then you will get one line per sample period; most of the output is self explanatory. The reason to start with vmstat is the "swap" section: si and s...

Cassandra reading list

I put together this list for a co-worker who wants to learn more about Cassandra: ( 0.5 beta 2 out now!) Getting Started : Cassandra is surprisingly easy to try out. This walks you through both single-node and clustered setup. The Dynamo paper and Amazon's related article on eventual consistency : Cassandra's replication model is strongly influenced by Dynamo's. Almost everything you read here also applies to Cassandra. (The major exceptions are vector clocks, and even that may change , and Cassandra's support for order-preserving partitioning with active load balancing.) WTF is a SuperColumn ? Arin Sarkissian from Digg explains the Cassandra data model. Operations : stuff you will want to know when you run Cassandra in production Cassandra users survey from Nov 09 : What Twitter, Mahalo, Ooyala, SimpleGeo, and others are using Cassandra for More articles here (Cassandra on OS X seems to be a particularly popular topic) If you want to know more about the interna...

Cassandra hackfest and OSCON report

The best part of OSCON for me wasn't actually part of OSCON. The guys at Twitter put together a Cassandra hackfest on Wednesday night, with much awesomeness resulting. Thanks to Evan for organizing! Stu Hood flew up from Rackspace's Virginia offices just for the night, which normally probably wouldn't have been worth it, but Cliff Moon , author of dynomite , showed up (thanks, Cliff!) and was able to give Stu a lot of pointers on implementing merkle trees . Cliff and I also had a good discussion with Jun Rao about hinted handoff--Cliff and Jun are not fans, and I tend to agree with them--and eventual consistency . I also met David Pollack and got to talk a little about persistence for Goat Rodeo , and talked to a ton of people from Twitter and Digg. I think those two, with Rackspace and IBM Research, constituted the companies with more than one engineer attending. The rest was "long tail." Back at OSCON, my Cassandra talk was standing room only. Slid...

Cassandra 0.3 update

Two months after the first release candidate , Cassandra 0.3 is still not out. But, we're close! We had two more bug-fix release candidates, and it's virtually certain that 0.3-final will be the same exact code as 0.3-rc3 . (If you're using rc1, you do want to upgrade; see CHANGES.txt .) But, we got stuck in the ASF bureaucracy and it's going to take at least one more round-trip before the crack Release Prevention Team grudgingly lets us call it official. In the meantime, work continues apace on trunk for 0.4.

Patch-oriented development made sane with git-svn

One of the drawbacks to working on Cassandra is that unlike every other OSS project I have ever worked on, we are using a patch-oriented development process rather than post-commit review. It's really quite painfully slow. Somehow this became sort of the default for ASF projects, but there is precedent for switching to post-commit review eventually. In the meantime, there is git-svn. (The ASF does have a git mirror set up , but I'm going to ignore that because (a) its reliability has been questionable and (b) sticking with git-svn hopefully makes this more useful for non-ASF projects.) Disclaimer: I am not a git expert , and probably some of this will make you cringe if you are. Still, I hope it will be useful for some others fumbling their way towards enlightenment. As background, I suggest the git crash course for svn users . Just the parts up to the Remote section. Checkout: git-svn init https://svn.apache.org/repos/asf/cassandra/trunk cassandra Once that...

Why you won't be building your killer app on a distributed hash table

I ran across A case study in building layered DHT applications while doing some research on implementing load-balancing in Cassandra . The question addressed is, "Are DHTs a general-purpose tool that you can build more sophisticated services on?" Short version: no. A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals. This paper describes the results of attempting to build a relatively complex datastructure (prefix hash trees, for range queries) on top of OpenDHT. The result was mostly failure: A simple put-get interface was not quite enough. In particular, OpenDHT relies on timeouts to invalidate entries and has no support for atomicity primitives... In return for ease of implementation and deployment, we sacrificed performance. With the OpenDHT implementation, a PHT query operation took a median of 2–4 seconds...

Cassandra 0.3 release candidate and progress

We have a release candidate out for Cassandra 0.3. Grab the download and check out how to get started . The facebook presentation from almost a year ago now is also still a good intro to some of the features and data model. Cassandra in a nutshell : Scales writes very, very well: just add more nodes! Has a much richer data model than vanilla key/value stores -- closer to what you'd be used to in a relational db. Is pretty bleeding edge -- to my knowledge, Facebook is the only group running Cassandra in production. (Their largest cluster is 120 machines and 40TB of data .) At Rackspace we are working on a Cassandra-based app now that 0.3 has the extra features we need. Moved to the Apache Incubator about 40 days ago, at which point development greatly accelerated. Changes in 0.3 include Range queries on keys, including user-defined key collation. Remove support, which is nontrivial in an eventually consistent world. Workaround for a weird bug in JDK select/register that see...

A better analysis of Cassandra than most

Vladimir Sedach wrote a three-part dive into Cassandra . (Almost two months ago now. Guess I need to set up Google Alerts. Trouble is there's a surprising amount of noise around the word `cassandra.`) Part 0 Part 1 Part 2 A few notes: We now have an order-preserving partitioner as well as the hash-based one Yes, if you tell Cassandra to wait for all replicas to be ack'd before calling a write a success, then you would have traditional consistency (as opposed to "eventual") but you'd also have no tolerance for hardware failures which is a main point of this kind of system. Zookeeper is not currently used by Cassandra, although we have plans to use it in the future. Load balancing is not implemented yet. The move to Apache is finished and development is active there now.

Consistent hashing vs order-preserving partitioning in distributed databases

The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner. The reason that almost all similar systems use consistent hashing (the dynamo paper has the best description; see sections 4.1-4.3) is that it provides a kind of brain-dead load balancing from the hash algorithm spreading keys across the ring. But the dynamo authors go into some detail about how this by itself doesn't actually give good results in practice; their solution was to assign multiple tokens to each node in the cluster and they describe several approaches to that. But Cassandra's original designer considers this a hack and prefers real load balancing . An order-preserving partitioner, where keys are distributed to nodes in their natural order, has huge advantages over consistent hashing, particularly the ability to do range queries across the keys in the system (which has also been committed to Cassandra now...

Automatic project structure inference

David MacIver has an interesting blog entry up about determining logical project structure via commit logs . I was very interested because one of Cassandra's oldest issues is creating categories for our JIRA instance. (I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you, in this case.) The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results . In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not e...