Wednesday, December 31, 2008

CouchDB: not drinking the kool-aid

This is my attempt to clear up some misconceptions about CouchDB and point out some technical details that a lot of people seem to have overlooked.  For the record, I like Damien Katz's blog, he seems like a great programmer, and Erlang looks cool.  Please don't hurt me.
First, and most important: CouchDB is not a distributed database.  BigTable is a distributed database.  Cassandra and dynomite are distributed databases.  (And open source, and based on a better design than BigTable.  More on this in another post.)  It's true that with CouchDB you can "shard" data out to different instances just like you can with MySQL or PostgreSQL.  That's not what people think when they see "distributed database." It's also true that CouchDB has good replication, but even multi-master replication isn't the same as a distributed database: you're still limited to the write throughput of the slowest machine.
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project:
  • Writes are serialized.  Not serialized as in the isolation level, serialized as in there can only be one write active at a time.  Want to spread writes across multiple disks?  Sorry.
  • CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes.  Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less.
  • CouchDB is simple.  Gloriously simple.  Why is that a negative?  It's competing with systems (in the popular imagination, if not in its author's mind) that have been maturing for years.  The reason PostgreSQL et al have those features is because people want them.  And if you don't, you should at least ask a DBA with a few years of non-MySQL experience what you'll be missing.  The majority of CouchDB fans don't appear to really understand what a good relational database gives them, just as a lot of PHP programmers don't get what the big deal is with namespaces.
  • A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce.  MapReduce is a great approach to trivially parallelizing certain classes of problem.  The problem is, it's tedious and error-prone to write raw MapReduce code.  This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively).  Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages.  It's a little verbose, and you might be bored with it, but it's much better than writing low-level mapreduce code.

Wednesday, December 24, 2008


Today marks one month that I've been working for Rackspace's RackLabs with the Mosso group in San Antonio, Texas.  (Anyone want to start a Python group?  The closest one is in Austin.)

It's kind of a gentle introduction to big company culture for me; at around 2,000 employees, Rackspace is easily ten times as large as any other company I've worked for, and 100 times as large as most.  Mosso is a lot smaller and RackLabs itself is smaller still, but I still had to go to five days (!) of corporate orientation.  Other than that, though, we're pretty much left alone by our corporate parent.

To start with, I'm working on Mosso's Cloud Files, which is basically an S3 competitor.  Cloud Files is similar to the work I did at Mozy, but there are a lot of technical differences.  Some are driven by Cloud Files being more of a general purpose storage engine than the one I wrote for Mozy; others stem from the Cloud Files authors being Twisted fans.

Strange coincidence: as with Mozy, I share an office here with a Debian developer, probably the only one in San Antonio.  My experience is that debian developers are pretty sharp guys, probably in no small part due to the rigorous screening process you have to go through.  They set a high bar.

Of course this continues to be my personal blog, and all opinions are mine alone.  RackLabs has its own blog for when they want to say something official.

Friday, December 19, 2008

Frustrated with git

I'm a little over a week into a git immersion program.  Let me just say that git's reputation of being a little arcane (okay, more than a little) and having a steep learning curve is 100% deserved.

One thing that would mitigate things is if git would give you feedback when you tell it to do nonsense.  But it doesn't.  Here's me trying to get machine B to always merge the debug branch from machine A when I pull:

232 git config branch.debug.remote origin
234 git config branch.master.remote origin
236 git config branch.master.remote origin/debug

All of these commands completed silently. None accomplished what I wanted. In the end I renamed master to old and debug to master to avoid having to fight it. Then I blew away my working copy and re-cloned because those config statements had created a new problem that I didn't know how to undo.

I'm sure the git virtuosos out there will know what was wrong.  That's not the point.  The point is that the tool gave me no feedback.  It was like git was telling me, "Figure it out yourself.  Or don't.  I don't care."  Which is par for the course with my git experience so far.

Thursday, December 18, 2008

FormAlchemy 1.1: admin app, composite key support

FormAlchemy 1.1 is out, so you no longer need to run trunk to get the admin app goodness -- now with i8n support. We also added support for all composite primary keys, and most composite foreign keys. (The distinction is, rendering an object depends on the PK, but loading relations depends on FKs.) Gael also added the fsblob extension, which allows storing blobs on the filesystem and the path in the database. (FormAlchemy can handle blob-in-the-db out of the box.)
(I previously blogged about basic FormAlchemy and the admin app, which are still good introductions.)
FormAlchemy has pretty good documentation. The most important page is form generation; instructions to configure the admin app are here.