This is my attempt to clear up some misconceptions about CouchDB and point out some technical details that a lot of people seem to have overlooked. For the record, I like Damien Katz's blog, he seems like a great programmer, and Erlang looks cool. Please don't hurt me.
First, and most important: CouchDB is not a distributed database. BigTable is a distributed database. Cassandra and dynomite are distributed databases. (And open source, and based on a better design than BigTable. More on this in another post.) It's true that with CouchDB you can "shard" data out to different instances just like you can with MySQL or PostgreSQL. That's not what people think when they see "distributed database." It's also true that CouchDB has good replication, but even multi-master replication isn't the same as a distributed database: you're still limited to the write throughput of the slowest machine.
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project:
- Writes are serialized. Not serialized as in the isolation level, serialized as in there can only be one write active at a time. Want to spread writes across multiple disks? Sorry.
- CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes. Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less.
- CouchDB is simple. Gloriously simple. Why is that a negative? It's competing with systems (in the popular imagination, if not in its author's mind) that have been maturing for years. The reason PostgreSQL et al have those features is because people want them. And if you don't, you should at least ask a DBA with a few years of non-MySQL experience what you'll be missing. The majority of CouchDB fans don't appear to really understand what a good relational database gives them, just as a lot of PHP programmers don't get what the big deal is with namespaces.
- A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce. MapReduce is a great approach to trivially parallelizing certain classes of problem. The problem is, it's tedious and error-prone to write raw MapReduce code. This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively). Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages. It's a little verbose, and you might be bored with it, but it's much better than writing low-level mapreduce code.