This is my attempt to clear up some misconceptions about CouchDB and point out some technical details that a lot of people seem to have overlooked. For the record, I like Damien Katz's blog, he seems like a great programmer, and Erlang looks cool. Please don't hurt me.
First, and most important: CouchDB is not a distributed database. BigTable is a distributed database. Cassandra and dynomite are distributed databases. (And open source, and based on a better design than BigTable. More on this in another post.) It's true that with CouchDB you can "shard" data out to different instances just like you can with MySQL or PostgreSQL. That's not what people think when they see "distributed database." It's also true that CouchDB has good replication, but even multi-master replication isn't the same as a distributed database: you're still limited to the write throughput of the slowest machine.
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project:
- Writes are serialized. Not serialized as in the isolation level, serialized as in there can only be one write active at a time. Want to spread writes across multiple disks? Sorry.
- CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes. Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less.
- CouchDB is simple. Gloriously simple. Why is that a negative? It's competing with systems (in the popular imagination, if not in its author's mind) that have been maturing for years. The reason PostgreSQL et al have those features is because people want them. And if you don't, you should at least ask a DBA with a few years of non-MySQL experience what you'll be missing. The majority of CouchDB fans don't appear to really understand what a good relational database gives them, just as a lot of PHP programmers don't get what the big deal is with namespaces.
- A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce. MapReduce is a great approach to trivially parallelizing certain classes of problem. The problem is, it's tedious and error-prone to write raw MapReduce code. This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively). Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages. It's a little verbose, and you might be bored with it, but it's much better than writing low-level mapreduce code.
Comments
However, your premise is that somehow it will replace current relational database technologies. I don't think that's its goal (in fact, in the "Introduction to CouchDB" on its own website, it states that it's not "a replacement for relational databases").
I think that comparing CouchDB to PostgreSQL is just comparing two different things.
If your premise is that other database technologies are better for storing logical documents in a distributed way, then I'd be really interested to see a post on that.
As far as scalability is concerned, CouchDB currently supports multi-master replication, as well as offline or disconnected writes. Availability is favored over consistency, so application developers will have to take seriously the chance that two different users could update the same document within a short time frame on separate nodes.
However, as Couch progresses, we're working to add cluster-wide transactions (as well as integrated partitioning) so in the long term, Couch should be able to scale with your data.
In my mind, that makes it totally useless for just about anything interesting. Why would I use CouchDB over memcached?
Replication flags any conflicting writes as they come in from other nodes. So if your application can process a queue of conflicted revs, you can scale it to multiple disconnected replicas, and have a guarantee of eventual consistency, as long as the nodes eventually complete a successful replication.
http://code.google.com/p/scalaris/
(Really-)Distributed, Erlang, Transactional Key/Value-Store.
What they lack is the JSON-API (easy to do yourself) and the server side map reduce (I miss that one, it's cool! but I'm still not sure how important that is or if it can be done with Hadoop as easily)
Peace
-stephan
Writes are serialized per database. Want to spread writes across multiple disks? Use multiple databases.
More importantly. Scalaris lacks persistent storage. It is memory-only.
--
@Peter
Same for memcached.
CouchDB is another such technology. RDBMS is overkill for many web related projects. Those same PHP programmers who don't understand namespaces, and struggle with SQL, are going to eat this up.
I'm personally more interested in the Drizzle fork of MySQL than I am about CouchDB, but I'm definitely going to watch it closely so I can consider it for projects where it might fit the bill.
CouchDB's storage model is closer to GFS than it is to Drizzle, so if you're looking for "big boys" to compare it to, that'd be one place to start.
CouchDB in that respect is a 'distributed' database.
CouchDB may not be the end all but document databases are better suited for the future it is just we will be coding different where the middle tier is smarter about where data lies not just the database. Lots of people are coding that way already dealing with terabytes of data that really go beyond the point of being able to even run a process across all data at all.
CouchDB was addressing a locking down of databases to one machine or cluster and freeing it to be more globally unique data (hash in the sky) when that type of system was just emerging, it will evolve just as RDBMS did.
I do believe there are strong futures for RESTful HTTP/JSON based databases that can handle some vertical scale like something like terracotta gives you but the future is horizontal scaling mostly. Document databases fit that much better than RDMBS even if they are young and still iterating to get to market ready.
Claiming it can't scale calls for actual proof - tests and numbers - and not just words.
But manual sharding is labor intensive, error prone, and inflexible. You _can_ deal with machine failures but it's painful. And that's the good news. Growing your cluster is much worse. So is dealing with load hot spots.
That's what my problem is with couchdb -- as Zach quoted, the first feature they tout is "distributed," which has become associated, fairly or not, with scalability features couchdb doesn't have. But none of their devs ever post a correction to articles lumping couchdb in with scalable databases to say, "actually, we mean this _other_ definition of distributed." They seem content to allow people to assume they have these other features too, which is understandable in some sense, but not really honest.