Comments on Spyced: CouchDB: not drinking the kool-aid

As everyone knows, scalability isn't about sin...

2009-06-26T05:30:32.952-07:00

As everyone knows, scalability isn't about single-node numbers (although those don't look too hot either); it's about whether adding Nx machines gives you Nx performance, and it's about automating growing and failure recovery so that you don't have to add Nx members to your ops team at the same time. If sharding + replication were enough to scale we'd all stick with pg and mysql, but it's not; at this point everyone's pretty much concluded that that's not adequate for big data.

But manual sharding is labor intensive, error prone, and inflexible. You _can_ deal with machine failures but it's painful. And that's the good news. Growing your cluster is much worse. So is dealing with load hot spots.

That's what my problem is with couchdb -- as Zach quoted, the first feature they tout is "distributed," which has become associated, fairly or not, with scalability features couchdb doesn't have. But none of their devs ever post a correction to articles lumping couchdb in with scalable databases to say, "actually, we mean this _other_ definition of distributed." They seem content to allow people to assume they have these other features too, which is understandable in some sense, but not really honest.

Bah, just saying something isn't distributed b...

2009-06-26T01:03:19.672-07:00

Bah, just saying something isn't distributed because it doesn't do X isn't really valid. There isn't a single technology what makes a thing distributed. The fact that it can do clusters makes it distributed.

Claiming it can't scale calls for actual proof - tests and numbers - and not just words.

CouchDB, MonogoDB(http://www.mongodb.org/display/D...

2009-06-25T22:18:20.521-07:00

CouchDB, MonogoDB(http://www.mongodb.org/display/DOCS/Home) and stuff like it have a place as it is essentally REST type service interfaces to databases where traditionally RDBMS was only accessible over specific ports, pipes or other network walls.

CouchDB in that respect is a 'distributed' database.

CouchDB may not be the end all but document databases are better suited for the future it is just we will be coding different where the middle tier is smarter about where data lies not just the database. Lots of people are coding that way already dealing with terabytes of data that really go beyond the point of being able to even run a process across all data at all.

CouchDB was addressing a locking down of databases to one machine or cluster and freeing it to be more globally unique data (hash in the sky) when that type of system was just emerging, it will evolve just as RDBMS did.

I do believe there are strong futures for RESTful HTTP/JSON based databases that can handle some vertical scale like something like terracotta gives you but the future is horizontal scaling mostly. Document databases fit that much better than RDMBS even if they are young and still iterating to get to market ready.

Maybe it's just me with being join challenged....

2009-06-25T22:14:59.826-07:00

Maybe it's just me with being join challenged. Just yesterday, I was trying to do something with sqlite3 (which I love BTW for certain things) with nested selects and had to read up on the syntax for half hour because it forced me to think about the problem in a different way. I think the simplicity of writing P.O.JS (plain old javascript) to express what you want to query vastly overcomes any limitations of a project that's barely 1.0. Atleast to me, that's the attractiveness. We'll eventually have sharding, parallel map-reduce and all the bells and whistles. But the magic of Couch is that it's simple and it __just__ works and it helps me think about the problem the way it needs to be thought about.

@PENIX I see where you are coming from, but as I m...

2009-01-02T16:42:00.000-08:00

@PENIX I see where you are coming from, but as I mentioned in my earlier comment, what is introducing about CouchDB is the p2p distribution model. Nothing else even comes close for the purpose of distributing simple applications to users, while letting them own their own data.

CouchDB's storage model is closer to GFS than it is to Drizzle, so if you're looking for "big boys" to compare it to, that'd be one place to start.

When MySQL first started taking hold, many DB guru...

2009-01-02T16:08:00.000-08:00

When MySQL first started taking hold, many DB gurus dismissed it as useless. Sure, it doesn't do half of what big boys like Oracle can do, but it doesn't need to. Oracle is overkill to the people using MySQL.

CouchDB is another such technology. RDBMS is overkill for many web related projects. Those same PHP programmers who don't understand namespaces, and struggle with SQL, are going to eat this up.

I'm personally more interested in the Drizzle fork of MySQL than I am about CouchDB, but I'm definitely going to watch it closely so I can consider it for projects where it might fit the bill.

Thanks for the dynomite mention. If you want to p...

2009-01-01T18:21:00.000-08:00

Thanks for the dynomite mention. If you want to pick my brain about it, feel free to get in touch. Thanks.

@Jonathan, CouchDB will handle that for you.

2009-01-01T15:00:00.000-08:00

@Jonathan, CouchDB will handle that for you.

@JanL: "just use database-per-disk" is not a solut...

2009-01-01T14:18:00.000-08:00

@JanL: "just use database-per-disk" is not a solution; you just run into the "CouchDB is not a truly distributed database" problem that much quicker, i.e., you have to manually partition. Manual partitioning is more painful the more nodes you have to manage, so a system that can only effectively handle a single disk per node is that much less attractive.

@StephanMore importantly. Scalaris lacks persisten...

2009-01-01T14:05:00.000-08:00

@Stephan

More importantly. Scalaris lacks persistent storage. It is memory-only.

--

@Peter

Same for memcached.

> Writes are serialized. Not serialized as in ...

2009-01-01T14:03:00.000-08:00

> Writes are serialized. Not serialized as in the isolation level, serialized as in there can only be one write active at a time. Want to spread writes across multiple disks? Sorry.

Writes are serialized per database. Want to spread writes across multiple disks? Use multiple databases.

I'm much more excited about Scalaris.http://code.g...

2008-12-31T23:20:00.000-08:00

I'm much more excited about Scalaris.

http://code.google.com/p/scalaris/

(Really-)Distributed, Erlang, Transactional Key/Value-Store.

What they lack is the JSON-API (easy to do yourself) and the server side map reduce (I miss that one, it's cool! but I'm still not sure how important that is or if it can be done with Hadoop as easily)

Peace
-stephan

@Peter CouchDB uses MVCC, so "whoever saves first ...

2008-12-31T17:06:00.000-08:00

@Peter CouchDB uses MVCC, so "whoever saves first wins" on a single node. Out of date saves fail, so clients must retry saving after fetching (and hopefully merging) the latest rev.

Replication flags any conflicting writes as they come in from other nodes. So if your application can process a queue of conflicted revs, you can scale it to multiple disconnected replicas, and have a guarantee of eventual consistency, as long as the nodes eventually complete a successful replication.

If I'm correct, CouchDB has no way to synchronize ...

2008-12-31T16:32:00.000-08:00

If I'm correct, CouchDB has no way to synchronize accesses across multiple keys (no transactions or locks).

In my mind, that makes it totally useless for just about anything interesting. Why would I use CouchDB over memcached?

CouchDB is no replacement for a relational databas...

2008-12-31T11:00:00.000-08:00

CouchDB is no replacement for a relational database. However, I think a lot of the excitement over it is driven by developers for whom the relational model is overkill. DHH was mocked for calling his database "a hash in the sky" but the reality is, that's what a lot of applications really need.

As far as scalability is concerned, CouchDB currently supports multi-master replication, as well as offline or disconnected writes. Availability is favored over consistency, so application developers will have to take seriously the chance that two different users could update the same document within a short time frame on separate nodes.

However, as Couch progresses, we're working to add cluster-wide transactions (as well as integrated partitioning) so in the long term, Couch should be able to scale with your data.

Eric, yes, that's why I said it's competing with t...

2008-12-31T10:22:00.000-08:00

Eric, yes, that's why I said it's competing with those systems "in the popular imagination, if not in its author's mind." The official position has always been what you say, but a lot of the blog activity hypes it as more than that.

I think that all your points are valid.However, yo...

2008-12-31T09:43:00.000-08:00

I think that all your points are valid.

However, your premise is that somehow it will replace current relational database technologies. I don't think that's its goal (in fact, in the "Introduction to CouchDB" on its own website, it states that it's not "a replacement for relational databases").

I think that comparing CouchDB to PostgreSQL is just comparing two different things.

If your premise is that other database technologies are better for storing logical documents in a distributed way, then I'd be really interested to see a post on that.

Amen. The CouchDB website starts off "Apache Couch...

2008-12-31T09:29:00.000-08:00

Amen. The CouchDB website starts off "Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API." It took me a couple weeks before I realized that their definition of "distributed" was not my definition of "distributed".