Comments on Spyced: Why I like the Cassandra distributed database

Right, I am thinking of something similar. Althoug...

2009-09-21T11:30:17.956-07:00

Right, I am thinking of something similar.
Although MogileFS is great and even proven to perform at Digg, I am looking for a java alternative. Dont ask me why :-)

Thanks for your reply.

Cassandra is really designed for storing structure...

2009-09-21T11:13:28.102-07:00

Cassandra is really designed for storing structured and semi-structured data than large media blobs, although I sketched out what large-object support might look like here: https://issues.apache.org/jira/browse/CASSANDRA-265

You could also use cassandra to store metadata about where your blobs are stored on conventional servers for something like MogileFS w/o the mysql bottleneck.

Hi Jonathan, Thanks for the info, it helps a lot. ...

2009-09-21T11:05:18.000-07:00

Hi Jonathan,
Thanks for the info, it helps a lot.
One question: is Cassandra a good fit to keep media on distributed nodes - images, videos etc..
AFAIK, thrift interface wont allow streaming access to the db.

Can you please shed some light! any help is appreciated.

Cheers

Thanks for taking the time to look at Cassandra, P...

2009-04-22T11:54:00.000-07:00

Thanks for taking the time to look at Cassandra, Phil. Certainly back in January it took a brave soul to dive in. We're hoping to make things a little more user-friendly as we approach a release. :)

I'd reply on your blog but I don't see a comment section? So here goes:

* we do plan to add the ability to add and remove columnfamilies after cluster start. the "right" way is to use something like ZooKeeper. the "quick hack" way is to just allow adding new CFs via the config xml -- if anyone is really blocking on this I can point you in the right direction; it's a couple hours of work is all. Otherwise I'd rather wait until we can do it "right."
* the api isn't set in stone; we're definitely willing to evaluate proposals for new features.
* for your specific example, the functionality is already there; get_column will take either columnfamily:column or supercolumnfamily:supercolumn:column. This is admittedly confusing, but it does work. :)
* take another look at Remove; I've done a lot of work on making that actually work. (similar to get_column you can specify varying degrees of granularity there -- you can remove a subcolumn, supercolumn, or entire columnfamily associated w/ a key.)

Feel free to drop by #cassandra if you have any more questions!

Make sure that the API of cassandra is enough for ...

2009-04-22T10:27:00.000-07:00

Make sure that the API of cassandra is enough for what you need. There are some operations on supercolumns that you might expect to be there but are missing (I'm sure for good reason). I went through and documented the cassandra api a few months ago when I was investigating it.

As an FYI, you don't have a link to your github on...

2009-04-20T13:27:00.000-07:00

As an FYI, you don't have a link to your github on
" (Update: also a commit Thursday added incorrectly-indented docstrings. Wonderful. Here is my github tree w/o that problem.) "

As an aside, calling this CAP business a theorem has been bugging me, but I guess that http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
establishes it at least in some models.

I have to agree on the above comments' points abou...

2009-03-31T15:09:00.000-07:00

I have to agree on the above comments' points about a Thrift library through buildbot/etc... but would also like to point out that HBase has a Thrift API which you have ignored in your post.

It's no secret that Amazon is still a major Or...

2009-03-28T14:35:00.000-07:00

It's no secret that Amazon is still a major Oracle customer; an A & P system like Dynamo is not suitable for certain kinds of data.

You can definitely pre-generate the pure Python li...

2009-03-27T15:13:00.000-07:00

You can definitely pre-generate the pure Python library & client code. Not sure about the `fastbinary` c extension -- it may depend on the C++ library which would make things trickier.

Buildbot would be great. I have no idea how to set one up. (Maybe once Thrift gets a stable release out -- they are working on that now.)

/me looks at the thrift depends list.Possibly stup...

2009-03-27T15:04:00.000-07:00

/me looks at the thrift depends list.

Possibly stupid question: from a system w/ a working thrift install, is it possible to generate the .py (or whatever lang) and then use that generated code on other systems?

I'm asking because it seems like a client-library buildbot would be useful here.

No?

It's fatal for my needs. (I chose my words carefu...

2009-03-27T10:23:00.000-07:00

It's fatal for my needs. (I chose my words carefully; I said 'decision' rather than 'mistake.' :)

Like I said, YMMV depending on your requirements.

> Then there are a whole bunch of "me too&...

2009-03-27T10:19:00.000-07:00

> Then there are a whole bunch of "me too" key value stores that made fatal architecture decisions or are writing a "memcached with persistance" without really thinking it through.

Selecting the C and P parts of the CAP triangle is not a "fatal architecture decision", it is simply one choice among several equally valid options. By selecting availability over consistency Vogels/Amazon has decided to use application-specific code to reconcile inconsistent data; anyone who makes the A & P choice is doing this and it can impose a non-trivial cost on the end-users of the system.

Eventually consistent can easily turn into not consistent if the end-user isn't cognizant of the tradeoffs involved and doesn't understand what burden is being shifted from the database to their own codebase. An equally valid claim could be made that distributed databases which opt for availability and partition-tolerance over consistency will be good choices for end-users who never really needed a database in the first place while people who need ACID properties are more likely to hold on to consistency as a key property in their database.