Short version: no. A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals.
This paper describes the results of attempting to build a relatively complex datastructure (prefix hash trees, for range queries) on top of OpenDHT. The result was mostly failure:
A simple put-get interface was not quite enough. In particular, OpenDHT relies on timeouts to invalidate entries and has no support for atomicity primitives... In return for ease of implementation and deployment, we sacrificed performance. With the OpenDHT implementation, a PHT query operation took a median of 2–4 seconds. This is due to the fact that layering entirely on top of a DHT service inherently implies that applications must perform a sequence of put-get operations to implement higher level semantics with limited opportunity for optimization within the DHT.In other words, there are two primary problems with the DHT approach:
- Most DHTs will require a second locking layer to achieve correctness when implementing a more complex data structure on top of the DHT semantics. In particular, this will certainly apply to eventually-consistent systems in the Dynamo mold.
- Advanced functionality like range queries needs to be supported natively to be at all efficient.
This is one reason why I think Cassandra is the most promising of the open-source distributed databases -- you get a relatively rich data model and a distribution model that supports efficient range queries. These are not things that can be grafted on top of a simpler DHT foundation, so Cassandra will be useful for a wider variety of applications.
Comments
Also, I'm not sure what you mean about requiring additional locking. When working with eventually consistent DHTs, isn't this where the application is supposed to step in and resolve conflicts in the state of the resource caused by concurrent operations?
The additional locking refers to keeping the prefix hash structure consistent since updating that requires multiple commands sent to the underlying DHT. Not locking here will result in irresolvable corruption, so it's not a place to employ the "eventually consistent" approach to which you refer. See the actual paper for details.
(1) DHTs / key-value stores,
(2) nonrelational db's that scale but have deeper functionality such as range queries and secondary indexes, and
(3) traditional RDBMSes when one needs the most functionality from the DB (e.g. very complex transactions).
Really depends on the problem at hand -- the right tool for the right job.
(I work on MongoDB which is a database that fits into bucket (2) above.)
Currently we have gone too far in our mentality of a RDB for everything. The RDB is a great generic tool but it needs to be one of a set of options. I think we are starting to see the emergence and uptake of DBMS’s that are somewhere in the middle of the RDBMS and DHT which have a strong focus on scale, predictability, development model, manageability and transaction processing performance. This is much like the breakaway path we have seen emerging for high end analytics over the few years.
Found a really promising project that also has been shown in a google talk. http://www.zib.de/CSR/Projects/scalaris
It supports range queries and heaps of other stuff.
Second, Scalaris doesn't actually support range queries yet according to the most recent information I know of (http://groups.google.com/group/scalaris/msg/1d365e6605f1f030).
I'm with you on the idea that opening up the minimal interface a bit might be a big win though, I am particularly interested in MongoDB.
I agree but I think then it stops being a simple DHT and becomes a distributed key/value database which is kind of a DHT with knobs on.
TB
Google seems to be doing quite well for itself on Bigtable, for instance.
OrderPreservingPartitioner and range query are relatively new features (only 2 months old) in Cassandra, I'd expect plenty of bugs and race conditions.