Several of the reports of the recently-concluded NoSQL Live event mentioned that I took a contrarian position on the "NoSQL in the Cloud" panel, arguing that traditional, bare metal servers usually make more sense. Here's why.
There are two reasons to use cloud infrastructure (and by cloud I mean here "commodity VMs such as those provided by Rackspace Cloud Servers or Amazon EC2):
- You only need a fraction of the capacity of a single machine
- Your demand is highly elastic; you want to be able to quickly spin up many new instances, then drop them when you are done
When capacity only grows in one direction it makes less sense to pay a premium for the flexibility of being able to reduce your capacity nearly instantly, especially when you also get reduced I/O performance (the most common bottleneck for databases) in the bargain because of the virtualization layer. That's why, despite working for a cloud provider, I don't think it's always a good fit for databases. (It doesn't hurt that Rackspace also offers classic bare metal hosting in the same data centers, so you can have the best of both worlds.)
Comments
Seems that NoSQL on top of VMs to auto scale to handle more requests to the DB (ignore storage, just requests) could be smart for scalability.
So it's generally better to just try to keep enough capacity for peak load.
I think the primary utility of starting on a cloud instance is so that the price scaling which is rather appealing with Cassandra goes down without requiring a hefty starting price point. Sure its no big deal to buy a few dedicated machines for most companies, but for side-projects that "might be something big one-day", it's nice to start with something cheaper.
That way one can design for the future without having to fully pay for it until its actually needed. :)
What are your views on database scaling on platforms like SliceHost, Linnode etc.?
What does this say for cloud applications today? Well, basically, Amazon.com is an excellent example of such. Amazon.com is a massive mostly-static database. Once you deploy the database to a cloud node, updating the database requires significantly less bandwidth, and replicating database updates back to Amazon Corporate HQ similarly takes relatively little bandwidth. In short, the big difference today is that we're exposing data *to customers*, and once you start doing that, it makes sense to put a copy of the data being served to customers close to the program that's doing the serving to customers, and it also makes sense to put the program doing the serving to customers out in "the cloud" close to where the customers are. This does not, however, mean that all data can live out in the cloud. Even Amazon.com likely doesn't do that -- you can bet that their corporate crown jewels are well hidden inside corporate HQ, and all massive data processing needed for data mining etc. is being run on their local copy of the data, not on data that's out in the cloud.