Monday, March 15, 2010

Why your data may not belong in the cloud

Several of the reports of the recently-concluded NoSQL Live event mentioned that I took a contrarian position on the "NoSQL in the Cloud" panel, arguing that traditional, bare metal servers usually make more sense. Here's why.

There are two reasons to use cloud infrastructure (and by cloud I mean here "commodity VMs such as those provided by Rackspace Cloud Servers or Amazon EC2):

  1. You only need a fraction of the capacity of a single machine
  2. Your demand is highly elastic; you want to be able to quickly spin up many new instances, then drop them when you are done
Most people looking at NoSQL solutions are doing it because their data is larger than a traditional solution can handle, or will be, so (1) is not a very strong motivation. But what about (2)? At first glance, cloud is a great fit for adding capacity to a database cluster painlessly. But there's an important difference between load like web traffic that bounces up and down frequently, and databases: with few exceptions, databases only get larger with time. You won't have 20 TB of data this week, and 2 next.

When capacity only grows in one direction it makes less sense to pay a premium for the flexibility of being able to reduce your capacity nearly instantly, especially when you also get reduced I/O performance (the most common bottleneck for databases) in the bargain because of the virtualization layer. That's why, despite working for a cloud provider, I don't think it's always a good fit for databases. (It doesn't hurt that Rackspace also offers classic bare metal hosting in the same data centers, so you can have the best of both worlds.)

8 comments:

Lee said...

What about availability? A smart nosql management layer could detect down or under performing nodes and automatically spin up a new instance and replace it.

Matt said...

If a site spikes in traffic, you'll need IOPS to the cluster to serve more requests.

Seems that NoSQL on top of VMs to auto scale to handle more requests to the DB (ignore storage, just requests) could be smart for scalability.

Jonathan Ellis said...

@Matt: that's the thing, though: with a database, you can't "ignore storage." You can't serve requests from a new database machine unless you move/replicate data to it first, and that takes i/o on the existing machines that's not free. Trying to do it in the middle of a load spike could be a Very Bad Idea.

So it's generally better to just try to keep enough capacity for peak load.

Ben Bangert said...

Of course, if you start on something like Rackspace Cloud, and keep moving up, when you hit the 'top', you have the whole machine to yourself. At which point you'd prolly keep expanding by adding more 'bare metal' or 'managed' dedicated machines to the cluster.

I think the primary utility of starting on a cloud instance is so that the price scaling which is rather appealing with Cassandra goes down without requiring a hefty starting price point. Sure its no big deal to buy a few dedicated machines for most companies, but for side-projects that "might be something big one-day", it's nice to start with something cheaper.

That way one can design for the future without having to fully pay for it until its actually needed. :)

Christopher Clarke said...

Hi Jonathan
What are your views on database scaling on platforms like SliceHost, Linnode etc.?

Eric Lee Green said...

This I/O bandwidth issue, BTW, is one reason why cloud computing did not take off the last time that it was the "fad of the day" -- in the mid 1970's. In the mid 1970's it was assumed that soon no business would own their own computing infrastructure. Instead they would rent time on massive mainframes that were hooked up to the then-new packet switching networks, and all they would have in-house would be 3270 terminals or ASCII terminals (depending upon whether they were an IBM or non-IBM shop). The problem then became one of data ownership -- what if your cloud err timesharing vendor went bankrupt? So companies pulled their data back in-house... at which point accessing their data over the primitive communications infrastructure of the day became an insoluble bottleneck, thereby requiring them to pull their computing infrastructure back in-house too. And that was that for the proto-cloud, they labored on for another half-decade or so serving businesses too small for their own mainframes, and then personal computers put a stake in their heart.

What does this say for cloud applications today? Well, basically, Amazon.com is an excellent example of such. Amazon.com is a massive mostly-static database. Once you deploy the database to a cloud node, updating the database requires significantly less bandwidth, and replicating database updates back to Amazon Corporate HQ similarly takes relatively little bandwidth. In short, the big difference today is that we're exposing data *to customers*, and once you start doing that, it makes sense to put a copy of the data being served to customers close to the program that's doing the serving to customers, and it also makes sense to put the program doing the serving to customers out in "the cloud" close to where the customers are. This does not, however, mean that all data can live out in the cloud. Even Amazon.com likely doesn't do that -- you can bet that their corporate crown jewels are well hidden inside corporate HQ, and all massive data processing needed for data mining etc. is being run on their local copy of the data, not on data that's out in the cloud.

Scott said...

There's also the issue of operational overhead. A small development team without a dedicated operations team will save a lot of time and resources by deploying to the cloud and taking advantage of a cloud API that provides automation around things like EBS snapshot/restore capabilities, instant provisioning, and dynamically upgrading storage for a given set of machines. The operational savings in theory can more than make up for the added expense of having more VMs that are less performant.

Jonathan Ellis said...

@Scott That's a managed/do-it-yourself distinction, not hardware/virtualized one. In fact, the self-service cloud providers tend to require a lot more in-house expertise than outsourcing that to a managed provider like Rackspace.