Skip to main content

Why your data may not belong in the cloud

Several of the reports of the recently-concluded NoSQL Live event mentioned that I took a contrarian position on the "NoSQL in the Cloud" panel, arguing that traditional, bare metal servers usually make more sense. Here's why.

There are two reasons to use cloud infrastructure (and by cloud I mean here "commodity VMs such as those provided by Rackspace Cloud Servers or Amazon EC2):

  1. You only need a fraction of the capacity of a single machine
  2. Your demand is highly elastic; you want to be able to quickly spin up many new instances, then drop them when you are done
Most people looking at NoSQL solutions are doing it because their data is larger than a traditional solution can handle, or will be, so (1) is not a very strong motivation. But what about (2)? At first glance, cloud is a great fit for adding capacity to a database cluster painlessly. But there's an important difference between load like web traffic that bounces up and down frequently, and databases: with few exceptions, databases only get larger with time. You won't have 20 TB of data this week, and 2 next.

When capacity only grows in one direction it makes less sense to pay a premium for the flexibility of being able to reduce your capacity nearly instantly, especially when you also get reduced I/O performance (the most common bottleneck for databases) in the bargain because of the virtualization layer. That's why, despite working for a cloud provider, I don't think it's always a good fit for databases. (It doesn't hurt that Rackspace also offers classic bare metal hosting in the same data centers, so you can have the best of both worlds.)

Comments

Unknown said…
What about availability? A smart nosql management layer could detect down or under performing nodes and automatically spin up a new instance and replace it.
Matt said…
If a site spikes in traffic, you'll need IOPS to the cluster to serve more requests.

Seems that NoSQL on top of VMs to auto scale to handle more requests to the DB (ignore storage, just requests) could be smart for scalability.
Jonathan Ellis said…
@Matt: that's the thing, though: with a database, you can't "ignore storage." You can't serve requests from a new database machine unless you move/replicate data to it first, and that takes i/o on the existing machines that's not free. Trying to do it in the middle of a load spike could be a Very Bad Idea.

So it's generally better to just try to keep enough capacity for peak load.
Ben Bangert said…
Of course, if you start on something like Rackspace Cloud, and keep moving up, when you hit the 'top', you have the whole machine to yourself. At which point you'd prolly keep expanding by adding more 'bare metal' or 'managed' dedicated machines to the cluster.

I think the primary utility of starting on a cloud instance is so that the price scaling which is rather appealing with Cassandra goes down without requiring a hefty starting price point. Sure its no big deal to buy a few dedicated machines for most companies, but for side-projects that "might be something big one-day", it's nice to start with something cheaper.

That way one can design for the future without having to fully pay for it until its actually needed. :)
Hi Jonathan
What are your views on database scaling on platforms like SliceHost, Linnode etc.?
Eric Lee Green said…
This I/O bandwidth issue, BTW, is one reason why cloud computing did not take off the last time that it was the "fad of the day" -- in the mid 1970's. In the mid 1970's it was assumed that soon no business would own their own computing infrastructure. Instead they would rent time on massive mainframes that were hooked up to the then-new packet switching networks, and all they would have in-house would be 3270 terminals or ASCII terminals (depending upon whether they were an IBM or non-IBM shop). The problem then became one of data ownership -- what if your cloud err timesharing vendor went bankrupt? So companies pulled their data back in-house... at which point accessing their data over the primitive communications infrastructure of the day became an insoluble bottleneck, thereby requiring them to pull their computing infrastructure back in-house too. And that was that for the proto-cloud, they labored on for another half-decade or so serving businesses too small for their own mainframes, and then personal computers put a stake in their heart.

What does this say for cloud applications today? Well, basically, Amazon.com is an excellent example of such. Amazon.com is a massive mostly-static database. Once you deploy the database to a cloud node, updating the database requires significantly less bandwidth, and replicating database updates back to Amazon Corporate HQ similarly takes relatively little bandwidth. In short, the big difference today is that we're exposing data *to customers*, and once you start doing that, it makes sense to put a copy of the data being served to customers close to the program that's doing the serving to customers, and it also makes sense to put the program doing the serving to customers out in "the cloud" close to where the customers are. This does not, however, mean that all data can live out in the cloud. Even Amazon.com likely doesn't do that -- you can bet that their corporate crown jewels are well hidden inside corporate HQ, and all massive data processing needed for data mining etc. is being run on their local copy of the data, not on data that's out in the cloud.
Scott said…
There's also the issue of operational overhead. A small development team without a dedicated operations team will save a lot of time and resources by deploying to the cloud and taking advantage of a cloud API that provides automation around things like EBS snapshot/restore capabilities, instant provisioning, and dynamically upgrading storage for a given set of machines. The operational savings in theory can more than make up for the added expense of having more VMs that are less performant.
Jonathan Ellis said…
@Scott That's a managed/do-it-yourself distinction, not hardware/virtualized one. In fact, the self-service cloud providers tend to require a lot more in-house expertise than outsourcing that to a managed provider like Rackspace.

Popular posts from this blog

The Missing Piece in AI Coding: Automated Context Discovery

I recently switched tasks from writing the ColBERT Live! library and related benchmarking tools to authoring BM25 search for Cassandra . I was able to implement the former almost entirely with "coding in English" via Aider . That is: I gave the LLM tasks, in English, and it generated diffs for me that Aider applied to my source files. This made me easily 5x more productive vs writing code by hand, even with AI autocomplete like Copilot. It felt amazing! (Take a minute to check out this short thread on a real-life session with Aider , if you've never tried it.) Coming back to Cassandra, by contrast, felt like swimming through molasses. Doing everything by hand is tedious when you know that an LLM could do it faster if you could just structure the problem correctly for it. It felt like writing assembly without a compiler -- a useful skill in narrow situations, but mostly not a good use of human intelligence today. The key difference in these two sce...

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas...