Friday, November 16, 2007

Reed-Solomon libraries

If you want to run a multi-petabyte storage system then you don't want to do it with Raid 5 or Raid 6; with modern disks' ~3% per year failure rate, that's 300 a year when you have 10000 disks and the odds start to get pretty good (relatively speaking) that you'll face permanent data loss at some point when you lose a third disk from an array while two are rebuilding. And of course monitoring and replacing disks in lots of small arrays is manpower-intensive, which to investors translates as "expensive."

You probably don't want to go with triplication, either; disks are cheap, but not so cheap that you want to triple your hardware costs unnecessarily. While storing multiple copies of frequently used data is good, all your data probably isn't "frequently used."

What is the solution? As it turns out, Raid is actually a special case of Reed-Solomon encoding, which lets you specify any degree of redundancy you want. You can be safer than triplication with a fraction of the space needed.

I was prompted to write this because Mozy open-sourced the Reed-Solomon library I used while I was there, librs, complete with Python bindings. The original librs we used at Mozy was written by Byron Clark, a formidible task. Later we switched to the version you see on sourceforge, based on Plank's original encoder. I wasn't involved with librs at all except to fix a couple reference leaks in the Python wrapper.

But if you're actually looking for an rs library to use, Alen Peacock, who is much more knowledgeable than I about the gory details involved here, tells me that if you are starting from scratch the two libraries you should evaluate are zfec, which also comes with Python bindings, and Jerasure which is an updated -- i.e., probably faster than his first -- encoder by Plank. (Jerasure has nothing to do with Java.)

1 comment:

Shane said...

It's great to see these new libraries pop up. Reed Solomon accomplishes an amazing trick that more software developers should be aware of. I don't think the possibilities have been explored nearly enough.

The unintuitive truth is that a tunable error correction algorithm easily achieves much higher reliability than replication, with less hardware. For example, if I organize my data into blocks spanning 20 disks, then put 10 error correction blocks on 10 other disks, that data is statistically safer than it would be on a system that maintains 4 replicas. Even better, I only need 50% extra space instead of 300%! (In fact, 20+10 is overkill; I'd rather use around 20+7.)

I've created a petabyte-scale storage system based on RS called Bit Mountain, but my employer is not mature in the ways of open source, so I can't release it. Fortunately, several other groups are now seeing the light, including allmydata.org. I need to check out their Tahoe project.

Another way I'd like to see RS applied is in packet-level transmission, especially for VOIP. Bandwidth isn't a problem anymore and delays up to 500 ms are unimportant. What's bad is regular packet loss. RS could solve the packet loss by generating a stream of forward error correcting packets that accompany the normal packets. I wish I could just turn on some iptables filter to add RS coding to a connection.