If you use RAID, you still need backups

I have not personally used any RAID systems.  The acronym RAID stands for Redundant Array of Inexpensive Disks.  The basic idea is to put a bunch of PC disks into a configuration, where each bit of data is in more than one place (the redundancy), using a controller that allows you to treat the whole collection as a single disk.  This gives high disk capacity at modest prices.  And the redundancy is supposed to provide protection against disk failure.

In a common setup, the RAID box has several disks that are designed to be hot plug-able.  If a disk fails, you pull it out and plug in a replacement.  The RAID controller reconstructs what should have been on that disk from the redundant data on other disks.  As a result, your system continues through the disk failure without interruption.

As far as I know, this all works and works very well.  So RAID does protect you against the failure of individual disks.  But Murphy’s law is not thereby repealed.  There are things that can go wrong, other than an individual disk failure.  And when that happens, you might have quite a mess on your hands.

There was an example of this at the dslreports forum.  The site was taken down by a power failure on April 16th 2012.  And, two weeks later, the site is still down and it is not certain when it will be back up.

As best I can glean, from what information is available, the RAID controller failed.  But apparently it didn’t fail dead.  It failed in such a way that when it was brought back up it managed to corrupt some of the data on the disks, thus making recovery harder.  It appears that there was no backup, beyond the redundancy that the RAID system is supposed to provide.

The moral of the story seems to be this:  If your data is large enough to benefit from RAID, then it is large enough to warrant regularly backing up that RAID.  The backup should probably be off-site, so that you can also recover from a serious failure at the site.

There’s also a report of a somewhat similar problem at opensuse, where apparently two disks failed at the same time.  But fortunately, in that case, there were offsite mirrors so that recovery will be faster.

Advertisements

Tags: , ,

About Neil Rickert

Mathematician and computer scientist who dabbles in cognitive science.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: