RAID, Can it Fail? If it Does is Data Recovery Possible?

Originally, as envisaged in 1987 by Patterson, Gibson and Katz from the University of California in Berkeley, the acronym RAID stood for a "Redundant Array of Inexpensive Disks". In short a larger number of smaller cheaper disks could be used in place of a single much more expensive large hard disk, or even to create a disk that was larger than any currently available.
They went a stage further and postulated a variety of options that would not only result in getting a big disk for a lower cost, but could improve performance, or increase reliability at the same time. Partly the options for improved reliability were required as using multiple disks gave a reduction in the Mean-Time-Between-Failure, divide the MTBF for a drive in the array by the number of drives and theoretically a RAID will fail more quickly than a single disk.
Today RAID is usually described as a "Redundant Array of Independent Disks", technology has moved on and even the most costly disks are not particularly expensive.
Six levels of RAID were originally defined, some geared towards performance, others to improved fault tolerance, though the first of these did not have any redundancy or fault-tolerance so might not truly be considered RAID.
RAID 0 - Striped and not really "RAID"
RAID 0 provides capacity and speed but not redundancy, data is striped across the drives with all of the benefits that gives, but if one drive fails the RAID is dead just as if a single hard disk drive fails.
This is good for transient storage where performance matters but the data is either non-critical or a copy is also kept elsewhere. Other RAID levels are more suited for critical systems where backups might not be up-to-the-minute, or down-time is undesirable.
RAID 1 - Mirroring
RAID 1 is often used for the boot devices in servers or for critical data where reliability requirements are paramount. Usually 2 hard disk drives are used and any data written to one disk is also written to the other.
In the event of a failure of one drive the system can switch to single drive operation, the failed drive replaced and the data transferred to a replacement drive to rebuild the mirror.
RAID 2
RAID 2 introduced error correction code generation to compensate for drives that did not have their own error detection. There are no such drives now, and have not been for a long time. RAID 2 is not really used anywhere.
RAID 3 - Dedicated Parity
RAID 3 uses striping, down to the byte level. This adds a hardware overhead for no apparent benefit. It also introduces "parity" or error correction data on a separate drive so an additional hard disk is needed that gives greater security but no additional space.
RAID 4 - Dedicated Parity
RAID 4 stripes to the block level, and like RAID 3 stores parity information on a dedicated drive.
RAID 5 - The most common format
RAID 5 stripes at the block level but does not use a single dedicated drive for storing parity. Instead, parity is interspersed within the data, so after each run of data stripes there is a strip of parity data, but this changes then for the next set of stripes.
This could means, for example, that in a 3 disk RAID 5 there are data strips on disks 0 and 1 followed by a parity strip on disk 2. For the next set of stripes the data is on disks 0 and 2 with the parity on disk 1, then data on disks 1 and 2 with parity on disk 0.
RAID 5 is generally faster for smaller reads, so eminently suitable for server systems being shared by large numbers of users created smaller data files or accessing smaller amounts of data each time. For other applications, however, RAID 4 will outperform RAID 5 quite considerably.
Beyond RAID 5?
Advances on RAID 5 do exist, though in general these use RAID 5 techniques and enhance them, for example by mirroring two RAID 5 arrays, or by having 2 parity stripes.
RAID data recovery
It might be imaged that with all of this fault tolerance that data recovery would not be a requirement, but things will still go wrong.
With all RAID levels logical corruption, damage to the file system, has just as devastating effect as with a single hard disk. You might have a robustly stored file system, but it is a robustly stored and corrupted file system.
With RAID 0 the result of a failure of one disk is terminal for the RAID, if data cannot be recovered from the failed disk then a percentage of the data is lost for good, and since RAID uses data striping, this could be like losing 1 MB of data out of every 4 MB, and the chances of that leaving any major files intact are low. For smaller files, those less than the sum of a strip each from the working drive there will be files that are fortunately intact, for larger files (e.g. Exchange or SQL databases) there will be considerable data loss and structural damage and low level work will be required to salvage any useful data from them.
For RAID levels where there is parity and the chance to recover from a single disk failure then the most common problems were see are:
Degraded running
A single disk fails and is ignored, or there is not a spare available and so one is ordered. Either way the RAID unit stays in operation but with a disk missing so there is no longer any redundancy.
Usually the hard disks in a RAID are part of the same manufacturing batch, have been stored and run in the same environment, if the unit has been mis-handled then each disk in the RAID has been mis-handled. So, there is quite a good chance that another drive will fail sometime soon, if not for any of the reasons just given but because bad things don't happen singly.
Multiple failure
Striped RAID is fault tolerant if a single drive fails nice and cleanly. If multiple drives fail then the RAID is lost, but also if one drive fails and de-stabilises the SCSI bus. This can result in multiple drives appearing to fail, the RAID unit believes that they have failed, and so the RAID will not operate.
Configuration loss
When a RAID is configured information is stored about the order of the disks the size of a strip of data and so on. If there is a failure within the RAID controller and this information is lost then the RAID will no operate, and it is not always practicable to re-instate it.
Some RAID controllers will consider re-programming the RAID configuration as a rebuild request and re-write to each of the disks destroying the data.
People making it worse
One of the worst sounds we hear with RAID problems is that of human panic, and frantic attempts to repair the problem. "We're just going to try one more thing" is often the sound that signals the end of the data as a RAID is repaired with the disks in the wrong slots, or rebuild and set back to its original state.
What to do when a RAID fails
STOP
THINK
Make sure that anything you do is going to be non-destructive.
Get Advice
Do not let anyone push you into precipitous action, they might have a deadline and be applying pressure but they will quickly forget their part in driving proceedings when the RAID is fatally damaged by a hurried repair attempt.
How can data be recovered from a RAID?
Much of RAID recovery is the same as for a single disk recovery, data must be secured and backed up to guarantee that the problem will not be exacerbated. For logical problems the difficult work is all on the analysis of the file system, that it is from a RAID makes no major difference once the RAID scheme has been identified and the correct access to it worked out.
For mirrored RAID data can be "mixed and matched" from the good sectors of two drives to rebuild a good drive. With striped RAID schemes that use parity then data can be rebuild at the stripe level rather than on a per drive basis so if there are bad sectors throughout more than one drive these can be corrected individually.
With non-redundant RAID schemes each sector that cannot read from a disk results in data loss from the RAID set. For redundant RAID schemes, however, there is much that can be done to rebuild when data is missing. Whilst a RAID controller will take a disk off-line when it fails and operate in degraded mode rebuilding the data from the missing disk on demand, a data recovery process can be somewhat more sophisticated. With properly written recovery software the level of granularity can be one sector rather than one disk so for each sector that fails the data can be rebuild so long as all sectors can be recovered from the remainder of the disks. Even if the next failed sector is on a different drive in the set, so long as the same sector can be read from the other disks then a complete rebuild can be made.
For levels of RAID that have greater redundancy, the number of failed sectors across a set of disks can be even greater without data loss.
Even as data recovery specialists we are, however, still bound by the rules of mathematics. If sector 99 is missing from both disks 0 and 4 in a RAID5 set then rebuilding of the missing data is not a possibility.
Once the raid/disk issues have been resolved then the data recovery process can continue just as it would for a single disk.
RAID, Can it Fail? If it Does is Data Recovery Possible? RAID, Can it Fail? If it Does is Data Recovery Possible? Reviewed by SODIQ AFOLABI on March 02, 2018 Rating: 5

No comments

Home PageNavi Display