Promoting Linux Requires Advertising. It Matters to Me. TM
GnuCash Personal Finance Manager
GnuCash!

RAID and Data Protection Solutions for Linux

When a system administrator is first asked to provide a reliable, redundant means of protecting critical data on a server, RAID is usually the first term that comes to mind. In fact, RAID is just one part of an overall data availability architecture. RAID, and some of the complimentary storage technologies, are reviewed below.

RAID, short for Redundant Array of Inexpensive Disks, is a method whereby information is spread across several disks, using techniques such as disk striping (RAID Level 0) and disk mirroring (RAID level 1) to achieve redundancy, lower latency and/or higher bandwidth for reading and/or writing, and recoverability from hard-disk crashes. Over six different types of RAID configurations have been defined. A brief introduction can be found in Mike Neuffer's What Is RAID? page.

If you are a sysadmin contemplating the use of RAID, I strongly encourage you to use EVMS instead. Its a more flexible tool that uses RAID under-the-covers, and provides a better and more comprehensive storage solution than stand-alone RAID.

Related References

Types of Data Loss

Many users come to RAID with the expectation that using it will prevent data loss. This is expecting too much: RAID can help avoid data loss, but it can't prevent it. To understand why, and to be able to plan a better data protection strategy, it is useful to understand the different types of failures, and the way they can cause data loss.

Accidental or Intentional Erasure
One of the leading causes of data loss is the accidental or intentional erasure of files by you or another (human) user. This includes files that were erased by hackers who broke into your system, files that were erased by disgruntled employees, and files erased by you, thinking that they weren't needed any more, or due to a sense of discovery, to find out what old-timers mean when they say they fixed it for good by using the wizardly command su - root; cd /; rm -r *. RAID will not help you recover data lost in this way; to mitigate these kinds of losses, you need to perform regular backups (to archive media that aren't easily lost in a fire, stolen, or accidentally erased).

Total Disk Drive Failure
One possible disk drive failure mode is "complete and total disk failure". This can happen when a computer is dropped or kicked, although it can also happen due to old age (of the drive). Typically, the read head crashes into the disk platter, thereby trashing the head, and keeping any/everything on that platter from being readable. If the disk drive has only one platter, this means everything. Failure of the drive electronics (due to e.g. electrostatic discharge) can result in the same symptoms. This is the pre-eminent failure mode that RAID protects against. By splattering data in a redundant way across many disks, the total failure of any one disk will not cause any actual data loss. A far more common disk failure mode, however, is a slow accumulation of bad blocks: disk sectors which have become bad/unreadable. RAID does not protect against data corruption. This case is discussed in detail below.

Power Loss and Ensuing Data Corruption
Many beginners think that they can test RAID by starting a disk-access intensive job, and then unplugging the power while it is running. This is usually guaranteed to cause some kind of data corruption, and RAID does nothing to prevent it or to recover the resulting lost data. This kind of data corruption/loss can be avoided by using a journaling file system, and/or a journaling database server (to avoid data loss in a running SQL server when the system goes down). In discussions of journaling, there are typically two types of protection that can be offered: journaled meta-data, and journaled (user's) data. The term "meta-data" refers to the file name, the file owner, creation date, permissions, etc., whereas "data" is that actual contents of the file. By journaling the meta-data, a journaling file system can guarantee fast system boot times, by avoiding long integrity checks during boot. However, journaling the meta-data does not prevent the contents of the file from getting scrambled. Note that most journaling file systems journal only the meta-data, and not the data. (Ext3fs can be made to journal data, but at a tremendous performance loss). Note that databases have their own unique ways of guaranteeing data integrity in the face of power loss or system crash.

Bad Blocks on Disk Drive
The most common form of disk drive failure is a slow but steady loss of 'blocks' on the disk drive. Blocks can go bad in a number of ways: microscopic dust sticking to the platter, gouges in the platter when the head struck it, magnetic media applied too thinly at the factory, or worn off due to contact, etc. Over time, bad blocks can accumulate, and, from personal experience, as fast as one a day. Once a block is bad, data cannot be read from it. Bad blocks are not uncommon: all brand new disk drives leave the factory with hundreds (if not thousands) of bad blocks on them. The hard drive electronics can detect a bad block, and automatically reassign in its place a new, good block from elsewhere on the disk. All subsequent accesses to that block by the operating system are automatically and transparently handled by the disk drive. This feature is both good, and bad. As blocks slowly fail on the drive, they are automatically handled until one day the bad-block lookup table on the hard drive is full. At this point, bad blocks become painfully visible to the operating system: Linux grinds to a near halt, while spewing dma_intr: status=0x51 { DriveReady SeekComplete UnrecoverableError } messages.

Despite this being the most common disk failure mode, there are painfully few solutions and precious little that one can do. RAID, even in theory, does not address this problem, nor does file system journaling. At this point, I am aware of only two options: (1) run badblocks or (2) use EVMS. The first option, in the form of 'e2fsck -f -cc' is terrible: it can only be run on an unmounted file system that was built on a raw disk partition, and its painfully slow. A 5 or 10 gig partition can up to an hour, and a 160 gig partition can take a day. Furthermore, it works only on a raw disk partition: if the file system sits on top of a RAID md device, or an LVM logical volume, the exercise is pointless. I have not yet personally tried EVMS ...

(smart == Self-Monitoring Analysis and Reporting Technology System) ide-smart and smartsuite can help understand if this failure mode is about to bite you.

It is likely that md (Linux Software RAID) will gain bad-block replacement capabilities in the 2.5.x kernel series. See this (nasty) discussioni on LKML.

General System Corruption
Windows users are familiar with the vague and uneasy regression of one's system into total chaos, eventually necessitating a clean-slate reinstall of the operating system. Due to bugs in the operating system, the database server, and in other applications, there is a slow buildup of corrupted data until the system finally becomes unusable. There is little that one can do about this, other than to stay away from Windows (Win95/98 in particular), and avoid putting mission-critical services on beta software. Unfortunately, even regular data backups do little to avoid this kind of corruption: most likely, one is backing up corrupted data. The good news is that this is an uncommon phenomenon under Linux; I can't name any examples of this kind of corruption. Which is not to say that it doesn't occur: although unseen when handling ordinary files under linux/ext2fs, it may show up in some database products, or systems that do a lot of document mangling (e.g. due to an obscure bug in a word processor). While Linux won't crash if the word processor has a bug in it, this kind of a bug can lead to irretrievable data loss, which can be almost as bad. Other than file archiving, I know of no strategies for dealing with this kind of data loss.

Note that this kind of corruption can also occur due to bad hardware, cabling, or even an electrically noisy environment. A loose cable may slowly corrupt data, although it will usually show itself in other ways, which the device driver will interpret as broken hardware.

Linux RAID Solutions

There are three types of RAID solution options available to Linux users: software RAID, outboard DASD boxes, and RAID disk controllers.

Software RAID
Pure software RAID implements the various RAID levels in the kernel disk (block device) code. Pure-software RAID offers the cheapest possible solution: not only are expensive disk controller cards or hot-swap chassis not required, but software RAID works with cheaper IDE disks as well as SCSI disks. With today's fast CPU's, software RAID performance can hold its own against hardware RAID in all but the most heavily loaded and largest systems. The current Linux Software RAID is becoming increasingly fast, feature-rich and reliable, making many of the lower-end hardware solutions uninteresting. Expensive, high-end hardware may still offer advantages, but the nature of those advantages are not entirely clear.

Note that there are currently two Linux Software RAID implementations: the md (multi-disk) driver, which has been around since the early linux-2.0.x days, and the newer EVMS driver. The EVMS driver appears to be disk-format compatible with md. Features of the md driver include:

Note that while Linux MD is tried-and-true, reliable, robust, and does what it promises, its development is essentially at a standstill. For this reason, it seems that EVMS, with its greater activity, and strong long-term vision, is the technology to investigate first.

Outboard DASD Solutions
DASD (Direct Access Storage Device, an old IBM mainframe term) are separate boxes that come with their own power supply, provide a cabinet/chassis for holding the hard drives, and appear to Linux as just another SCSI device. In many ways, these offer the most robust RAID solution. Most boxes provide hot-swap disk bays, where failing disk drives can be removed and replaced without turning off power. Outboard solutions usually offer the greatest choice of RAID levels: RAID 0,1,3,4,and 5 are common, as well as combinations of these levels. Some boxes offer redundant power supplies, so that a failure of a power supply will not disable the box. Finally, with Y-scsi cables, such boxes can be attached to several computers, allowing high-availability to be implemented, so that if one computer fails, another can take over operations.

Because these boxes appear as a single drive to the host operating system, yet are composed of multiple SCSI disks, they are sometimes known as SCSI-to-SCSI boxes. Outboard boxes are usually the most reliable RAID solutions, although they are usually the most expensive (e.g. some of the cheaper offerings from IBM are in the twenty-thousand dollar ballpark). The high-end of this technology is frequently called 'SAN' for 'Storage Area Network', and features cable lengths that stretch to kilometers, and the ability for a large number of host CPU's to access one array.

Inboard DASD Solutions
Similar in concept to outboard solutions, there are now a number of bus-to-bus RAID converters that will fit inside a PC case. These in several varieties. One style is a small disk-like box, that fits into a standard 3.5 inch drive bay, and draws power from the power supply in the same way that a disk would. Another style will plug into a PCI, ISA or MicroChannel slot, and use that slot only for electrical power (and the space it provides).

Both SCSI-to-SCSI and EIDE-to-EIDE converters are available. Because these are converters, they appear as ordinary hard-drives to the operating system, and do not require any special drivers. Most such converters seem to support only RAID 0 (stripping) and 1 (mirroring), apparently due to size and cabling restrictions.

The principal advantages of inboard converters are price, reliability, ease-of-use, and in some cases, performance. Disadvantages are usually the lack of RAID-5 support, lack of hot-plug capabilities, and the lack of dual-ended operation.

RAID Disk Controllers
Disk Controllers are adapter cards that plug into the ISA/EISA/PCI bus. Just like regular disk controller cards, a cable attaches them to the disk drives. Unlike regular disk controllers, the RAID controllers will implement RAID on the card itself, performing all necessary operations to provide various RAID levels. Just like outboard boxes, the Linux kernel does not know (or need to know) that RAID is being used. However, just like ordinary disk controllers, these cards must have a corresponding device driver in the Linux kernel to be usable.

If the RAID disk controller has a modern, high-speed DSP/controller on board, and a sufficient amount of cache memory, it can outperform software RAID, especially on a heavily loaded system. However, using and old controller on a modern, fast 2-way or 4-way SMP machine may easily prove to be a performance bottle-neck as compared to a pure software-RAID solution. Some of the performance figures below provide additional insight into this claim.

Related Data Storage Protection Technologies

There are several related storage technologies that can provide various amounts of data redundancy, fault tolerance and high-availability features. These are typically used in conjunction with RAID, as a part of the overall system data protection design strategy.

SAN and NAS
There are a variety of high-end storage solutions available for large installations. These go typically under the acronyms 'NAS' and 'SAN'. NAS abbreviates 'Network Area Storage', and refers to NFS and Samba servers that Unix and Windows clients can mount. SAN abbreviates 'Storage Area Network', and refers to schemes that are the conceptual equivalent of thousand-foot-long disk-drive ribbon cables. Although the cables themselves may be fiber-optic (Fibre-Channel) or Ethernet (e.g. iSCSI), the attached devices appear to be 'ordinary disk drives' from the point of view of the host computer. These systems can be quite sophisticated: for example, this white-paper describes a SAN-like system that has built-in RAID and LVM features.

Journaling
Journaling refers to the concept of having a file system write a 'diary' of information to the disk in such a way as to allow the file system to be quickly restored to a consistent state after a power failure or other unanticipated hardware/software failure. A journaled file system can be brought back on-line quickly after a system reboot, and, as such, is a vital element of building a reliable, available storage solution.

There are a number of journaled file systems available for Linux. These include:

These different systems have different performance profiles and differ significantly in features and functions. There are many articles on the web which compare these. Note that some of these articles may be out-of-date with respect to features, performance or reputed bugs.

LVM and EVMS
Several volume management systems are available for Linux: LVM, the Logical Volume Manager, and EVMS, the Enterprise Volume Management System. LVM implements a set of features and functions that resemble those that would be found in traditional LVM systems on other Unixes. EVMS is a far more ambitious project, and includes a superset of features found in both LVM And Linux MD (Linux Software RAID). As of this writing (August 2002), EVMS appears to be the superior solution: it provides the right set of features for most system administrators today, and it has the long term strategic vision and an active developer base. By contrast, Linux MD development is essentially at a complete standstill, and has been for years. These are discussed in greater detail below.

LVM
The Linux LVM (like all traditional Unix volume management systems) provides an abstraction of the physical disks that makes it easier to administer large file systems and disk arrays. It does this by grouping sets of disks (physical volumes) into a pool (volume group). The volume group can be in turn be carved up into virtual partitions (logical volumes) that behave just like the ordinary disk block devices, except that (unlike disk partitions) they can be dynamically grown, shrunk and moved about without rebooting the system or entering into maintenance/standalone mode. A file system (or a swap space, or a raw block device) sits on top of a logical volume. In short, LVM adds an abstraction layer between the file system mount points (/, /usr, /opt, etc) and the hard drive devices (/dev/hda, /dev/sdb2, etc.)

The benefit of LVM is that you can add and remove hard drives, and move data from one hard drive to another without disrupting the system or other users. Thus, LVM is ideal for administering servers to which disks are constantly being added, removed or simply moved around to accommodate new users, new applications or just provide more space for the data. If you have only one or two disks, the effort to learn LVM may outweigh any administrative benefits that you gain.

Linux LVM and Linux Software RAID can be used together, although neither layer knows about the other, and some of the advantages of LVM seem to be lost as a result. The usual way of using RAID with LVM is as follows:

  1. Use fdisk (or cfdisk, etc.) to create a set of equal-sized disk partitions.
  2. Create a RAID-5 (or other RAID level array) across these partitions.
  3. Use LVM to create a physical volume on the RAID device. For instance, if the RAID array was /dev/md0, then pvcreate /dev/md0.
  4. Finish setting up LVM as normal.
In this scenario, although LVM can still be used to dynamically resize logical volumes, one does loose the benefit adding and removing hard drives willy-nilly. Linux RAID devices cannot be dynamically resized, nor is it easy to move a RAID array from one set of drives to another. One must still do space planning in order to have RAID arrays of the appropriate size. This may change: note that LVM is in the process of acquiring mirroring capabilities, although RAID-5 for LVM is still not envisioned.

Another serious drawback of this RAID+LVM combo is that neither Linux Software RAID (MD) nor LVM have any sort of bad-block replacement mechanisms. If (or rather, when) disks start manifesting bad blocks, one is up a creak without a paddle.

EVMS
EVMS provides an over-arching storage menagement solution, ranging from low-level drivers that provide RAID and LVM features, to high-level command-line and graphical tools for managing partitions, raid arrays, logical volumes and file systems. Of particular interest is that EVMS provides a bad-block replacement mechanism. Another interesting feature of EVMS is "snapshotting" (also suppoorted by LVM): the ability to take a "snapshot" of a file system at a particular point in time, even while the system is active, thereby allowing a consistent backup. Traditionally, without snapshots, backups can take many hours to run, and if files are added, renamed or deleted while the backup is running, the backup will record this inconsistent state.

Veritas
The Veritas Foundation Suite is a storage management software product that includes an LVM-like system. The following very old press release announces this system: VERITAS Software Unveils Linux Strategy and Roadmap (January 2000) It seems that it is now available for IBM mainframes running Linux (August 2003)!

Diagnostic and Monitoring Tools

Sooner or later, you will feel the need for tools to diagnose hardware problems, or simply monitor the hardware health. Alternately, some rescue operations require low-level configuration tools. In this case, you might find the following useful:

smartmontools
The smartmontools package (http://smartmontools.sourceforge.net) provides a set of utilities for working with the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into modern IDE/ATA and SCSI-3 disks. These tools can report a variety of disk drive health statistics, and the smartd daemon can run continuously to log events into the syslog.

scsirastools
""This project includes changes that enhance the Reliability, Availability and Serviceability (RAS) of the drivers that are commonly used in a Linux software RAID-1 configuration. Other efforts have been made to enable various common hardware RAID adapters and their drivers on Linux." See http://scsirastools.sourceforge.net. The pacakage contains low level scsi utilities including sgdskfl to load disk firmware, sgmode to get and set mode pages, sgdefects to read primary and grown defect lists and sgdiag to perform format and other test functions."

sg3_utils
The sg3_utils package provides a set of utilities for use with the Linux SCSI Generic (sg) device driver. The utilities include sg variants for the traditional dd command, tools for scanning and mapping the SCSI bus, tools for issuing low-level SCSI commands, tools for timing and testing, and some example source & miscellany.

This web page is remarkable because it also provides a nice cross-reference to other diagnostic and monitoring tools.

scu
"The SCSI Command Utility (SCU) implements various SCSI commands necessary for normal maintenance and diagnostics of SCSI peripherals. Some of its features include: formatting, scanning for (and reassigning) bad blocks, downloading new firmware, executing diagnostics and obtaining performance information. It is available on several Unix platforms (and NT), however it is only currently available in binary form. See www.bit-net.com/~rmiller/scu.html for more details."

Hardware RAID Controllers

A hardware controller is a PCI or ISA card that mediates between the CPU and the disk drives via the I/O bus. Hardware controllers always need a device driver to be loaded into the kernel, so that the kernel can talk to the card. Note that there are some devices (which I've listed in the "outboard controllers" section below) that only draw power from the PCI/ISA bus, but do not use any of the signal pins, and do not require a (special) device driver. This section lists only those cards that use the PCI/ISA bus for actually moving data.

Vendors supported under Linux: (Current as of 1998; some of the information below may be rancid.)

Highpoint (New Listing!)
Offer several products, such as the RocketRAID 404 controller which allows four IDE ribbon cables (and up to eight IDE drives) to be attached to the controller. Handy for saving up PCI slots, as IDE is significantly cheaper than SCSI. The RocketRAID BIOS is interesting because it allows hardware-RAID arrays to be built across multiple controllers plugged into the same PCI bus. Thus, for example, a 13-drive hardware RAID array can be built by using multiple controllers.

MegaRAID (New Listing!)
MegaRAID offers the MegaRAID i4 controller for IDE drives features four connectors for IDE ribbons (a total of 8 IDE drives per controller). Supports RAID-5, hot swap and hot spare capabilities.

BigStorage
BigStorage offers a broad line of storage products tailored for Linux.

ICP Vortex
ICP Vortex offers full line of disk array controllers. Drivers are a standard part of the 2.0.x and 2.2.x kernels; the boot diskettes for most major Linux distributions will recognize an ICP controller. Initial configuration can be done through on-board ROM BIOS.

ICP Vortex also provides the GDTMonitor management utility. It provides the ability to monitor transfer rates, set hard drive and controller parameters, and hot-swap and reconstruct defect drives. For sites that cannot afford to take down and reboot a server in order to replace failed disks or do other maintenance, this utility is a gotta-have feature. As of January 1999, this is the only such program that I have heard of for a Linux hardware RAID controller, and this feature alone immediately elevates ICP above the competition.

A RAID Primer (PDF) and Manuals; see Chapter K for GDTmon.

Syred
Syred offers a series of RAID controllers. Their sales staff indicated that they use RedHat internally, so the Linux support should be solid.

BusLogic/Mylex
Buslogic/Mylex offers a series of SCSI controllers, including RAID controllers. BusLogic has been well known for their early support of SCSI on Linux. The latest drivers for these cards are being written & maintained by Dandelion Digital.

DPT
Look for the SmartCache [I/III/IV] and SmartRAID [I/III/IV] controllers from Distributed Processing Technology, Inc. Note that one must use the EATA-DMA driver, which is a part of the standard linux kernel distribution. There are two drivers:
  • EATA-DMA: The driver for all EATA-DMA compliant (DPT) controllers.
  • EATA-PIO: The driver for the very old PM2001 and PM2012A from DPT.
IBM ServeRAID
IBM Intel based servers have onboard RAID.

Outboard RAID Vendors

There are many outboard box vendors, and, in theory, they should all work with Linux. In practice, some SCSI boxes support features that SCSI cards don't, and vice-versa, so buyer beware. Note Some outboard controllers are not true stand-alone, external boxes with external power supplies, but are small devices that fit into a standard drive bay, and draw power from the system power supply. Others are shaped as PCI or ISA cards, but use the PCI/ISA slots only to draw power, and do not use the signal pins on the bus. All of these devices need some other disk controller (typically, the stock, non-raid controller that came with your box) to communicate with. The upside to such a scheme: no special device drivers are required. The downside: there are even more cards, cables and connectors that can fail.

StorComp
Storage Computer is an early pioneer in RAID, and has continued to provide sophisticated, advanced systems, focused primarily on the 'SAN' style architecture. For example, their 'Virtual Storage Architecture' allows multiple CPU's to access the disks through SCSI interfaces. See, for example, their Product Sheet. They also have an interesting collection of White Papers.

www.raidweb.com
www.raidweb.com

Arco Computer Products
Arco Computer offers the DupliDisk EIDE-to-EIDE converter for RAID-1 (mirroring). Three versions are supported: one that fits into an ISA slot, one that fits into an IDE slot, and one that fits into a drive bay.

DILOG
DILOG offers the 2XFR SCSI-to-SCSI RAID-0 product. Features:
  • Fits into a 3.5 inch drive bay.
  • Certified by an IBM SIT lab to inter-operate with Linux.

Dynamic Network Factory
Dynamic Network Factory specializes in larger arrays.

LAND-5
Offer several products. These appear to be stand-alone scsi-attached boxes, and require no special Linux support. See www.land-5.com

Disk Array Management Software

Most controllers can be configured and managed via brute force, by rebooting the machine and descending into on-card BIOS or possibly DOS utilities to reconfigure, exchange and rebuild failed drives. However, for many system operators, rebooting is a luxury that is not available. For these sites and servers, there is a real need for configuration and management software that will not only report on a variety of disk statistics, but also raise alarms where there is trouble, allow failed drives to be disabled, swapped out, and reconstructed, and for all this to be done without taking the array off line, without halting any server functions. Currently (January 1999) I am aware of only one vendor that provides this capability: ICP-Vortex.

ICP Vortex (New Listing)
ICP Vortex provides the GDTMonitor management utility for its controllers. The utility provides the ability to monitor transfer rates, set hard drive and controller parameters, and hot-swap and reconstruct defect drives.

BusLogic
Buslogic offers the Global Array Manager which runs under SCO Unix and UnixWare. Thus, a port to Linux is at least theoretically possible. Contact your sales representative.

StorComp
Storage Computer offers an SNMP MIB for storage management. MIB's being being what they are, any SNMP tool on Linux should be able to use this to query and manage the system. However, MIB's being what they are, this is a rather low-level, (very-) hard to use solution. See also a white paper on storage management.

DPT
DPT provides management software with their cards. The distribution includes SCO binaries. Thus, a port to Linux is at least theoretically possible. Contact your sales representative.


History

Last updated August 2003 by Linas Vepstas (linas@linas.org)

Copyright (c) 1996-1999, 2001-2003 Linas Vepstas.
Copyright (c) 2003 Douglas Gilbert <dgilbert@interlog.com>

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included at the URL http://www.linas.org/fdl.html, the web page titled "GNU Free Documentation License".

The phrase 'Enterprise Linux' is a trademark of Linas Vepstas.
All trademarks on this page are property of their respective owners.
Return to the Enterprise Linux(TM) Page
Go Back to Linas' Home Page