NAME
raidctl —
configuration utility for the
RAIDframe disk driver
SYNOPSIS
raidctl |
[-v] -A
[yes | no | forceroot | softroot]
dev |
raidctl |
[-v] -a
component dev |
raidctl |
[-v] -C
config_file dev |
raidctl |
[-v] -c
config_file dev |
raidctl |
[-v] -F
component dev |
raidctl |
[-v] -f
component dev |
raidctl |
[-v] -g
component dev |
raidctl |
[-v] -I
serial_number dev |
raidctl |
[-v] -M
[yes | no | set params]
dev |
raidctl |
[-v] -R
component dev |
raidctl |
[-v] -r
component dev |
DESCRIPTION
raidctl is the user-land control program for
raid(4), the RAIDframe disk
device.
raidctl is primarily used to dynamically configure
and unconfigure RAIDframe disk devices. For more information about the
RAIDframe disk device, see
raid(4).
This document assumes the reader has at least rudimentary knowledge of RAID and
RAID concepts.
The command-line options for
raidctl are as follows:
-
-
- -A
yes dev
- Make the RAID set auto-configurable. The RAID set will be
automatically configured at boot before the root
file system is mounted. Note that all components of the set must be of
type
RAID
in the disklabel.
-
-
- -A
no dev
- Turn off auto-configuration for the RAID set.
-
-
- -A
forceroot dev
- Make the RAID set auto-configurable, and also mark the set
as being eligible to be the root partition. A RAID set configured this way
will override the use of the boot disk as the root
device. All components of the set must be of type
RAID
in the disklabel. Note that only certain
architectures (currently alpha, amd64, i386, pmax, sandpoint, sparc,
sparc64, and vax) support booting a kernel directly from a RAID set.
Please note that forceroot mode was referred to as
root mode on earlier versions of
NetBSD. For compatibility reasons,
root can be used as an alias for
forceroot.
-
-
- -A
softroot dev
- Like forceroot, but only change the root
device if the boot device is part of the RAID set.
-
-
- -a
component dev
- Add component as a hot spare for the
device dev. Component labels (which identify the
location of a given component within a particular RAID set) are
automatically added to the hot spare after it has been used and are not
required for component before it is used.
-
-
- -B
dev
- Initiate a copyback of reconstructed data from a spare disk
to its original disk. This is performed after a component has failed, and
the failed drive has been reconstructed onto a spare drive.
-
-
- -C
config_file dev
- As for -c, but forces the configuration
to take place. Fatal errors due to uninitialized components are ignored.
This is required the first time a RAID set is configured.
-
-
- -c
config_file dev
- Configure the RAIDframe device dev
according to the configuration given in config_file.
A description of the contents of config_file is
given later.
-
-
- -F
component dev
- Fails the specified component of the
device, and immediately begin a reconstruction of the failed disk onto an
available hot spare. This is one of the mechanisms used to start the
reconstruction process if a component does have a hardware failure.
-
-
- -f
component dev
- This marks the specified component as
having failed, but does not initiate a reconstruction of that
component.
-
-
- -G
dev
- Generate the configuration of the RAIDframe device in a
format suitable for use with the -c or
-C options.
-
-
- -g
component dev
- Get the component label for the specified component.
-
-
- -I
serial_number dev
- Initialize the component labels on each component of the
device. serial_number is used as one of the keys in
determining whether a particular set of components belong to the same RAID
set. While not strictly enforced, different serial numbers should be used
for different RAID sets. This step MUST be performed
when a new RAID set is created.
-
-
- -i
dev
- Initialize the RAID device. In particular, (re-)write the
parity on the selected device. This MUST be done for
all RAID sets before the RAID device is labeled and
before file systems are created on the RAID device.
-
-
- -M
yes dev
- Enable the use of a parity map on the RAID set; this is the
default, and greatly reduces the time taken to check parity after unclean
shutdowns at the cost of some very slight overhead during normal
operation. Changes to this setting will take effect the next time the set
is configured. Note that RAID-0 sets, having no parity, will not use a
parity map in any case.
-
-
- -M
no dev
- Disable the use of a parity map on the RAID set; doing this
is not recommended. This will take effect the next time the set is
configured.
-
-
- -M
set cooldown
tickms regions
dev
- Alter the parameters of the parity map; parameters to leave
unchanged can be given as 0, and trailing zeroes may be omitted. The RAID
set is divided into regions regions; each region is
marked dirty for at most cooldown intervals of
tickms milliseconds each after a write to it, and at
least cooldown - 1 such intervals. Changes to
regions take effect the next time is configured,
while changes to the other parameters are applied immediately. The default
parameters are expected to be reasonable for most workloads.
-
-
- -m
dev
- Display status information about the parity map on the RAID
set, if any. If used with -v then the current contents
of the parity map will be output (in hexadecimal format) as well.
-
-
- -P
dev
- Check the status of the parity on the RAID set, and
initialize (re-write) the parity if the parity is not known to be
up-to-date. This is normally used after a system crash (and before a
fsck(8)) to ensure the
integrity of the parity.
-
-
- -p
dev
- Check the status of the parity on the RAID set. Displays a
status message, and returns successfully if the parity is up-to-date.
-
-
- -R
component dev
- Fails the specified component, if
necessary, and immediately begins a reconstruction back to
component. This is useful for reconstructing back
onto a component after it has been replaced following a failure.
-
-
- -r
component dev
- Remove the spare disk specified by
component from the set of available spare
components.
-
-
- -S
dev
- Check the status of parity re-writing, component
reconstruction, and component copyback. The output indicates the amount of
progress achieved in each of these areas.
-
-
- -s
dev
- Display the status of the RAIDframe device for each of the
components and spares.
-
-
- -U
unit dev
- Set the
last_unit
field in all the
raid components, so that the next time the raid will be autoconfigured it
uses that unit.
-
-
- -u
dev
- Unconfigure the RAIDframe device. This does not remove any
component labels or change any configuration settings (e.g.
auto-configuration settings) for the RAID set.
-
-
- -v
- Be more verbose. For operations such as reconstructions,
parity re-writing, and copybacks, provide a progress indicator.
The device used by
raidctl is specified by
dev.
dev may be either the full
name of the device, e.g.,
/dev/rraid0d, for the i386
architecture, or
/dev/rraid0c for many others, or just
simply
raid0 (for
/dev/rraid0[cd]). It is
recommended that the partitions used to represent the RAID device are not used
for file systems.
Configuration file
The format of the configuration file is complex, and only an abbreviated
treatment is given here. In the configuration files, a ‘#’
indicates the beginning of a comment.
There are 4 required sections of a configuration file, and 2 optional sections.
Each section begins with a ‘START’, followed by the section name,
and the configuration parameters associated with that section. The first
section is the ‘array’ section, and it specifies the number of
rows, columns, and spare disks in the RAID set. For example:
indicates an array with 1 row, 3 columns, and 0 spare disks. Note that although
multi-dimensional arrays may be specified, they are
NOT
supported in the driver.
The second section, the ‘disks’ section, specifies the actual
components of the device. For example:
START disks
/dev/sd0e
/dev/sd1e
/dev/sd2e
specifies the three component disks to be used in the RAID device. If any of the
specified drives cannot be found when the RAID device is configured, then they
will be marked as ‘failed’, and the system will operate in
degraded mode. Note that it is
imperative that the order of
the components in the configuration file does not change between
configurations of a RAID device. Changing the order of the components will
result in data loss if the set is configured with the
-C
option. In normal circumstances, the RAID set will not configure if only
-c is specified, and the components are out-of-order.
The next section, which is the ‘spare’ section, is optional, and, if
present, specifies the devices to be used as ‘hot spares’ —
devices which are on-line, but are not actively used by the RAID driver unless
one of the main components fail. A simple ‘spare’ section might
be:
for a configuration with a single spare component. If no spare drives are to be
used in the configuration, then the ‘spare’ section may be
omitted.
The next section is the ‘layout’ section. This section describes the
general layout parameters for the RAID device, and provides such information
as sectors per stripe unit, stripe units per parity unit, stripe units per
reconstruction unit, and the parity configuration to use. This section might
look like:
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
32 1 1 5
The sectors per stripe unit specifies, in blocks, the interleave factor; i.e.,
the number of contiguous sectors to be written to each component for a single
stripe. Appropriate selection of this value (32 in this example) is the
subject of much research in RAID architectures. The stripe units per parity
unit and stripe units per reconstruction unit are normally each set to 1.
While certain values above 1 are permitted, a discussion of valid values and
the consequences of using anything other than 1 are outside the scope of this
document. The last value in this section (5 in this example) indicates the
parity configuration desired. Valid entries include:
-
-
- 0
- RAID level 0. No parity, only simple striping.
-
-
- 1
- RAID level 1. Mirroring. The parity is the mirror.
-
-
- 4
- RAID level 4. Striping across components, with parity
stored on the last component.
-
-
- 5
- RAID level 5. Striping across components, parity
distributed across all components.
There are other valid entries here, including those for Even-Odd parity, RAID
level 5 with rotated sparing, Chained declustering, and Interleaved
declustering, but as of this writing the code for those parity operations has
not been tested with
NetBSD.
The next required section is the ‘queue’ section. This is most often
specified as:
where the queuing method is specified as fifo (first-in, first-out), and the
size of the per-component queue is limited to 100 requests. Other queuing
methods may also be specified, but a discussion of them is beyond the scope of
this document.
The final section, the ‘debug’ section, is optional. For more
details on this the reader is referred to the RAIDframe documentation
discussed in the
HISTORY section.
See
EXAMPLES for a more complete
configuration file example.
FILES
- /dev/{,r}raid*
- raid device special files.
EXAMPLES
It is highly recommended that before using the RAID driver for real file systems
that the system administrator(s) become quite familiar with the use of
raidctl, and that they understand how the component
reconstruction process works. The examples in this section will focus on
configuring a number of different RAID sets of varying degrees of redundancy.
By working through these examples, administrators should be able to develop a
good feel for how to configure a RAID set, and how to initiate reconstruction
of failed components.
In the following examples ‘raid0’ will be used to denote the RAID
device. Depending on the architecture,
/dev/rraid0c or
/dev/rraid0d may be used in place of
raid0.
Initialization and
Configuration
The initial step in configuring a RAID set is to identify the components that
will be used in the RAID set. All components should be the same size. Each
component should have a disklabel type of
FS_RAID
, and
a typical disklabel entry for a RAID component might look like:
f: 1800000 200495 RAID # (Cyl. 405*- 4041*)
While
FS_BSDFFS
will also work as the component type,
the type
FS_RAID
is preferred for RAIDframe use, as it
is required for features such as auto-configuration. As part of the initial
configuration of each RAID set, each component will be given a
‘component label’. A ‘component label’ contains
important information about the component, including a user-specified serial
number, the row and column of that component in the RAID set, the redundancy
level of the RAID set, a ‘modification counter’, and whether the
parity information (if any) on that component is known to be correct.
Component labels are an integral part of the RAID set, since they are used to
ensure that components are configured in the correct order, and used to keep
track of other vital information about the RAID set. Component labels are also
required for the auto-detection and auto-configuration of RAID sets at boot
time. For a component label to be considered valid, that particular component
label must be in agreement with the other component labels in the set. For
example, the serial number, ‘modification counter’, number of rows
and number of columns must all be in agreement. If any of these are different,
then the component is not considered to be part of the set. See
raid(4) for more information about
component labels.
Once the components have been identified, and the disks have appropriate labels,
raidctl is then used to configure the
raid(4) device. To configure the
device, a configuration file which looks something like:
START array
# numRow numCol numSpare
1 3 1
START disks
/dev/sd1e
/dev/sd2e
/dev/sd3e
START spare
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
32 1 1 5
START queue
fifo 100
is created in a file. The above configuration file specifies a RAID 5 set
consisting of the components
/dev/sd1e,
/dev/sd2e, and
/dev/sd3e, with
/dev/sd4e available as a ‘hot spare’ in case one
of the three main drives should fail. A RAID 0 set would be specified in a
similar way:
START array
# numRow numCol numSpare
1 4 0
START disks
/dev/sd10e
/dev/sd11e
/dev/sd12e
/dev/sd13e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
64 1 1 0
START queue
fifo 100
In this case, devices
/dev/sd10e,
/dev/sd11e,
/dev/sd12e, and
/dev/sd13e are the components that make up this RAID set.
Note that there are no hot spares for a RAID 0 set, since there is no way to
recover data if any of the components fail.
For a RAID 1 (mirror) set, the following configuration might be used:
START array
# numRow numCol numSpare
1 2 0
START disks
/dev/sd20e
/dev/sd21e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
128 1 1 1
START queue
fifo 100
In this case,
/dev/sd20e and
/dev/sd21e are
the two components of the mirror set. While no hot spares have been specified
in this configuration, they easily could be, just as they were specified in
the RAID 5 case above. Note as well that RAID 1 sets are currently limited to
only 2 components. At present, n-way mirroring is not possible.
The first time a RAID set is configured, the
-C option must be
used:
raidctl -C raid0.conf raid0
where
raid0.conf is the name of the RAID configuration file.
The
-C forces the configuration to succeed, even if any of
the component labels are incorrect. The
-C option should not
be used lightly in situations other than initial configurations, as if the
system is refusing to configure a RAID set, there is probably a very good
reason for it. After the initial configuration is done (and appropriate
component labels are added with the
-I option) then raid0
can be configured normally with:
raidctl -c raid0.conf raid0
When the RAID set is configured for the first time, it is necessary to
initialize the component labels, and to initialize the parity on the RAID set.
Initializing the component labels is done with:
where ‘112341’ is a user-specified serial number for the RAID set.
This initialization step is
required for all RAID sets. As
well, using different serial numbers between RAID sets is
strongly encouraged, as using the same serial number for all
RAID sets will only serve to decrease the usefulness of the component label
checking.
Initializing the RAID set is done via the
-i option. This
initialization
MUST be done for
all RAID
sets, since among other things it verifies that the parity (if any) on the
RAID set is correct. Since this initialization may be quite time-consuming,
the
-v option may be also used in conjunction with
-i:
This will give more verbose output on the status of the initialization:
Initiating re-write of parity
Parity Re-write status:
10% |**** | ETA: 06:03 /
The output provides a ‘Percent Complete’ in both a numeric and
graphical format, as well as an estimated time to completion of the operation.
Since it is the parity that provides the ‘redundancy’ part of RAID,
it is critical that the parity is correct as much as possible. If the parity
is not correct, then there is no guarantee that data will not be lost if a
component fails.
Once the parity is known to be correct, it is then safe to perform
disklabel(8),
newfs(8), or
fsck(8) on the device or its file
systems, and then to mount the file systems for use.
Under certain circumstances (e.g., the additional component has not arrived, or
data is being migrated off of a disk destined to become a component) it may be
desirable to configure a RAID 1 set with only a single component. This can be
achieved by using the word “absent” to indicate that a particular
component is not present. In the following:
START array
# numRow numCol numSpare
1 2 0
START disks
absent
/dev/sd0e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
128 1 1 1
START queue
fifo 100
/dev/sd0e is the real component, and will be the second disk
of a RAID 1 set. The first component is simply marked as being absent.
Configuration (using
-C and
-I
12345 as above) proceeds normally, but initialization of
the RAID set will have to wait until all physical components are present.
After configuration, this set can be used normally, but will be operating in
degraded mode. Once a second physical component is obtained, it can be
hot-added, the existing data mirrored, and normal operation resumed.
The size of the resulting RAID set will depend on the number of data components
in the set. Space is automatically reserved for the component labels, and the
actual amount of space used for data on a component will be rounded down to
the largest possible multiple of the sectors per stripe unit (sectPerSU)
value. Thus, the amount of space provided by the RAID set will be less than
the sum of the size of the components.
Maintenance of the RAID set
After the parity has been initialized for the first time, the command:
can be used to check the current status of the parity. To check the parity and
rebuild it necessary (for example, after an unclean shutdown) the command:
is used. Note that re-writing the parity can be done while other operations on
the RAID set are taking place (e.g., while doing a
fsck(8) on a file system on the
RAID set). However: for maximum effectiveness of the RAID set, the parity
should be known to be correct before any data on the set is modified.
To see how the RAID set is doing, the following command can be used to show the
RAID set's status:
The output will look something like:
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
Component label for /dev/sd1e:
Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Component label for /dev/sd2e:
Row: 0 Column: 1 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Component label for /dev/sd3e:
Row: 0 Column: 2 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
This indicates that all is well with the RAID set. Of importance here are the
component lines which read ‘optimal’, and the ‘Parity
status’ line. ‘Parity status: clean’ indicates that the
parity is up-to-date for this RAID set, whether or not the RAID set is in
redundant or degraded mode. ‘Parity status: DIRTY’ indicates that
it is not known if the parity information is consistent with the data, and
that the parity information needs to be checked. Note that if there are file
systems open on the RAID set, the individual components will not be
‘clean’ but the set as a whole can still be clean.
To check the component label of
/dev/sd1e, the following is
used:
raidctl -g /dev/sd1e raid0
The output of this command will look something like:
Component label for /dev/sd1e:
Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
Version: 2 Serial Number: 13432 Mod Counter: 65
Clean: No Status: 0
sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
RAID Level: 5 blocksize: 512 numBlocks: 1799936
Autoconfig: No
Last configured as: raid0
Dealing with Component
Failures
If for some reason (perhaps to test reconstruction) it is necessary to pretend a
drive has failed, the following will perform that function:
raidctl -f /dev/sd2e raid0
The system will then be performing all operations in degraded mode, where
missing data is re-computed from existing data and the parity. In this case,
obtaining the status of raid0 will return (in part):
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
Note that with the use of
-f a reconstruction has not been
started. To both fail the disk and start a reconstruction, the
-F option must be used:
raidctl -F /dev/sd2e raid0
The
-f option may be used first, and then the
-F option used later, on the same disk, if desired.
Immediately after the reconstruction is started, the status will report:
Components:
/dev/sd1e: optimal
/dev/sd2e: reconstructing
/dev/sd3e: optimal
Spares:
/dev/sd4e: used_spare
[...]
Parity status: clean
Reconstruction is 10% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
This indicates that a reconstruction is in progress. To find out how the
reconstruction is progressing the
-S option may be used.
This will indicate the progress in terms of the percentage of the
reconstruction that is completed. When the reconstruction is finished the
-s option will show:
Components:
/dev/sd1e: optimal
/dev/sd2e: spared
/dev/sd3e: optimal
Spares:
/dev/sd4e: used_spare
[...]
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
At this point there are at least two options. First, if
/dev/sd2e is known to be good (i.e., the failure was either
caused by
-f or
-F, or the failed disk was
replaced), then a copyback of the data can be initiated with the
-B option. In this example, this would copy the entire
contents of
/dev/sd4e to
/dev/sd2e. Once
the copyback procedure is complete, the status of the device would be (in
part):
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
and the system is back to normal operation.
The second option after the reconstruction is to simply use
/dev/sd4e in place of
/dev/sd2e in the
configuration file. For example, the configuration file (in part) might now
look like:
START array
1 3 0
START disks
/dev/sd1e
/dev/sd4e
/dev/sd3e
This can be done as
/dev/sd4e is completely interchangeable
with
/dev/sd2e at this point. Note that extreme care must be
taken when changing the order of the drives in a configuration. This is one of
the few instances where the devices and/or their orderings can be changed
without loss of data! In general, the ordering of components in a
configuration file should
never be changed.
If a component fails and there are no hot spares available on-line, the status
of the RAID set might (in part) look like:
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
No spares.
In this case there are a number of options. The first option is to add a hot
spare using:
raidctl -a /dev/sd4e raid0
After the hot add, the status would then be:
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
Reconstruction could then take place using
-F as describe
above.
A second option is to rebuild directly onto
/dev/sd2e. Once
the disk containing
/dev/sd2e has been replaced, one can
simply use:
raidctl -R /dev/sd2e raid0
to rebuild the
/dev/sd2e component. As the rebuilding is in
progress, the status will be:
Components:
/dev/sd1e: optimal
/dev/sd2e: reconstructing
/dev/sd3e: optimal
No spares.
and when completed, will be:
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
No spares.
In circumstances where a particular component is completely unavailable after a
reboot, a special component name will be used to indicate the missing
component. For example:
Components:
/dev/sd2e: optimal
component1: failed
No spares.
indicates that the second component of this RAID set was not detected at all by
the auto-configuration code. The name ‘component1’ can be used
anywhere a normal component name would be used. For example, to add a hot
spare to the above set, and rebuild to that hot spare, the following could be
done:
raidctl -a /dev/sd3e raid0
raidctl -F component1 raid0
at which point the data missing from ‘component1’ would be
reconstructed onto
/dev/sd3e.
When more than one component is marked as ‘failed’ due to a
non-component hardware failure (e.g., loss of power to two components, adapter
problems, termination problems, or cabling issues) it is quite possible to
recover the data on the RAID set. The first thing to be aware of is that the
first disk to fail will almost certainly be out-of-sync with the remainder of
the array. If any IO was performed between the time the first component is
considered ‘failed’ and when the second component is considered
‘failed’, then the first component to fail will
not contain correct data, and should be ignored. When the
second component is marked as failed, however, the RAID device will
(currently) panic the system. At this point the data on the RAID set (not
including the first failed component) is still self consistent, and will be in
no worse state of repair than had the power gone out in the middle of a write
to a file system on a non-RAID device. The problem, however, is that the
component labels may now have 3 different ‘modification counters’
(one value on the first component that failed, one value on the second
component that failed, and a third value on the remaining components). In such
a situation, the RAID set will not autoconfigure, and can only be forcibly
re-configured with the
-C option. To recover the RAID set,
one must first remedy whatever physical problem caused the multiple-component
failure. After that is done, the RAID set can be restored by forcibly
configuring the raid set
without the component that failed
first. For example, if
/dev/sd1e and
/dev/sd2e fail (in that order) in a RAID set of the
following configuration:
START array
1 4 0
START disks
/dev/sd1e
/dev/sd2e
/dev/sd3e
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
64 1 1 5
START queue
fifo 100
then the following configuration (say "recover_raid0.conf")
START array
1 4 0
START disks
absent
/dev/sd2e
/dev/sd3e
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
64 1 1 5
START queue
fifo 100
can be used with
raidctl -C recover_raid0.conf raid0
to force the configuration of raid0. A
will be required in order to synchronize the component labels. At this point the
file systems on the RAID set can then be checked and corrected. To complete
the re-construction of the RAID set,
/dev/sd1e is simply
hot-added back into the array, and reconstructed as described earlier.
RAID on RAID
RAID sets can be layered to create more complex and much larger RAID sets. A
RAID 0 set, for example, could be constructed from four RAID 5 sets. The
following configuration file shows such a setup:
START array
# numRow numCol numSpare
1 4 0
START disks
/dev/raid1e
/dev/raid2e
/dev/raid3e
/dev/raid4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
128 1 1 0
START queue
fifo 100
A similar configuration file might be used for a RAID 0 set constructed from
components on RAID 1 sets. In such a configuration, the mirroring provides a
high degree of redundancy, while the striping provides additional speed
benefits.
Auto-configuration and
Root on RAID
RAID sets can also be auto-configured at boot. To make a set auto-configurable,
simply prepare the RAID set as above, and then do a:
to turn on auto-configuration for that set. To turn off auto-configuration, use:
RAID sets which are auto-configurable will be configured before the root file
system is mounted. These RAID sets are thus available for use as a root file
system, or for any other file system. A primary advantage of using the
auto-configuration is that RAID components become more independent of the
disks they reside on. For example, SCSI ID's can change, but auto-configured
sets will always be configured correctly, even if the SCSI ID's of the
component disks have become scrambled.
Having a system's root file system (
/) on a RAID set is also
allowed, with the ‘a’ partition of such a RAID set being used for
/. To use raid0a as the root file system, simply use:
raidctl -A forceroot raid0
To return raid0a to be just an auto-configuring set simply use the
-A yes arguments.
Note that kernels can only be directly read from RAID 1 components on
architectures that support that (currently alpha, i386, pmax, sandpoint,
sparc, sparc64, and vax). On those architectures, the
FS_RAID
file system is recognized by the bootblocks,
and will properly load the kernel directly from a RAID 1 component. For other
architectures, or to support the root file system on other RAID sets, some
other mechanism must be used to get a kernel booting. For example, a small
partition containing only the secondary boot-blocks and an alternate kernel
(or two) could be used. Once a kernel is booting however, and an
auto-configuring RAID set is found that is eligible to be root, then that RAID
set will be auto-configured and used as the root device. If two or more RAID
sets claim to be root devices, then the user will be prompted to select the
root device. At this time, RAID 0, 1, 4, and 5 sets are all supported as root
devices.
A typical RAID 1 setup with root on RAID might be as follows:
- wd0a - a small partition, which contains a complete,
bootable, basic NetBSD installation.
- wd1a - also contains a complete, bootable, basic
NetBSD installation.
- wd0e and wd1e - a RAID 1 set, raid0, used for the root
file system.
- wd0f and wd1f - a RAID 1 set, raid1, which will be used
only for swap space.
- wd0g and wd1g - a RAID 1 set, raid2, used for
/usr, /home, or other data, if
desired.
- wd0h and wd1h - a RAID 1 set, raid3, if desired.
RAID sets raid0, raid1, and raid2 are all marked as auto-configurable. raid0 is
marked as being a root file system. When new kernels are installed, the kernel
is not only copied to
/, but also to wd0a and wd1a. The
kernel on wd0a is required, since that is the kernel the system boots from.
The kernel on wd1a is also required, since that will be the kernel used should
wd0 fail. The important point here is to have redundant copies of the kernel
available, in the event that one of the drives fail.
There is no requirement that the root file system be on the same disk as the
kernel. For example, obtaining the kernel from wd0a, and using sd0e and sd1e
for raid0, and the root file system, is fine. It
is
critical, however, that there be multiple kernels available, in the event of
media failure.
Multi-layered RAID devices (such as a RAID 0 set made up of RAID 1 sets) are
not supported as root devices or auto-configurable devices
at this point. (Multi-layered RAID devices
are supported in
general, however, as mentioned earlier.) Note that in order to enable
component auto-detection and auto-configuration of RAID devices, the line:
must be in the kernel configuration file. See
raid(4) for more details.
Swapping on RAID
A RAID device can be used as a swap device. In order to ensure that a RAID
device used as a swap device is correctly unconfigured when the system is
shutdown or rebooted, it is recommended that the line
be added to
/etc/rc.conf.
Unconfiguration
The final operation performed by
raidctl is to unconfigure a
raid(4) device. This is
accomplished via a simple:
at which point the device is ready to be reconfigured.
Selection of the various parameter values which result in the best performance
can be quite tricky, and often requires a bit of trial-and-error to get those
values most appropriate for a given system. A whole range of factors come into
play, including:
- Types of components (e.g., SCSI vs. IDE) and their
bandwidth
- Types of controller cards and their bandwidth
- Distribution of components among controllers
- IO bandwidth
- file system access patterns
- CPU speed
As with most performance tuning, benchmarking under real-life loads may be the
only way to measure expected performance. Understanding some of the underlying
technology is also useful in tuning. The goal of this section is to provide
pointers to those parameters which may make significant differences in
performance.
For a RAID 1 set, a SectPerSU value of 64 or 128 is typically sufficient. Since
data in a RAID 1 set is arranged in a linear fashion on each component,
selecting an appropriate stripe size is somewhat less critical than it is for
a RAID 5 set. However: a stripe size that is too small will cause large IO's
to be broken up into a number of smaller ones, hurting performance. At the
same time, a large stripe size may cause problems with concurrent accesses to
stripes, which may also affect performance. Thus values in the range of 32 to
128 are often the most effective.
Tuning RAID 5 sets is trickier. In the best case, IO is presented to the RAID
set one stripe at a time. Since the entire stripe is available at the
beginning of the IO, the parity of that stripe can be calculated before the
stripe is written, and then the stripe data and parity can be written in
parallel. When the amount of data being written is less than a full stripe
worth, the ‘small write’ problem occurs. Since a ‘small
write’ means only a portion of the stripe on the components is going to
change, the data (and parity) on the components must be updated slightly
differently. First, the ‘old parity’ and ‘old data’
must be read from the components. Then the new parity is constructed, using
the new data to be written, and the old data and old parity. Finally, the new
data and new parity are written. All this extra data shuffling results in a
serious loss of performance, and is typically 2 to 4 times slower than a full
stripe write (or read). To combat this problem in the real world, it may be
useful to ensure that stripe sizes are small enough that a ‘large
IO’ from the system will use exactly one large stripe write. As is seen
later, there are some file system dependencies which may come into play here
as well.
Since the size of a ‘large IO’ is often (currently) only 32K or 64K,
on a 5-drive RAID 5 set it may be desirable to select a SectPerSU value of 16
blocks (8K) or 32 blocks (16K). Since there are 4 data sectors per stripe, the
maximum data per stripe is 64 blocks (32K) or 128 blocks (64K). Again,
empirical measurement will provide the best indicators of which values will
yield better performance.
The parameters used for the file system are also critical to good performance.
For
newfs(8), for example,
increasing the block size to 32K or 64K may improve performance dramatically.
As well, changing the cylinders-per-group parameter from 16 to 32 or higher is
often not only necessary for larger file systems, but may also have positive
performance implications.
Summary
Despite the length of this man-page, configuring a RAID set is a relatively
straight-forward process. All that needs to be done is the following steps:
- Use
disklabel(8) to create
the components (of type RAID).
- Construct a RAID configuration file: e.g.,
raid0.conf
- Configure the RAID set with:
raidctl -C raid0.conf raid0
- Initialize the component labels with:
- Initialize other important parts of the set with:
- Get the default label for the RAID set:
disklabel raid0 > /tmp/label
- Edit the label:
- Put the new label on the RAID set:
disklabel -R -r raid0 /tmp/label
- Create the file system:
- Mount the file system:
- Use:
raidctl -c raid0.conf raid0
To re-configure the RAID set the next time it is needed, or put
raid0.conf into /etc where it will
automatically be started by the /etc/rc.d scripts.
SEE ALSO
ccd(4),
raid(4),
rc(8)
HISTORY
RAIDframe is a framework for rapid prototyping of RAID structures developed by
the folks at the Parallel Data Laboratory at Carnegie Mellon University (CMU).
A more complete description of the internals and functionality of RAIDframe is
found in the paper "RAIDframe: A Rapid Prototyping Tool for RAID
Systems", by William V. Courtright II, Garth Gibson, Mark Holland, LeAnn
Neal Reilly, and Jim Zelenka, and published by the Parallel Data Laboratory of
Carnegie Mellon University.
The
raidctl command first appeared as a program in CMU's
RAIDframe v1.1 distribution. This version of
raidctl is a
complete re-write, and first appeared in
NetBSD 1.4.
COPYRIGHT
The RAIDframe Copyright is as follows:
Copyright (c) 1994-1996 Carnegie-Mellon University.
All rights reserved.
Permission to use, copy, modify and distribute this software and
its documentation is hereby granted, provided that both the copyright
notice and this permission notice appear in all copies of the
software, derivative works or modified versions, and any portions
thereof, and that both notices appear in supporting documentation.
CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
Carnegie Mellon requests users of this software to return to
Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
School of Computer Science
Carnegie Mellon University
Pittsburgh PA 15213-3890
any improvements or extensions that they make and grant Carnegie the
rights to redistribute these changes.
WARNINGS
Certain RAID levels (1, 4, 5, 6, and others) can protect against some data loss
due to component failure. However the loss of two components of a RAID 4 or 5
system, or the loss of a single component of a RAID 0 system will result in
the entire file system being lost. RAID is
NOT a substitute
for good backup practices.
Recomputation of parity
MUST be performed whenever there is a
chance that it may have been compromised. This includes after system crashes,
or before a RAID device has been used for the first time. Failure to keep
parity correct will be catastrophic should a component ever fail — it is
better to use RAID 0 and get the additional space and speed, than it is to use
parity, but not keep the parity correct. At least with RAID 0 there is no
perception of increased data security.
When replacing a failed component of a RAID set, it is a good idea to zero out
the first 64 blocks of the new component to insure the RAIDframe driver
doesn't erroneously detect a component label in the new component. This is
particularly true on
RAID 1 sets because there is at most
one correct component label in a failed RAID 1 installation, and the RAIDframe
driver picks the component label with the highest serial number and
modification value as the authoritative source for the failed RAID set when
choosing which component label to use to configure the RAID set.
BUGS
Hot-spare removal is currently not available.