Chan-Ik Park
cipark@vision.postech.ac.kr
Tae-Young Choe
In the past few years, the performance of processors has been growing steadily, doubling approximately every two years [6], and with the advance of parallel processing technology, the processing power of a computer system is expected to be continuously increasing. However, the I/O performance has been far behind the processing power. Is it now very important to balance the I/O bandwidth and the processing power in order to improve the overall performance of a system [6,14].
There has been much research in improving I/O performance [7,8,12,14,15], known as data declustering and disk striping in disk array systems. All of them utilizes the parallelism among many disks. However, incorporating such a large number of disks into a disk array system makes the disk array system more prone to failure than a single disk [12]. High reliability can be achieved by using parity encoding of data to survive more than one disk failures. For example, RAID technology [12] makes a disk array system tolerate only one disk failure. However, since there is a growing demand in high reliability for user data, an efficient data placement scheme RM2 has been proposed in [11] which makes a disk array system tolerate double disk failures. Any error correcting code such as Hamming and Reed-Solomon codes [13] can be used to tolerate more than one disk failures in a disk array system, but it requires that a certain number of disks are exclusively reserved for storing redundant (e.g., parity) information [5], resulting in severe performance degradation due to bottleneck. RM2 eliminates this kind of bottleneck by evenly distributing parity and data across all disks.
This paper1 investigates how to choose an optimal striping unit for an RM2 disk array. Past research has studied how to stripe data in RAID level 0 [2] or RAID level 5 [4]. A disk array simulator incorporating RM2 as one of possible data placement schemes as well as RAIDs is developed for the purpose of evaluating disk array systems.
The rest of the paper is organized as follows. Section 2 describes basic terminologies and how RM2 works. In Section 3, I/O methods are defined for RM2 and our disk simulator is presented. Section 4 gives simulation results. Finally, Section 5 concludes the paper and presents suggestions for future work.
Here we explain basic terminology used in disk array systems and show examples of RAID level 5 and RM2 in Figure 1.
- Stripe unit :
- Unit of data interleaving which represents the group of logically contiguous blocks that are placed consecutively on a single disk before placing blocks on a different disk.
- Stripe :
- A set of N stripe units each of which has the same physical address in each disk and N represents the number of disks.
- Parity group :
- A set of data units and a parity unit, where the parity is computed from the set of data units.
- Parity stripe :
- A minimum set of stripes covering a given parity group.
Figure 1: Basic terminologies for disk array : (a) RAID level 5 (b) RM2.
As shown in Figure 1, a stripe, a parity group, and its corresponding parity stripe are all equivalent in RAID level 5, but different in RM2. Several placement schemes have been proposed in [9], and Figure 1 (a) shows an example of a data placement scheme called left-symmetric among them since it produces the best performance.
RAIDs [12] tolerates a single disk failure. However,
there is a growing demand in high reliability for user data
beyond what current RAIDs can provide.
In [11], an efficient data placement scheme RM2 has been
proposed which makes a disk array system tolerate double disk
failures without any explicit bottleneck.
Any error correcting code such as Hamming and Reed-Solomon codes
[13] can be used to tolerate more than one disk failures
in a disk array system, but it requires that a certain number
of disks are exclusively reserved for storing redundant
(e.g., parity) information [5],
resulting in severe performance
degradation due to bottleneck in updating parity information.
RM2 eliminates this kind of bottleneck by evenly distributing
parity and data across all disks.
Figure 1 (b) shows how RM2 places data and parity units when
the number of disks is seven.
Note that we have three partitions in a disk. One of the partitions
is used to store only parity units and the other two partitions
are used to store data units.
Assume that the size of one partition is m and
the number of disks is N.
Then, the amount of disk spaces
used to store redundant (i.e., parity) information
will be , i.e.,
.
This is called redundancy rate.
By redundancy rate, we can specify the size of
disk spaces storing only redundant information, i.e., parity units.
represents the i-th parity unit (representing
i-th parity group) and
represents data unit
which is included in both i-th and j-th parity groups.
For example, in order to compute the parity unit
,
we have to consider the data units
,
,
, and
.
For more detail on RM2, refer to the reference [11].
We have considered IBM 0661 3.5'' 320MB SCSI disk drive as a basic disk model when constructing a disk array system. The disk parameters are shown in Table 1. All disks are assumed to be rotationally synchronized and seek latency is computed in the same way as [16] :
where x represents seek distance, and a, b, and c are approximately modelled by single-cylinder-seek-time (minSeek), average-seek-time (avgSeek), and max-stroke-seek-time (maxSeek) given as
Note that numCyls represents total number of cylinders.
Table 1: IBM 0661 3.5" SCSI disk drive parameters.
In Chen et.al [3], large scale secondary storages such as disk arrays are usually throughput-oriented rather than response-time-oriented. Asynchronous I/O, read caching, and write buffering are widely accepted in disk arrays in order to improve throughput. So, we consider throughput as main criteria to evaluate a disk array system. However, when comparing disk array systems, the resulting throughput is highly dependent on the number of disks.
Since storage efficiency of a disk array system varies in accordance with redundancy scheme, the number of disks of each disk array may be different in order to provide the equal amount of data capacity. Now, we have two alternatives to compare disk arrays having different redundancy schemes as shown in [1]. First, compare disk arrays under the same cost. We mean by the same cost that all disk arrays use the same number of disks. But, if we fix the number of disks, data capacities of each disk array are not equal due to redundancy schemes. This makes it hard to determine input workloads to fairly evaluate all disk arrays. Second, compare disk arrays under the same data capacity, meaning that each disk array has different number of disks. Now, We can apply any workload to every disk arrays without any fairness problem. Since overall throughput depends highly on the number of disks used, we consider `per-disk throughput' as a new criteria obtained by normalization of overall throughput over the number of disks. Note that `per-disk throughput' can be useful when overall throughput of a disk array depends only on the number of disks.
A placement function is given to formally describe a data placement scheme such as RAID level 0, RAID level 5, and RM2. Similarly, a mapping function can be described which converts logical address into physical address for I/O access methods. Note that placement functions are equivalent to mapping functions.
After placing data and parity units by the placement function, a logical address given as blockNumber is converted into a physical address denoted as (DiskNo, Sector) by the corresponding mapping function. A logical address blockNumber can simply be rewritten as a triple (ParityStripeID, stripeUnitID, blockNumber) defined in a disk array configuration. From now on, a logical address means the triple without loss of generality.
The following notations are used to define mapping functions, i.e., placement functions in a disk array system.
In addition to notations shown in Figure 1, we show notations for mapping functions in Figure 2. Here, we see M = 1 and ndata = 4.
Figure 2: Notations for mapping and placement functions in RAID level 5 with
N = 5.
Mapping functions of RAID level 0 and 5, and RM2 are given as follows. These two RAID levels are used for comparisons with RM2.
For example, Figure 3 shows how RAID level 0 places
data units in case of where N represents the number of disks.
Each column represents a single disk drive.
Lee et.al [9] have proposed several schemes to place data and parity units in RAID level 5. We consider left-symmetric placement scheme among them since the scheme produces the best performance under various workloads.
Figure 4 shows how left-symmetric placement scheme works when the number of disks, N, is 5. Shaded rectangles represent parity units.
Figure 4: Data placement of RAID level 5 according to left-symmetric placement scheme when N = 5.
We now explain how RM2 places data and parity units.
RM2 has been proposed in [11],
but we have extended it in order to improve performance
by reducing seek latency overhead.
For example, Figure 5 shows how the original
RM2 scheme places data and parity units in case of
N = 7 and where N represents the number of disks
and p represents redundancy rate.
Note that each disk has three partitions where one partition stores
only parity units and the other two partitions store only data units.
This causes large seek latency since a data unit and the corresponding
parity units are placed very far.
Therefore, we have modified the original RM2 scheme
such that every units included in a same parity group are to be
placed as close as possible.
Figure 6 shows how data and parity units are placed by
our modified RM2 scheme in the same configuration N = 7
and
.
Note that each parity stripe consists of three stripes,
and one of them is used to store parity units only whereas the other
two are used to store only data units.
Figure 5: Data placement of the original RM2 for N = 7 and p = 33.3%.
The mapping function of our modified RM2 is defined as follows:
Note that two physical addresses will be defined as (DiskNo1,Sector) and (DiskNo2,Sector) by the parity mapping function for a given logical address since a data unit has to belong to two parity groups in RM2.
Each disk operation such as read and write is processed in the following ways:
In disk write operations, data and parity units are required to be updated at once in order to maintain data consistency regardless of disk failures. However, since updating at once is almost impossible to implement, we have to consider a locking mechanism to maintain data consistency. Locking information is needed for each parity stripe because a parity stripe is a basic unit of input/output request. The number of parity stripes is usually so large in a disk array system due to its large data capacity. For efficient locking mechanism on the large number of parity stripes, we use hashing method.
Figure 7: I/O methods defined for a parity stripe in RAID level 5 :
For a given method, there are some shaded rectangles and a single
black-filled rectangle.
The shaded rectangles represent a request and the black-filled
rectangle represents the corresponding parity unit.
Now, we define I/O methods for RM2 which mean low-level disk access methods for multiple disks, not a single disk. All I/O methods are defined for parity stripe since it is the basic I/O unit. In case of RAID level 0, no special I/O methods need to be defined since it does not require parity maintenance. The three kinds of I/O methods have been proprosed for RAID level 5 by Lee et.al [9] as shown in Figure 7: one is for disk read operation while the other two are for disk write operations. Shaded rectangles represent a request and a black-filled rectangle represents the corresponding parity unit for each method.
Figure 8: I/O methods defined for a parity stripe in RM2 : Note that
a parity stripe in RM2 consists of three stripes.
Shaded rectangles represent data units requested and black-filled
rectangles represent parity units.
Similarly, we define three kinds of I/O methods for RM2, as shown in Figure 8. For a given request, the corresponding parity stripe can be derived. All I/O methods are applied to the parity stripe.
Figure 9: The first step of Read-Modify-Write method defined RM2:
Reading old data and old parity.
We need a disk array simulator supporting RM2 as well as RAID level 0 and 5 for experiments. A disk array simulator RaidSim has been developed by Lee et. al [9] which supports only RAID 0-5. Based on RaidSim, we develop a disk array simulator called DASim for the purpose. DASim is different from RaidSim in that it can support RM2 and handle request sizes having certain distributions (not constant) given as an input. Like RaidSim, all disks are assumed to be rotationally synchronous and any disk requests accessing contiguous blocks are merged into a single disk request.
Operation flows of DASim are shown in Figure 10.
By simulation, we will show how RM2 affects overall throughput compared with RAID level 0 and 5. Data capacities of disk array systems are made to be equal by setting six to the number of data disks. Since the size of stripe unit (SU size) affects overall performance significantly of a disk array, we first find an optimal SU size for a given workload by simulation. An input workload is specified by request size, read/write ratio, and the degree of concurrency. The degree of concurrency is defined as the number of processes generating input/output requests.
Input workloads considered for experiments are given as follows:
Table 2 shows maximum data throughput obtained
theoretically in RAID level 5 and RM2 relative to RAID level 0
(i.e., when RAID level 0 is assumed to be one).
For simple analysis, we consider only two kinds of request size:
small and large request.
`Small' means that its size is equal to a single stripe
unit size whereas `large' means that its size is equal to
the size of a full parity stripe.
N represents the number of disks, and M represents the number of
stripes in a single parity stripe.
Storage efficiency means the ratio of data capacity and total disk
capacity. Remember that six disk operations are required
for small write in RM2 since disk read and write operations have to
be completed on the data unit as well as two parity units.
That is why is obtained for small write in RM2.
In case of large write (full parity stripe), disk read operations on
the data unit as well as two parity units are not required since new
parity can be computed from current available data units to be
written. So, M disk write operations are enough in order to
write (M-1) data units.
Table 2: Theoretically maximum data throughputs of RAID level 5 and RM2
relative to RAID level 0: N represents the number of disks and
M represents the number of stripes in a single parity stripe.
We investigate how SU sizes affect overall system performance for various workloads in order to determine an optimal SU size. Chen et. al [2,4] have presented how to determine an optimal SU size in RAID level 0 and 5. Like [2,4], request sizes are assumed to have the following distributions: exp4KB, exp16KB, norm400KB, and norm1.5MB. Since it is very hard to model system's workload exactly, an optimal SU size should work well even when system workload is unpredictable. That's why various workloads are considered to determine an optimal SU size.
Figure 11, Figure 12, and Figure 13 show the relative performance to maximum data throughput of RM2, RAID level 5, and RAID level 0, respectively, for various workloads specified by request size distribution, read/write ratio, and the degree of concurrency. From these results, an optimal SU size working for various workloads is determined, as shown in Table 3. The values inside parenthesis represent optimal values obtained directly from simulation results, whereas the values outside parenthesis represent the exponent value of two nearest to the values inside parenthesis. The exponent value of two will be frequently used in real situations since it works better. Note that read/write ratio cannot affect overall throughput of RAID0 due to no parity update overhead in RAID0.
Table 3: Optimal SU sizes applicable for various workloads with a specified
read/write ratio.
Figure 11: Relative per-disk throughput of RM2 with respect to maximum vs.
Su sizes for various workloads with read/write ratio of (a) 1.0/0.0 (b) 0.6/0.4
(c) 0.3/0.7 (d) 0.0/1.0.
Figure 13: Relative per-disk throughput of RAID0 with respect to maximum vs.
SU sizes for various workloads.
In case of large requests (norm400KB and norm1500KB), an optimal striping unit is shown to be 8KB, i.e., one third of a single track if any disk write operations are involved (write = 40, 70, and 100%). Here, we explain about this behavior in detail.
Figure 14: Relationships between parity stripe and disk track:
(a) Overall diagram for a parity stripe
when the number of disks is 9, a single parity stripe consists of
one parity stripe and two data stripes, and an input request equivalent to
16 stripe units has arrived.
(b) Overall diagram for a disk track of 24KB when a stripe unit is 8KB,
i.e., three stripe units are included in a track.
(c) Overall diagram for a disk track of 24KB when a stripe unit is 4KB,
i.e., six stripe units are included in a track.
Figure 14 (a) shows overall diagram of a parity stripe which consists of one parity stripe and two data stripes, i.e., M = 3. Note that the number of disks is 9. Read-Modify-Write method requires that an input request is converted into pairs of disk read and write operations to the corresponding stripe units of each corresponding disk. In case of M = 3, by Read-Modify-Write I/O access method, some disks has to handle three stripe units while some disks handle two stripe units. As an input request size increases, the number of disks handling three stripe units increases. If we can make these three stripe units be stored in a single track, then any additional rotational delay is greatly reduced required for Read-Modify-Write method. As shown in Figure 14 (b), if we choose 8KB as a stripe unit size and a track is 24KB, then three stripe units can be stored in a single track. However, if we choose 4KB as a stripe unit size, then six stripe units can be stored in a single track. In this case, we cannot avoid additional rotational delay required to disk write operations following disk read operations for the same stripe units.
That is, when the request size is large enough to cover a whole parity stripe
consisting of M stripes,
we get the best per-disk throughputs by adjusting
the size of a stripe unit such that a single track
can store M stripe units.
We obtain analytically Table 4.
It shows the size of an optimal stripe unit
for a given N and M such that a single track stores
M stripe units, where N represents the number of disks
and M represents the number of stripes in a single parity
stripe. Note that , i.e., M stripe units
cannot be stored exactly in a single track when N = 15 and 21.
From experimental results shown in Figure 15,
we can see that our analysis on the size of an optimal stripe unit
shown in Table 4 is correct, particularly when
N = 9, 12, 18, and 24 (i.e.,
).
Figure 15: SU size vs per-disk throughputs when the number of disks is 9, 12,
15, 18, 21, and 24.
Table 4: Optimal SU size vs. M in RM2 disk array for
large requests: N represents the number of disks and M
represents the number of stripes in a single parity stripe.
An efficient data placement scheme RM2 has been proposed in [11] to improve reliability of a disk array system significantly by enabling the tolerance of double disk failures. In this paper, we have presented how to choose an optimal striping unit of RM2. A disk array simulator called DASim has been developed from RaidSim [9] for experimental work. It is shown from simulations that an optimal striping unit of RM2 is two and half tracks in the case of disk read operations, and it is one third of a single track if any disk write operations are involved.