Striping in Disk Array RM2 Enabling the Tolerance of Double Disk Failures

Chan-Ik Park
Department of Computer Science and Engineering/PIRL
Postech
San 31, HyoJa Dong
Pohang, Korea 790-784

cipark@vision.postech.ac.kr

Tae-Young Choe
Department of Computer Science and Engineering/PIRL
Postech
San 31, HyoJa Dong
Pohang, Korea 790-784

Abstract:: There is a growing demand in high reliability beyond what current RAID can provide and there are various levels of user demand for data reliability. An efficient data placement scheme called RM2 has been proposed in [11], which makes a disk array system tolerable against double disk failures. In this paper, we consider how to choose an optimal striping unit for RM2 particularly when no workload information is available except read/write ratio. A disk array simulator for RM2 has been developed for experimental work. It is shown that RM2 has an optimal striping unit of two and half tracks in the case of disk read operations, and one third of a single track if any disk write operations are involved.
Keywords:: data placement, disk array, performance, reliability, striping

Introduction

In the past few years, the performance of processors has been growing steadily, doubling approximately every two years [6], and with the advance of parallel processing technology, the processing power of a computer system is expected to be continuously increasing. However, the I/O performance has been far behind the processing power. Is it now very important to balance the I/O bandwidth and the processing power in order to improve the overall performance of a system [6,14].

There has been much research in improving I/O performance [7,8,12,14,15], known as data declustering and disk striping in disk array systems. All of them utilizes the parallelism among many disks. However, incorporating such a large number of disks into a disk array system makes the disk array system more prone to failure than a single disk [12]. High reliability can be achieved by using parity encoding of data to survive more than one disk failures. For example, RAID technology [12] makes a disk array system tolerate only one disk failure. However, since there is a growing demand in high reliability for user data, an efficient data placement scheme RM2 has been proposed in [11] which makes a disk array system tolerate double disk failures. Any error correcting code such as Hamming and Reed-Solomon codes [13] can be used to tolerate more than one disk failures in a disk array system, but it requires that a certain number of disks are exclusively reserved for storing redundant (e.g., parity) information [5], resulting in severe performance degradation due to bottleneck. RM2 eliminates this kind of bottleneck by evenly distributing parity and data across all disks.

This paper¹ investigates how to choose an optimal striping unit for an RM2 disk array. Past research has studied how to stripe data in RAID level 0 [2] or RAID level 5 [4]. A disk array simulator incorporating RM2 as one of possible data placement schemes as well as RAIDs is developed for the purpose of evaluating disk array systems.

The rest of the paper is organized as follows. Section 2 describes basic terminologies and how RM2 works. In Section 3, I/O methods are defined for RM2 and our disk simulator is presented. Section 4 gives simulation results. Finally, Section 5 concludes the paper and presents suggestions for future work.

Preliminaries

Basic Terminology

Here we explain basic terminology used in disk array systems and show examples of RAID level 5 and RM2 in Figure 1.

Stripe unit :
Unit of data interleaving which represents the group of logically contiguous blocks that are placed consecutively on a single disk before placing blocks on a different disk.
Stripe :
A set of N stripe units each of which has the same physical address in each disk and N represents the number of disks.
Parity group :
A set of data units and a parity unit, where the parity is computed from the set of data units.
Parity stripe :
A minimum set of stripes covering a given parity group.

Figure 1: Basic terminologies for disk array : (a) RAID level 5 (b) RM2.

As shown in Figure 1, a stripe, a parity group, and its corresponding parity stripe are all equivalent in RAID level 5, but different in RM2. Several placement schemes have been proposed in [9], and Figure 1 (a) shows an example of a data placement scheme called left-symmetric among them since it produces the best performance.

RAIDs [12] tolerates a single disk failure. However, there is a growing demand in high reliability for user data beyond what current RAIDs can provide. In [11], an efficient data placement scheme RM2 has been proposed which makes a disk array system tolerate double disk failures without any explicit bottleneck. Any error correcting code such as Hamming and Reed-Solomon codes [13] can be used to tolerate more than one disk failures in a disk array system, but it requires that a certain number of disks are exclusively reserved for storing redundant (e.g., parity) information [5], resulting in severe performance degradation due to bottleneck in updating parity information. RM2 eliminates this kind of bottleneck by evenly distributing parity and data across all disks. Figure 1 (b) shows how RM2 places data and parity units when the number of disks is seven. Note that we have three partitions in a disk. One of the partitions is used to store only parity units and the other two partitions are used to store data units. Assume that the size of one partition is m and the number of disks is N. Then, the amount of disk spaces used to store redundant (i.e., parity) information will be , i.e., . This is called redundancy rate. By redundancy rate, we can specify the size of disk spaces storing only redundant information, i.e., parity units. represents the i-th parity unit (representing i-th parity group) and represents data unit which is included in both i-th and j-th parity groups. For example, in order to compute the parity unit , we have to consider the data units , , , and . For more detail on RM2, refer to the reference [11].

A Disk Model

We have considered IBM 0661 3.5'' 320MB SCSI disk drive as a basic disk model when constructing a disk array system. The disk parameters are shown in Table 1. All disks are assumed to be rotationally synchronized and seek latency is computed in the same way as [16] :

where x represents seek distance, and a, b, and c are approximately modelled by single-cylinder-seek-time (minSeek), average-seek-time (avgSeek), and max-stroke-seek-time (maxSeek) given as

Note that numCyls represents total number of cylinders.

Table 1: IBM 0661 3.5" SCSI disk drive parameters.

Evaluation Criteria

In Chen et.al [3], large scale secondary storages such as disk arrays are usually throughput-oriented rather than response-time-oriented. Asynchronous I/O, read caching, and write buffering are widely accepted in disk arrays in order to improve throughput. So, we consider throughput as main criteria to evaluate a disk array system. However, when comparing disk array systems, the resulting throughput is highly dependent on the number of disks.

Since storage efficiency of a disk array system varies in accordance with redundancy scheme, the number of disks of each disk array may be different in order to provide the equal amount of data capacity. Now, we have two alternatives to compare disk arrays having different redundancy schemes as shown in [1]. First, compare disk arrays under the same cost. We mean by the same cost that all disk arrays use the same number of disks. But, if we fix the number of disks, data capacities of each disk array are not equal due to redundancy schemes. This makes it hard to determine input workloads to fairly evaluate all disk arrays. Second, compare disk arrays under the same data capacity, meaning that each disk array has different number of disks. Now, We can apply any workload to every disk arrays without any fairness problem. Since overall throughput depends highly on the number of disks used, we consider `per-disk throughput' as a new criteria obtained by normalization of overall throughput over the number of disks. Note that `per-disk throughput' can be useful when overall throughput of a disk array depends only on the number of disks.

I/O Access Methods for RM2

A placement function is given to formally describe a data placement scheme such as RAID level 0, RAID level 5, and RM2. Similarly, a mapping function can be described which converts logical address into physical address for I/O access methods. Note that placement functions are equivalent to mapping functions.

Placement and Mapping Functions

After placing data and parity units by the placement function, a logical address given as blockNumber is converted into a physical address denoted as (DiskNo, Sector) by the corresponding mapping function. A logical address blockNumber can simply be rewritten as a triple (ParityStripeID, stripeUnitID, blockNumber) defined in a disk array configuration. From now on, a logical address means the triple without loss of generality.

The following notations are used to define mapping functions, i.e., placement functions in a disk array system.

In addition to notations shown in Figure 1, we show notations for mapping functions in Figure 2. Here, we see M = 1 and ndata = 4.

Figure 2: Notations for mapping and placement functions in RAID level 5 with N = 5.

Mapping functions of RAID level 0 and 5, and RM2 are given as follows. These two RAID levels are used for comparisons with RM2.

Data mapping of RAID level 0
Parity mapping of RAID level 0 : Not applicable.

For example, Figure 3 shows how RAID level 0 places data units in case of where N represents the number of disks. Each column represents a single disk drive.

Figure 3: Data placement of RAID level 0 when N = 5.

Lee et.al [9] have proposed several schemes to place data and parity units in RAID level 5. We consider left-symmetric placement scheme among them since the scheme produces the best performance under various workloads.

Data mapping of RAID level 5
Parity mapping of RAID level 5

Figure 4 shows how left-symmetric placement scheme works when the number of disks, N, is 5. Shaded rectangles represent parity units.

Figure 4: Data placement of RAID level 5 according to left-symmetric placement scheme when N = 5.

We now explain how RM2 places data and parity units. RM2 has been proposed in [11], but we have extended it in order to improve performance by reducing seek latency overhead. For example, Figure 5 shows how the original RM2 scheme places data and parity units in case of N = 7 and where N represents the number of disks and p represents redundancy rate. Note that each disk has three partitions where one partition stores only parity units and the other two partitions store only data units. This causes large seek latency since a data unit and the corresponding parity units are placed very far. Therefore, we have modified the original RM2 scheme such that every units included in a same parity group are to be placed as close as possible. Figure 6 shows how data and parity units are placed by our modified RM2 scheme in the same configuration N = 7 and . Note that each parity stripe consists of three stripes, and one of them is used to store parity units only whereas the other two are used to store only data units.

Figure 5: Data placement of the original RM2 for N = 7 and p = 33.3%.

Figure 6: Data placement of our modified RM2 for N=7 and p=33.3%.

The mapping function of our modified RM2 is defined as follows:

Data mapping of our modified RM2
Parity mapping of our modified RM2

Note that two physical addresses will be defined as (DiskNo1,Sector) and (DiskNo2,Sector) by the parity mapping function for a given logical address since a data unit has to belong to two parity groups in RM2.

I/O Methods

Each disk operation such as read and write is processed in the following ways:

Given an I/O request, the request is first partitioned into subrequests in the unit of parity stripes and then each parity stripe is processed independently.
Each parity stripe has its own lock information for maintaining consistency in case of disk write operations. So, when the request is disk write operation, lock the corresponding parity stripe until it is completely processed.
Apply an appropriate I/O method to each parity stripe. I/O methods are described later.
If all parity stripes obtained from partitioning of a given request are completely processed, then the request is regarded as completed.

In disk write operations, data and parity units are required to be updated at once in order to maintain data consistency regardless of disk failures. However, since updating at once is almost impossible to implement, we have to consider a locking mechanism to maintain data consistency. Locking information is needed for each parity stripe because a parity stripe is a basic unit of input/output request. The number of parity stripes is usually so large in a disk array system due to its large data capacity. For efficient locking mechanism on the large number of parity stripes, we use hashing method.

Figure 7: I/O methods defined for a parity stripe in RAID level 5 : For a given method, there are some shaded rectangles and a single black-filled rectangle. The shaded rectangles represent a request and the black-filled rectangle represents the corresponding parity unit.

Now, we define I/O methods for RM2 which mean low-level disk access methods for multiple disks, not a single disk. All I/O methods are defined for parity stripe since it is the basic I/O unit. In case of RAID level 0, no special I/O methods need to be defined since it does not require parity maintenance. The three kinds of I/O methods have been proprosed for RAID level 5 by Lee et.al [9] as shown in Figure 7: one is for disk read operation while the other two are for disk write operations. Shaded rectangles represent a request and a black-filled rectangle represents the corresponding parity unit for each method.

Figure 8: I/O methods defined for a parity stripe in RM2 : Note that a parity stripe in RM2 consists of three stripes. Shaded rectangles represent data units requested and black-filled rectangles represent parity units.

Similarly, we define three kinds of I/O methods for RM2, as shown in Figure 8. For a given request, the corresponding parity stripe can be derived. All I/O methods are applied to the parity stripe.

Read
By read method, necessary data units in the corresponding parity stripe are read.
Read-Modify-Write
Read-Modify-Write method is applied when a request is disk write operation and is not covered by a number of full parity stripes. It is similar to Read-Modify-Write method for RAID level 5 since it computes new parity by old data, old parity, and new data. The operation sequences are described as follows:
1. Read old data and old parity. I/O operations involved in this step are shown in Figure 9. We can simply construct disk requests needed for reading old data by converting `write' into `read' for an input request. For reading old parity, it appends appropriate parity units to a partial parity stripe of the original request in order to initiate read operations for old parity. Note that the number of old parities is more than one whereas it is one for RAID level 5.
  
  Figure 9: The first step of Read-Modify-Write method defined RM2: Reading old data and old parity.
2. Compute new parity from old data, old parity, and new data.
3. Write new data as well as new parity. This operation is done in the similar way as in the first step except that it is disk write operation.
FullParityStripe-Write
FullParityStripe-Write method� is applied when a request is covered by a number of full parity stripes. In this case, new parity can be computed from new data without reading old data and old parity.

Disk Array Simulator

We need a disk array simulator supporting RM2 as well as RAID level 0 and 5 for experiments. A disk array simulator RaidSim has been developed by Lee et. al [9] which supports only RAID 0-5. Based on RaidSim, we develop a disk array simulator called DASim for the purpose. DASim is different from RaidSim in that it can support RM2 and handle request sizes having certain distributions (not constant) given as an input. Like RaidSim, all disks are assumed to be rotationally synchronous and any disk requests accessing contiguous blocks are merged into a single disk request.

Figure 10: Operation flows of DASim.

Operation flows of DASim are shown in Figure 10.

Simulation Results

By simulation, we will show how RM2 affects overall throughput compared with RAID level 0 and 5. Data capacities of disk array systems are made to be equal by setting six to the number of data disks. Since the size of stripe unit (SU size) affects overall performance significantly of a disk array, we first find an optimal SU size for a given workload by simulation. An input workload is specified by request size, read/write ratio, and the degree of concurrency. The degree of concurrency is defined as the number of processes generating input/output requests.

Input workloads considered for experiments are given as follows:

request size :
- distribution (applied when we determine an optimal SU size) : exponential distribution with mean 4 KBytes (exp4KB), exponential distribution with mean 16 KBytes (exp16KB), normal distribution with mean 400 KBytes (norm400KB), and normal distribution with mean 1.5 MBytes (norm1500KB).
- constant (applied when we investigate how RM2 affects overall performance compared with RAID level 0 and 5) : 2K, 4K, 8K, 16K, 32K for small sizes, 32K1024K in the interval of 32K for medium sizes, and 1024K5120K in the interval of 512K for large sizes.
read/write ratio : (1.0/0.0), (0.6/0.4), (0.3/0.7), and (1.0/0.0)
the degree of concurrency : 1, 4, 8, and 20

Maximum Data Throughputs

Table 2 shows maximum data throughput obtained theoretically in RAID level 5 and RM2 relative to RAID level 0 (i.e., when RAID level 0 is assumed to be one). For simple analysis, we consider only two kinds of request size: small and large request. `Small' means that its size is equal to a single stripe unit size whereas `large' means that its size is equal to the size of a full parity stripe. N represents the number of disks, and M represents the number of stripes in a single parity stripe. Storage efficiency means the ratio of data capacity and total disk capacity. Remember that six disk operations are required for small write in RM2 since disk read and write operations have to be completed on the data unit as well as two parity units. That is why is obtained for small write in RM2. In case of large write (full parity stripe), disk read operations on the data unit as well as two parity units are not required since new parity can be computed from current available data units to be written. So, M disk write operations are enough in order to write (M-1) data units.

Table 2: Theoretically maximum data throughputs of RAID level 5 and RM2 relative to RAID level 0: N represents the number of disks and M represents the number of stripes in a single parity stripe.

Optimal SU Sizes in RM2

We investigate how SU sizes affect overall system performance for various workloads in order to determine an optimal SU size. Chen et. al [2,4] have presented how to determine an optimal SU size in RAID level 0 and 5. Like [2,4], request sizes are assumed to have the following distributions: exp4KB, exp16KB, norm400KB, and norm1.5MB. Since it is very hard to model system's workload exactly, an optimal SU size should work well even when system workload is unpredictable. That's why various workloads are considered to determine an optimal SU size.

Figure 11, Figure 12, and Figure 13 show the relative performance to maximum data throughput of RM2, RAID level 5, and RAID level 0, respectively, for various workloads specified by request size distribution, read/write ratio, and the degree of concurrency. From these results, an optimal SU size working for various workloads is determined, as shown in Table 3. The values inside parenthesis represent optimal values obtained directly from simulation results, whereas the values outside parenthesis represent the exponent value of two nearest to the values inside parenthesis. The exponent value of two will be frequently used in real situations since it works better. Note that read/write ratio cannot affect overall throughput of RAID0 due to no parity update overhead in RAID0.

Table 3: Optimal SU sizes applicable for various workloads with a specified read/write ratio.

Figure 11: Relative per-disk throughput of RM2 with respect to maximum vs. Su sizes for various workloads with read/write ratio of (a) 1.0/0.0 (b) 0.6/0.4 (c) 0.3/0.7 (d) 0.0/1.0.

Figure 12: Relative per-disk throughput of RAID5 with respect to maximum vs. SU sizes for various workloads with read/write ratio of (a) 1.0/0.0 (b) 0.6/0.4 (c) 0.3/0.7 (d) 0.0/1.0.

Figure 13: Relative per-disk throughput of RAID0 with respect to maximum vs. SU sizes for various workloads.

In case of large requests (norm400KB and norm1500KB), an optimal striping unit is shown to be 8KB, i.e., one third of a single track if any disk write operations are involved (write = 40, 70, and 100%). Here, we explain about this behavior in detail.

Figure 14: Relationships between parity stripe and disk track: (a) Overall diagram for a parity stripe when the number of disks is 9, a single parity stripe consists of one parity stripe and two data stripes, and an input request equivalent to 16 stripe units has arrived. (b) Overall diagram for a disk track of 24KB when a stripe unit is 8KB, i.e., three stripe units are included in a track. (c) Overall diagram for a disk track of 24KB when a stripe unit is 4KB, i.e., six stripe units are included in a track.

Figure 14 (a) shows overall diagram of a parity stripe which consists of one parity stripe and two data stripes, i.e., M = 3. Note that the number of disks is 9. Read-Modify-Write method requires that an input request is converted into pairs of disk read and write operations to the corresponding stripe units of each corresponding disk. In case of M = 3, by Read-Modify-Write I/O access method, some disks has to handle three stripe units while some disks handle two stripe units. As an input request size increases, the number of disks handling three stripe units increases. If we can make these three stripe units be stored in a single track, then any additional rotational delay is greatly reduced required for Read-Modify-Write method. As shown in Figure 14 (b), if we choose 8KB as a stripe unit size and a track is 24KB, then three stripe units can be stored in a single track. However, if we choose 4KB as a stripe unit size, then six stripe units can be stored in a single track. In this case, we cannot avoid additional rotational delay required to disk write operations following disk read operations for the same stripe units.

That is, when the request size is large enough to cover a whole parity stripe consisting of M stripes, we get the best per-disk throughputs by adjusting the size of a stripe unit such that a single track can store M stripe units. We obtain analytically Table 4. It shows the size of an optimal stripe unit for a given N and M such that a single track stores M stripe units, where N represents the number of disks and M represents the number of stripes in a single parity stripe. Note that , i.e., M stripe units cannot be stored exactly in a single track when N = 15 and 21. From experimental results shown in Figure 15, we can see that our analysis on the size of an optimal stripe unit shown in Table 4 is correct, particularly when N = 9, 12, 18, and 24 (i.e., ).

Figure 15: SU size vs per-disk throughputs when the number of disks is 9, 12, 15, 18, 21, and 24.

Table 4: Optimal SU size vs. M in RM2 disk array for large requests: N represents the number of disks and M represents the number of stripes in a single parity stripe.

Conclusions

An efficient data placement scheme RM2 has been proposed in [11] to improve reliability of a disk array system significantly by enabling the tolerance of double disk failures. In this paper, we have presented how to choose an optimal striping unit of RM2. A disk array simulator called DASim has been developed from RaidSim [9] for experimental work. It is shown from simulations that an optimal striping unit of RM2 is two and half tracks in the case of disk read operations, and it is one third of a single track if any disk write operations are involved.

Footnote

¹This research has been supported in part by KOSEF, ADD, IITA and ETRI, 1996.

References

1: Chen, Peter M., Garth A. Gibson, Randy H. Katz, and David A. Patterson. An Evaluation of Redundant Arrays of Disks Using an Amdahl 5890. In Proc. of the 1990 ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, Boulder CO, May 1990.
2: Chen, Peter M., and David A. Patterson. Maximizing Performance in a Striped Disk Array. In Proc. of the 17th ACM SIGARCH Ann. Int. Symp. of Computer Architecture, Seattle WA, May 1990.
3: Chen, Peter M., Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. RAID : High-Performance, Reliable Secondary Storage. Technical Report CSD-93-778, Univ. California, Berkeley, 1993.
4: Chen, Peter M., and Edward K. Lee. Striping in a RAID Level 5 Disk Array. Technical Report CSE-TR-181-93, Univ. California, Berkeley, 1993.
5: Gibson, G. A., L. Hellerstein, R. M. Karp, R. H. Katz, and D. A. Patterson. Coding Techniques for Handling Failures in Large Disk Arrays. In Proc. ASPLOS III Conference, Boston, March 1989.
6: Katz, Randy H., Garth A. Gibson, and David A. Patterson. Disk System Architecture for High Performance Computing. Proceedings of the IEEE, Vol. 77, No. 12, Dec. 1989.
7: Kim, Michelle Y. Synchronized Disk Interleaving. IEEE Transactions on Computers, Vol. C-35, No. 11, Nov. 1986.
8: Kim, Michelle Y., and Asser N. Tantawi. Asynchronous Disk Interleaving. IEEE Transactions on Computers, Vol. 40, No. 7, July 1991.
9: Lee, Edward K. The Performance of Parity Placements in Disk Arrays. IEEE Trans. Computers, Vol. 42, No. 6, June 1993, pp. 651-664.
10: Park, Chan-Ik. A Study on Disk Array for High Performance Computer Systems. KOSEF 931-0900-025-2, POSTECH, 1995.
11: Park, Chan-Ik. Efficient Placement of Parity and Data to Tolerate Two Disk Failures in Disk Array Systems. IEEE Trans. Parallel and Distributed Systems, Nov. 1995.
12: Patterson, David A., Garth Gibson, and Randy H.Katz. A Case for Redundance of Inexpensive Disks (RAID). In ACM SIGMOD Conference, Chicago, Illinois, June 1-3, 1988.
13: Peterson, Wesley W., and E. J. Weldon. Error-Correcting Codes. MIT Press, 1972.
14: Reddy, A. L. Narasimha, and Prithviraj Banerjee. An Evaluation of Multiple-Disk I/O Systems. IEEE Transactions on Computers, Vol. 38, No. 12, Dec. 1989, pp. 1680-1690.
15: Salem, Kenneth, and Hector Garcia-Molina. Disk Striping. In Proc. Intl. Conf. Data Engineering, 1986.
16: Thisquen, J. Seek Time Measurements. Technical report, Amdahl Peripheral Products Division, May 1988.