~ubuntu-branches/ubuntu/precise/mdadm/precise-updates

Viewing changes to docs/RAID5_versus_RAID10.txt

Committer: Package Import Robot
Author(s): Clint Byrum
Date: 2012-02-09 16:53:02 UTC
mfrom: (1.1.27 sid)
Revision ID: package-import@ubuntu.com-20120209165302-bs4cfosmhoga2rpt

Tags: 3.2.3-2ubuntu1

https://launchpad.net/bugs/920324

* Merge from Debian testing. (LP: #920324)  Remaining changes:
  - Call checks in local-premount to avoid race condition with udev
    and opening a degraded array.
  - d/initramfs/mdadm-functions: Record in /run when boot-degraded
    question has been asked so that it is only asked once
  - pass --test to mdadm to enable result codes for degraded arrays.
  - Build udeb with -O2 on ppc64, working around a link error.
  - debian/control: we need udev and util-linux in the right version. We
    also remove the build dependency from quilt and docbook-to-man as both
    are not used in Ubuntus mdadm.
  - debian/initramfs/hook: kept the Ubuntus version for handling the absence
    of active raid arrays in <initramfs>/etc/mdadm/mdadm.conf
  - debian/initramfs/script.local-top.DEBIAN, debian/mdadm-startall,
    debian/mdadm.raid.DEBIAN: removed. udev does its job now instead.
  - debian/mdadm-startall.sgml, debian/mdadm-startall.8: documentation of
    unused startall script
  - debian/mdadm.config, debian/mdadm.postinst - let udev do the handling
    instead. Resolved merge conflict by keeping Ubuntu's version.
  - debian/mdadm.postinst, debian/mdadm.config, initramfs/init-premount:
    boot-degraded enablement; maintain udev starting of RAID devices;
    init-premount hook script for the initramfs, to provide information at
    boot
  - debian/mkconf.in is the older mkconf. Kept the Ubuntu version.
  - debian/rules: Kept Ubuntus version for installing apport hooks, not
    installing un-used startall script and for adding a udev rule
    corresponding to mdadm.
  - debian/install-rc, check.d/_numbers, check.d/root_on_raid: Ubuntu partman
    installer changes
  - debian/presubj: Dropped this unused bug reporting file. Instead use
    source_mdadm.py act as an apport hook for bug handling.
  - rename debian/mdadm.vol_id.udev to debian/mdadm.mdadm-blkid.udev so that
    the rules file ends up with a more reasonable name
  - d/p/debian-changes-3.1.4-1+8efb9d1ubuntu4: mdadm udev rule
    incrementally adds mdadm member when detected. Starting such an
    array in degraded mode is possible by mdadm -IRs. Using mdadm
    -ARs without stopping the array first does nothing when no
    mdarray-unassociated device is available. Using mdadm -IRs to
    start a previously partially assembled array through incremental
    mode. Keeping the mdadm -ARs for assembling arrays which were for
    some reason not assembled through incremental mode (i.e through
    mdadm's udev rule).

files added:
ANNOUNCE-3.2.3

debian/docs

debian/docs/RAID5_versus_RAID10.txt

debian/docs/md.txt

debian/docs/md_superblock_formats.txt

debian/docs/rebuilding-raid.html

debian/mdadd.sh

debian/po/sk.po

raid6check.8

files removed:
.pc/contrib

.pc/contrib/docs

.pc/contrib/docs/jd-rebuilding-raid.diff

.pc/contrib/docs/jd-rebuilding-raid.diff/docs

.pc/contrib/docs/jd-rebuilding-raid.diff/docs/rebuilding-raid.html

.pc/contrib/docs/md.txt.diff

.pc/contrib/docs/md.txt.diff/docs

.pc/contrib/docs/md.txt.diff/docs/md.txt

.pc/contrib/docs/raid5-vs-raid10.diff

.pc/contrib/docs/raid5-vs-raid10.diff/docs

.pc/contrib/docs/raid5-vs-raid10.diff/docs/RAID5_versus_RAID10.txt

.pc/contrib/docs/superblock_formats.diff

.pc/contrib/docs/superblock_formats.diff/docs

.pc/contrib/docs/superblock_formats.diff/docs/md_superblock_formats.txt

.pc/contrib/scripts

.pc/contrib/scripts/mdadd.diff

.pc/contrib/scripts/mdadd.diff/contrib

.pc/contrib/scripts/mdadd.diff/contrib/mdadd.sh

contrib

contrib/mdadd.sh

debian/patches/contrib

debian/patches/contrib/docs

debian/patches/contrib/docs/jd-rebuilding-raid.diff

debian/patches/contrib/docs/md.txt.diff

debian/patches/contrib/docs/raid5-vs-raid10.diff

debian/patches/contrib/docs/superblock_formats.diff

debian/patches/contrib/scripts

debian/patches/contrib/scripts/mdadd.diff

docs

docs/RAID5_versus_RAID10.txt

docs/md.txt

docs/md_superblock_formats.txt

docs/rebuilding-raid.html

files modified:
.gitignore

.pc/applied-patches

.pc/debian-changes-3.1.4-1+8efb9d1ubuntu4/Assemble.c

.pc/debian-changes-3.1.4-1+8efb9d1ubuntu4/ReadMe.c

.pc/debian-changes-3.1.4-1+8efb9d1ubuntu4/config.c

.pc/debian/conffile-location.diff/Makefile

.pc/debian/conffile-location.diff/ReadMe.c

.pc/debian/conffile-location.diff/mdadm.8.in

.pc/debian/conffile-location.diff/mdadm.conf.5

.pc/debian/conffile-location.diff/mdassemble.8

.pc/debian/disable-udev-incr-assembly.diff/udev-md-raid.rules

.pc/debian/no-Werror.diff/Makefile

Assemble.c

COPYING

Create.c

Detail.c

Grow.c

Incremental.c

Kill.c

Makefile

Manage.c

Monitor.c

ReadMe.c

bitmap.c

config.c

debian/FAQ

debian/changelog

debian/checkarray

debian/control

debian/copyright

debian/mdadm.docs

debian/mdadm.init

debian/patches/debian/disable-udev-incr-assembly.diff

debian/patches/debian/no-Werror.diff

debian/patches/series

debian/po/ca.po

debian/po/cs.po

debian/po/da.po

debian/po/de.po

debian/po/es.po

debian/po/eu.po

debian/po/fi.po

debian/po/fr.po

debian/po/gl.po

debian/po/it.po

debian/po/ja.po

debian/po/nl.po

debian/po/pt.po

debian/po/pt_BR.po

debian/po/ru.po

debian/po/sv.po

debian/po/templates.pot

debian/po/vi.po

debian/rules

inventory

makedist

managemon.c

mapfile.c

md.4

md_p.h

mdadm.8.in

mdadm.c

mdadm.conf.5

mdadm.h

mdadm.spec

mdassemble.8

mdassemble.c

mdmon.8

mdmon.c

mdmon.h

mdopen.c

mdstat.c

monitor.c

msg.c

platform-intel.h

policy.c

restripe.c

super-ddf.c

super-gpt.c

super-intel.c

super-mbr.c

super0.c

super1.c

sysfs.c

tests/03r5assemV1

udev-md-raid.rules

util.c

Show diffs side-by-side

added added

removed removed

docs/RAID5_versus_RAID10.txt

# from http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt

# also see http://www.miracleas.com/BAARF/BAARF2.html

# Note: I, the Debian maintainer, do not agree with some of the arguments,

# especially not with the total condemning of RAID5. Anyone who talks about

# data loss and blames the RAID system should spend time reading up on Backups

# instead of trying to evangelise, but that's only my opinion. RAID5 has its

# merits and its shortcomings, just like any other method. However, the author

# of this argument puts forth a good case and thus I am including the

# document. Remember that you're the only one that can decide which RAID level

# to use.

RAID5 versus RAID10 (or even RAID3 or RAID4)

First let's get on the same page so we're all talking about apples.

What is RAID5?

OK here is the deal, RAID5 uses ONLY ONE parity drive per stripe and many

RAID5 arrays are 5 (if your counts are different adjust the calculations

appropriately) drives (4 data and 1 parity though it is not a single drive

that is holding all of the parity as in RAID 3 & 4 but read on). If you

have 10 drives or say 20GB each for 200GB RAID5 will use 20% for parity

(assuming you set it up as two 5 drive arrays) so you will have 160GB of

storage. Now since RAID10, like mirroring (RAID1), uses 1 (or more) mirror

drive for each primary drive you are using 50% for redundancy so to get the

same 160GB of storage you will need 8 pairs or 16 - 20GB drives, which is

why RAID5 is so popular. This intro is just to put things into

perspective.

RAID5 is physically a stripe set like RAID0 but with data recovery

included. RAID5 reserves one disk block out of each stripe block for

parity data. The parity block contains an error correction code which can

correct any error in the RAID5 block, in effect it is used in combination

with the remaining data blocks to recreate any single missing block, gone

missing because a drive has failed. The innovation of RAID5 over RAID3 &

RAID4 is that the parity is distributed on a round robin basis so that

there can be independent reading of different blocks from the several

drives. This is why RAID5 became more popular than RAID3 & RAID4 which

must sychronously read the same block from all drives together. So, if

Drive2 fails blocks 1,2,4,5,6 & 7 are data blocks on this drive and blocks

3 and 8 are parity blocks on this drive. So that means that the parity on

Drive5 will be used to recreate the data block from Disk2 if block 1 is

requested before a new drive replaces Drive2 or during the rebuilding of

the new Drive2 replacement. Likewise the parity on Drive1 will be used to

repair block 2 and the parity on Drive3 will repair block4, etc. For block

2 all the data is safely on the remaining drives but during the rebuilding

of Drive2's replacement a new parity block will be calculated from the

block 2 data and will be written to Drive 2.

Now when a disk block is read from the array the RAID software/firmware

calculates which RAID block contains the disk block, which drive the disk

block is on and which drive contains the parity block for that RAID block

and reads ONLY the one data drive. It returns the data block. If you

later modify the data block it recalculates the parity by subtracting the

old block and adding in the new version then in two separate operations it

writes the data block followed by the new parity block. To do this it must

first read the parity block from whichever drive contains the parity for

that stripe block and reread the unmodified data for the updated block from

the original drive. This read-read-write-write is known as the RAID5 write

penalty since these two writes are sequential and synchronous the write

system call cannot return until the reread and both writes complete, for

safety, so writing to RAID5 is up to 50% slower than RAID0 for an array of

the same capacity. (Some software RAID5's avoid the re-read by keeping an

unmodified copy of the orginal block in memory.)

Now what is RAID10:

RAID10 is one of the combinations of RAID1 (mirroring) and RAID0

(striping) which are possible. There used to be confusion about what

RAID01 or RAID10 meant and different RAID vendors defined them

differently. About five years or so ago I proposed the following standard

language which seems to have taken hold. When N mirrored pairs are

striped together this is called RAID10 because the mirroring (RAID1) is

applied before striping (RAID0). The other option is to create two stripe

sets and mirror them one to the other, this is known as RAID01 (because

the RAID0 is applied first). In either a RAID01 or RAID10 system each and

every disk block is completely duplicated on its drive's mirror.

Performance-wise both RAID01 and RAID10 are functionally equivalent. The

difference comes in during recovery where RAID01 suffers from some of the

same problems I will describe affecting RAID5 while RAID10 does not.

Now if a drive in the RAID5 array dies, is removed, or is shut off data is

returned by reading the blocks from the remaining drives and calculating

the missing data using the parity, assuming the defunct drive is not the

parity block drive for that RAID block. Note that it takes 4 physical

reads to replace the missing disk block (for a 5 drive array) for four out

of every five disk blocks leading to a 64% performance degradation until

the problem is discovered and a new drive can be mapped in to begin

recovery. Performance is degraded further during recovery because all

drives are being actively accessed in order to rebuild the replacement

drive (see below).

If a drive in the RAID10 array dies data is returned from its mirror drive

in a single read with only minor (6.25% on average for a 4 pair array as a

whole) performance reduction when two non-contiguous blocks are needed from

the damaged pair (since the two blocks cannot be read in parallel from both

drives) and none otherwise.

100

101

One begins to get an inkling of what is going on and why I dislike RAID5,

102

but, as they say on late night info-mercials, there's more.

103

104

What's wrong besides a bit of performance I don't know I'm missing?

105

106

OK, so that brings us to the final question of the day which is: What is

107

the problem with RAID5? It does recover a failed drive right? So writes

108

are slower, I don't do enough writing to worry about it and the cache

109

helps a lot also, I've got LOTS of cache! The problem is that despite the

110

improved reliability of modern drives and the improved error correction

111

codes on most drives, and even despite the additional 8 bytes of error

112

correction that EMC puts on every Clariion drive disk block (if you are

113

lucky enough to use EMC systems), it is more than a little possible that a

114

drive will become flaky and begin to return garbage. This is known as

115

partial media failure. Now SCSI controllers reserve several hundred disk

116

blocks to be remapped to replace fading sectors with unused ones, but if

117

the drive is going these will not last very long and will run out and SCSI

118

does NOT report correctable errors back to the OS! Therefore you will not

119

know the drive is becoming unstable until it is too late and there are no

120

more replacement sectors and the drive begins to return garbage. [Note

121

that the recently popular IDE/ATA drives do not (TMK) include bad sector

122

remapping in their hardware so garbage is returned that much sooner.]

123

When a drive returns garbage, since RAID5 does not EVER check parity on

124

read (RAID3 & RAID4 do BTW and both perform better for databases than

125

RAID5 to boot) when you write the garbage sector back garbage parity will

126

be calculated and your RAID5 integrity is lost! Similarly if a drive

127

fails and one of the remaining drives is flaky the replacement will be

128

rebuilt with garbage also propagating the problem to two blocks instead of

129

just one.

130

131

Need more? During recovery, read performance for a RAID5 array is

132

degraded by as much as 80%. Some advanced arrays let you configure the

133

preference more toward recovery or toward performance. However, doing so

134

will increase recovery time and increase the likelihood of losing a second

135

drive in the array before recovery completes resulting in catastrophic

136

data loss. RAID10 on the other hand will only be recovering one drive out

137

of 4 or more pairs with performance ONLY of reads from the recovering pair

138

degraded making the performance hit to the array overall only about 20%!

139

Plus there is no parity calculation time used during recovery - it's a

140

straight data copy.

141

142

What about that thing about losing a second drive? Well with RAID10 there

143

is no danger unless the one mirror that is recovering also fails and

144

that's 80% or more less likely than that any other drive in a RAID5 array

145

will fail! And since most multiple drive failures are caused by

146

undetected manufacturing defects you can make even this possibility

147

vanishingly small by making sure to mirror every drive with one from a

148

different manufacturer's lot number. ("Oh", you say, "this schenario does

149

not seem likely!" Pooh, we lost 50 drives over two weeks when a batch of

150

200 IBM drives began to fail. IBM discovered that the single lot of

151

drives would have their spindle bearings freeze after so many hours of

152

operation. Fortunately due in part to RAID10 and in part to a herculean

153

effort by DG techs and our own people over 2 weeks no data was lost.

154

HOWEVER, one RAID5 filesystem was a total loss after a second drive failed

155

during recover. Fortunately everything was on tape.

156

157

Conclusion? For safety and performance favor RAID10 first, RAID3 second,

158

RAID4 third, and RAID5 last! The original reason for the RAID2-5 specs

159

was that the high cost of disks was making RAID1, mirroring, impractical.

160

That is no longer the case! Drives are commodity priced, even the biggest

161

fastest drives are cheaper in absolute dollars than drives were then and

162

cost per MB is a tiny fraction of what it was. Does RAID5 make ANY sense

163

anymore? Obviously I think not.

164

165

To put things into perspective: If a drive costs $1000US (and most are far

166

less expensive than that) then switching from a 4 pair RAID10 array to a 5

167

drive RAID5 array will save 3 drives or $3000US. What is the cost of

168

overtime, wear and tear on the technicians, DBAs, managers, and customers

169

of even a recovery scare? What is the cost of reduced performance and

170

possibly reduced customer satisfaction? Finally what is the cost of lost

171

business if data is unrecoverable? I maintain that the drives are FAR

172

cheaper! Hence my mantra:

173

174

NO RAID5! NO RAID5! NO RAID5! NO RAID5! NO RAID5! NO RAID5! NO RAID5!

175

176

Art S. Kagel

177

Older »