~xnox/ubuntu/natty/mdadm/natty-proposed

Viewing changes to docs/md.txt

Committer: Dmitrijs Ledkovs
Author(s): martin f. krafft
Date: 2009-11-06 10:06:03 UTC
mto: This revision was merged to the branch mainline in revision 55.
Revision ID: dmitrijs.ledkovs@canonical.com-20091106100603-q9s0evnzd106mrzt

Tags: 3.0.3-2

http://bugs.debian.org/541396

* Bumped Standards-Version to 3.8.3 without having to make changes.
* Fixed init dependencies of mdadm daemon init.d script; thanks Petter
Reinholdtsen (closes: #541396).
* Switched source package to v3-quilt format.

files added:
.pc

.pc/.quilt_patches

.pc/.quilt_series

.pc/.version

.pc/applied-patches

.pc/contrib

.pc/contrib/docs

.pc/contrib/docs/jd-rebuilding-raid.diff

.pc/contrib/docs/jd-rebuilding-raid.diff/docs

.pc/contrib/docs/jd-rebuilding-raid.diff/docs/rebuilding-raid.html

.pc/contrib/docs/md.txt.diff

.pc/contrib/docs/md.txt.diff/docs

.pc/contrib/docs/md.txt.diff/docs/md.txt

.pc/contrib/docs/raid5-vs-raid10.diff

.pc/contrib/docs/raid5-vs-raid10.diff/docs

.pc/contrib/docs/raid5-vs-raid10.diff/docs/RAID5_versus_RAID10.txt

.pc/contrib/docs/superblock_formats.diff

.pc/contrib/docs/superblock_formats.diff/docs

.pc/contrib/docs/superblock_formats.diff/docs/md_superblock_formats.txt

.pc/debian

.pc/debian/conffile-location.diff

.pc/debian/conffile-location.diff/Makefile

.pc/debian/conffile-location.diff/ReadMe.c

.pc/debian/conffile-location.diff/mdadm.8

.pc/debian/conffile-location.diff/mdadm.conf.5

.pc/debian/conffile-location.diff/mdassemble.8

.pc/fixes

.pc/fixes/udev-blkid.diff

.pc/fixes/udev-blkid.diff/udev-md-raid.rules

debian/source

debian/source/format

docs

docs/RAID5_versus_RAID10.txt

docs/md.txt

docs/md_superblock_formats.txt

docs/rebuilding-raid.html

files modified:
Makefile

ReadMe.c

debian/bugscript.in *

debian/changelog

debian/control

debian/mdadm.init

debian/newdisk *

debian/rules

mdadm.8

mdadm.conf.5

mdassemble.8

udev-md-raid.rules

Show diffs side-by-side

added added

removed removed

docs/md.txt

# From: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob_plain;f=Documentation/md.txt;hb=v2.6.31

Tools that manage md devices can be found at

http://www.<country>.kernel.org/pub/linux/utils/raid/....

Boot time assembly of RAID arrays

---------------------------------

You can boot with your md device with the following kernel command

lines:

for old raid arrays without persistent superblocks:

md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn

for raid arrays with persistent superblocks

md=<md device no.>,dev0,dev1,...,devn

or, to assemble a partitionable array:

md=d<md device no.>,dev0,dev1,...,devn

md device no. = the number of the md device ...

0 means md0,

1 md1,

2 md2,

3 md3,

4 md4

raid level = -1 linear mode

0 striped mode

other modes are only supported with persistent super blocks

chunk size factor = (raid-0 and raid-1 only)

Set the chunk size as 4k << n.

fault level = totally ignored

dev0-devn: e.g. /dev/hda1,/dev/hdc1,/dev/sda1,/dev/sdb1

A possible loadlin line (Harald Hoyer <HarryH@Royal.Net>) looks like this:

e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro

Boot time autodetection of RAID arrays

--------------------------------------

When md is compiled into the kernel (not as module), partitions of

type 0xfd are scanned and automatically assembled into RAID arrays.

This autodetection may be suppressed with the kernel parameter

"raid=noautodetect". As of kernel 2.6.9, only drives with a type 0

superblock can be autodetected and run at boot time.

The kernel parameter "raid=partitionable" (or "raid=part") means

that all auto-detected arrays are assembled as partitionable.

Boot time assembly of degraded/dirty arrays

-------------------------------------------

If a raid5 or raid6 array is both dirty and degraded, it could have

undetectable data corruption. This is because the fact that it is

'dirty' means that the parity cannot be trusted, and the fact that it

is degraded means that some datablocks are missing and cannot reliably

be reconstructed (due to no parity).

For this reason, md will normally refuse to start such an array. This

requires the sysadmin to take action to explicitly start the array

despite possible corruption. This is normally done with

mdadm --assemble --force ....

This option is not really available if the array has the root

filesystem on it. In order to support this booting from such an

array, md supports a module parameter "start_dirty_degraded" which,

when set to 1, bypassed the checks and will allows dirty degraded

arrays to be started.

So, to boot with a root filesystem of a dirty degraded raid[56], use

md-mod.start_dirty_degraded=1

Superblock formats

------------------

The md driver can support a variety of different superblock formats.

Currently, it supports superblock formats "0.90.0" and the "md-1" format

introduced in the 2.5 development series.

The kernel will autodetect which format superblock is being used.

Superblock format '0' is treated differently to others for legacy

reasons - it is the original superblock format.

General Rules - apply for all superblock formats

------------------------------------------------

An array is 'created' by writing appropriate superblocks to all

devices.

100

It is 'assembled' by associating each of these devices with an

101

particular md virtual device. Once it is completely assembled, it can

102

be accessed.

103

104

An array should be created by a user-space tool. This will write

105

superblocks to all devices. It will usually mark the array as

106

'unclean', or with some devices missing so that the kernel md driver

107

can create appropriate redundancy (copying in raid1, parity

108

calculation in raid4/5).

109

110

When an array is assembled, it is first initialized with the

111

SET_ARRAY_INFO ioctl. This contains, in particular, a major and minor

112

version number. The major version number selects which superblock

113

format is to be used. The minor number might be used to tune handling

114

of the format, such as suggesting where on each device to look for the

115

superblock.

116

117

Then each device is added using the ADD_NEW_DISK ioctl. This

118

provides, in particular, a major and minor number identifying the

119

device to add.

120

121

The array is started with the RUN_ARRAY ioctl.

122

123

Once started, new devices can be added. They should have an

124

appropriate superblock written to them, and then passed be in with

125

ADD_NEW_DISK.

126

127

Devices that have failed or are not yet active can be detached from an

128

array using HOT_REMOVE_DISK.

129

130

131

Specific Rules that apply to format-0 super block arrays, and

132

arrays with no superblock (non-persistent).

133

-------------------------------------------------------------

134

135

An array can be 'created' by describing the array (level, chunksize

136

etc) in a SET_ARRAY_INFO ioctl. This must has major_version==0 and

137

raid_disks != 0.

138

139

Then uninitialized devices can be added with ADD_NEW_DISK. The

140

structure passed to ADD_NEW_DISK must specify the state of the device

141

and it's role in the array.

142

143

Once started with RUN_ARRAY, uninitialized spares can be added with

144

HOT_ADD_DISK.

145

146

147

148

MD devices in sysfs

149

-------------------

150

md devices appear in sysfs (/sys) as regular block devices,

151

e.g.

152

/sys/block/md0

153

154

Each 'md' device will contain a subdirectory called 'md' which

155

contains further md-specific information about the device.

156

157

All md devices contain:

158

level

159

a text file indicating the 'raid level'. e.g. raid0, raid1,

160

raid5, linear, multipath, faulty.

161

If no raid level has been set yet (array is still being

162

assembled), the value will reflect whatever has been written

163

to it, which may be a name like the above, or may be a number

164

such as '0', '5', etc.

165

166

raid_disks

167

a text file with a simple number indicating the number of devices

168

in a fully functional array. If this is not yet known, the file

169

will be empty. If an array is being resized this will contain

170

the new number of devices.

171

Some raid levels allow this value to be set while the array is

172

active. This will reconfigure the array. Otherwise it can only

173

be set while assembling an array.

174

A change to this attribute will not be permitted if it would

175

reduce the size of the array. To reduce the number of drives

176

in an e.g. raid5, the array size must first be reduced by

177

setting the 'array_size' attribute.

178

179

chunk_size

180

This is the size in bytes for 'chunks' and is only relevant to

181

raid levels that involve striping (0,4,5,6,10). The address space

182

of the array is conceptually divided into chunks and consecutive

183

chunks are striped onto neighbouring devices.

184

The size should be at least PAGE_SIZE (4k) and should be a power

185

of 2. This can only be set while assembling an array

186

187

layout

188

The "layout" for the array for the particular level. This is

189

simply a number that is interpretted differently by different

190

levels. It can be written while assembling an array.

191

192

array_size

193

This can be used to artificially constrain the available space in

194

the array to be less than is actually available on the combined

195

devices. Writing a number (in Kilobytes) which is less than

196

the available size will set the size. Any reconfiguration of the

197

array (e.g. adding devices) will not cause the size to change.

198

Writing the word 'default' will cause the effective size of the

199

array to be whatever size is actually available based on

200

'level', 'chunk_size' and 'component_size'.

201

202

This can be used to reduce the size of the array before reducing

203

the number of devices in a raid4/5/6, or to support external

204

metadata formats which mandate such clipping.

205

206

reshape_position

207

This is either "none" or a sector number within the devices of

208

the array where "reshape" is up to. If this is set, the three

209

attributes mentioned above (raid_disks, chunk_size, layout) can

210

potentially have 2 values, an old and a new value. If these

211

values differ, reading the attribute returns

212

new (old)

213

and writing will effect the 'new' value, leaving the 'old'

214

unchanged.

215

216

component_size

217

For arrays with data redundancy (i.e. not raid0, linear, faulty,

218

multipath), all components must be the same size - or at least

219

there must a size that they all provide space for. This is a key

220

part or the geometry of the array. It is measured in sectors

221

and can be read from here. Writing to this value may resize

222

the array if the personality supports it (raid1, raid5, raid6),

223

and if the component drives are large enough.

224

225

metadata_version

226

This indicates the format that is being used to record metadata

227

about the array. It can be 0.90 (traditional format), 1.0, 1.1,

228

1.2 (newer format in varying locations) or "none" indicating that

229

the kernel isn't managing metadata at all.

230

Alternately it can be "external:" followed by a string which

231

is set by user-space. This indicates that metadata is managed

232

by a user-space program. Any device failure or other event that

233

requires a metadata update will cause array activity to be

234

suspended until the event is acknowledged.

235

236

resync_start

237

The point at which resync should start. If no resync is needed,

238

this will be a very large number. At array creation it will

239

default to 0, though starting the array as 'clean' will

240

set it much larger.

241

242

new_dev

243

This file can be written but not read. The value written should

244

be a block device number as major:minor. e.g. 8:0

245

This will cause that device to be attached to the array, if it is

246

available. It will then appear at md/dev-XXX (depending on the

247

name of the device) and further configuration is then possible.

248

249

safe_mode_delay

250

When an md array has seen no write requests for a certain period

251

of time, it will be marked as 'clean'. When another write

252

request arrives, the array is marked as 'dirty' before the write

253

commences. This is known as 'safe_mode'.

254

The 'certain period' is controlled by this file which stores the

255

period as a number of seconds. The default is 200msec (0.200).

256

Writing a value of 0 disables safemode.

257

258

array_state

259

This file contains a single word which describes the current

260

state of the array. In many cases, the state can be set by

261

writing the word for the desired state, however some states

262

cannot be explicitly set, and some transitions are not allowed.

263

264

Select/poll works on this file. All changes except between

265

active_idle and active (which can be frequent and are not

266

very interesting) are notified. active->active_idle is

267

reported if the metadata is externally managed.

268

269

clear

270

No devices, no size, no level

271

Writing is equivalent to STOP_ARRAY ioctl

272

inactive

273

May have some settings, but array is not active

274

all IO results in error

275

When written, doesn't tear down array, but just stops it

276

suspended (not supported yet)

277

All IO requests will block. The array can be reconfigured.

278

Writing this, if accepted, will block until array is quiessent

279

readonly

280

no resync can happen. no superblocks get written.

281

write requests fail

282

read-auto

283

like readonly, but behaves like 'clean' on a write request.

284

285

clean - no pending writes, but otherwise active.

286

When written to inactive array, starts without resync

287

If a write request arrives then

288

if metadata is known, mark 'dirty' and switch to 'active'.

289

if not known, block and switch to write-pending

290

If written to an active array that has pending writes, then fails.

291

active

292

fully active: IO and resync can be happening.

293

When written to inactive array, starts with resync

294

295

write-pending

296

clean, but writes are blocked waiting for 'active' to be written.

297

298

active-idle

299

like active, but no writes have been seen for a while (safe_mode_delay).

300

301

302

As component devices are added to an md array, they appear in the 'md'

303

directory as new directories named

304

dev-XXX

305

where XXX is a name that the kernel knows for the device, e.g. hdb1.

306

Each directory contains:

307

308

block

309

a symlink to the block device in /sys/block, e.g.

310

/sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1

311

312

super

313

A file containing an image of the superblock read from, or

314

written to, that device.

315

316

state

317

A file recording the current state of the device in the array

318

which can be a comma separated list of

319

faulty - device has been kicked from active use due to

320

a detected fault

321

in_sync - device is a fully in-sync member of the array

322

writemostly - device will only be subject to read

323

requests if there are no other options.

324

This applies only to raid1 arrays.

325

blocked - device has failed, metadata is "external",

326

and the failure hasn't been acknowledged yet.

327

Writes that would write to this device if

328

it were not faulty are blocked.

329

spare - device is working, but not a full member.

330

This includes spares that are in the process

331

of being recovered to

332

This list may grow in future.

333

This can be written to.

334

Writing "faulty" simulates a failure on the device.

335

Writing "remove" removes the device from the array.

336

Writing "writemostly" sets the writemostly flag.

337

Writing "-writemostly" clears the writemostly flag.

338

Writing "blocked" sets the "blocked" flag.

339

Writing "-blocked" clear the "blocked" flag and allows writes

340

to complete.

341

342

This file responds to select/poll. Any change to 'faulty'

343

or 'blocked' causes an event.

344

345

errors

346

An approximate count of read errors that have been detected on

347

this device but have not caused the device to be evicted from

348

the array (either because they were corrected or because they

349

happened while the array was read-only). When using version-1

350

metadata, this value persists across restarts of the array.

351

352

This value can be written while assembling an array thus

353

providing an ongoing count for arrays with metadata managed by

354

userspace.

355

356

slot

357

This gives the role that the device has in the array. It will

358

either be 'none' if the device is not active in the array

359

(i.e. is a spare or has failed) or an integer less than the

360

'raid_disks' number for the array indicating which position

361

it currently fills. This can only be set while assembling an

362

array. A device for which this is set is assumed to be working.

363

364

offset

365

This gives the location in the device (in sectors from the

366

start) where data from the array will be stored. Any part of

367

the device before this offset us not touched, unless it is

368

used for storing metadata (Formats 1.1 and 1.2).

369

370

size

371

The amount of the device, after the offset, that can be used

372

for storage of data. This will normally be the same as the

373

component_size. This can be written while assembling an

374

array. If a value less than the current component_size is

375

written, it will be rejected.

376

377

378

An active md device will also contain and entry for each active device

379

in the array. These are named

380

381

rdNN

382

383

where 'NN' is the position in the array, starting from 0.

384

So for a 3 drive array there will be rd0, rd1, rd2.

385

These are symbolic links to the appropriate 'dev-XXX' entry.

386

Thus, for example,

387

cat /sys/block/md*/md/rd*/state

388

will show 'in_sync' on every line.

389

390

391

392

Active md devices for levels that support data redundancy (1,4,5,6)

393

also have

394

395

sync_action

396

a text file that can be used to monitor and control the rebuild

397

process. It contains one word which can be one of:

398

resync - redundancy is being recalculated after unclean

399

shutdown or creation

400

recover - a hot spare is being built to replace a

401

failed/missing device

402

idle - nothing is happening

403

check - A full check of redundancy was requested and is

404

happening. This reads all block and checks

405

them. A repair may also happen for some raid

406

levels.

407

repair - A full check and repair is happening. This is

408

similar to 'resync', but was requested by the

409

user, and the write-intent bitmap is NOT used to

410

optimise the process.

411

412

This file is writable, and each of the strings that could be

413

read are meaningful for writing.

414

415

'idle' will stop an active resync/recovery etc. There is no

416

guarantee that another resync/recovery may not be automatically

417

started again, though some event will be needed to trigger

418

this.

419

'resync' or 'recovery' can be used to restart the

420

corresponding operation if it was stopped with 'idle'.

421

'check' and 'repair' will start the appropriate process

422

providing the current state is 'idle'.

423

424

This file responds to select/poll. Any important change in the value

425

triggers a poll event. Sometimes the value will briefly be

426

"recover" if a recovery seems to be needed, but cannot be

427

achieved. In that case, the transition to "recover" isn't

428

notified, but the transition away is.

429

430

degraded

431

This contains a count of the number of devices by which the

432

arrays is degraded. So an optimal array with show '0'. A

433

single failed/missing drive will show '1', etc.

434

This file responds to select/poll, any increase or decrease

435

in the count of missing devices will trigger an event.

436

437

mismatch_count

438

When performing 'check' and 'repair', and possibly when

439

performing 'resync', md will count the number of errors that are

440

found. The count in 'mismatch_cnt' is the number of sectors

441

that were re-written, or (for 'check') would have been

442

re-written. As most raid levels work in units of pages rather

443

than sectors, this my be larger than the number of actual errors

444

by a factor of the number of sectors in a page.

445

446

bitmap_set_bits

447

If the array has a write-intent bitmap, then writing to this

448

attribute can set bits in the bitmap, indicating that a resync

449

would need to check the corresponding blocks. Either individual

450

numbers or start-end pairs can be written. Multiple numbers

451

can be separated by a space.

452

Note that the numbers are 'bit' numbers, not 'block' numbers.

453

They should be scaled by the bitmap_chunksize.

454

455

sync_speed_min

456

sync_speed_max

457

This are similar to /proc/sys/dev/raid/speed_limit_{min,max}

458

however they only apply to the particular array.

459

If no value has been written to these, of if the word 'system'

460

is written, then the system-wide value is used. If a value,

461

in kibibytes-per-second is written, then it is used.

462

When the files are read, they show the currently active value

463

followed by "(local)" or "(system)" depending on whether it is

464

a locally set or system-wide value.

465

466

sync_completed

467

This shows the number of sectors that have been completed of

468

whatever the current sync_action is, followed by the number of

469

sectors in total that could need to be processed. The two

470

numbers are separated by a '/' thus effectively showing one

471

value, a fraction of the process that is complete.

472

A 'select' on this attribute will return when resync completes,

473

when it reaches the current sync_max (below) and possibly at

474

other times.

475

476

sync_max

477

This is a number of sectors at which point a resync/recovery

478

process will pause. When a resync is active, the value can

479

only ever be increased, never decreased. The value of 'max'

480

effectively disables the limit.

481

482

483

sync_speed

484

This shows the current actual speed, in K/sec, of the current

485

sync_action. It is averaged over the last 30 seconds.

486

487

suspend_lo

488

suspend_hi

489

The two values, given as numbers of sectors, indicate a range

490

within the array where IO will be blocked. This is currently

491

only supported for raid4/5/6.

492

493

494

Each active md device may also have attributes specific to the

495

personality module that manages it.

496

These are specific to the implementation of the module and could

497

change substantially if the implementation changes.

498

499

These currently include

500

501

stripe_cache_size (currently raid5 only)

502

number of entries in the stripe cache. This is writable, but

503

there are upper and lower limits (32768, 16). Default is 128.

504

strip_cache_active (currently raid5 only)

505

number of active entries in the stripe cache

506

preread_bypass_threshold (currently raid5 only)

507

number of times a stripe requiring preread will be bypassed by

508

a stripe that does not require preread. For fairness defaults

509

to 1. Setting this to 0 disables bypass accounting and

510

requires preread stripes to wait until all full-width stripe-

511

writes are complete. Valid values are 0 to stripe_cache_size.

Older »