~ubuntu-branches/ubuntu/wily/sqlite3/wily

onfocus="entersearch()" onblur="leavesearch()" style="width:24ex;padding:1px 1ex; border:solid white 1px; font-size:0.9em ; font-style:italic;color:#044a64;" value="Search SQLite Docs...">

112

113

</form>

114

</div>

115

</table>

116

</div></div></div></div>

117

</td></tr></table>

118

119

120

121

122

123

124

Atomic Commit In SQLite

125

</h1>

126

127

<h2>1.0 Introduction</h2>

128

129

<p>An important feature of transactional databases like SQLite

130

is "atomic commit".

131

Atomic commit means that either all database changes within a single

132

transaction occur or none of them occur. With atomic commit, it

133

is as if many different writes to different sections of the database

134

file occur instantaneously and simultaneously.

135

Real hardware serializes writes to mass storage, and writing

136

a single sector takes a finite amount of time.

137

So it is impossible to truly write many different sectors of a

138

database file simultaneously and/or instantaneously.

139

But the atomic commit logic within

140

SQLite makes it appear as if the changes for a transaction

141

are all written instantaneously and simultaneously.</p>

142

143

<p>SQLite has the important property that transactions appear

144

to be atomic even if the transaction is interrupted by an

145

operating system crash or power failure.</p>

146

147

<p>This article describes the techniques used by SQLite to create the

148

illusion of atomic commit.</p>

149

150

<p>The information in this article applies only when SQLite is operating

151

in "rollback mode", or in other words when SQLite is not

152

using a <a href="wal.html">write-ahead log</a>. SQLite still supports atomic commit when

153

write-ahead logging is enabled, but it accomplishes atomic commit by

154

a different mechanism from the one describe in this article. See

155

the <a href="wal.html">write-ahead log documentation</a> for additional information on how

156

SQLite supports atomic commit in that context.</p>

157

158

159

160

<h2>2.0 Hardware Assumptions</h2>

161

162

<p>Throughout this article, we will call the mass storage device "disk"

163

even though the mass storage device might really be flash memory.</p>

164

165

<p>We assume that disk is written in chunks which we call a "sector".

166

It is not possible to modify any part of the disk smaller than a sector.

167

To change a part of the disk smaller than a sector, you have to read in

168

the full sector that contains the part you want to change, make the

169

change, then write back out the complete sector.</p>

170

171

<p>On a traditional spinning disk, a sector is the minimum unit of transfer

172

in both directions, both reading and writing. On flash memory, however,

173

the minimum size of a read is typically much smaller than a minimum write.

174

SQLite is only concerned with the minimum write amount and so for the

175

purposes of this article, when we say "sector" we mean the minimum amount

176

of data that can be written to mass storage in a single go.</p>

177

178

<p>

179

Prior to SQLite version 3.3.14, a sector size of 512 bytes was

180

assumed in all cases. There was a compile-time option to change

181

this but the code had never been tested with a larger value. The

182

512 byte sector assumption seemed reasonable since until very recently

183

all disk drives used a 512 byte sector internally. However, there

184

has recently been a push to increase the sector size of disks to

185

4096 bytes. Also the sector size

186

for flash memory is usually larger than 512 bytes. For these reasons,

187

versions of SQLite beginning with 3.3.14 have a method in the OS

188

interface layer that interrogates the underlying filesystem to find

189

the true sector size. As currently implemented (version 3.5.0) this

190

method still returns a hard-coded value of 512 bytes, since there

191

is no standard way of discovering the true sector size on either

192

Unix or Windows. But the method is available for embedded device

193

manufactures to tweak according to their own needs. And we have

194

left open the possibility of filling in a more meaningful implementation

195

on Unix and Windows in the future.</p>

196

197

<p>SQLite has traditionally assumed that a sector write is <u>not</u> atomic.

198

However, SQLite does always assume that a sector write is linear. By "linear"

199

we mean that SQLite assumes that when writing a sector, the hardware begins

200

at one end of the data and writes byte by byte until it gets to

201

the other end. The write might go from beginning to end or from

202

end to beginning. If a power failure occurs in the middle of a

203

sector write it might be that part of the sector was modified

204

and another part was left unchanged. The key assumption by SQLite

205

is that if any part of the sector gets changed, then either the

206

first or the last bytes will be changed. So the hardware will

207

never start writing a sector in the middle and work towards the

208

ends. We do not know if this assumption is always true but it

209

seems reasonable.</p>

210

211

<p>The previous paragraph states that SQLite does not assume that

212

sector writes are atomic. This is true by default. But as of

213

SQLite version 3.5.0, there is a new interface called the

214

Virtual File System (<a href="vfs.html">VFS</a>) interface. The <a href="vfs.html">VFS</a> is the only means

215

by which SQLite communicates to the underlying filesystem. The

216

code comes with default VFS implementations for Unix and Windows

217

and there is a mechanism for creating new custom VFS implementations

218

at runtime. In this new VFS interface there is a method called

219

xDeviceCharacteristics. This method interrogates the underlying

220

filesystem to discover various properties and behaviors that the

221

filesystem may or may not exhibit. The xDeviceCharacteristics

222

method might indicate that sector writes are atomic, and if it does

223

so indicate, SQLite will try to take advantage of that fact. But

224

the default xDeviceCharacteristics method for both Unix and Windows

225

does not indicate atomic sector writes and so these optimizations

226

are normally omitted.</p>

227

228

<p>SQLite assumes that the operating system will buffer writes and

229

that a write request will return before data has actually been stored

230

in the mass storage device.

231

SQLite further assumes that write operations will be reordered by

232

the operating system.

233

For this reason, SQLite does a "flush" or "fsync" operation at key

234

points. SQLite assumes that the flush or fsync will not return until

235

all pending write operations for the file that is being flushed have

236

completed. We are told that the flush and fsync primitives

237

are broken on some versions of Windows and Linux. This is unfortunate.

238

It opens SQLite up to the possibility of database corruption following

239

a power loss in the middle of a commit. However, there is nothing

240

that SQLite can do to test for or remedy the situation. SQLite

241

assumes that the operating system that it is running on works as

242

advertised. If that is not quite the case, well then hopefully you

243

will not lose power too often.</p>

244

245

<p>SQLite assumes that when a file grows in length that the new

246

file space originally contains garbage and then later is filled in

247

with the data actually written. In other words, SQLite assumes that

248

the file size is updated before the file content. This is a

249

pessimistic assumption and SQLite has to do some extra work to make

250

sure that it does not cause database corruption if power is lost

251

between the time when the file size is increased and when the

252

new content is written. The xDeviceCharacteristics method of

253

the <a href="vfs.html">VFS</a> might indicate that the filesystem will always write the

254

data before updating the file size. (This is the

255

SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking

256

at the code.) When the xDeviceCharacteristics method indicates

257

that files content is written before the file size is increased,

258

SQLite can forego some of its pedantic database protection steps

259

and thereby decrease the amount of disk I/O needed to perform a

260

commit. The current implementation, however, makes no such assumptions

261

for the default VFSes for Windows and Unix.</p>

262

263

<p>SQLite assumes that a file deletion is atomic from the

264

point of view of a user process. By this we mean that if SQLite

265

requests that a file be deleted and the power is lost during the

266

delete operation, once power is restored either the file will

267

exist completely with all if its original content unaltered, or

268

else the file will not be seen in the filesystem at all. If

269

after power is restored the file is only partially deleted,

270

if some of its data has been altered or erased,

271

or the file has been truncated but not completely removed, then

272

database corruption will likely result.</p>

273

274

<p>SQLite assumes that the detection and/or correction of

275

bit errors caused by cosmic rays, thermal noise, quantum

276

fluctuations, device driver bugs, or other mechanisms, is the

277

responsibility of the underlying hardware and operating system.

278

SQLite does not add any redundancy to the database file for

279

the purpose of detecting corruption or I/O errors.

280

SQLite assumes that the data it reads is exactly the same data

281

that it previously wrote.</p>

282

283

<p>By default, SQLite assumes that an operating system call to write

284

a range of bytes will not damage or alter any bytes outside of that range

285

even if a power lose or OS crash occurs during that write. We

286

call this the "<a href="psow.html">powersafe overwrite</a>" property. Prior to version 3.7.9,

287

SQLite did not assume powersafe overwrite. But with the standard

288

sector size increasing from 512 to 4096 bytes on most disk drives, it

289

has become necessary to assume powersafe overwrite in order to maintain

290

historical performance levels and so powersafe overwrite is assumed by

291

default in recent versions of SQLite. The assumption of powersafe

292

overwrite property can be disabled at compile-time or a run-time if

293

desired. See the <a href="psow.html">powersafe overwrite documentation</a> for further

294

details.

295

296

297

298

<h2>3.0 Single File Commit</h2>

299

300

<p>We begin with an overview of the steps SQLite takes in order to

301

perform an atomic commit of a transaction against a single database

302

file. The details of file formats used to guard against damage from

303

power failures and techniques for performing an atomic commit across

304

multiple databases are discussed in later sections.</p>

305

306

307

308

<h3>3.1 Initial State</h3>

309

310

311

312

<p>The state of the computer when a database connection is

313

first opened is shown conceptually by the diagram at the

314

right.

315

The area of the diagram on the extreme right (labeled "Disk") represents

316

information stored on the mass storage device. Each rectangle is

317

a sector. The blue color represents that the sectors contain

318

original data.

319

The middle area is the operating systems disk cache. At the

320

onset of our example, the cache is cold and this is represented

321

by leaving the rectangles of the disk cache empty.

322

The left area of the diagram shows the content of memory for

323

the process that is using SQLite. The database connection has

324

just been opened and no information has been read yet, so the

325

user space is empty.

326

</p>

327

328

329

330

331

<h3>3.2 Acquiring A Read Lock</h3>

332

333

334

335

<p>Before SQLite can write to a database, it must first read

336

the database to see what is there already. Even if it is just

337

appending new data, SQLite still has to read in the database

338

schema from the <b>sqlite_master</b> table so that it can know

339

how to parse the INSERT statements and discover where in the

340

database file the new information should be stored.</p>

341

342

<p>The first step toward reading from the database file

343

is obtaining a shared lock on the database file. A "shared"

344

lock allows two or more database connections to read from the

345

database file at the same time. But a shared lock prevents

346

another database connection from writing to the database file

347

while we are reading it. This is necessary because if another

348

database connection were writing to the database file at the

349

same time we are reading from the database file, we might read

350

some data before the change and other data after the change.

351

This would make it appears as if the change made by the other

352

process is not atomic.</p>

353

354

<p>Notice that the shared lock is on the operating system

355

disk cache, not on the disk itself. File locks

356

really are just flags within the operating system kernel,

357

usually. (The details depend on the specific OS layer

358

interface.) Hence, the lock will instantly vanish if the

359

operating system crashes or if there is a power loss. It

360

is usually also the case that the lock will vanish if the

361

process that created the lock exits.</p>

362

363

364

365

366

<h3>3.3 Reading Information Out Of The Database</h3>

367

368

369

370

<p>After the shared lock is acquired, we can begin reading

371

information from the database file. In this scenario, we

372

are assuming a cold cache, so information must first be

373

read from mass storage into the operating system cache then

374

transferred from operating system cache into user space.

375

On subsequent reads, some or all of the information might

376

already be found in the operating system cache and so only

377

the transfer to user space would be required.</p>

378

379

<p>Usually only a subset of the pages in the database file

380

are read. In this example we are showing three

381

pages out of eight being read. In a typical application, a

382

database will have thousands of pages and a query will normally

383

only touch a small percentage of those pages.</p>

384

385

386

387

388

389

<h3>3.4 Obtaining A Reserved Lock</h3>

390

391

392

393

<p>Before making changes to the database, SQLite first

394

obtains a "reserved" lock on the database file. A reserved

395

lock is similar to a shared lock in that both a reserved lock

396

and shared lock allow other processes to read from the database

397

file. A single reserve lock can coexist with multiple shared

398

locks from other processes. However, there can only be a

399

single reserved lock on the database file. Hence only a

400

single process can be attempting to write to the database

401

at one time.</p>

402

403

<p>The idea behind a reserved locks is that it signals that

404

a process intends to modify the database file in the near

405

future but has not yet started to make the modifications.

406

And because the modifications have not yet started, other

407

processes can continue to read from the database. However,

408

no other process should also begin trying to write to the

409

database.</p>

410

411

412

413

<h3>3.5 Creating A Rollback Journal File</h3>

414

415

416

<p>Prior to making any changes to the database file, SQLite first

417

creates a separate rollback journal file and writes into the

418

rollback journal the original

419

content of the database pages that are to be altered.

420

The idea behind the rollback journal is that it contains

421

all information needed to restore the database back to

422

its original state.</p>

423

424

<p>The rollback journal contains a small header (shown in green

425

in the diagram) that records the original size of the database

426

file. So if a change causes the database file to grow, we

427

will still know the original size of the database. The page

428

number is stored together with each database page that is

429

written into the rollback journal.</p>

430

431

<p>

432

When a new file is created, most desktop operating systems

433

(Windows, Linux, Mac OS X) will not actually write anything to

434

disk. The new file is created in the operating systems disk

435

cache only. The file is not created on mass storage until sometime

436

later, when the operating system has a spare moment. This creates

437

the impression to users that I/O is happening much faster than

438

is possible when doing real disk I/O. We illustrate this idea in

439

the diagram to the right by showing that the new rollback journal

440

appears in the operating system disk cache only and not on the

441

disk itself.</p>

442

443

444

445

<h3>3.6 Changing Database Pages In User Space</h3>

446

447

448

<p>After the original page content has been saved in the rollback

449

journal, the pages can be modified in user memory. Each database

450

connection has its own private copy of user space, so the changes

451

that are made in user space are only visible to the database connection

452

that is making the changes. Other database connections still see

453

the information in operating system disk cache buffers which have

454

not yet been changed. And so even though one process is busy

455

modifying the database, other processes can continue to read their

456

own copies of the original database content.</p>

457

458

459

460

<h3>3.7 Flushing The Rollback Journal File To Mass Storage</h3>

461

462

463

<p>The next step is to flush the content of the rollback journal

464

file to nonvolatile storage.

465

As we will see later,

466

this is a critical step in insuring that the database can survive

467

an unexpected power loss.

468

This step also takes a lot of time, since writing to nonvolatile

469

storage is normally a slow operation.</p>

470

471

<p>This step is usually more complicated than simply flushing

472

the rollback journal to the disk. On most platforms two separate

473

flush (or fsync()) operations are required. The first flush writes

474

out the base rollback journal content. Then the header of the

475

rollback journal is modified to show the number of pages in the

476

rollback journal. Then the header is flushed to disk. The details

477

on why we do this header modification and extra flush are provided

478

in a later section of this paper.</p>

479

480

481

482

<h3>3.8 Obtaining An Exclusive Lock</h3>

483

484

485

<p>Prior to making changes to the database file itself, we must

486

obtain an exclusive lock on the database file. Obtaining an

487

exclusive lock is really a two-step process. First SQLite obtains

488

a "pending" lock. Then it escalates the pending lock to an

489

exclusive lock.</p>

490

491

<p>A pending lock allows other processes that already have a

492

shared lock to continue reading the database file. But it

493

prevents new shared locks from being established. The idea

494

behind a pending lock is to prevent writer starvation caused

495

by a large pool of readers. There might be dozens, even hundreds,

496

of other processes trying to read the database file. Each process

497

acquires a shared lock before it starts reading, reads what it

498

needs, then releases the shared lock. If, however, there are

499

many different processes all reading from the same database, it

500

might happen that a new process always acquires its shared lock before

501

the previous process releases its shared lock. And so there is

502

never an instant when there are no shared locks on the database

503

file and hence there is never an opportunity for the writer to

504

seize the exclusive lock. A pending lock is designed to prevent

505

that cycle by allowing existing shared locks to proceed but

506

blocking new shared locks from being established. Eventually

507

all shared locks will clear and the pending lock will then be

508

able to escalate into an exclusive lock.</p>

509

510

511

512

<h3>3.9 Writing Changes To The Database File</h3>

513

514

515

<p>Once an exclusive lock is held, we know that no other

516

processes are reading from the database file and it is

517

safe to write changes into the database file. Usually

518

those changes only go as far as the operating systems disk

519

cache and do not make it all the way to mass storage.</p>

520

521

522

523

<h3>3.10 Flushing Changes To Mass Storage</h3>

524

525

526

<p>Another flush must occur to make sure that all the

527

database changes are written into nonvolatile storage.

528

This is a critical step to ensure that the database will

529

survive a power loss without damage. However, because

530

of the inherent slowness of writing to disk or flash memory,

531

this step together with the rollback journal file flush in section

532

3.7 above takes up most the time required to complete a

533

transaction commit in SQLite.</p>

534

535

536

537

<h3>3.11 Deleting The Rollback Journal</h3>

538

539

540

<p>After the database changes are all safely on the mass

541

storage device, the rollback journal file is deleted.

542

This is the instant where the transaction commits.

543

If a power failure or system crash occurs prior to this

544

point, then recovery processes to be described later make

545

it appears as if no changes were ever made to the database

546

file. If a power failure or system crash occurs after

547

the rollback journal is deleted, then it appears as if

548

all changes have been written to disk. Thus, SQLite gives

549

the appearance of having made no changes to the database

550

file or having made the complete set of changes to the

551

database file depending on whether or not the rollback

552

journal file exists.</p>

553

554

<p>Deleting a file is not really an atomic operation, but

555

it appears to be from the point of view of a user process.

556

A process is always able to ask the operating system "does

557

this file exist?" and the process will get back a yes or no

558

answer. After a power failure that occurs during a

559

transaction commit, SQLite will ask the operating system

560

whether or not the rollback journal file exists. If the

561

answer is "yes" then the transaction is incomplete and is

562

rolled back. If the answer is "no" then it means the transaction

563

did commit.</p>

564

565

<p>The existence of a transaction depends on whether or

566

not the rollback journal file exists and the deletion

567

of a file appears to be an atomic operation from the point of

568

view of a user-space process. Therefore,

569

a transaction appears to be an atomic operation.</p>

570

571

<p>The act of deleting a file is expensive on many systems.

572

As an optimization, SQLite can be configured to truncate

573

the journal file to zero bytes in length

574

or overwrite the journal file header with zeros. In either

575

case, the resulting journal file is no longer capable of rolling

576

back and so the transaction still commits. Truncating a file

577

to zero length, like deleting a file, is assumed to be an atomic

578

operation from the point of view of a user process. Overwriting

579

the header of the journal with zeros is not atomic, but if any

580

part of the header is malformed the journal will not roll back.

581

Hence, one can say that the commit occurs as soon as the header

582

is sufficiently changed to make it invalid. Typically this happens

583

as soon as the first byte of the header is zeroed.</p>

584

585

586

587

<h3>3.12 Releasing The Lock</h3>

588

589

590

<p>The last step in the commit process is to release the

591

exclusive lock so that other processes can once again

592

start accessing the database file.</p>

593

594

<p>In the diagram at the right, we show that the information

595

that was held in user space is cleared when the lock is released.

596

This used to be literally true for older versions of SQLite. But

597

more recent versions of SQLite keep the user space information

598

in memory in case it might be needed again at the start of the

599

next transaction. It is cheaper to reuse information that is

600

already in local memory than to transfer the information back

601

from the operating system disk cache or to read it off of the

602

disk drive again. Prior to reusing the information in user space,

603

we must first reacquire the shared lock and then we have to check

604

to make sure that no other process modified the database file while

605

we were not holding a lock. There is a counter in the first page

606

of the database that is incremented every time the database file

607

is modified. We can find out if another process has modified the

608

database by checking that counter. If the database was modified,

609

then the user space cache must be cleared and reread. But it is

610

commonly the case that no changes have been made and the user

611

space cache can be reused for a significant performance savings.</p>

612

613

614

615

616

<h2>4.0 Rollback</h2>

617

618

<p>An atomic commit is supposed to happen instantaneously. But the processing

619

described above clearly takes a finite amount of time.

620

Suppose the power to the computer were cut

621

part way through the commit operation described above. In order

622

to maintain the illusion that the changes were instantaneous, we

623

have to "rollback" any partial changes and restore the database to

624

the state it was in prior to the beginning of the transaction.</p>

625

626

627

628

<h3>4.1 When Something Goes Wrong...</h3>

629

630

631

<p>Suppose the power loss occurred during step 3.10 above,

632

while the database changes were being written to disk.

633

After power is restored, the situation might be something

634

like what is shown to the right. We were trying to change

635

three pages of the database file but only one page was

636

successfully written. Another page was partially written

637

and a third page was not written at all.</p>

638

639

<p>The rollback journal is complete and intact on disk when

640

the power is restored. This is a key point. The reason for

641

the flush operation in step 3.7 is to make absolutely sure that

642

all of the rollback journal is safely on nonvolatile storage

643

prior to making any changes to the database file itself.</p>

644

645

646

647

<h3>4.2 Hot Rollback Journals</h3>

648

649

650

<p>The first time that any SQLite process attempts to access

651

the database file, it obtains a shared lock as described in

652

section 3.2 above. But then it notices that there is a

653

rollback journal file present. SQLite then checks to see if

654

the rollback journal is a "hot journal". A hot journal is

655

a rollback journal that needs to be played back in order to

656

restore the database to a sane state. A hot journal only

657

exists when an earlier process was in the middle of committing

658

a transaction when it crashed or lost power.</p>

659

660

<p>A rollback journal is a "hot" journal if all of the following

661

are true:</p>

662

663

<ul>

664

<li>The rollback journal exist.

665

<li>The rollback journal is not an empty file.

666

<li>There is no reserved lock on the main database file.

667

<li>The header of the rollback journal is well-formed and in particular

668

has not been zeroed out.

669

<li>The rollback journal does not

670

contain the name of a master journal file (see

671

<a href="#section_5_5">section 5.5</a> below) or if does

672

contain the name of a master journal, then that master journal

673

file exists.

674

</ul>

675

676

<p>The presence of a hot journal is our indication

677

that a previous process was trying to commit a transaction but

678

it aborted for some reason prior to the completion of the

679

commit. A hot journal means that

680

the database file is in an inconsistent state and needs to

681

be repaired (by rollback) prior to being used.</p>

682

683

684

685

686

<h3>4.3 Obtaining An Exclusive Lock On The Database</h3>

687

688

689

<p>The first step toward dealing with a hot journal is to

690

obtain an exclusive lock on the database file. This prevents two

691

or more processes from trying to rollback the same hot journal

692

at the same time.</p>

693

694

695

696

<h3>4.4 Rolling Back Incomplete Changes</h3>

697

698

699

<p>Once a process obtains an exclusive lock, it is permitted

700

to write to the database file. It then proceeds to read the

701

original content of pages out of the rollback journal and write

702

that content back to were it came from in the database file.

703

Recall that the header of the rollback journal records the original

704

size of the database file prior to the start of the aborted

705

transaction. SQLite uses this information to truncate the

706

database file back to its original size in cases where the

707

incomplete transaction caused the database to grow. At the

708

end of this step, the database should be the same size and

709

contain the same information as it did before the start of

710

the aborted transaction.</p>

711

712

713

714

715

<h3>4.5 Deleting The Hot Journal</h3>

716

717

718

<p>After all information in the rollback journal has been

719

played back into the database file (and flushed to disk in case

720

we encounter yet another power failure), the hot rollback journal

721

can be deleted.</p>

722

723

<p>As in <a href="#section_3_11">section 3.11</a>, the journal

724

file might be truncated to zero length or its header might

725

be overwritten with zeros as an optimization on systems where

726

deleting a file is expense. Either way, the journal is no

727

long hot after this step.</p>

728

729

730

731

732

<h3>4.6 Continue As If The Uncompleted Writes Had Never Happened</h3>

733

734

735

<p>The final recovery step is to reduce the exclusive lock back

736

to a shared lock. Once this happens, the database is back in the

737

state that it would have been if the aborted transaction had never

738

started. Since all of this recovery activity happens completely

739

automatically and transparently, it appears to the program using

740

SQLite as if the aborted transaction had never begun.</p>

741

742

743

744

745

<h2>5.0 Multi-file Commit</h2>

746

747

<p>SQLite allows a single

748

<a href="c3ref/sqlite3.html">database connection</a> to talk to

749

two or more database files simultaneously through the use of

750

the <a href="lang_attach.html">ATTACH DATABASE</a> command.

751

When multiple database files are modified within a single

752

transaction, all files are updated atomically.

753

In other words, either all of the database files are updated or

754

else none of them are.

755

Achieving an atomic commit across multiple database files is

756

more complex that doing so for a single file. This section

757

describes how SQLite works that bit of magic.</p>

758

759

760

761

<h3>5.1 Separate Rollback Journals For Each Database</h3>

762

763

764

<p>When multiple database files are involved in a transaction,

765

each database has its own rollback journal and each database

766

is locked separately. The diagram at the right shows a scenario

767

where three different database files have been modified within

768

one transaction. The situation at this step is analogous to

769

the single-file transaction scenario at

770

<a href="#section_3_6">step 3.6</a>. Each database file has

771

a reserved lock. For each database, the original content of pages

772

that are being changed have been written into the rollback journal

773

for that database, but the content of the journals have not yet

774

been flushed to disk. No changes have been made to the database

775

file itself yet, though presumably there are changes being held

776

in user memory.</p>

777

778

<p>For brevity, the diagrams in this section are simplified from

779

those that came before. Blue color still signifies original content

780

and pink still signifies new content. But the individual pages

781

in the rollback journal and the database file are not shown and

782

we are not making the distinction between information in the

783

operating system cache and information that is on disk. All of

784

these factors still apply in a multi-file commit scenario. They

785

just take up a lot of space in the diagrams and they do not add

786

any new information, so they are omitted here.</p>

787

788

789

790

791

<h3>5.2 The Master Journal File</h3>

792

793

794

<p>The next step in a multi-file commit is the creation of a

795

"master journal" file. The name of the master journal file is

796

the same name as the original database filename (the database

797

that was opened using the

798

<a href="c3ref/open.html">sqlite3_open()</a> interface,

799

not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary

800

databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where

801

<i>HHHHHHHH</i> is a random 32-bit hexadecimal number. The

802

random <i>HHHHHHHH</i> suffix changes for every new master journal.</p>

803

804

<p><i>(Nota bene: The formula for computing the master journal filename

805

given in the previous paragraph corresponds to the implementation as

806

of SQLite version 3.5.0. But this formula is not part of the SQLite

807

specification and is subject to change in future releases.)</i></p>

808

809

<p>Unlike the rollback journals, the master journal does not contain

810

any original database page content. Instead, the master journal contains

811

the full pathnames for rollback journals for every database that is

812

participating in the transaction.</p>

813

814

<p>After the master journal is constructed, its content is flushed

815

to disk before any further actions are taken. On Unix, the directory

816

that contains the master journal is also synced in order to make sure

817

the master journal file will appear in the directory following a

818

power failure.</p>

819

820

821

822

823

<h3>5.3 Updating Rollback Journal Headers</h3>

824

825

826

<p>The next step is to record the full pathname of the master journal file

827

in the header of every rollback journal. Space to hold the master

828

journal filename was reserved at the beginning of each rollback journal

829

as the rollback journals were created.</p>

830

831

<p>The content of each rollback journal is flushed to disk both before

832

and after the master journal filename is written into the rollback

833

journal header. It is important to do both of these flushes. Fortunately,

834

the second flush is usually inexpensive since typically only a single

835

page of the journal file (the first page) has changed.</p>

836

837

<p>This step is analogous to

838

<a href="#section_3_7">step 3.7</a> in the single-file commit

839

scenario described above.</p>

840

841

842

843

844

<h3>5.4 Updating The Database Files</h3>

845

846

847

<p>Once all rollback journal files have been flushed to disk, it

848

is safe to begin updating database files. We have to obtain an

849

exclusive lock on all database files before writing the changes.

850

After all the changes are written, it is important to flush the

851

changes to disk so that they will be preserved in the event of

852

a power failure or operating system crash.</p>

853

854

<p>This step corresponds to steps

855

<a href="#section_3_8">3.8</a>,

856

<a href="#section_3_9">3.9</a>, and

857

<a href="#section_3_10">3.10</a> in the single-file commit

858

scenario described previously.</p>

859

860

861

862

863

<h3>5.5 Delete The Master Journal File</h3>

864

865

866

<p>The next step is to delete the master journal file.

867

This is the point where the multi-file transaction commits.

868

This step corresponds to

869

<a href="#section_3_11">step 3.11</a> in the single-file

870

commit scenario where the rollback journal is deleted.</p>

871

872

<p>If a power failure or operating system crash occurs at this

873

point, the transaction will not rollback when the system reboots

874

even though there are rollback journals present. The

875

difference is the master journal pathname in the header of the

876

rollback journal. Upon restart, SQLite only considers a journal

877

to be hot and will only playback the journal if there is no

878

master journal filename in the header (which is the case for

879

a single-file commit) or if the master journal file still

880

exists on disk.</p>

881

882

883

884

885

<h3>5.6 Clean Up The Rollback Journals</h3>

886

887

888

<p>The final step in a multi-file commit is to delete the

889

individual rollback journals and drop the exclusive locks on

890

the database files so that other processes can see the changes.

891

This corresponds to

892

<a href="#section_3_12">step 3.12</a> in the single-file

893

commit sequence.</p>

894

895

<p>The transaction has already committed at this point so timing

896

is not critical in the deletion of the rollback journals.

897

The current implementation deletes a single rollback journal

898

then unlocks the corresponding database file before proceeding

899

to the next rollback journal. But in the future we might change

900

this so that all rollback journals are deleted before any database

901

files are unlocked. As long as the rollback journal is deleted before

902

its corresponding database file is unlocked it does not matter in what

903

order the rollback journals are deleted or the database files are

904

unlocked.</p>

905

906

907

908

<h2>6.0 Additional Details Of The Commit Process</h2>

909

910

<p><a href="#section_3_0">Section 3.0</a> above provides an overview of

911

how atomic commit works in SQLite. But it glosses over a number of

912

important details. The following subsections will attempt to fill

913

in the gaps.</p>

914

915

916

917

<h3>6.1 Always Journal Complete Sectors</h3>

918

919

<p>When the original content of a database page is written into

920

the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),

921

SQLite always writes a complete sectors worth of data, even if the

922

page size of the database is smaller than the sector size.

923

Historically, the sector size in SQLite has been hard coded to 512

924

bytes and since the minimum page size is also 512 bytes, this has never

925

been an issue. But beginning with SQLite version 3.3.14, it is possible

926

for SQLite to use mass storage devices with a sector size larger than 512

927

bytes. So, beginning with version 3.3.14, whenever any page within a

928

sector is written into the journal file, all pages in that same sector

929

are stored with it.</p>

930

931

<p>It is important to store all pages of a sector in the rollback

932

journal in order to prevent database corruption following a power

933

loss while writing the sector. Suppose that pages 1, 2, 3, and 4 are

934

all stored in sector 1 and that page 2 is modified. In order to write

935

the changes to page 2, the underlying hardware must also rewrite the

936

content of pages 1, 3, and 4 since the hardware must write the complete

937

sector. If this write operation is interrupted by a power outage,

938

one or more of the pages 1, 3, or 4 might be left with incorrect data.

939

Hence, to avoid lasting corruption to the database, the original content

940

of all of those pages must be contained in the rollback journal.</p>

941

942

943

944

<h3>6.2 Dealing With Garbage Written Into Journal Files</h3>

945

946

<p>When data is appended to the end of the rollback journal,

947

SQLite normally makes the pessimistic assumption that the file

948

is first extended with invalid "garbage" data and that afterwards

949

the correct data replaces the garbage. In other words, SQLite assumes

950

that the file size is increased first and then afterwards the content

951

is written into the file. If a power failure occurs after the file

952

size has been increased but before the file content has been written,

953

the rollback journal can be left containing garbage data. If after

954

power is restored, another SQLite process sees the rollback journal

955

containing the garbage data and tries to roll it back into the original

956

database file, it might copy some of the garbage into the database file

957

and thus corrupt the database file.</p>

958

959

<p>SQLite uses two defenses against this problem. In the first place,

960

SQLite records the number of pages in the rollback journal in the header

961

of the rollback journal. This number is initially zero. So during an

962

attempt to rollback an incomplete (and possibly corrupt) rollback

963

journal, the process doing the rollback will see that the journal

964

contains zero pages and will thus make no changes to the database. Prior

965

to a commit, the rollback journal is flushed to disk to ensure that

966

all content has been synced to disk and there is no "garbage" left

967

in the file, and only then is the page count in the header changed from

968

zero to true number of pages in the rollback journal. The rollback journal

969

header is always kept in a separate sector from any page data so that

970

it can be overwritten and flushed without risking damage to a data

971

page if a power outage occurs. Notice that the rollback journal

972

is flushed to disk twice: once to write the page content and a second

973

time to write the page count in the header.</p>

974

975

<p>The previous paragraph describes what happens when the

976

synchronous pragma setting is "full".</p>

977

978

979

PRAGMA synchronous=FULL;

980

</blockquote>

981

982

<p>The default synchronous setting is full so the above is what usually

983

happens. However, if the synchronous setting is lowered to "normal",

984

SQLite only flushes the rollback journal once, after the page count has

985

been written.

986

This carries a risk of corruption because it might happen that the

987

modified (non-zero) page count reaches the disk surface before all

988

of the data does. The data will have been written first, but SQLite

989

assumes that the underlying filesystem can reorder write requests and

990

that the page count can be burned into oxide first even though its

991

write request occurred last. So as a second line of defense, SQLite

992

also uses a 32-bit checksum on every page of data in the rollback

993

journal. This checksum is evaluated for each page during rollback

994

while rolling back a journal as described in

995

<a href="#section_4_4">section 4.4</a>. If an incorrect checksum

996

is seen, the rollback is abandoned. Note that the checksum does

997

not guarantee that the page data is correct since there is a small

998

but finite probability that the checksum might be right even if the data is

999

corrupt. But the checksum does at least make such an error unlikely.

1000

</p>

1001

1002

<p>Note that the checksums in the rollback journal are not necessary

1003

if the synchronous setting is FULL. We only depend on the checksums

1004

when synchronous is lowered to NORMAL. Nevertheless, the checksums

1005

never hurt and so they are included in the rollback journal regardless

1006

of the synchronous setting.</p>

1007

1008

1009

1010

<h3>6.3 Cache Spill Prior To Commit</h3>

1011

1012

<p>The commit process shown in <a href="#section_3_0">section 3.0</a>

1013

assumes that all database changes fit in memory until it is time to

1014

commit. This is the common case. But sometimes a larger change will

1015

overflow the user-space cache prior to transaction commit. In those

1016

cases, the cache must spill to the database before the transaction

1017

is complete.</p>

1018

1019

<p>At the beginning of a cache spill, the status of the database

1020

connection is as shown in <a href="#section_3_6">step 3.6</a>.

1021

Original page content has been saved in the rollback journal and

1022

modifications of the pages exist in user memory. To spill the cache,

1023

SQLite executes steps <a href="#section_3_7">3.7</a> through

1024

<a href="#section_3_9">3.9</a>. In other words, the rollback journal

1025

is flushed to disk, an exclusive lock is acquired, and changes are

1026

written into the database. But the remaining steps are deferred

1027

until the transaction really commits. A new journal header is

1028

appended to the end of the rollback journal (in its own sector)

1029

and the exclusive database lock is retained, but otherwise processing

1030

returns to <a href="#section_3_6">step 3.6</a>. When the transaction

1031

commits, or if another cache spill occurs, steps

1032

<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are

1033

repeated. (Step <a href="#section_3_8">3.8</a> is omitted on second

1034

and subsequent passes since an exclusive database lock is already held

1035

due to the first pass.)</p>

1036

1037

<p>A cache spill causes the lock on the database file to

1038

escalate from reserved to exclusive. This reduces concurrency.

1039

A cache spill also causes extra disk flush or fsync operations to

1040

occur and these operations are slow, hence a cache spill can

1041

seriously reduce performance.

1042

For these reasons a cache spill is avoided whenever possible.</p>

1043

1044

1045

1046

<h2>7.0 Optimizations</h2>

1047

1048

<p>Profiling indicates that for most systems and in most circumstances

1049

SQLite spends most of its time doing disk I/O. It follows then that

1050

anything we can do to reduce the amount of disk I/O will likely have a

1051

large positive impact on the performance of SQLite. This section

1052

describes some of the techniques used by SQLite to try to reduce the

1053

amount of disk I/O to a minimum while still preserving atomic commit.</p>

1054

1055

1056

1057

<h3>7.1 Cache Retained Between Transactions</h3>

1058

1059

<p><a href="#section_3_12">Step 3.12</a> of the commit process shows

1060

that once the shared lock has been released, all user-space cache

1061

images of database content must be discarded. This is done because

1062

without a shared lock, other processes are free to modify the database

1063

file content and so any user-space image of that content might become

1064

obsolete. Consequently, each new transaction would begin by rereading

1065

data which had previously been read. This is not as bad as it sounds

1066

at first since the data being read is still likely in the operating

1067

systems file cache. So the "read" is really just a copy of data

1068

from kernel space into user space. But even so, it still takes time.</p>

1069

1070

<p>Beginning with SQLite version 3.3.14 a mechanism has been added

1071

to try to reduce the needless rereading of data. In newer versions

1072

of SQLite, the data in the user-space pager cache is retained when

1073

the lock on the database file is released. Later, after the

1074

shared lock is acquired at the beginning of the next transaction,

1075

SQLite checks to see if any other process has modified the database

1076

file. If the database has been changed in any way since the lock

1077

was last released, the user-space cache is erased at that point.

1078

But commonly the database file is unchanged and the user-space cache

1079

can be retained, and some unnecessary read operations can be avoided.</p>

1080

1081

<p>In order to determine whether or not the database file has changed,

1082

SQLite uses a counter in the database header (in bytes 24 through 27)

1083

which is incremented during every change operation. SQLite saves a copy

1084

of this counter prior to releasing its database lock. Then after

1085

acquiring the next database lock it compares the saved counter value

1086

against the current counter value and erases the cache if the values

1087

are different, or reuses the cache if they are the same.</p>

1088

1089

1090

<h3>7.2 Exclusive Access Mode</h3>

1091

1092

<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".

1093

In exclusive access mode, SQLite retains the exclusive

1094

database lock at the conclusion of each transaction. This prevents

1095

other processes for accessing the database, but in many deployments

1096

only a single process is using a database so this is not a

1097

serious problem. The advantage of exclusive access mode is that

1098

disk I/O can be reduced in three ways:</p>

1099

1100

<ol>

1101

<li><p>It is not necessary to increment the change counter in the

1102

database header for transactions after the first transaction. This

1103

will often save a write of page one to both the rollback

1104

journal and the main database file.</p></li>

1105

1106

<li><p>No other processes can change the database so there is never

1107

a need to check the change counter and clear the user-space cache

1108

at the beginning of a transaction.</p></li>

1109

1110

<li><p>Each transaction can be committed by overwriting the rollback

1111

journal header with zeros rather than deleting the journal file.

1112

This avoids having to modify the directory entry for the journal file

1113

and it avoids having to deallocate disk sectors associated with the

1114

journal. Furthermore, the next transaction will overwrite existing

1115

journal file content rather than append new content and on most systems

1116

overwriting is much faster than appending.</p></li>

1117

</ol>

1118

1119

<p>The third optimization, zeroing the journal file header rather than

1120

deleting the rollback journal file,

1121

does not depend on holding an exclusive lock at all times.

1122

This optimization can be set independently of exclusive lock mode

1123

using the <a href="pragma.html#pragma_journal_mode">journal_mode pragma</a>

1124

as described in <a href="#section_7_6">section 7.6</a> below.</p>

1125

1126

1127

1128

<h3>7.3 Do Not Journal Freelist Pages</h3>

1129

1130

<p>When information is deleted from an SQLite database, the pages used

1131

to hold the deleted information are added to a "<a href="fileformat2.html#freelist">freelist</a>". Subsequent

1132

inserts will draw pages off of this freelist rather than expanding the

1133

database file.</p>

1134

1135

<p>Some freelist pages contain critical data; specifically the locations

1136

of other freelist pages. But most freelist pages contain nothing useful.

1137

These latter freelist pages are called "leaf" pages. We are free to

1138

modify the content of a leaf freelist page in the database without

1139

changing the meaning of the database in any way.</p>

1140

1141

<p>Because the content of leaf freelist pages is unimportant, SQLite

1142

avoids storing leaf freelist page content in the rollback journal

1143

in <a href="#section_3_5">step 3.5</a> of the commit process.

1144

If a leaf freelist page is changed and that change does not get rolled back

1145

during a transaction recovery, the database is not harmed by the omission.

1146

Similarly, the content of a new freelist page is never written back

1147

into the database at <a href="#section_3_9">step 3.9</a> nor

1148

read from the database at <a href="#section_3_3">step 3.3</a>.

1149

These optimizations can greatly reduce the amount of I/O that occurs

1150

when making changes to a database file that contains free space.</p>

1151

1152

1153

1154

<h3>7.4 Single Page Updates And Atomic Sector Writes</h3>

1155

1156

<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)

1157

interface contains a method named xDeviceCharacteristics which reports

1158

on special properties that the underlying mass storage device

1159

might have. Among the special properties that

1160

xDeviceCharacteristics might report is the ability of to do an

1161

atomic sector write.</p>

1162

1163

<p>Recall that by default SQLite assumes that sector writes are

1164

linear but not atomic. A linear write starts at one end of the

1165

sector and changes information byte by byte until it gets to the

1166

other end of the sector. If a power loss occurs in the middle of

1167

a linear write then part of the sector might be modified while the

1168

other end is unchanged. In an atomic sector write, either the entire

1169

sector is overwritten or else nothing in the sector is changed.</p>

1170

1171

<p>We believe that most modern disk drives implement atomic sector

1172

writes. When power is lost, the drive uses energy stored in capacitors

1173

and/or the angular momentum of the disk platter to provide power to

1174

complete any operation in progress. Nevertheless, there are so many

1175

layers in between the write system call and the on-board disk drive

1176

electronics that we take the safe approach in both Unix and w32 VFS

1177

implementations and assume that sector writes are not atomic. On the

1178

other hand, device

1179

manufactures with more control over their filesystems might want

1180

to consider enabling the atomic write property of xDeviceCharacteristics

1181

if their hardware really does do atomic writes.</p>

1182

1183

<p>When sector writes are atomic and the page size of a database is

1184

the same as a sector size, and when there is a database change that

1185

only touches a single database page, then SQLite skips the whole

1186

journaling and syncing process and simply writes the modified page

1187

directly into the database file. The change counter in the first

1188

page of the database file is modified separately since no harm is

1189

done if power is lost before the change counter can be updated.</p>

1190

1191

1192

1193

<h3>7.5 Filesystems With Safe Append Semantics</h3>

1194

1195

<p>Another optimization introduced in SQLite version 3.5.0 makes

1196

use of "safe append" behavior of the underlying disk.

1197

Recall that SQLite assumes that when data is appended to a file

1198

(specifically to the rollback journal) that the size of the file

1199

is increased first and that the content is written second. So

1200

if power is lost after the file size is increased but before the

1201

content is written, the file is left containing invalid "garbage"

1202

data. The xDeviceCharacteristics method of the VFS might, however,

1203

indicate that the filesystem implements "safe append" semantics.

1204

This means that the content is written before the file size is

1205

increased so that it is impossible for garbage to be introduced

1206

into the rollback journal by a power loss or system crash.</p>

1207

1208

<p>When safe append semantics are indicated for a filesystem,

1209

SQLite always stores the special value of -1 for the page count

1210

in the header of the rollback journal. The -1 page count value

1211

tells any process attempting to rollback the journal that the

1212

number of pages in the journal should be computed from the journal

1213

size. This -1 value is never changed. So that when a commit

1214

occurs, we save a single flush operation and a sector write of

1215

the first page of the journal file. Furthermore, when a cache

1216

spill occurs we no longer need to append a new journal header

1217

to the end of the journal; we can simply continue appending

1218

new pages to the end of the existing journal.</p>

1219

1220

1221

<h3>7.6 Persistent Rollback Journals</h3>

1222

1223

<p>Deleting a file is an expensive operation on many systems.

1224

So as an optimization, SQLite can be configured to avoid the

1225

delete operation of <a href="#section_3_11">section 3.11</a>.

1226

Instead of deleting the journal file in order to commit a transaction,

1227

the file is either truncated to zero bytes in length or its

1228

header is overwritten with zeros. Truncating the file to zero

1229

length saves having to make modifications to the directory containing

1230

the file since the file is not removed from the directory.

1231

Overwriting the header has the additional savings of not having

1232

to update the length of the file (in the "inode" on many systems)

1233

and not having to deal with newly freed disk sectors. Furthermore,

1234

at the next transaction the journal will be created by overwriting

1235

existing content rather than appending new content onto the end

1236

of a file, and overwriting is often much faster than appending.</p>

1237

1238

<p>SQLite can be configured to commit transactions by overwriting

1239

the journal header with zeros instead of deleting the journal file

1240

by setting the "PERSIST" journaling mode using the

1241

<a href="pragma.html#pragma_journal_mode">journal_mode</a> PRAGMA.

1242

For example:</p>

1243

1244

1245

PRAGMA journal_mode=PERSIST;

1246

</per></blockquote>

1247

1248

<p>The use of persistent journal mode provide a noticeable performance

1249

improvement on many systems. Of course, the drawback is that the

1250

journal files remain on the disk, using disk space and cluttering

1251

directories, long after the transaction commits. The only safe way

1252

to delete a persistent journal file is to commit a transaction

1253

with journaling mode set to DELETE:</p>

1254

1255

1256

PRAGMA journal_mode=DELETE;

1257

BEGIN EXCLUSIVE;

1258

COMMIT;

1259

</per></blockquote>

1260

1261

<p>Beware of deleting persistent journal files by any other means

1262

since the journal file might be hot, in which case deleting it will

1263

corrupt the corresponding database file.</p>

1264

1265

<p>Beginning in SQLite version 3.6.4, the TRUNCATE journal mode is

1266

also supported:</p>

1267

1268

1269

PRAGMA journal_mode=TRUNCATE;

1270

</pre></blockquote>

1271

1272

<p>In truncate journal mode, the transaction is committed by truncating

1273

the journal file to zero length rather than deleting the journal file

1274

(as in DELETE mode) or by zeroing the header (as in PERSIST mode).

1275

TRUNCATE mode shares the advantage of PERSIST mode that the directory

1276

that contains the journal file and database does not need to be updated.

1277

Hence truncating a file is often faster than deleting it. TRUNCATE has

1278

the additional advantage that it is not followed by a

1279

system call (ex: fsync()) to synchronize the change to disk. It might

1280

be safer if it did.

1281

But on many modern filesystems, a truncate is an atomic and

1282

synchronous operation and so we think that TRUNCATE will usually be safe

1283

in the face of power failures. If you are uncertain about whether or

1284

not TRUNCATE will be synchronous and atomic on your filesystem and it is

1285

important to you that your database survive a power loss or operating

1286

system crash that occurs during the truncation operation, then you might

1287

consider using a different journaling mode.</p>

1288

1289

<p>On embedded systems with synchronous filesystems, TRUNCATE results

1290

in slower behavior than PERSIST. The commit operation is the same speed.

1291

But subsequent transactions are slower following a TRUNCATE because it is

1292

faster to overwrite existing content than to append to the end of a file.

1293

New journal file entries will always be appended following a TRUNCATE but

1294

will usually overwrite with PERSIST.</p>

1295

1296

1297

1298

<h2>8.0 Testing Atomic Commit Behavior</h2>

1299

1300

<p>The developers of SQLite are confident that it is robust

1301

in the face of power failures and system crashes because the

1302

automatic test procedures do extensive checks on

1303

the ability of SQLite to recover from simulated power loss.

1304

We call these the "crash tests".</p>

1305

1306

<p>Crash tests in SQLite use a modified VFS that can simulate

1307

the kinds of filesystem damage that occur during a power

1308

loss or operating system crash. The crash-test VFS can simulate

1309

incomplete sector writes, pages filled with garbage data because

1310

a write has not completed, and out of order writes, all occurring

1311

at varying points during a test scenario. Crash tests execute

1312

transactions over and over, varying the time at which a simulated

1313

power loss occurs and the properties of the damage inflicted.

1314

Each test then reopens the database after the simulated crash and

1315

verifies that the transaction either occurred completely

1316

or not at all and that the database is in a completely

1317

consistent state.</p>

1318

1319

<p>The crash tests in SQLite have discovered a number of very

1320

subtle bugs (now fixed) in the recovery mechanism. Some of

1321

these bugs were very obscure and unlikely to have been found

1322

using only code inspection and analysis techniques. From this

1323

experience, the developers of SQLite feel confident that any other

1324

database system that does not use a similar crash test system

1325

likely contains undetected bugs that will lead to database

1326

corruption following a system crash or power failure.</p>

1327

1328

1329

1330

<h2>9.0 Things That Can Go Wrong</h2>

1331

1332

<p>The atomic commit mechanism in SQLite has proven to be robust,

1333

but it can be circumvented by a sufficiently creative

1334

adversary or a sufficiently broken operating system implementation.

1335

This section describes a few of the ways in which an SQLite database

1336

might be corrupted by a power failure or system crash.

1337

(See also: <a href="lockingv3.html#how_to_corrupt">How To Corrupt Your Database Files</a>.)</p>

1338

1339

1340

1341

<h3>9.1 Broken Locking Implementations</h3>

1342

1343

<p>SQLite uses filesystem locks to make sure that only one

1344

process and database connection is trying to modify the database

1345

at a time. The filesystem locking mechanism is implemented

1346

in the VFS layer and is different for every operating system.

1347

SQLite depends on this implementation being correct. If something

1348

goes wrong and two or more processes are able to write the same

1349

database file at the same time, severe damage can result.</p>

1350

1351

<p>We have received reports of implementations of both

1352

Windows network filesystems and NFS in which locking was

1353

subtly broken. We can not verify these reports, but as

1354

locking is difficult to get right on a network filesystem

1355

we have no reason to doubt them. You are advised to

1356

avoid using SQLite on a network filesystem in the first place,

1357

since performance will be slow. But if you must use a

1358

network filesystem to store SQLite database files, consider

1359

using a secondary locking mechanism to prevent simultaneous

1360

writes to the same database even if the native filesystem

1361

locking mechanism malfunctions.</p>

1362

1363

<p>The versions of SQLite that come preinstalled on Apple

1364

Mac OS X computers contain a version of SQLite that has been

1365

extended to use alternative locking strategies that work on

1366

all network filesystems that Apple supports. These extensions

1367

used by Apple work great as long as all processes are accessing

1368

the database file in the same way. Unfortunately, the locking

1369

mechanisms do not exclude one another, so if one process is

1370

accessing a file using (for example) AFP locking and another

1371

process (perhaps on a different machine) is using dot-file locks,

1372

the two processes might collide because AFP locks do not exclude

1373

dot-file locks or vice versa.</p>

1374

1375

1376

1377

<h3>9.2 Incomplete Disk Flushes</h3>

1378

1379

<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()

1380

system call on w32 in order to sync the file system buffers onto disk

1381

oxide as shown in <a href="#section_3_7">step 3.7</a> and

1382

<a href="#section_3_10">step 3.10</a>. Unfortunately, we have received

1383

reports that neither of these interfaces works as advertised on many

1384

systems. We hear that FlushFileBuffers() can be completely disabled

1385

using registry settings on some Windows versions. Some historical

1386

versions of Linux contain versions of fsync() which are no-ops on

1387

some filesystems, we are told. Even on systems where

1388

FlushFileBuffers() and fsync() are said to be working, often

1389

the IDE disk control lies and says that data has reached oxide

1390

while it is still held only in the volatile control cache.</p>

1391

1392

<p>On the Mac, you can set this pragma:</p>

1393

1394

1395

PRAGMA fullfsync=ON;

1396

</blockquote>

1397

1398

<p>Setting fullfsync on a Mac will guarantee that data really does

1399

get pushed out to the disk platter on a flush. But the implementation

1400

of fullfsync involves resetting the disk controller. And so not only

1401

is it profoundly slow, it also slows down other unrelated disk I/O.

1402

So its use is not recommended.</p>

1403

1404

1405

1406

<h3>9.3 Partial File Deletions</h3>

1407

1408

<p>SQLite assumes that file deletion is an atomic operation from the

1409

point of view of a user process. If power fails in the middle of

1410

a file deletion, then after power is restored SQLite expects to see

1411

either the entire file with all of its original data intact, or it

1412

expects not to find the file at all. Transactions may not be atomic

1413

on systems that do not work this way.</p>

1414

1415

1416

1417

<h3>9.4 Garbage Written Into Files</h3>

1418

1419

<p>SQLite database files are ordinary disk files that can be

1420

opened and written by ordinary user processes. A rogue process

1421

can open an SQLite database and fill it with corrupt data.

1422

Corrupt data might also be introduced into an SQLite database

1423

by bugs in the operating system or disk controller; especially

1424

bugs triggered by a power failure. There is nothing SQLite can

1425

do to defend against these kinds of problems.</p>

1426

1427

1428

1429

<h3>9.5 Deleting Or Renaming A Hot Journal</h3>

1430

1431

<p>If a crash or power loss does occur and a hot journal is left on

1432

the disk, it is essential that the original database file and the hot

1433

journal remain on disk with their original names until the database

1434

file is opened by another SQLite process and rolled back.

1435

During recovery at <a href="section_4_2">step 4.2</a> SQLite locates

1436

the hot journal by looking for a file in the same directory as the

1437

database being opened and whose name is derived from the name of the

1438

file being opened. If either the original database file or the

1439

hot journal have been moved or renamed, then the hot journal will

1440

not be seen and the database will not be rolled back.</p>

1441

1442

<p>We suspect that a common failure mode for SQLite recovery happens

1443

like this: A power failure occurs. After power is restored, a well-meaning

1444

user or system administrator begins looking around on the disk for

1445

damage. They see their database file named "important.data". This file

1446

is perhaps familiar to them. But after the crash, there is also a

1447

hot journal named "important.data-journal". The user then deletes

1448

the hot journal, thinking that they are helping to cleanup the system.

1449

We know of no way to prevent this other than user education.</p>

1450

1451

<p>If there are multiple (hard or symbolic) links to a database file,

1452

the journal will be created using the name of the link through which

1453

the file was opened. If a crash occurs and the database is opened again

1454

using a different link, the hot journal will not be located and no

1455

rollback will occur.</p>

1456

1457

<p>Sometimes a power failure will cause a filesystem to be corrupted

1458

such that recently changed filenames are forgotten and the file is

1459

moved into a "/lost+found" directory. When that happens, the hot

1460

journal will not be found and recovery will not occur.

1461

SQLite tries to prevent this

1462

by opening and syncing the directory containing the rollback journal

1463

at the same time it syncs the journal file itself. However, the

1464

movement of files into /lost+found can be caused by unrelated processes

1465

creating unrelated files in the same directory as the main database file.

1466

And since this is out from under the control of SQLite, there is nothing

1467

that SQLite can do to prevent it. If you are running on a system that

1468

is vulnerable to this kind of filesystem namespace corruption (most

1469

modern journalling filesystems are immune, we believe) then you might

1470

want to consider putting each SQLite database file in its own private

1471

subdirectory.</p>

1472

1473

1474

1475

<h2>10.0 Future Directions And Conclusion</h2>

1476

1477

<p>Every now and then someone discovers a new failure mode for

1478

the atomic commit mechanism in SQLite and the developers have to

1479

put in a patch. This is happening less and less and the

1480

failure modes are becoming more and more obscure. But it would

1481

still be foolish to suppose that the atomic commit logic of

1482

SQLite is entirely bug-free. The developers are committed to fixing

1483

these bugs as quickly as they might be found.</p>

1484

1485

<p>

1486

The developers are also on the lookout for new ways to

1487

optimize the commit mechanism. The current VFS implementations

1488

for Unix (Linux and Mac OS X) and Windows make pessimistic assumptions about

1489

the behavior of those systems. After consultation with experts

1490

on how these systems work, we might be able to relax some of the

1491

assumptions on these systems and allow them to run faster. In

1492

particular, we suspect that most modern filesystems exhibit the

1493

safe append property and that many of them might support atomic

1494

sector writes. But until this is known for certain, SQLite will

1495

take the conservative approach and assume the worst.</p>

1496