~ubuntu-branches/ubuntu/wily/sqlite3/wily

onfocus="entersearch()" onblur="leavesearch()" style="width:24ex;padding:1px 1ex; border:solid white 1px; font-size:0.9em ; font-style:italic;color:#044a64;" value="Search SQLite Docs...">

112

113

</form>

114

</div>

115

</table>

116

</div></div></div></div>

117

</td></tr></table>

118

119

120

121

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

122

<html>

123

<head>

124

125

126

</head>

127

<div id=document_title>SQLite File IO Specification</div>

128

<div id=toc_header>Table Of Contents</div>

129

130

Javascript is required for some features of this document, including

131

table of contents, figure numbering and internal references (section

132

numbers and hyper-links.

133

134

</div>

135

136

<h1 id=overview>Overview</h1>

137

138

SQLite stores an entire database within a single file, the format of

139

which is described in the SQLite File Database File Format

140

document <cite>ff_sqlitert_requirements</cite>. Each database file is

141

stored within a file system, presumably provided by the host operating

142

system. Instead of interfacing with the operating system directly,

143

the host application is required to supply an adaptor component that

144

implements the SQLite Virtual File System interface

145

(described in <cite>capi_sqlitert_requirements</cite>). The adaptor

146

component is responsible for translating the calls made by SQLite to

147

the VFS interface into calls to the file-system interface

148

provided by the operating system. This arrangement is depicted in figure

149

<cite>figure_vfs_role</cite>.

150

151

Figure - Virtual File System (VFS) Adaptor

152

</center>

153

154

Although it would be easy to design a system that uses the VFS

155

interface to read and update the content of a database file stored

156

within a file-system, there are several complicated issues that need

157

to be addressed by such a system:

158

<ol>

159

<li>SQLite is required to implement atomic and durable

160

transactions (the 'A' and 'D' from the ACID acronym), even if an

161

application, operating system or power failure occurs midway through or

162

shortly after updating a database file.

163

To implement atomic transactions in the face of potential

164

application, operating system or power failures, database writers write

165

a copy of those portions of the database file that they are going to

166

modify into a second file, the journal file, before writing

167

to the database file. If a failure does occur while modifying the

168

database file, SQLite can reconstruct the original database

169

(before the modifications were attempted) based on the contents of

170

the journal file.

171

<li>SQLite is required to implement isolated transactions (the 'I'

172

from the ACID acronym).

173

This is done by using the file locking facilities provided by the

174

VFS adaptor to serialize writers (write transactions) and preventing

175

readers (read transactions) from accessing database files while writers

176

are midway through updating them.

177

<li>For performance reasons, it is advantageous to minimize the

178

quantity of data read and written to and from the file-system.

179

As one might expect, the amount of data read from the database

180

file is minimized by caching portions of the database file in main

181

memory. Additionally, multiple updates to the database file that

182

are part of the same write transaction may be cached in

183

main memory and written to the file together, allowing for

184

more efficient IO patterns and eliminating the redundant write

185

operations that could take place if part of the database file is

186

modified more than once within a single write transaction.

187

</ol>

188

189

System requirement references for the above points.

190

191

This document describes in detail the way that SQLite uses the API

192

provided by the VFS adaptor component to solve the problems and implement

193

the strategies enumerated above. It also specifies the assumptions made

194

about the properties of the system that the VFS adaptor provides

195

access to. For example, specific assumptions about the extent of

196

data corruption that may occur if a power failure occurs while a

197

database file is being updated are presented in section

198

<cite>fs_characteristics</cite>.

199

200

This document does not specify the details of the interface that must

201

be implemented by the VFS adaptor component, that is left to

202

<cite>capi_sqlitert_requirements</cite>.

203

<h2>Relationship to Other Documents</h2>

204

205

Related to C-API requirements:

206

<ol>

207

<li>Opening a connection.

208

<li>Closing a connection.

209

</ol>

210

211

Related to SQL requirements:

212

<ol>

213

<li value=3>Opening a read-only transaction.

214

<li>Terminating a read-only transaction.

215

<li>Opening a read-write transaction.

216

<li>Committing a read-write transaction.

217

<li>Rolling back a read-write transaction.

218

<li>Opening a statement transaction.

219

<li>Committing a statement transaction.

220

<li>Rolling back a statement transaction.

221

<li>Committing a multi-file transaction.

222

</ol>

223

224

Related to file-format requirements:

225

<ol>

226

<li value=12>Pinning (reading) a database page.

227

<li>Unpinning a database page.

228

<li>Modifying the contents of a database page.

229

<li>Appending a new page to the database file.

230

<li>Truncating a page from the end of the database file.

231

</ol>

232

<h2>Document Structure</h2>

233

234

Section <cite>vfs_assumptions</cite> of this document describes the

235

various assumptions made about the system to which the VFS adaptor

236

component provides access. The basic capabilities and functions

237

required from the VFS implementation are presented along with the

238

description of the VFS interface in

239

<cite>capi_sqlitert_requirements</cite>. Section

240

<cite>vfs_assumptions</cite> complements this by describing in more

241

detail the assumptions made about VFS implementations on which the

242

algorithms presented in this document depend. Some of these assumptions

243

relate to performance issues, but most concern the expected state of

244

the file-system following a failure that occurs midway through

245

modifying a database file.

246

247

Section <cite>database_connections</cite> introduces the concept of

248

a database connection, a combination of a file-handle and

249

in-memory cache used to access a database file. It also describes the

250

VFS operations required when a new database connection is

251

created (opened), and when one is destroyed (closed).

252

253

Section <cite>reading_data</cite> describes the steps required to

254

open a read transaction and read data from a database file.

255

256

Section <cite>writing_data</cite> describes the steps required to

257

open a write transaction and write data to a database file.

258

259

Section <cite>rollback</cite> describes the way in which aborted

260

write transactions may be rolled back (reverted), either as

261

a result of an explicit user directive or because an application,

262

operating system or power failure occurred while SQLite was midway

263

through updating a database file.

264

265

Section <cite>page_cache_algorithms</cite> describes some of the

266

algorithms used to determine exactly which portions of the database

267

file are cached by a page cache, and the effect that they

268

have on the quantity and nature of the required VFS operations.

269

It may at first seem odd to include the page cache, which is

270

primarily an implementation detail, in this document. However, it is

271

necessary to acknowledge and describe the page cache in order to

272

provide a more complete explanation of the nature and quantity of IO

273

performed by SQLite.

274

<h2>Glossary</h2>

275

276

After this document is ready, make the vocabulary consistent and

277

then add a glossary here.

278

<h1 id=vfs_assumptions>VFS Adaptor Related Assumptions</h1>

279

280

This section documents those assumptions made about the system that

281

the VFS adaptor provides access to. The assumptions noted in section

282

<cite>fs_characteristics</cite> are particularly important. If these

283

assumptions are not true, then a power or operating system failure

284

may cause SQLite databases to become corrupted.

285

<h2 id=fs_performance>Performance Related Assumptions</h2>

286

287

SQLite uses the assumptions in this section to try to speed up

288

reading from and writing to the database file.

289

290

It is assumed that writing a series of sequential blocks of data to

291

a file in order is faster than writing the same blocks in an arbitrary

292

order.

293

<h2 id=fs_characteristics>System Failure Related Assumptions</h2>

294

295

In the event of an operating system or power failure, the various

296

combinations of file-system software and storage hardware available

297

provide varying levels of guarantee as to the integrity of the data

298

written to the file system just before or during the failure. The exact

299

combination of IO operations that SQLite is required to perform in

300

order to safely modify a database file depend on the exact

301

characteristics of the target platform.

302

303

This section describes the assumptions that SQLite makes about the

304

content of a file-system following a power or system failure. In

305

other words, it describes the extent of file and file-system corruption

306

that such an event may cause.

307

308

SQLite queries an implementation for file-system characteristics

309

using the xDeviceCharacteristics() and xSectorSize() methods of the

310

database file file-handle. These two methods are only ever called

311

on file-handles open on database files. They are not called for

312

journal files, master-journal files or

313

temporary database files.

314

315

The file-system sector size value determined by calling the

316

xSectorSize() method is a power of 2 value between 512 and 32768,

317

inclusive reference to exactly how this is

318

determined. SQLite assumes that the underlying storage

319

device stores data in blocks of sector-size bytes each,

320

sectors. It is also assumed that each aligned block of

321

sector-size bytes of each file is stored in a single device

322

sector. If the file is not an exact multiple of sector-size

323

bytes in size, then the final device sector is partially empty.

324

325

Normally, SQLite assumes that if a power failure occurs while

326

updating any portion of a sector then the contents of the entire

327

device sector is suspect following recovery. After writing to

328

any part of a sector within a file, it is assumed that the modified

329

sector contents are held in a volatile buffer somewhere within

330

the system (main memory, disk cache etc.). SQLite does not assume

331

that the updated data has reached the persistent storage media, until

332

after it has successfully synced the corresponding file by

333

invoking the VFS xSync() method. Syncing a file causes all

334

modifications to the file up until that point to be committed to

335

persistent storage.

336

337

Based on the above, SQLite is designed around a model of the

338

file-system whereby any sector of a file written to is considered to be

339

in a transient state until after the file has been successfully

340

synced. Should a power or system failure occur while a sector

341

is in a transient state, it is impossible to predict its contents

342

following recovery. It may be written correctly, not written at all,

343

overwritten with random data, or any combination thereof.

344

345

For example, if the sector-size of a given file-system is

346

2048 bytes, and SQLite opens a file and writes a 1024 byte block

347

of data to offset 3072 of the file, then according to the model

348

the second sector of the file is in the transient state. If a

349

power failure or operating system crash occurs before or during

350

the next call to xSync() on the file handle, then following system

351

recovery SQLite assumes that all file data between byte offsets 2048

352

and 4095, inclusive, is invalid. It also assumes that since the first

353

sector of the file, containing the data from byte offset 0 to 2047

354

inclusive, is valid, since it was not in a transient state when the

355

crash occurred.

356

357

Assuming that any and all sectors in the transient state may be

358

corrupted following a power or system failure is a very pessimistic

359

approach. Some modern systems provide more sophisticated guarantees

360

than this. SQLite allows the VFS implementation to specify at runtime

361

that the current platform supports zero or more of the following

362

properties:

363

<ul>

364

<li>The safe-append property. If a system supports the

365

safe-append property, it means that when a file is extended

366

the new data is written to the persistent media before the size

367

of the file itself is updated. This guarantees that if a failure

368

occurs after a file has been extended, following recovery

369

the write operations that extended the file will appear to have

370

succeeded or not occurred at all. It is not possible for invalid

371

or garbage data to appear in the extended region of the file.

372

<li>The atomic-write property. A system that supports this

373

property also specifies the size or sizes of the blocks that it

374

is capable of writing. Valid sizes are powers of two greater than

375

512. If a write operation modifies a block of n bytes,

376

where n is one of the block sizes for which atomic-write

377

is supported, then it is impossible for an aligned write of n

378

bytes to cause data corruption. If a failure occurs after such

379

a write operation and before the applicable file handle is

380

synced, then following recovery it will appear as if the

381

write operation succeeded or did not take place at all. It is not

382

possible that only part of the data specified by the write operation

383

was written to persistent media, nor is it possible for any content

384

of the sectors spanned by the write operation to be replaced with

385

garbage data, as it is normally assumed to be.

386

<li>The sequential-write property. A system that supports the

387

sequential-write property guarantees that the various write

388

operations on files within the same file-system are written to the

389

persistent media in the same order that they are performed by the

390

application and that each operation is concluded before the next

391

is begun. If a system supports the sequential-write

392

property, then the model used to determine the possible states of

393

the file-system following a failure is different.

394

If a system supports sequential-write it is assumed that

395

syncing any file within the file system flushes all write

396

operations on all files (not just the synced file) to

397

the persistent media. If a failure does occur, it is not known

398

whether or not any of the write operations performed by SQLite

399

since the last time a file was synced. SQLite is able to

400

assume that if the write operations of unknown status are arranged

401

in the order that they occurred:

402

<ol>

403

<li> the first n operations will have been executed

404

successfully,

405

<li> the next operation puts all device sectors that it modifies

406

into the transient state, so that following recovery each

407

sector may be partially written, completely written, not

408

written at all or populated with garbage data,

409

<li> the remaining operations will not have had any effect on

410

the contents of the file-system.

411

</ol>

412

</ul>

413

<h3 id=fs_assumption_details>Failure Related Assumption Details</h3>

414

415

This section describes how the assumptions presented in the parent

416

section apply to the individual API functions and operations provided

417

by the VFS to SQLite for the purposes of modifying the contents of the

418

file-system.

419

420

SQLite manipulates the contents of the file-system using a combination

421

of the following four types of operation:

422

<ul>

423

<li> Create file operations. SQLite may create new files

424

within the file-system by invoking the xOpen() method of

425

the sqlite3_io_methods object.

426

<li> Delete file operations. SQLite may remove files from the

427

file system by calling the xDelete() method of the

428

sqlite3_io_methods object.

429

<li> Truncate file operations. SQLite may truncate existing

430

files by invoking the xTruncate() method of the sqlite3_file

431

object.

432

<li> Write file operations. SQLite may modify the contents

433

and increase the size of a file by files by invoking the xWrite()

434

method of the sqlite3_file object.

435

</ul>

436

437

Additionally, all VFS implementations are required to provide the

438

sync file operation, accessed via the xSync() method of the

439

sqlite3_file object, used to flush create, write and truncate operations

440

on a file to the persistent storage medium.

441

442

The formalized assumptions in this section refer to system failure

443

events. In this context, this should be interpreted as any failure that

444

causes the system to stop operating. For example a power failure or

445

operating system crash.

446

447

SQLite does not assume that a create file operation has actually

448

modified the file-system records within persistent storage until

449

after the file has been successfully synced.

450

451

If a system failure occurs during or after a "create file"

452

operation, but before the created file has been synced, then

453

SQLite assumes that it is possible that the created file may not

454

exist following system recovery.

455

456

Of course, it is also possible that it does exist following system

457

recovery.

458

459

If a "create file" operation is executed by SQLite, and then the

460

created file synced, then SQLite assumes that the file-system

461

modifications corresponding to the "create file" operation have been

462

committed to persistent media. It is assumed that if a system

463

failure occurs any time after the file has been successfully

464

synced, then the file is guaranteed to appear in the file-system

465

following system recovery.

466

467

A delete file operation (invoked by a call to the VFS xDelete()

468

method) is assumed to be an atomic and durable operation.

469

470

471

If a system failure occurs at any time after a "delete file"

472

operation (call to the VFS xDelete() method) returns successfully, it is

473

assumed that the file-system will not contain the deleted file following

474

system recovery.

475

476

If a system failure occurs during a "delete file" operation,

477

it is assumed that following system recovery the file-system will

478

either contain the file being deleted in the state it was in before

479

the operation was attempted, or not contain the file at all. It is

480

assumed that it is not possible for the file to have become corrupted

481

purely as a result of a failure occurring during a "delete file"

482

operation.

483

484

The effects of a truncate file operation are not assumed to

485

be made persistent until after the corresponding file has been

486

synced.

487

488

If a system failure occurs during or after a "truncate file"

489

operation, but before the truncated file has been synced, then

490

SQLite assumes that the size of the truncated file is either as large

491

or larger than the size that it was to be truncated to.

492

493

If a system failure occurs during or after a "truncate file"

494

operation, but before the truncated file has been synced, then

495

it is assumed that the contents of the file up to the size that the

496

file was to be truncated to are not corrupted.

497

498

The above two assumptions may be interpreted to mean that if a

499

system failure occurs after file truncation but before the truncated

500

file is synced, the contents of the file following the point

501

at which it was to be truncated may not be trusted. They may contain

502

the original file data, or may contain garbage.

503

504

If a "truncate file" operation is executed by SQLite, and then the

505

truncated file synced, then SQLite assumes that the file-system

506

modifications corresponding to the "truncate file" operation have been

507

committed to persistent media. It is assumed that if a system

508

failure occurs any time after the file has been successfully

509

synced, then the effects of the file truncation are guaranteed

510

to appear in the file system following recovery.

511

512

A write file operation modifies the contents of an existing file

513

within the file-system. It may also increase the size of the file.

514

The effects of a write file operation are not assumed to

515

be made persistent until after the corresponding file has been

516

synced.

517

518

If a system failure occurs during or after a "write file"

519

operation, but before the corresponding file has been synced,

520

then it is assumed that the content of all sectors spanned by the

521

write file operation are untrustworthy following system

522

recovery. This includes regions of the sectors that were not

523

actually modified by the write file operation.

524

525

If a system failure occurs on a system that supports the

526

atomic-write property for blocks of size N bytes

527

following an aligned write of N

528

bytes to a file but before the file has been successfully synced,

529

then is assumed following recovery that all sectors spanned by the

530

write operation were correctly updated, or that none of the sectors were

531

modified at all.

532

533

If a system failure occurs on a system that supports the

534

safe-append following a write operation that appends data

535

to the end of the file without modifying any of the existing file

536

content but before the file has been successfully synced,

537

then is assumed following recovery that either the data was

538

correctly appended to the file, or that the file size remains

539

unchanged. It is assumed that it is impossible that the file be

540

extended but populated with incorrect data.

541

542

Following a system recovery, if a device sector is deemed to be

543

untrustworthy as defined by A21008 and neither A21011 or A21012

544

apply to the range of bytes written, then no assumption can be

545

made about the content of the sector following recovery. It is

546

assumed that it is possible for such a sector to be written

547

correctly, not written at all, populated with garbage data or any

548

combination thereof.

549

550

If a system failure occurs during or after a "write file"

551

operation that causes the file to grow, but before the corresponding

552

file has been synced, then it is assumed that the size of

553

the file following recovery is as large or larger than it was when

554

it was most recently synced.

555

556

If a system supports the sequential-write property, then further

557

assumptions may be made with respect to the state of the file-system

558

following recovery from a system failure. Specifically, it is

559

assumed that create, truncate, delete and write file operations are

560

applied to the persistent representation in the same order as they

561

are performed by SQLite. Furthermore, it is assumed that the

562

file-system waits until one operation is safely written to the

563

persistent media before the next is attempted, just as if the relevant

564

file were synced following each operation.

565

566

If a system failure occurs on a system that supports the

567

sequential-write property, then it is assumed that all

568

operations completed before the last time any file was synced

569

have been successfully committed to persistent media.

570

571

If a system failure occurs on a system that supports the

572

sequential-write property, then it is assumed that the set

573

of possible states that the file-system may be in following recovery

574

is the same as if each of the write operations performed since the most

575

recent time a file was synced was itself followed by a sync

576

file operation, and that the system failure may have occurred during

577

any of the write or sync file operations.

578

<!--

579

580

The return value of the xSectorSize() method, the sector-size, is

581

expected by SQLite to be a power of 2 value greater than or equal to 512.

582

583

What does it do if this is not the case? If the sector size is less

584

than 512 then 512 is used instead. How about a non power-of-two value?

585

UPDATE: How this situation is handled should be described in the API

586

requirements. Here we can just refer to the other document.

587

588

SQLite assumes that files are stored and written to within the

589

file-system as a collection of blocks (hereafter sectors) of data, each

590

sector-size bytes in size. This model is used to derive

591

the following assumptions related to the expected state of the

592

file-system following a power failure or operating system crash.

593

<ul>

594

<li>

595

After part or all of a file sector has been modified

596

using the xWrite() method of an open file-handle, the sector

597

is said to be in a transient state, where the operating system

598

makes no guarantees about the actual content of the sector on the

599

persistent media. The sector remains in the transient state until

600

the next successful call to xSync() on the same file-handle

601

returns. If a power failure or operating system crash occurs, then

602

part or all of all sectors in the transient state when the crash

603

occurred may contain invalid data following system recovery.

604

<li>

605

Following a power failure or operating system crash, the content

606

of all sectors that were not in a transient state when the crash

607

occurred may be trusted.

608

</ul>

609

610

What do we assume about the other three file-system write

611

operations - xTruncate(), xDelete() and "create file"?

612

613

The xDeviceCharacteristics() method returns a set of flags,

614

indicating which of the following properties (if any) the

615

file-system provides:

616

<ul>

617

<li>The sequential IO property. If a file-system has this

618

property, then in the event of a crash at most a single sector

619

may contain invalid data. The file-system guarantees

620

<li>The safe-append property.

621

<li>The atomic write property.

622

</ul>

623

624

Write an explanation as to how the file-system properties influence

625

the model used to predict file damage after a catastrophe.

626

-->

627

<h1 id=database_connections>Database Connections</h1>

628

629

Within this document, the term database connection has a slightly

630

different meaning from that which one might assume. The handles returned

631

by the <code>sqlite3_open()</code> and <code>sqlite3_open16()</code>

632

APIs (reference) are referred to as database

633

handles. A database connection is a connection to a single

634

database file using a single file-handle, which is held open for the

635

lifetime of the connection. Using the SQL ATTACH syntax, multiple

636

database connections may be accessed via a single database

637

handle. Or, using SQLite's shared-cache mode feature, multiple

638

database handles may access a single database connection.

639

640

Usually, a new database connection is opened whenever the user opens

641

new database handle on a real database file (not an in-memory

642

database) or when a database file is attached to an existing database

643

connection using the SQL ATTACH syntax. However if the shared-cache

644

mode feature is enabled, then the database file may be accessed through

645

an existing database connection. For more information on

646

shared-cache mode, refer to Reference. The

647

various IO operations required to open a new connection are detailed in

648

section <cite>open_new_connection</cite> of this document.

649

650

Similarly, a database connection is usually closed when the user

651

closes a database handle that is open on a real database file or

652

has had one or more real database files attached to it using the ATTACH

653

mechanism, or when a real database file is detached from a database

654

connection using the DETACH syntax. Again, the exception is if

655

shared-cache mode is enabled. In this case, a database

656

connection is not closed until its number of users reaches zero.

657

The IO related steps required to close a database connection are

658

described in section <cite>closing_database_connection</cite>.

659

660

After sections 4 and 5 are finished, come back here and see if we can add a

661

list of state items associated with each database connection to make things

662

easier to understand. i.e each database connection has a file handle, a set

663

of entries in the page cache, an expected page size etc.

664

<h2 id=open_new_connection>Opening a New Connection</h2>

665

666

This section describes the VFS operations that take place when a

667

new database connection is created.

668

669

Opening a new database connection is a two-step process:

670

<ol>

671

<li> A file-handle is opened on the database file.

672

<li> If step 1 was successful, an attempt is made to read the

673

database file header from the database file using the

674

new file-handle.

675

</ol>

676

677

In step 2 of the procedure above, the database file is not locked

678

before it is read from. This is the only exception to the locking

679

rules described in section <cite>reading_data</cite>.

680

681

The reason for attempting to read the database file header

682

is to determine the page-size used by the database file.

683

Because it is not possible to be certain as to the page-size

684

without holding at least a shared lock on the database file

685

(because some other database connection might have changed it

686

since the database file header was read), the value read from the

687

database file header is known as the expected page size.

688

689

When a new database connection is required, SQLite shall attempt

690

to open a file-handle on the database file. If the attempt fails, then

691

no new database connection is created and an error returned.

692

693

When a new database connection is required, after opening the

694

new file-handle, SQLite shall attempt to read the first 100 bytes

695

of the database file. If the attempt fails for any other reason than

696

that the opened file is less than 100 bytes in size, then

697

the file-handle is closed, no new database connection is created

698

and an error returned instead.

699

700

If the database file header is successfully read from a newly

701

opened database file, the connections expected page-size shall

702

be set to the value stored in the page-size field of the

703

database header.

704

705

If the database file header cannot be read from a newly opened

706

database file (because the file is less than 100 bytes in size), the

707

connections expected page-size shall be set to the compile time

708

value of the SQLITE_DEFAULT_PAGESIZE option.

709

<h2 id=closing_database_connection>Closing a Connection</h2>

710

711

This section describes the VFS operations that take place when an

712

existing database connection is closed (destroyed).

713

714

Closing a database connection is a simple matter. The open VFS

715

file-handle is closed and in-memory page cache related resources

716

are released.

717

718

When a database connection is closed, SQLite shall close the

719

associated file handle at the VFS level.

720

721

When a database connection is closed, all associated page

722

cache entries shall be discarded.

723

<h1 id=page_cache>The Page Cache</h1>

724

725

The contents of an SQLite database file are formatted as a set of

726

fixed size pages. See <cite>ff_sqlitert_requirements</cite> for a

727

complete description of the format used. The page size used

728

for a particular database is stored as part of the database file

729

header at a well-known offset within the first 100 bytes of the

730

file. Almost all read and write operations performed by SQLite on

731

database files are done on blocks of data page-size bytes

732

in size.

733

734

All SQLite database connections running within a single process share

735

a single page cache. The page cache caches data read from

736

database files in main-memory on a per-page basis. When SQLite requires

737

data from a database file to satisfy a database query, it checks the

738

page cache for usable cached versions of the required database

739

pages before loading it from the database file. If no usable cache

740

entry can be found and the database page data is loaded from the database

741

file, it is cached in the page cache in case the same data is

742

needed again later. Because reading from the database file is assumed to

743

be an order of magnitude faster than reading from main-memory, caching

744

database page content in the page cache to minimize the number

745

of read operations performed on the database file is a significant

746

performance enhancement.

747

748

The page cache is also used to buffer database write operations.

749

When SQLite is required to modify one of more of the database pages

750

that make up a database file, it first modifies the cached version of

751

the page in the page cache. At that point the page is considered

752

a "dirty" page. At some point later on, the new content of the "dirty"

753

page is copied from the page cache into the database file via

754

the VFS interface. Buffering writes in the page cache can reduce

755

the number of write operations required on the database file (in cases

756

where the same page is updated twice) and allows optimizations based

757

on the assumptions outlined in section <cite>fs_performance</cite>.

758

759

Database read and write operations, and the way in which they interact

760

with and use the page cache, are described in detail in sections

761

<cite>reading_data</cite> and <cite>writing_data</cite> of this document,

762

respectively.

763

764

At any one time, the page cache contains zero or more page cache

765

entries, each of which has the following data associated with it:

766

<ul>

767

<li>

768

A reference to the associated database connection. Each

769

entry in the page cache is associated with a single database

770

connection; the database connection that created the entry.

771

A page cache entry is only ever used by the database

772

connection that created it. Page cache entries are not shared between

773

database connections.

774

<li>

775

The page number of the cached page. Pages are sequentially

776

numbered within a database file starting from page 1 (page 1 begins at

777

byte offset 0). Refer to <cite>ff_sqlitert_requirements</cite> for

778

details.

779

<li>

780

The cached data; a blob of data page-size bytes in size.

781

</ul>

782

783

The first two elements in the list above, the associated database

784

connection and the page number, uniquely identify the

785

page cache entry. At no time may the page cache contain two

786

entries for which both the database connection and page

787

number are identical. Or, put another way, a single database

788

connection never caches more than one copy of a database page

789

within the page cache.

790

791

At any one time, each page cache entry may be said to be a clean

792

page, a non-writable dirty page or a writable dirty page,

793

according to the following definitions:

794

<ul>

795

<li> A clean page is one for which the cached data

796

currently matches the contents of the corresponding page of

797

the database file. The page has not been modified since it was

798

loaded from the file.

799

<li> A dirty page is a page cache entry for which

800

the cached data has been modified since it was loaded from the database

801

file, and so no longer matches the current contents of the

802

corresponding database file page. A dirty page is one that is

803

currently buffering a modification made to the database file as part

804

of a write transaction.

805

<li> Within this document, the term non-writable dirty

806

page is used specifically to refer to a page cache

807

entry with modified content for which it is not yet safe to update

808

the database file with. It is not safe to update a database file with

809

a buffered write if a power or system failure that occurs during or

810

soon after the update may cause the database to become corrupt

811

following system recovery, according to the assumptions made in

812

section <cite>fs_assumption_details</cite>.

813

<li> A dirty page for which it would be safe to update the

814

corresponding database file page with the modified contents of

815

without risking database corruption is known as a

816

writable dirty page.

817

</ul>

818

819

The exact logic used to determine if a page cache entry with

820

modified content is a dirty page or writable page is

821

presented in section <cite>page_cache_algorithms</cite>.

822

823

Because main-memory is a limited resource, the page cache cannot

824

be allowed to grow indefinitely. As a result, unless all database files

825

opened by database connections within the process are quite small,

826

sometimes data must be discarded from the page cache. In practice

827

this means page cache entries must be purged to make room

828

for new ones. If a page cache entry being removed from the page

829

cache to free main-memory is a dirty page, then its contents

830

must be saved into the database file before it can be discarded without

831

data loss. The following two sub-sections describe the algorithms used by

832

the page cache to determine exactly when existing page cache

833

entries are purged (discarded).

834

<h2>Page Cache Configuration</h2>

835

836

Describe the parameters set to configure the page cache limits.

837

<h2 id=page_cache_algorithms>Page Cache Algorithms</h2>

838

839

Requirements describing the way in which the configuration parameters

840

are used. About LRU etc.

841

<h1 id=reading_data>Reading Data</h1>

842

843

In order to return data from the database to the user, for example as

844

the results of a SELECT query, SQLite must at some point read data

845

from the database file. Usually, data is read from the database file in

846

aligned blocks of page-size bytes. The exception is when the

847

database file header fields are being inspected, before the

848

page-size used by the database can be known.

849

850

With two exceptions, a database connection must have an open

851

transaction (either a read-only transaction or a

852

read/write transaction) on the database before data may be

853

read from the database file.

854

855

The two exceptions are:

856

<ul>

857

<li> When an attempt is made to read the 100 byte database file

858

header immediately after opening the database connection

859

(see section <cite>open_new_connection</cite>). When this occurs

860

no lock is held on the database file.

861

<li> Data read while in the process of opening a read-only transaction

862

(see section <cite>open_read_only_trans</cite>). These read

863

operations occur after a shared lock is held on the database

864

file.

865

</ul>

866

867

Once a transaction has been opened, reading data from a database

868

connection is a simple operation. Using the xRead() method of the

869

file-handle open on the database file, the required database file

870

pages are read one at a time. SQLite never reads partial pages and

871

always uses a single call to xRead() for each required page.

872

873

After reading the data for a database page, SQLite stores the raw

874

page of data in the page cache. Each time a page of data is

875

required by the upper layers, the page cache is queried

876

to see if it contains a copy of the required page stored by

877

the current database connection. If such an entry can be

878

found, then the required data is read from the page cache instead

879

of the database file. Only a connection with an open transaction

880

transaction (either a read-only transaction or a

881

read/write transaction) on the database may read data from the

882

page cache. In this sense reading from the page cache is no

883

different to reading from the database file.

884

885

Refer to section <cite>page_cache_algorithms</cite> for a description

886

of exactly how and for how long page data is stored in the

887

page cache.

888

889

Except for the read operation required by H35070 and those reads made

890

as part of opening a read-only transaction, SQLite shall ensure that

891

a database connection has an open read-only or read/write

892

transaction when any data is read from the database file.

893

894

Aside from those read operations described by H35070 and H21XXX, SQLite

895

shall read data from the database file in aligned blocks of

896

page-size bytes, where page-size is the database page size

897

used by the database file.

898

899

SQLite shall ensure that a database connection has an open

900

read-only or read/write transaction before using data stored in the page

901

cache to satisfy user queries.

902

<h2 id=open_read_only_trans>Opening a Read-Only Transaction</h2>

903

904

Before data may be read from a database file or queried from

905

the page cache, a read-only transaction must be

906

successfully opened by the associated database connection (this is true

907

even if the connection will eventually write to the database, as a

908

read/write transaction may only be opened by upgrading from a

909

read-only transaction). This section describes the procedure

910

for opening a read-only transaction.

911

912

The key element of a read-only transaction is that the

913

file-handle open on the database file obtains and holds a

914

shared-lock on the database file. Because a connection requires

915

an exclusive-lock before it may actually modify the contents

916

of the database file, and by definition while one connection is holding

917

a shared-lock no other connection may hold an

918

exclusive-lock, holding a shared-lock guarantees that

919

no other process may modify the database file while the read-only

920

transaction remains open. This ensures that read-only

921

transactions are sufficiently isolated from the transactions of

922

other database users (see section <cite>overview</cite>).

923

Obtaining the shared lock itself on the database file is quite

924

simple, SQLite just calls the xLock() method of the database file

925

handle. Some of the other processes that take place as part of

926

opening the read-only transaction are quite complex. The

927

steps that SQLite is required to take to open a read-only

928

transaction, in the order in which they must occur, is as follows:

929

<ol>

930

<li>A shared-lock is obtained on the database file.

931

<li>The connection checks if a hot journal file exists in the

932

file-system. If one does, then it is rolled back before continuing.

933

<li>The connection checks if the data in the page cache may

934

still be trusted. If not, all page cache data is discarded.

935

<li>If the file-size is not zero bytes and the page cache does not

936

contain valid data for the first page of the database, then the

937

data for the first page must be read from the database.

938

</ol>

939

940

Of course, an error may occur while attempting any of the 4 steps

941

enumerated above. If this happens, then the shared-lock is

942

released (if it was obtained) and an error returned to the user.

943

Step 2 of the procedure above is described in more detail in section

944

<cite>hot_journal_detection</cite>. Section <cite>cache_validation</cite>

945

describes the process identified by step 3 above. Further detail

946

on step 4 may be found in section <cite>read_page_one</cite>.

947

948

When required to open a read-only transaction using a

949

database connection, SQLite shall first attempt to obtain

950

a shared-lock on the file-handle open on the database file.

951

952

If, while opening a read-only transaction, SQLite fails to obtain

953

the shared-lock on the database file, then the process is

954

abandoned, no transaction is opened and an error returned to the user.

955

956

The most common reason an attempt to obtain a shared-lock may

957

fail is that some other connection is holding an exclusive or

958

pending lock. However it may also fail because some other

959

error (e.g. an IO or comms related error) occurs within the call to the

960

xLock() method.

961

962

While opening a read-only transaction, after successfully

963

obtaining a shared lock on the database file, SQLite shall

964

attempt to detect and roll back a hot journal file associated

965

with the same database file.

966

967

If, while opening a read-only transaction, SQLite encounters

968

an error while attempting to detect or roll back a hot journal

969

file, then the shared-lock on the database file is released,

970

no transaction is opened and an error returned to the user.

971

972

Section <cite>hot_journal_detection</cite> contains a description of

973

and requirements governing the detection of a hot-journal file referred

974

to in the above requirements.

975

976

Assuming no errors have occurred, then after attempting to detect and

977

roll back a hot journal file, if the page cache contains

978

any entries associated with the current database connection,

979

then SQLite shall validate the contents of the page cache by

980

testing the file change counter. This procedure is known as

981

cache validation.

982

983

The cache validation process is described in detail in section

984

<cite>cache_validation</cite>

985

986

If the cache validate procedure prescribed by H35040 is required and

987

does not prove that the page cache entries associated with the

988

current database connection are valid, then SQLite shall discard

989

all entries associated with the current database connection from

990

the page cache.

991

992

The numbered list above notes that the data for the first page of the

993

database file, if it exists and is not already loaded into the page

994

cache, must be read from the database file before the read-only

995

transaction may be considered opened. This is handled by

996

requirement H35240.

997

<h3 id=hot_journal_detection>Hot Journal Detection</h3>

998

999

This section describes the procedure that SQLite uses to detect a

1000

hot journal file. If a hot journal file is detected,

1001

this indicates that at some point the process of writing a

1002

transaction to the database was interrupted and a recovery operation

1003

(hot journal rollback) needs to take place. This section does

1004

not describe the process of hot journal rollback (see section

1005

<cite>hot_journal_rollback</cite>) or the processes by which a

1006

hot journal file may be created (see section

1007

<cite>writing_data</cite>).

1008

1009

The procedure used to detect a hot-journal file is quite

1010

complex. The following steps take place:

1011

1012

<li>Using the VFS xAccess() method, SQLite queries the file-system

1013

to see if the journal file associated with the database exists.

1014

If it does not, then there is no hot-journal file.

1015

<li>By invoking the xCheckReservedLock() method of the file-handle

1016

opened on the database file, SQLite checks if some other connection

1017

holds a reserved lock or greater. If some other connection

1018

does hold a reserved lock, this indicates that the other

1019

connection is midway through a read/write transaction (see

1020

section <cite>writing_data</cite>). In this case the

1021

journal file is not a hot-journal and must not be

1022

rolled back.

1023

<li>Using the xFileSize() method of the file-handle opened

1024

on the database file, SQLite checks if the database file is

1025

0 bytes in size. If it is, the journal file is not considered

1026

to be a hot journal file. Instead of rolling back the

1027

journal file, in this case it is deleted from the file-system

1028

by calling the VFS xDelete() method. Technically,

1029

there is a race condition here. This step should be moved to

1030

after the exclusive lock is held.

1031

<li>An attempt is made to upgrade to an exclusive lock on the

1032

database file. If the attempt fails, then all locks, including

1033

the recently obtained shared lock are dropped. The attempt

1034

to open a read-only transaction has failed. This occurs

1035

when some other connection is also attempting to open a

1036

read-only transaction and the attempt to gain the

1037

exclusive lock fails because the other connection is also

1038

holding a shared lock. It is left to the other connection

1039

to roll back the hot journal.

1040

1041

It is important that the file-handle lock is upgraded

1042

directly from shared to exclusive in this case,

1043

instead of first upgrading to reserved or pending

1044

locks as is required when obtaining an exclusive lock to

1045

write to the database file (section <cite>writing_data</cite>).

1046

If SQLite were to first upgrade to a reserved or

1047

pending lock in this scenario, then a second process also

1048

trying to open a read-transaction on the database file might

1049

detect the reserved lock in step 2 of this process,

1050

conclude that there was no hot journal, and commence

1051

reading data from the database file.

1052

<li>The xAccess() method is invoked again to detect if the journal

1053

file is still in the file system. If it is, then it is a

1054

hot-journal file and SQLite tries to roll it back (see section

1055

<cite>rollback</cite>).

1056

</ol>

1057

Master journal file pointers?

1058

1059

The following requirements describe step 1 of the above procedure in

1060

more detail.

1061

1062

When required to attempt to detect a hot-journal file, SQLite

1063

shall first use the xAccess() method of the VFS layer to check if a

1064

journal file exists in the file-system.

1065

1066

If the call to xAccess() required by H35140 fails (due to an IO error or

1067

similar), then SQLite shall abandon the attempt to open a read-only

1068

transaction, relinquish the shared lock held on the database

1069

file and return an error to the user.

1070

1071

When required to attempt to detect a hot-journal file, if the

1072

call to xAccess() required by H35140 indicates that a journal file does

1073

not exist, then SQLite shall conclude that there is no hot-journal

1074

file in the file system and therefore that no hot journal

1075

rollback is required.

1076

1077

The following requirements describe step 2 of the above procedure in

1078

more detail.

1079

1080

When required to attempt to detect a hot-journal file, if the

1081

call to xAccess() required by H35140 indicates that a journal file

1082

is present, then the xCheckReservedLock() method of the database file

1083

file-handle is invoked to determine whether or not some other

1084

process is holding a reserved or greater lock on the database

1085

file.

1086

1087

If the call to xCheckReservedLock() required by H35160 fails (due to an

1088

IO or other internal VFS error), then SQLite shall abandon the attempt

1089

to open a read-only transaction, relinquish the shared lock

1090

held on the database file and return an error to the user.

1091

1092

If the call to xCheckReservedLock() required by H35160 indicates that

1093

some other database connection is holding a reserved

1094

or greater lock on the database file, then SQLite shall conclude that

1095

there is no hot journal file. In this case the attempt to detect

1096

a hot journal file is concluded.

1097

1098

The following requirements describe step 3 of the above procedure in

1099

more detail.

1100

1101

If while attempting to detect a hot-journal file the call to

1102

xCheckReservedLock() indicates that no process holds a reserved

1103

or greater lock on the database file, then SQLite shall open

1104

a file handle on the potentially hot journal file using the VFS xOpen()

1105

method.

1106

1107

If the call to xOpen() required by H35440 fails (due to an IO or other

1108

internal VFS error), then SQLite shall abandon the attempt to open a

1109

read-only transaction, relinquish the shared lock held on

1110

the database file and return an error to the user.

1111

1112

After successfully opening a file-handle on a potentially hot journal

1113

file, SQLite shall query the file for its size in bytes using the

1114

xFileSize() method of the open file handle.

1115

1116

If the call to xFileSize() required by H35450 fails (due to an IO or

1117

other internal VFS error), then SQLite shall abandon the attempt to open

1118

a read-only transaction, relinquish the shared lock held on

1119

the database file, close the file handle opened on the journal file and

1120

return an error to the user.

1121

1122

If the size of a potentially hot journal file is revealed to be zero

1123

bytes by a query required by H35450, then SQLite shall close the

1124

file handle opened on the journal file and delete the journal file using

1125

a call to the VFS xDelete() method. In this case SQLite shall conclude

1126

that there is no hot journal file.

1127

1128

If the call to xDelete() required by H35450 fails (due to an IO or

1129

other internal VFS error), then SQLite shall abandon the attempt to open

1130

a read-only transaction, relinquish the shared lock held on

1131

the database file and return an error to the user.

1132

1133

The following requirements describe step 4 of the above procedure in

1134

more detail.

1135

1136

If the size of a potentially hot journal file is revealed to be greater

1137

than zero bytes by a query required by H35450, then SQLite shall attempt

1138

to upgrade the shared lock held by the database connection

1139

on the database file directly to an exclusive lock.

1140

1141

If an attempt to upgrade to an exclusive lock prescribed by

1142

H35470 fails for any reason, then SQLite shall release all locks held by

1143

the database connection and close the file handle opened on the

1144

journal file. The attempt to open a read-only transaction

1145

shall be deemed to have failed and an error returned to the user.

1146

1147

Finally, the following requirements describe step 5 of the above

1148

procedure in more detail.

1149

1150

If, as part of the hot journal file detection process, the

1151

attempt to upgrade to an exclusive lock mandated by H35470 is

1152

successful, then SQLite shall query the file-system using the xAccess()

1153

method of the VFS implementation to test whether or not the journal

1154

file is still present in the file-system.

1155

1156

If the call to xAccess() required by H35490 fails (due to an IO or

1157

other internal VFS error), then SQLite shall abandon the attempt to open

1158

a read-only transaction, relinquish the lock held on the

1159

database file, close the file handle opened on the journal file and

1160

return an error to the user.

1161

1162

If the call to xAccess() required by H35490 reveals that the journal

1163

file is no longer present in the file system, then SQLite shall abandon

1164

the attempt to open a read-only transaction, relinquish the

1165

lock held on the database file, close the file handle opened on the

1166

journal file and return an SQLITE_BUSY error to the user.

1167

1168

If the xAccess() query required by H35490 reveals that the journal

1169

file is still present in the file system, then SQLite shall conclude

1170

that the journal file is a hot journal file that needs to

1171

be rolled back. SQLite shall immediately begin hot journal

1172

rollback.

1173

<h3 id=cache_validation>Cache Validation</h3>

1174

1175

When a database connection opens a read transaction, the

1176

page cache may already contain data associated with the

1177

database connection. However, if another process has modified

1178

the database file since the cached pages were loaded it is possible that

1179

the cached data is invalid.

1180

1181

SQLite determines whether or not the page cache entries belonging

1182

to the database connection are valid or not using the file

1183

change counter, a field in the database file header. The

1184

file change counter is a 4-byte big-endian integer field stored

1185

starting at byte offset 24 of the database file header. Before the

1186

conclusion of a read/write transaction that modifies the contents

1187

of the database file in any way (see section <cite>writing_data</cite>),

1188

the value stored in the file change counter is incremented. When

1189

a database connection unlocks the database file, it stores the

1190

current value of the file change counter. Later, while opening a

1191

new read-only transaction, SQLite checks the value of the file

1192

change counter stored in the database file. If the value has not

1193

changed since the database file was unlocked, then the page cache

1194

entries can be trusted. If the value has changed, then the page

1195

cache entries cannot be trusted and all entries associated with

1196

the current database connection are discarded.

1197

1198

When a file-handle open on a database file is unlocked, if the

1199

page cache contains one or more entries belonging to the

1200

associated database connection, SQLite shall store the value

1201

of the file change counter internally.

1202

1203

When required to perform cache validation as part of opening

1204

a read transaction, SQLite shall read a 16 byte block

1205

starting at byte offset 24 of the database file using the xRead()

1206

method of the database connections file handle.

1207

1208

Why a 16 byte block? Why not 4? (something to do with encrypted

1209

databases).

1210

1211

While performing cache validation, after loading the 16 byte

1212

block as required by H35190, SQLite shall compare the 32-bit big-endian

1213

integer stored in the first 4 bytes of the block to the most

1214

recently stored value of the file change counter (see H35180).

1215

If the values are not the same, then SQLite shall conclude that

1216

the contents of the cache are invalid.

1217

1218

Requirement H35050 (section <cite>open_read_only_trans</cite>)

1219

specifies the action SQLite is required to take upon determining that

1220

the cache contents are invalid.

1221

<h3 id=read_page_one>Page 1 and the Expected Page Size</h3>

1222

1223

As the last step in opening a read transaction on a database

1224

file that is more than 0 bytes in size, SQLite is required to load

1225

data for page 1 of the database into the page cache, if it is

1226

not already there. This is slightly more complicated than it seems,

1227

as the database page-size is no known at this point.

1228

1229

Even though the database page-size cannot be known for sure,

1230

SQLite is usually able to guess correctly by assuming it to be equal to

1231

the connections expected page size. The expected page size

1232

is the value of the page-size field read from the

1233

database file header while opening the database connection

1234

(see section <cite>open_new_connection</cite>), or the page-size

1235

of the database file when the most read transaction was concluded.

1236

1237

During the conclusion of a read transaction, before unlocking

1238

the database file, SQLite shall set the connections

1239

expected page size to the current database page-size.

1240

1241

As part of opening a new read transaction, immediately after

1242

performing cache validation, if there is no data for database

1243

page 1 in the page cache, SQLite shall read N bytes from

1244

the start of the database file using the xRead() method of the

1245

connections file handle, where N is the connections current

1246

expected page size value.

1247

1248

If page 1 data is read as required by H35230, then the value of the

1249

page-size field that appears in the database file header that

1250

consumes the first 100 bytes of the read block is not the same as the

1251

connections current expected page size, then the

1252

expected page size is set to this value, the database file is

1253

unlocked and the entire procedure to open a read transaction

1254

is repeated.

1255

1256

If page 1 data is read as required by H35230, then the value of the

1257

page-size field that appears in the database file header that

1258

consumes the first 100 bytes of the read block is the same as the

1259

connections current expected page size, then the block of data

1260

read is stored in the page cache as page 1.

1261

<h2>Reading Database Data</h2>

1262

1263

Add something about checking the page-cache first etc.

1264

<h2>Ending a Read-only Transaction</h2>

1265

1266

To end a read-only transaction, SQLite simply relinquishes the

1267

shared lock on the file-handle open on the database file. No

1268

other action is required.

1269

1270

When required to end a read-only transaction, SQLite shall

1271

relinquish the shared lock held on the database file by

1272

calling the xUnlock() method of the file-handle.

1273

1274

See also requirements H35180 and H35210 above.

1275

<h1 id=writing_data>Writing Data</h1>

1276

1277

Using DDL or DML SQL statements, SQLite users may modify the contents and

1278

size of a database file. Exactly how changes to the logical database are

1279

translated to modifications to the database file is described in

1280

<cite>ff_sqlitert_requirements</cite>. From the point of view of the

1281

sub-systems described in this document, each DDL or DML statement executed

1282

results in the contents of zero or more database file pages being

1283

overwritten with new data. A DDL or DML statement may also append or

1284

truncate one or more pages to or from the end of the database file. One

1285

or more DDL and/or DML statements are grouped together to make up a

1286

single write transaction. A write transaction is required

1287

to have the special properties described in section <cite>overview</cite>;

1288

a write transaction must be isolated, durable and atomic.

1289

1290

SQLite accomplishes these goals using the following techniques:

1291

<ul>

1292

<li>

1293

To ensure that write transactions are isolated, before

1294

beginning to modify the contents of the database file to reflect the

1295

results of a write transaction, SQLite obtains an exclusive

1296

lock on the database file. The lock is not relinquished

1297

until the write transaction is concluded. Because reading from

1298

the database file requires a shared lock (see section

1299

<cite>reading_data</cite>) and holding an exclusive

1300

lock guarantees that no other database connection is holding

1301

or can obtain a shared lock, this ensures that no other

1302

connection may read data from the database file at a point when

1303

a write transaction has been partially applied.

1304

<li>Ensuring that write transactions are atomic is the most

1305

complex task required of the system. In this case, atomic means

1306

that even if a system failure occurs, an attempt to commit a write

1307

transaction to the database file either results in all changes

1308

that are a part of the transaction being successfully applied to the

1309

database file, or none of the changes are successfully applied. There

1310

is no chance that a subset of the changes only are applied. Hence from

1311

the point of view of an external observer, the write transaction

1312

appears to be an atomic event.

1313

1314

Of course, it is usually not possible to atomically apply all the

1315

changes required by a write transaction to a database file

1316

within the file-system. For example, if a write transaction

1317

requires ten pages of a database file to be modified, and a power

1318

outage causes a system failure after sqlite has modified only five

1319

pages, then the database file will almost certainly be in an

1320

inconsistent state following system recovery.

1321

1322

SQLite solves this problem by using a journal file. In almost

1323

all cases, before the database file is modified in any way,

1324

SQLite stores sufficient information in the journal file to

1325

allow the original the database file to be reconstructed if a system

1326

failure occurs while the database file is being updated to reflect

1327

the modifications made by the write transaction. Each time

1328

SQLite opens a database file, it checks if such a system failure has

1329

occurred and, if so,

1330

reconstructs the database file based on the contents

1331

of the journal file. The procedure used to detect whether or not this

1332

process, coined hot journal rollback, is required is described

1333

in section <cite>hot_journal_detection</cite>. Hot journal rollback

1334

itself is described in section <cite>hot_journal_rollback</cite>.

1335

1336

The same technique ensures that an SQLite database file cannot be

1337

corrupted by a system failure that occurs at an inopportune moment.

1338

If a system failure does occur before SQLite has had a chance to

1339

execute sufficient sync file operations to ensure that the

1340

changes that make up a write transaction have made it safely

1341

to persistent storage, then the journal file will be used

1342

to restore the database to a known good state following system

1343

recovery.

1344

<li>

1345

So that write transactions are durable in the face of

1346

a system failure, SQLite executes a sync file operation on the

1347

database file before concluding the write transaction

1348

</ul>

1349

1350

The page cache is used to buffer modifications to the database

1351

file image before they are written to the database file. When

1352

the contents of a page is required to be modified as the results of

1353

an operation within a write transaction, the modified copy is

1354

stored in the page cache. Similarly, if new pages are appended

1355

to the end of a database file, they are added to the page cache

1356

instead of being immediately written to the database file within the

1357

file-system.

1358

1359

Ideally, all changes for an entire write transaction are buffered in

1360

the page cache until the end of the transaction. When the user commits

1361

the transaction, all changes are applied to the database file in the

1362

most efficient way possible, taking into account the assumptions

1363

enumerated in section <cite>fs_performance</cite>. Unfortunately, since

1364

main-memory is a limited resource, this is not always possible for

1365

large transactions. In this case changes are buffered in the page

1366

cache until some internal condition or limit is reached,

1367

then written out to the database file in order to free resources

1368

as they are required. Section <cite>page_cache_algorithms</cite>

1369

describes the circumstances under which changes are flushed through

1370

to the database file mid-transaction to free page cache resources.

1371

1372

Even if an application or system failure does not occur while a

1373

write transaction is in progress, a rollback operation to restore

1374

the database file and page cache to the state that it was in before

1375

the transaction started may be required. This may occur if the user

1376

explicitly requests transaction rollback (by issuing a "ROLLBACK" command),

1377

or automatically, as a result of encountering an SQL constraint (see

1378

<cite>sql_sqlitert_requirements</cite>). For this reason, the original page

1379

content is stored in the journal file before the page is even

1380

modified within the page cache.

1381

1382

Introduce the following sub-sections.

1383

<h2 id=journal_file_format>Journal File Format</h2>

1384

1385

This section describes the format used by an SQLite journal file.

1386

1387

A journal file consists of one or more journal headers, zero

1388

or more journal records and optionally a master journal

1389

pointer. Each journal file always begins with a

1390

journal header, followed by zero or more journal records.

1391

Following this may be a second journal header followed by a

1392

second set of zero or more journal records and so on. There

1393

is no limit to the number of journal headers a journal file

1394

may contain. Following the journal headers and their accompanying

1395

sets of journal records may be the optional master journal

1396

pointer. Or, the file may simply end following the final journal

1397

record.

1398

1399

This section only describes the format of the journal file and the

1400

various objects that make it up. But because a journal file may be

1401

read by an SQLite process following recovery from a system failure

1402

(hot journal rollback, see section

1403

<cite>hot_journal_rollback</cite>) it is also important to describe

1404

the way the file is created and populated within the file-system

1405

using a combination of write file, sync file and

1406

truncate file operations. These are described in section

1407

<cite>write_transactions</cite>.

1408

<h3 id=journal_header_format>Journal Header Format</h3>

1409

1410

A journal header is sector-size bytes in size, where

1411

sector-size is the value returned by the xSectorSize method of

1412

the file handle opened on the database file. Only the first 28 bytes

1413

of the journal header are used, the remainder may contain garbage

1414

data. The first 28 bytes of each journal header consists of an

1415

eight byte block set to a well-known value, followed by five big-endian

1416

32-bit unsigned integer fields.

1417

1418

Figure - Journal Header Format

1419

</center>

1420

1421

Figure <cite>figure_journal_header</cite> graphically depicts the layout

1422

of a journal header. The individual fields are described in

1423

the following table. The offsets in the 'byte offset' column of the

1424

table are relative to the start of the journal header.

1425

1426

<tr><th>Byte offset<th>Size in bytes<th width=100%>Description

1427

<tr><td>0<td>8<td>The journal magic field always contains a

1428

well-known 8-byte string value used to identify SQLite

1429

journal files. The well-known sequence of byte values

1430

is:

1431

1432

<tr><td>8<td>4<td>This field, the record count, is set to the

1433

number of journal records that follow this

1434

journal header in the journal file.

1435

<tr><td>12<td>4<td>The checksum initializer field is set to a

1436

pseudo-random value. It is used as part of the

1437

algorithm to calculate the checksum for all journal

1438

records that follow this journal header.

1439

<tr><td>16<td>4<td>This field, the database page count, is set

1440

to the number of pages that the database file

1441

contained before any modifications associated with

1442

write transaction are applied.

1443

<tr><td>20<td>4<td>This field, the sector size, is set to the

1444

sector size of the device on which the

1445

journal file was created, in bytes. This value

1446

is required when reading the journal file to determine

1447

the size of each journal header.

1448

<tr><td>24<td>4<td>The page size field contains the database page

1449

size used by the corresponding database file

1450

when the journal file was created, in bytes.

1451

</table>

1452

1453

All journal headers are positioned in the file so that they

1454

start at a sector size aligned offset. To achieve this, unused

1455

space may be left between the start of the second and subsequent

1456

journal headers and the end of the journal records

1457

associated with the previous header.

1458

<h3 id=journal_record_format>Journal Record Format</h3>

1459

1460

Each journal record contains the original data for a database page

1461

modified by the write transaction. If a rollback is required, then

1462

this data may be used to restore the contents of the database page to the

1463

state it was in before the write transaction was started.

1464

1465

Figure - Journal Record Format

1466

</center>

1467

1468

A journal record, depicted graphically by figure

1469

<cite>figure_journal_record</cite>, contains three fields, as described

1470

in the following table. Byte offsets are relative to the start of the

1471

journal record.

1472

1473

<tr><th>Byte offset<th>Size in bytes<th width=100%>Description

1474

<tr><td>0<td>4<td>The page number of the database page associated with

1475

this journal record, stored as a 4 byte

1476

big-endian unsigned integer.

1477

1478

This field contains the original data for the page,

1479

exactly as it appeared in the database file before the

1480

write transaction began.

1481

1482

This field contains a checksum value, calculated based

1483

on the contents of the journaled database page data

1484

(the previous field) and the values stored in the

1485

checksum initializer field of the preceding

1486

journal header.

1487

</table>

1488

1489

The set of journal records that follow a journal header

1490

in a journal file are packed tightly together. There are no

1491

alignment requirements for journal records as there are for

1492

journal headers.

1493

<h3>Master Journal Pointer</h3>

1494

1495

To support atomic transactions that modify more than one

1496

database file, SQLite sometimes includes a master journal pointer

1497

record in a journal file. Multiple file transactions are

1498

described in section <cite>multifile_transactions</cite>. A

1499

master journal pointer contains the name of a master journal-file

1500

along with a check-sum and some well known values that allow

1501

the master journal pointer to be recognized as such when

1502

the journal file is read during a rollback operation (section

1503

<cite>rollback</cite>).

1504

1505

As is the case for a journal header, the start of a master

1506

journal pointer is always positioned at a sector size

1507

aligned offset. If the journal record or journal header

1508

that appears immediately before the master journal pointer does

1509

not end at an aligned offset, then unused space is left between the

1510

end of the journal record or journal header and the start

1511

of the master journal pointer.

1512

1513

Figure - Master Journal Pointer Format

1514

</center>

1515

1516

A master journal pointer, depicted graphically by figure

1517

<cite>figure_master_journal_ptr</cite>, contains five fields, as

1518

described in the following table. Byte offsets are relative to the

1519

start of the master journal pointer.

1520

1521

<tr><th>Byte offset<th>Size in bytes<th width=100%>Description

1522

<tr><td>0<td>4<td>This field, the locking page number, is always

1523

set to the page number of the database locking page

1524

stored as a 4-byte big-endian integer. The locking page

1525

is the page that begins at byte offset 2<super>30</super> of the

1526

database file. Even if the database file is large enough to

1527

contain the locking page, the locking page is

1528

never used to store any data and so the first four bytes of of a

1529

valid journal record will never contain this value. For

1530

further description of the locking page, refer to

1531

<cite>ff_sqlitert_requirements</cite>.

1532

<tr><td>4<td>name-length<td>

1533

The master journal name field contains the name of the

1534

master journal file, encoded as a utf-8 string. There is no

1535

nul-terminator appended to the string.

1536

<tr><td>4 + name-length<td>4<td>

1537

The name-length field contains the length of the

1538

previous field in bytes, formatted as a 4-byte big-endian

1539

unsigned integer.

1540

<tr><td>8 + name-length<td>4<td>

1541

The checksum field contains a checksum value stored as

1542

a 4-byte big-endian signed integer. The checksum value is

1543

calculated as the sum of the bytes that make up the

1544

master journal name field, interpreting each byte as

1545

an 8-bit signed integer.

1546

<tr><td style="white-space: nowrap">12 + name-length<td>8<td>

1547

Finally, the journal magic field always contains a

1548

well-known 8-byte string value; the same value stored in the

1549

first 8 bytes of a journal header. The well-known

1550

sequence of bytes is:

1551

1552

</table>

1553

<h2 id=write_transactions>Write Transactions</h2>

1554

1555

This section describes the progression of an SQLite write

1556

transaction. From the point of view of the systems described in

1557

this document, most write transactions consist of three steps:

1558

<ol>

1559

<li>The write transaction is opened. This process is described

1560

in section <cite>opening_a_write_transaction</cite>.

1561

<li>The end-user executes DML or DDL SQL statements that require the

1562

structure of the database file of the database file to be modified.

1563

These modifications may be any combination of operations to

1564

<ul><li>modify the content of an existing database page,

1565

<li>append a new database page to the database file image, or

1566

<li>truncate (discard) a database page from the end of the

1567

database file.

1568

</ul>

1569

These operations are described in detail in section

1570

<cite>modifying_appending_truncating</cite>. How user DDL or DML

1571

SQL statements are mapped to combinations of these three operations

1572

is described in <cite>ff_sqlitert_requirements</cite>.

1573

<li>The write transaction is concluded and the changes made

1574

permanently committed to the database. The process required to

1575

commit a transaction is described in section

1576

<cite>committing_a_transaction</cite>.

1577

</ol>

1578

1579

As an alternative to step 3 above, the transaction may be rolled back.

1580

Transaction rollback is described in section <cite>rollback</cite>.

1581

Finally, it is also important to remember that a write transaction

1582

may be interrupted by a system failure at any point. In this

1583

case, the contents of the file-system (the database file and

1584

journal file) must be left in such a state so as to enable

1585

the database file to be restored to the state it was in before

1586

the interrupted write transaction was started. This is known

1587

as hot journal rollback, and is described in section

1588

<cite>hot_journal_rollback</cite>. Section

1589

<cite>fs_assumption_details</cite> describes the assumptions made

1590

regarding the effects of a system failure on the file-system

1591

contents following recovery.

1592

<h3 id=opening_a_write_transaction>Beginning a Write Transaction</h3>

1593

1594

Before any database pages may be modified within the page cache,

1595

the database connection must open a write transaction.

1596

Opening a write transaction requires that the database

1597

connection obtains a reserved lock (or greater) on the

1598

database file. Because a obtaining a reserved lock on

1599

a database file guarantees that no other database

1600

connection may hold or obtain a reserved lock or greater,

1601

it follows that no other database connection may have an

1602

open write transaction.

1603

1604

A reserved lock on the database file may be thought of

1605

as an exclusive lock on the journal file. No

1606

database connection may read from or write to a journal

1607

file without a reserved or greater lock on the corresponding

1608

database file.

1609

1610

Before opening a write transaction, a database connection

1611

must have an open read transaction, opened via the procedure

1612

described in section <cite>open_read_only_trans</cite>. This ensures

1613

that there is no hot-journal file that needs to be rolled back

1614

and that any data stored in the page cache can be trusted.

1615

1616

Once a read transaction has been opened, upgrading to a

1617

write transaction is a two step process, as follows:

1618

<ol>

1619

<li>A reserved lock is obtained on the database file.

1620

<li>The journal file is opened and created if necessary (using

1621

the VFS xOpen method), and a journal file header written

1622

to the start of it using a single call to the file handles xWrite

1623

method.

1624

</ol>

1625

1626

Requirements describing step 1 of the above procedure in detail:

1627

1628

When required to open a write transaction on the database,

1629

SQLite shall first open a read transaction, if the database

1630

connection in question has not already opened one.

1631

1632

When required to open a write transaction on the database, after

1633

ensuring a read transaction has already been opened, SQLite

1634

shall obtain a reserved lock on the database file by calling

1635

the xLock method of the file-handle open on the database file.

1636

1637

If an attempt to acquire a reserved lock prescribed by

1638

requirement H35360 fails, then SQLite shall deem the attempt to

1639

open a write transaction to have failed and return an error

1640

to the user.

1641

1642

Requirements describing step 2 of the above procedure in detail:

1643

1644

When required to open a write transaction on the database, after

1645

obtaining a reserved lock on the database file, SQLite shall

1646

open a read/write file-handle on the corresponding journal file.

1647

1648

When required to open a write transaction on the database, after

1649

opening a file-handle on the journal file, SQLite shall append

1650

a journal header to the (currently empty) journal file.

1651

<h4 id=writing_journal_header>Writing a Journal Header</h4>

1652

1653

Requirements describing how a journal header is appended to

1654

a journal file:

1655

1656

When required to append a journal header to the journal

1657

file, SQLite shall do so by writing a block of sector-size

1658

bytes using a single call to the xWrite method of the file-handle

1659

open on the journal file. The block of data written shall begin

1660

at the smallest sector-size aligned offset at or following the current

1661

end of the journal file.

1662

1663

The first 8 bytes of the journal header required to be written

1664

by H35680 shall contain the following values, in order from byte offset 0

1665

to 7: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63 and 0xd7.

1666

1667

Bytes 8-11 of the journal header required to be written by

1668

H35680 shall contain 0x00.

1669

1670

Bytes 12-15 of the journal header required to be written by

1671

H35680 shall contain the number of pages that the database file

1672

contained when the current write-transaction was started,

1673

formatted as a 4-byte big-endian unsigned integer.

1674

1675

Bytes 16-19 of the journal header required to be written by

1676

H35680 shall contain pseudo-randomly generated values.

1677

1678

Bytes 20-23 of the journal header required to be written by

1679

H35680 shall contain the sector size used by the VFS layer,

1680

formatted as a 4-byte big-endian unsigned integer.

1681

1682

Bytes 24-27 of the journal header required to be written by

1683

H35680 shall contain the page size used by the database at

1684

the start of the write transaction, formatted as a 4-byte

1685

big-endian unsigned integer.

1686

1687

Modifying, Adding or Truncating a Database Page

1688

</h3>

1689

1690

When the end-user executes a DML or DDL SQL statement to modify the

1691

database schema or content, SQLite is required to update the database

1692

file image to reflect the new database state. This involves modifying

1693

the content of, appending or truncating one of more database file

1694

pages. Instead of modifying the database file directly using the VFS

1695

interface, changes are first buffered within the page cache.

1696

1697

Before modifying a database page within the page cache that

1698

may need to be restored by a rollback operation, the page must be

1699

journalled. Journalling a page is the process of copying

1700

that pages original data into the journal file so that it can be

1701

recovered if the write transaction is rolled back. The process

1702

of journalling a page is described in section

1703

<cite>journalling_a_page</cite>.

1704

1705

When required to modify the contents of an existing database page that

1706

existed and was not a free-list leaf page when the write

1707

transaction was opened, SQLite shall journal the page if it has not

1708

already been journalled within the current write transaction.

1709

1710

When required to modify the contents of an existing database page,

1711

SQLite shall update the cached version of the database page content

1712

stored as part of the page cache entry associated with the page.

1713

1714

When a new database page is appended to a database file, there is

1715

no requirement to add a record to the journal file. If a

1716

rollback is required the database file will simply be truncated back

1717

to its original size based on the value stored at byte offset 12

1718

of the journal file.

1719

1720

When required to append a new database page to the database file,

1721

SQLite shall create a new page cache entry corresponding to

1722

the new page and insert it into the page cache. The dirty

1723

flag of the new page cache entry shall be set.

1724

1725

If required to truncate a database page from the end of the database

1726

file, the associated page cache entry is discarded. The adjusted

1727

size of the database file is stored internally. The database file

1728

is not actually truncated until the current write transaction

1729

is committed (see section <cite>committing_a_transaction</cite>).

1730

1731

When required to truncate (remove) a database page that existed and was

1732

not a free-list leaf page when the write transaction was

1733

opened from the end of a database file, SQLite shall journal the page if

1734

it has not already been journalled within the current write

1735

transaction.

1736

1737

When required to truncate a database page from the end of the database

1738

file, SQLite shall discard the associated page cache entry

1739

from the page cache.

1740

<h4 id=journalling_a_page>Journalling a Database Page</h4>

1741

1742

A page is journalled by adding a journal record to the

1743

journal file. The format of a journal record is described

1744

in section <cite>journal_record_format</cite>.

1745

1746

When required to journal a database page, SQLite shall first

1747

append the page number of the page being journalled to the

1748

journal file, formatted as a 4-byte big-endian unsigned integer,

1749

using a single call to the xWrite method of the file-handle opened

1750

on the journal file.

1751

1752

When required to journal a database page, if the attempt to

1753

append the page number to the journal file is successful,

1754

then the current page data (page-size bytes) shall be appended

1755

to the journal file, using a single call to the xWrite method of the

1756

file-handle opened on the journal file.

1757

1758

When required to journal a database page, if the attempt to

1759

append the current page data to the journal file is successful,

1760

then SQLite shall append a 4-byte big-endian integer checksum value

1761

to the to the journal file, using a single call to the xWrite method

1762

of the file-handle opened on the journal file.

1763

1764

The checksum value written to the journal file immediately after

1765

the page data (requirement H35290), is a function of both the page

1766

data and the checksum initializer field stored in the

1767

journal header (see section <cite>journal_header_format</cite>).

1768

Specifically, it is the sum of the checksum initializer and

1769

the value of every 200th byte of page data interpreted as an 8-bit

1770

unsigned integer, starting with the (page-size % 200)'th

1771

byte of page data. For example, if the page-size is 1024 bytes,

1772

then a checksum is calculated by adding the values of the bytes at

1773

offsets 23, 223, 423, 623, 823 and 1023 (the last byte of the page)

1774

together with the value of the checksum initializer.

1775

1776

The checksum value written to the journal file by the write

1777

required by H35290 shall be equal to the sum of the checksum

1778

initializer field stored in the journal header (H35700) and

1779

every 200th byte of the page data, beginning with the

1780

(page-size % 200)th byte.

1781

1782

The '%' character is used in requirement H35300 to represent the

1783

modulo operator, just as it is in programming languages such as C, Java

1784

and Javascript.

1785

<h3 id=syncing_journal_file>Syncing the Journal File</h3>

1786

1787

Even after the original data of a database page has been written into

1788

the journal file using calls to the journal file file-handle xWrite

1789

method (section <cite>journalling_a_page</cite>), it is still not

1790

safe to write to the page within the database file. This is because

1791

in the event of a system failure the data written to the journal file

1792

may still be corrupted (see section <cite>fs_characteristics</cite>).

1793

Before the page can be updated within the database itself, the

1794

following procedure takes place:

1795

<ol>

1796

<li> The xSync method of the file-handle opened on the journal file

1797

is called. This operation ensures that all journal records

1798

in the journal file have been written to persistent storage, and

1799

that they will not become corrupted as a result of a subsequent

1800

system failure.

1801

<li> The journal record count field (see section

1802

<cite>journal_header_format</cite>) of the most recently written

1803

journal header in the journal file is updated to contain the

1804

number of journal records added to the journal file since

1805

the header was written.

1806

<li> The xSync method is called again, to ensure that the update to

1807

the journal record count has been committed to persistent

1808

storage.

1809

</ol>

1810

1811

If all three of the steps enumerated above are executed successfully,

1812

then it is safe to modify the content of the journalled

1813

database pages within the database file itself. The combination of

1814

the three steps above is referred to as syncing the journal file.

1815

1816

When required to sync the journal file, SQLite shall invoke the

1817

xSync method of the file handle open on the journal file.

1818

1819

When required to sync the journal file, after invoking the

1820

xSync method as required by H35750, SQLite shall update the record

1821

count of the journal header most recently written to the

1822

journal file. The 4-byte field shall be updated to contain

1823

the number of journal records that have been written to the

1824

journal file since the journal header was written,

1825

formatted as a 4-byte big-endian unsigned integer.

1826

1827

When required to sync the journal file, after updating the

1828

record count field of a journal header as required by

1829

H35760, SQLite shall invoke the xSync method of the file handle open

1830

on the journal file.

1831

<h3 id=upgrading_to_exclusive_lock>Upgrading to an Exclusive Lock</h3>

1832

1833

Before the content of a page modified within the page cache may

1834

be written to the database file, an exclusive lock must be held

1835

on the database file. The purpose of this lock is to prevent another

1836

connection from reading from the database file while the first

1837

connection is midway through writing to it. Whether the reason for

1838

writing to the database file is because a transaction is being committed,

1839

or to free up space within the page cache, upgrading to an

1840

exclusive lock always occurs immediately after

1841

syncing the journal file.

1842

1843

When required to upgrade to an exclusive lock as part of a write

1844

transaction, SQLite shall first attempt to obtain a pending lock

1845

on the database file if one is not already held by invoking the xLock

1846

method of the file handle opened on the database file.

1847

1848

When required to upgrade to an exclusive lock as part of a write

1849

transaction, after successfully obtaining a pending lock SQLite

1850

shall attempt to obtain an exclusive lock by invoking the

1851

xLock method of the file handle opened on the database file.

1852

1853

What happens if the exclusive lock cannot be obtained? It is not

1854

possible for the attempt to upgrade from a reserved to a pending

1855

lock to fail.

1856

<h3 id=committing_a_transaction>Committing a Transaction</h3>

1857

1858

Committing a write transaction is the final step in updating the

1859

database file. Committing a transaction is a seven step process,

1860

summarized as follows:

1861

<ol>

1862

<li>

1863

The database file header change counter field is incremented.

1864

The change counter, described in

1865

<cite>ff_sqlitert_requirements</cite>, is used by the cache

1866

validation procedure described in section

1867

<cite>cache_validation</cite>.

1868

<li>

1869

The journal file is synced. The steps required to sync the

1870

journal file are described in section

1871

<cite>syncing_journal_file</cite>.

1872

<li>

1873

Upgrade to an exclusive lock on the database file, if an

1874

exclusive lock is not already held. Upgrading to an

1875

exclusive lock is described in section

1876

<cite>upgrading_to_exclusive_lock</cite>.

1877

<li>

1878

Copy the contents of all dirty pages stored in the page

1879

cache into the database file. The set of dirty pages are written

1880

to the database file in page number order in order to improve

1881

performance (see the assumptions in section <cite>fs_performance</cite>

1882

for details).

1883

<li>

1884

The database file is synced to ensure that all updates are stored

1885

safely on the persistent media.

1886

<li>

1887

The file-handle open on the journal file is closed and the

1888

journal file itself deleted. At this point the write transaction

1889

transaction has been irrevocably committed.

1890

<li>

1891

The database file is unlocked.

1892

</ol>

1893

1894

Expand on and explain the above a bit.

1895

1896

The following requirements describe the steps enumerated above in more

1897

detail.

1898

1899

When required to commit a write-transaction, SQLite shall

1900

modify page 1 to increment the value stored in the change counter

1901

field of the database file header.

1902

1903

The change counter is a 4-byte big-endian integer field stored

1904

at byte offset 24 of the database file. The modification to page 1

1905

required by H35800 is made using the process described in section

1906

<cite>modifying_appending_truncating</cite>. If page 1 has not already

1907

been journalled as a part of the current write-transaction, then

1908

incrementing the change counter may require that page 1 be

1909

journalled. In all cases the page cache entry corresponding to

1910

page 1 becomes a dirty page as part of incrementing the change

1911

counter value.

1912

1913

When required to commit a write-transaction, after incrementing

1914

the change counter field, SQLite shall sync the journal

1915

file.

1916

1917

When required to commit a write-transaction, after syncing

1918

the journal file as required by H35810, if an exclusive lock

1919

on the database file is not already held, SQLite shall attempt to

1920

upgrade to an exclusive lock.

1921

1922

When required to commit a write-transaction, after syncing

1923

the journal file as required by H35810 and ensuring that an

1924

exclusive lock is held on the database file as required by

1925

H35830, SQLite shall copy the contents of all dirty page

1926

stored in the page cache into the database file using

1927

calls to the xWrite method of the database connection file

1928

handle. Each call to xWrite shall write the contents of a single

1929

dirty page (page-size bytes of data) to the database

1930

file. Dirty pages shall be written in order of page number,

1931

from lowest to highest.

1932

1933

When required to commit a write-transaction, after copying the

1934

contents of any dirty pages to the database file as required

1935

by H35830, SQLite shall sync the database file by invoking the xSync

1936

method of the database connection file handle.

1937

1938

When required to commit a write-transaction, after syncing

1939

the database file as required by H35840, SQLite shall close the

1940

file-handle opened on the journal file and delete the

1941

journal file from the file system via a call to the VFS

1942

xDelete method.

1943

1944

When required to commit a write-transaction, after deleting

1945

the journal file as required by H35850, SQLite shall relinquish

1946

all locks held on the database file by invoking the xUnlock

1947

method of the database connection file handle.

1948

1949

Is the shared lock held after committing a write transaction?

1950

<h3>Purging a Dirty Page</h3>

1951

1952

Usually, no data is actually written to the database file until the

1953

user commits the active write transaction. The exception is

1954

if a single write transaction contains too many modifications

1955

to be stored in the page cache. In this case, some of the

1956

database file modifications stored in the page cache must be

1957

applied to the database file before the transaction is committed so

1958

that the associated page cache entries can be purged from the

1959

page cache to free memory. Exactly when this condition is reached and

1960

dirty pages must be purged is described in section

1961

<cite>page_cache_algorithms</cite>.

1962

1963

Before the contents of the page cache entry can be written into

1964

the database file, the page cache entry must meet the criteria

1965

for a writable dirty page, as defined in section

1966

<cite>page_cache_algorithms</cite>. If the dirty page selected by the

1967

algorithms in section <cite>page_cache_algorithms</cite> for purging,

1968

SQLite is required to sync the journal file. Immediately after

1969

the journal file is synced, all dirty pages associated with the

1970

database connection are classified as writable dirty pages.

1971

1972

When required to purge a non-writable dirty page from the

1973

page cache, SQLite shall sync the journal file before

1974

proceeding with the write operation required by H35670.

1975

1976

After syncing the journal file as required by H35640, SQLite

1977

shall append a new journal header to the journal file

1978

before proceeding with the write operation required by H35670.

1979

1980

Appending a new journal header to the journal file is described

1981

in section <cite>writing_journal_header</cite>.

1982

1983

Once the dirty page being purged is writable, it is simply written

1984

into the database file.

1985

1986

When required to purge a page cache entry that is a

1987

dirty page SQLite shall write the page data into the database

1988

file, using a single call to the xWrite method of the database

1989

connection file handle.

1990

<h2 id="multifile_transactions">Multi-File Transactions</h2>

1991

<h2 id="statement_transactions">Statement Transactions</h2>

1992

<h1 id=rollback>Rollback</h1>

1993

<h2 id=hot_journal_rollback>Hot Journal Rollback</h2>

1994

<h2>Transaction Rollback</h2>

1995

<h2>Statement Rollback</h2>

1996

<h1>References</h1>

1997

1998

1999

C API Requirements Document.

2000

2001

SQL Requirements Document.

2002

2003

File Format Requirements Document.

2004

</table>

2005

2006