1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
3
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
4
<title>No Title</title>
5
<style type="text/css">
8
font-family: Verdana, sans-serif;
13
a:visited { color: #734559 }
15
.logo { position:absolute; margin:3px; }
31
.toolbar a { color: white; text-decoration: none; padding: 6px 12px; }
32
.toolbar a:visited { color: white; }
33
.toolbar a:hover { color: #044a64; background: white; }
35
.content { margin: 5%; }
36
.content dt { font-weight:bold; }
37
.content dd { margin-bottom: 25px; margin-left:20%; }
38
.content ul { padding:0px; padding-left: 15px; margin:0px; }
41
.se { background: url(images/se.gif) 100% 100% no-repeat #044a64}
42
.sw { background: url(images/sw.gif) 0% 100% no-repeat }
43
.ne { background: url(images/ne.gif) 100% 0% no-repeat }
44
.nw { background: url(images/nw.gif) 0% 0% no-repeat }
46
/* Things for "fancyformat" documents start here. */
47
.fancy img+p {font-style:italic}
48
.fancy .codeblock i { color: darkblue; }
49
.fancy h1,.fancy h2,.fancy h3,.fancy h4 {font-weight:normal;color:#044a64}
50
.fancy h2 { margin-left: 10px }
51
.fancy h3 { margin-left: 20px }
52
.fancy h4 { margin-left: 30px }
53
.fancy th {white-space:nowrap;text-align:left;border-bottom:solid 1px #444}
54
.fancy th, .fancy td {padding: 0.2em 1ex; vertical-align:top}
55
.fancy #toc a { color: darkblue ; text-decoration: none }
56
.fancy .todo { color: #AA3333 ; font-style : italic }
57
.fancy .todo:before { content: 'TODO:' }
58
.fancy p.todo { border: solid #AA3333 1px; padding: 1ex }
59
.fancy img { display:block; }
60
.fancy :link:hover, .fancy :visited:hover { background: wheat }
61
.fancy p,.fancy ul,.fancy ol { margin: 1em 5ex }
62
.fancy li p { margin: 1em 0 }
63
/* End of "fancyformat" specific rules. */
69
<div><!-- container div to satisfy validator -->
72
<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite Logo"
74
<div><!-- IE hack to prevent disappearing logo--></div>
75
<div class="tagline">Small. Fast. Reliable.<br>Choose any three.</div>
77
<table width=100% style="clear:both"><tr><td>
78
<div class="se"><div class="sw"><div class="ne"><div class="nw">
79
<table width=100% style="padding:0;margin:0;cell-spacing:0"><tr>
82
<a href="about.html">About</a>
83
<a href="sitemap.html">Sitemap</a>
84
<a href="docs.html">Documentation</a>
85
<a href="download.html">Download</a>
86
<a href="copyright.html">License</a>
87
<a href="news.html">News</a>
88
<a href="support.html">Support</a>
91
gMsg = "Search SQLite Docs..."
92
function entersearch() {
93
var q = document.getElementById("q");
94
if( q.value == gMsg ) { q.value = "" }
95
q.style.color = "black"
96
q.style.fontStyle = "normal"
98
function leavesearch() {
99
var q = document.getElementById("q");
100
if( q.value == "" ) {
102
q.style.color = "#044a64"
103
q.style.fontStyle = "italic"
108
<div style="padding:0 1em 0px 0;white-space:nowrap">
109
<form name=f method="GET" action="http://www.sqlite.org/search">
110
<input id=q name=q type=text
111
onfocus="entersearch()" onblur="leavesearch()" style="width:24ex;padding:1px 1ex; border:solid white 1px; font-size:0.9em ; font-style:italic;color:#044a64;" value="Search SQLite Docs...">
112
<input type=submit value="Go" style="border:solid white 1px;background-color:#044a64;color:white;font-size:0.9em;padding:0 1ex">
116
</div></div></div></div>
118
<div class=startsearch></div>
121
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
124
<link type="text/css" rel="stylesheet" href="images/fileformat/rtdocs.css">
125
<script type="text/javascript" src=images/fileformat/rtdocs.js></script>
127
<div id=document_title>SQLite File IO Specification</div>
128
<div id=toc_header>Table Of Contents</div>
130
<b>Javascript is required for some features of this document, including
131
table of contents, figure numbering and internal references (section
132
numbers and hyper-links.
135
<!-- End of standard rt docs header -->
136
<h1 id=overview>Overview</h1>
138
SQLite stores an entire database within a single file, the format of
139
which is described in the <i>SQLite File Database File Format</i>
140
document <cite>ff_sqlitert_requirements</cite>. Each database file is
141
stored within a file system, presumably provided by the host operating
142
system. Instead of interfacing with the operating system directly,
143
the host application is required to supply an adaptor component that
144
implements the <i>SQLite Virtual File System</i> interface
145
(described in <cite>capi_sqlitert_requirements</cite>). The adaptor
146
component is responsible for translating the calls made by SQLite to
147
the <i>VFS</i> interface into calls to the file-system interface
148
provided by the operating system. This arrangement is depicted in figure
149
<cite>figure_vfs_role</cite>.
150
<center><img src="images/fileformat/vfs_role.gif">
151
<p><i>Figure <span class=fig id=figure_vfs_role></span> - Virtual File System (VFS) Adaptor</i>
154
Although it would be easy to design a system that uses the <i>VFS</i>
155
interface to read and update the content of a database file stored
156
within a file-system, there are several complicated issues that need
157
to be addressed by such a system:
159
<li><p>SQLite is required to <b>implement atomic and durable
160
transactions</b> (the 'A' and 'D' from the ACID acronym), even if an
161
application, operating system or power failure occurs midway through or
162
shortly after updating a database file.
163
<p>To implement atomic transactions in the face of potential
164
application, operating system or power failures, database writers write
165
a copy of those portions of the database file that they are going to
166
modify into a second file, the <i>journal file</i>, before writing
167
to the database file. If a failure does occur while modifying the
168
database file, SQLite can reconstruct the original database
169
(before the modifications were attempted) based on the contents of
170
the <i>journal file</i>.
171
<li><p>SQLite is required to <b>implement isolated transactions</b> (the 'I'
172
from the ACID acronym).
173
<p>This is done by using the file locking facilities provided by the
174
VFS adaptor to serialize writers (write transactions) and preventing
175
readers (read transactions) from accessing database files while writers
176
are midway through updating them.
177
<li><p>For performance reasons, it is advantageous to <b>minimize the
178
quantity of data read and written</b> to and from the file-system.
179
<p>As one might expect, the amount of data read from the database
180
file is minimized by caching portions of the database file in main
181
memory. Additionally, multiple updates to the database file that
182
are part of the same <i>write transaction</i> may be cached in
183
main memory and written to the file together, allowing for
184
more efficient IO patterns and eliminating the redundant write
185
operations that could take place if part of the database file is
186
modified more than once within a single <i>write transaction</i>.
189
System requirement references for the above points.
191
This document describes in detail the way that SQLite uses the API
192
provided by the VFS adaptor component to solve the problems and implement
193
the strategies enumerated above. It also specifies the assumptions made
194
about the properties of the system that the VFS adaptor provides
195
access to. For example, specific assumptions about the extent of
196
data corruption that may occur if a power failure occurs while a
197
database file is being updated are presented in section
198
<cite>fs_characteristics</cite>.
200
This document does not specify the details of the interface that must
201
be implemented by the VFS adaptor component, that is left to
202
<cite>capi_sqlitert_requirements</cite>.
203
<h2>Relationship to Other Documents</h2>
205
Related to C-API requirements:
207
<li>Opening a connection.
208
<li>Closing a connection.
211
Related to SQL requirements:
213
<li value=3>Opening a read-only transaction.
214
<li>Terminating a read-only transaction.
215
<li>Opening a read-write transaction.
216
<li>Committing a read-write transaction.
217
<li>Rolling back a read-write transaction.
218
<li>Opening a statement transaction.
219
<li>Committing a statement transaction.
220
<li>Rolling back a statement transaction.
221
<li>Committing a multi-file transaction.
224
Related to file-format requirements:
226
<li value=12>Pinning (reading) a database page.
227
<li>Unpinning a database page.
228
<li>Modifying the contents of a database page.
229
<li>Appending a new page to the database file.
230
<li>Truncating a page from the end of the database file.
232
<h2>Document Structure</h2>
234
Section <cite>vfs_assumptions</cite> of this document describes the
235
various assumptions made about the system to which the VFS adaptor
236
component provides access. The basic capabilities and functions
237
required from the VFS implementation are presented along with the
238
description of the VFS interface in
239
<cite>capi_sqlitert_requirements</cite>. Section
240
<cite>vfs_assumptions</cite> complements this by describing in more
241
detail the assumptions made about VFS implementations on which the
242
algorithms presented in this document depend. Some of these assumptions
243
relate to performance issues, but most concern the expected state of
244
the file-system following a failure that occurs midway through
245
modifying a database file.
247
Section <cite>database_connections</cite> introduces the concept of
248
a <i>database connection</i>, a combination of a file-handle and
249
in-memory cache used to access a database file. It also describes the
250
VFS operations required when a new <i>database connection</i> is
251
created (opened), and when one is destroyed (closed).
253
Section <cite>reading_data</cite> describes the steps required to
254
open a <i>read transaction</i> and read data from a database file.
256
Section <cite>writing_data</cite> describes the steps required to
257
open a <i>write transaction </i> and write data to a database file.
259
Section <cite>rollback</cite> describes the way in which aborted
260
<i>write transactions</i> may be rolled back (reverted), either as
261
a result of an explicit user directive or because an application,
262
operating system or power failure occurred while SQLite was midway
263
through updating a database file.
265
Section <cite>page_cache_algorithms</cite> describes some of the
266
algorithms used to determine exactly which portions of the database
267
file are cached by a <i>page cache</i>, and the effect that they
268
have on the quantity and nature of the required VFS operations.
269
It may at first seem odd to include the <i>page cache</i>, which is
270
primarily an implementation detail, in this document. However, it is
271
necessary to acknowledge and describe the <i>page cache</i> in order to
272
provide a more complete explanation of the nature and quantity of IO
276
After this document is ready, make the vocabulary consistent and
277
then add a glossary here.
278
<h1 id=vfs_assumptions>VFS Adaptor Related Assumptions</h1>
280
This section documents those assumptions made about the system that
281
the VFS adaptor provides access to. The assumptions noted in section
282
<cite>fs_characteristics</cite> are particularly important. If these
283
assumptions are not true, then a power or operating system failure
284
may cause SQLite databases to become corrupted.
285
<h2 id=fs_performance>Performance Related Assumptions</h2>
287
SQLite uses the assumptions in this section to try to speed up
288
reading from and writing to the database file.
289
<p class=req id=A21010>
290
It is assumed that writing a series of sequential blocks of data to
291
a file in order is faster than writing the same blocks in an arbitrary
293
</p> <h2 id=fs_characteristics>System Failure Related Assumptions</h2>
295
In the event of an operating system or power failure, the various
296
combinations of file-system software and storage hardware available
297
provide varying levels of guarantee as to the integrity of the data
298
written to the file system just before or during the failure. The exact
299
combination of IO operations that SQLite is required to perform in
300
order to safely modify a database file depend on the exact
301
characteristics of the target platform.
303
This section describes the assumptions that SQLite makes about the
304
content of a file-system following a power or system failure. In
305
other words, it describes the extent of file and file-system corruption
306
that such an event may cause.
308
SQLite queries an implementation for file-system characteristics
309
using the xDeviceCharacteristics() and xSectorSize() methods of the
310
database file file-handle. These two methods are only ever called
311
on file-handles open on database files. They are not called for
312
<i>journal files</i>, <i>master-journal files</i> or
313
<i>temporary database files</i>.
315
The file-system <i>sector size</i> value determined by calling the
316
xSectorSize() method is a power of 2 value between 512 and 32768,
317
inclusive <span class=todo>reference to exactly how this is
318
determined</span>. SQLite assumes that the underlying storage
319
device stores data in blocks of <i>sector-size</i> bytes each,
320
sectors. It is also assumed that each aligned block of
321
<i>sector-size</i> bytes of each file is stored in a single device
322
sector. If the file is not an exact multiple of <i>sector-size</i>
323
bytes in size, then the final device sector is partially empty.
325
Normally, SQLite assumes that if a power failure occurs while
326
updating any portion of a sector then the contents of the entire
327
device sector is suspect following recovery. After writing to
328
any part of a sector within a file, it is assumed that the modified
329
sector contents are held in a volatile buffer somewhere within
330
the system (main memory, disk cache etc.). SQLite does not assume
331
that the updated data has reached the persistent storage media, until
332
after it has successfully <i>synced</i> the corresponding file by
333
invoking the VFS xSync() method. <i>Syncing</i> a file causes all
334
modifications to the file up until that point to be committed to
337
Based on the above, SQLite is designed around a model of the
338
file-system whereby any sector of a file written to is considered to be
339
in a transient state until after the file has been successfully
340
<i>synced</i>. Should a power or system failure occur while a sector
341
is in a transient state, it is impossible to predict its contents
342
following recovery. It may be written correctly, not written at all,
343
overwritten with random data, or any combination thereof.
345
For example, if the <i>sector-size</i> of a given file-system is
346
2048 bytes, and SQLite opens a file and writes a 1024 byte block
347
of data to offset 3072 of the file, then according to the model
348
the second sector of the file is in the transient state. If a
349
power failure or operating system crash occurs before or during
350
the next call to xSync() on the file handle, then following system
351
recovery SQLite assumes that all file data between byte offsets 2048
352
and 4095, inclusive, is invalid. It also assumes that since the first
353
sector of the file, containing the data from byte offset 0 to 2047
354
inclusive, is valid, since it was not in a transient state when the
357
Assuming that any and all sectors in the transient state may be
358
corrupted following a power or system failure is a very pessimistic
359
approach. Some modern systems provide more sophisticated guarantees
360
than this. SQLite allows the VFS implementation to specify at runtime
361
that the current platform supports zero or more of the following
364
<li><p>The <b>safe-append</b> property. If a system supports the
365
<i>safe-append</i> property, it means that when a file is extended
366
the new data is written to the persistent media before the size
367
of the file itself is updated. This guarantees that if a failure
368
occurs after a file has been extended, following recovery
369
the write operations that extended the file will appear to have
370
succeeded or not occurred at all. It is not possible for invalid
371
or garbage data to appear in the extended region of the file.
372
<li><p>The <b>atomic-write</b> property. A system that supports this
373
property also specifies the size or sizes of the blocks that it
374
is capable of writing. Valid sizes are powers of two greater than
375
512. If a write operation modifies a block of <i>n</i> bytes,
376
where <i>n</i> is one of the block sizes for which <i>atomic-write</i>
377
is supported, then it is impossible for an aligned write of <i>n</i>
378
bytes to cause data corruption. If a failure occurs after such
379
a write operation and before the applicable file handle is
380
<i>synced</i>, then following recovery it will appear as if the
381
write operation succeeded or did not take place at all. It is not
382
possible that only part of the data specified by the write operation
383
was written to persistent media, nor is it possible for any content
384
of the sectors spanned by the write operation to be replaced with
385
garbage data, as it is normally assumed to be.
386
<li><p>The <b>sequential-write</b> property. A system that supports the
387
<i>sequential-write</i> property guarantees that the various write
388
operations on files within the same file-system are written to the
389
persistent media in the same order that they are performed by the
390
application and that each operation is concluded before the next
391
is begun. If a system supports the <i>sequential-write</i>
392
property, then the model used to determine the possible states of
393
the file-system following a failure is different.
394
<p>If a system supports <i>sequential-write</i> it is assumed that
395
<i>syncing</i> any file within the file system flushes all write
396
operations on all files (not just the <i>synced</i> file) to
397
the persistent media. If a failure does occur, it is not known
398
whether or not any of the write operations performed by SQLite
399
since the last time a file was <i>synced</i>. SQLite is able to
400
assume that if the write operations of unknown status are arranged
401
in the order that they occurred:
403
<li> the first <i>n</i> operations will have been executed
405
<li> the next operation puts all device sectors that it modifies
406
into the transient state, so that following recovery each
407
sector may be partially written, completely written, not
408
written at all or populated with garbage data,
409
<li> the remaining operations will not have had any effect on
410
the contents of the file-system.
413
<h3 id=fs_assumption_details>Failure Related Assumption Details</h3>
415
This section describes how the assumptions presented in the parent
416
section apply to the individual API functions and operations provided
417
by the VFS to SQLite for the purposes of modifying the contents of the
420
SQLite manipulates the contents of the file-system using a combination
421
of the following four types of operation:
423
<li> <b>Create file</b> operations. SQLite may create new files
424
within the file-system by invoking the xOpen() method of
425
the sqlite3_io_methods object.
426
<li> <b>Delete file</b> operations. SQLite may remove files from the
427
file system by calling the xDelete() method of the
428
sqlite3_io_methods object.
429
<li> <b>Truncate file</b> operations. SQLite may truncate existing
430
files by invoking the xTruncate() method of the sqlite3_file
432
<li> <b>Write file</b> operations. SQLite may modify the contents
433
and increase the size of a file by files by invoking the xWrite()
434
method of the sqlite3_file object.
437
Additionally, all VFS implementations are required to provide the
438
<i>sync file</i> operation, accessed via the xSync() method of the
439
sqlite3_file object, used to flush create, write and truncate operations
440
on a file to the persistent storage medium.
442
The formalized assumptions in this section refer to <i>system failure</i>
443
events. In this context, this should be interpreted as any failure that
444
causes the system to stop operating. For example a power failure or
445
operating system crash.
447
SQLite does not assume that a <b>create file</b> operation has actually
448
modified the file-system records within persistent storage until
449
after the file has been successfully <i>synced</i>.
450
<p class=req id=A21001>
451
If a system failure occurs during or after a "create file"
452
operation, but before the created file has been <i>synced</i>, then
453
SQLite assumes that it is possible that the created file may not
454
exist following system recovery.
456
Of course, it is also possible that it does exist following system
458
<p class=req id=A21002>
459
If a "create file" operation is executed by SQLite, and then the
460
created file <i>synced</i>, then SQLite assumes that the file-system
461
modifications corresponding to the "create file" operation have been
462
committed to persistent media. It is assumed that if a system
463
failure occurs any time after the file has been successfully
464
<i>synced</i>, then the file is guaranteed to appear in the file-system
465
following system recovery.
467
A <b>delete file</b> operation (invoked by a call to the VFS xDelete()
468
method) is assumed to be an atomic and durable operation.
470
<p class=req id=A21003>
471
If a system failure occurs at any time after a "delete file"
472
operation (call to the VFS xDelete() method) returns successfully, it is
473
assumed that the file-system will not contain the deleted file following
475
</p><p class=req id=A21004>
476
If a system failure occurs during a "delete file" operation,
477
it is assumed that following system recovery the file-system will
478
either contain the file being deleted in the state it was in before
479
the operation was attempted, or not contain the file at all. It is
480
assumed that it is not possible for the file to have become corrupted
481
purely as a result of a failure occurring during a "delete file"
484
The effects of a <b>truncate file</b> operation are not assumed to
485
be made persistent until after the corresponding file has been
487
<p class=req id=A21005>
488
If a system failure occurs during or after a "truncate file"
489
operation, but before the truncated file has been <i>synced</i>, then
490
SQLite assumes that the size of the truncated file is either as large
491
or larger than the size that it was to be truncated to.
492
</p><p class=req id=A21006>
493
If a system failure occurs during or after a "truncate file"
494
operation, but before the truncated file has been <i>synced</i>, then
495
it is assumed that the contents of the file up to the size that the
496
file was to be truncated to are not corrupted.
498
The above two assumptions may be interpreted to mean that if a
499
system failure occurs after file truncation but before the truncated
500
file is <i>synced</i>, the contents of the file following the point
501
at which it was to be truncated may not be trusted. They may contain
502
the original file data, or may contain garbage.
503
<p class=req id=A21007>
504
If a "truncate file" operation is executed by SQLite, and then the
505
truncated file <i>synced</i>, then SQLite assumes that the file-system
506
modifications corresponding to the "truncate file" operation have been
507
committed to persistent media. It is assumed that if a system
508
failure occurs any time after the file has been successfully
509
<i>synced</i>, then the effects of the file truncation are guaranteed
510
to appear in the file system following recovery.
512
A <b>write file</b> operation modifies the contents of an existing file
513
within the file-system. It may also increase the size of the file.
514
The effects of a <i>write file</i> operation are not assumed to
515
be made persistent until after the corresponding file has been
517
<p class=req id=A21008>
518
If a system failure occurs during or after a "write file"
519
operation, but before the corresponding file has been <i>synced</i>,
520
then it is assumed that the content of all sectors spanned by the
521
<i>write file</i> operation are untrustworthy following system
522
recovery. This includes regions of the sectors that were not
523
actually modified by the write file operation.
524
</p><p class=req id=A21011>
525
If a system failure occurs on a system that supports the
526
<i>atomic-write</i> property for blocks of size <i>N</i> bytes
527
following an aligned write of <i>N</i>
528
bytes to a file but before the file has been successfully <i>synced</i>,
529
then is assumed following recovery that all sectors spanned by the
530
write operation were correctly updated, or that none of the sectors were
532
</p><p class=req id=A21012>
533
If a system failure occurs on a system that supports the
534
<i>safe-append</i> following a write operation that appends data
535
to the end of the file without modifying any of the existing file
536
content but before the file has been successfully <i>synced</i>,
537
then is assumed following recovery that either the data was
538
correctly appended to the file, or that the file size remains
539
unchanged. It is assumed that it is impossible that the file be
540
extended but populated with incorrect data.
541
</p><p class=req id=A21013>
542
Following a system recovery, if a device sector is deemed to be
543
untrustworthy as defined by A21008 and neither A21011 or A21012
544
apply to the range of bytes written, then no assumption can be
545
made about the content of the sector following recovery. It is
546
assumed that it is possible for such a sector to be written
547
correctly, not written at all, populated with garbage data or any
549
</p><p class=req id=A21009>
550
If a system failure occurs during or after a "write file"
551
operation that causes the file to grow, but before the corresponding
552
file has been <i>synced</i>, then it is assumed that the size of
553
the file following recovery is as large or larger than it was when
554
it was most recently <i>synced</i>.
556
If a system supports the <i>sequential-write</i> property, then further
557
assumptions may be made with respect to the state of the file-system
558
following recovery from a <i>system failure</i>. Specifically, it is
559
assumed that create, truncate, delete and write file operations are
560
applied to the persistent representation in the same order as they
561
are performed by SQLite. Furthermore, it is assumed that the
562
file-system waits until one operation is safely written to the
563
persistent media before the next is attempted, just as if the relevant
564
file were <i>synced</i> following each operation.
565
<p class=req id=A21014>
566
If a system failure occurs on a system that supports the
567
<i>sequential-write</i> property, then it is assumed that all
568
operations completed before the last time any file was <i>synced</i>
569
have been successfully committed to persistent media.
570
</p><p class=req id=A21015>
571
If a system failure occurs on a system that supports the
572
<i>sequential-write</i> property, then it is assumed that the set
573
of possible states that the file-system may be in following recovery
574
is the same as if each of the write operations performed since the most
575
recent time a file was <i>synced</i> was itself followed by a <i>sync
576
file</i> operation, and that the system failure may have occurred during
577
any of the write or <i>sync file</i> operations.
580
The return value of the xSectorSize() method, the <i>sector-size</i>, is
581
expected by SQLite to be a power of 2 value greater than or equal to 512.
583
What does it do if this is not the case? If the sector size is less
584
than 512 then 512 is used instead. How about a non power-of-two value?
585
UPDATE: How this situation is handled should be described in the API
586
requirements. Here we can just refer to the other document.
588
SQLite assumes that files are stored and written to within the
589
file-system as a collection of blocks (hereafter sectors) of data, each
590
<i>sector-size</i> bytes in size. This model is used to derive
591
the following assumptions related to the expected state of the
592
file-system following a power failure or operating system crash.
595
After part or all of a file sector has been modified
596
using the xWrite() method of an open file-handle, the sector
597
is said to be in a transient state, where the operating system
598
makes no guarantees about the actual content of the sector on the
599
persistent media. The sector remains in the transient state until
600
the next successful call to xSync() on the same file-handle
601
returns. If a power failure or operating system crash occurs, then
602
part or all of all sectors in the transient state when the crash
603
occurred may contain invalid data following system recovery.
605
Following a power failure or operating system crash, the content
606
of all sectors that were not in a transient state when the crash
607
occurred may be trusted.
610
What do we assume about the other three file-system write
611
operations - xTruncate(), xDelete() and "create file"?
613
The xDeviceCharacteristics() method returns a set of flags,
614
indicating which of the following properties (if any) the
615
file-system provides:
617
<li>The <b><i>sequential IO</i></b> property. If a file-system has this
618
property, then in the event of a crash at most a single sector
619
may contain invalid data. The file-system guarantees
620
<li>The <b><i>safe-append</i></b> property.
621
<li>The <b><i>atomic write</i></b> property.
624
Write an explanation as to how the file-system properties influence
625
the model used to predict file damage after a catastrophe.
627
<h1 id=database_connections>Database Connections</h1>
629
Within this document, the term <i>database connection</i> has a slightly
630
different meaning from that which one might assume. The handles returned
631
by the <code>sqlite3_open()</code> and <code>sqlite3_open16()</code>
632
APIs (<span class=todo>reference</span>) are referred to as <i>database
633
handles</i>. A <i>database connection</i> is a connection to a single
634
database file using a single file-handle, which is held open for the
635
lifetime of the connection. Using the SQL ATTACH syntax, multiple
636
<i>database connections</i> may be accessed via a single <i>database
637
handle</i>. Or, using SQLite's <i>shared-cache mode</i> feature, multiple
638
<i>database handles</i> may access a single <i>database connection</i>.
640
Usually, a new <i>database connection</i> is opened whenever the user opens
641
new <i>database handle</i> on a real database file (not an in-memory
642
database) or when a database file is attached to an existing <i>database
643
connection</i> using the SQL ATTACH syntax. However if the <i>shared-cache
644
mode</i> feature is enabled, then the database file may be accessed through
645
an existing <i>database connection</i>. For more information on
646
<i>shared-cache mode</i>, refer to <span class=todo>Reference</span>. The
647
various IO operations required to open a new connection are detailed in
648
section <cite>open_new_connection</cite> of this document.
650
Similarly, a <i>database connection</i> is usually closed when the user
651
closes a <i>database handle</i> that is open on a real database file or
652
has had one or more real database files attached to it using the ATTACH
653
mechanism, or when a real database file is detached from a <i>database
654
connection</i> using the DETACH syntax. Again, the exception is if
655
<i>shared-cache mode</i> is enabled. In this case, a <i>database
656
connection</i> is not closed until its number of users reaches zero.
657
The IO related steps required to close a <i>database connection</i> are
658
described in section <cite>closing_database_connection</cite>.
660
After sections 4 and 5 are finished, come back here and see if we can add a
661
list of state items associated with each database connection to make things
662
easier to understand. i.e each database connection has a file handle, a set
663
of entries in the page cache, an expected page size etc.
664
<h2 id=open_new_connection>Opening a New Connection</h2>
666
This section describes the VFS operations that take place when a
667
new database connection is created.
669
Opening a new database connection is a two-step process:
671
<li> A file-handle is opened on the database file.
672
<li> If step 1 was successful, an attempt is made to read the
673
<i>database file header</i> from the database file using the
677
In step 2 of the procedure above, the database file is not locked
678
before it is read from. This is the only exception to the locking
679
rules described in section <cite>reading_data</cite>.
681
The reason for attempting to read the <i>database file header</i>
682
is to determine the <i>page-size</i> used by the database file.
683
Because it is not possible to be certain as to the <i>page-size</i>
684
without holding at least a <i>shared lock</i> on the database file
685
(because some other <i>database connection</i> might have changed it
686
since the <i>database file header</i> was read), the value read from the
687
<i>database file header</i> is known as the <i>expected page size</i>.
688
<p class=req id=H35060>
689
When a new <i>database connection</i> is required, SQLite shall attempt
690
to open a file-handle on the database file. If the attempt fails, then
691
no new <i>database connection</i> is created and an error returned.
692
<p class=req id=H35070>
693
When a new <i>database connection</i> is required, after opening the
694
new file-handle, SQLite shall attempt to read the first 100 bytes
695
of the database file. If the attempt fails for any other reason than
696
that the opened file is less than 100 bytes in size, then
697
the file-handle is closed, no new <i>database connection</i> is created
698
and an error returned instead.
699
<p class=req id=H35080>
700
If the <i>database file header</i> is successfully read from a newly
701
opened database file, the connections <i>expected page-size</i> shall
702
be set to the value stored in the <i>page-size field</i> of the
704
<p class=req id=H35090>
705
If the <i>database file header</i> cannot be read from a newly opened
706
database file (because the file is less than 100 bytes in size), the
707
connections <i>expected page-size</i> shall be set to the compile time
708
value of the SQLITE_DEFAULT_PAGESIZE option.
709
<h2 id=closing_database_connection>Closing a Connection</h2>
711
This section describes the VFS operations that take place when an
712
existing database connection is closed (destroyed).
714
Closing a database connection is a simple matter. The open VFS
715
file-handle is closed and in-memory <i>page cache</i> related resources
717
<p class=req id=H35400>
718
When a <i>database connection</i> is closed, SQLite shall close the
719
associated file handle at the VFS level.
720
<p class=req id=H35430>
721
When a <i>database connection</i> is closed, all associated <i>page
722
cache</i> entries shall be discarded.
723
<h1 id=page_cache>The Page Cache</h1>
725
The contents of an SQLite database file are formatted as a set of
726
fixed size pages. See <cite>ff_sqlitert_requirements</cite> for a
727
complete description of the format used. The <i>page size</i> used
728
for a particular database is stored as part of the database file
729
header at a well-known offset within the first 100 bytes of the
730
file. Almost all read and write operations performed by SQLite on
731
database files are done on blocks of data <i>page-size</i> bytes
734
All SQLite database connections running within a single process share
735
a single <i>page cache</i>. The <i>page cache</i> caches data read from
736
database files in main-memory on a per-page basis. When SQLite requires
737
data from a database file to satisfy a database query, it checks the <i>
738
page cache</i> for usable cached versions of the required database
739
pages before loading it from the database file. If no usable cache
740
entry can be found and the database page data is loaded from the database
741
file, it is cached in the <i>page cache</i> in case the same data is
742
needed again later. Because reading from the database file is assumed to
743
be an order of magnitude faster than reading from main-memory, caching
744
database page content in the <i>page cache</i> to minimize the number
745
of read operations performed on the database file is a significant
746
performance enhancement.
748
The <i>page cache</i> is also used to buffer database write operations.
749
When SQLite is required to modify one of more of the <i>database pages</i>
750
that make up a database file, it first modifies the cached version of
751
the page in the <i>page cache</i>. At that point the page is considered
752
a "dirty" page. At some point later on, the new content of the "dirty"
753
page is copied from the <i>page cache</i> into the database file via
754
the VFS interface. Buffering writes in the <i>page cache</i> can reduce
755
the number of write operations required on the database file (in cases
756
where the same page is updated twice) and allows optimizations based
757
on the assumptions outlined in section <cite>fs_performance</cite>.
759
Database read and write operations, and the way in which they interact
760
with and use the <i>page cache</i>, are described in detail in sections
761
<cite>reading_data</cite> and <cite>writing_data</cite> of this document,
764
At any one time, the <i>page cache</i> contains zero or more <i>page cache
765
entries</i>, each of which has the following data associated with it:
768
A reference to <b>the associated <i>database connection</i></b>. Each
769
entry in the <i>page cache</i> is associated with a single <i>database
770
connection</i>; the <i>database connection</i> that created the entry.
771
A <i>page cache entry</i> is only ever used by the <i>database
772
connection</i> that created it. Page cache entries are not shared between
773
<i>database connections</i>.
775
The <b><i>page number</i></b> of the cached page. Pages are sequentially
776
numbered within a database file starting from page 1 (page 1 begins at
777
byte offset 0). Refer to <cite>ff_sqlitert_requirements</cite> for
780
The <b>cached data</b>; a blob of data <i>page-size</i> bytes in size.
783
The first two elements in the list above, the associated <i>database
784
connection</i> and the <i>page number</i>, uniquely identify the
785
<i>page cache entry</i>. At no time may the <i>page cache</i> contain two
786
entries for which both the <i>database connection</i> and <i>page
787
number</i> are identical. Or, put another way, a single <i>database
788
connection</i> never caches more than one copy of a database page
789
within the <i>page cache</i>.
791
At any one time, each <i>page cache entry</i> may be said to be a <i>clean
792
page</i>, a <i>non-writable dirty page</i> or a <i>writable dirty page</i>,
793
according to the following definitions:
795
<li> <p>A <b><i>clean page</i></b> is one for which the cached data
796
currently matches the contents of the corresponding page of
797
the database file. The page has not been modified since it was
798
loaded from the file.
799
<li> <p>A <b><i>dirty page</i></b> is a <i>page cache entry</i> for which
800
the cached data has been modified since it was loaded from the database
801
file, and so no longer matches the current contents of the
802
corresponding database file page. A <i>dirty page</i> is one that is
803
currently buffering a modification made to the database file as part
804
of a <i>write transaction</i>.
805
<li> <p>Within this document, the term <b><i>non-writable dirty
806
page</i></b> is used specifically to refer to a <i>page cache
807
entry</i> with modified content for which it is not yet safe to update
808
the database file with. It is not safe to update a database file with
809
a buffered write if a power or system failure that occurs during or
810
soon after the update may cause the database to become corrupt
811
following system recovery, according to the assumptions made in
812
section <cite>fs_assumption_details</cite>.
813
<li> <p>A <i>dirty page</i> for which it would be safe to update the
814
corresponding database file page with the modified contents of
815
without risking database corruption is known as a
816
<b><i>writable dirty page</i></b>.
819
The exact logic used to determine if a <i>page cache entry</i> with
820
modified content is a <i>dirty page</i> or <i>writable page</i> is
821
presented in section <cite>page_cache_algorithms</cite>.
823
Because main-memory is a limited resource, the <i>page cache</i> cannot
824
be allowed to grow indefinitely. As a result, unless all database files
825
opened by database connections within the process are quite small,
826
sometimes data must be discarded from the <i>page cache</i>. In practice
827
this means <i>page cache entries</i> must be purged to make room
828
for new ones. If a <i>page cache entry</i> being removed from the <i>page
829
cache</i> to free main-memory is a <i>dirty page</i>, then its contents
830
must be saved into the database file before it can be discarded without
831
data loss. The following two sub-sections describe the algorithms used by
832
the <i>page cache</i> to determine exactly when existing <i>page cache
833
entries</i> are purged (discarded).
834
<h2>Page Cache Configuration</h2>
836
Describe the parameters set to configure the page cache limits.
837
<h2 id=page_cache_algorithms>Page Cache Algorithms</h2>
839
Requirements describing the way in which the configuration parameters
840
are used. About LRU etc.
841
<h1 id=reading_data>Reading Data</h1>
843
In order to return data from the database to the user, for example as
844
the results of a SELECT query, SQLite must at some point read data
845
from the database file. Usually, data is read from the database file in
846
aligned blocks of <i>page-size</i> bytes. The exception is when the
847
database file header fields are being inspected, before the
848
<i>page-size</i> used by the database can be known.
850
With two exceptions, a <i>database connection</i> must have an open
851
transaction (either a <i>read-only transaction</i> or a
852
<i>read/write transaction</i>) on the database before data may be
853
read from the database file.
855
The two exceptions are:
857
<li> When an attempt is made to read the 100 byte <i>database file
858
header</i> immediately after opening the <i>database connection</i>
859
(see section <cite>open_new_connection</cite>). When this occurs
860
no lock is held on the database file.
861
<li> Data read while in the process of opening a read-only transaction
862
(see section <cite>open_read_only_trans</cite>). These read
863
operations occur after a <i>shared lock</i> is held on the database
867
Once a transaction has been opened, reading data from a database
868
connection is a simple operation. Using the xRead() method of the
869
file-handle open on the database file, the required database file
870
pages are read one at a time. SQLite never reads partial pages and
871
always uses a single call to xRead() for each required page.
873
After reading the data for a database page, SQLite stores the raw
874
page of data in the <i>page cache</i>. Each time a page of data is
875
required by the upper layers, the <i>page cache</i> is queried
876
to see if it contains a copy of the required page stored by
877
the current <i>database connection</i>. If such an entry can be
878
found, then the required data is read from the <i>page cache</i> instead
879
of the database file. Only a connection with an open transaction
880
transaction (either a <i>read-only transaction</i> or a
881
<i>read/write transaction</i>) on the database may read data from the
882
<i>page cache</i>. In this sense reading from the <i>page cache</i> is no
883
different to reading from the <i>database file</i>.
885
Refer to section <cite>page_cache_algorithms</cite> for a description
886
of exactly how and for how long page data is stored in the
888
<p class=req id=H35010>
889
Except for the read operation required by H35070 and those reads made
890
as part of opening a read-only transaction, SQLite shall ensure that
891
a <i>database connection</i> has an open read-only or read/write
892
transaction when any data is read from the <i>database file</i>.
893
<p class=req id=H35020>
894
Aside from those read operations described by H35070 and H21XXX, SQLite
895
shall read data from the database file in aligned blocks of
896
<i>page-size</i> bytes, where <i>page-size</i> is the database page size
897
used by the database file.
898
<p class=req id=H35420>
899
SQLite shall ensure that a <i>database connection</i> has an open
900
read-only or read/write transaction before using data stored in the <i>page
901
cache</i> to satisfy user queries.
902
<h2 id=open_read_only_trans>Opening a Read-Only Transaction</h2>
904
Before data may be read from a <i>database file</i> or queried from
905
the <i>page cache</i>, a <i>read-only transaction</i> must be
906
successfully opened by the associated database connection (this is true
907
even if the connection will eventually write to the database, as a
908
<i>read/write transaction</i> may only be opened by upgrading from a
909
<i>read-only transaction</i>). This section describes the procedure
910
for opening a <i>read-only transaction</i>.
912
The key element of a <i>read-only transaction</i> is that the
913
file-handle open on the database file obtains and holds a
914
<i>shared-lock</i> on the database file. Because a connection requires
915
an <i>exclusive-lock</i> before it may actually modify the contents
916
of the database file, and by definition while one connection is holding
917
a <i>shared-lock</i> no other connection may hold an
918
<i>exclusive-lock</i>, holding a <i>shared-lock</i> guarantees that
919
no other process may modify the database file while the <i>read-only
920
transaction</i> remains open. This ensures that <i>read-only
921
transactions</i> are sufficiently isolated from the transactions of
922
other database users (see section <cite>overview</cite>).
923
<p>Obtaining the <i>shared lock</i> itself on the database file is quite
924
simple, SQLite just calls the xLock() method of the database file
925
handle. Some of the other processes that take place as part of
926
opening the <i>read-only transaction</i> are quite complex. The
927
steps that SQLite is required to take to open a <i>read-only
928
transaction</i>, in the order in which they must occur, is as follows:
930
<li>A <i>shared-lock</i> is obtained on the database file.
931
<li>The connection checks if a <i>hot journal file</i> exists in the
932
file-system. If one does, then it is rolled back before continuing.
933
<li>The connection checks if the data in the <i>page cache</i> may
934
still be trusted. If not, all page cache data is discarded.
935
<li>If the file-size is not zero bytes and the page cache does not
936
contain valid data for the first page of the database, then the
937
data for the first page must be read from the database.
940
Of course, an error may occur while attempting any of the 4 steps
941
enumerated above. If this happens, then the <i>shared-lock</i> is
942
released (if it was obtained) and an error returned to the user.
943
Step 2 of the procedure above is described in more detail in section
944
<cite>hot_journal_detection</cite>. Section <cite>cache_validation</cite>
945
describes the process identified by step 3 above. Further detail
946
on step 4 may be found in section <cite>read_page_one</cite>.
947
<p class=req id=H35100>
948
When required to open a <i>read-only transaction</i> using a
949
<i>database connection</i>, SQLite shall first attempt to obtain
950
a <i>shared-lock</i> on the file-handle open on the database file.
951
<p class=req id=H35110>
952
If, while opening a <i>read-only transaction</i>, SQLite fails to obtain
953
the <i>shared-lock</i> on the database file, then the process is
954
abandoned, no transaction is opened and an error returned to the user.
956
The most common reason an attempt to obtain a <i>shared-lock</i> may
957
fail is that some other connection is holding an <i>exclusive</i> or
958
<i>pending lock</i>. However it may also fail because some other
959
error (e.g. an IO or comms related error) occurs within the call to the
961
<p class=req id=H35030>
962
While opening a <i>read-only transaction</i>, after successfully
963
obtaining a <i>shared lock</i> on the database file, SQLite shall
964
attempt to detect and roll back a <i>hot journal file</i> associated
965
with the same database file.
966
<p class=req id=H35120>
967
If, while opening a <i>read-only transaction</i>, SQLite encounters
968
an error while attempting to detect or roll back a <i>hot journal
969
file</i>, then the <i>shared-lock</i> on the database file is released,
970
no transaction is opened and an error returned to the user.
972
Section <cite>hot_journal_detection</cite> contains a description of
973
and requirements governing the detection of a hot-journal file referred
974
to in the above requirements.
975
<p class=req id=H35040>
976
Assuming no errors have occurred, then after attempting to detect and
977
roll back a <i>hot journal file</i>, if the <i>page cache</i> contains
978
any entries associated with the current <i>database connection</i>,
979
then SQLite shall validate the contents of the <i>page cache</i> by
980
testing the <i>file change counter</i>. This procedure is known as
981
<i>cache validation</i>.
983
The <i>cache validation</i> process is described in detail in section
984
<cite>cache_validation</cite>
985
<p class=req id=H35050>
986
If the cache validate procedure prescribed by H35040 is required and
987
does not prove that the <i>page cache</i> entries associated with the
988
current <i>database connection</i> are valid, then SQLite shall discard
989
all entries associated with the current <i>database connection</i> from
990
the <i>page cache</i>.
992
The numbered list above notes that the data for the first page of the
993
database file, if it exists and is not already loaded into the <i>page
994
cache</i>, must be read from the database file before the <i>read-only
995
transaction</i> may be considered opened. This is handled by
997
<h3 id=hot_journal_detection>Hot Journal Detection</h3>
999
This section describes the procedure that SQLite uses to detect a
1000
<i>hot journal file</i>. If a <i>hot journal file</i> is detected,
1001
this indicates that at some point the process of writing a
1002
transaction to the database was interrupted and a recovery operation
1003
(<i>hot journal rollback</i>) needs to take place. This section does
1004
not describe the process of <i>hot journal rollback</i> (see section
1005
<cite>hot_journal_rollback</cite>) or the processes by which a
1006
<i>hot journal file</i> may be created (see section
1007
<cite>writing_data</cite>).
1009
The procedure used to detect a <i>hot-journal file</i> is quite
1010
complex. The following steps take place:
1011
<ol class=spacedlist>
1012
<li>Using the VFS xAccess() method, SQLite queries the file-system
1013
to see if the journal file associated with the database exists.
1014
If it does not, then there is no hot-journal file.
1015
<li>By invoking the xCheckReservedLock() method of the file-handle
1016
opened on the database file, SQLite checks if some other connection
1017
holds a <i>reserved lock</i> or greater. If some other connection
1018
does hold a <i>reserved lock</i>, this indicates that the other
1019
connection is midway through a <i>read/write transaction</i> (see
1020
section <cite>writing_data</cite>). In this case the
1021
<i>journal file</i> is not a <i>hot-journal</i> and must not be
1023
<li>Using the xFileSize() method of the file-handle opened
1024
on the database file, SQLite checks if the database file is
1025
0 bytes in size. If it is, the journal file is not considered
1026
to be a <i>hot journal</i> file. Instead of rolling back the
1027
journal file, in this case it is deleted from the file-system
1028
by calling the VFS xDelete() method. <span class=todo>Technically,
1029
there is a race condition here. This step should be moved to
1030
after the exclusive lock is held.</span>
1031
<li>An attempt is made to upgrade to an <i>exclusive lock</i> on the
1032
database file. If the attempt fails, then all locks, including
1033
the recently obtained <i>shared lock</i> are dropped. The attempt
1034
to open a <i>read-only transaction</i> has failed. This occurs
1035
when some other connection is also attempting to open a
1036
<i>read-only transaction</i> and the attempt to gain the
1037
<i>exclusive lock</i> fails because the other connection is also
1038
holding a <i>shared lock</i>. It is left to the other connection
1039
to roll back the <i>hot journal</i>.
1040
<div style="margin-top:0.5em"></div>
1041
It is important that the file-handle lock is upgraded
1042
directly from <i>shared</i> to <i>exclusive</i> in this case,
1043
instead of first upgrading to <i>reserved</i> or </i>pending</i>
1044
locks as is required when obtaining an <i>exclusive lock</i> to
1045
write to the database file (section <cite>writing_data</cite>).
1046
If SQLite were to first upgrade to a <i>reserved</i> or
1047
<i>pending</i> lock in this scenario, then a second process also
1048
trying to open a <i>read-transaction</i> on the database file might
1049
detect the <i>reserved</i> lock in step 2 of this process,
1050
conclude that there was no <i>hot journal</i>, and commence
1051
reading data from the <i>database file</i>.
1052
<li>The xAccess() method is invoked again to detect if the journal
1053
file is still in the file system. If it is, then it is a
1054
hot-journal file and SQLite tries to roll it back (see section
1055
<cite>rollback</cite>).
1057
<p class=todo> Master journal file pointers?
1059
The following requirements describe step 1 of the above procedure in
1061
<p class=req id=H35140>
1062
When required to attempt to detect a <i>hot-journal file</i>, SQLite
1063
shall first use the xAccess() method of the VFS layer to check if a
1064
journal file exists in the file-system.
1065
<p class=req id=H35510>
1066
If the call to xAccess() required by H35140 fails (due to an IO error or
1067
similar), then SQLite shall abandon the attempt to open a <i>read-only
1068
transaction</i>, relinquish the <i>shared lock</i> held on the database
1069
file and return an error to the user.
1070
<p class=req id=H35150>
1071
When required to attempt to detect a <i>hot-journal file</i>, if the
1072
call to xAccess() required by H35140 indicates that a journal file does
1073
not exist, then SQLite shall conclude that there is no <i>hot-journal
1074
file</i> in the file system and therefore that no <i>hot journal
1075
rollback</i> is required.
1077
The following requirements describe step 2 of the above procedure in
1079
<p class=req id=H35160>
1080
When required to attempt to detect a <i>hot-journal file</i>, if the
1081
call to xAccess() required by H35140 indicates that a journal file
1082
is present, then the xCheckReservedLock() method of the database file
1083
file-handle is invoked to determine whether or not some other
1084
process is holding a <i>reserved</i> or greater lock on the database
1086
<p class=req id=H35520>
1087
If the call to xCheckReservedLock() required by H35160 fails (due to an
1088
IO or other internal VFS error), then SQLite shall abandon the attempt
1089
to open a <i>read-only transaction</i>, relinquish the <i>shared lock</i>
1090
held on the database file and return an error to the user.
1091
<p class=req id=H35170>
1092
If the call to xCheckReservedLock() required by H35160 indicates that
1093
some other <i>database connection</i> is holding a <i>reserved</i>
1094
or greater lock on the database file, then SQLite shall conclude that
1095
there is no <i>hot journal file</i>. In this case the attempt to detect
1096
a <i>hot journal file</i> is concluded.
1098
The following requirements describe step 3 of the above procedure in
1100
<p class=req id=H35440>
1101
If while attempting to detect a <i>hot-journal file</i> the call to
1102
xCheckReservedLock() indicates that no process holds a <i>reserved</i>
1103
or greater lock on the <i>database file</i>, then SQLite shall open
1104
a file handle on the potentially hot journal file using the VFS xOpen()
1106
<p class=req id=H35530>
1107
If the call to xOpen() required by H35440 fails (due to an IO or other
1108
internal VFS error), then SQLite shall abandon the attempt to open a
1109
<i>read-only transaction</i>, relinquish the <i>shared lock</i> held on
1110
the database file and return an error to the user.
1111
<p class=req id=H35450>
1112
After successfully opening a file-handle on a potentially hot journal
1113
file, SQLite shall query the file for its size in bytes using the
1114
xFileSize() method of the open file handle.
1115
<p class=req id=H35540>
1116
If the call to xFileSize() required by H35450 fails (due to an IO or
1117
other internal VFS error), then SQLite shall abandon the attempt to open
1118
a <i>read-only transaction</i>, relinquish the <i>shared lock</i> held on
1119
the database file, close the file handle opened on the journal file and
1120
return an error to the user.
1121
<p class=req id=H35460>
1122
If the size of a potentially hot journal file is revealed to be zero
1123
bytes by a query required by H35450, then SQLite shall close the
1124
file handle opened on the journal file and delete the journal file using
1125
a call to the VFS xDelete() method. In this case SQLite shall conclude
1126
that there is no <i>hot journal file</i>.
1127
<p class=req id=H35550>
1128
If the call to xDelete() required by H35450 fails (due to an IO or
1129
other internal VFS error), then SQLite shall abandon the attempt to open
1130
a <i>read-only transaction</i>, relinquish the <i>shared lock</i> held on
1131
the database file and return an error to the user.
1133
The following requirements describe step 4 of the above procedure in
1135
<p class=req id=H35470>
1136
If the size of a potentially hot journal file is revealed to be greater
1137
than zero bytes by a query required by H35450, then SQLite shall attempt
1138
to upgrade the <i>shared lock</i> held by the <i>database connection</i>
1139
on the <i>database file</i> directly to an <i>exclusive lock</i>.
1140
<p class=req id=H35480>
1141
If an attempt to upgrade to an <i>exclusive lock</i> prescribed by
1142
H35470 fails for any reason, then SQLite shall release all locks held by
1143
the <i>database connection</i> and close the file handle opened on the
1144
<i>journal file</i>. The attempt to open a <i>read-only transaction</i>
1145
shall be deemed to have failed and an error returned to the user.
1147
Finally, the following requirements describe step 5 of the above
1148
procedure in more detail.
1149
<p class=req id=H35490>
1150
If, as part of the <i>hot journal file</i> detection process, the
1151
attempt to upgrade to an <i>exclusive lock</i> mandated by H35470 is
1152
successful, then SQLite shall query the file-system using the xAccess()
1153
method of the VFS implementation to test whether or not the journal
1154
file is still present in the file-system.
1155
<p class=req id=H35560>
1156
If the call to xAccess() required by H35490 fails (due to an IO or
1157
other internal VFS error), then SQLite shall abandon the attempt to open
1158
a <i>read-only transaction</i>, relinquish the lock held on the
1159
database file, close the file handle opened on the journal file and
1160
return an error to the user.
1161
<p class=req id=H35570>
1162
If the call to xAccess() required by H35490 reveals that the journal
1163
file is no longer present in the file system, then SQLite shall abandon
1164
the attempt to open a <i>read-only transaction</i>, relinquish the
1165
lock held on the database file, close the file handle opened on the
1166
journal file and return an SQLITE_BUSY error to the user.
1167
<p class=req id=H35500>
1168
If the xAccess() query required by H35490 reveals that the journal
1169
file is still present in the file system, then SQLite shall conclude
1170
that the journal file is a <i>hot journal file</i> that needs to
1171
be rolled back. SQLite shall immediately begin <i>hot journal
1173
<h3 id=cache_validation>Cache Validation</h3>
1175
When a <i>database connection</i> opens a <i>read transaction</i>, the
1176
<i>page cache</i> may already contain data associated with the
1177
<i>database connection</i>. However, if another process has modified
1178
the database file since the cached pages were loaded it is possible that
1179
the cached data is invalid.
1181
SQLite determines whether or not the <i>page cache</i> entries belonging
1182
to the <i>database connection</i> are valid or not using the <i>file
1183
change counter</i>, a field in the <i>database file header</i>. The
1184
<i>file change counter</i> is a 4-byte big-endian integer field stored
1185
starting at byte offset 24 of the <i>database file header</i>. Before the
1186
conclusion of a <i>read/write transaction</i> that modifies the contents
1187
of the database file in any way (see section <cite>writing_data</cite>),
1188
the value stored in the <i>file change counter</i> is incremented. When
1189
a <i>database connection</i> unlocks the database file, it stores the
1190
current value of the <i>file change counter</i>. Later, while opening a
1191
new <i>read-only transaction</i>, SQLite checks the value of the <i>file
1192
change counter</i> stored in the database file. If the value has not
1193
changed since the database file was unlocked, then the <i>page cache</i>
1194
entries can be trusted. If the value has changed, then the <i>page
1195
cache</i> entries cannot be trusted and all entries associated with
1196
the current <i>database connection</i> are discarded.
1197
<p class=req id=H35180>
1198
When a file-handle open on a database file is unlocked, if the
1199
<i>page cache</i> contains one or more entries belonging to the
1200
associated <i>database connection</i>, SQLite shall store the value
1201
of the <i>file change counter</i> internally.
1202
<p class=req id=H35190>
1203
When required to perform <i>cache validation</i> as part of opening
1204
a <i>read transaction</i>, SQLite shall read a 16 byte block
1205
starting at byte offset 24 of the <i>database file</i> using the xRead()
1206
method of the <i>database connections</i> file handle.
1208
Why a 16 byte block? Why not 4? (something to do with encrypted
1210
<p class=req id=H35200>
1211
While performing <i>cache validation</i>, after loading the 16 byte
1212
block as required by H35190, SQLite shall compare the 32-bit big-endian
1213
integer stored in the first 4 bytes of the block to the most
1214
recently stored value of the <i>file change counter</i> (see H35180).
1215
If the values are not the same, then SQLite shall conclude that
1216
the contents of the cache are invalid.
1218
Requirement H35050 (section <cite>open_read_only_trans</cite>)
1219
specifies the action SQLite is required to take upon determining that
1220
the cache contents are invalid.
1221
<h3 id=read_page_one>Page 1 and the Expected Page Size</h3>
1223
As the last step in opening a <i>read transaction</i> on a database
1224
file that is more than 0 bytes in size, SQLite is required to load
1225
data for page 1 of the database into the <i>page cache</i>, if it is
1226
not already there. This is slightly more complicated than it seems,
1227
as the database <i>page-size</i> is no known at this point.
1229
Even though the database <i>page-size</i> cannot be known for sure,
1230
SQLite is usually able to guess correctly by assuming it to be equal to
1231
the connections <i>expected page size</i>. The <i>expected page size</i>
1232
is the value of the <i>page-size</i> field read from the
1233
<i>database file header</i> while opening the database connection
1234
(see section <cite>open_new_connection</cite>), or the <i>page-size</i>
1235
of the database file when the most <i>read transaction</i> was concluded.
1236
<p class=req id=H35210>
1237
During the conclusion of a <i>read transaction</i>, before unlocking
1238
the database file, SQLite shall set the connections
1239
<i>expected page size</i> to the current database <i>page-size</i>.
1240
<p class=req id=H35220>
1241
As part of opening a new <i>read transaction</i>, immediately after
1242
performing <i>cache validation</i>, if there is no data for database
1243
page 1 in the <i>page cache</i>, SQLite shall read <i>N</i> bytes from
1244
the start of the database file using the xRead() method of the
1245
connections file handle, where <i>N</i> is the connections current
1246
<i>expected page size</i> value.
1247
<p class=req id=H35230>
1248
If page 1 data is read as required by H35230, then the value of the
1249
<i>page-size</i> field that appears in the database file header that
1250
consumes the first 100 bytes of the read block is not the same as the
1251
connections current <i>expected page size</i>, then the
1252
<i>expected page size</i> is set to this value, the database file is
1253
unlocked and the entire procedure to open a <i>read transaction</i>
1255
<p class=req id=H35240>
1256
If page 1 data is read as required by H35230, then the value of the
1257
<i>page-size</i> field that appears in the database file header that
1258
consumes the first 100 bytes of the read block is the same as the
1259
connections current <i>expected page size</i>, then the block of data
1260
read is stored in the <i>page cache</i> as page 1.
1261
<h2>Reading Database Data</h2>
1263
Add something about checking the page-cache first etc.
1264
<h2>Ending a Read-only Transaction</h2>
1266
To end a <i>read-only transaction</i>, SQLite simply relinquishes the
1267
<i>shared lock</i> on the file-handle open on the database file. No
1268
other action is required.
1269
<p class=req id=H35130>
1270
When required to end a <i>read-only transaction</i>, SQLite shall
1271
relinquish the <i>shared lock</i> held on the database file by
1272
calling the xUnlock() method of the file-handle.
1274
See also requirements H35180 and H35210 above.
1275
<h1 id=writing_data>Writing Data</h1>
1277
Using DDL or DML SQL statements, SQLite users may modify the contents and
1278
size of a database file. Exactly how changes to the logical database are
1279
translated to modifications to the database file is described in
1280
<cite>ff_sqlitert_requirements</cite>. From the point of view of the
1281
sub-systems described in this document, each DDL or DML statement executed
1282
results in the contents of zero or more database file pages being
1283
overwritten with new data. A DDL or DML statement may also append or
1284
truncate one or more pages to or from the end of the database file. One
1285
or more DDL and/or DML statements are grouped together to make up a
1286
single <i>write transaction</i>. A <i>write transaction</i> is required
1287
to have the special properties described in section <cite>overview</cite>;
1288
a <i>write transaction</i> must be isolated, durable and atomic.
1290
SQLite accomplishes these goals using the following techniques:
1293
To ensure that <i>write transactions</i> are <b>isolated</b>, before
1294
beginning to modify the contents of the database file to reflect the
1295
results of a <i>write transaction</i>, SQLite obtains an <i>exclusive
1296
lock</i> on the <i>database file</i>. The lock is not relinquished
1297
until the <i>write transaction</i> is concluded. Because reading from
1298
the <i>database file</i> requires a <i>shared lock</i> (see section
1299
<cite>reading_data</cite>) and holding an <i>exclusive
1300
lock</i> guarantees that no other <i>database connection</i> is holding
1301
or can obtain a <i>shared lock</i>, this ensures that no other
1302
connection may read data from the <i>database file</i> at a point when
1303
a <i>write transaction</i> has been partially applied.
1304
<li><p>Ensuring that <i>write transactions</i> are <b>atomic</b> is the most
1305
complex task required of the system. In this case, <i>atomic</i> means
1306
that even if a system failure occurs, an attempt to commit a <i>write
1307
transaction</i> to the database file either results in all changes
1308
that are a part of the transaction being successfully applied to the
1309
database file, or none of the changes are successfully applied. There
1310
is no chance that a subset of the changes only are applied. Hence from
1311
the point of view of an external observer, the <i>write transaction</i>
1312
appears to be an atomic event.
1314
Of course, it is usually not possible to atomically apply all the
1315
changes required by a <i>write transaction</i> to a database file
1316
within the file-system. For example, if a <i>write transaction</i>
1317
requires ten pages of a database file to be modified, and a power
1318
outage causes a system failure after sqlite has modified only five
1319
pages, then the database file will almost certainly be in an
1320
inconsistent state following system recovery.
1322
SQLite solves this problem by using a <i>journal file</i>. In almost
1323
all cases, before the <i>database file</i> is modified in any way,
1324
SQLite stores sufficient information in the <i>journal file</i> to
1325
allow the original the database file to be reconstructed if a system
1326
failure occurs while the database file is being updated to reflect
1327
the modifications made by the <i>write transaction</i>. Each time
1328
SQLite opens a database file, it checks if such a system failure has
1329
occurred and, if so,
1330
reconstructs the database file based on the contents
1331
of the journal file. The procedure used to detect whether or not this
1332
process, coined <i>hot journal rollback</i>, is required is described
1333
in section <cite>hot_journal_detection</cite>. <i>Hot journal rollback
1334
</i> itself is described in section <cite>hot_journal_rollback</cite>.
1336
The same technique ensures that an SQLite database file cannot be
1337
corrupted by a system failure that occurs at an inopportune moment.
1338
If a system failure does occur before SQLite has had a chance to
1339
execute sufficient <i>sync file</i> operations to ensure that the
1340
changes that make up a <i>write transaction</i> have made it safely
1341
to persistent storage, then the <i>journal file</i> will be used
1342
to restore the database to a known good state following system
1345
So that <i>write transactions</i> are <b>durable</b> in the face of
1346
a system failure, SQLite executes a <i>sync file</i> operation on the
1347
database file before concluding the <i>write transaction</i>
1350
The <i>page cache</i> is used to buffer modifications to the database
1351
file image before they are written to the <i>database file</i>. When
1352
the contents of a page is required to be modified as the results of
1353
an operation within a <i>write transaction</i>, the modified copy is
1354
stored in the <i>page cache</i>. Similarly, if new pages are appended
1355
to the end of a database file, they are added to the <i>page cache</i>
1356
instead of being immediately written to the database file within the
1359
Ideally, all changes for an entire write transaction are buffered in
1360
the page cache until the end of the transaction. When the user commits
1361
the transaction, all changes are applied to the database file in the
1362
most efficient way possible, taking into account the assumptions
1363
enumerated in section <cite>fs_performance</cite>. Unfortunately, since
1364
main-memory is a limited resource, this is not always possible for
1365
large transactions. In this case changes are buffered in the <i>page
1366
cache</i> until some internal condition or limit is reached,
1367
then written out to the database file in order to free resources
1368
as they are required. Section <cite>page_cache_algorithms</cite>
1369
describes the circumstances under which changes are flushed through
1370
to the database file mid-transaction to free <i>page cache</i> resources.
1372
Even if an application or system failure does not occur while a
1373
<i>write transaction</i> is in progress, a rollback operation to restore
1374
the database file and <i>page cache</i> to the state that it was in before
1375
the transaction started may be required. This may occur if the user
1376
explicitly requests transaction rollback (by issuing a "ROLLBACK" command),
1377
or automatically, as a result of encountering an SQL constraint (see
1378
<cite>sql_sqlitert_requirements</cite>). For this reason, the original page
1379
content is stored in the <i>journal file</i> before the page is even
1380
modified within the <i>page cache</i>.
1382
Introduce the following sub-sections.
1383
<h2 id=journal_file_format>Journal File Format</h2>
1385
This section describes the format used by an SQLite <i>journal file</i>.
1387
A journal file consists of one or more <i>journal headers</i>, zero
1388
or more <i>journal records</i> and optionally a <i>master journal
1389
pointer</i>. Each journal file always begins with a
1390
<i>journal header</i>, followed by zero or more <i>journal records</i>.
1391
Following this may be a second <i>journal header</i> followed by a
1392
second set of zero or more <i>journal records</i> and so on. There
1393
is no limit to the number of <i>journal headers</i> a journal file
1394
may contain. Following the <i>journal headers</i> and their accompanying
1395
sets of <i>journal records</i> may be the optional <i>master journal
1396
pointer</i>. Or, the file may simply end following the final <i>journal
1399
This section only describes the format of the journal file and the
1400
various objects that make it up. But because a journal file may be
1401
read by an SQLite process following recovery from a system failure
1402
(<i>hot journal rollback</i>, see section
1403
<cite>hot_journal_rollback</cite>) it is also important to describe
1404
the way the file is created and populated within the file-system
1405
using a combination of <i>write file</i>, <i>sync file</i> and
1406
<i>truncate file</i> operations. These are described in section
1407
<cite>write_transactions</cite>.
1408
<h3 id=journal_header_format>Journal Header Format</h3>
1410
A <i>journal header</i> is <i>sector-size</i> bytes in size, where <i>
1411
sector-size</i> is the value returned by the xSectorSize method of
1412
the file handle opened on the database file. Only the first 28 bytes
1413
of the <i>journal header</i> are used, the remainder may contain garbage
1414
data. The first 28 bytes of each <i>journal header</i> consists of an
1415
eight byte block set to a well-known value, followed by five big-endian
1416
32-bit unsigned integer fields.
1417
<center><img src="images/fileformat/journal_header.gif">
1418
<p><i>Figure <span class=fig id=figure_journal_header></span> - Journal Header Format</i>
1421
Figure <cite>figure_journal_header</cite> graphically depicts the layout
1422
of a <i>journal header</i>. The individual fields are described in
1423
the following table. The offsets in the 'byte offset' column of the
1424
table are relative to the start of the <i>journal header</i>.
1425
<table class=striped>
1426
<tr><th>Byte offset<th>Size in bytes<th width=100%>Description
1427
<tr><td>0<td>8<td>The <b>journal magic</b> field always contains a
1428
well-known 8-byte string value used to identify SQLite
1429
journal files. The well-known sequence of byte values
1431
<pre>0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7</pre>
1432
<tr><td>8<td>4<td>This field, the <b>record count</b>, is set to the
1433
number of <i>journal records</i> that follow this
1434
<i>journal header</i> in the <i>journal file</i>.
1435
<tr><td>12<td>4<td>The <b>checksum initializer</b> field is set to a
1436
pseudo-random value. It is used as part of the
1437
algorithm to calculate the checksum for all <i>journal
1438
records</i> that follow this <i>journal header</i>.
1439
<tr><td>16<td>4<td>This field, the <b>database page count</b>, is set
1440
to the number of pages that the <i>database file</i>
1441
contained before any modifications associated with
1442
<i>write transaction</i> are applied.
1443
<tr><td>20<td>4<td>This field, the <b>sector size</b>, is set to the
1444
<i>sector size</i> of the device on which the
1445
<i>journal file</i> was created, in bytes. This value
1446
is required when reading the journal file to determine
1447
the size of each <i>journal header</i>.
1448
<tr><td>24<td>4<td>The <b>page size</b> field contains the database page
1449
size used by the corresponding <i>database file</i>
1450
when the <i>journal file</i> was created, in bytes.
1453
All <i>journal headers</i> are positioned in the file so that they
1454
start at a <i>sector size</i> aligned offset. To achieve this, unused
1455
space may be left between the start of the second and subsequent
1456
<i>journal headers</i> and the end of the <i>journal records</i>
1457
associated with the previous header.
1458
<h3 id=journal_record_format>Journal Record Format</h3>
1460
Each <i>journal record</i> contains the original data for a database page
1461
modified by the <i>write transaction</i>. If a rollback is required, then
1462
this data may be used to restore the contents of the database page to the
1463
state it was in before the <i>write transaction</i> was started.
1464
<center><img src="images/fileformat/journal_record.gif">
1465
<p><i>Figure <span class=fig id=figure_journal_record></span> - Journal Record Format</i>
1468
A <i>journal record</i>, depicted graphically by figure
1469
<cite>figure_journal_record</cite>, contains three fields, as described
1470
in the following table. Byte offsets are relative to the start of the
1471
<i>journal record</i>.
1472
<table class=striped>
1473
<tr><th>Byte offset<th>Size in bytes<th width=100%>Description
1474
<tr><td>0<td>4<td>The page number of the database page associated with
1475
this <i>journal record</i>, stored as a 4 byte
1476
big-endian unsigned integer.
1477
<tr><td>4<td><i>page-size<td>
1478
This field contains the original data for the page,
1479
exactly as it appeared in the database file before the
1480
<i>write transaction</i> began.
1481
<tr><td style="white-space: nowrap">4 + <i>page-size</i><td>4<td>
1482
This field contains a checksum value, calculated based
1483
on the contents of the journaled database page data
1484
(the previous field) and the values stored in the
1485
<i>checksum initializer</i> field of the preceding
1486
<i>journal header</i>.
1489
The set of <i>journal records</i> that follow a <i>journal header</i>
1490
in a <i>journal file</i> are packed tightly together. There are no
1491
alignment requirements for <i>journal records</i> as there are for
1492
<i>journal headers</i>.
1493
<h3>Master Journal Pointer</h3>
1495
To support <i>atomic</i> transactions that modify more than one
1496
database file, SQLite sometimes includes a <i>master journal pointer</i>
1497
record in a <i>journal file</i>. Multiple file transactions are
1498
described in section <cite>multifile_transactions</cite>. A <i>
1499
master journal pointer</i> contains the name of a <i>master journal-file
1500
</i> along with a check-sum and some well known values that allow
1501
the <i>master journal pointer</i> to be recognized as such when
1502
the <i>journal file</i> is read during a rollback operation (section
1503
<cite>rollback</cite>).
1505
As is the case for a <i>journal header</i>, the start of a <i>master
1506
journal pointer</i> is always positioned at a <i>sector size</i>
1507
aligned offset. If the <i>journal record</i> or <i>journal header</i>
1508
that appears immediately before the <i>master journal pointer</i> does
1509
not end at an aligned offset, then unused space is left between the
1510
end of the <i>journal record</i> or <i>journal header</i> and the start
1511
of the <i>master journal pointer</i>.
1512
<center><img src="images/fileformat/master_journal_ptr.gif">
1513
<p><i>Figure <span class=fig id=figure_master_journal_ptr></span> - Master Journal Pointer Format</i>
1516
A <i>master journal pointer</i>, depicted graphically by figure
1517
<cite>figure_master_journal_ptr</cite>, contains five fields, as
1518
described in the following table. Byte offsets are relative to the
1519
start of the <i>master journal pointer</i>.
1520
<table class=striped>
1521
<tr><th>Byte offset<th>Size in bytes<th width=100%>Description
1522
<tr><td>0<td>4<td>This field, the <b>locking page number</b>, is always
1523
set to the page number of the database <i>locking page</i>
1524
stored as a 4-byte big-endian integer. The <i>locking page</i>
1525
is the page that begins at byte offset 2<super>30</super> of the
1526
database file. Even if the database file is large enough to
1527
contain the <i>locking page</i>, the <i>locking page</i> is
1528
never used to store any data and so the first four bytes of of a
1529
valid <i>journal record</i> will never contain this value. For
1530
further description of the <i>locking page</i>, refer to
1531
<cite>ff_sqlitert_requirements</cite>.
1532
<tr><td>4<td><i>name-length</i><td>
1533
The <b>master journal name</b> field contains the name of the
1534
master journal file, encoded as a utf-8 string. There is no
1535
nul-terminator appended to the string.
1536
<tr><td>4 + <i>name-length</i><td><i>4<td>
1537
The <b>name-length</b> field contains the length of the
1538
previous field in bytes, formatted as a 4-byte big-endian
1540
<tr><td>8 + <i>name-length</i><td><i>4<td>
1541
The <b>checksum</b> field contains a checksum value stored as
1542
a 4-byte big-endian signed integer. The checksum value is
1543
calculated as the sum of the bytes that make up the <i>
1544
master journal name</i> field, interpreting each byte as
1545
an 8-bit signed integer.
1546
<tr><td style="white-space: nowrap">12 + <i>name-length</i><td><i>8<td>
1547
Finally, the <b>journal magic</b> field always contains a
1548
well-known 8-byte string value; the same value stored in the
1549
first 8 bytes of a <i>journal header</i>. The well-known
1550
sequence of bytes is:
1551
<pre>0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7</pre>
1553
<h2 id=write_transactions>Write Transactions</h2>
1555
This section describes the progression of an SQLite <i>write
1556
transaction</i>. From the point of view of the systems described in
1557
this document, most <i>write transactions</i> consist of three steps:
1559
<li><p>The <i>write transaction</i> is opened. This process is described
1560
in section <cite>opening_a_write_transaction</cite>.
1561
<li><p>The end-user executes DML or DDL SQL statements that require the
1562
structure of the database file of the database file to be modified.
1563
These modifications may be any combination of operations to
1564
<ul><li>modify the content of an existing database page,
1565
<li>append a new database page to the database file image, or
1566
<li>truncate (discard) a database page from the end of the
1569
These operations are described in detail in section
1570
<cite>modifying_appending_truncating</cite>. How user DDL or DML
1571
SQL statements are mapped to combinations of these three operations
1572
is described in <cite>ff_sqlitert_requirements</cite>.
1573
<li><p>The <i>write transaction</i> is concluded and the changes made
1574
permanently committed to the database. The process required to
1575
commit a transaction is described in section
1576
<cite>committing_a_transaction</cite>.
1579
As an alternative to step 3 above, the transaction may be rolled back.
1580
Transaction rollback is described in section <cite>rollback</cite>.
1581
Finally, it is also important to remember that a <i>write transaction</i>
1582
may be interrupted by a <i>system failure</i> at any point. In this
1583
case, the contents of the file-system (the <i>database file</i> and
1584
<i>journal file</i>) must be left in such a state so as to enable
1585
the <i>database file</i> to be restored to the state it was in before
1586
the interrupted <i>write transaction</i> was started. This is known
1587
as <i>hot journal rollback</i>, and is described in section
1588
<cite>hot_journal_rollback</cite>. Section
1589
<cite>fs_assumption_details</cite> describes the assumptions made
1590
regarding the effects of a <i>system failure</i> on the file-system
1591
contents following recovery.
1592
<h3 id=opening_a_write_transaction>Beginning a Write Transaction</h3>
1594
Before any database pages may be modified within the <i>page cache</i>,
1595
the <i>database connection</i> must open a <i>write transaction</i>.
1596
Opening a <i>write transaction</i> requires that the <i>database
1597
connection</i> obtains a <i>reserved lock</i> (or greater) on the
1598
<i>database file</i>. Because a obtaining a <i>reserved lock</i> on
1599
a <i>database file</i> guarantees that no other <i>database
1600
connection</i> may hold or obtain a <i>reserved lock</i> or greater,
1601
it follows that no other <i>database connection</i> may have an
1602
open <i>write transaction</i>.
1604
A <i>reserved lock</i> on the <i>database file</i> may be thought of
1605
as an exclusive lock on the <i>journal file</i>. No
1606
<i>database connection</i> may read from or write to a <i>journal
1607
file</i> without a <i>reserved</i> or greater lock on the corresponding
1608
<i>database file</i>.
1610
Before opening a <i>write transaction</i>, a <i>database connection</i>
1611
must have an open <i>read transaction</i>, opened via the procedure
1612
described in section <cite>open_read_only_trans</cite>. This ensures
1613
that there is no <i>hot-journal file</i> that needs to be rolled back
1614
and that any data stored in the <i>page cache</i> can be trusted.
1616
Once a <i>read transaction</i> has been opened, upgrading to a
1617
<i>write transaction</i> is a two step process, as follows:
1619
<li>A <i>reserved lock</i> is obtained on the <i>database file</i>.
1620
<li>The <i>journal file</i> is opened and created if necessary (using
1621
the VFS xOpen method), and a <i>journal file header</i> written
1622
to the start of it using a single call to the file handles xWrite
1626
Requirements describing step 1 of the above procedure in detail:
1627
<p class=req id=H35350>
1628
When required to open a <i>write transaction</i> on the database,
1629
SQLite shall first open a <i>read transaction</i>, if the <i>database
1630
connection</i> in question has not already opened one.
1631
<p class=req id=H35360>
1632
When required to open a <i>write transaction</i> on the database, after
1633
ensuring a <i>read transaction</i> has already been opened, SQLite
1634
shall obtain a <i>reserved lock</i> on the database file by calling
1635
the xLock method of the file-handle open on the database file.
1636
<p class=req id=H35580>
1637
If an attempt to acquire a <i>reserved lock</i> prescribed by
1638
requirement H35360 fails, then SQLite shall deem the attempt to
1639
open a <i>write transaction</i> to have failed and return an error
1642
Requirements describing step 2 of the above procedure in detail:
1643
<p class=req id=H35370>
1644
When required to open a <i>write transaction</i> on the database, after
1645
obtaining a <i>reserved lock</i> on the database file, SQLite shall
1646
open a read/write file-handle on the corresponding <i>journal file</i>.
1647
<p class=req id=H35380>
1648
When required to open a <i>write transaction</i> on the database, after
1649
opening a file-handle on the <i>journal file</i>, SQLite shall append
1650
a <i>journal header</i> to the (currently empty) <i>journal file</i>.
1651
<h4 id=writing_journal_header>Writing a Journal Header</h4>
1653
Requirements describing how a <i>journal header</i> is appended to
1655
<p class=req id=H35680>
1656
When required to append a <i>journal header</i> to the <i>journal
1657
file</i>, SQLite shall do so by writing a block of <i>sector-size</i>
1658
bytes using a single call to the xWrite method of the file-handle
1659
open on the <i>journal file</i>. The block of data written shall begin
1660
at the smallest sector-size aligned offset at or following the current
1661
end of the <i>journal file</i>.
1662
<p class=req id=H35690>
1663
The first 8 bytes of the <i>journal header</i> required to be written
1664
by H35680 shall contain the following values, in order from byte offset 0
1665
to 7: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63 and 0xd7.
1666
<p class=req id=H35700>
1667
Bytes 8-11 of the <i>journal header</i> required to be written by
1668
H35680 shall contain 0x00.
1669
<p class=req id=H35710>
1670
Bytes 12-15 of the <i>journal header</i> required to be written by
1671
H35680 shall contain the number of pages that the database file
1672
contained when the current <i>write-transaction</i> was started,
1673
formatted as a 4-byte big-endian unsigned integer.
1674
<p class=req id=H35720>
1675
Bytes 16-19 of the <i>journal header</i> required to be written by
1676
H35680 shall contain pseudo-randomly generated values.
1677
<p class=req id=H35730>
1678
Bytes 20-23 of the <i>journal header</i> required to be written by
1679
H35680 shall contain the <i>sector size</i> used by the VFS layer,
1680
formatted as a 4-byte big-endian unsigned integer.
1681
<p class=req id=H35740>
1682
Bytes 24-27 of the <i>journal header</i> required to be written by
1683
H35680 shall contain the <i>page size</i> used by the database at
1684
the start of the <i>write transaction</i>, formatted as a 4-byte
1685
big-endian unsigned integer.
1686
<h3 id=modifying_appending_truncating>
1687
Modifying, Adding or Truncating a Database Page
1690
When the end-user executes a DML or DDL SQL statement to modify the
1691
database schema or content, SQLite is required to update the database
1692
file image to reflect the new database state. This involves modifying
1693
the content of, appending or truncating one of more database file
1694
pages. Instead of modifying the database file directly using the VFS
1695
interface, changes are first buffered within the <i>page cache</i>.
1697
Before modifying a database page within the <i>page cache</i> that
1698
may need to be restored by a rollback operation, the page must be
1699
<i>journalled</i>. <i>Journalling a page</i> is the process of copying
1700
that pages original data into the journal file so that it can be
1701
recovered if the <i>write transaction</i> is rolled back. The process
1702
of journalling a page is described in section
1703
<cite>journalling_a_page</cite>.
1704
<p class=req id=H35590>
1705
When required to modify the contents of an existing database page that
1706
existed and was not a <i>free-list leaf page</i> when the <i>write
1707
transaction</i> was opened, SQLite shall journal the page if it has not
1708
already been journalled within the current <i>write transaction</i>.
1709
<p class=req id=H35600>
1710
When required to modify the contents of an existing database page,
1711
SQLite shall update the cached version of the database page content
1712
stored as part of the <i>page cache entry</i> associated with the page.
1714
When a new database page is appended to a database file, there is
1715
no requirement to add a record to the <i>journal file</i>. If a
1716
rollback is required the database file will simply be truncated back
1717
to its original size based on the value stored at byte offset 12
1718
of the <i>journal file</i>.
1719
<p class=req id=H35610>
1720
When required to append a new database page to the database file,
1721
SQLite shall create a new <i>page cache entry</i> corresponding to
1722
the new page and insert it into the <i>page cache</i>. The <i>dirty
1723
flag</i> of the new <i>page cache entry</i> shall be set.
1725
If required to truncate a database page from the end of the database
1726
file, the associated <i>page cache entry</i> is discarded. The adjusted
1727
size of the database file is stored internally. The database file
1728
is not actually truncated until the current <i>write transaction</i>
1729
is committed (see section <cite>committing_a_transaction</cite>).
1730
<p class=req id=H35620>
1731
When required to truncate (remove) a database page that existed and was
1732
not a <i>free-list leaf page</i> when the <i>write transaction</i> was
1733
opened from the end of a database file, SQLite shall journal the page if
1734
it has not already been journalled within the current <i>write
1736
<p class=req id=H35630>
1737
When required to truncate a database page from the end of the database
1738
file, SQLite shall discard the associated <i>page cache entry</i>
1739
from the page cache.
1740
<h4 id=journalling_a_page>Journalling a Database Page</h4>
1742
A page is journalled by adding a <i>journal record</i> to the <i>
1743
journal file</i>. The format of a <i>journal record</i> is described
1744
in section <cite>journal_record_format</cite>.
1745
<p class=req id=H35270>
1746
When required to <i>journal a database page</i>, SQLite shall first
1747
append the <i>page number</i> of the page being journalled to the
1748
<i>journal file</i>, formatted as a 4-byte big-endian unsigned integer,
1749
using a single call to the xWrite method of the file-handle opened
1750
on the journal file.
1751
<p class=req id=H35280>
1752
When required to <i>journal a database page</i>, if the attempt to
1753
append the <i>page number</i> to the journal file is successful,
1754
then the current page data (<i>page-size</i> bytes) shall be appended
1755
to the journal file, using a single call to the xWrite method of the
1756
file-handle opened on the journal file.
1757
<p class=req id=H35290>
1758
When required to <i>journal a database page</i>, if the attempt to
1759
append the current page data to the journal file is successful,
1760
then SQLite shall append a 4-byte big-endian integer checksum value
1761
to the to the journal file, using a single call to the xWrite method
1762
of the file-handle opened on the journal file.
1764
The checksum value written to the <i>journal file</i> immediately after
1765
the page data (requirement H35290), is a function of both the page
1766
data and the <i>checksum initializer</i> field stored in the
1767
<i>journal header</i> (see section <cite>journal_header_format</cite>).
1768
Specifically, it is the sum of the <i>checksum initializer</i> and
1769
the value of every 200th byte of page data interpreted as an 8-bit
1770
unsigned integer, starting with the (<i>page-size</i> % 200)'th
1771
byte of page data. For example, if the <i>page-size</i> is 1024 bytes,
1772
then a checksum is calculated by adding the values of the bytes at
1773
offsets 23, 223, 423, 623, 823 and 1023 (the last byte of the page)
1774
together with the value of the <i>checksum initializer</i>.
1775
<p class=req id=H35300>
1776
The checksum value written to the <i>journal file</i> by the write
1777
required by H35290 shall be equal to the sum of the <i>checksum
1778
initializer</i> field stored in the <i>journal header</i> (H35700) and
1779
every 200th byte of the page data, beginning with the
1780
(<i>page-size</i> % 200)th byte.
1782
The '%' character is used in requirement H35300 to represent the
1783
modulo operator, just as it is in programming languages such as C, Java
1785
<h3 id=syncing_journal_file>Syncing the Journal File</h3>
1787
Even after the original data of a database page has been written into
1788
the journal file using calls to the journal file file-handle xWrite
1789
method (section <cite>journalling_a_page</cite>), it is still not
1790
safe to write to the page within the database file. This is because
1791
in the event of a system failure the data written to the journal file
1792
may still be corrupted (see section <cite>fs_characteristics</cite>).
1793
Before the page can be updated within the database itself, the
1794
following procedure takes place:
1796
<li> The xSync method of the file-handle opened on the journal file
1797
is called. This operation ensures that all <i>journal records</i>
1798
in the journal file have been written to persistent storage, and
1799
that they will not become corrupted as a result of a subsequent
1801
<li> The <i>journal record count</i> field (see section
1802
<cite>journal_header_format</cite>) of the most recently written
1803
journal header in the journal file is updated to contain the
1804
number of <i>journal records</i> added to the journal file since
1805
the header was written.
1806
<li> The xSync method is called again, to ensure that the update to
1807
the <i>journal record count</i> has been committed to persistent
1811
If all three of the steps enumerated above are executed successfully,
1812
then it is safe to modify the content of the <i>journalled</i>
1813
database pages within the database file itself. The combination of
1814
the three steps above is referred to as <i>syncing the journal file</i>.
1815
<p class=req id=H35750>
1816
When required to <i>sync the journal file</i>, SQLite shall invoke the
1817
xSync method of the file handle open on the <i>journal file</i>.
1818
<p class=req id=H35760>
1819
When required to <i>sync the journal file</i>, after invoking the
1820
xSync method as required by H35750, SQLite shall update the <i>record
1821
count</i> of the <i>journal header</i> most recently written to the
1822
<i>journal file</i>. The 4-byte field shall be updated to contain
1823
the number of <i>journal records</i> that have been written to the
1824
<i>journal file</i> since the <i>journal header</i> was written,
1825
formatted as a 4-byte big-endian unsigned integer.
1826
<p class=req id=H35770>
1827
When required to <i>sync the journal file</i>, after updating the
1828
<i>record count</i> field of a <i>journal header</i> as required by
1829
H35760, SQLite shall invoke the xSync method of the file handle open
1830
on the <i>journal file</i>.
1831
<h3 id=upgrading_to_exclusive_lock>Upgrading to an Exclusive Lock</h3>
1833
Before the content of a page modified within the <i>page cache</i> may
1834
be written to the database file, an <i>exclusive lock</i> must be held
1835
on the database file. The purpose of this lock is to prevent another
1836
connection from reading from the database file while the first
1837
connection is midway through writing to it. Whether the reason for
1838
writing to the database file is because a transaction is being committed,
1839
or to free up space within the <i>page cache</i>, upgrading to an
1840
<i>exclusive lock</i> always occurs immediately after
1841
<i>syncing the journal file</i>.
1842
<p class=req id=H35780>
1843
When required to upgrade to an <i>exclusive lock</i> as part of a write
1844
transaction, SQLite shall first attempt to obtain a <i>pending lock</i>
1845
on the database file if one is not already held by invoking the xLock
1846
method of the file handle opened on the <i>database file</i>.
1847
<p class=req id=H35790>
1848
When required to upgrade to an <i>exclusive lock</i> as part of a write
1849
transaction, after successfully obtaining a <i>pending lock</i> SQLite
1850
shall attempt to obtain an <i>exclusive lock</i> by invoking the
1851
xLock method of the file handle opened on the <i>database file</i>.
1853
What happens if the exclusive lock cannot be obtained? It is not
1854
possible for the attempt to upgrade from a reserved to a pending
1856
<h3 id=committing_a_transaction>Committing a Transaction</h3>
1858
Committing a <i>write transaction</i> is the final step in updating the
1859
database file. Committing a transaction is a seven step process,
1860
summarized as follows:
1863
The database file header <i>change counter</i> field is incremented.
1864
The <i>change counter</i>, described in
1865
<cite>ff_sqlitert_requirements</cite>, is used by the <i>cache
1866
validation</i> procedure described in section
1867
<cite>cache_validation</cite>.
1869
The <i>journal file</i> is synced. The steps required to <i>sync the
1870
journal file</i> are described in section
1871
<cite>syncing_journal_file</cite>.
1873
Upgrade to an <i>exclusive lock</i> on the database file, if an
1874
<i>exclusive lock</i> is not already held. Upgrading to an
1875
<i>exclusive lock</i> is described in section
1876
<cite>upgrading_to_exclusive_lock</cite>.
1878
Copy the contents of all <i>dirty pages</i> stored in the <i>page
1879
cache</i> into the database file. The set of dirty pages are written
1880
to the database file in page number order in order to improve
1881
performance (see the assumptions in section <cite>fs_performance</cite>
1884
The database file is synced to ensure that all updates are stored
1885
safely on the persistent media.
1887
The file-handle open on the <i>journal file</i> is closed and the
1888
journal file itself deleted. At this point the <i>write transaction</i>
1889
transaction has been irrevocably committed.
1891
The database file is unlocked.
1894
Expand on and explain the above a bit.
1896
The following requirements describe the steps enumerated above in more
1898
<p class=req id=H35800>
1899
When required to <i>commit a write-transaction</i>, SQLite shall
1900
modify page 1 to increment the value stored in the <i>change counter</i>
1901
field of the <i>database file header</i>.
1903
The <i>change counter</i> is a 4-byte big-endian integer field stored
1904
at byte offset 24 of the <i>database file</i>. The modification to page 1
1905
required by H35800 is made using the process described in section
1906
<cite>modifying_appending_truncating</cite>. If page 1 has not already
1907
been journalled as a part of the current write-transaction, then
1908
incrementing the <i>change counter</i> may require that page 1 be
1909
journalled. In all cases the <i>page cache entry</i> corresponding to
1910
page 1 becomes a <i>dirty page</i> as part of incrementing the <i>change
1912
<p class=req id=H35810>
1913
When required to <i>commit a write-transaction</i>, after incrementing
1914
the <i>change counter</i> field, SQLite shall <i>sync the journal
1916
<p class=req id=H35820>
1917
When required to <i>commit a write-transaction</i>, after <i>syncing
1918
the journal file</i> as required by H35810, if an <i>exclusive lock</i>
1919
on the database file is not already held, SQLite shall attempt to
1920
<i>upgrade to an exclusive lock</i>.
1921
<p class=req id=H35830>
1922
When required to <i>commit a write-transaction</i>, after <i>syncing
1923
the journal file</i> as required by H35810 and ensuring that an
1924
<i>exclusive lock</i> is held on the database file as required by
1925
H35830, SQLite shall copy the contents of all <i>dirty page</i>
1926
stored in the <i>page cache</i> into the <i>database file</i> using
1927
calls to the xWrite method of the <i>database connection</i> file
1928
handle. Each call to xWrite shall write the contents of a single
1929
<i>dirty page</i> (<i>page-size</i> bytes of data) to the database
1930
file. Dirty pages shall be written in order of <i>page number</i>,
1931
from lowest to highest.
1932
<p class=req id=H35840>
1933
When required to <i>commit a write-transaction</i>, after copying the
1934
contents of any <i>dirty pages</i> to the database file as required
1935
by H35830, SQLite shall sync the database file by invoking the xSync
1936
method of the <i>database connection</i> file handle.
1937
<p class=req id=H35850>
1938
When required to <i>commit a write-transaction</i>, after syncing
1939
the database file as required by H35840, SQLite shall close the
1940
file-handle opened on the <i>journal file</i> and delete the
1941
<i>journal file</i> from the file system via a call to the VFS
1943
<p class=req id=H35860>
1944
When required to <i>commit a write-transaction</i>, after deleting
1945
the <i>journal file</i> as required by H35850, SQLite shall relinquish
1946
all locks held on the <i>database file</i> by invoking the xUnlock
1947
method of the <i>database connection</i> file handle.
1949
Is the shared lock held after committing a <i>write transaction</i>?
1950
<h3>Purging a Dirty Page</h3>
1952
Usually, no data is actually written to the database file until the
1953
user commits the active <i>write transaction</i>. The exception is
1954
if a single <i>write transaction</i> contains too many modifications
1955
to be stored in the <i>page cache</i>. In this case, some of the
1956
database file modifications stored in the <i>page cache</i> must be
1957
applied to the database file before the transaction is committed so
1958
that the associated <i>page cache entries</i> can be purged from the
1959
page cache to free memory. Exactly when this condition is reached and
1960
dirty pages must be purged is described in section
1961
<cite>page_cache_algorithms</cite>.
1963
Before the contents of the <i>page cache entry</i> can be written into
1964
the database file, the <i>page cache entry</i> must meet the criteria
1965
for a <i>writable dirty page</i>, as defined in section
1966
<cite>page_cache_algorithms</cite>. If the dirty page selected by the
1967
algorithms in section <cite>page_cache_algorithms</cite> for purging,
1968
SQLite is required to <i>sync the journal file</i>. Immediately after
1969
the journal file is synced, all dirty pages associated with the
1970
<i>database connection</i> are classified as <i>writable dirty pages</i>.
1971
<p class=req id=H35640>
1972
When required to purge a <i>non-writable dirty page</i> from the
1973
<i>page cache</i>, SQLite shall <i>sync the journal file</i> before
1974
proceeding with the write operation required by H35670.
1975
<p class=req id=H35660>
1976
After <i>syncing the journal file</i> as required by H35640, SQLite
1977
shall append a new <i>journal header</i> to the <i>journal file</i>
1978
before proceeding with the write operation required by H35670.
1980
Appending a new <i>journal header</i> to the journal file is described
1981
in section <cite>writing_journal_header</cite>.
1983
Once the dirty page being purged is writable, it is simply written
1984
into the database file.
1985
<p class=req id=H35670>
1986
When required to purge a <i>page cache entry</i> that is a
1987
<i>dirty page</i> SQLite shall write the page data into the database
1988
file, using a single call to the xWrite method of the <i>database
1989
connection</i> file handle.
1990
<h2 id="multifile_transactions">Multi-File Transactions</h2>
1991
<h2 id="statement_transactions">Statement Transactions</h2>
1992
<h1 id=rollback>Rollback</h1>
1993
<h2 id=hot_journal_rollback>Hot Journal Rollback</h2>
1994
<h2>Transaction Rollback</h2>
1995
<h2>Statement Rollback</h2>
1997
<table id="refs" style="width:auto; margin: 1em 5ex">
1998
<tr><td style="width:5ex" id="capi_sqlitert_requirements">[1]<td>
1999
C API Requirements Document.
2000
<tr><td style="width:5ex" id="sql_sqlitert_requirements">[2]<td>
2001
SQL Requirements Document.
2002
<tr><td style="width:5ex" id="ff_sqlitert_requirements">[3]<td>
2003
File Format Requirements Document.