1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
3
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
4
<title>Atomic Commit In SQLite</title>
5
<style type="text/css">
8
font-family: Verdana, sans-serif;
13
a:visited { color: #734559 }
15
.logo { position:absolute; margin:3px; }
31
.toolbar a { color: white; text-decoration: none; padding: 6px 12px; }
32
.toolbar a:visited { color: white; }
33
.toolbar a:hover { color: #044a64; background: white; }
35
.content { margin: 5%; }
36
.content dt { font-weight:bold; }
37
.content dd { margin-bottom: 25px; margin-left:20%; }
38
.content ul { padding:0px; padding-left: 15px; margin:0px; }
41
.se { background: url(images/se.gif) 100% 100% no-repeat #044a64}
42
.sw { background: url(images/sw.gif) 0% 100% no-repeat }
43
.ne { background: url(images/ne.gif) 100% 0% no-repeat }
44
.nw { background: url(images/nw.gif) 0% 0% no-repeat }
46
/* Things for "fancyformat" documents start here. */
47
.fancy img+p {font-style:italic}
48
.fancy .codeblock i { color: darkblue; }
49
.fancy h1,.fancy h2,.fancy h3,.fancy h4 {font-weight:normal;color:#044a64}
50
.fancy h2 { margin-left: 10px }
51
.fancy h3 { margin-left: 20px }
52
.fancy h4 { margin-left: 30px }
53
.fancy th {white-space:nowrap;text-align:left;border-bottom:solid 1px #444}
54
.fancy th, .fancy td {padding: 0.2em 1ex; vertical-align:top}
55
.fancy #toc a { color: darkblue ; text-decoration: none }
56
.fancy .todo { color: #AA3333 ; font-style : italic }
57
.fancy .todo:before { content: 'TODO:' }
58
.fancy p.todo { border: solid #AA3333 1px; padding: 1ex }
59
.fancy img { display:block; }
60
.fancy :link:hover, .fancy :visited:hover { background: wheat }
61
.fancy p,.fancy ul,.fancy ol { margin: 1em 5ex }
62
.fancy li p { margin: 1em 0 }
63
/* End of "fancyformat" specific rules. */
69
<div><!-- container div to satisfy validator -->
72
<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite Logo"
74
<div><!-- IE hack to prevent disappearing logo--></div>
75
<div class="tagline">Small. Fast. Reliable.<br>Choose any three.</div>
77
<table width=100% style="clear:both"><tr><td>
78
<div class="se"><div class="sw"><div class="ne"><div class="nw">
79
<table width=100% style="padding:0;margin:0;cell-spacing:0"><tr>
82
<a href="about.html">About</a>
83
<a href="sitemap.html">Sitemap</a>
84
<a href="docs.html">Documentation</a>
85
<a href="download.html">Download</a>
86
<a href="copyright.html">License</a>
87
<a href="news.html">News</a>
88
<a href="support.html">Support</a>
91
gMsg = "Search SQLite Docs..."
92
function entersearch() {
93
var q = document.getElementById("q");
94
if( q.value == gMsg ) { q.value = "" }
95
q.style.color = "black"
96
q.style.fontStyle = "normal"
98
function leavesearch() {
99
var q = document.getElementById("q");
100
if( q.value == "" ) {
102
q.style.color = "#044a64"
103
q.style.fontStyle = "italic"
108
<div style="padding:0 1em 0px 0;white-space:nowrap">
109
<form name=f method="GET" action="http://www.sqlite.org/search">
110
<input id=q name=q type=text
111
onfocus="entersearch()" onblur="leavesearch()" style="width:24ex;padding:1px 1ex; border:solid white 1px; font-size:0.9em ; font-style:italic;color:#044a64;" value="Search SQLite Docs...">
112
<input type=submit value="Go" style="border:solid white 1px;background-color:#044a64;color:white;font-size:0.9em;padding:0 1ex">
116
</div></div></div></div>
118
<div class=startsearch></div>
124
Atomic Commit In SQLite
127
<h2>1.0 Introduction</h2>
129
<p>An important feature of transactional databases like SQLite
131
Atomic commit means that either all database changes within a single
132
transaction occur or none of them occur. With atomic commit, it
133
is as if many different writes to different sections of the database
134
file occur instantaneously and simultaneously.
135
Real hardware serializes writes to mass storage, and writing
136
a single sector takes a finite amount of time.
137
So it is impossible to truly write many different sectors of a
138
database file simultaneously and/or instantaneously.
139
But the atomic commit logic within
140
SQLite makes it appear as if the changes for a transaction
141
are all written instantaneously and simultaneously.</p>
143
<p>SQLite has the important property that transactions appear
144
to be atomic even if the transaction is interrupted by an
145
operating system crash or power failure.</p>
147
<p>This article describes the techniques used by SQLite to create the
148
illusion of atomic commit.</p>
150
<p>The information in this article applies only when SQLite is operating
151
in "rollback mode", or in other words when SQLite is not
152
using a <a href="wal.html">write-ahead log</a>. SQLite still supports atomic commit when
153
write-ahead logging is enabled, but it accomplishes atomic commit by
154
a different mechanism from the one describe in this article. See
155
the <a href="wal.html">write-ahead log documentation</a> for additional information on how
156
SQLite supports atomic commit in that context.</p>
158
<a name="hardware"></a>
160
<h2>2.0 Hardware Assumptions</h2>
162
<p>Throughout this article, we will call the mass storage device "disk"
163
even though the mass storage device might really be flash memory.</p>
165
<p>We assume that disk is written in chunks which we call a "sector".
166
It is not possible to modify any part of the disk smaller than a sector.
167
To change a part of the disk smaller than a sector, you have to read in
168
the full sector that contains the part you want to change, make the
169
change, then write back out the complete sector.</p>
171
<p>On a traditional spinning disk, a sector is the minimum unit of transfer
172
in both directions, both reading and writing. On flash memory, however,
173
the minimum size of a read is typically much smaller than a minimum write.
174
SQLite is only concerned with the minimum write amount and so for the
175
purposes of this article, when we say "sector" we mean the minimum amount
176
of data that can be written to mass storage in a single go.</p>
179
Prior to SQLite version 3.3.14, a sector size of 512 bytes was
180
assumed in all cases. There was a compile-time option to change
181
this but the code had never been tested with a larger value. The
182
512 byte sector assumption seemed reasonable since until very recently
183
all disk drives used a 512 byte sector internally. However, there
184
has recently been a push to increase the sector size of disks to
185
4096 bytes. Also the sector size
186
for flash memory is usually larger than 512 bytes. For these reasons,
187
versions of SQLite beginning with 3.3.14 have a method in the OS
188
interface layer that interrogates the underlying filesystem to find
189
the true sector size. As currently implemented (version 3.5.0) this
190
method still returns a hard-coded value of 512 bytes, since there
191
is no standard way of discovering the true sector size on either
192
Unix or Windows. But the method is available for embedded device
193
manufactures to tweak according to their own needs. And we have
194
left open the possibility of filling in a more meaningful implementation
195
on Unix and Windows in the future.</p>
197
<p>SQLite has traditionally assumed that a sector write is <u>not</u> atomic.
198
However, SQLite does always assume that a sector write is linear. By "linear"
199
we mean that SQLite assumes that when writing a sector, the hardware begins
200
at one end of the data and writes byte by byte until it gets to
201
the other end. The write might go from beginning to end or from
202
end to beginning. If a power failure occurs in the middle of a
203
sector write it might be that part of the sector was modified
204
and another part was left unchanged. The key assumption by SQLite
205
is that if any part of the sector gets changed, then either the
206
first or the last bytes will be changed. So the hardware will
207
never start writing a sector in the middle and work towards the
208
ends. We do not know if this assumption is always true but it
209
seems reasonable.</p>
211
<p>The previous paragraph states that SQLite does not assume that
212
sector writes are atomic. This is true by default. But as of
213
SQLite version 3.5.0, there is a new interface called the
214
Virtual File System (<a href="vfs.html">VFS</a>) interface. The <a href="vfs.html">VFS</a> is the only means
215
by which SQLite communicates to the underlying filesystem. The
216
code comes with default VFS implementations for Unix and Windows
217
and there is a mechanism for creating new custom VFS implementations
218
at runtime. In this new VFS interface there is a method called
219
xDeviceCharacteristics. This method interrogates the underlying
220
filesystem to discover various properties and behaviors that the
221
filesystem may or may not exhibit. The xDeviceCharacteristics
222
method might indicate that sector writes are atomic, and if it does
223
so indicate, SQLite will try to take advantage of that fact. But
224
the default xDeviceCharacteristics method for both Unix and Windows
225
does not indicate atomic sector writes and so these optimizations
226
are normally omitted.</p>
228
<p>SQLite assumes that the operating system will buffer writes and
229
that a write request will return before data has actually been stored
230
in the mass storage device.
231
SQLite further assumes that write operations will be reordered by
232
the operating system.
233
For this reason, SQLite does a "flush" or "fsync" operation at key
234
points. SQLite assumes that the flush or fsync will not return until
235
all pending write operations for the file that is being flushed have
236
completed. We are told that the flush and fsync primitives
237
are broken on some versions of Windows and Linux. This is unfortunate.
238
It opens SQLite up to the possibility of database corruption following
239
a power loss in the middle of a commit. However, there is nothing
240
that SQLite can do to test for or remedy the situation. SQLite
241
assumes that the operating system that it is running on works as
242
advertised. If that is not quite the case, well then hopefully you
243
will not lose power too often.</p>
245
<p>SQLite assumes that when a file grows in length that the new
246
file space originally contains garbage and then later is filled in
247
with the data actually written. In other words, SQLite assumes that
248
the file size is updated before the file content. This is a
249
pessimistic assumption and SQLite has to do some extra work to make
250
sure that it does not cause database corruption if power is lost
251
between the time when the file size is increased and when the
252
new content is written. The xDeviceCharacteristics method of
253
the <a href="vfs.html">VFS</a> might indicate that the filesystem will always write the
254
data before updating the file size. (This is the
255
SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
256
at the code.) When the xDeviceCharacteristics method indicates
257
that files content is written before the file size is increased,
258
SQLite can forego some of its pedantic database protection steps
259
and thereby decrease the amount of disk I/O needed to perform a
260
commit. The current implementation, however, makes no such assumptions
261
for the default VFSes for Windows and Unix.</p>
263
<p>SQLite assumes that a file deletion is atomic from the
264
point of view of a user process. By this we mean that if SQLite
265
requests that a file be deleted and the power is lost during the
266
delete operation, once power is restored either the file will
267
exist completely with all if its original content unaltered, or
268
else the file will not be seen in the filesystem at all. If
269
after power is restored the file is only partially deleted,
270
if some of its data has been altered or erased,
271
or the file has been truncated but not completely removed, then
272
database corruption will likely result.</p>
274
<p>SQLite assumes that the detection and/or correction of
275
bit errors caused by cosmic rays, thermal noise, quantum
276
fluctuations, device driver bugs, or other mechanisms, is the
277
responsibility of the underlying hardware and operating system.
278
SQLite does not add any redundancy to the database file for
279
the purpose of detecting corruption or I/O errors.
280
SQLite assumes that the data it reads is exactly the same data
281
that it previously wrote.</p>
283
<p>By default, SQLite assumes that an operating system call to write
284
a range of bytes will not damage or alter any bytes outside of that range
285
even if a power lose or OS crash occurs during that write. We
286
call this the "<a href="psow.html">powersafe overwrite</a>" property. Prior to version 3.7.9,
287
SQLite did not assume powersafe overwrite. But with the standard
288
sector size increasing from 512 to 4096 bytes on most disk drives, it
289
has become necessary to assume powersafe overwrite in order to maintain
290
historical performance levels and so powersafe overwrite is assumed by
291
default in recent versions of SQLite. The assumption of powersafe
292
overwrite property can be disabled at compile-time or a run-time if
293
desired. See the <a href="psow.html">powersafe overwrite documentation</a> for further
297
<a name="section_3_0"></a>
298
<h2>3.0 Single File Commit</h2>
300
<p>We begin with an overview of the steps SQLite takes in order to
301
perform an atomic commit of a transaction against a single database
302
file. The details of file formats used to guard against damage from
303
power failures and techniques for performing an atomic commit across
304
multiple databases are discussed in later sections.</p>
306
<a name="initstate"></a>
308
<h3>3.1 Initial State</h3>
310
<img src="images/ac/commit-0.gif" align="right" hspace="15">
312
<p>The state of the computer when a database connection is
313
first opened is shown conceptually by the diagram at the
315
The area of the diagram on the extreme right (labeled "Disk") represents
316
information stored on the mass storage device. Each rectangle is
317
a sector. The blue color represents that the sectors contain
319
The middle area is the operating systems disk cache. At the
320
onset of our example, the cache is cold and this is represented
321
by leaving the rectangles of the disk cache empty.
322
The left area of the diagram shows the content of memory for
323
the process that is using SQLite. The database connection has
324
just been opened and no information has been read yet, so the
331
<h3>3.2 Acquiring A Read Lock</h3>
333
<img src="images/ac/commit-1.gif" align="right" hspace="15">
335
<p>Before SQLite can write to a database, it must first read
336
the database to see what is there already. Even if it is just
337
appending new data, SQLite still has to read in the database
338
schema from the <b>sqlite_master</b> table so that it can know
339
how to parse the INSERT statements and discover where in the
340
database file the new information should be stored.</p>
342
<p>The first step toward reading from the database file
343
is obtaining a shared lock on the database file. A "shared"
344
lock allows two or more database connections to read from the
345
database file at the same time. But a shared lock prevents
346
another database connection from writing to the database file
347
while we are reading it. This is necessary because if another
348
database connection were writing to the database file at the
349
same time we are reading from the database file, we might read
350
some data before the change and other data after the change.
351
This would make it appears as if the change made by the other
352
process is not atomic.</p>
354
<p>Notice that the shared lock is on the operating system
355
disk cache, not on the disk itself. File locks
356
really are just flags within the operating system kernel,
357
usually. (The details depend on the specific OS layer
358
interface.) Hence, the lock will instantly vanish if the
359
operating system crashes or if there is a power loss. It
360
is usually also the case that the lock will vanish if the
361
process that created the lock exits.</p>
365
<a name="section_3_3"></a>
366
<h3>3.3 Reading Information Out Of The Database</h3>
368
<img src="images/ac/commit-2.gif" align="right" hspace="15">
370
<p>After the shared lock is acquired, we can begin reading
371
information from the database file. In this scenario, we
372
are assuming a cold cache, so information must first be
373
read from mass storage into the operating system cache then
374
transferred from operating system cache into user space.
375
On subsequent reads, some or all of the information might
376
already be found in the operating system cache and so only
377
the transfer to user space would be required.</p>
379
<p>Usually only a subset of the pages in the database file
380
are read. In this example we are showing three
381
pages out of eight being read. In a typical application, a
382
database will have thousands of pages and a query will normally
383
only touch a small percentage of those pages.</p>
387
<a name="rsvdlock"></a>
389
<h3>3.4 Obtaining A Reserved Lock</h3>
391
<img src="images/ac/commit-3.gif" align="right" hspace="15">
393
<p>Before making changes to the database, SQLite first
394
obtains a "reserved" lock on the database file. A reserved
395
lock is similar to a shared lock in that both a reserved lock
396
and shared lock allow other processes to read from the database
397
file. A single reserve lock can coexist with multiple shared
398
locks from other processes. However, there can only be a
399
single reserved lock on the database file. Hence only a
400
single process can be attempting to write to the database
403
<p>The idea behind a reserved locks is that it signals that
404
a process intends to modify the database file in the near
405
future but has not yet started to make the modifications.
406
And because the modifications have not yet started, other
407
processes can continue to read from the database. However,
408
no other process should also begin trying to write to the
412
<a name="section_3_5"></a>
413
<h3>3.5 Creating A Rollback Journal File</h3>
414
<img src="images/ac/commit-4.gif" align="right" hspace="15">
416
<p>Prior to making any changes to the database file, SQLite first
417
creates a separate rollback journal file and writes into the
418
rollback journal the original
419
content of the database pages that are to be altered.
420
The idea behind the rollback journal is that it contains
421
all information needed to restore the database back to
422
its original state.</p>
424
<p>The rollback journal contains a small header (shown in green
425
in the diagram) that records the original size of the database
426
file. So if a change causes the database file to grow, we
427
will still know the original size of the database. The page
428
number is stored together with each database page that is
429
written into the rollback journal.</p>
432
When a new file is created, most desktop operating systems
433
(Windows, Linux, Mac OS X) will not actually write anything to
434
disk. The new file is created in the operating systems disk
435
cache only. The file is not created on mass storage until sometime
436
later, when the operating system has a spare moment. This creates
437
the impression to users that I/O is happening much faster than
438
is possible when doing real disk I/O. We illustrate this idea in
439
the diagram to the right by showing that the new rollback journal
440
appears in the operating system disk cache only and not on the
444
<a name="section_3_6"></a>
445
<h3>3.6 Changing Database Pages In User Space</h3>
446
<img src="images/ac/commit-5.gif" align="right" hspace="15">
448
<p>After the original page content has been saved in the rollback
449
journal, the pages can be modified in user memory. Each database
450
connection has its own private copy of user space, so the changes
451
that are made in user space are only visible to the database connection
452
that is making the changes. Other database connections still see
453
the information in operating system disk cache buffers which have
454
not yet been changed. And so even though one process is busy
455
modifying the database, other processes can continue to read their
456
own copies of the original database content.</p>
459
<a name="section_3_7"></a>
460
<h3>3.7 Flushing The Rollback Journal File To Mass Storage</h3>
461
<img src="images/ac/commit-6.gif" align="right" hspace="15">
463
<p>The next step is to flush the content of the rollback journal
464
file to nonvolatile storage.
465
As we will see later,
466
this is a critical step in insuring that the database can survive
467
an unexpected power loss.
468
This step also takes a lot of time, since writing to nonvolatile
469
storage is normally a slow operation.</p>
471
<p>This step is usually more complicated than simply flushing
472
the rollback journal to the disk. On most platforms two separate
473
flush (or fsync()) operations are required. The first flush writes
474
out the base rollback journal content. Then the header of the
475
rollback journal is modified to show the number of pages in the
476
rollback journal. Then the header is flushed to disk. The details
477
on why we do this header modification and extra flush are provided
478
in a later section of this paper.</p>
481
<a name="section_3_8"></a>
482
<h3>3.8 Obtaining An Exclusive Lock</h3>
483
<img src="images/ac/commit-7.gif" align="right" hspace="15">
485
<p>Prior to making changes to the database file itself, we must
486
obtain an exclusive lock on the database file. Obtaining an
487
exclusive lock is really a two-step process. First SQLite obtains
488
a "pending" lock. Then it escalates the pending lock to an
491
<p>A pending lock allows other processes that already have a
492
shared lock to continue reading the database file. But it
493
prevents new shared locks from being established. The idea
494
behind a pending lock is to prevent writer starvation caused
495
by a large pool of readers. There might be dozens, even hundreds,
496
of other processes trying to read the database file. Each process
497
acquires a shared lock before it starts reading, reads what it
498
needs, then releases the shared lock. If, however, there are
499
many different processes all reading from the same database, it
500
might happen that a new process always acquires its shared lock before
501
the previous process releases its shared lock. And so there is
502
never an instant when there are no shared locks on the database
503
file and hence there is never an opportunity for the writer to
504
seize the exclusive lock. A pending lock is designed to prevent
505
that cycle by allowing existing shared locks to proceed but
506
blocking new shared locks from being established. Eventually
507
all shared locks will clear and the pending lock will then be
508
able to escalate into an exclusive lock.</p>
511
<a name="section_3_9"></a>
512
<h3>3.9 Writing Changes To The Database File</h3>
513
<img src="images/ac/commit-8.gif" align="right" hspace="15">
515
<p>Once an exclusive lock is held, we know that no other
516
processes are reading from the database file and it is
517
safe to write changes into the database file. Usually
518
those changes only go as far as the operating systems disk
519
cache and do not make it all the way to mass storage.</p>
522
<a name="section_3_10"></a>
523
<h3>3.10 Flushing Changes To Mass Storage</h3>
524
<img src="images/ac/commit-9.gif" align="right" hspace="15">
526
<p>Another flush must occur to make sure that all the
527
database changes are written into nonvolatile storage.
528
This is a critical step to ensure that the database will
529
survive a power loss without damage. However, because
530
of the inherent slowness of writing to disk or flash memory,
531
this step together with the rollback journal file flush in section
532
3.7 above takes up most the time required to complete a
533
transaction commit in SQLite.</p>
536
<a name="section_3_11"></a>
537
<h3>3.11 Deleting The Rollback Journal</h3>
538
<img src="images/ac/commit-A.gif" align="right" hspace="15">
540
<p>After the database changes are all safely on the mass
541
storage device, the rollback journal file is deleted.
542
This is the instant where the transaction commits.
543
If a power failure or system crash occurs prior to this
544
point, then recovery processes to be described later make
545
it appears as if no changes were ever made to the database
546
file. If a power failure or system crash occurs after
547
the rollback journal is deleted, then it appears as if
548
all changes have been written to disk. Thus, SQLite gives
549
the appearance of having made no changes to the database
550
file or having made the complete set of changes to the
551
database file depending on whether or not the rollback
552
journal file exists.</p>
554
<p>Deleting a file is not really an atomic operation, but
555
it appears to be from the point of view of a user process.
556
A process is always able to ask the operating system "does
557
this file exist?" and the process will get back a yes or no
558
answer. After a power failure that occurs during a
559
transaction commit, SQLite will ask the operating system
560
whether or not the rollback journal file exists. If the
561
answer is "yes" then the transaction is incomplete and is
562
rolled back. If the answer is "no" then it means the transaction
565
<p>The existence of a transaction depends on whether or
566
not the rollback journal file exists and the deletion
567
of a file appears to be an atomic operation from the point of
568
view of a user-space process. Therefore,
569
a transaction appears to be an atomic operation.</p>
571
<p>The act of deleting a file is expensive on many systems.
572
As an optimization, SQLite can be configured to truncate
573
the journal file to zero bytes in length
574
or overwrite the journal file header with zeros. In either
575
case, the resulting journal file is no longer capable of rolling
576
back and so the transaction still commits. Truncating a file
577
to zero length, like deleting a file, is assumed to be an atomic
578
operation from the point of view of a user process. Overwriting
579
the header of the journal with zeros is not atomic, but if any
580
part of the header is malformed the journal will not roll back.
581
Hence, one can say that the commit occurs as soon as the header
582
is sufficiently changed to make it invalid. Typically this happens
583
as soon as the first byte of the header is zeroed.</p>
586
<a name="section_3_12"></a>
587
<h3>3.12 Releasing The Lock</h3>
588
<img src="images/ac/commit-B.gif" align="right" hspace="15">
590
<p>The last step in the commit process is to release the
591
exclusive lock so that other processes can once again
592
start accessing the database file.</p>
594
<p>In the diagram at the right, we show that the information
595
that was held in user space is cleared when the lock is released.
596
This used to be literally true for older versions of SQLite. But
597
more recent versions of SQLite keep the user space information
598
in memory in case it might be needed again at the start of the
599
next transaction. It is cheaper to reuse information that is
600
already in local memory than to transfer the information back
601
from the operating system disk cache or to read it off of the
602
disk drive again. Prior to reusing the information in user space,
603
we must first reacquire the shared lock and then we have to check
604
to make sure that no other process modified the database file while
605
we were not holding a lock. There is a counter in the first page
606
of the database that is incremented every time the database file
607
is modified. We can find out if another process has modified the
608
database by checking that counter. If the database was modified,
609
then the user space cache must be cleared and reread. But it is
610
commonly the case that no changes have been made and the user
611
space cache can be reused for a significant performance savings.</p>
614
<a name="rollback"></a>
616
<h2>4.0 Rollback</h2>
618
<p>An atomic commit is supposed to happen instantaneously. But the processing
619
described above clearly takes a finite amount of time.
620
Suppose the power to the computer were cut
621
part way through the commit operation described above. In order
622
to maintain the illusion that the changes were instantaneous, we
623
have to "rollback" any partial changes and restore the database to
624
the state it was in prior to the beginning of the transaction.</p>
626
<a name="crisis"></a>
628
<h3>4.1 When Something Goes Wrong...</h3>
629
<img src="images/ac/rollback-0.gif" align="right" hspace="15">
631
<p>Suppose the power loss occurred during step 3.10 above,
632
while the database changes were being written to disk.
633
After power is restored, the situation might be something
634
like what is shown to the right. We were trying to change
635
three pages of the database file but only one page was
636
successfully written. Another page was partially written
637
and a third page was not written at all.</p>
639
<p>The rollback journal is complete and intact on disk when
640
the power is restored. This is a key point. The reason for
641
the flush operation in step 3.7 is to make absolutely sure that
642
all of the rollback journal is safely on nonvolatile storage
643
prior to making any changes to the database file itself.</p>
646
<a name="section_4_2"></a>
647
<h3>4.2 Hot Rollback Journals</h3>
648
<img src="images/ac/rollback-1.gif" align="right" hspace="15">
650
<p>The first time that any SQLite process attempts to access
651
the database file, it obtains a shared lock as described in
652
section 3.2 above. But then it notices that there is a
653
rollback journal file present. SQLite then checks to see if
654
the rollback journal is a "hot journal". A hot journal is
655
a rollback journal that needs to be played back in order to
656
restore the database to a sane state. A hot journal only
657
exists when an earlier process was in the middle of committing
658
a transaction when it crashed or lost power.</p>
660
<p>A rollback journal is a "hot" journal if all of the following
664
<li>The rollback journal exist.
665
<li>The rollback journal is not an empty file.
666
<li>There is no reserved lock on the main database file.
667
<li>The header of the rollback journal is well-formed and in particular
668
has not been zeroed out.
669
<li>The rollback journal does not
670
contain the name of a master journal file (see
671
<a href="#section_5_5">section 5.5</a> below) or if does
672
contain the name of a master journal, then that master journal
676
<p>The presence of a hot journal is our indication
677
that a previous process was trying to commit a transaction but
678
it aborted for some reason prior to the completion of the
679
commit. A hot journal means that
680
the database file is in an inconsistent state and needs to
681
be repaired (by rollback) prior to being used.</p>
684
<a name="exlock"></a>
686
<h3>4.3 Obtaining An Exclusive Lock On The Database</h3>
687
<img src="images/ac/rollback-2.gif" align="right" hspace="15">
689
<p>The first step toward dealing with a hot journal is to
690
obtain an exclusive lock on the database file. This prevents two
691
or more processes from trying to rollback the same hot journal
692
at the same time.</p>
695
<a name="section_4_4"></a>
696
<h3>4.4 Rolling Back Incomplete Changes</h3>
697
<img src="images/ac/rollback-3.gif" align="right" hspace="15">
699
<p>Once a process obtains an exclusive lock, it is permitted
700
to write to the database file. It then proceeds to read the
701
original content of pages out of the rollback journal and write
702
that content back to were it came from in the database file.
703
Recall that the header of the rollback journal records the original
704
size of the database file prior to the start of the aborted
705
transaction. SQLite uses this information to truncate the
706
database file back to its original size in cases where the
707
incomplete transaction caused the database to grow. At the
708
end of this step, the database should be the same size and
709
contain the same information as it did before the start of
710
the aborted transaction.</p>
713
<a name="delhotjrnl"></a>
715
<h3>4.5 Deleting The Hot Journal</h3>
716
<img src="images/ac/rollback-4.gif" align="right" hspace="15">
718
<p>After all information in the rollback journal has been
719
played back into the database file (and flushed to disk in case
720
we encounter yet another power failure), the hot rollback journal
723
<p>As in <a href="#section_3_11">section 3.11</a>, the journal
724
file might be truncated to zero length or its header might
725
be overwritten with zeros as an optimization on systems where
726
deleting a file is expense. Either way, the journal is no
727
long hot after this step.</p>
732
<h3>4.6 Continue As If The Uncompleted Writes Had Never Happened</h3>
733
<img src="images/ac/rollback-5.gif" align="right" hspace="15">
735
<p>The final recovery step is to reduce the exclusive lock back
736
to a shared lock. Once this happens, the database is back in the
737
state that it would have been if the aborted transaction had never
738
started. Since all of this recovery activity happens completely
739
automatically and transparently, it appears to the program using
740
SQLite as if the aborted transaction had never begun.</p>
743
<a name="multicommit"></a>
745
<h2>5.0 Multi-file Commit</h2>
747
<p>SQLite allows a single
748
<a href="c3ref/sqlite3.html">database connection</a> to talk to
749
two or more database files simultaneously through the use of
750
the <a href="lang_attach.html">ATTACH DATABASE</a> command.
751
When multiple database files are modified within a single
752
transaction, all files are updated atomically.
753
In other words, either all of the database files are updated or
754
else none of them are.
755
Achieving an atomic commit across multiple database files is
756
more complex that doing so for a single file. This section
757
describes how SQLite works that bit of magic.</p>
759
<a name="multijrnl"></a>
761
<h3>5.1 Separate Rollback Journals For Each Database</h3>
762
<img src="images/ac/multi-0.gif" align="right" hspace="15">
764
<p>When multiple database files are involved in a transaction,
765
each database has its own rollback journal and each database
766
is locked separately. The diagram at the right shows a scenario
767
where three different database files have been modified within
768
one transaction. The situation at this step is analogous to
769
the single-file transaction scenario at
770
<a href="#section_3_6">step 3.6</a>. Each database file has
771
a reserved lock. For each database, the original content of pages
772
that are being changed have been written into the rollback journal
773
for that database, but the content of the journals have not yet
774
been flushed to disk. No changes have been made to the database
775
file itself yet, though presumably there are changes being held
778
<p>For brevity, the diagrams in this section are simplified from
779
those that came before. Blue color still signifies original content
780
and pink still signifies new content. But the individual pages
781
in the rollback journal and the database file are not shown and
782
we are not making the distinction between information in the
783
operating system cache and information that is on disk. All of
784
these factors still apply in a multi-file commit scenario. They
785
just take up a lot of space in the diagrams and they do not add
786
any new information, so they are omitted here.</p>
789
<a name="masterjrnl"></a>
791
<h3>5.2 The Master Journal File</h3>
792
<img src="images/ac/multi-1.gif" align="right" hspace="15">
794
<p>The next step in a multi-file commit is the creation of a
795
"master journal" file. The name of the master journal file is
796
the same name as the original database filename (the database
797
that was opened using the
798
<a href="c3ref/open.html">sqlite3_open()</a> interface,
799
not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary
800
databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where
801
<i>HHHHHHHH</i> is a random 32-bit hexadecimal number. The
802
random <i>HHHHHHHH</i> suffix changes for every new master journal.</p>
804
<p><i>(Nota bene: The formula for computing the master journal filename
805
given in the previous paragraph corresponds to the implementation as
806
of SQLite version 3.5.0. But this formula is not part of the SQLite
807
specification and is subject to change in future releases.)</i></p>
809
<p>Unlike the rollback journals, the master journal does not contain
810
any original database page content. Instead, the master journal contains
811
the full pathnames for rollback journals for every database that is
812
participating in the transaction.</p>
814
<p>After the master journal is constructed, its content is flushed
815
to disk before any further actions are taken. On Unix, the directory
816
that contains the master journal is also synced in order to make sure
817
the master journal file will appear in the directory following a
821
<a name="multijrnlupdate"></a>
823
<h3>5.3 Updating Rollback Journal Headers</h3>
824
<img src="images/ac/multi-2.gif" align="right" hspace="15">
826
<p>The next step is to record the full pathname of the master journal file
827
in the header of every rollback journal. Space to hold the master
828
journal filename was reserved at the beginning of each rollback journal
829
as the rollback journals were created.</p>
831
<p>The content of each rollback journal is flushed to disk both before
832
and after the master journal filename is written into the rollback
833
journal header. It is important to do both of these flushes. Fortunately,
834
the second flush is usually inexpensive since typically only a single
835
page of the journal file (the first page) has changed.</p>
837
<p>This step is analogous to
838
<a href="#section_3_7">step 3.7</a> in the single-file commit
839
scenario described above.</p>
842
<a name="multidbupdate"></a>
844
<h3>5.4 Updating The Database Files</h3>
845
<img src="images/ac/multi-3.gif" align="right" hspace="15">
847
<p>Once all rollback journal files have been flushed to disk, it
848
is safe to begin updating database files. We have to obtain an
849
exclusive lock on all database files before writing the changes.
850
After all the changes are written, it is important to flush the
851
changes to disk so that they will be preserved in the event of
852
a power failure or operating system crash.</p>
854
<p>This step corresponds to steps
855
<a href="#section_3_8">3.8</a>,
856
<a href="#section_3_9">3.9</a>, and
857
<a href="#section_3_10">3.10</a> in the single-file commit
858
scenario described previously.</p>
862
<a name="section_5_5"></a>
863
<h3>5.5 Delete The Master Journal File</h3>
864
<img src="images/ac/multi-4.gif" align="right" hspace="15">
866
<p>The next step is to delete the master journal file.
867
This is the point where the multi-file transaction commits.
868
This step corresponds to
869
<a href="#section_3_11">step 3.11</a> in the single-file
870
commit scenario where the rollback journal is deleted.</p>
872
<p>If a power failure or operating system crash occurs at this
873
point, the transaction will not rollback when the system reboots
874
even though there are rollback journals present. The
875
difference is the master journal pathname in the header of the
876
rollback journal. Upon restart, SQLite only considers a journal
877
to be hot and will only playback the journal if there is no
878
master journal filename in the header (which is the case for
879
a single-file commit) or if the master journal file still
883
<a name="cleanup"></a>
885
<h3>5.6 Clean Up The Rollback Journals</h3>
886
<img src="images/ac/multi-5.gif" align="right" hspace="15">
888
<p>The final step in a multi-file commit is to delete the
889
individual rollback journals and drop the exclusive locks on
890
the database files so that other processes can see the changes.
892
<a href="#section_3_12">step 3.12</a> in the single-file
895
<p>The transaction has already committed at this point so timing
896
is not critical in the deletion of the rollback journals.
897
The current implementation deletes a single rollback journal
898
then unlocks the corresponding database file before proceeding
899
to the next rollback journal. But in the future we might change
900
this so that all rollback journals are deleted before any database
901
files are unlocked. As long as the rollback journal is deleted before
902
its corresponding database file is unlocked it does not matter in what
903
order the rollback journals are deleted or the database files are
906
<a name="moredetail"></a>
908
<h2>6.0 Additional Details Of The Commit Process</h2>
910
<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
911
how atomic commit works in SQLite. But it glosses over a number of
912
important details. The following subsections will attempt to fill
915
<a name="completesectors"></a>
917
<h3>6.1 Always Journal Complete Sectors</h3>
919
<p>When the original content of a database page is written into
920
the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
921
SQLite always writes a complete sectors worth of data, even if the
922
page size of the database is smaller than the sector size.
923
Historically, the sector size in SQLite has been hard coded to 512
924
bytes and since the minimum page size is also 512 bytes, this has never
925
been an issue. But beginning with SQLite version 3.3.14, it is possible
926
for SQLite to use mass storage devices with a sector size larger than 512
927
bytes. So, beginning with version 3.3.14, whenever any page within a
928
sector is written into the journal file, all pages in that same sector
929
are stored with it.</p>
931
<p>It is important to store all pages of a sector in the rollback
932
journal in order to prevent database corruption following a power
933
loss while writing the sector. Suppose that pages 1, 2, 3, and 4 are
934
all stored in sector 1 and that page 2 is modified. In order to write
935
the changes to page 2, the underlying hardware must also rewrite the
936
content of pages 1, 3, and 4 since the hardware must write the complete
937
sector. If this write operation is interrupted by a power outage,
938
one or more of the pages 1, 3, or 4 might be left with incorrect data.
939
Hence, to avoid lasting corruption to the database, the original content
940
of all of those pages must be contained in the rollback journal.</p>
942
<a name="journalgarbage"></a>
944
<h3>6.2 Dealing With Garbage Written Into Journal Files</h3>
946
<p>When data is appended to the end of the rollback journal,
947
SQLite normally makes the pessimistic assumption that the file
948
is first extended with invalid "garbage" data and that afterwards
949
the correct data replaces the garbage. In other words, SQLite assumes
950
that the file size is increased first and then afterwards the content
951
is written into the file. If a power failure occurs after the file
952
size has been increased but before the file content has been written,
953
the rollback journal can be left containing garbage data. If after
954
power is restored, another SQLite process sees the rollback journal
955
containing the garbage data and tries to roll it back into the original
956
database file, it might copy some of the garbage into the database file
957
and thus corrupt the database file.</p>
959
<p>SQLite uses two defenses against this problem. In the first place,
960
SQLite records the number of pages in the rollback journal in the header
961
of the rollback journal. This number is initially zero. So during an
962
attempt to rollback an incomplete (and possibly corrupt) rollback
963
journal, the process doing the rollback will see that the journal
964
contains zero pages and will thus make no changes to the database. Prior
965
to a commit, the rollback journal is flushed to disk to ensure that
966
all content has been synced to disk and there is no "garbage" left
967
in the file, and only then is the page count in the header changed from
968
zero to true number of pages in the rollback journal. The rollback journal
969
header is always kept in a separate sector from any page data so that
970
it can be overwritten and flushed without risking damage to a data
971
page if a power outage occurs. Notice that the rollback journal
972
is flushed to disk twice: once to write the page content and a second
973
time to write the page count in the header.</p>
975
<p>The previous paragraph describes what happens when the
976
synchronous pragma setting is "full".</p>
979
PRAGMA synchronous=FULL;
982
<p>The default synchronous setting is full so the above is what usually
983
happens. However, if the synchronous setting is lowered to "normal",
984
SQLite only flushes the rollback journal once, after the page count has
986
This carries a risk of corruption because it might happen that the
987
modified (non-zero) page count reaches the disk surface before all
988
of the data does. The data will have been written first, but SQLite
989
assumes that the underlying filesystem can reorder write requests and
990
that the page count can be burned into oxide first even though its
991
write request occurred last. So as a second line of defense, SQLite
992
also uses a 32-bit checksum on every page of data in the rollback
993
journal. This checksum is evaluated for each page during rollback
994
while rolling back a journal as described in
995
<a href="#section_4_4">section 4.4</a>. If an incorrect checksum
996
is seen, the rollback is abandoned. Note that the checksum does
997
not guarantee that the page data is correct since there is a small
998
but finite probability that the checksum might be right even if the data is
999
corrupt. But the checksum does at least make such an error unlikely.
1002
<p>Note that the checksums in the rollback journal are not necessary
1003
if the synchronous setting is FULL. We only depend on the checksums
1004
when synchronous is lowered to NORMAL. Nevertheless, the checksums
1005
never hurt and so they are included in the rollback journal regardless
1006
of the synchronous setting.</p>
1008
<a name="cachespill"></a>
1010
<h3>6.3 Cache Spill Prior To Commit</h3>
1012
<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
1013
assumes that all database changes fit in memory until it is time to
1014
commit. This is the common case. But sometimes a larger change will
1015
overflow the user-space cache prior to transaction commit. In those
1016
cases, the cache must spill to the database before the transaction
1019
<p>At the beginning of a cache spill, the status of the database
1020
connection is as shown in <a href="#section_3_6">step 3.6</a>.
1021
Original page content has been saved in the rollback journal and
1022
modifications of the pages exist in user memory. To spill the cache,
1023
SQLite executes steps <a href="#section_3_7">3.7</a> through
1024
<a href="#section_3_9">3.9</a>. In other words, the rollback journal
1025
is flushed to disk, an exclusive lock is acquired, and changes are
1026
written into the database. But the remaining steps are deferred
1027
until the transaction really commits. A new journal header is
1028
appended to the end of the rollback journal (in its own sector)
1029
and the exclusive database lock is retained, but otherwise processing
1030
returns to <a href="#section_3_6">step 3.6</a>. When the transaction
1031
commits, or if another cache spill occurs, steps
1032
<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are
1033
repeated. (Step <a href="#section_3_8">3.8</a> is omitted on second
1034
and subsequent passes since an exclusive database lock is already held
1035
due to the first pass.)</p>
1037
<p>A cache spill causes the lock on the database file to
1038
escalate from reserved to exclusive. This reduces concurrency.
1039
A cache spill also causes extra disk flush or fsync operations to
1040
occur and these operations are slow, hence a cache spill can
1041
seriously reduce performance.
1042
For these reasons a cache spill is avoided whenever possible.</p>
1046
<h2>7.0 Optimizations</h2>
1048
<p>Profiling indicates that for most systems and in most circumstances
1049
SQLite spends most of its time doing disk I/O. It follows then that
1050
anything we can do to reduce the amount of disk I/O will likely have a
1051
large positive impact on the performance of SQLite. This section
1052
describes some of the techniques used by SQLite to try to reduce the
1053
amount of disk I/O to a minimum while still preserving atomic commit.</p>
1055
<a name="keepcache"></a>
1057
<h3>7.1 Cache Retained Between Transactions</h3>
1059
<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
1060
that once the shared lock has been released, all user-space cache
1061
images of database content must be discarded. This is done because
1062
without a shared lock, other processes are free to modify the database
1063
file content and so any user-space image of that content might become
1064
obsolete. Consequently, each new transaction would begin by rereading
1065
data which had previously been read. This is not as bad as it sounds
1066
at first since the data being read is still likely in the operating
1067
systems file cache. So the "read" is really just a copy of data
1068
from kernel space into user space. But even so, it still takes time.</p>
1070
<p>Beginning with SQLite version 3.3.14 a mechanism has been added
1071
to try to reduce the needless rereading of data. In newer versions
1072
of SQLite, the data in the user-space pager cache is retained when
1073
the lock on the database file is released. Later, after the
1074
shared lock is acquired at the beginning of the next transaction,
1075
SQLite checks to see if any other process has modified the database
1076
file. If the database has been changed in any way since the lock
1077
was last released, the user-space cache is erased at that point.
1078
But commonly the database file is unchanged and the user-space cache
1079
can be retained, and some unnecessary read operations can be avoided.</p>
1081
<p>In order to determine whether or not the database file has changed,
1082
SQLite uses a counter in the database header (in bytes 24 through 27)
1083
which is incremented during every change operation. SQLite saves a copy
1084
of this counter prior to releasing its database lock. Then after
1085
acquiring the next database lock it compares the saved counter value
1086
against the current counter value and erases the cache if the values
1087
are different, or reuses the cache if they are the same.</p>
1089
<a name="section_7_2"></a>
1090
<h3>7.2 Exclusive Access Mode</h3>
1092
<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".
1093
In exclusive access mode, SQLite retains the exclusive
1094
database lock at the conclusion of each transaction. This prevents
1095
other processes for accessing the database, but in many deployments
1096
only a single process is using a database so this is not a
1097
serious problem. The advantage of exclusive access mode is that
1098
disk I/O can be reduced in three ways:</p>
1101
<li><p>It is not necessary to increment the change counter in the
1102
database header for transactions after the first transaction. This
1103
will often save a write of page one to both the rollback
1104
journal and the main database file.</p></li>
1106
<li><p>No other processes can change the database so there is never
1107
a need to check the change counter and clear the user-space cache
1108
at the beginning of a transaction.</p></li>
1110
<li><p>Each transaction can be committed by overwriting the rollback
1111
journal header with zeros rather than deleting the journal file.
1112
This avoids having to modify the directory entry for the journal file
1113
and it avoids having to deallocate disk sectors associated with the
1114
journal. Furthermore, the next transaction will overwrite existing
1115
journal file content rather than append new content and on most systems
1116
overwriting is much faster than appending.</p></li>
1119
<p>The third optimization, zeroing the journal file header rather than
1120
deleting the rollback journal file,
1121
does not depend on holding an exclusive lock at all times.
1122
This optimization can be set independently of exclusive lock mode
1123
using the <a href="pragma.html#pragma_journal_mode">journal_mode pragma</a>
1124
as described in <a href="#section_7_6">section 7.6</a> below.</p>
1126
<a name="freelistjrnl"></a>
1128
<h3>7.3 Do Not Journal Freelist Pages</h3>
1130
<p>When information is deleted from an SQLite database, the pages used
1131
to hold the deleted information are added to a "<a href="fileformat2.html#freelist">freelist</a>". Subsequent
1132
inserts will draw pages off of this freelist rather than expanding the
1135
<p>Some freelist pages contain critical data; specifically the locations
1136
of other freelist pages. But most freelist pages contain nothing useful.
1137
These latter freelist pages are called "leaf" pages. We are free to
1138
modify the content of a leaf freelist page in the database without
1139
changing the meaning of the database in any way.</p>
1141
<p>Because the content of leaf freelist pages is unimportant, SQLite
1142
avoids storing leaf freelist page content in the rollback journal
1143
in <a href="#section_3_5">step 3.5</a> of the commit process.
1144
If a leaf freelist page is changed and that change does not get rolled back
1145
during a transaction recovery, the database is not harmed by the omission.
1146
Similarly, the content of a new freelist page is never written back
1147
into the database at <a href="#section_3_9">step 3.9</a> nor
1148
read from the database at <a href="#section_3_3">step 3.3</a>.
1149
These optimizations can greatly reduce the amount of I/O that occurs
1150
when making changes to a database file that contains free space.</p>
1152
<a name="atomicsector"></a>
1154
<h3>7.4 Single Page Updates And Atomic Sector Writes</h3>
1156
<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
1157
interface contains a method named xDeviceCharacteristics which reports
1158
on special properties that the underlying mass storage device
1159
might have. Among the special properties that
1160
xDeviceCharacteristics might report is the ability of to do an
1161
atomic sector write.</p>
1163
<p>Recall that by default SQLite assumes that sector writes are
1164
linear but not atomic. A linear write starts at one end of the
1165
sector and changes information byte by byte until it gets to the
1166
other end of the sector. If a power loss occurs in the middle of
1167
a linear write then part of the sector might be modified while the
1168
other end is unchanged. In an atomic sector write, either the entire
1169
sector is overwritten or else nothing in the sector is changed.</p>
1171
<p>We believe that most modern disk drives implement atomic sector
1172
writes. When power is lost, the drive uses energy stored in capacitors
1173
and/or the angular momentum of the disk platter to provide power to
1174
complete any operation in progress. Nevertheless, there are so many
1175
layers in between the write system call and the on-board disk drive
1176
electronics that we take the safe approach in both Unix and w32 VFS
1177
implementations and assume that sector writes are not atomic. On the
1179
manufactures with more control over their filesystems might want
1180
to consider enabling the atomic write property of xDeviceCharacteristics
1181
if their hardware really does do atomic writes.</p>
1183
<p>When sector writes are atomic and the page size of a database is
1184
the same as a sector size, and when there is a database change that
1185
only touches a single database page, then SQLite skips the whole
1186
journaling and syncing process and simply writes the modified page
1187
directly into the database file. The change counter in the first
1188
page of the database file is modified separately since no harm is
1189
done if power is lost before the change counter can be updated.</p>
1191
<a name="safeappend"></a>
1193
<h3>7.5 Filesystems With Safe Append Semantics</h3>
1195
<p>Another optimization introduced in SQLite version 3.5.0 makes
1196
use of "safe append" behavior of the underlying disk.
1197
Recall that SQLite assumes that when data is appended to a file
1198
(specifically to the rollback journal) that the size of the file
1199
is increased first and that the content is written second. So
1200
if power is lost after the file size is increased but before the
1201
content is written, the file is left containing invalid "garbage"
1202
data. The xDeviceCharacteristics method of the VFS might, however,
1203
indicate that the filesystem implements "safe append" semantics.
1204
This means that the content is written before the file size is
1205
increased so that it is impossible for garbage to be introduced
1206
into the rollback journal by a power loss or system crash.</p>
1208
<p>When safe append semantics are indicated for a filesystem,
1209
SQLite always stores the special value of -1 for the page count
1210
in the header of the rollback journal. The -1 page count value
1211
tells any process attempting to rollback the journal that the
1212
number of pages in the journal should be computed from the journal
1213
size. This -1 value is never changed. So that when a commit
1214
occurs, we save a single flush operation and a sector write of
1215
the first page of the journal file. Furthermore, when a cache
1216
spill occurs we no longer need to append a new journal header
1217
to the end of the journal; we can simply continue appending
1218
new pages to the end of the existing journal.</p>
1220
<a name="section_7_6"></a>
1221
<h3>7.6 Persistent Rollback Journals</h3>
1223
<p>Deleting a file is an expensive operation on many systems.
1224
So as an optimization, SQLite can be configured to avoid the
1225
delete operation of <a href="#section_3_11">section 3.11</a>.
1226
Instead of deleting the journal file in order to commit a transaction,
1227
the file is either truncated to zero bytes in length or its
1228
header is overwritten with zeros. Truncating the file to zero
1229
length saves having to make modifications to the directory containing
1230
the file since the file is not removed from the directory.
1231
Overwriting the header has the additional savings of not having
1232
to update the length of the file (in the "inode" on many systems)
1233
and not having to deal with newly freed disk sectors. Furthermore,
1234
at the next transaction the journal will be created by overwriting
1235
existing content rather than appending new content onto the end
1236
of a file, and overwriting is often much faster than appending.</p>
1238
<p>SQLite can be configured to commit transactions by overwriting
1239
the journal header with zeros instead of deleting the journal file
1240
by setting the "PERSIST" journaling mode using the
1241
<a href="pragma.html#pragma_journal_mode">journal_mode</a> PRAGMA.
1245
PRAGMA journal_mode=PERSIST;
1248
<p>The use of persistent journal mode provide a noticeable performance
1249
improvement on many systems. Of course, the drawback is that the
1250
journal files remain on the disk, using disk space and cluttering
1251
directories, long after the transaction commits. The only safe way
1252
to delete a persistent journal file is to commit a transaction
1253
with journaling mode set to DELETE:</p>
1256
PRAGMA journal_mode=DELETE;
1261
<p>Beware of deleting persistent journal files by any other means
1262
since the journal file might be hot, in which case deleting it will
1263
corrupt the corresponding database file.</p>
1265
<p>Beginning in SQLite version 3.6.4, the TRUNCATE journal mode is
1269
PRAGMA journal_mode=TRUNCATE;
1272
<p>In truncate journal mode, the transaction is committed by truncating
1273
the journal file to zero length rather than deleting the journal file
1274
(as in DELETE mode) or by zeroing the header (as in PERSIST mode).
1275
TRUNCATE mode shares the advantage of PERSIST mode that the directory
1276
that contains the journal file and database does not need to be updated.
1277
Hence truncating a file is often faster than deleting it. TRUNCATE has
1278
the additional advantage that it is not followed by a
1279
system call (ex: fsync()) to synchronize the change to disk. It might
1281
But on many modern filesystems, a truncate is an atomic and
1282
synchronous operation and so we think that TRUNCATE will usually be safe
1283
in the face of power failures. If you are uncertain about whether or
1284
not TRUNCATE will be synchronous and atomic on your filesystem and it is
1285
important to you that your database survive a power loss or operating
1286
system crash that occurs during the truncation operation, then you might
1287
consider using a different journaling mode.</p>
1289
<p>On embedded systems with synchronous filesystems, TRUNCATE results
1290
in slower behavior than PERSIST. The commit operation is the same speed.
1291
But subsequent transactions are slower following a TRUNCATE because it is
1292
faster to overwrite existing content than to append to the end of a file.
1293
New journal file entries will always be appended following a TRUNCATE but
1294
will usually overwrite with PERSIST.</p>
1296
<a name="testing"></a>
1298
<h2>8.0 Testing Atomic Commit Behavior</h2>
1300
<p>The developers of SQLite are confident that it is robust
1301
in the face of power failures and system crashes because the
1302
automatic test procedures do extensive checks on
1303
the ability of SQLite to recover from simulated power loss.
1304
We call these the "crash tests".</p>
1306
<p>Crash tests in SQLite use a modified VFS that can simulate
1307
the kinds of filesystem damage that occur during a power
1308
loss or operating system crash. The crash-test VFS can simulate
1309
incomplete sector writes, pages filled with garbage data because
1310
a write has not completed, and out of order writes, all occurring
1311
at varying points during a test scenario. Crash tests execute
1312
transactions over and over, varying the time at which a simulated
1313
power loss occurs and the properties of the damage inflicted.
1314
Each test then reopens the database after the simulated crash and
1315
verifies that the transaction either occurred completely
1316
or not at all and that the database is in a completely
1317
consistent state.</p>
1319
<p>The crash tests in SQLite have discovered a number of very
1320
subtle bugs (now fixed) in the recovery mechanism. Some of
1321
these bugs were very obscure and unlikely to have been found
1322
using only code inspection and analysis techniques. From this
1323
experience, the developers of SQLite feel confident that any other
1324
database system that does not use a similar crash test system
1325
likely contains undetected bugs that will lead to database
1326
corruption following a system crash or power failure.</p>
1328
<a name="sect_9_0"></a>
1330
<h2>9.0 Things That Can Go Wrong</h2>
1332
<p>The atomic commit mechanism in SQLite has proven to be robust,
1333
but it can be circumvented by a sufficiently creative
1334
adversary or a sufficiently broken operating system implementation.
1335
This section describes a few of the ways in which an SQLite database
1336
might be corrupted by a power failure or system crash.
1337
(See also: <a href="lockingv3.html#how_to_corrupt">How To Corrupt Your Database Files</a>.)</p>
1339
<a name="brokenlocks"></a>
1341
<h3>9.1 Broken Locking Implementations</h3>
1343
<p>SQLite uses filesystem locks to make sure that only one
1344
process and database connection is trying to modify the database
1345
at a time. The filesystem locking mechanism is implemented
1346
in the VFS layer and is different for every operating system.
1347
SQLite depends on this implementation being correct. If something
1348
goes wrong and two or more processes are able to write the same
1349
database file at the same time, severe damage can result.</p>
1351
<p>We have received reports of implementations of both
1352
Windows network filesystems and NFS in which locking was
1353
subtly broken. We can not verify these reports, but as
1354
locking is difficult to get right on a network filesystem
1355
we have no reason to doubt them. You are advised to
1356
avoid using SQLite on a network filesystem in the first place,
1357
since performance will be slow. But if you must use a
1358
network filesystem to store SQLite database files, consider
1359
using a secondary locking mechanism to prevent simultaneous
1360
writes to the same database even if the native filesystem
1361
locking mechanism malfunctions.</p>
1363
<p>The versions of SQLite that come preinstalled on Apple
1364
Mac OS X computers contain a version of SQLite that has been
1365
extended to use alternative locking strategies that work on
1366
all network filesystems that Apple supports. These extensions
1367
used by Apple work great as long as all processes are accessing
1368
the database file in the same way. Unfortunately, the locking
1369
mechanisms do not exclude one another, so if one process is
1370
accessing a file using (for example) AFP locking and another
1371
process (perhaps on a different machine) is using dot-file locks,
1372
the two processes might collide because AFP locks do not exclude
1373
dot-file locks or vice versa.</p>
1375
<a name="fsync"></a>
1377
<h3>9.2 Incomplete Disk Flushes</h3>
1379
<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()
1380
system call on w32 in order to sync the file system buffers onto disk
1381
oxide as shown in <a href="#section_3_7">step 3.7</a> and
1382
<a href="#section_3_10">step 3.10</a>. Unfortunately, we have received
1383
reports that neither of these interfaces works as advertised on many
1384
systems. We hear that FlushFileBuffers() can be completely disabled
1385
using registry settings on some Windows versions. Some historical
1386
versions of Linux contain versions of fsync() which are no-ops on
1387
some filesystems, we are told. Even on systems where
1388
FlushFileBuffers() and fsync() are said to be working, often
1389
the IDE disk control lies and says that data has reached oxide
1390
while it is still held only in the volatile control cache.</p>
1392
<p>On the Mac, you can set this pragma:</p>
1395
PRAGMA fullfsync=ON;
1398
<p>Setting fullfsync on a Mac will guarantee that data really does
1399
get pushed out to the disk platter on a flush. But the implementation
1400
of fullfsync involves resetting the disk controller. And so not only
1401
is it profoundly slow, it also slows down other unrelated disk I/O.
1402
So its use is not recommended.</p>
1404
<a name="filedel"></a>
1406
<h3>9.3 Partial File Deletions</h3>
1408
<p>SQLite assumes that file deletion is an atomic operation from the
1409
point of view of a user process. If power fails in the middle of
1410
a file deletion, then after power is restored SQLite expects to see
1411
either the entire file with all of its original data intact, or it
1412
expects not to find the file at all. Transactions may not be atomic
1413
on systems that do not work this way.</p>
1415
<a name="filegarbage"></a>
1417
<h3>9.4 Garbage Written Into Files</h3>
1419
<p>SQLite database files are ordinary disk files that can be
1420
opened and written by ordinary user processes. A rogue process
1421
can open an SQLite database and fill it with corrupt data.
1422
Corrupt data might also be introduced into an SQLite database
1423
by bugs in the operating system or disk controller; especially
1424
bugs triggered by a power failure. There is nothing SQLite can
1425
do to defend against these kinds of problems.</p>
1427
<a name="mvhotjrnl"></a>
1429
<h3>9.5 Deleting Or Renaming A Hot Journal</h3>
1431
<p>If a crash or power loss does occur and a hot journal is left on
1432
the disk, it is essential that the original database file and the hot
1433
journal remain on disk with their original names until the database
1434
file is opened by another SQLite process and rolled back.
1435
During recovery at <a href="section_4_2">step 4.2</a> SQLite locates
1436
the hot journal by looking for a file in the same directory as the
1437
database being opened and whose name is derived from the name of the
1438
file being opened. If either the original database file or the
1439
hot journal have been moved or renamed, then the hot journal will
1440
not be seen and the database will not be rolled back.</p>
1442
<p>We suspect that a common failure mode for SQLite recovery happens
1443
like this: A power failure occurs. After power is restored, a well-meaning
1444
user or system administrator begins looking around on the disk for
1445
damage. They see their database file named "important.data". This file
1446
is perhaps familiar to them. But after the crash, there is also a
1447
hot journal named "important.data-journal". The user then deletes
1448
the hot journal, thinking that they are helping to cleanup the system.
1449
We know of no way to prevent this other than user education.</p>
1451
<p>If there are multiple (hard or symbolic) links to a database file,
1452
the journal will be created using the name of the link through which
1453
the file was opened. If a crash occurs and the database is opened again
1454
using a different link, the hot journal will not be located and no
1455
rollback will occur.</p>
1457
<p>Sometimes a power failure will cause a filesystem to be corrupted
1458
such that recently changed filenames are forgotten and the file is
1459
moved into a "/lost+found" directory. When that happens, the hot
1460
journal will not be found and recovery will not occur.
1461
SQLite tries to prevent this
1462
by opening and syncing the directory containing the rollback journal
1463
at the same time it syncs the journal file itself. However, the
1464
movement of files into /lost+found can be caused by unrelated processes
1465
creating unrelated files in the same directory as the main database file.
1466
And since this is out from under the control of SQLite, there is nothing
1467
that SQLite can do to prevent it. If you are running on a system that
1468
is vulnerable to this kind of filesystem namespace corruption (most
1469
modern journalling filesystems are immune, we believe) then you might
1470
want to consider putting each SQLite database file in its own private
1473
<a name="future"></a>
1475
<h2>10.0 Future Directions And Conclusion</h2>
1477
<p>Every now and then someone discovers a new failure mode for
1478
the atomic commit mechanism in SQLite and the developers have to
1479
put in a patch. This is happening less and less and the
1480
failure modes are becoming more and more obscure. But it would
1481
still be foolish to suppose that the atomic commit logic of
1482
SQLite is entirely bug-free. The developers are committed to fixing
1483
these bugs as quickly as they might be found.</p>
1486
The developers are also on the lookout for new ways to
1487
optimize the commit mechanism. The current VFS implementations
1488
for Unix (Linux and Mac OS X) and Windows make pessimistic assumptions about
1489
the behavior of those systems. After consultation with experts
1490
on how these systems work, we might be able to relax some of the
1491
assumptions on these systems and allow them to run faster. In
1492
particular, we suspect that most modern filesystems exhibit the
1493
safe append property and that many of them might support atomic
1494
sector writes. But until this is known for certain, SQLite will
1495
take the conservative approach and assume the worst.</p>