1
$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.15 2004-08-27 17:07:41 tgl Exp $
6
Postgres uses three types of interprocess locks:
8
* Spinlocks. These are intended for *very* short-term locks. If a lock
9
is to be held more than a few dozen instructions, or across any sort of
10
kernel call (or even a call to a nontrivial subroutine), don't use a
11
spinlock. Spinlocks are primarily used as infrastructure for lightweight
12
locks. They are implemented using a hardware atomic-test-and-set
13
instruction, if available. Waiting processes busy-loop until they can
14
get the lock. There is no provision for deadlock detection, automatic
15
release on error, or any other nicety. There is a timeout if the lock
16
cannot be gotten after a minute or so (which is approximately forever in
17
comparison to the intended lock hold time, so this is certainly an error
20
* Lightweight locks (LWLocks). These locks are typically used to
21
interlock access to datastructures in shared memory. LWLocks support
22
both exclusive and shared lock modes (for read/write and read-only
23
access to a shared object). There is no provision for deadlock
24
detection, but the LWLock manager will automatically release held
25
LWLocks during elog() recovery, so it is safe to raise an error while
26
holding LWLocks. Obtaining or releasing an LWLock is quite fast (a few
27
dozen instructions) when there is no contention for the lock. When a
28
process has to wait for an LWLock, it blocks on a SysV semaphore so as
29
to not consume CPU time. Waiting processes will be granted the lock in
30
arrival order. There is no timeout.
32
* Regular locks (a/k/a heavyweight locks). The regular lock manager
33
supports a variety of lock modes with table-driven semantics, and it has
34
full deadlock detection and automatic release at transaction end.
35
Regular locks should be used for all user-driven lock requests.
37
Acquisition of either a spinlock or a lightweight lock causes query
38
cancel and die() interrupts to be held off until all such locks are
39
released. No such restriction exists for regular locks, however. Also
40
note that we can accept query cancel and die() interrupts while waiting
41
for a regular lock, but we will not accept them while waiting for
42
spinlocks or LW locks. It is therefore not a good idea to use LW locks
43
when the wait time might exceed a few seconds.
45
The rest of this README file discusses the regular lock manager in detail.
50
Lock methods describe the overall locking behavior. Currently there are
51
two lock methods: DEFAULT and USER. (USER locks are non-blocking.)
53
Lock modes describe the type of the lock (read/write or shared/exclusive).
54
See src/tools/backend/index.html and src/include/storage/lock.h for more
57
There are two fundamental lock structures in shared memory: the
58
per-lockable-object LOCK struct, and the per-lock-and-requestor PROCLOCK
59
struct. A LOCK object exists for each lockable object that currently has
60
locks held or requested on it. A PROCLOCK struct exists for each transaction
61
that is holding or requesting lock(s) on each LOCK object.
63
In addition to these, each backend maintains an unshared LOCALLOCK structure
64
for each lockable object and lock mode that it is currently holding or
65
requesting. The shared lock structures only allow a single lock grant to
66
be made per lockable object/lock mode/transaction. Internally to a backend,
67
however, the same lock may be requested and perhaps released multiple times
68
in a transaction. The internal request counts are held in LOCALLOCK so that
69
the shared LockMgrLock need not be obtained to alter them.
71
---------------------------------------------------------------------------
73
The lock manager's LOCK objects contain:
76
The key fields that are used for hashing locks in the shared memory
77
lock hash table. This is declared as a separate struct to ensure that
78
we always zero out the correct number of bytes. It is critical that
79
any alignment-padding bytes the compiler might insert in the struct
80
be zeroed out, else the hash computation will be random.
83
Uniquely identifies the relation that the lock corresponds to.
86
Uniquely identifies the database in which the relation lives. If
87
this is a shared system relation (e.g. pg_database) the dbId must
91
Uniquely identifies the block/page within the relation and the
92
tuple within the block. If we are setting a table level lock
93
both the blockId and tupleId (in an item pointer this is called
94
the position) are set to invalid, if it is a page level lock the
95
blockId is valid, while the tupleId is still invalid. Finally if
96
this is a tuple level lock (we currently never do this) then both
97
the blockId and tupleId are set to valid specifications. This is
98
how we get the appearance of a multi-level lock table while using
99
only a single table (see Gray's paper on 2 phase locking if
100
you are puzzled about how multi-level lock tables work).
103
This bitmask indicates what types of locks are currently held on the
104
given lockable object. It is used (against the lock table's conflict
105
table) to determine if a new lock request will conflict with existing
106
lock types held. Conflicts are determined by bitwise AND operations
107
between the grantMask and the conflict table entry for the requested
108
lock type. Bit i of grantMask is 1 if and only if granted[i] > 0.
111
This bitmask shows the types of locks being waited for. Bit i of waitMask
112
is 1 if and only if requested[i] > granted[i].
115
This is a shared memory queue of all the PROCLOCK structs associated with
116
the lock object. Note that both granted and waiting PROCLOCKs are in this
117
list (indeed, the same PROCLOCK might have some already-granted locks and
118
be waiting for more!).
121
This is a shared memory queue of all process structures corresponding to
122
a backend that is waiting (sleeping) until another backend releases this
123
lock. The process structure holds the information needed to determine
124
if it should be woken up when this lock is released.
127
Keeps a count of how many times this lock has been attempted to be
128
acquired. The count includes attempts by processes which were put
129
to sleep due to conflicts. It also counts the same backend twice
130
if, for example, a backend process first acquires a read and then
131
acquires a write, or acquires the lock under two different transaction
132
IDs. (But multiple acquisitions of the same lock/lock mode under the
133
same transaction ID are not multiply counted here; they are recorded
134
only in the backend's LOCALLOCK structure.)
137
Keeps a count of how many locks of each type have been attempted. Only
138
elements 1 through MAX_LOCKMODES-1 are used as they correspond to the lock
139
type defined constants. Summing the values of requested[] should come out
143
Keeps count of how many times this lock has been successfully acquired.
144
This count does not include attempts that are waiting due to conflicts.
145
Otherwise the counting rules are the same as for nRequested.
148
Keeps count of how many locks of each type are currently held. Once again
149
only elements 1 through MAX_LOCKMODES-1 are used (0 is not). Also, like
150
requested, summing the values of granted should total to the value
153
We should always have 0 <= nGranted <= nRequested, and
154
0 <= granted[i] <= requested[i] for each i. If the request counts go to
155
zero, the lock object is no longer needed and can be freed.
157
---------------------------------------------------------------------------
159
The lock manager's PROCLOCK objects contain:
162
The key fields that are used for hashing entries in the shared memory
163
PROCLOCK hash table. This is declared as a separate struct to ensure that
164
we always zero out the correct number of bytes.
167
SHMEM offset of the LOCK object this PROCLOCK is for.
170
SHMEM offset of PROC of backend process that owns this PROCLOCK.
173
XID of transaction this PROCLOCK is for, or InvalidTransactionId
174
if the PROCLOCK is for session-level locking.
176
Note that this structure will support multiple transactions running
177
concurrently in one backend. Currently we do not use it for that
178
purpose: subtransactions acquire locks in the name of their top parent
179
transaction, to simplify reassigning lock ownership at subtransaction end.
180
So the XID field is really only needed to distinguish per-transaction
181
locks from session locks. User locks are always session locks, and we
182
also use session locks for multi-transaction operations like VACUUM.
185
A bitmask for the lock types successfully acquired by this PROCLOCK.
186
This should be a subset of the LOCK object's grantMask, and also a
187
subset of the PGPROC object's heldLocks mask.
190
List link for shared memory queue of all the PROCLOCK objects for the
194
List link for shared memory queue of all the PROCLOCK objects for the
197
---------------------------------------------------------------------------
199
The deadlock detection algorithm:
201
Since we allow user transactions to request locks in any order, deadlock
202
is possible. We use a deadlock detection/breaking algorithm that is
203
fairly standard in essence, but there are many special considerations
204
needed to deal with Postgres' generalized locking model.
206
A key design consideration is that we want to make routine operations
207
(lock grant and release) run quickly when there is no deadlock, and
208
avoid the overhead of deadlock handling as much as possible. We do this
209
using an "optimistic waiting" approach: if a process cannot acquire the
210
lock it wants immediately, it goes to sleep without any deadlock check.
211
But it also sets a delay timer, with a delay of DeadlockTimeout
212
milliseconds (typically set to one second). If the delay expires before
213
the process is granted the lock it wants, it runs the deadlock
214
detection/breaking code. Normally this code will determine that there is
215
no deadlock condition, and then the process will go back to sleep and
216
wait quietly until it is granted the lock. But if a deadlock condition
217
does exist, it will be resolved, usually by aborting the detecting
218
process' transaction. In this way, we avoid deadlock handling overhead
219
whenever the wait time for a lock is less than DeadlockTimeout, while
220
not imposing an unreasonable delay of detection when there is an error.
222
Lock acquisition (routines LockAcquire and ProcSleep) follows these rules:
224
1. A lock request is granted immediately if it does not conflict with
225
any existing or waiting lock request, or if the process already holds an
226
instance of the same lock type (eg, there's no penalty to acquire a read
227
lock twice). Note that a process never conflicts with itself, eg one
228
can obtain read lock when one already holds exclusive lock.
230
2. Otherwise the process joins the lock's wait queue. Normally it will
231
be added to the end of the queue, but there is an exception: if the
232
process already holds locks on this same lockable object that conflict
233
with the request of any pending waiter, then the process will be
234
inserted in the wait queue just ahead of the first such waiter. (If we
235
did not make this check, the deadlock detection code would adjust the
236
queue order to resolve the conflict, but it's relatively cheap to make
237
the check in ProcSleep and avoid a deadlock timeout delay in this case.)
238
Note special case when inserting before the end of the queue: if the
239
process's request does not conflict with any existing lock nor any
240
waiting request before its insertion point, then go ahead and grant the
241
lock without waiting.
243
When a lock is released, the lock release routine (ProcLockWakeup) scans
244
the lock object's wait queue. Each waiter is awoken if (a) its request
245
does not conflict with already-granted locks, and (b) its request does
246
not conflict with the requests of prior un-wakable waiters. Rule (b)
247
ensures that conflicting requests are granted in order of arrival. There
248
are cases where a later waiter must be allowed to go in front of
249
conflicting earlier waiters to avoid deadlock, but it is not
250
ProcLockWakeup's responsibility to recognize these cases; instead, the
251
deadlock detection code will re-order the wait queue when necessary.
253
To perform deadlock checking, we use the standard method of viewing the
254
various processes as nodes in a directed graph (the waits-for graph or
255
WFG). There is a graph edge leading from process A to process B if A
256
waits for B, ie, A is waiting for some lock and B holds a conflicting
257
lock. There is a deadlock condition if and only if the WFG contains a
258
cycle. We detect cycles by searching outward along waits-for edges to
259
see if we return to our starting point. There are three possible
262
1. All outgoing paths terminate at a running process (which has no
265
2. A deadlock is detected by looping back to the start point. We
266
resolve such a deadlock by canceling the start point's lock request and
267
reporting an error in that transaction, which normally leads to
268
transaction abort and release of that transaction's held locks. Note
269
that it's sufficient to cancel one request to remove the cycle; we don't
270
need to kill all the transactions involved.
272
3. Some path(s) loop back to a node other than the start point. This
273
indicates a deadlock, but one that does not involve our starting
274
process. We ignore this condition on the grounds that resolving such a
275
deadlock is the responsibility of the processes involved --- killing our
276
start- point process would not resolve the deadlock. So, cases 1 and 3
277
both report "no deadlock".
279
Postgres' situation is a little more complex than the standard discussion
280
of deadlock detection, for two reasons:
282
1. A process can be waiting for more than one other process, since there
283
might be multiple PROCLOCKs of (non-conflicting) lock types that all
284
conflict with the waiter's request. This creates no real difficulty
285
however; we simply need to be prepared to trace more than one outgoing
288
2. If a process A is behind a process B in some lock's wait queue, and
289
their requested locks conflict, then we must say that A waits for B, since
290
ProcLockWakeup will never awaken A before B. This creates additional
291
edges in the WFG. We call these "soft" edges, as opposed to the "hard"
292
edges induced by locks already held. Note that if B already holds any
293
locks conflicting with A's request, then their relationship is a hard edge
296
A "soft" block, or wait-priority block, has the same potential for
297
inducing deadlock as a hard block. However, we may be able to resolve
298
a soft block without aborting the transactions involved: we can instead
299
rearrange the order of the wait queue. This rearrangement reverses the
300
direction of the soft edge between two processes with conflicting requests
301
whose queue order is reversed. If we can find a rearrangement that
302
eliminates a cycle without creating new ones, then we can avoid an abort.
303
Checking for such possible rearrangements is the trickiest part of the
306
The workhorse of the deadlock detector is a routine FindLockCycle() which
307
is given a starting point process (which must be a waiting process).
308
It recursively scans outward across waits-for edges as discussed above.
309
If it finds no cycle involving the start point, it returns "false".
310
(As discussed above, we can ignore cycles not involving the start point.)
311
When such a cycle is found, FindLockCycle() returns "true", and as it
312
unwinds it also builds a list of any "soft" edges involved in the cycle.
313
If the resulting list is empty then there is a hard deadlock and the
314
configuration cannot succeed. However, if the list is not empty, then
315
reversing any one of the listed edges through wait-queue rearrangement
316
will eliminate that cycle. Since such a reversal might create cycles
317
elsewhere, we may need to try every possibility. Therefore, we need to
318
be able to invoke FindLockCycle() on hypothetical configurations (wait
319
orders) as well as the current real order.
321
The easiest way to handle this seems to be to have a lookaside table that
322
shows the proposed new queue order for each wait queue that we are
323
considering rearranging. This table is checked by FindLockCycle, and it
324
believes the proposed queue order rather than the real order for each lock
325
that has an entry in the lookaside table.
327
We build a proposed new queue order by doing a "topological sort" of the
328
existing entries. Each soft edge that we are currently considering
329
reversing creates a property of the partial order that the topological sort
330
has to enforce. We must use a sort method that preserves the input
331
ordering as much as possible, so as not to gratuitously break arrival
332
order for processes not involved in a deadlock. (This is not true of the
333
tsort method shown in Knuth, for example, but it's easily done by a simple
334
doubly-nested-loop method that emits the first legal candidate at each
335
step. Fortunately, we don't need a highly efficient sort algorithm, since
336
the number of partial order constraints is not likely to be large.) Note
337
that failure of the topological sort tells us we have conflicting ordering
338
constraints, and therefore that the last-added soft edge reversal
339
conflicts with a prior edge reversal. We need to detect this case to
340
avoid an infinite loop in the case where no possible rearrangement will
341
work: otherwise, we might try a reversal, find that it still leads to
342
a cycle, then try to un-reverse the reversal while trying to get rid of
343
that cycle, etc etc. Topological sort failure tells us the un-reversal
344
is not a legitimate move in this context.
346
So, the basic step in our rearrangement method is to take a list of
347
soft edges in a cycle (as returned by FindLockCycle()) and successively
348
try the reversal of each one as a topological-sort constraint added to
349
whatever constraints we are already considering. We recursively search
350
through all such sets of constraints to see if any one eliminates all
351
the deadlock cycles at once. Although this might seem impossibly
352
inefficient, it shouldn't be a big problem in practice, because there
353
will normally be very few, and not very large, deadlock cycles --- if
354
any at all. So the combinatorial inefficiency isn't going to hurt us.
355
Besides, it's better to spend some time to guarantee that we've checked
356
all possible escape routes than to abort a transaction when we didn't
359
Each edge reversal constraint can be viewed as requesting that the waiting
360
process A be moved to before the blocking process B in the wait queue they
361
are both in. This action will reverse the desired soft edge, as well as
362
any other soft edges between A and other processes it is advanced over.
363
No other edges will be affected (note this is actually a constraint on our
364
topological sort method to not re-order the queue more than necessary.)
365
Therefore, we can be sure we have not created any new deadlock cycles if
366
neither FindLockCycle(A) nor FindLockCycle(B) discovers any cycle. Given
367
the above-defined behavior of FindLockCycle, each of these searches is
368
necessary as well as sufficient, since FindLockCycle starting at the
369
original start point will not complain about cycles that include A or B
370
but not the original start point.
372
In short then, a proposed rearrangement of the wait queue(s) is determined
373
by one or more broken soft edges A->B, fully specified by the output of
374
topological sorts of each wait queue involved, and then tested by invoking
375
FindLockCycle() starting at the original start point as well as each of
376
the mentioned processes (A's and B's). If none of the tests detect a
377
cycle, then we have a valid configuration and can implement it by
378
reordering the wait queues per the sort outputs (and then applying
379
ProcLockWakeup on each reordered queue, in case a waiter has become wakable).
380
If any test detects a soft cycle, we can try to resolve it by adding each
381
soft link in that cycle, in turn, to the proposed rearrangement list.
382
This is repeated recursively until we either find a workable rearrangement
383
or determine that none exists. In the latter case, the outer level
384
resolves the deadlock by aborting the original start-point transaction.
386
The particular order in which rearrangements are tried depends on the
387
order FindLockCycle() happens to scan in, so if there are multiple
388
workable rearrangements of the wait queues, then it is unspecified which
389
one will be chosen. What's more important is that we guarantee to try
390
every queue rearrangement that could lead to success. (For example,
391
if we have A before B before C and the needed order constraints are
392
C before A and B before C, we would first discover that A before C
393
doesn't work and try the rearrangement C before A before B. This would
394
eventually lead to the discovery of the additional constraint B before C.)
400
1. It is easily proven that no deadlock will be missed due to our
401
asynchronous invocation of deadlock checking. A deadlock cycle in the WFG
402
is formed when the last edge in the cycle is added; therefore the last
403
process in the cycle to wait (the one from which that edge is outgoing) is
404
certain to detect and resolve the cycle when it later runs CheckDeadLock.
405
This holds even if that edge addition created multiple cycles; the process
406
may indeed abort without ever noticing those additional cycles, but we
407
don't particularly care. The only other possible creation of deadlocks is
408
during deadlock resolution's rearrangement of wait queues, and we already
409
saw that that algorithm will prove that it creates no new deadlocks before
410
it attempts to actually execute any rearrangement.
412
2. It is not certain that a deadlock will be resolved by aborting the
413
last-to-wait process. If earlier waiters in the cycle have not yet run
414
CheckDeadLock, then the first one to do so will be the victim.
416
3. No live (wakable) process can be missed by ProcLockWakeup, since it
417
examines every member of the wait queue (this was not true in the 7.0
418
implementation, BTW). Therefore, if ProcLockWakeup is always invoked
419
after a lock is released or a wait queue is rearranged, there can be no
420
failure to wake a wakable process. One should also note that
421
LockWaitCancel (abort a waiter due to outside factors) must run
422
ProcLockWakeup, in case the canceled waiter was soft-blocking other
425
4. We can minimize excess rearrangement-trial work by being careful to
426
scan the wait queue from the front when looking for soft edges. For
427
example, if we have queue order A,B,C and C has deadlock conflicts with
428
both A and B, we want to generate the "C before A" constraint first,
429
rather than wasting time with "C before B", which won't move C far
430
enough up. So we look for soft edges outgoing from C starting at the
431
front of the wait queue.
433
5. The working data structures needed by the deadlock detection code can
434
be limited to numbers of entries computed from MaxBackends. Therefore,
435
we can allocate the worst-case space needed during backend startup. This
436
seems a safer approach than trying to allocate workspace on the fly; we
437
don't want to risk having the deadlock detector run out of memory, else
438
we really have no guarantees at all that deadlock will be detected.