~vcs-imports/mammoth-replicator/trunk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.15 2004-08-27 17:07:41 tgl Exp $


LOCKING OVERVIEW

Postgres uses three types of interprocess locks:

* Spinlocks.  These are intended for *very* short-term locks.  If a lock
is to be held more than a few dozen instructions, or across any sort of
kernel call (or even a call to a nontrivial subroutine), don't use a
spinlock. Spinlocks are primarily used as infrastructure for lightweight
locks. They are implemented using a hardware atomic-test-and-set
instruction, if available.  Waiting processes busy-loop until they can
get the lock. There is no provision for deadlock detection, automatic
release on error, or any other nicety.  There is a timeout if the lock
cannot be gotten after a minute or so (which is approximately forever in
comparison to the intended lock hold time, so this is certainly an error
condition).

* Lightweight locks (LWLocks).  These locks are typically used to
interlock access to datastructures in shared memory.  LWLocks support
both exclusive and shared lock modes (for read/write and read-only
access to a shared object). There is no provision for deadlock
detection, but the LWLock manager will automatically release held
LWLocks during elog() recovery, so it is safe to raise an error while
holding LWLocks.  Obtaining or releasing an LWLock is quite fast (a few
dozen instructions) when there is no contention for the lock.  When a
process has to wait for an LWLock, it blocks on a SysV semaphore so as
to not consume CPU time.  Waiting processes will be granted the lock in
arrival order.  There is no timeout.

* Regular locks (a/k/a heavyweight locks).  The regular lock manager
supports a variety of lock modes with table-driven semantics, and it has
full deadlock detection and automatic release at transaction end. 
Regular locks should be used for all user-driven lock requests.

Acquisition of either a spinlock or a lightweight lock causes query
cancel and die() interrupts to be held off until all such locks are
released. No such restriction exists for regular locks, however.  Also
note that we can accept query cancel and die() interrupts while waiting
for a regular lock, but we will not accept them while waiting for
spinlocks or LW locks. It is therefore not a good idea to use LW locks
when the wait time might exceed a few seconds.

The rest of this README file discusses the regular lock manager in detail.


LOCK DATA STRUCTURES

Lock methods describe the overall locking behavior.  Currently there are
two lock methods: DEFAULT and USER.  (USER locks are non-blocking.)

Lock modes describe the type of the lock (read/write or shared/exclusive). 
See src/tools/backend/index.html and src/include/storage/lock.h for more
details.

There are two fundamental lock structures in shared memory: the
per-lockable-object LOCK struct, and the per-lock-and-requestor PROCLOCK
struct.  A LOCK object exists for each lockable object that currently has
locks held or requested on it.  A PROCLOCK struct exists for each transaction
that is holding or requesting lock(s) on each LOCK object.

In addition to these, each backend maintains an unshared LOCALLOCK structure
for each lockable object and lock mode that it is currently holding or
requesting.  The shared lock structures only allow a single lock grant to
be made per lockable object/lock mode/transaction.  Internally to a backend,
however, the same lock may be requested and perhaps released multiple times
in a transaction.  The internal request counts are held in LOCALLOCK so that
the shared LockMgrLock need not be obtained to alter them.

---------------------------------------------------------------------------

The lock manager's LOCK objects contain:

tag -
    The key fields that are used for hashing locks in the shared memory
    lock hash table.  This is declared as a separate struct to ensure that
    we always zero out the correct number of bytes.  It is critical that
    any alignment-padding bytes the compiler might insert in the struct
    be zeroed out, else the hash computation will be random.

    tag.relId -
	Uniquely identifies the relation that the lock corresponds to.
    
    tag.dbId -
	Uniquely identifies the database in which the relation lives.  If
	this is a shared system relation (e.g. pg_database) the dbId must
	be set to 0.

    tag.objId -
	Uniquely identifies the block/page within the relation and the
	tuple within the block.  If we are setting a table level lock
	both the blockId and tupleId (in an item pointer this is called
	the position) are set to invalid, if it is a page level lock the
	blockId is valid, while the tupleId is still invalid.  Finally if
	this is a tuple level lock (we currently never do this) then both
	the blockId and tupleId are set to valid specifications.  This is
	how we get the appearance of a multi-level lock table while using
	only a single table (see Gray's paper on 2 phase locking if
	you are puzzled about how multi-level lock tables work).

grantMask -
    This bitmask indicates what types of locks are currently held on the
    given lockable object.  It is used (against the lock table's conflict
    table) to determine if a new lock request will conflict with existing
    lock types held.  Conflicts are determined by bitwise AND operations
    between the grantMask and the conflict table entry for the requested
    lock type.  Bit i of grantMask is 1 if and only if granted[i] > 0.

waitMask -
    This bitmask shows the types of locks being waited for.  Bit i of waitMask
    is 1 if and only if requested[i] > granted[i].

procLocks -
    This is a shared memory queue of all the PROCLOCK structs associated with
    the lock object.  Note that both granted and waiting PROCLOCKs are in this
    list (indeed, the same PROCLOCK might have some already-granted locks and
    be waiting for more!).

waitProcs -
    This is a shared memory queue of all process structures corresponding to
    a backend that is waiting (sleeping) until another backend releases this
    lock.  The process structure holds the information needed to determine
    if it should be woken up when this lock is released.

nRequested -
    Keeps a count of how many times this lock has been attempted to be
    acquired.  The count includes attempts by processes which were put
    to sleep due to conflicts.  It also counts the same backend twice
    if, for example, a backend process first acquires a read and then
    acquires a write, or acquires the lock under two different transaction
    IDs.  (But multiple acquisitions of the same lock/lock mode under the
    same transaction ID are not multiply counted here; they are recorded
    only in the backend's LOCALLOCK structure.)

requested -
    Keeps a count of how many locks of each type have been attempted.  Only
    elements 1 through MAX_LOCKMODES-1 are used as they correspond to the lock
    type defined constants.  Summing the values of requested[] should come out
    equal to nRequested.

nGranted -
    Keeps count of how many times this lock has been successfully acquired.
    This count does not include attempts that are waiting due to conflicts.
    Otherwise the counting rules are the same as for nRequested.

granted -
    Keeps count of how many locks of each type are currently held.  Once again
    only elements 1 through MAX_LOCKMODES-1 are used (0 is not).  Also, like
    requested, summing the values of granted should total to the value
    of nGranted.

We should always have 0 <= nGranted <= nRequested, and
0 <= granted[i] <= requested[i] for each i.  If the request counts go to
zero, the lock object is no longer needed and can be freed.

---------------------------------------------------------------------------

The lock manager's PROCLOCK objects contain:

tag -
    The key fields that are used for hashing entries in the shared memory
    PROCLOCK hash table.  This is declared as a separate struct to ensure that
    we always zero out the correct number of bytes.

    tag.lock
        SHMEM offset of the LOCK object this PROCLOCK is for.

    tag.proc
        SHMEM offset of PROC of backend process that owns this PROCLOCK.

    tag.xid
        XID of transaction this PROCLOCK is for, or InvalidTransactionId
        if the PROCLOCK is for session-level locking.

    Note that this structure will support multiple transactions running
    concurrently in one backend.  Currently we do not use it for that
    purpose: subtransactions acquire locks in the name of their top parent
    transaction, to simplify reassigning lock ownership at subtransaction end.
    So the XID field is really only needed to distinguish per-transaction
    locks from session locks.  User locks are always session locks, and we
    also use session locks for multi-transaction operations like VACUUM.

holdMask -
    A bitmask for the lock types successfully acquired by this PROCLOCK.
    This should be a subset of the LOCK object's grantMask, and also a
    subset of the PGPROC object's heldLocks mask.

lockLink -
    List link for shared memory queue of all the PROCLOCK objects for the
    same LOCK.

procLink -
    List link for shared memory queue of all the PROCLOCK objects for the
    same backend.

---------------------------------------------------------------------------

The deadlock detection algorithm:

Since we allow user transactions to request locks in any order, deadlock
is possible.  We use a deadlock detection/breaking algorithm that is
fairly standard in essence, but there are many special considerations
needed to deal with Postgres' generalized locking model.

A key design consideration is that we want to make routine operations
(lock grant and release) run quickly when there is no deadlock, and
avoid the overhead of deadlock handling as much as possible.  We do this
using an "optimistic waiting" approach: if a process cannot acquire the
lock it wants immediately, it goes to sleep without any deadlock check. 
But it also sets a delay timer, with a delay of DeadlockTimeout
milliseconds (typically set to one second).  If the delay expires before
the process is granted the lock it wants, it runs the deadlock
detection/breaking code. Normally this code will determine that there is
no deadlock condition, and then the process will go back to sleep and
wait quietly until it is granted the lock.  But if a deadlock condition
does exist, it will be resolved, usually by aborting the detecting
process' transaction.  In this way, we avoid deadlock handling overhead
whenever the wait time for a lock is less than DeadlockTimeout, while
not imposing an unreasonable delay of detection when there is an error.

Lock acquisition (routines LockAcquire and ProcSleep) follows these rules:

1. A lock request is granted immediately if it does not conflict with
any existing or waiting lock request, or if the process already holds an
instance of the same lock type (eg, there's no penalty to acquire a read
lock twice).  Note that a process never conflicts with itself, eg one
can obtain read lock when one already holds exclusive lock.

2. Otherwise the process joins the lock's wait queue.  Normally it will
be added to the end of the queue, but there is an exception: if the
process already holds locks on this same lockable object that conflict
with the request of any pending waiter, then the process will be
inserted in the wait queue just ahead of the first such waiter.  (If we
did not make this check, the deadlock detection code would adjust the
queue order to resolve the conflict, but it's relatively cheap to make
the check in ProcSleep and avoid a deadlock timeout delay in this case.)
 Note special case when inserting before the end of the queue: if the
process's request does not conflict with any existing lock nor any
waiting request before its insertion point, then go ahead and grant the
lock without waiting.

When a lock is released, the lock release routine (ProcLockWakeup) scans
the lock object's wait queue.  Each waiter is awoken if (a) its request
does not conflict with already-granted locks, and (b) its request does
not conflict with the requests of prior un-wakable waiters.  Rule (b)
ensures that conflicting requests are granted in order of arrival. There
are cases where a later waiter must be allowed to go in front of
conflicting earlier waiters to avoid deadlock, but it is not
ProcLockWakeup's responsibility to recognize these cases; instead, the
deadlock detection code will re-order the wait queue when necessary.

To perform deadlock checking, we use the standard method of viewing the
various processes as nodes in a directed graph (the waits-for graph or
WFG).  There is a graph edge leading from process A to process B if A
waits for B, ie, A is waiting for some lock and B holds a conflicting
lock.  There is a deadlock condition if and only if the WFG contains a
cycle.  We detect cycles by searching outward along waits-for edges to
see if we return to our starting point.  There are three possible
outcomes:

1. All outgoing paths terminate at a running process (which has no
outgoing edge).

2. A deadlock is detected by looping back to the start point.  We
resolve such a deadlock by canceling the start point's lock request and
reporting an error in that transaction, which normally leads to
transaction abort and release of that transaction's held locks.  Note
that it's sufficient to cancel one request to remove the cycle; we don't
need to kill all the transactions involved.

3. Some path(s) loop back to a node other than the start point.  This
indicates a deadlock, but one that does not involve our starting
process. We ignore this condition on the grounds that resolving such a
deadlock is the responsibility of the processes involved --- killing our
start- point process would not resolve the deadlock.  So, cases 1 and 3
both report "no deadlock".

Postgres' situation is a little more complex than the standard discussion
of deadlock detection, for two reasons:

1. A process can be waiting for more than one other process, since there
might be multiple PROCLOCKs of (non-conflicting) lock types that all
conflict with the waiter's request.  This creates no real difficulty
however; we simply need to be prepared to trace more than one outgoing
edge.

2. If a process A is behind a process B in some lock's wait queue, and
their requested locks conflict, then we must say that A waits for B, since
ProcLockWakeup will never awaken A before B.  This creates additional
edges in the WFG.  We call these "soft" edges, as opposed to the "hard"
edges induced by locks already held.  Note that if B already holds any
locks conflicting with A's request, then their relationship is a hard edge
not a soft edge.

A "soft" block, or wait-priority block, has the same potential for
inducing deadlock as a hard block.  However, we may be able to resolve
a soft block without aborting the transactions involved: we can instead
rearrange the order of the wait queue.  This rearrangement reverses the
direction of the soft edge between two processes with conflicting requests
whose queue order is reversed.  If we can find a rearrangement that
eliminates a cycle without creating new ones, then we can avoid an abort.
Checking for such possible rearrangements is the trickiest part of the
algorithm.

The workhorse of the deadlock detector is a routine FindLockCycle() which
is given a starting point process (which must be a waiting process).
It recursively scans outward across waits-for edges as discussed above.
If it finds no cycle involving the start point, it returns "false".
(As discussed above, we can ignore cycles not involving the start point.)
When such a cycle is found, FindLockCycle() returns "true", and as it
unwinds it also builds a list of any "soft" edges involved in the cycle.
If the resulting list is empty then there is a hard deadlock and the
configuration cannot succeed.  However, if the list is not empty, then
reversing any one of the listed edges through wait-queue rearrangement
will eliminate that cycle.  Since such a reversal might create cycles
elsewhere, we may need to try every possibility.  Therefore, we need to
be able to invoke FindLockCycle() on hypothetical configurations (wait
orders) as well as the current real order.

The easiest way to handle this seems to be to have a lookaside table that
shows the proposed new queue order for each wait queue that we are
considering rearranging.  This table is checked by FindLockCycle, and it
believes the proposed queue order rather than the real order for each lock
that has an entry in the lookaside table.

We build a proposed new queue order by doing a "topological sort" of the
existing entries.  Each soft edge that we are currently considering
reversing creates a property of the partial order that the topological sort
has to enforce.  We must use a sort method that preserves the input
ordering as much as possible, so as not to gratuitously break arrival
order for processes not involved in a deadlock.  (This is not true of the
tsort method shown in Knuth, for example, but it's easily done by a simple
doubly-nested-loop method that emits the first legal candidate at each
step.  Fortunately, we don't need a highly efficient sort algorithm, since
the number of partial order constraints is not likely to be large.)  Note
that failure of the topological sort tells us we have conflicting ordering
constraints, and therefore that the last-added soft edge reversal
conflicts with a prior edge reversal.  We need to detect this case to
avoid an infinite loop in the case where no possible rearrangement will
work: otherwise, we might try a reversal, find that it still leads to
a cycle, then try to un-reverse the reversal while trying to get rid of
that cycle, etc etc.  Topological sort failure tells us the un-reversal
is not a legitimate move in this context.

So, the basic step in our rearrangement method is to take a list of
soft edges in a cycle (as returned by FindLockCycle()) and successively
try the reversal of each one as a topological-sort constraint added to
whatever constraints we are already considering.  We recursively search
through all such sets of constraints to see if any one eliminates all
the deadlock cycles at once.  Although this might seem impossibly
inefficient, it shouldn't be a big problem in practice, because there
will normally be very few, and not very large, deadlock cycles --- if
any at all.  So the combinatorial inefficiency isn't going to hurt us.
Besides, it's better to spend some time to guarantee that we've checked
all possible escape routes than to abort a transaction when we didn't
really have to.

Each edge reversal constraint can be viewed as requesting that the waiting
process A be moved to before the blocking process B in the wait queue they
are both in.  This action will reverse the desired soft edge, as well as
any other soft edges between A and other processes it is advanced over.
No other edges will be affected (note this is actually a constraint on our
topological sort method to not re-order the queue more than necessary.)
Therefore, we can be sure we have not created any new deadlock cycles if
neither FindLockCycle(A) nor FindLockCycle(B) discovers any cycle.  Given
the above-defined behavior of FindLockCycle, each of these searches is
necessary as well as sufficient, since FindLockCycle starting at the
original start point will not complain about cycles that include A or B
but not the original start point.

In short then, a proposed rearrangement of the wait queue(s) is determined
by one or more broken soft edges A->B, fully specified by the output of
topological sorts of each wait queue involved, and then tested by invoking
FindLockCycle() starting at the original start point as well as each of
the mentioned processes (A's and B's).  If none of the tests detect a
cycle, then we have a valid configuration and can implement it by
reordering the wait queues per the sort outputs (and then applying
ProcLockWakeup on each reordered queue, in case a waiter has become wakable).
If any test detects a soft cycle, we can try to resolve it by adding each
soft link in that cycle, in turn, to the proposed rearrangement list.
This is repeated recursively until we either find a workable rearrangement
or determine that none exists.  In the latter case, the outer level
resolves the deadlock by aborting the original start-point transaction.

The particular order in which rearrangements are tried depends on the
order FindLockCycle() happens to scan in, so if there are multiple
workable rearrangements of the wait queues, then it is unspecified which
one will be chosen.  What's more important is that we guarantee to try
every queue rearrangement that could lead to success.  (For example,
if we have A before B before C and the needed order constraints are
C before A and B before C, we would first discover that A before C
doesn't work and try the rearrangement C before A before B.  This would
eventually lead to the discovery of the additional constraint B before C.)

Got that?

Miscellaneous notes:

1. It is easily proven that no deadlock will be missed due to our
asynchronous invocation of deadlock checking.  A deadlock cycle in the WFG
is formed when the last edge in the cycle is added; therefore the last
process in the cycle to wait (the one from which that edge is outgoing) is
certain to detect and resolve the cycle when it later runs CheckDeadLock.
This holds even if that edge addition created multiple cycles; the process
may indeed abort without ever noticing those additional cycles, but we
don't particularly care.  The only other possible creation of deadlocks is
during deadlock resolution's rearrangement of wait queues, and we already
saw that that algorithm will prove that it creates no new deadlocks before
it attempts to actually execute any rearrangement.

2. It is not certain that a deadlock will be resolved by aborting the
last-to-wait process.  If earlier waiters in the cycle have not yet run
CheckDeadLock, then the first one to do so will be the victim.

3. No live (wakable) process can be missed by ProcLockWakeup, since it
examines every member of the wait queue (this was not true in the 7.0
implementation, BTW).  Therefore, if ProcLockWakeup is always invoked
after a lock is released or a wait queue is rearranged, there can be no
failure to wake a wakable process.  One should also note that
LockWaitCancel (abort a waiter due to outside factors) must run
ProcLockWakeup, in case the canceled waiter was soft-blocking other
waiters.

4. We can minimize excess rearrangement-trial work by being careful to
scan the wait queue from the front when looking for soft edges.  For
example, if we have queue order A,B,C and C has deadlock conflicts with
both A and B, we want to generate the "C before A" constraint first,
rather than wasting time with "C before B", which won't move C far
enough up.  So we look for soft edges outgoing from C starting at the
front of the wait queue.

5. The working data structures needed by the deadlock detection code can
be limited to numbers of entries computed from MaxBackends.  Therefore,
we can allocate the worst-case space needed during backend startup. This
seems a safer approach than trying to allocate workspace on the fly; we
don't want to risk having the deadlock detector run out of memory, else
we really have no guarantees at all that deadlock will be detected.