3
The fundamental problem with a journaled cluster filesystem is
4
handling journal replay with multiple journals. A single block of
5
metadata can be modified sequentially by many different nodes in the
6
cluster. As the block is modified by each node, it gets logged in the
7
journal for each node. If care is not taken, it's possible to get
8
into a situation where a journal replay can actually corrupt a
9
filesystem. The error scenario is:
11
1) Node A modifies a metadata block by putting a updated copy into its
13
2) Node B wants to read and modify the block so it requests the lock
14
and a blocking callback is sent to Node A.
15
3) Node A flushes its incore log to disk, and then syncs out the
16
metadata block to its inplace location.
17
4) Node A then releases the lock.
18
5) Node B reads in the block and puts a modified copy into its ondisk
19
log and then the inplace block location.
22
At this point, Node A's journal needs to be replayed. Since there is
23
a newer version of block inplace, if that block is replayed, the
24
filesystem will be corrupted. There are a few different ways of
25
avoiding this problem.
27
1) Generation Numbers (GFS1)
29
Each metadata block has header in it that contains a 64-bit
30
generation number. As each block is logged into a journal, the
31
generation number is incremented. This provides a strict ordering
32
of the different versions of the block a they are logged in the FS'
33
different journals. When journal replay happens, each block in the
34
journal is not replayed if generation number in the journal is less
35
than the generation number in place. This ensures that a newer
36
version of a block is never replaced with an older version. So,
37
this solution basically allows multiple copies of the same block in
38
different journals, but it allows you to always know which is the
43
A) This method allows the fastest callbacks. To release a lock,
44
the incore log for the lock must be flushed and then the inplace
45
data and metadata must be synced. That's it. The sync
46
operations involved are: start the log body and wait for it to
47
become stable on the disk, synchronously write the commit block,
48
start the inplace metadata and wait for it to become stable on
53
A) Maintaining the generation numbers is expensive. All newly
54
allocated metadata block must be read off the disk in order to
55
figure out what the previous value of the generation number was.
56
When deallocating metadata, extra work and care must be taken to
57
make sure dirty data isn't thrown away in such a way that the
58
generation numbers stop doing their thing.
59
B) You can't continue to modify the filesystem during journal
60
replay. Basically, replay of a block is a read-modify-write
61
operation: the block is read from disk, the generation number is
62
compared, and (maybe) the new version is written out. Replay
63
requires that the R-M-W operation is atomic with respect to
64
other R-M-W operations that might be happening (say by a normal
65
I/O process). Since journal replay doesn't (and can't) play by
66
the normal metadata locking rules, you can't count on them to
67
protect replay. Hence GFS1, quieces all writes on a filesystem
68
before starting replay. This provides the mutual exclusion
69
required, but it's slow and unnecessarily interrupts service on
72
2) Total Metadata Sync (OCFS2)
74
This method is really simple in that it uses exactly the same
75
infrastructure that a local journaled filesystem uses. Every time
76
a node receives a callback, it stops all metadata modification,
77
syncs out the whole incore journal, syncs out any dirty data, marks
78
the journal as being clean (unmounted), and then releases the lock.
79
Because journal is marked as clean and recovery won't look at any
80
of the journaled blocks in it, a valid copy of any particular block
81
only exists in one journal at a time and that journal always the
82
journal who modified it last.
86
A) Very simple to implement.
87
B) You can reuse journaling code from other places (such as JBD).
88
C) No quiece necessary for replay.
89
D) No need for generation numbers sprinkled throughout the metadata.
93
A) This method has the slowest possible callbacks. The sync
94
operations are: stop all metadata operations, start and wait for
95
the log body, write the log commit block, start and wait for all
96
the FS' dirty metadata, write an unmount block. Writing the
97
metadata for the whole filesystem can be particularly expensive
98
because it can be scattered all over the disk and there can be a
99
whole journal's worth of it.
101
3) Revocation of a lock's buffers (GFS2)
103
This method prevents a block from appearing in more than one
104
journal by canceling out the metadata blocks in the journal that
105
belong to the lock being released. Journaling works very similarly
106
to a local filesystem or to #2 above.
108
The biggest difference is you have to keep track of buffers in the
109
active region of the ondisk journal, even after the inplace blocks
110
have been written back. This is done in GFS2 by adding a second
111
part to the Active Items List. The first part (in GFS2 called
112
AIL1) contains a list of all the blocks which have been logged to
113
the journal, but not written back to their inplace location. Once
114
an item in AIL1 has been written back to its inplace location, it
115
is moved to AIL2. Once the tail of the log moves past the block's
116
transaction in the log, it can be removed from AIL2.
118
When a callback occurs, the log is flushed to the disk and the
119
metadata for the lock is synced to disk. At this point, any
120
metadata blocks for the lock that are in the current active region
121
of the log will be in the AIL2 list. We then build a transaction
122
that contains revoke tags for each buffer in the AIL2 list that
123
belongs to that lock.
127
A) No quiece necessary for Replay
128
B) No need for generation numbers sprinkled throughout the
130
C) The sync operations are: stop all metadata operations, start and
131
wait for the log body, write the log commit block, start and
132
wait for all the FS' dirty metadata, start and wait for the log
133
body of a transaction that revokes any of the lock's metadata
134
buffers in the journal's active region, and write the commit
135
block for that transaction.
139
A) Recovery takes two passes, one to find all the revoke tags in
140
the log and one to replay the metadata blocks using the revoke
141
tags as a filter. This is necessary for a local filesystem and
142
the total sync method, too. It's just that there will probably
145
Comparing #2 and #3, both do extra I/O during a lock callback to make
146
sure that any metadata blocks in the log for that lock will be
147
removed. I believe #2 will be slower because syncing out all the
148
dirty metadata for entire filesystem requires lots of little,
149
scattered I/O across the whole disk. The extra I/O done by #3 is a
150
log write to the disk. So, not only should it be less I/O, but it
151
should also be better suited to get good performance out of the disk