~vcs-imports/redhatcluster/master

« back to all changes in this revision

Viewing changes to doc/journaling.txt

Committer: Fabio M. Di Nitto
Date: 2010-11-25 09:21:00 UTC
Revision ID: git-v1:aabd30fca943f8c865691bbc76ddb045c1ec0a33

obsolete master branch

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>

files added:
README

files removed:
Makefile.am

autogen.sh

bindings

bindings/Makefile.am

bindings/perl

bindings/perl/Makefile.am

bindings/perl/ccs

bindings/perl/ccs/CCS.pm.in

bindings/perl/ccs/CCS.xs

bindings/perl/ccs/MANIFEST

bindings/perl/ccs/META.yml.in

bindings/perl/ccs/Makefile.PL

bindings/perl/ccs/Makefile.am

bindings/perl/ccs/test.pl

bindings/perl/ccs/typemap

cman

cman/Makefile.am

cman/cman_tool

cman/cman_tool/Makefile.am

cman/cman_tool/cman_tool.h

cman/cman_tool/join.c

cman/cman_tool/main.c

cman/config

cman/config/Makefile.am

cman/config/cman-preconfig.c

cman/config/cman.h

cman/config/nodelist.h

cman/init.d

cman/init.d/cman.in

cman/man

cman/man/Makefile.am

cman/man/cman.5

cman/man/cman_tool.8

cman/man/cmannotifyd.8

cman/man/mkqdisk.8

cman/man/qdisk.5

cman/man/qdiskd.8

cman/notifyd

cman/notifyd/Makefile.am

cman/notifyd/cman_notify.in

cman/notifyd/main.c

cman/qdisk

cman/qdisk/Makefile.am

cman/qdisk/bitmap.c

cman/qdisk/daemon_init.c

cman/qdisk/disk.c

cman/qdisk/disk.h

cman/qdisk/disk_util.c

cman/qdisk/iostate.c

cman/qdisk/iostate.h

cman/qdisk/main.c

cman/qdisk/mkqdisk.c

cman/qdisk/platform.h

cman/qdisk/proc.c

cman/qdisk/scandisk.c

cman/qdisk/scandisk.h

cman/qdisk/score.c

cman/qdisk/score.h

cman/services

cman/services/Makefile.am

cman/services/cman

cman/services/cman/Makefile.am

cman/services/cman/include

cman/services/cman/include/Makefile.am

cman/services/cman/include/corosync

cman/services/cman/include/corosync/cman.h

cman/services/cman/include/corosync/ipc_cman.h

cman/services/cman/lib

cman/services/cman/lib/Makefile.am

cman/services/cman/lib/libcman.c

cman/services/cman/lib/libcman.h

cman/services/cman/lib/libcman.pc.in

cman/services/cman/services

cman/services/cman/services/Makefile.am

cman/services/cman/services/cman.c

cman/tests

cman/tests/Makefile.am

cman/tests/client.c

cman/tests/libtest.c

cman/tests/qwait.c

cman/tests/sysman.c

cman/tests/sysmand.c

cman/tests/user_service.c

common

common/Makefile.am

common/liblogthread

common/liblogthread/Makefile.am

common/liblogthread/liblogthread.c

common/liblogthread/liblogthread.h

common/liblogthread/liblogthread.pc.in

config

config/Makefile.am

config/libs

config/libs/Makefile.am

config/libs/libccsconfdb

config/libs/libccsconfdb/Makefile.am

config/libs/libccsconfdb/ccs.h

config/libs/libccsconfdb/ccs_internal.h

config/libs/libccsconfdb/extras.c

config/libs/libccsconfdb/fullxpath.c

config/libs/libccsconfdb/libccs.c

config/libs/libccsconfdb/libccs.pc.in

config/libs/libccsconfdb/xpathlite.c

config/man

config/man/Makefile.am

config/man/cluster.conf.5

config/plugins

config/plugins/Makefile.am

config/plugins/ldap

config/plugins/ldap/99cluster.ldif

config/plugins/ldap/Makefile.am

config/plugins/ldap/configldap.c

config/plugins/ldap/example.ldif

config/plugins/xml

config/plugins/xml/Makefile.am

config/plugins/xml/config.c

config/tools

config/tools/Makefile.am

config/tools/ccs_tool

config/tools/ccs_tool/Makefile.am

config/tools/ccs_tool/ccs_tool.c

config/tools/ccs_tool/editconf.c

config/tools/ccs_tool/editconf.h

config/tools/ldap

config/tools/ldap/Makefile.am

config/tools/ldap/confdb2ldif.c

config/tools/man

config/tools/man/Makefile.am

config/tools/man/ccs_tool.8

config/tools/man/confdb2ldif.8

config/tools/mkconf

config/tools/mkconf/Makefile.am

config/tools/mkconf/mkconf.c

configure.ac

doc/COPYING.applications

doc/COPYING.libraries

doc/COPYRIGHT

doc/Makefile.am

doc/README.licence

doc/cluster.logrotate.in

doc/cman_notify_template.sh

doc/gfs2.txt

doc/journaling.txt

doc/min-gfs.txt

doc/usage.txt

group

group/Makefile.am

group/man

group/man/Makefile.am

group/man/group_tool.8

group/tool

group/tool/Makefile.am

group/tool/main.c

make

make/copyright.cf

make/lcrso.mk

Show diffs side-by-side

added added

removed removed

doc/journaling.txt

o Journaling & Replay

The fundamental problem with a journaled cluster filesystem is

handling journal replay with multiple journals. A single block of

metadata can be modified sequentially by many different nodes in the

cluster. As the block is modified by each node, it gets logged in the

journal for each node. If care is not taken, it's possible to get

into a situation where a journal replay can actually corrupt a

filesystem. The error scenario is:

1) Node A modifies a metadata block by putting a updated copy into its

incore log.

2) Node B wants to read and modify the block so it requests the lock

and a blocking callback is sent to Node A.

3) Node A flushes its incore log to disk, and then syncs out the

metadata block to its inplace location.

4) Node A then releases the lock.

5) Node B reads in the block and puts a modified copy into its ondisk

log and then the inplace block location.

6) Node A crashes.

At this point, Node A's journal needs to be replayed. Since there is

a newer version of block inplace, if that block is replayed, the

filesystem will be corrupted. There are a few different ways of

avoiding this problem.

1) Generation Numbers (GFS1)

Each metadata block has header in it that contains a 64-bit

generation number. As each block is logged into a journal, the

generation number is incremented. This provides a strict ordering

of the different versions of the block a they are logged in the FS'

different journals. When journal replay happens, each block in the

journal is not replayed if generation number in the journal is less

than the generation number in place. This ensures that a newer

version of a block is never replaced with an older version. So,

this solution basically allows multiple copies of the same block in

different journals, but it allows you to always know which is the

correct one.

Pros:

A) This method allows the fastest callbacks. To release a lock,

the incore log for the lock must be flushed and then the inplace

data and metadata must be synced. That's it. The sync

operations involved are: start the log body and wait for it to

become stable on the disk, synchronously write the commit block,

start the inplace metadata and wait for it to become stable on

the disk.

Cons:

A) Maintaining the generation numbers is expensive. All newly

allocated metadata block must be read off the disk in order to

figure out what the previous value of the generation number was.

When deallocating metadata, extra work and care must be taken to

make sure dirty data isn't thrown away in such a way that the

generation numbers stop doing their thing.

B) You can't continue to modify the filesystem during journal

replay. Basically, replay of a block is a read-modify-write

operation: the block is read from disk, the generation number is

compared, and (maybe) the new version is written out. Replay

requires that the R-M-W operation is atomic with respect to

other R-M-W operations that might be happening (say by a normal

I/O process). Since journal replay doesn't (and can't) play by

the normal metadata locking rules, you can't count on them to

protect replay. Hence GFS1, quieces all writes on a filesystem

before starting replay. This provides the mutual exclusion

required, but it's slow and unnecessarily interrupts service on

the whole cluster.

2) Total Metadata Sync (OCFS2)

This method is really simple in that it uses exactly the same

infrastructure that a local journaled filesystem uses. Every time

a node receives a callback, it stops all metadata modification,

syncs out the whole incore journal, syncs out any dirty data, marks

the journal as being clean (unmounted), and then releases the lock.

Because journal is marked as clean and recovery won't look at any

of the journaled blocks in it, a valid copy of any particular block

only exists in one journal at a time and that journal always the

journal who modified it last.

Pros:

A) Very simple to implement.

B) You can reuse journaling code from other places (such as JBD).

C) No quiece necessary for replay.

D) No need for generation numbers sprinkled throughout the metadata.

Cons:

A) This method has the slowest possible callbacks. The sync

operations are: stop all metadata operations, start and wait for

the log body, write the log commit block, start and wait for all

the FS' dirty metadata, write an unmount block. Writing the

metadata for the whole filesystem can be particularly expensive

because it can be scattered all over the disk and there can be a

whole journal's worth of it.

100

101

3) Revocation of a lock's buffers (GFS2)

102

103

This method prevents a block from appearing in more than one

104

journal by canceling out the metadata blocks in the journal that

105

belong to the lock being released. Journaling works very similarly

106

to a local filesystem or to #2 above.

107

108

The biggest difference is you have to keep track of buffers in the

109

active region of the ondisk journal, even after the inplace blocks

110

have been written back. This is done in GFS2 by adding a second

111

part to the Active Items List. The first part (in GFS2 called

112

AIL1) contains a list of all the blocks which have been logged to

113

the journal, but not written back to their inplace location. Once

114

an item in AIL1 has been written back to its inplace location, it

115

is moved to AIL2. Once the tail of the log moves past the block's

116

transaction in the log, it can be removed from AIL2.

117

118

When a callback occurs, the log is flushed to the disk and the

119

metadata for the lock is synced to disk. At this point, any

120

metadata blocks for the lock that are in the current active region

121

of the log will be in the AIL2 list. We then build a transaction

122

that contains revoke tags for each buffer in the AIL2 list that

123

belongs to that lock.

124

125

Pros:

126

127

A) No quiece necessary for Replay

128

B) No need for generation numbers sprinkled throughout the

129

metadata.

130

C) The sync operations are: stop all metadata operations, start and

131

wait for the log body, write the log commit block, start and

132

wait for all the FS' dirty metadata, start and wait for the log

133

body of a transaction that revokes any of the lock's metadata

134

buffers in the journal's active region, and write the commit

135

block for that transaction.

136

137

Cons:

138

139

A) Recovery takes two passes, one to find all the revoke tags in

140

the log and one to replay the metadata blocks using the revoke

141

tags as a filter. This is necessary for a local filesystem and

142

the total sync method, too. It's just that there will probably

143

be more tags.

144

145

Comparing #2 and #3, both do extra I/O during a lock callback to make

146

sure that any metadata blocks in the log for that lock will be

147

removed. I believe #2 will be slower because syncing out all the

148

dirty metadata for entire filesystem requires lots of little,

149

scattered I/O across the whole disk. The extra I/O done by #3 is a

150

log write to the disk. So, not only should it be less I/O, but it

151

should also be better suited to get good performance out of the disk

152

subsystem.

153

154

KWP 07/06/05

155

Older »