2
<!--Copyright 1997-2002 by Sleepycat Software, Inc.-->
3
<!--All rights reserved.-->
4
<!--See the file LICENSE for redistribution information.-->
7
<title>Berkeley DB Reference Guide: Network partitions</title>
8
<meta name="description" content="Berkeley DB: An embedded database programmatic toolkit.">
9
<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++">
12
<table width="100%"><tr valign=top>
13
<td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td>
14
<td align=right><a href="../../ref/rep/trans.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/faq.html"><img src="../../images/next.gif" alt="Next"></a>
17
<h1 align=center>Network partitions</h1>
18
<p>The Berkeley DB replication implementation can be affected by network
19
partitioning problems.
20
<p>For example, consider a replication group with N members. The network
21
partitions with the master on one side and more than N/2 of the sites
22
on the other side. The sites on the side with the master will continue
23
forward, and the master will continue to accept write queries for the
24
databases. Unfortunately, the sites on the other side of the partition,
25
realizing they no longer have a master, will hold an election. The
26
election will succeed as there are more than N/2 of the total sites
27
participating, and there will then be two masters for the replication
28
group. Since both masters are potentially accepting write queries, the
29
databases could diverge in incompatible ways.
30
<p>If multiple masters are ever found to exist in a replication group, a
31
master detecting the problem will return <a href="../../api_c/rep_message.html#DB_REP_DUPMASTER">DB_REP_DUPMASTER</a>. If
32
the application sees this return, it should reconfigure itself as a
33
client (by calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>), and then call for an election
34
(by calling <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>). The site that wins the election may be
35
one of the two previous masters, or it may be another site entirely.
36
Regardless, the winning system will bring all of the other systems into
38
<p>As another example, consider a replication group with a master
39
environment and two clients A and B, where client A may upgrade to
40
master status and client B cannot. Then, assume client A is partitioned
41
from the other two database environments, and it becomes out-of-date
42
with respect to the master. Then, assume the master crashes and does
43
not come back on-line. Subsequently, the network partition is restored,
44
and clients A and B hold an election. As client B cannot win the
45
election, client A will win by default, and in order to get back into
46
sync with client B, possibly committed transactions on client B will be
47
unrolled until the two sites can once again move forward together.
48
<p>In both of these examples, there is a phase where a newly elected master
49
brings the members of a replication group into conformance with itself
50
so that it can start sending new information to them. This can result
51
in the loss of information as previously committed transactions are
53
<p>In architectures where network partitions are an issue, applications
54
may want to implement a heart-beat protocol to minimize the consequences
55
of a bad network partition. As long as a master is able to contact at
56
least half of the sites in the replication group, it is impossible for
57
there to be two masters. If the master can no longer contact a
58
sufficient number of systems, it should reconfigure itself as a client,
60
<p>There is another tool applications can use to minimize the damage in
61
the case of a network partition. By specifying a <b>nsites</b>
62
argument to <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> that is larger than the actual number of
63
database environments in the replication group, applications can keep
64
systems from declaring themselves the master unless they can talk to at
65
a large percentage of the sites in the system. For example, if there
66
are 20 database environments in the replication group, and an argument
67
of 30 is specified to the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method, then a system will have
68
to be able to talk to at least 16 of the sites to declare itself the
70
<p>Specifying a <b>nsites</b> argument to <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> that is
71
smaller than the actual number of database environments in the
72
replication group has its uses as well. For example, consider a
73
replication group with 2 environments. If they are partitioned from
74
each other, neither of the sites could ever get enough votes to become
75
the master. A reasonable alternative would be to specify a
76
<b>nsites</b> argument of 2 to one of the systems and a <b>nsites</b>
77
argument of 1 to the other. That way, one of the systems could win
78
elections even when partitioned, while the other one could not. This
79
would allow at one of the systems to continue accepting write queries
81
<p>These scenarios stress the importance of good network infrastructure in
82
Berkeley DB replicated environments. When replicating database environments
83
over sufficiently lossy networking, the best solution may well be to
84
pick a single master, and only hold elections when human intervention
85
has determined the selected master is unable to recover at all.
86
<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/trans.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/faq.html"><img src="../../images/next.gif" alt="Next"></a>
88
<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font>