~zulcss/samba/server-dailies-3.4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE chapter PUBLIC "-//Samba-Team//DTD DocBook V4.2-Based Variant V1.0//EN" "http://www.samba.org/samba/DTD/samba-doc">
<chapter id="SambaHA">
<chapterinfo>
	&author.jht;
	&author.jeremy;
</chapterinfo>

<title>High Availability</title>

<sect1>
<title>Features and Benefits</title>

<para>
<indexterm><primary>availability</primary></indexterm>
<indexterm><primary>intolerance</primary></indexterm>
<indexterm><primary>vital task</primary></indexterm>
Network administrators are often concerned about the availability of file and print
services. Network users are inclined toward intolerance of the services they depend
on to perform vital task responsibilities.
</para>

<para>
A sign in a computer room served to remind staff of their responsibilities. It read:
</para>

<blockquote>
<para>
<indexterm><primary>fail</primary></indexterm>
<indexterm><primary>managed by humans</primary></indexterm>
<indexterm><primary>economically wise</primary></indexterm>
<indexterm><primary>anticipate failure</primary></indexterm>
All humans fail, in both great and small ways we fail continually. Machines fail too.
Computers are machines that are managed by humans, the fallout from failure
can be spectacular. Your responsibility is to deal with failure, to anticipate it
and to eliminate it as far as is humanly and economically wise to achieve.
Are your actions part of the problem or part of the solution?
</para>
</blockquote>

<para>
If we are to deal with failure in a planned and productive manner, then first we must
understand the problem. That is the purpose of this chapter.
</para>

<para>
<indexterm><primary>high availability</primary></indexterm>
<indexterm><primary>CIFS/SMB</primary></indexterm>
<indexterm><primary>state of knowledge</primary></indexterm>
Parenthetically, in the following discussion there are seeds of information on how to
provision a network infrastructure against failure. Our purpose here is not to provide
a lengthy dissertation on the subject of high availability. Additionally, we have made
a conscious decision to not provide detailed working examples of high availability
solutions; instead we present an overview of the issues in the hope that someone will
rise to the challenge of providing a detailed document that is focused purely on
presentation of the current state of knowledge and practice in high availability as it
applies to the deployment of Samba and other CIFS/SMB technologies.
</para>

</sect1>

<sect1>
<title>Technical Discussion</title>

<para>
<indexterm><primary>SambaXP conference</primary></indexterm>
<indexterm><primary>Germany</primary></indexterm>
<indexterm><primary>inspired structure</primary></indexterm>
The following summary was part of a presentation by Jeremy Allison at the SambaXP 2003
conference that was held at Goettingen, Germany, in April 2003. Material has been added
from other sources, but it was Jeremy who inspired the structure that follows.
</para>

	<sect2>
	<title>The Ultimate Goal</title>

	<para>
<indexterm><primary>clustering technologies</primary></indexterm>
<indexterm><primary>affordable power</primary></indexterm>
<indexterm><primary>unstoppable services</primary></indexterm>
	All clustering technologies aim to achieve one or more of the following:
	</para>

	<itemizedlist>
		<listitem><para>Obtain the maximum affordable computational power.</para></listitem>
		<listitem><para>Obtain faster program execution.</para></listitem>
		<listitem><para>Deliver unstoppable services.</para></listitem>
		<listitem><para>Avert points of failure.</para></listitem>
		<listitem><para>Exact most effective utilization of resources.</para></listitem>
	</itemizedlist>

	<para>
	A clustered file server ideally has the following properties:
<indexterm><primary>clustered file server</primary></indexterm>
<indexterm><primary>connect transparently</primary></indexterm>
<indexterm><primary>transparently reconnected</primary></indexterm>
<indexterm><primary>distributed file system</primary></indexterm>
	</para>

	<itemizedlist>
		<listitem><para>All clients can connect transparently to any server.</para></listitem>
		<listitem><para>A server can fail and clients are transparently reconnected to another server.</para></listitem>
		<listitem><para>All servers serve out the same set of files.</para></listitem>
		<listitem><para>All file changes are immediately seen on all servers.</para>
			<itemizedlist><listitem><para>Requires a distributed file system.</para></listitem></itemizedlist></listitem>
		<listitem><para>Infinite ability to scale by adding more servers or disks.</para></listitem>
	</itemizedlist>

	</sect2>

	<sect2>
	<title>Why Is This So Hard?</title>

	<para>
	In short, the problem is one of <emphasis>state</emphasis>.
	</para>

	<itemizedlist>
		<listitem>
			<para>
<indexterm><primary>state information</primary></indexterm>
			All TCP/IP connections are dependent on state information.
			</para>
			<para>
<indexterm><primary>TCP failover</primary></indexterm>
			The TCP connection involves a packet sequence number. This
			sequence number would need to be dynamically updated on all
			machines in the cluster to effect seamless TCP failover.
			</para>
		</listitem>
		<listitem>
			<para>
<indexterm><primary>CIFS/SMB</primary></indexterm>
<indexterm><primary>TCP</primary></indexterm>
			CIFS/SMB (the Windows networking protocols) uses TCP connections.
			</para>
			<para>
			This means that from a basic design perspective, failover is not
			seriously considered.
			<itemizedlist>
				<listitem><para>
				All current SMB clusters are failover solutions
				&smbmdash; they rely on the clients to reconnect. They provide server
				failover, but clients can lose information due to a server failure.
<indexterm><primary>server failure</primary></indexterm>
				</para></listitem>
			</itemizedlist>
			</para>
		</listitem>
		<listitem>
			<para>
			Servers keep state information about client connections.
			<itemizedlist>
<indexterm><primary>state</primary></indexterm>
				<listitem><para>CIFS/SMB involves a lot of state.</para></listitem>
				<listitem><para>Every file open must be compared with other open files
						to check share modes.</para></listitem>
			</itemizedlist>
			</para>
		</listitem>
	</itemizedlist>

		<sect3>
		<title>The Front-End Challenge</title>

		<para>
<indexterm><primary>cluster servers</primary></indexterm>
<indexterm><primary>single server</primary></indexterm>
<indexterm><primary>TCP data streams</primary></indexterm>
<indexterm><primary>front-end virtual server</primary></indexterm>
<indexterm><primary>virtual server</primary></indexterm>
<indexterm><primary>de-multiplex</primary></indexterm>
<indexterm><primary>SMB</primary></indexterm>
		To make it possible for a cluster of file servers to appear as a single server that has one
		name and one IP address, the incoming TCP data streams from clients must be processed by the
		front-end virtual server. This server must de-multiplex the incoming packets at the SMB protocol
		layer level and then feed the SMB packet to different servers in the cluster.
		</para>

		<para>
<indexterm><primary>IPC$ connections</primary></indexterm>
<indexterm><primary>RPC calls</primary></indexterm>
		One could split all IPC$ connections and RPC calls to one server to handle printing and user
		lookup requirements. RPC printing handles are shared between different IPC4 sessions &smbmdash; it is
		hard to split this across clustered servers!
		</para>

		<para>
		Conceptually speaking, all other servers would then provide only file services. This is a simpler
		problem to concentrate on.
		</para>

		</sect3>

		<sect3>
		<title>Demultiplexing SMB Requests</title>

		<para>
<indexterm><primary>SMB requests</primary></indexterm>
<indexterm><primary>SMB state information</primary></indexterm>
<indexterm><primary>front-end virtual server</primary></indexterm>
<indexterm><primary>complicated problem</primary></indexterm>
		De-multiplexing of SMB requests requires knowledge of SMB state information,
		all of which must be held by the front-end <emphasis>virtual</emphasis> server.
		This is a perplexing and complicated problem to solve.
		</para>

		<para>
<indexterm><primary>vuid</primary></indexterm>
<indexterm><primary>tid</primary></indexterm>
<indexterm><primary>fid</primary></indexterm>
		Windows XP and later have changed semantics so state information (vuid, tid, fid)
		must match for a successful operation. This makes things simpler than before and is a
		positive step forward.
		</para>

		<para>
<indexterm><primary>SMB requests</primary></indexterm>
<indexterm><primary>Terminal Server</primary></indexterm>
		SMB requests are sent by vuid to their associated server. No code exists today to
		effect this solution. This problem is conceptually similar to the problem of
		correctly handling requests from multiple requests from Windows 2000
		Terminal Server in Samba.
		</para>

		<para>
<indexterm><primary>de-multiplexing</primary></indexterm>
		One possibility is to start by exposing the server pool to clients directly.
		This could eliminate the de-multiplexing step.
		</para>

		</sect3>

		<sect3>
		<title>The Distributed File System Challenge</title>

		<para>
<indexterm><primary>Distributed File Systems</primary></indexterm>
		There exists many distributed file systems for UNIX and Linux.
		</para>

		<para>
<indexterm><primary>backend</primary></indexterm>
<indexterm><primary>SMB semantics</primary></indexterm>
<indexterm><primary>share modes</primary></indexterm>
<indexterm><primary>locking</primary></indexterm>
<indexterm><primary>oplock</primary></indexterm>
<indexterm><primary>distributed file systems</primary></indexterm>
		Many could be adopted to backend our cluster, so long as awareness of SMB
		semantics is kept in mind (share modes, locking, and oplock issues in particular).
		Common free distributed file systems include:
<indexterm><primary>NFS</primary></indexterm>
<indexterm><primary>AFS</primary></indexterm>
<indexterm><primary>OpenGFS</primary></indexterm>
<indexterm><primary>Lustre</primary></indexterm>
		</para>

		<itemizedlist>
			<listitem><para>NFS</para></listitem>
			<listitem><para>AFS</para></listitem>
			<listitem><para>OpenGFS</para></listitem>
			<listitem><para>Lustre</para></listitem>
		</itemizedlist>

		<para>
<indexterm><primary>server pool</primary></indexterm>
		The server pool (cluster) can use any distributed file system backend if all SMB
		semantics are performed within this pool.
		</para>

		</sect3>

		<sect3>
		<title>Restrictive Constraints on Distributed File Systems</title>

		<para>
<indexterm><primary>SMB services</primary></indexterm>
<indexterm><primary>oplock handling</primary></indexterm>
<indexterm><primary>server pool</primary></indexterm>
<indexterm><primary>backend file system pool</primary></indexterm>
		Where a clustered server provides purely SMB services, oplock handling
		may be done within the server pool without imposing a need for this to
		be passed to the backend file system pool.
		</para>

		<para>
<indexterm><primary>NFS</primary></indexterm>
<indexterm><primary>interoperability</primary></indexterm>
		On the other hand, where the server pool also provides NFS or other file services,
		it will be essential that the implementation be oplock-aware so it can
		interoperate with SMB services. This is a significant challenge today. A failure
		to provide this interoperability will result in a significant loss of performance that will be
		sorely noted by users of Microsoft Windows clients.
		</para>

		<para>
		Last, all state information must be shared across the server pool.
		</para>

		</sect3>

		<sect3>
		<title>Server Pool Communications</title>

		<para>
<indexterm><primary>POSIX semantics</primary></indexterm>
<indexterm><primary>SMB</primary></indexterm>
<indexterm><primary>POSIX locks</primary></indexterm>
<indexterm><primary>SMB locks</primary></indexterm>
		Most backend file systems support POSIX file semantics. This makes it difficult
		to push SMB semantics back into the file system. POSIX locks have different properties
		and semantics from SMB locks.
		</para>

		<para>
<indexterm><primary>smbd</primary></indexterm>
<indexterm><primary>tdb</primary></indexterm>
<indexterm><primary>Clustered smbds</primary></indexterm>
		All <command>smbd</command> processes in the server pool must of necessity communicate
		very quickly. For this, the current <parameter>tdb</parameter> file structure that Samba
		uses is not suitable for use across a network. Clustered <command>smbd</command>s must use something else.
		</para>

		</sect3>

		<sect3>
		<title>Server Pool Communications Demands</title>

		<para>
		High-speed interserver communications in the server pool is a design prerequisite
		for a fully functional system. Possibilities for this include:
		</para>

		<itemizedlist>
<indexterm><primary>Myrinet</primary></indexterm>
<indexterm><primary>scalable coherent interface</primary><see>SCI</see></indexterm>
			<listitem><para>
			Proprietary shared memory bus (example: Myrinet or SCI [scalable coherent interface]).
			These are high-cost items.
			</para></listitem>
		
			<listitem><para>
			Gigabit Ethernet (now quite affordable).
			</para></listitem>
		
			<listitem><para>
			Raw Ethernet framing (to bypass TCP and UDP overheads).
			</para></listitem>
		</itemizedlist>

		<para>
		We have yet to identify metrics for  performance demands to enable this to happen
		effectively.
		</para>

		</sect3>

		<sect3>
		<title>Required Modifications to Samba</title>

		<para>
		Samba needs to be significantly modified to work with a high-speed server interconnect
		system to permit transparent failover clustering.
		</para>

		<para>
		Particular functions inside Samba that will be affected include:
		</para>

		<itemizedlist>
			<listitem><para>
			The locking database, oplock notifications,
			and the share mode database.
			</para></listitem>

			<listitem><para>
<indexterm><primary>failure semantics</primary></indexterm>
<indexterm><primary>oplock messages</primary></indexterm>
			Failure semantics need to be defined. Samba behaves the same way as Windows.
			When oplock messages fail, a file open request is allowed, but this is 
			potentially dangerous in a clustered environment. So how should interserver
			pool failure semantics function, and how should such functionality be implemented?
			</para></listitem>

			<listitem><para>
			Should this be implemented using a point-to-point lock manager, or can this
			be done using multicast techniques?
			</para></listitem>

		</itemizedlist>

		</sect3>
	</sect2>

	<sect2>
	<title>A Simple Solution</title>

	<para>
<indexterm><primary>failover servers</primary></indexterm>
<indexterm><primary>exported file system</primary></indexterm>
<indexterm><primary>distributed locking protocol</primary></indexterm>
	Allowing failover servers to handle different functions within the exported file system
	removes the problem of requiring a distributed locking protocol.
	</para>

	<para>
<indexterm><primary>high-speed server interconnect</primary></indexterm>
<indexterm><primary>complex file name space</primary></indexterm>
	If only one server is active in a pair, the need for high-speed server interconnect is avoided.
	This allows the use of existing high-availability solutions, instead of inventing a new one.
	This simpler solution comes at a price &smbmdash; the cost of which is the need to manage a more
	complex file name space. Since there is now not a single file system, administrators
	must remember where all services are located &smbmdash; a complexity not easily dealt with.
	</para>

	<para>
<indexterm><primary>virtual server</primary></indexterm>
	The <emphasis>virtual server</emphasis> is still needed to redirect requests to backend
	servers. Backend file space integrity is the responsibility of the administrator.
	</para>

	</sect2>

	<sect2>
	<title>High-Availability Server Products</title>

	<para>
<indexterm><primary>resource failover</primary></indexterm>
<indexterm><primary>high-availability services</primary></indexterm>
<indexterm><primary>dedicated heartbeat</primary></indexterm>
<indexterm><primary>LAN</primary></indexterm>
<indexterm><primary>failover process</primary></indexterm>
	Failover servers must communicate in order to handle resource failover. This is essential
	for high-availability services. The use of a dedicated heartbeat is a common technique to
	introduce some intelligence into the failover process. This is often done over a dedicated
	link (LAN or serial).
	</para>

	<para>
<indexterm><primary>SCSI</primary></indexterm>
<indexterm><primary>Red Hat Cluster Manager</primary></indexterm>
<indexterm><primary>Microsoft Wolfpack</primary></indexterm>
<indexterm><primary>Fiber Channel</primary></indexterm>
<indexterm><primary>failover communication</primary></indexterm>
	Many failover solutions (like Red Hat Cluster Manager and Microsoft Wolfpack)
	can use a shared SCSI of Fiber Channel disk storage array for failover communication.
	Information regarding Red Hat high availability solutions for Samba may be obtained from
	<ulink url="http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster-manager/s1-service-samba.html">www.redhat.com</ulink>.
	</para>

	<para>
<indexterm><primary>Linux High Availability project</primary></indexterm>
	The Linux High Availability project is a resource worthy of consultation if your desire is
	to build a highly available Samba file server solution. Please consult the home page at
	<ulink url="http://www.linux-ha.org/">www.linux-ha.org/</ulink>.
	</para>

	<para>
<indexterm><primary>backend failures</primary></indexterm>
<indexterm><primary>continuity of service</primary></indexterm>
	Front-end server complexity remains a challenge for high availability because it must deal
	gracefully with backend failures, while at the same time providing continuity of service
	to all network clients.
	</para>
	
	</sect2>

	<sect2>
	<title>MS-DFS: The Poor Man's Cluster</title>

	<para>
<indexterm><primary>MS-DFS</primary></indexterm>
<indexterm><primary>DFS</primary><see>MS-DFS, Distributed File Systems</see></indexterm>
	MS-DFS links can be used to redirect clients to disparate backend servers. This pushes
	complexity back to the network client, something already included by Microsoft.
	MS-DFS creates the illusion of a simple, continuous file system name space that works even
	at the file level.
	</para>

	<para>
	Above all, at the cost of complexity of management, a distributed system (pseudo-cluster) can
	be created using existing Samba functionality.
	</para>

	</sect2>

	<sect2>
	<title>Conclusions</title>

	<itemizedlist>
		<listitem><para>Transparent SMB clustering is hard to do!</para></listitem>
		<listitem><para>Client failover is the best we can do today.</para></listitem>
		<listitem><para>Much more work is needed before a practical and manageable high-availability transparent cluster solution will be possible.</para></listitem>
		<listitem><para>MS-DFS can be used to create the illusion of a single transparent cluster.</para></listitem>
	</itemizedlist>

	</sect2>

</sect1>
</chapter>