~ubuntu-branches/ubuntu/vivid/cctools/vivid

<li> Douglas Thain and Miron Livny, <a href=http://www.cse.nd.edu/~dthain/papers/ethernet-hpdc12.pdf>The Ethernet Approach to Grid Computing</a>, IEEE High Performance Distributed Computing, August 2003.

</dir>

<h2>Introduction</h2>

Shell scripts are a vital tool for integrating software.

They are indispensable for rapid prototyping and system assembly.

Yet, they are extraordinarily sensitive to errors.

A missing file, a broken network, or a sick file server can

cause a script to barrel through its actions with surprising results.

It is possible to write error-safe scripts,

but only with extraordinary discipline and complexity.

<p>

The Fault Tolerant Shell (ftsh) aims to solve this problem

by combining the ease of scripting with precise error semantics.

Ftsh is a balance between the flexibility and power of script languages

and the precision of most compiled languages. Ftsh parses complete

programs in order to eliminate run-time errors.

An exception-like structure allows scripts to be both succinct and safe.

A focus on timed repetition simplifies the most common form of

recovery in a distributed system. A carefully-vetted set of language

features limits the "surprises" that haunt system programmers.

<p>

As an example, consider this innocuous script written in Bourne Shell:

<pre>

#!/bin/sh

cd /work/foo

rm -rf bar

cp -r /fresh/data .

</pre>

Suppose that the <tt>/work</tt> filesystem is temporarily unavailable,

perhaps due to an NFS failure.

The <tt>cd</tt> command will fail and print a message on the console.

The shell will ignore this error result -- it is primarily designed

as a user interface tool -- and proceed to execute the <tt>rm</tt> and

<tt>cp</tt> in the directory it happened to be before.

<p>

Naturally, we may attempt to head off these cases with code that

checks error codes, attempts to recover, and so on. However,

even the disciplined programmer that leaves no value unturned must

admit that this makes shell scripts incomprehensible:

<pre>

#!/bin/sh

for attempt in 1 2 3

cd /work/foo

if [ ! $? ]

then

echo "cd failed, trying again..."

sleep 5

else

break

done

if [ ! $? ]

then

echo "couldn't cd, giving up..."

return 1

</pre>

And that's just the first line!

<p>

If we accept that failure, looping, timeouts, and job cancellation

are fundamental concerns in distributed systems, we may both simplify and

strengthen programs by making them fundamental expressions in a programming

language. These concepts are embodied in the simple <tt>try</tt> command:

<pre>

#!/usr/bin/ftsh

try for 5 minutes every 30 seconds

cd /work/foo

rm -rf bar

cp -r /fresh/data .

end

</pre>

<p>

100

Ftsh provides simple structures that encourage explicit acknowledgement

101

of failure while maintaining the readability of script code. You might

102

think of this as exceptions for scripts.

103

<P>

104

Want to learn more?

105

This document is a short introduction to the fault tolerant shell.

106

It quickly breezes over the language features in order to get

107

started. You can learn more about the motivation for the language in

108

<a href=http://www.cse.nd.edu/~ccl/software/ftsh/ethernet-hpdc12.pdf>"The Ethernet Approach to Grid Computing"</a>,

109

available from the <a href=http://www.cse.nd.edu/~ccl/software/ftsh>ftsh web page</a>

110

For a quick introduction, read on!

111

112

113

<h2>Basics</h2>

114

115

<h3>Simple Commands</h3>

116

117

An ftsh program is built up from simple commands.

118

A simple command names a program to be executed, just <tt>sh</tt> or <tt>csh</tt>.

119

The command is separated from its arguments by whitespaces.

120

Quotation marks may be used to escape whitespace. For example:

121

<pre>

122

ls -l /etc/hosts

123

</pre>

124

or:

125

<pre>

126

cat /etc/passwd

127

</pre>

128

or:

129

<pre>

130

cp "This File" "That File"

131

</pre>

132

<p>

133

As you may know, a command (a UNIX process) returns an integer known

134

as its "exit code." Convention dictates that an exit code of zero

135

indicates success while any other number indicates failure. Languages

136

tend to differ in their mapping of integers to success or failure, so

137

from here on, we will simply use the abstract terms "success" and "failure."

138

<p>

139

A command may also fail in a variety of other ways without returning an

140

exit code. It may be killed by a signal, or it may fail to start

141

altogether if the program does not exist or its image cannot be loaded.

142

These cases are also considered failures.

143

144

<h3>Groups</h3>

145

146

A "group" is simply a list of commands.

147

Each command only runs if the previous command succeeded.

148

Let's return our first example:

149

<pre>

150

#!/usr/bin/ftsh

151

152

cd /work/foo

153

rm -rf bar

154

cp -r /fresh/data .

155

</pre>

156

This group succeeds only if <b>every command</b> in the group succeeds.

157

So, if <tt>cd</tt> fails, then the whole group fails and no further commands

158

are executed.

159

<p>

160

This is called the "brittle" property of ftsh. If anything goes wrong,

161

then processing stops instantly. When something goes wrong, you will

162

know it, and the program will not "run away" executing more commands blindly.

163

We will see ways to contain the brittleness of a program below.

164

<p>

165

Ftsh itself has an exit code.

166

Ftsh returns the result of the top-level group that makes up the program.

167

So, if any command in the top-level group fails, then ftsh itself will

168

fail. If they all succeed, then ftsh succeeds.

169

170

<h3>Try Statements</h3>

171

172

A try statement is used to contain and retry group failure.

173

Here is a simple try statement:

174

<pre>

175

#!/usr/bin/ftsh

176

177

try 5 times

178

cd /work/foo

179

rm -rf bar

180

cp -r /fresh/data .

181

end

182

</pre>

183

184

The try statement attempts to execute the enclosed group until

185

the conditions in its header expire. Here, the group will

186

be attempted five times. Recall that a group fails as soon as any

187

one command fails. So, if <tt>rm</tt> fails, then the try statement

188

will stop and attempt the group again from the beginning.

189

<p>

190

If the five times are exceeded, then the try statement itself

191

fails, and (if it is the top-level try-statement) the whole

192

shell program itself will fail. If you prefer, you may

193

catch and react to a try statement in a manner similar to

194

an exception. The <tt>failure</tt> keyword may be used

195

to cause a new exception, just like <tt>throw</tt> in other languages.

196

<pre>

197

try 5 times

198

cd /work/foo

199

rm -rf bar

200

cp -r /fresh/data .

201

catch

202

echo "Oops, it failed. Oh well!"

203

failure

204

end

205

</pre>

206

<p>

207

Try statements come in several forms.

208

They may limit the number of times the group is executed.

209

For example:

210

<pre>

211

try for 10 times

212

</pre>

213

A try statement may allow an unlimited number of loops,

214

terminated by a maximum amount of time, given in seconds,

215

minutes, hours, or days:

216

<pre>

217

try for 45 seconds

218

</pre>

219

Both may be combined, yielding a try statement that stops

220

when either the loop limit or the time limit expires:

221

<pre>

222

try for 3 days or 100 times

223

</pre>

224

Note that such an statement does not limit the length

225

of any single attempt to execute the contained group.

226

If a single command is delayed for three days, the try statement

227

will wait that long and then kill the command.

228

To force individual attempts to be shorter, try statements

229

may be nested. For example:

230

<pre>

231

try for 3 days or 100 times

232

try for 1 time or 1 minute

233

/bin/big-simulation

234

end

235

end

236

</pre>

237

Here, <tt>big-simulation</tt> will be executed for

238

no more than a minute at a time. Such one-minute attempts

239

will be tried for up to three days or one hundred attempts

240

before the outer try statement fails.

241

<p>

242

By default, ftsh uses an exponential backoff.

243

If a group fails once, ftsh will wait one second, and then

244

try again. If it fails again, it will wait 2 seconds,

245

then 4 seconds, and so on, doubling the waiting time after

246

each failure, up to a maximum of one hour.

247

This prevents failures from consuming

248

excessive resources in fruitless retries.

249

<p>

250

If you prefer to have the retries occur at regular

251

intervals (though we don't recommend it)

252

use the <tt>every</tt> keyword to control

253

how frequently errors are retried. For example:

254

<pre>

255

try for 3 days every 1 hour

256

257

try for 10 times every 30 seconds

258

259

try for 1 minute or 3 times every 15 seconds

260

</pre>

261

If a time limit expires in the middle of a try statement, then

262

the currently running command is forcibly cancelled. If an

263

<tt>every</tt> clause is used, it merely ensures that each attempt

264

is at <b>least</b> that long. However, group will not be cancelled

265

merely to satisfy an <tt>every</tt> clause. To ensure that a single

266

loop attempt will be time limited, you may combine two try

267

statements as above:

268

<pre>

269

try for 3 days or 100 times <b>every 1 minute</b>

270

try for 1 time or <b>1 minute</b>

271

/bin/big-simulation

272

end

273

end

274

</pre>

275

Try statements themselves return either success or failure

276

in the same way as a simple command.

277

If the enclosed group finally succeeds, then the try

278

expression itself succeeds. If the try expression exhausts

279

its attempts, then the try statement itself fails.

280

We will make use of this success or failure value in the next section.

281

<p>

282

Cancelling a process is somewhat more complicated than one might think.

283

For all the details on how this actually works, see the section

284

on cancelling processes below.

285

<p>

286

In (almost) all cases, a try statement absolutely controls

287

what comes inside of it. There are two ways for a subprogram

288

to break out of the control of a try. The first is to invoke

289

<tt>exit</tt>, which causes the entire ftsh process to exit

290

immediately with the given exit code. The second is to call

291

<tt>exec</tt>, which causes the given process to be run

292

in place of the current shell process, thus voiding any

293

surrounding controls.

294

295

<h3>Redirection</h3>

296

297

Ftsh uses Bourne shell style I/O redirection. For example:

298

<pre>

299

echo "hello" > outfile

300

</pre>

301

...sends the output <tt>hello</tt> into the file <tt>outfile</tt>,

302

likewise, ftsh supports many of the more arcane redirections

303

of the Bourne shell, such as the redirection of explicit file descriptors:

304

<pre>

305

grep needle 0<infile 1>outfile 2>errfile

306

</pre>

307

...appending to output files:

308

<pre>

309

grep needle >>outfile 2>>errfile

310

</pre>

311

...redirection to an open file descriptor:

312

<pre>

313

grep needle >outfile 2>&1

314

</pre>

315

...and redirection of both input and output at once:

316

<pre>

317

grep needle >& out-and-err-file

318

</pre>

319

320

<h2>Variables</h2>

321

322

Ftsh provides variables similar to that of the Bourne shell.

323

For example,

324

<pre>

325

name=Douglas

326

echo "Hello, ${name}!"

327

echo "Hello, $(name)!"

328

echo "Hello, $name!"

329

</pre>

330

Ftsh also allows variables to be the source and target

331

of redirections. That is, a variable may act as a file!

332

The benefit of this approach is that ftsh manages the storage

333

and name space of variables for you. You don't have to worry

334

about cleaning up or clashing with other programs.

335

<p>

336

Variable redirection looks just like file redirection, except

337

a dash is put in front of the redirector. For example,

338

For example, suppose that we want to capture the output of

339

<tt>grep</tt> and then run it through <tt>sort</tt>:

340

<pre>

341

grep needle /tmp/haystack -> needles

342

sort -< needles

343

</pre>

344

This sort of operation takes the place of a pipeline,

345

which ftsh does not have (yet). However, by using variables

346

instead of pipelines, different retry conditions may be

347

placed on each stage of the work:

348

<pre>

349

try for 5 times

350

grep needle /tmp/haystack -> needles

351

end

352

try for 1 hour

353

sort -< needles

354

end

355

</pre>

356

<p>

357

All of the variations on file redirection are available for variable

358

redirection, including -> and 2-> and ->>

359

and 2->> and ->& and ->>&.

360

<p>

361

Like the Bourne shell, several variable names are reserved.

362

Simple integers are used to refer

363

to the command line arguments given to ftsh itself.

364

<tt>$$</tt> names the current process.

365

<tt>$#</tt> gives the number of arguments passed to the program.

366

<tt>$*</tt> gives all of the unquoted current arguments, while

367

<tt>"$@"</tt> gives all of the arguments individually quoted.

368

The <tt>shift</tt> command can be used to pop off the

369

first positional argument.

370

<p>

371

Variables are implemented by creating temporary files and immediately unlinking

372

them after creation. Thus, no matter how ftsh exits

373

-- even if it crashes -- the kernel deletes buffer space after you.

374

This prevents both the namespace and garbage

375

collection problem left by scripts that manually read and write to files.

376

377

378

<h2>Structures</h2>

379

380

Complex programs are built up by combining basic elements

381

with programming structures. Ftsh has most of the decision

382

elements of other programming languages, such as conditionals

383

and loops. Each of these elements behaves in a very precise way with

384

respect to successes and failures.

385

386

<h3>For-Statements</h3>

387

388

A for-statement executes a command group once for each

389

word in a list. For example:

390

391

<pre>

392

for food in bread wine meatballs

393

echo "I like ${food}"

394

end

395

</pre>

396

397

Of course, the list of items may also come from a variable:

398

399

<pre>

400

packages="bread.tar.gz wine.tar.gz meatballs.tar.gz"

401

402

for p in ${packages

403

echo "Unpacking package ${p}..."

404

tar xvzf ${p}

405

end

406

</pre>

407

408

The more interesting variations are <tt>forany</tt> and <tt>forall</tt>.

409

A <tt>forany</tt> attempts to make a group succeed once for any of the options given

410

in the header, chosen randomly. After the for-statement has run, the branch that

411

succeeds in made available through the control variable:

412

<pre>

413

hosts="mirror1.wisc.edu mirror2.wisc.edu mirror3.wisc.edu"

414

forany h in ${hosts}

415

echo "Attempting host ${host}"

416

wget http://${h}/some-file

417

end

418

echo "Got file from ${h}"

419

</pre>

420

421

A <tt>forall</tt> attempts to make a group succeed for all of the options

422

given in the header, simultaneously:

423

<pre>

424

forall h in ${hosts}

425

ssh ${h} reboot

426

end

427

</pre>

428

429

Both <tt>for</tt> and <tt>forall</tt> are brittle with respect to failures.

430

If any instance fails, then the entire for-statement

431

fails. A try-statement may be added in one of two ways.

432

If you wish to make each iteration resilient, place

433

the try-statement inside the for-statement:

434

435

<pre>

436

for p in ${packages}

437

try for 1 hour every 5 minutes

438

echo "Unpacking package ${p}..."

439

tar xvzf ${p}

440

end

441

end

442

</pre>

443

444

Or, if you wish to make the entire for-statement

445

restart after a failure, place it outside:

446

447

<pre>

448

try for 1 hour every 5 minutes

449

for p in ${packages}

450

echo "Unpacking package ${p}..."

451

tar xvzf ${p}

452

end

453

end

454

</pre>

455

456

<h3>Loops, Conditionals, and Expressions</h3>

457

458

Ftsh has loops and conditionals similar to other languages.

459

For example:

460

<pre>

461

n=0

462

while $n .lt. 10

463

echo "n is now ${n}"

464

n=$n .add. 1

465

end

466

</pre>

467

And:

468

<pre>

469

if $n .lt. 1000

470

echo "n is less than 1000"

471

else if $n .eq. 1000

472

echo "n is equal to 1000"

473

else

474

echo "n is greater than 1000"

475

end

476

</pre>

477

478

You'll notice right away that arithmetic expressions look a little

479

different than other languages.

480

Here's how it works:

481

<p>

482

The arithmetic operators .add. .sub. .mul. .div. .mod. .pow.

483

represent addition, subtraction, multiplication, division, modulus,

484

and exponentiation, including parenthesis and the usual order

485

of operations. For example:

486

<pre>

487

a=$x .mul. ( $y .add. $z )

488

</pre>

489

The comparison operators .eq. .ne. .le. .lt. .ge. .gt

490

represent equal, not-equal, less-than-or-equal, less-than,

491

greater-than-or-equal, and greater-than. These return

492

the literal strings "true" and "false".

493

<pre>

494

uname -s -> n

495

if $n .ne. Linux

496

...

497

end

498

499

</pre>

500

For integer comparison, use the operators .eql. and .neql..

501

<pre>

502

if $x .eql. 5

503

...

504

end

505

</pre>

506

The Boolean operators .and. .or .not. have the usual meaning.

507

An exception is thrown if they are given arguments that are

508

not the boolean strings "true" or "false".

509

<pre>

510

while ( $x .lt. 10 ) .and. ( $y .gt. 20 )

511

...

512

end

513

</pre>

514

The unary file operators .exists. .isr. .isw. .isx. test whether

515

a filename exists, is readable, writeable, or executable, respectively.

516

The similar operators .isfile. .isdir. .issock. .isfile. .isblock. .ischar.

517

test for the type of a named file.

518

All these operators throw exceptions if the named file is unavailable

519

for examination.

520

<pre>

521

f=/etc/passwd

522

if ( .isfile. $f ) .and. ( .isr. $f )

523

...

524

end

525

</pre>

526

Finally, the .to. and .step. operators are conveniences for generating

527

numeric lists to be used with for-loops:

528

<pre>

529

forall x in 1 .to. 100

530

ssh c$x reboot

531

end

532

533

for x in 1 .to. 100 .step. 5

534

y=$x .mul. $x

535

echo "$x times $x is $y"

536

end

537

</pre>

538

539

Notice that, unlike other shells, there is a distinction

540

between expressions, which compute a value or throw an

541

exception, and external commands, which return no value.

542

Therefore, you cannot do this:

543

<pre>

544

# !!! This is wrong !!!

545

if rm $f

546

echo "Removed $f"

547

else

548

echo "Couldn't remove $f"

549

end

550

</pre>

551

Instead, you want this:

552

<pre>

553

try

554

rm $f

555

echo "Removed $f"

556

catch

557

echo "Couldn't remove $f"

558

end

559

</pre>

560

561

<h3>Functions</h3>

562

563

Simple functions are named groups of commands that may

564

be called in the same manner as an external program.

565

The arguments passed to the function are available

566

in the same way as arguments to the shell:

567

568

<pre>

569

function compress_and_move

570

echo "Working on ${1}..."

571

gzip ${1}

572

mv ${1}.gz ${2}

573

end

574

575

compress_and_move /etc/hosts /tmp/hosts.gz

576

compress_and_move /etc/passwd /tmp/passwd.gz

577

compress_and_move /usr/dict/words /tmp/dict.gz

578

</pre>

579

580

A function may also be used to compute and return

581

a value:

582

583

<pre>

584

function fib

585

if $1 .le. 1

586

return 1

587

else

588

return fib($1 .sub. 1) .add. fib($1 .sub. 2)

589

end

590

end

591

592

value=fib(100)

593

echo $value

594

</pre>

595

596

Functions, like groups, are brittle with respect

597

to failures. A failure inside a function causes

598

the entire function to stop and fail immediately.

599

As in most languages, functions may be both nested

600

and recursive. However, ftsh aborts recursive

601

function calls deeper than 1000 steps.

602

If a function is used in an expression but does not

603

return a value, then the expression evaluation fails.

604

605

606

<h2>Miscellaneous Features</h2>

607

608

<h3>Environment</h3>

609

Variables may be exported into the environment, just like the Bourne shell:

610

<pre>

611

PATH="/usr/bin:/usr/local/bin"

612

export PATH

613

</pre>

614

615

<h3>Nested Shells</h3>

616

617

Ftsh is perfectly safe to nest.

618

That is, an ftsh script may safely call other scripts written in ftsh.

619

One ftsh passes all of its options to sub-shells using environment

620

variables, so logs, error settings, and timeouts are uniform from

621

top to bottom. If a sub-shell provides its own arguments, these

622

override the environment settings of the parent.

623

624

<h3>Error Logging</h3>

625

626

Ftsh may optionally keep a log that describes all the details

627

of a script's execution. The -f option specifies a log file.

628

Logs are open for appending, so parallel and sub-shells may

629

share the same log. The time, process number, script, and

630

line number are all recorded with every event.

631

<p>

632

<b>Note: Logs shared between processes must not be recorded in NFS or AFS filesystems.

633

NFS is not designed to support shared appending: your logs

634

will be corrupted sooner or later. AFS is not designed to support

635

simultaneous write sharing of a file: you will end up with the

636

log of one process or another, but not both. These are deliberate

637

design limitations of these filesystems and are not bugs in

638

UNIX or ftsh.

639

</b>

640

<p>

641

The amount of detail kept in a log

642

is controled with the -l option. These logging

643

levels are currently defined:

644

<ul>

645

<li><b>0</b> - Nothing is logged.

646

<li><b>10</b> - Display failed commands and structures.

647

<li><b>20</b> - Display executed commands and their exit codes.

648

<li><b>30</b> - Display structural elements such as TRY and IF-THEN.

649

<li><b>40</b> - Display process activities such as signals and completions.

650

</ul>

651

652

<h3>Command-Line Arguments</h3>

653

Ftsh accepts the following command-line arguments:

654

<ul>

655

<li> <b>-f <file></b> The name of a log file for tracing the

656

execution of a script. This log file is opened in append mode.

657

Equivalent to the environment variable FTSH_LOG_FILE.

658

<li> <b>-l <level></b> The logging level, on a scale of 0 to 100.

659

Higher numbers log more data about a script.

660

Equivalent to the environment variable FTSH_LOG_LEVEL.

661

<li> <b>-k <mode></b> - Controls whether ftsh trusts the operating

662

system to actually kill a process. If set to 'weak', ftsh will assume

663

that processes die without checking. If set to 'strong', ftsh will issue

664

SIGKILL repeatedly until a process actually dies.

665

Equivalent to the environment variable FTSH_KILL_MODE.

666

<li> <b>-t <secs></b> - The amount of time ftsh will wait

667

between requesting a process to exit (SIGTERM) and killing it forcibly (SIGKILL).

668

Equivalent to the environment variable FTSH_KILL_TIMEOUT.

669

<li> <b>-p</b> Parse, but do not execute the script. This option may be used to test the validity of an Ftsh script.

670

<li> <b>-v</b> Show the version of ftsh.

671

<li> <b>-h</b> Show the help screen.

672

</ul>

673

674

<h3>Cancelling Processes</h3>

675

676

Cancelling a running process in UNIX is rather quite complex.

677

Although starting and stopping one single process is fairly

678

easy, there are several complications to manging a tree of

679

processes, as well as dealing with the various failures that

680

can occur in the transmission of a signal.

681

<p>

682

Ftsh can clean up any set of processes that it starts,

683

given the following restrictions:

684

<dir>

685

<li> Your programs must not create a new UNIX "process session"

686

with the <tt>setsid()</tt> system call. If you don't know what

687

this is, then don't worry about it.

688

<li> The operating system must actually kill a process when

689

ftsh asks it to. Some variants of Linux won't kill processes

690

using distributed file systems. Consider using the "weak" mode

691

of ftsh.

692

<li> Rogue system administrators must not forcibly kill

693

an ftsh with a SIGKILL. However, you may safely send a SIGTERM, SIGHUP,

694

SIGINT, or SIGQUIT to an ftsh, and it will clean up its children and exit.

695

</dir>

696

<p>

697

Ftsh starts every command as a separate UNIX process in its

698

own process session (i.e. <tt>setsid</tt>).

699

This simplifies the administration of

700

large process tress. To cancel a command, ftsh sends a SIGTERM

701

to every process in the group. Ftsh then waits up to

702

thirty seconds for the child to exit willingly. At the end of

703

that time, it forcibly terminates the entire process group

704

with a SIGKILL.

705

<p>

706

Surprisingly, SIGKILL is not always effective.

707

Some operating systems have bugs in which signals

708

are occasionally lost or the process may be in such

709

a state that it cannot be killed at all.

710

By default, ftsh tries very hard to kill processes

711

by issuing SIGKILL repeatedly until the process actually

712

dies. This is called the "strong" kill mode.

713

If you do not wish to have this behavior -- perhaps

714

you have a bug resulting in unkillable processes --

715

then you may run ftsh in the "weak" kill mode, using

716

the "-k weak" option.

717

<p>

718

Ftsh may be safely nested. That is, an ftsh may invoke

719

another program written using ftsh. However, this child

720

needs to clean up faster than its parents. If the parent

721

shell issues forcible kills after waiting for 30 seconds,

722

then the child must issue forcible kills before that.

723

This problem is handled transparently for you.

724

Each ftsh informs its children of the current kill timeout

725

by setting the FTSH_KILL_TIMEOUT variable to five seconds

726

less than the current timeout. Thus, subshells are progressively

727

less tolerant of programs that refuse to exit cleanly.

728

729

<hr>

730

731

</body>

732

733

</html>

Older »