~vcs-imports/gawk/master : revision 21

1

\input texinfo @c -*-texinfo-*-

2

@c %**start of header (This is for running Texinfo on a region.)

3

@setfilename gawk.info

4

@settitle AWK Language Programming

5

@c %**end of header (This is for running Texinfo on a region.)

6

7

@ignore

8

@ifinfo

9

@format

10

START-INFO-DIR-ENTRY

11

* Gawk: (gawk.info). A Text Scanning and Processing Language.

12

END-INFO-DIR-ENTRY

13

@end format

14

@end ifinfo

15

@end ignore

16

17

@c @set xref-automatic-section-title

18

@c @set DRAFT

19

20

@c The following information should be updated here only!

21

@c This sets the edition of the document, the version of gawk it

22

@c applies to, and when the document was updated.

23

@set TITLE AWK Language Programming

24

@set EDITION 1.0

25

@set VERSION 3.0

26

@set UPDATE-MONTH January 1996

27

@iftex

28

@set DOCUMENT book

29

@end iftex

30

@ifinfo

31

@set DOCUMENT Info file

32

@end ifinfo

33

34

@ignore

35

Some comments on the layout for TeX.

36

1. Use the texinfo.tex from the gawk distribution. It contains fixes that

37

are needed to get the footings for draft mode to not appear.

38

2. I have done A LOT of work to make this look good. There `@page' commands

39

and use of `@group ... @end group' in a number of places. If you muck

40

with anything, it's your responsibility not to break the layout.

41

@end ignore

42

43

@c merge the function and variable indexes into the concept index

44

@ifinfo

45

@synindex fn cp

46

@synindex vr cp

47

@end ifinfo

48

@iftex

49

@syncodeindex fn cp

50

@syncodeindex vr cp

51

@end iftex

52

53

@c If "finalout" is commented out, the printed output will show

54

@c black boxes that mark lines that are too long. Thus, it is

55

@c unwise to comment it out when running a master in case there are

56

@c overfulls which are deemed okay.

57

58

@ifclear DRAFT

59

@iftex

60

@finalout

61

@end iftex

62

@end ifclear

63

64

@smallbook

65

@iftex

66

@cropmarks

67

@end iftex

68

69

@ifinfo

70

This file documents @code{awk}, a program that you can use to select

71

particular records in a file and perform operations upon them.

72

73

This is Edition @value{EDITION} of @cite{@value{TITLE}},

74

for the @value{VERSION} version of the GNU implementation of AWK.

75

76

77

78

Permission is granted to make and distribute verbatim copies of

79

this manual provided the copyright notice and this permission notice

80

are preserved on all copies.

81

82

@ignore

83

Permission is granted to process this file through TeX and print the

84

results, provided the printed document carries copying permission

85

notice identical to this one except for the removal of this paragraph

86

(this paragraph not being relevant to the printed manual).

87

88

@end ignore

89

Permission is granted to copy and distribute modified versions of this

90

manual under the conditions for verbatim copying, provided that the entire

91

resulting derived work is distributed under the terms of a permission

92

notice identical to this one.

93

94

Permission is granted to copy and distribute translations of this manual

95

into another language, under the above conditions for modified versions,

96

except that this permission notice may be stated in a translation approved

97

by the Foundation.

98

@end ifinfo

99

100

@setchapternewpage odd

101

102

@titlepage

103

@title @value{TITLE}

104

@subtitle A User's Guide for GNU AWK

105

@subtitle Edition @value{EDITION}

106

@subtitle @value{UPDATE-MONTH}

107

@author Arnold D. Robbins

108

@sp

109

@author Based on @cite{The GAWK Manual},

110

@author by Robbins, Close, Rubin, and Stallman

111

112

@c Include the Distribution inside the titlepage environment so

113

@c that headings are turned off. Headings on and off do not work.

114

115

@page

116

@vskip 0pt plus 1filll

117

@ifset LEGALJUNK

118

The programs and applications presented in this book have been

119

included for their instructional value. They have been tested with care,

120

but are not guaranteed for any particular purpose. The publisher does not

121

offer any warranties or representations, nor does it accept any

122

liabilities with respect to the programs or applications.

123

So there.

124

@sp 2

125

UNIX is a registered trademark of X/Open, Ltd. @*

126

Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a

127

trademark of Microsoft Corporation in the United States and other

128

countries. @*

129

Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks

130

or trademarks of Atari Corporation. @*

131

DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment

132

Corporation. @*

133

@end ifset

134

``To boldly go where no man has gone before'' is a

135

Registered Trademark of Paramount Pictures Corporation. @*

136

@c sorry, i couldn't resist

137

@sp 3

138

Copyright @copyright{} 1989, 1991 - 1996 Free Software Foundation, Inc.

139

@sp 2

140

141

This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*

142

for the @value{VERSION} (or later) version of the GNU implementation of AWK.

143

144

@sp 2

145

Published by the Free Software Foundation @*

146

59 Temple Place --- Suite 330 @*

147

Boston, MA 02111-1307 USA @*

148

Phone: +1-617-542-5942 @*

149

Fax (including Japan): +1-617-542-2652 @*

150

Printed copies are available for $25 each. @*

151

@c this ISBN can change! Check with the FSF office...

152

@c This one is correct for gawk 3.0 and edition 1.0

153

ISBN 1-882114-26-4 @*

154

155

Permission is granted to make and distribute verbatim copies of

156

this manual provided the copyright notice and this permission notice

157

are preserved on all copies.

158

159

Permission is granted to copy and distribute modified versions of this

160

manual under the conditions for verbatim copying, provided that the entire

161

resulting derived work is distributed under the terms of a permission

162

notice identical to this one.

163

164

Permission is granted to copy and distribute translations of this manual

165

into another language, under the above conditions for modified versions,

166

except that this permission notice may be stated in a translation approved

167

by the Foundation.

168

@sp 2

169

Cover art by Etienne Suvasa.

170

@end titlepage

171

172

@c Thanks to Bob Chassell for directions on doing dedications.

173

@iftex

174

@headings off

175

@page

176

@w{ }

177

@sp 9

178

@center @i{To Miriam, for making me complete.}

179

@sp

180

@center @i{To Chana, for the joy you bring us.}

181

@sp

182

@center @i{To Rivka, for the exponential increase.}

183

@page

184

@w{ }

185

@page

186

@headings on

187

@end iftex

188

189

@iftex

190

@headings off

191

@evenheading @thispage@ @ @ @b{@thistitle} @| @|

192

@oddheading @| @| @b{@thischapter}@ @ @ @thispage

193

@ifset DRAFT

194

@evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute

195

@oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{}

196

@end ifset

197

@end iftex

198

199

@ifinfo

200

@node Top, Preface, (dir), (dir)

201

@top General Introduction

202

@c Preface or Licensing nodes should come right after the Top

203

@c node, in `unnumbered' sections, then the chapter, `What is gawk'.

204

205

This file documents @code{awk}, a program that you can use to select

206

particular records in a file and perform operations upon them.

207

208

This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*

209

for the @value{VERSION} version of the GNU implementation @*

210

of AWK.

211

212

@end ifinfo

213

214

@menu

215

* Preface:: What this @value{DOCUMENT} is about; brief

216

history and acknowledgements.

217

* What Is Awk:: What is the @code{awk} language; using this

218

@value{DOCUMENT}.

219

* Getting Started:: A basic introduction to using @code{awk}. How

220

to run an @code{awk} program. Command line

221

syntax.

222

* One-liners:: Short, sample @code{awk} programs.

223

* Regexp:: All about matching things using regular

224

expressions.

225

* Reading Files:: How to read files and manipulate fields.

226

* Printing:: How to print using @code{awk}. Describes the

227

@code{print} and @code{printf} statements.

228

Also describes redirection of output.

229

* Expressions:: Expressions are the basic building blocks of

230

statements.

231

* Patterns and Actions:: Overviews of patterns and actions.

232

* Statements:: The various control statements are described

233

in detail.

234

* Built-in Variables:: Built-in Variables

235

* Arrays:: The description and use of arrays. Also

236

includes array-oriented control statements.

237

* Built-in:: The built-in functions are summarized here.

238

* User-defined:: User-defined functions are described in

239

detail.

240

* Invoking Gawk:: How to run @code{gawk}.

241

* Library Functions:: A Library of @code{awk} Functions.

242

* Sample Programs:: Many @code{awk} programs with complete

243

explanations.

244

* Language History:: The evolution of the @code{awk} language.

245

* Gawk Summary:: @code{gawk} Options and Language Summary.

246

* Installation:: Installing @code{gawk} under various operating

247

systems.

248

* Notes:: Something about the implementation of

249

@code{gawk}.

250

* Glossary:: An explanation of some unfamiliar terms.

251

* Copying:: Your right to copy and distribute @code{gawk}.

252

* Index:: Concept and Variable Index.

253

254

* History:: The history of @code{gawk} and @code{awk}.

255

* Manual History:: Brief history of the GNU project and this

256

@value{DOCUMENT}.

257

* Acknowledgements:: Acknowledgements.

258

* This Manual:: Using this @value{DOCUMENT}. Includes sample

259

input files that you can use.

260

* Conventions:: Typographical Conventions.

261

* Sample Data Files:: Sample data files for use in the @code{awk}

262

programs illustrated in this @value{DOCUMENT}.

263

* Names:: What name to use to find @code{awk}.

264

* Running gawk:: How to run @code{gawk} programs; includes

265

command line syntax.

266

* One-shot:: Running a short throw-away @code{awk} program.

267

* Read Terminal:: Using no input files (input from terminal

268

instead).

269

* Long:: Putting permanent @code{awk} programs in

270

files.

271

* Executable Scripts:: Making self-contained @code{awk} programs.

272

* Comments:: Adding documentation to @code{gawk} programs.

273

* Very Simple:: A very simple example.

274

* Two Rules:: A less simple one-line example with two rules.

275

* More Complex:: A more complex example.

276

* Statements/Lines:: Subdividing or combining statements into

277

lines.

278

* Other Features:: Other Features of @code{awk}.

279

* When:: When to use @code{gawk} and when to use other

280

things.

281

* Regexp Usage:: How to Use Regular Expressions.

282

* Escape Sequences:: How to write non-printing characters.

283

* Regexp Operators:: Regular Expression Operators.

284

* GNU Regexp Operators:: Operators specific to GNU software.

285

* Case-sensitivity:: How to do case-insensitive matching.

286

* Leftmost Longest:: How much text matches.

287

* Computed Regexps:: Using Dynamic Regexps.

288

* Records:: Controlling how data is split into records.

289

* Fields:: An introduction to fields.

290

* Non-Constant Fields:: Non-constant Field Numbers.

291

* Changing Fields:: Changing the Contents of a Field.

292

* Field Separators:: The field separator and how to change it.

293

* Basic Field Splitting:: How fields are split with single characters or

294

simple strings.

295

* Regexp Field Splitting:: Using regexps as the field separator.

296

* Single Character Fields:: Making each character a separate field.

297

* Command Line Field Separator:: Setting @code{FS} from the command line.

298

* Field Splitting Summary:: Some final points and a summary table.

299

* Constant Size:: Reading constant width data.

300

* Multiple Line:: Reading multi-line records.

301

* Getline:: Reading files under explicit program control

302

using the @code{getline} function.

303

* Getline Intro:: Introduction to the @code{getline} function.

304

* Plain Getline:: Using @code{getline} with no arguments.

305

* Getline/Variable:: Using @code{getline} into a variable.

306

* Getline/File:: Using @code{getline} from a file.

307

* Getline/Variable/File:: Using @code{getline} into a variable from a

308

file.

309

* Getline/Pipe:: Using @code{getline} from a pipe.

310

* Getline/Variable/Pipe:: Using @code{getline} into a variable from a

311

pipe.

312

* Getline Summary:: Summary Of @code{getline} Variants.

313

* Print:: The @code{print} statement.

314

* Print Examples:: Simple examples of @code{print} statements.

315

* Output Separators:: The output separators and how to change them.

316

* OFMT:: Controlling Numeric Output With @code{print}.

317

* Printf:: The @code{printf} statement.

318

* Basic Printf:: Syntax of the @code{printf} statement.

319

* Control Letters:: Format-control letters.

320

* Format Modifiers:: Format-specification modifiers.

321

* Printf Examples:: Several examples.

322

* Redirection:: How to redirect output to multiple files and

323

pipes.

324

* Special Files:: File name interpretation in @code{gawk}.

325

@code{gawk} allows access to inherited file

326

descriptors.

327

* Close Files And Pipes:: Closing Input and Output Files and Pipes.

328

* Constants:: String, numeric, and regexp constants.

329

* Scalar Constants:: Numeric and string constants.

330

* Regexp Constants:: Regular Expression constants.

331

* Using Constant Regexps:: When and how to use a regexp constant.

332

* Variables:: Variables give names to values for later use.

333

* Using Variables:: Using variables in your programs.

334

* Assignment Options:: Setting variables on the command line and a

335

summary of command line syntax. This is an

336

advanced method of input.

337

* Conversion:: The conversion of strings to numbers and vice

338

versa.

339

* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},

340

etc.)

341

* Concatenation:: Concatenating strings.

342

* Assignment Ops:: Changing the value of a variable or a field.

343

* Increment Ops:: Incrementing the numeric value of a variable.

344

* Truth Values:: What is ``true'' and what is ``false''.

345

* Typing and Comparison:: How variables acquire types, and how this

346

affects comparison of numbers and strings with

347

@samp{<}, etc.

348

* Boolean Ops:: Combining comparison expressions using boolean

349

operators @samp{||} (``or''), @samp{&&}

350

(``and'') and @samp{!} (``not'').

351

* Conditional Exp:: Conditional expressions select between two

352

subexpressions under control of a third

353

subexpression.

354

* Function Calls:: A function call is an expression.

355

* Precedence:: How various operators nest.

356

* Pattern Overview:: What goes into a pattern.

357

* Kinds of Patterns:: A list of all kinds of patterns.

358

* Regexp Patterns:: Using regexps as patterns.

359

* Expression Patterns:: Any expression can be used as a pattern.

360

* Ranges:: Pairs of patterns specify record ranges.

361

* BEGIN/END:: Specifying initialization and cleanup rules.

362

* Using BEGIN/END:: How and why to use BEGIN/END rules.

363

* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.

364

* Empty:: The empty pattern, which matches every record.

365

* Action Overview:: What goes into an action.

366

* If Statement:: Conditionally execute some @code{awk}

367

statements.

368

* While Statement:: Loop until some condition is satisfied.

369

* Do Statement:: Do specified action while looping until some

370

condition is satisfied.

371

* For Statement:: Another looping statement, that provides

372

initialization and increment clauses.

373

* Break Statement:: Immediately exit the innermost enclosing loop.

374

* Continue Statement:: Skip to the end of the innermost enclosing

375

loop.

376

* Next Statement:: Stop processing the current input record.

377

* Nextfile Statement:: Stop processing the current file.

378

* Exit Statement:: Stop execution of @code{awk}.

379

* User-modified:: Built-in variables that you change to control

380

@code{awk}.

381

* Auto-set:: Built-in variables where @code{awk} gives you

382

information.

383

* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.

384

* Array Intro:: Introduction to Arrays

385

* Reference to Elements:: How to examine one element of an array.

386

* Assigning Elements:: How to change an element of an array.

387

* Array Example:: Basic Example of an Array

388

* Scanning an Array:: A variation of the @code{for} statement. It

389

loops through the indices of an array's

390

existing elements.

391

* Delete:: The @code{delete} statement removes an element

392

from an array.

393

* Numeric Array Subscripts:: How to use numbers as subscripts in

394

@code{awk}.

395

* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.

396

* Multi-dimensional:: Emulating multi-dimensional arrays in

397

@code{awk}.

398

* Multi-scanning:: Scanning multi-dimensional arrays.

399

* Calling Built-in:: How to call built-in functions.

400

* Numeric Functions:: Functions that work with numbers, including

401

@code{int}, @code{sin} and @code{rand}.

402

* String Functions:: Functions for string manipulation, such as

403

@code{split}, @code{match}, and

404

@code{sprintf}.

405

* I/O Functions:: Functions for files and shell commands.

406

* Time Functions:: Functions for dealing with time stamps.

407

* Definition Syntax:: How to write definitions and what they mean.

408

* Function Example:: An example function definition and what it

409

does.

410

* Function Caveats:: Things to watch out for.

411

* Return Statement:: Specifying the value a function returns.

412

* Options:: Command line options and their meanings.

413

* Other Arguments:: Input file names and variable assignments.

414

* AWKPATH Variable:: Searching directories for @code{awk} programs.

415

* Obsolete:: Obsolete Options and/or features.

416

* Undocumented:: Undocumented Options and Features.

417

* Known Bugs:: Known Bugs in @code{gawk}.

418

* Portability Notes:: What to do if you don't have @code{gawk}.

419

* Nextfile Function:: Two implementations of a @code{nextfile}

420

function.

421

* Assert Function:: A function for assertions in @code{awk}

422

programs.

423

* Ordinal Functions:: Functions for using characters as numbers and

424

vice versa.

425

* Join Function:: A function to join an array into a string.

426

* Mktime Function:: A function to turn a date into a timestamp.

427

* Gettimeofday Function:: A function to get formatted times.

428

* Filetrans Function:: A function for handling data file transitions.

429

* Getopt Function:: A function for processing command line

430

arguments.

431

* Passwd Functions:: Functions for getting user information.

432

* Group Functions:: Functions for getting group information.

433

* Library Names:: How to best name private global variables in

434

library functions.

435

* Clones:: Clones of common utilities.

436

* Cut Program:: The @code{cut} utility.

437

* Egrep Program:: The @code{egrep} utility.

438

* Id Program:: The @code{id} utility.

439

* Split Program:: The @code{split} utility.

440

* Tee Program:: The @code{tee} utility.

441

* Uniq Program:: The @code{uniq} utility.

442

* Wc Program:: The @code{wc} utility.

443

* Miscellaneous Programs:: Some interesting @code{awk} programs.

444

* Dupword Program:: Finding duplicated words in a document.

445

* Alarm Program:: An alarm clock.

446

* Translate Program:: A program similar to the @code{tr} utility.

447

* Labels Program:: Printing mailing labels.

448

* Word Sorting:: A program to produce a word usage count.

449

* History Sorting:: Eliminating duplicate entries from a history

450

file.

451

* Extract Program:: Pulling out programs from Texinfo source

452

files.

453

* Simple Sed:: A Simple Stream Editor.

454

* Igawk Program:: A wrapper for @code{awk} that includes files.

455

* V7/SVR3.1:: The major changes between V7 and System V

456

Release 3.1.

457

* SVR4:: Minor changes between System V Releases 3.1

458

and 4.

459

* POSIX:: New features from the POSIX standard.

460

* BTL:: New features from the AT&T Bell Laboratories

461

version of @code{awk}.

462

* POSIX/GNU:: The extensions in @code{gawk} not in POSIX

463

@code{awk}.

464

* Command Line Summary:: Recapitulation of the command line.

465

* Language Summary:: A terse review of the language.

466

* Variables/Fields:: Variables, fields, and arrays.

467

* Fields Summary:: Input field splitting.

468

* Built-in Summary:: @code{awk}'s built-in variables.

469

* Arrays Summary:: Using arrays.

470

* Data Type Summary:: Values in @code{awk} are numbers or strings.

471

* Rules Summary:: Patterns and Actions, and their component

472

parts.

473

* Pattern Summary:: Quick overview of patterns.

474

* Regexp Summary:: Quick overview of regular expressions.

475

* Actions Summary:: Quick overview of actions.

476

* Operator Summary:: @code{awk} operators.

477

* Control Flow Summary:: The control statements.

478

* I/O Summary:: The I/O statements.

479

* Printf Summary:: A summary of @code{printf}.

480

* Special File Summary:: Special file names interpreted internally.

481

* Built-in Functions Summary:: Built-in numeric and string functions.

482

* Time Functions Summary:: Built-in time functions.

483

* String Constants Summary:: Escape sequences in strings.

484

* Functions Summary:: Defining and calling functions.

485

* Historical Features:: Some undocumented but supported ``features''.

486

* Gawk Distribution:: What is in the @code{gawk} distribution.

487

* Getting:: How to get the distribution.

488

* Extracting:: How to extract the distribution.

489

* Distribution contents:: What is in the distribution.

490

* Unix Installation:: Installing @code{gawk} under various versions

491

of Unix.

492

* Quick Installation:: Compiling @code{gawk} under Unix.

493

* Configuration Philosophy:: How it's all supposed to work.

494

* VMS Installation:: Installing @code{gawk} on VMS.

495

* VMS Compilation:: How to compile @code{gawk} under VMS.

496

* VMS Installation Details:: How to install @code{gawk} under VMS.

497

* VMS Running:: How to run @code{gawk} under VMS.

498

* VMS POSIX:: Alternate instructions for VMS POSIX.

499

* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS

500

and OS/2

501

* Atari Installation:: Installing @code{gawk} on the Atari ST.

502

* Atari Compiling:: Compiling @code{gawk} on Atari

503

* Atari Using:: Running @code{gawk} on Atari

504

* Amiga Installation:: Installing @code{gawk} on an Amiga.

505

* Bugs:: Reporting Problems and Bugs.

506

* Other Versions:: Other freely available @code{awk}

507

implementations.

508

* Compatibility Mode:: How to disable certain @code{gawk} extensions.

509

* Additions:: Making Additions To @code{gawk}.

510

* Adding Code:: Adding code to the main body of @code{gawk}.

511

* New Ports:: Porting @code{gawk} to a new operating system.

512

* Future Extensions:: New features that may be implemented one day.

513

* Improvements:: Suggestions for improvements by volunteers.

514

515

@end menu

516

517

@c dedication for Info file

518

@ifinfo

519

@center To Miriam, for making me complete.

520

@sp 1

521

@center To Chana, for the joy you bring us.

522

@sp 1

523

@center To Rivka, for the exponential increase.

524

@end ifinfo

525

526

@node Preface, What Is Awk, Top, Top

527

@unnumbered Preface

528

529

@c I saw a comment somewhere that the preface should describe the book itself,

530

@c and the introduction should describe what the book covers.

531

532

This @value{DOCUMENT} teaches you about the @code{awk} language and

533

how you can use it effectively. You should already be familiar with basic

534

system commands, such as @code{cat} and @code{ls},@footnote{These commands

535

are available on POSIX compliant systems, as well as on traditional Unix

536

based systems. If you are using some other operating system, you still need to

537

be familiar with the ideas of I/O redirection and pipes} and basic shell

538

facilities, such as Input/Output (I/O) redirection and pipes.

539

540

Implementations of the @code{awk} language are available for many different

541

computing environments. This @value{DOCUMENT}, while describing the @code{awk} language

542

in general, also describes a particular implementation of @code{awk} called

543

@code{gawk} (which stands for ``GNU Awk''). @code{gawk} runs on a broad range

544

of Unix systems, ranging from 80386 PC-based computers, up through large scale

545

systems, such as Crays. @code{gawk} has also been ported to MS-DOS and

546

OS/2 PC's, Atari and Amiga micro-computers, and VMS.

547

548

@menu

549

* History:: The history of @code{gawk} and @code{awk}.

550

* Manual History:: Brief history of the GNU project and this

551

@value{DOCUMENT}.

552

* Acknowledgements:: Acknowledgements.

553

@end menu

554

555

@node History, Manual History, Preface, Preface

556

@unnumberedsec History of @code{awk} and @code{gawk}

557

558

@cindex acronym

559

@cindex history of @code{awk}

560

@cindex Aho, Alfred

561

@cindex Weinberger, Peter

562

@cindex Kernighan, Brian

563

@cindex old @code{awk}

564

@cindex new @code{awk}

565

The name @code{awk} comes from the initials of its designers: Alfred V.@:

566

Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of

567

@code{awk} was written in 1977 at AT&T Bell Laboratories.

568

In 1985 a new version made the programming

569

language more powerful, introducing user-defined functions, multiple input

570

streams, and computed regular expressions.

571

This new version became generally available with Unix System V Release 3.1.

572

The version in System V Release 4 added some new features and also cleaned

573

up the behavior in some of the ``dark corners'' of the language.

574

The specification for @code{awk} in the POSIX Command Language

575

and Utilities standard further clarified the language based on feedback

576

from both the @code{gawk} designers, and the original Bell Labs @code{awk}

577

designers.

578

579

The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin

580

and Jay Fenlason, with advice from Richard Stallman. John Woods

581

contributed parts of the code as well. In 1988 and 1989, David Trueman, with

582

help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility

583

with the newer @code{awk}. Current development focuses on bug fixes,

584

performance improvements, standards compliance, and occasionally, new features.

585

586

@node Manual History, Acknowledgements, History, Preface

587

@unnumberedsec The GNU Project and This Book

588

589

@cindex Free Software Foundation

590

The Free Software Foundation (FSF) is a non-profit organization dedicated

591

to the production and distribution of freely distributable software.

592

It was founded by Richard M.@: Stallman, the author of the original

593

Emacs editor. GNU Emacs is the most widely used version of Emacs today.

594

595

@cindex GNU Project

596

The GNU project is an on-going effort on the part of the Free Software

597

Foundation to create a complete, freely distributable, POSIX compliant

598

computing environment. (GNU stands for ``GNU's not Unix''.)

599

The FSF uses the ``GNU General Public License'' (or GPL) to ensure that

600

source code for their software is always available to the end user. A

601

copy of the GPL is included for your reference

602

(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).

603

The GPL applies to the C language source code for @code{gawk}.

604

605

As of this writing (1995), the only major component of the

606

GNU environment still uncompleted is the operating system kernel, and

607

work proceeds apace on that. A shell, an editor (Emacs), highly portable

608

optimizing C, C++, and Objective-C compilers, a symbolic debugger, and dozens

609

of large and small utilities (such as @code{gawk}),

610

have all been completed and are freely available.

611

612

@cindex Linux

613

@cindex NetBSD

614

@cindex FreeBSD

615

Until the GNU operating system is released, the FSF recommends the use

616

of Linux, a freely distributable, Unix-like operating system for 80386

617

and other systems. There are many books on Linux. One freely available one

618

is @cite{Linux Installation and Getting Started}, by Matt Welsh.

619

Many Linux distributions are available, often in computer stores or

620

bundled on CD-ROM with books about Linux. Also, the FSF provides a Linux

621

distribution (``Debian''); contact them for more information.

622

@xref{Getting, ,Getting the @code{gawk} Distribution}, for the FSF's contact

623

information.

624

(There are two other freely available, Unix-like operating systems for

625

80386 and other systems, NetBSD and FreeBSD. Both are based on the

626

4.4-Lite Berkeley Software Distribution, and both use recent versions

627

of @code{gawk} for their versions of @code{awk}.)

628

629

@iftex

630

This @value{DOCUMENT} you are reading now is actually free. The

631

information in it is freely available to anyone, the machine readable

632

source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone

633

may take this @value{DOCUMENT} to a copying machine and make as many

634

copies of it as they like. (Take a moment to check the copying

635

permissions on the Copyright page.)

636

637

If you paid money for this @value{DOCUMENT}, what you actually paid for

638

was the @value{DOCUMENT}'s nice printing and binding, and the

639

publisher's associated costs to produce it. We have made an effort to

640

keep these costs reasonable; most people would prefer a bound book to

641

over 300 pages of photo-copied text that would then have to be held in

642

a loose-leaf binder (not to mention the time and labor involved in

643

doing the copying). The same is true of producing this

644

@value{DOCUMENT} from the machine readable source; the retail price is

645

only slightly more than the cost per page of printing it

646

on a laser printer.

647

@end iftex

648

649

This @value{DOCUMENT} itself has gone through several previous,

650

preliminary editions. I started working on a preliminary draft of

651

@cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard

652

Stallman in the fall of 1988.

653

It was around 90 pages long, and barely described the original, ``old''

654

version of @code{awk}. After substantial revision, the first version of

655

the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in

656

October of 1989. The manual then underwent more substantial revision

657

for Edition 0.13 of December 1991.

658

David Trueman, Pat Rankin, and Michal Jaegermann contributed sections

659

of the manual for Edition 0.13.

660

That edition was published by the

661

FSF as a bound book early in 1992. Since then there have been several

662

minor revisions, notably Edition 0.14 of November 1992 that was published

663

by the FSF in January of 1993, and Edition 0.16 of August 1993.

664

665

Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working

666

of @cite{The GAWK Manual}, with much additional material.

667

The FSF and I agree that I am now the primary author.

668

I also felt that it needed a more descriptive title.

669

670

@cite{@value{TITLE}} will undoubtedly continue to evolve.

671

An electronic version

672

comes with the @code{gawk} distribution from the FSF.

673

If you find an error in this @value{DOCUMENT}, please report it!

674

@xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting

675

problem reports electronically, or write to me in care of the FSF.

676

677

@node Acknowledgements, , Manual History, Preface

678

@unnumberedsec Acknowledgements

679

680

I would like to acknowledge Richard M.@: Stallman, for his vision of a

681

better world, and for his courage in founding the FSF and starting the

682

GNU project.

683

684

The initial draft of @cite{The GAWK Manual} had the following acknowledgements:

685

686

@quotation

687

Many people need to be thanked for their assistance in producing this

688

manual. Jay Fenlason contributed many ideas and sample programs. Richard

689

Mlynarik and Robert Chassell gave helpful comments on drafts of this

690

manual. The paper @cite{A Supplemental Document for @code{awk}} by John W.@:

691

Pierce of the Chemistry Department at UC San Diego, pinpointed several

692

issues relevant both to @code{awk} implementation and to this manual, that

693

would otherwise have escaped us.

694

@end quotation

695

696

The following people provided many helpful comments on Edition 0.13 of

697

@cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close,

698

Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins,

699

and Michal Jaegermann.

700

701

The following people provided many helpful comments for Edition 1.0 of

702

@cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel

703

Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins.

704

Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik

705

updated their respective sections for Edition 1.0.

706

707

Robert J.@: Chassell provided much valuable advice on

708

the use of Texinfo. He also deserves special thanks for

709

convincing me @emph{not} to title this @value{DOCUMENT}

710

@cite{How To Gawk Politely}.

711

Karl Berry helped significantly with the @TeX{} part of Texinfo.

712

713

@cindex Trueman, David

714

David Trueman deserves special credit; he has done a yeoman job

715

of evolving @code{gawk} so that it performs well, and without bugs.

716

Although he is no longer involved with @code{gawk},

717

working with him on this project was a significant pleasure.

718

719

@cindex Deifik, Scott

720

@cindex Hankerson, Darrel

721

@cindex Rommel, Kai Uwe

722

@cindex Rankin, Pat

723

@cindex Jaegermann, Michal

724

Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal

725

Jaegermann (in no particular order) are long time members of the

726

@code{gawk} ``crack portability team.'' Without their hard work and

727

help, @code{gawk} would not be nearly the fine program it is today. It

728

has been and continues to be a pleasure working with this team of fine

729

people.

730

731

@cindex Friedl, Jeffrey

732

Jeffrey Friedl provided invaluable help in tracking down a number

733

of last minute problems with regular expressions in @code{gawk} 3.0.

734

735

@cindex Kernighan, Brian

736

David and I would like to thank Brian Kernighan of Bell Labs for

737

invaluable assistance during the testing and debugging of @code{gawk}, and for

738

help in clarifying numerous points about the language. We could not have

739

done nearly as good a job on either @code{gawk} or its documentation without

740

his help.

741

742

@cindex Hughes, Phil

743

I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@:

744

Bert and Rita Schreiber of Detroit for large amounts of quiet vacation

745

time in their homes, which allowed me to make significant progress on

746

this @value{DOCUMENT} and on @code{gawk} itself. Phil Hughes of SSC

747

contributed in a very important way by loaning me his laptop Linux

748

system, not once, but twice, allowing me to do a lot of work while

749

away from home.

750

751

@cindex Robbins, Miriam

752

Finally, I must thank my wonderful wife, Miriam, for her patience through

753

the many versions of this project, for her proof-reading,

754

and for sharing me with the computer.

755

I would like to thank my parents for their love, and for the grace with

756

which they raised and educated me.

757

I also must acknowledge my gratitude to G-d, for the many opportunities

758

He has sent my way, as well as for the gifts He has given me with which to

759

take advantage of those opportunities.

760

@sp 2

761

@noindent

762

Arnold Robbins @*

763

Atlanta, Georgia @*

764

January, 1996

765

766

@ignore

767

Stuff still not covered anywhere:

768

BASICS:

769

Integer vs. floating point

770

Hex vs. octal vs. decimal

771

Interpreter vs compiler

772

input/output

773

@end ignore

774

775

@node What Is Awk, Getting Started, Preface, Top

776

@chapter Introduction

777

778

If you are like many computer users, you would frequently like to make

779

changes in various text files wherever certain patterns appear, or

780

extract data from parts of certain lines while discarding the rest. To

781

write a program to do this in a language such as C or Pascal is a

782

time-consuming inconvenience that may take many lines of code. The job

783

may be easier with @code{awk}.

784

785

The @code{awk} utility interprets a special-purpose programming language

786

that makes it possible to handle simple data-reformatting jobs

787

with just a few lines of code.

788

789

The GNU implementation of @code{awk} is called @code{gawk}; it is fully

790

upward compatible with the System V Release 4 version of

791

@code{awk}. @code{gawk} is also upward compatible with the POSIX

792

specification of the @code{awk} language. This means that all

793

properly written @code{awk} programs should work with @code{gawk}.

794

Thus, we usually don't distinguish between @code{gawk} and other @code{awk}

795

implementations.

796

797

@cindex uses of @code{awk}

798

Using @code{awk} you can:

799

800

@itemize @bullet

801

@item

802

manage small, personal databases

803

804

@item

805

generate reports

806

807

@item

808

validate data

809

810

@item

811

produce indexes, and perform other document preparation tasks

812

813

@item

814

even experiment with algorithms that can be adapted later to other computer

815

languages

816

@end itemize

817

818

@menu

819

* This Manual:: Using this @value{DOCUMENT}. Includes sample

820

input files that you can use.

821

* Conventions:: Typographical Conventions.

822

* Sample Data Files:: Sample data files for use in the @code{awk}

823

programs illustrated in this @value{DOCUMENT}.

824

@end menu

825

826

@node This Manual, Conventions, What Is Awk, What Is Awk

827

@section Using This Book

828

@cindex book, using this

829

@cindex using this book

830

@cindex language, @code{awk}

831

@cindex program, @code{awk}

832

@ignore

833

@cindex @code{awk} language

834

@cindex @code{awk} program

835

@end ignore

836

837

The term @code{awk} refers to a particular program, and to the language you

838

use to tell this program what to do. When we need to be careful, we call

839

the program ``the @code{awk} utility'' and the language ``the @code{awk}

840

language.'' The term @code{gawk} refers to a version of @code{awk} developed

841

as part the GNU project. The purpose of this @value{DOCUMENT} is to explain

842

both the @code{awk} language and how to run the @code{awk} utility.

843

844

The main purpose of the @value{DOCUMENT} is to explain the features

845

of @code{awk}, as defined in the POSIX standard. It does so in the context

846

of one particular implementation, @code{gawk}. While doing so, it will also

847

attempt to describe important differences between @code{gawk} and other

848

@code{awk} implementations. Finally, any @code{gawk} features that

849

are not in the POSIX standard for @code{awk} will be noted.

850

851

@iftex

852

This @value{DOCUMENT} has the difficult task of being both tutorial and reference.

853

If you are a novice, feel free to skip over details that seem too complex.

854

You should also ignore the many cross references; they are for the

855

expert user, and for the on-line Info version of the document.

856

@end iftex

857

858

The term @dfn{@code{awk} program} refers to a program written by you in

859

the @code{awk} programming language.

860

861

@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare

862

essentials you need to know to start using @code{awk}.

863

864

Some useful ``one-liners'' are included to give you a feel for the

865

@code{awk} language (@pxref{One-liners, ,Useful One Line Programs}).

866

867

Many sample @code{awk} programs have been provided for you

868

(@pxref{Library Functions, ,A Library of @code{awk} Functions}; also

869

@pxref{Sample Programs, ,Practical @code{awk} Programs}).

870

871

The entire @code{awk} language is summarized for quick reference in

872

@ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need

873

to refresh your memory about a particular feature.

874

875

If you find terms that you aren't familiar with, try looking them

876

up in the glossary (@pxref{Glossary}).

877

878

Most of the time complete @code{awk} programs are used as examples, but in

879

some of the more advanced sections, only the part of the @code{awk} program

880

that illustrates the concept being described is shown.

881

882

While this @value{DOCUMENT} is aimed principally at people who have not been

883

exposed

884

to @code{awk}, there is a lot of information here that even the @code{awk}

885

expert should find useful. In particular, the description of POSIX

886

@code{awk}, and the example programs in

887

@ref{Library Functions, ,A Library of @code{awk} Functions}, and

888

@ref{Sample Programs, ,Practical @code{awk} Programs},

889

should be of interest.

890

891

@c fakenode --- for prepinfo

892

@unnumberedsubsec Dark Corners

893

894

@cindex d.c., see ``dark corner''

895

@cindex dark corner

896

Until the POSIX standard (and @cite{The Gawk Manual}),

897

many features of @code{awk} were either poorly documented, or not

898

documented at all. Descriptions of such features

899

(often called ``dark corners'') are noted in this @value{DOCUMENT} with

900

``(d.c.)''.

901

They also appear in the index under the heading ``dark corner.''

902

903

@node Conventions, Sample Data Files, This Manual, What Is Awk

904

@section Typographical Conventions

905

906

This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language.

907

A single Texinfo source file is used to produce both the printed and on-line

908

versions of the documentation.

909

@iftex

910

Because of this, the typographical conventions

911

are slightly different than in other books you may have read.

912

@end iftex

913

@ifinfo

914

This section briefly documents the typographical conventions used in Texinfo.

915

@end ifinfo

916

917

Examples you would type at the command line are preceded by the common

918

shell primary and secondary prompts, @samp{$} and @samp{>}.

919

Output from the command is preceded by the glyph ``@print{}''.

920

This typically represents the command's standard output.

921

Error messages, and other output on the command's standard error, are preceded

922

by the glyph ``@error{}''. For example:

923

924

@example

925

$ echo hi on stdout

926

@print{} hi on stdout

927

$ echo hello on stderr 1>&2

928

@error{} hello on stderr

929

@end example

930

931

@iftex

932

In the text, command names appear in @code{this font}, while code segments

933

appear in the same font and quoted, @samp{like this}. Some things will

934

be emphasized @emph{like this}, and if a point needs to be made

935

strongly, it will be done @strong{like this}. The first occurrence of

936

a new term is usually its @dfn{definition}, and appears in the same

937

font as the previous occurrence of ``definition'' in this sentence.

938

File names are indicated like this: @file{/path/to/ourfile}.

939

@end iftex

940

941

Characters that you type at the keyboard look @kbd{like this}. In particular,

942

there are special characters called ``control characters.'' These are

943

characters that you type by holding down both the @kbd{CONTROL} key and

944

another key, at the same time. For example, a @kbd{Control-d} is typed

945

by first pressing and holding the @kbd{CONTROL} key, next

946

pressing the @kbd{d} key, and finally releasing both keys.

947

948

@node Sample Data Files, , Conventions, What Is Awk

949

@section Data Files for the Examples

950

951

@cindex input file, sample

952

@cindex sample input file

953

@cindex @file{BBS-list} file

954

Many of the examples in this @value{DOCUMENT} take their input from two sample

955

data files. The first, called @file{BBS-list}, represents a list of

956

computer bulletin board systems together with information about those systems.

957

The second data file, called @file{inventory-shipped}, contains

958

information about shipments on a monthly basis. In both files,

959

each line is considered to be one @dfn{record}.

960

961

In the file @file{BBS-list}, each record contains the name of a computer

962

bulletin board, its phone number, the board's baud rate(s), and a code for

963

the number of hours it is operational. An @samp{A} in the last column

964

means the board operates 24 hours a day. A @samp{B} in the last

965

column means the board operates evening and weekend hours, only. A

966

@samp{C} means the board operates only on weekends.

967

968

@c 2e: Update the baud rates to reflect today's faster modems

969

@example

970

@c system mkdir eg

971

@c system mkdir eg/lib

972

@c system mkdir eg/data

973

@c system mkdir eg/prog

974

@c system mkdir eg/misc

975

@c file eg/data/BBS-list

976

aardvark 555-5553 1200/300 B

977

alpo-net 555-3412 2400/1200/300 A

978

barfly 555-7685 1200/300 A

979

bites 555-1675 2400/1200/300 A

980

camelot 555-0542 300 C

981

core 555-2912 1200/300 C

982

fooey 555-1234 2400/1200/300 B

983

foot 555-6699 1200/300 B

984

macfoo 555-6480 1200/300 A

985

sdace 555-3430 2400/1200/300 A

986

sabafoo 555-2127 1200/300 C

987

@c endfile

988

@end example

989

990

@cindex @file{inventory-shipped} file

991

The second data file, called @file{inventory-shipped}, represents

992

information about shipments during the year.

993

Each record contains the month of the year, the number

994

of green crates shipped, the number of red boxes shipped, the number of

995

orange bags shipped, and the number of blue packages shipped,

996

respectively. There are 16 entries, covering the 12 months of one year

997

and four months of the next year.

998

999

@example

1000

@c file eg/data/inventory-shipped

1001

Jan 13 25 15 115

1002

Feb 15 32 24 226

1003

Mar 15 24 34 228

1004

Apr 31 52 63 420

1005

May 16 34 29 208

1006

Jun 31 42 75 492

1007

Jul 24 34 67 436

1008

Aug 15 34 47 316

1009

Sep 13 55 37 277

1010

Oct 29 54 68 525

1011

Nov 20 87 82 577

1012

Dec 17 35 61 401

1013

1014

Jan 21 36 64 620

1015

Feb 26 58 80 652

1016

Mar 24 75 70 495

1017

Apr 21 70 74 514

1018

@c endfile

1019

@end example

1020

1021

@ifinfo

1022

If you are reading this in GNU Emacs using Info, you can copy the regions

1023

of text showing these sample files into your own test files. This way you

1024

can try out the examples shown in the remainder of this document. You do

1025

this by using the command @kbd{M-x write-region} to copy text from the Info

1026

file into a file for use with @code{awk}

1027

(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual},

1028

for more information). Using this information, create your own

1029

@file{BBS-list} and @file{inventory-shipped} files, and practice what you

1030

learn in this @value{DOCUMENT}.

1031

1032

If you are using the stand-alone version of Info,

1033

see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},

1034

for an @code{awk} program that will extract these data files from

1035

@file{gawk.texi}, the Texinfo source file for this Info file.

1036

@end ifinfo

1037

1038

@node Getting Started, One-liners, What Is Awk, Top

1039

@chapter Getting Started with @code{awk}

1040

@cindex script, definition of

1041

@cindex rule, definition of

1042

@cindex program, definition of

1043

@cindex basic function of @code{awk}

1044

1045

The basic function of @code{awk} is to search files for lines (or other

1046

units of text) that contain certain patterns. When a line matches one

1047

of the patterns, @code{awk} performs specified actions on that line.

1048

@code{awk} keeps processing input lines in this way until the end of the

1049

input files are reached.

1050

1051

@cindex data-driven languages

1052

@cindex procedural languages

1053

@cindex language, data-driven

1054

@cindex language, procedural

1055

Programs in @code{awk} are different from programs in most other languages,

1056

because @code{awk} programs are @dfn{data-driven}; that is, you describe

1057

the data you wish to work with, and then what to do when you find it.

1058

Most other languages are @dfn{procedural}; you have to describe, in great

1059

detail, every step the program is to take. When working with procedural

1060

languages, it is usually much

1061

harder to clearly describe the data your program will process.

1062

For this reason, @code{awk} programs are often refreshingly easy to both

1063

write and read.

1064

1065

@cindex program, definition of

1066

@cindex rule, definition of

1067

When you run @code{awk}, you specify an @code{awk} @dfn{program} that

1068

tells @code{awk} what to do. The program consists of a series of

1069

@dfn{rules}. (It may also contain @dfn{function definitions},

1070

an advanced feature which we will ignore for now.

1071

@xref{User-defined, ,User-defined Functions}.) Each rule specifies one

1072

pattern to search for, and one action to perform when that pattern is found.

1073

1074

Syntactically, a rule consists of a pattern followed by an action. The

1075

action is enclosed in curly braces to separate it from the pattern.

1076

Rules are usually separated by newlines. Therefore, an @code{awk}

1077

program looks like this:

1078

1079

@example

1080

@var{pattern} @{ @var{action} @}

1081

@var{pattern} @{ @var{action} @}

1082

@dots{}

1083

@end example

1084

1085

@menu

1086

* Names:: What name to use to find @code{awk}.

1087

* Running gawk:: How to run @code{gawk} programs; includes

1088

command line syntax.

1089

* Very Simple:: A very simple example.

1090

* Two Rules:: A less simple one-line example with two rules.

1091

* More Complex:: A more complex example.

1092

* Statements/Lines:: Subdividing or combining statements into

1093

lines.

1094

* Other Features:: Other Features of @code{awk}.

1095

* When:: When to use @code{gawk} and when to use other

1096

things.

1097

@end menu

1098

1099

@node Names, Running gawk , Getting Started, Getting Started

1100

@section A Rose By Any Other Name

1101

1102

@cindex old @code{awk} vs. new @code{awk}

1103

@cindex new @code{awk} vs. old @code{awk}

1104

The @code{awk} language has evolved over the years. Full details are

1105

provided in @ref{Language History, ,The Evolution of the @code{awk} Language}.

1106

The language described in this @value{DOCUMENT}

1107

is often referred to as ``new @code{awk}.''

1108

1109

Because of this, many systems have multiple

1110

versions of @code{awk}.

1111

Some systems have an @code{awk} utility that implements the

1112

original version of the @code{awk} language, and a @code{nawk} utility

1113

for the new version. Others have an @code{oawk} for the ``old @code{awk}''

1114

language, and plain @code{awk} for the new one. Still others only

1115

have one version, usually the new one.@footnote{Often, these systems

1116

use @code{gawk} for their @code{awk} implementation!}

1117

1118

All in all, this makes it difficult for you to know which version of

1119

@code{awk} you should run when writing your programs. The best advice

1120

we can give here is to check your local documentation. Look for @code{awk},

1121

@code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you

1122

will have some version of new @code{awk} on your system, and that is what

1123

you should use when running your programs. (Of course, if you're reading

1124

this @value{DOCUMENT}, chances are good that you have @code{gawk}!)

1125

1126

Throughout this @value{DOCUMENT}, whenever we refer to a language feature

1127

that should be available in any complete implementation of POSIX @code{awk},

1128

we simply use the term @code{awk}. When referring to a feature that is

1129

specific to the GNU implementation, we use the term @code{gawk}.

1130

1131

@node Running gawk, Very Simple, Names, Getting Started

1132

@section How to Run @code{awk} Programs

1133

1134

@cindex command line formats

1135

@cindex running @code{awk} programs

1136

There are several ways to run an @code{awk} program. If the program is

1137

short, it is easiest to include it in the command that runs @code{awk},

1138

like this:

1139

1140

@example

1141

awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}

1142

@end example

1143

1144

@noindent

1145

where @var{program} consists of a series of patterns and actions, as

1146

described earlier.

1147

(The reason for the single quotes is described below, in

1148

@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.)

1149

1150

When the program is long, it is usually more convenient to put it in a file

1151

and run it with a command like this:

1152

1153

@example

1154

awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}

1155

@end example

1156

1157

@menu

1158

* One-shot:: Running a short throw-away @code{awk} program.

1159

* Read Terminal:: Using no input files (input from terminal

1160

instead).

1161

* Long:: Putting permanent @code{awk} programs in

1162

files.

1163

* Executable Scripts:: Making self-contained @code{awk} programs.

1164

* Comments:: Adding documentation to @code{gawk} programs.

1165

@end menu

1166

1167

@node One-shot, Read Terminal, Running gawk, Running gawk

1168

@subsection One-shot Throw-away @code{awk} Programs

1169

1170

Once you are familiar with @code{awk}, you will often type in simple

1171

programs the moment you want to use them. Then you can write the

1172

program as the first argument of the @code{awk} command, like this:

1173

1174

@example

1175

awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}

1176

@end example

1177

1178

@noindent

1179

where @var{program} consists of a series of @var{patterns} and

1180

@var{actions}, as described earlier.

1181

1182

@cindex single quotes, why needed

1183

This command format instructs the @dfn{shell}, or command interpreter,

1184

to start @code{awk} and use the @var{program} to process records in the

1185

input file(s). There are single quotes around @var{program} so that

1186

the shell doesn't interpret any @code{awk} characters as special shell

1187

characters. They also cause the shell to treat all of @var{program} as

1188

a single argument for @code{awk} and allow @var{program} to be more

1189

than one line long.

1190

1191

This format is also useful for running short or medium-sized @code{awk}

1192

programs from shell scripts, because it avoids the need for a separate

1193

file for the @code{awk} program. A self-contained shell script is more

1194

reliable since there are no other files to misplace.

1195

1196

@ref{One-liners, , Useful One Line Programs}, presents several short,

1197

self-contained programs.

1198

1199

@iftex

1200

@page

1201

@end iftex

1202

As an interesting side point, the command

1203

1204

@example

1205

awk '/foo/' @var{files} @dots{}

1206

@end example

1207

1208

@noindent

1209

is essentially the same as

1210

1211

@cindex @code{egrep}

1212

@example

1213

egrep foo @var{files} @dots{}

1214

@end example

1215

1216

@node Read Terminal, Long, One-shot, Running gawk

1217

@subsection Running @code{awk} without Input Files

1218

1219

@cindex standard input

1220

@cindex input, standard

1221

You can also run @code{awk} without any input files. If you type the

1222

command line:

1223

1224

@example

1225

awk '@var{program}'

1226

@end example

1227

1228

@noindent

1229

then @code{awk} applies the @var{program} to the @dfn{standard input},

1230

which usually means whatever you type on the terminal. This continues

1231

until you indicate end-of-file by typing @kbd{Control-d}.

1232

(On other operating systems, the end-of-file character may be different.

1233

For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.)

1234

1235

For example, the following program prints a friendly piece of advice

1236

(from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}),

1237

to keep you from worrying about the complexities of computer programming

1238

(@samp{BEGIN} is a feature we haven't discussed yet).

1239

1240

@example

1241

$ awk "BEGIN @{ print \"Don't Panic!\" @}"

1242

@print{} Don't Panic!

1243

@end example

1244

1245

@cindex quoting, shell

1246

@cindex shell quoting

1247

This program does not read any input. The @samp{\} before each of the

1248

inner double quotes is necessary because of the shell's quoting rules,

1249

in particular because it mixes both single quotes and double quotes.

1250

1251

This next simple @code{awk} program

1252

emulates the @code{cat} utility; it copies whatever you type at the

1253

keyboard to its standard output. (Why this works is explained shortly.)

1254

1255

@example

1256

$ awk '@{ print @}'

1257

Now is the time for all good men

1258

@print{} Now is the time for all good men

1259

to come to the aid of their country.

1260

@print{} to come to the aid of their country.

1261

Four score and seven years ago, ...

1262

@print{} Four score and seven years ago, ...

1263

What, me worry?

1264

@print{} What, me worry?

1265

@kbd{Control-d}

1266

@end example

1267

1268

@node Long, Executable Scripts, Read Terminal, Running gawk

1269

@subsection Running Long Programs

1270

1271

@cindex running long programs

1272

@cindex @code{-f} option

1273

@cindex program file

1274

@cindex file, @code{awk} program

1275

Sometimes your @code{awk} programs can be very long. In this case it is

1276

more convenient to put the program into a separate file. To tell

1277

@code{awk} to use that file for its program, you type:

1278

1279

@example

1280

awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}

1281

@end example

1282

1283

The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program

1284

from the file @var{source-file}. Any file name can be used for

1285

@var{source-file}. For example, you could put the program:

1286

1287

@example

1288

BEGIN @{ print "Don't Panic!" @}

1289

@end example

1290

1291

@noindent

1292

into the file @file{advice}. Then this command:

1293

1294

@example

1295

awk -f advice

1296

@end example

1297

1298

@noindent

1299

does the same thing as this one:

1300

1301

@example

1302

awk "BEGIN @{ print \"Don't Panic!\" @}"

1303

@end example

1304

1305

@cindex quoting, shell

1306

@cindex shell quoting

1307

@noindent

1308

which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}).

1309

Note that you don't usually need single quotes around the file name that you

1310

specify with @samp{-f}, because most file names don't contain any of the shell's

1311

special characters. Notice that in @file{advice}, the @code{awk}

1312

program did not have single quotes around it. The quotes are only needed

1313

for programs that are provided on the @code{awk} command line.

1314

1315

If you want to identify your @code{awk} program files clearly as such,

1316

you can add the extension @file{.awk} to the file name. This doesn't

1317

affect the execution of the @code{awk} program, but it does make

1318

``housekeeping'' easier.

1319

1320

@node Executable Scripts, Comments, Long, Running gawk

1321

@subsection Executable @code{awk} Programs

1322

@cindex executable scripts

1323

@cindex scripts, executable

1324

@cindex self contained programs

1325

@cindex program, self contained

1326

@cindex @code{#!} (executable scripts)

1327

1328

Once you have learned @code{awk}, you may want to write self-contained

1329

@code{awk} scripts, using the @samp{#!} script mechanism. You can do

1330

this on many Unix systems@footnote{The @samp{#!} mechanism works on

1331

Linux systems,

1332

Unix systems derived from Berkeley Unix, System V Release 4, and some System

1333

V Release 3 systems.} (and someday on the GNU system).

1334

1335

For example, you could update the file @file{advice} to look like this:

1336

1337

@example

1338

#! /bin/awk -f

1339

1340

BEGIN @{ print "Don't Panic!" @}

1341

@end example

1342

1343

@noindent

1344

After making this file executable (with the @code{chmod} utility), you

1345

can simply type @samp{advice}

1346

at the shell, and the system will arrange to run @code{awk} @footnote{The

1347

line beginning with @samp{#!} lists the full file name of an interpreter

1348

to be run, and an optional initial command line argument to pass to that

1349

interpreter. The operating system then runs the interpreter with the given

1350

argument and the full argument list of the executed program. The first argument

1351

in the list is the full file name of the @code{awk} program. The rest of the

1352

argument list will either be options to @code{awk}, or data files,

1353

or both.} as if you had typed @samp{awk -f advice}.

1354

1355

@example

1356

$ advice

1357

@print{} Don't Panic!

1358

@end example

1359

1360

@noindent

1361

Self-contained @code{awk} scripts are useful when you want to write a

1362

program which users can invoke without their having to know that the program is

1363

written in @code{awk}.

1364

1365

@cindex shell scripts

1366

@cindex scripts, shell

1367

Some older systems do not support the @samp{#!} mechanism. You can get a

1368

similar effect using a regular shell script. It would look something

1369

like this:

1370

1371

@example

1372

: The colon ensures execution by the standard shell.

1373

awk '@var{program}' "$@@"

1374

@end example

1375

1376

Using this technique, it is @emph{vital} to enclose the @var{program} in

1377

single quotes to protect it from interpretation by the shell. If you

1378

omit the quotes, only a shell wizard can predict the results.

1379

1380

The @code{"$@@"} causes the shell to forward all the command line

1381

arguments to the @code{awk} program, without interpretation. The first

1382

line, which starts with a colon, is used so that this shell script will

1383

work even if invoked by a user who uses the C shell. (Not all older systems

1384

obey this convention, but many do.)

1385

@c 2e:

1386

@c Someday: (See @cite{The Bourne Again Shell}, by ??.)

1387

1388

@node Comments, , Executable Scripts, Running gawk

1389

@subsection Comments in @code{awk} Programs

1390

@cindex @code{#} (comment)

1391

@cindex comments

1392

@cindex use of comments

1393

@cindex documenting @code{awk} programs

1394

@cindex programs, documenting

1395

1396

A @dfn{comment} is some text that is included in a program for the sake

1397

of human readers; it is not really part of the program. Comments

1398

can explain what the program does, and how it works. Nearly all

1399

programming languages have provisions for comments, because programs are

1400

typically hard to understand without their extra help.

1401

1402

In the @code{awk} language, a comment starts with the sharp sign

1403

character, @samp{#}, and continues to the end of the line.

1404

The @samp{#} does not have to be the first character on the line. The

1405

@code{awk} language ignores the rest of a line following a sharp sign.

1406

For example, we could have put the following into @file{advice}:

1407

1408

@example

1409

# This program prints a nice friendly message. It helps

1410

# keep novice users from being afraid of the computer.

1411

BEGIN @{ print "Don't Panic!" @}

1412

@end example

1413

1414

You can put comment lines into keyboard-composed throw-away @code{awk}

1415

programs also, but this usually isn't very useful; the purpose of a

1416

comment is to help you or another person understand the program at

1417

a later time.

1418

1419

@node Very Simple, Two Rules, Running gawk, Getting Started

1420

@section A Very Simple Example

1421

1422

The following command runs a simple @code{awk} program that searches the

1423

input file @file{BBS-list} for the string of characters: @samp{foo}. (A

1424

string of characters is usually called a @dfn{string}.

1425

The term @dfn{string} is perhaps based on similar usage in English, such

1426

as ``a string of pearls,'' or, ``a string of cars in a train.'')

1427

1428

@example

1429

awk '/foo/ @{ print $0 @}' BBS-list

1430

@end example

1431

1432

@noindent

1433

When lines containing @samp{foo} are found, they are printed, because

1434

@w{@samp{print $0}} means print the current line. (Just @samp{print} by

1435

itself means the same thing, so we could have written that

1436

instead.)

1437

1438

You will notice that slashes, @samp{/}, surround the string @samp{foo}

1439

in the @code{awk} program. The slashes indicate that @samp{foo}

1440

is a pattern to search for. This type of pattern is called a

1441

@dfn{regular expression}, and is covered in more detail later

1442

(@pxref{Regexp, ,Regular Expressions}).

1443

The pattern is allowed to match parts of words.

1444

There are

1445

single-quotes around the @code{awk} program so that the shell won't

1446

interpret any of it as special shell characters.

1447

1448

Here is what this program prints:

1449

1450

@example

1451

@group

1452

$ awk '/foo/ @{ print $0 @}' BBS-list

1453

@print{} fooey 555-1234 2400/1200/300 B

1454

@print{} foot 555-6699 1200/300 B

1455

@print{} macfoo 555-6480 1200/300 A

1456

@print{} sabafoo 555-2127 1200/300 C

1457

@end group

1458

@end example

1459

1460

@cindex action, default

1461

@cindex pattern, default

1462

@cindex default action

1463

@cindex default pattern

1464

In an @code{awk} rule, either the pattern or the action can be omitted,

1465

but not both. If the pattern is omitted, then the action is performed

1466

for @emph{every} input line. If the action is omitted, the default

1467

action is to print all lines that match the pattern.

1468

1469

@cindex empty action

1470

@cindex action, empty

1471

Thus, we could leave out the action (the @code{print} statement and the curly

1472

braces) in the above example, and the result would be the same: all

1473

lines matching the pattern @samp{foo} would be printed. By comparison,

1474

omitting the @code{print} statement but retaining the curly braces makes an

1475

empty action that does nothing; then no lines would be printed.

1476

1477

@node Two Rules, More Complex, Very Simple, Getting Started

1478

@section An Example with Two Rules

1479

@cindex how @code{awk} works

1480

1481

The @code{awk} utility reads the input files one line at a

1482

time. For each line, @code{awk} tries the patterns of each of the rules.

1483

If several patterns match then several actions are run, in the order in

1484

which they appear in the @code{awk} program. If no patterns match, then

1485

no actions are run.

1486

1487

After processing all the rules (perhaps none) that match the line,

1488

@code{awk} reads the next line (however,

1489

@pxref{Next Statement, ,The @code{next} Statement},

1490

and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}).

1491

This continues until the end of the file is reached.

1492

1493

For example, the @code{awk} program:

1494

1495

@example

1496

/12/ @{ print $0 @}

1497

/21/ @{ print $0 @}

1498

@end example

1499

1500

@noindent

1501

contains two rules. The first rule has the string @samp{12} as the

1502

pattern and @samp{print $0} as the action. The second rule has the

1503

string @samp{21} as the pattern and also has @samp{print $0} as the

1504

action. Each rule's action is enclosed in its own pair of braces.

1505

1506

This @code{awk} program prints every line that contains the string

1507

@samp{12} @emph{or} the string @samp{21}. If a line contains both

1508

strings, it is printed twice, once by each rule.

1509

1510

This is what happens if we run this program on our two sample data files,

1511

@file{BBS-list} and @file{inventory-shipped}, as shown here:

1512

1513

@example

1514

$ awk '/12/ @{ print $0 @}

1515

> /21/ @{ print $0 @}' BBS-list inventory-shipped

1516

@print{} aardvark 555-5553 1200/300 B

1517

@print{} alpo-net 555-3412 2400/1200/300 A

1518

@print{} barfly 555-7685 1200/300 A

1519

@print{} bites 555-1675 2400/1200/300 A

1520

@print{} core 555-2912 1200/300 C

1521

@print{} fooey 555-1234 2400/1200/300 B

1522

@print{} foot 555-6699 1200/300 B

1523

@print{} macfoo 555-6480 1200/300 A

1524

@print{} sdace 555-3430 2400/1200/300 A

1525

@print{} sabafoo 555-2127 1200/300 C

1526

@print{} sabafoo 555-2127 1200/300 C

1527

@print{} Jan 21 36 64 620

1528

@print{} Apr 21 70 74 514

1529

@end example

1530

1531

@noindent

1532

Note how the line in @file{BBS-list} beginning with @samp{sabafoo}

1533

was printed twice, once for each rule.

1534

1535

@node More Complex, Statements/Lines, Two Rules, Getting Started

1536

@section A More Complex Example

1537

1538

@ignore

1539

We have to use ls -lg here to get portable output across Unix systems.

1540

The POSIX ls matches this behavior too. Sigh.

1541

@end ignore

1542

Here is an example to give you an idea of what typical @code{awk}

1543

programs do. This example shows how @code{awk} can be used to

1544

summarize, select, and rearrange the output of another utility. It uses

1545

features that haven't been covered yet, so don't worry if you don't

1546

understand all the details.

1547

1548

@example

1549

ls -lg | awk '$6 == "Nov" @{ sum += $5 @}

1550

END @{ print sum @}'

1551

@end example

1552

1553

@cindex @code{csh}, backslash continuation

1554

@cindex backslash continuation in @code{csh}

1555

This command prints the total number of bytes in all the files in the

1556

current directory that were last modified in November (of any year).

1557

(In the C shell you would need to type a semicolon and then a backslash

1558

at the end of the first line; in a POSIX-compliant shell, such as the

1559

Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example

1560

as shown.)

1561

@ignore

1562

FIXME: how can users tell what shell they are running? Need a footnote

1563

or something, but getting into this is a distraction.

1564

@end ignore

1565

1566

The @w{@samp{ls -lg}} part of this example is a system command that gives

1567

you a listing of the files in a directory, including file size and the date

1568

the file was last modified. Its output looks like this:

1569

1570

@example

1571

-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile

1572

-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h

1573

-rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h

1574

-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y

1575

-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c

1576

-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c

1577

-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c

1578

-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c

1579

@end example

1580

1581

@noindent

1582

The first field contains read-write permissions, the second field contains

1583

the number of links to the file, and the third field identifies the owner of

1584

the file. The fourth field identifies the group of the file.

1585

The fifth field contains the size of the file in bytes. The

1586

sixth, seventh and eighth fields contain the month, day, and time,

1587

respectively, that the file was last modified. Finally, the ninth field

1588

contains the name of the file.

1589

1590

@cindex automatic initialization

1591

@cindex initialization, automatic

1592

The @samp{$6 == "Nov"} in our @code{awk} program is an expression that

1593

tests whether the sixth field of the output from @w{@samp{ls -lg}}

1594

matches the string @samp{Nov}. Each time a line has the string

1595

@samp{Nov} for its sixth field, the action @samp{sum += $5} is

1596

performed. This adds the fifth field (the file size) to the variable

1597

@code{sum}. As a result, when @code{awk} has finished reading all the

1598

input lines, @code{sum} is the sum of the sizes of files whose

1599

lines matched the pattern. (This works because @code{awk} variables

1600

are automatically initialized to zero.)

1601

1602

After the last line of output from @code{ls} has been processed, the

1603

@code{END} rule is executed, and the value of @code{sum} is

1604

printed. In this example, the value of @code{sum} would be 80600.

1605

1606

These more advanced @code{awk} techniques are covered in later sections

1607

(@pxref{Action Overview, ,Overview of Actions}). Before you can move on to more

1608

advanced @code{awk} programming, you have to know how @code{awk} interprets

1609

your input and displays your output. By manipulating fields and using

1610

@code{print} statements, you can produce some very useful and impressive

1611

looking reports.

1612

1613

@node Statements/Lines, Other Features, More Complex, Getting Started

1614

@section @code{awk} Statements Versus Lines

1615

@cindex line break

1616

@cindex newline

1617

1618

Most often, each line in an @code{awk} program is a separate statement or

1619

separate rule, like this:

1620

1621

@example

1622

awk '/12/ @{ print $0 @}

1623

/21/ @{ print $0 @}' BBS-list inventory-shipped

1624

@end example

1625

1626

However, @code{gawk} will ignore newlines after any of the following:

1627

1628

@example

1629

, @{ ? : || && do else

1630

@end example

1631

1632

@noindent

1633

A newline at any other point is considered the end of the statement.

1634

(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk}

1635

extension. The @samp{?} and @samp{:} referred to here is the

1636

three operand conditional expression described in

1637

@ref{Conditional Exp, ,Conditional Expressions}.)

1638

1639

@cindex backslash continuation

1640

@cindex continuation of lines

1641

@cindex line continuation

1642

If you would like to split a single statement into two lines at a point

1643

where a newline would terminate it, you can @dfn{continue} it by ending the

1644

first line with a backslash character, @samp{\}. The backslash must be

1645

the final character on the line to be recognized as a continuation

1646

character. This is allowed absolutely anywhere in the statement, even

1647

in the middle of a string or regular expression. For example:

1648

1649

@example

1650

awk '/This regular expression is too long, so continue it\

1651

on the next line/ @{ print $1 @}'

1652

@end example

1653

1654

@noindent

1655

@cindex portability issues

1656

We have generally not used backslash continuation in the sample programs

1657

in this @value{DOCUMENT}. Since in @code{gawk} there is no limit on the

1658

length of a line, it is never strictly necessary; it just makes programs

1659

more readable. For this same reason, as well as for clarity, we have

1660

kept most statements short in the sample programs presented throughout

1661

the @value{DOCUMENT}. Backslash continuation is most useful when your

1662

@code{awk} program is in a separate source file, instead of typed in on

1663

the command line. You should also note that many @code{awk}

1664

implementations are more particular about where you may use backslash

1665

continuation. For example, they may not allow you to split a string

1666

constant using backslash continuation. Thus, for maximal portability of

1667

your @code{awk} programs, it is best not to split your lines in the

1668

middle of a regular expression or a string.

1669

1670

@cindex @code{csh}, backslash continuation

1671

@cindex backslash continuation in @code{csh}

1672

@strong{Caution: backslash continuation does not work as described above

1673

with the C shell.} Continuation with backslash works for @code{awk}

1674

programs in files, and also for one-shot programs @emph{provided} you

1675

are using a POSIX-compliant shell, such as the Bourne shell or Bash, the

1676

GNU Bourne-Again shell. But the C shell (@code{csh}) behaves

1677

differently! There, you must use two backslashes in a row, followed by

1678

a newline. Note also that when using the C shell, @emph{every} newline

1679

in your awk program must be escaped with a backslash. To illustrate:

1680

1681

@example

1682

% awk 'BEGIN @{ \

1683

? print \\

1684

? "hello, world" \

1685

? @}'

1686

@print{} hello, world

1687

@end example

1688

1689

@noindent

1690

Here, the @samp{%} and @samp{?} are the C shell's primary and secondary

1691

prompts, analogous to the standard shell's @samp{$} and @samp{>}.

1692

1693

@code{awk} is a line-oriented language. Each rule's action has to

1694

begin on the same line as the pattern. To have the pattern and action

1695

on separate lines, you @emph{must} use backslash continuation---there

1696

is no other way.

1697

1698

@cindex multiple statements on one line

1699

When @code{awk} statements within one rule are short, you might want to put

1700

more than one of them on a line. You do this by separating the statements

1701

with a semicolon, @samp{;}.

1702

1703

This also applies to the rules themselves.

1704

Thus, the previous program could have been written:

1705

1706

@example

1707

/12/ @{ print $0 @} ; /21/ @{ print $0 @}

1708

@end example

1709

1710

@noindent

1711

@strong{Note:} the requirement that rules on the same line must be

1712

separated with a semicolon was not in the original @code{awk}

1713

language; it was added for consistency with the treatment of statements

1714

within an action.

1715

1716

@node Other Features, When, Statements/Lines, Getting Started

1717

@section Other Features of @code{awk}

1718

1719

The @code{awk} language provides a number of predefined, or built-in variables, which

1720

your programs can use to get information from @code{awk}. There are other

1721

variables your program can set to control how @code{awk} processes your

1722

data.

1723

1724

In addition, @code{awk} provides a number of built-in functions for doing

1725

common computational and string related operations.

1726

1727

As we develop our presentation of the @code{awk} language, we introduce

1728

most of the variables and many of the functions. They are defined

1729

systematically in @ref{Built-in Variables}, and

1730

@ref{Built-in, ,Built-in Functions}.

1731

1732

@node When, , Other Features, Getting Started

1733

@section When to Use @code{awk}

1734

1735

@cindex when to use @code{awk}

1736

@cindex applications of @code{awk}

1737

You might wonder how @code{awk} might be useful for you. Using

1738

utility programs, advanced patterns, field separators, arithmetic

1739

statements, and other selection criteria, you can produce much more

1740

complex output. The @code{awk} language is very useful for producing

1741

reports from large amounts of raw data, such as summarizing information

1742

from the output of other utility programs like @code{ls}.

1743

(@xref{More Complex, ,A More Complex Example}.)

1744

1745

Programs written with @code{awk} are usually much smaller than they would

1746

be in other languages. This makes @code{awk} programs easy to compose and

1747

use. Often, @code{awk} programs can be quickly composed at your terminal,

1748

used once, and thrown away. Since @code{awk} programs are interpreted, you

1749

can avoid the (usually lengthy) compilation part of the typical

1750

edit-compile-test-debug cycle of software development.

1751

1752

Complex programs have been written in @code{awk}, including a complete

1753

retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for

1754

more information) and a microcode assembler for a special purpose Prolog

1755

computer. However, @code{awk}'s capabilities are strained by tasks of

1756

such complexity.

1757

1758

If you find yourself writing @code{awk} scripts of more than, say, a few

1759

hundred lines, you might consider using a different programming

1760

language. Emacs Lisp is a good choice if you need sophisticated string

1761

or pattern matching capabilities. The shell is also good at string and

1762

pattern matching; in addition, it allows powerful use of the system

1763

utilities. More conventional languages, such as C, C++, and Lisp, offer

1764

better facilities for system programming and for managing the complexity

1765

of large programs. Programs in these languages may require more lines

1766

of source code than the equivalent @code{awk} programs, but they are

1767

easier to maintain and usually run more efficiently.

1768

1769

@node One-liners, Regexp, Getting Started, Top

1770

@chapter Useful One Line Programs

1771

1772

@cindex one-liners

1773

Many useful @code{awk} programs are short, just a line or two. Here is a

1774

collection of useful, short programs to get you started. Some of these

1775

programs contain constructs that haven't been covered yet. The description

1776

of the program will give you a good idea of what is going on, but please

1777

read the rest of the @value{DOCUMENT} to become an @code{awk} expert!

1778

1779

Most of the examples use a data file named @file{data}. This is just a

1780

placeholder; if you were to use these programs yourself, you would substitute

1781

your own file names for @file{data}.

1782

1783

@ifinfo

1784

Since you are reading this in Info, each line of the example code is

1785

enclosed in quotes, to represent text that you would type literally.

1786

The examples themselves represent shell commands that use single quotes

1787

to keep the shell from interpreting the contents of the program.

1788

When reading the examples, focus on the text between the open and close

1789

quotes.

1790

@end ifinfo

1791

1792

@table @code

1793

@item awk '@{ if (length($0) > max) max = length($0) @}

1794

@itemx @ @ @ @ @ END @{ print max @}' data

1795

This program prints the length of the longest input line.

1796

1797

@item awk 'length($0) > 80' data

1798

This program prints every line that is longer than 80 characters. The sole

1799

rule has a relational expression as its pattern, and has no action (so the

1800

default action, printing the record, is used).

1801

1802

@item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @}

1803

@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}'

1804

This program prints the length of the longest line in @file{data}. The input

1805

is processed by the @code{expand} program to change tabs into spaces,

1806

so the widths compared are actually the right-margin columns.

1807

1808

@item awk 'NF > 0' data

1809

This program prints every line that has at least one field. This is an

1810

easy way to delete blank lines from a file (or rather, to create a new

1811

file similar to the old file but from which the blank lines have been

1812

deleted).

1813

1814

@c Karl Berry points out that new users probably don't want to see

1815

@c multiple ways to do things, just the `best' way. He's probably

1816

@c right. At some point it might be worth adding something about there

1817

@c often being multiple ways to do things in awk, but for now we'll

1818

@c just take this one out.

1819

@ignore

1820

@item awk '@{ if (NF > 0) print @}' data

1821

This program also prints every line that has at least one field. Here we

1822

allow the rule to match every line, and then decide in the action whether

1823

to print.

1824

@end ignore

1825

1826

@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++)

1827

@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}'

1828

This program prints seven random numbers from zero to 100, inclusive.

1829

1830

@item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}'

1831

This program prints the total number of bytes used by @var{files}.

1832

1833

@item ls -lg @var{files} | awk '@{ x += $5 @}

1834

@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}'

1835

This program prints the total number of kilobytes used by @var{files}.

1836

1837

@item awk -F: '@{ print $1 @}' /etc/passwd | sort

1838

This program prints a sorted list of the login names of all users.

1839

1840

@item awk 'END @{ print NR @}' data

1841

This program counts lines in a file.

1842

1843

@item awk 'NR % 2' data

1844

This program prints the even numbered lines in the data file.

1845

If you were to use the expression @samp{NR % 2 == 1} instead,

1846

it would print the odd number lines.

1847

@end table

1848

1849

@node Regexp, Reading Files, One-liners, Top

1850

@chapter Regular Expressions

1851

@cindex pattern, regular expressions

1852

@cindex regexp

1853

@cindex regular expression

1854

@cindex regular expressions as patterns

1855

1856

A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a

1857

set of strings.

1858

Because regular expressions are such a fundamental part of @code{awk}

1859

programming, their format and use deserve a separate chapter.

1860

1861

A regular expression enclosed in slashes (@samp{/})

1862

is an @code{awk} pattern that matches every input record whose text

1863

belongs to that set.

1864

1865

The simplest regular expression is a sequence of letters, numbers, or

1866

both. Such a regexp matches any string that contains that sequence.

1867

Thus, the regexp @samp{foo} matches any string containing @samp{foo}.

1868

Therefore, the pattern @code{/foo/} matches any input record containing

1869

the three characters @samp{foo}, @emph{anywhere} in the record. Other

1870

kinds of regexps let you specify more complicated classes of strings.

1871

1872

@iftex

1873

Initially, the examples will be simple. As we explain more about how

1874

regular expressions work, we will present more complicated examples.

1875

@end iftex

1876

1877

@menu

1878

* Regexp Usage:: How to Use Regular Expressions.

1879

* Escape Sequences:: How to write non-printing characters.

1880

* Regexp Operators:: Regular Expression Operators.

1881

* GNU Regexp Operators:: Operators specific to GNU software.

1882

* Case-sensitivity:: How to do case-insensitive matching.

1883

* Leftmost Longest:: How much text matches.

1884

* Computed Regexps:: Using Dynamic Regexps.

1885

@end menu

1886

1887

@node Regexp Usage, Escape Sequences, Regexp, Regexp

1888

@section How to Use Regular Expressions

1889

1890

A regular expression can be used as a pattern by enclosing it in

1891

slashes. Then the regular expression is tested against the

1892

entire text of each record. (Normally, it only needs

1893

to match some part of the text in order to succeed.) For example, this

1894

prints the second field of each record that contains the three

1895

characters @samp{foo} anywhere in it:

1896

1897

@example

1898

@group

1899

$ awk '/foo/ @{ print $2 @}' BBS-list

1900

@print{} 555-1234

1901

@print{} 555-6699

1902

@print{} 555-6480

1903

@print{} 555-2127

1904

@end group

1905

@end example

1906

1907

@cindex regexp matching operators

1908

@cindex string-matching operators

1909

@cindex operators, string-matching

1910

@cindex operators, regexp matching

1911

@cindex regexp match/non-match operators

1912

@cindex @code{~} operator

1913

@cindex @code{!~} operator

1914

Regular expressions can also be used in matching expressions. These

1915

expressions allow you to specify the string to match against; it need

1916

not be the entire current input record. The two operators, @samp{~}

1917

and @samp{!~}, perform regular expression comparisons. Expressions

1918

using these operators can be used as patterns or in @code{if},

1919

@code{while}, @code{for}, and @code{do} statements.

1920

@ifinfo

1921

@c adding this xref in TeX screws up the formatting too much

1922

(@xref{Statements, ,Control Statements in Actions}.)

1923

@end ifinfo

1924

1925

@table @code

1926

@item @var{exp} ~ /@var{regexp}/

1927

This is true if the expression @var{exp} (taken as a string)

1928

is matched by @var{regexp}. The following example matches, or selects,

1929

all input records with the upper-case letter @samp{J} somewhere in the

1930

first field:

1931

1932

@example

1933

@group

1934

$ awk '$1 ~ /J/' inventory-shipped

1935

@print{} Jan 13 25 15 115

1936

@print{} Jun 31 42 75 492

1937

@print{} Jul 24 34 67 436

1938

@print{} Jan 21 36 64 620

1939

@end group

1940

@end example

1941

1942

So does this:

1943

1944

@example

1945

awk '@{ if ($1 ~ /J/) print @}' inventory-shipped

1946

@end example

1947

1948

@item @var{exp} !~ /@var{regexp}/

1949

This is true if the expression @var{exp} (taken as a character string)

1950

is @emph{not} matched by @var{regexp}. The following example matches,

1951

or selects, all input records whose first field @emph{does not} contain

1952

the upper-case letter @samp{J}:

1953

1954

@example

1955

@group

1956

$ awk '$1 !~ /J/' inventory-shipped

1957

@print{} Feb 15 32 24 226

1958

@print{} Mar 15 24 34 228

1959

@print{} Apr 31 52 63 420

1960

@print{} May 16 34 29 208

1961

@dots{}

1962

@end group

1963

@end example

1964

@end table

1965

1966

@cindex regexp constant

1967

When a regexp is written enclosed in slashes, like @code{/foo/}, we call it

1968

a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and

1969

@code{"foo"} is a string constant.

1970

1971

@node Escape Sequences, Regexp Operators, Regexp Usage, Regexp

1972

@section Escape Sequences

1973

1974

@cindex escape sequence notation

1975

Some characters cannot be included literally in string constants

1976

(@code{"foo"}) or regexp constants (@code{/foo/}). You represent them

1977

instead with @dfn{escape sequences}, which are character sequences

1978

beginning with a backslash (@samp{\}).

1979

1980

One use of an escape sequence is to include a double-quote character in

1981

a string constant. Since a plain double-quote would end the string, you

1982

must use @samp{\"} to represent an actual double-quote character as a

1983

part of the string. For example:

1984

1985

@example

1986

$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}'

1987

@print{} He said "hi!" to her.

1988

@end example

1989

1990

The backslash character itself is another character that cannot be

1991

included normally; you write @samp{\\} to put one backslash in the

1992

string or regexp. Thus, the string whose contents are the two characters

1993

@samp{"} and @samp{\} must be written @code{"\"\\"}.

1994

1995

Another use of backslash is to represent unprintable characters

1996

such as tab or newline. While there is nothing to stop you from entering most

1997

unprintable characters directly in a string constant or regexp constant,

1998

they may look ugly.

1999

2000

Here is a table of all the escape sequences used in @code{awk}, and

2001

what they represent. Unless noted otherwise, all of these escape

2002

sequences apply to both string constants and regexp constants.

2003

2004

@iftex

2005

@page

2006

@end iftex

2007

@c @cartouche

2008

@table @code

2009

@item \\

2010

A literal backslash, @samp{\}.

2011

2012

@cindex @code{awk} language, V.4 version

2013

@item \a

2014

The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL).

2015

2016

@item \b

2017

Backspace, @kbd{Control-h}, ASCII code 8 (BS).

2018

2019

@item \f

2020

Formfeed, @kbd{Control-l}, ASCII code 12 (FF).

2021

2022

@item \n

2023

Newline, @kbd{Control-j}, ASCII code 10 (LF).

2024

2025

@item \r

2026

Carriage return, @kbd{Control-m}, ASCII code 13 (CR).

2027

2028

@item \t

2029

Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT).

2030

2031

@cindex @code{awk} language, V.4 version

2032

@item \v

2033

Vertical tab, @kbd{Control-k}, ASCII code 11 (VT).

2034

2035

@item \@var{nnn}

2036

The octal value @var{nnn}, where @var{nnn} are one to three digits

2037

between @samp{0} and @samp{7}. For example, the code for the ASCII ESC

2038

(escape) character is @samp{\033}.

2039

2040

@cindex @code{awk} language, V.4 version

2041

@cindex @code{awk} language, POSIX version

2042

@cindex POSIX @code{awk}

2043

@item \x@var{hh}@dots{}

2044

The hexadecimal value @var{hh}, where @var{hh} are hexadecimal

2045

digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or

2046

@samp{a} through @samp{f}). Like the same construct in ANSI C, the escape

2047

sequence continues until the first non-hexadecimal digit is seen. However,

2048

using more than two hexadecimal digits produces undefined results. (The

2049

@samp{\x} escape sequence is not allowed in POSIX @code{awk}.)

2050

2051

@item \/

2052

A literal slash (necessary for regexp constants only).

2053

You use this when you wish to write a regexp

2054

constant that contains a slash. Since the regexp is delimited by

2055

slashes, you need to escape the slash that is part of the pattern,

2056

in order to tell @code{awk} to keep processing the rest of the regexp.

2057

2058

@item \"

2059

A literal double-quote (necessary for string constants only).

2060

You use this when you wish to write a string

2061

constant that contains a double-quote. Since the string is delimited by

2062

double-quotes, you need to escape the quote that is part of the string,

2063

in order to tell @code{awk} to keep processing the rest of the string.

2064

@end table

2065

@c @end cartouche

2066

2067

In @code{gawk}, there are additional two character sequences that begin

2068

with backslash that have special meaning in regexps.

2069

@xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.

2070

2071

In a string constant,

2072

what happens if you place a backslash before something that is not one of

2073

the characters listed above? POSIX @code{awk} purposely leaves this case

2074

undefined. There are two choices.

2075

2076

@itemize @bullet

2077

@item

2078

Strip the backslash out. This is what Unix @code{awk} and @code{gawk} both do.

2079

For example, @code{"a\qc"} is the same as @code{"aqc"}.

2080

2081

@item

2082

Leave the backslash alone. Some other @code{awk} implementations do this.

2083

In such implementations, @code{"a\qc"} is the same as if you had typed

2084

@code{"a\\qc"}.

2085

@end itemize

2086

2087

In a regexp, a backslash before any character that is not in the above table,

2088

and not listed in

2089

@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}},

2090

means that the next character should be taken literally, even if it would

2091

normally be a regexp operator. E.g., @code{/a\+b/} matches the three

2092

characters @samp{a+b}.

2093

2094

@cindex portability issues

2095

For complete portability, do not use a backslash before any character not

2096

listed in the table above.

2097

2098

Another interesting question arises. Suppose you use an octal or hexadecimal

2099

escape to represent a regexp metacharacter

2100

(@pxref{Regexp Operators, , Regular Expression Operators}).

2101

Does @code{awk} treat the character as literal character, or as a regexp

2102

operator?

2103

2104

@cindex dark corner

2105

It turns out that historically, such characters were taken literally (d.c.).

2106

However, the POSIX standard indicates that they should be treated

2107

as real metacharacters, and this is what @code{gawk} does.

2108

However, in compatibility mode (@pxref{Options, ,Command Line Options}),

2109

@code{gawk} treats the characters represented by octal and hexadecimal

2110

escape sequences literally when used in regexp constants. Thus,

2111

@code{/a\52b/} is equivalent to @code{/a\*b/}.

2112

2113

To summarize:

2114

2115

@enumerate 1

2116

@item

2117

The escape sequences in the table above are always processed first,

2118

for both string constants and regexp constants. This happens very early,

2119

as soon as @code{awk} reads your program.

2120

2121

@item

2122

@code{gawk} processes both regexp constants and dynamic regexps

2123

(@pxref{Computed Regexps, ,Using Dynamic Regexps}),

2124

for the special operators listed in

2125

@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.

2126

2127

@item

2128

A backslash before any other character means to treat that character

2129

literally.

2130

@end enumerate

2131

2132

@node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp

2133

@section Regular Expression Operators

2134

@cindex metacharacters

2135

@cindex regular expression metacharacters

2136

@cindex regexp operators

2137

2138

You can combine regular expressions with the following characters,

2139

called @dfn{regular expression operators}, or @dfn{metacharacters}, to

2140

increase the power and versatility of regular expressions.

2141

2142

The escape sequences described

2143

@iftex

2144

above

2145

@end iftex

2146

in @ref{Escape Sequences},

2147

are valid inside a regexp. They are introduced by a @samp{\}. They

2148

are recognized and converted into the corresponding real characters as

2149

the very first step in processing regexps.

2150

2151

Here is a table of metacharacters. All characters that are not escape

2152

sequences and that are not listed in the table stand for themselves.

2153

2154

@iftex

2155

@page

2156

@end iftex

2157

@table @code

2158

@item \

2159

This is used to suppress the special meaning of a character when

2160

matching. For example:

2161

2162

@example

2163

\$

2164

@end example

2165

2166

@noindent

2167

matches the character @samp{$}.

2168

2169

@cindex anchors in regexps

2170

@cindex regexp, anchors

2171

@item ^

2172

This matches the beginning of a string. For example:

2173

2174

@example

2175

^@@chapter

2176

@end example

2177

2178

@noindent

2179

matches the @samp{@@chapter} at the beginning of a string, and can be used

2180

to identify chapter beginnings in Texinfo source files.

2181

The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to

2182

matching only at the beginning of the string.

2183

2184

It is important to realize that @samp{^} does not match the beginning of

2185

a line embedded in a string. In this example the condition is not true:

2186

2187

@example

2188

if ("line1\nLINE 2" ~ /^L/) @dots{}

2189

@end example

2190

2191

@item $

2192

This is similar to @samp{^}, but it matches only at the end of a string.

2193

For example:

2194

2195

@example

2196

p$

2197

@end example

2198

2199

@noindent

2200

matches a record that ends with a @samp{p}. The @samp{$} is also an anchor,

2201

and also does not match the end of a line embedded in a string. In this

2202

example the condition is not true:

2203

2204

@example

2205

if ("line1\nLINE 2" ~ /1$/) @dots{}

2206

@end example

2207

2208

@item .

2209

The period, or dot, matches any single character,

2210

@emph{including} the newline character. For example:

2211

2212

@example

2213

.P

2214

@end example

2215

2216

@noindent

2217

matches any single character followed by a @samp{P} in a string. Using

2218

concatenation we can make a regular expression like @samp{U.A}, which

2219

matches any three-character sequence that begins with @samp{U} and ends

2220

with @samp{A}.

2221

2222

@cindex @code{awk} language, POSIX version

2223

@cindex POSIX @code{awk}

2224

In strict POSIX mode (@pxref{Options, ,Command Line Options}),

2225

@samp{.} does not match the @sc{nul}

2226

character, which is a character with all bits equal to zero.

2227

Otherwise, @sc{nul} is just another character. Other versions of @code{awk}

2228

may not be able to match the @sc{nul} character.

2229

2230

@ignore

2231

2e: Add stuff that character list is the POSIX terminology. In other

2232

literature known as character set or character class.

2233

@end ignore

2234

2235

@cindex character list

2236

@item [@dots{}]

2237

This is called a @dfn{character list}. It matches any @emph{one} of the

2238

characters that are enclosed in the square brackets. For example:

2239

2240

@example

2241

[MVX]

2242

@end example

2243

2244

@noindent

2245

matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a

2246

string.

2247

2248

Ranges of characters are indicated by using a hyphen between the beginning

2249

and ending characters, and enclosing the whole thing in brackets. For

2250

example:

2251

2252

@example

2253

[0-9]

2254

@end example

2255

2256

@noindent

2257

matches any digit.

2258

Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a

2259

common way to express the idea of ``all alphanumeric characters.''

2260

2261

To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a

2262

character list, put a @samp{\} in front of it. For example:

2263

2264

@example

2265

[d\]]

2266

@end example

2267

2268

@noindent

2269

matches either @samp{d}, or @samp{]}.

2270

2271

@cindex @code{egrep}

2272

This treatment of @samp{\} in character lists

2273

is compatible with other @code{awk}

2274

implementations, and is also mandated by POSIX.

2275

The regular expressions in @code{awk} are a superset

2276

of the POSIX specification for Extended Regular Expressions (EREs).

2277

POSIX EREs are based on the regular expressions accepted by the

2278

traditional @code{egrep} utility.

2279

2280

@cindex character classes

2281

@cindex @code{awk} language, POSIX version

2282

@cindex POSIX @code{awk}

2283

@dfn{Character classes} are a new feature introduced in the POSIX standard.

2284

A character class is a special notation for describing

2285

lists of characters that have a specific attribute, but where the

2286

actual characters themselves can vary from country to country and/or

2287

from character set to character set. For example, the notion of what

2288

is an alphabetic character differs in the USA and in France.

2289

2290

A character class is only valid in a regexp @emph{inside} the

2291

brackets of a character list. Character classes consist of @samp{[:},

2292

a keyword denoting the class, and @samp{:]}. Here are the character

2293

classes defined by the POSIX standard.

2294

2295

@table @code

2296

@item [:alnum:]

2297

Alphanumeric characters.

2298

2299

@item [:alpha:]

2300

Alphabetic characters.

2301

2302

@item [:blank:]

2303

Space and tab characters.

2304

2305

@item [:cntrl:]

2306

Control characters.

2307

2308

@item [:digit:]

2309

Numeric characters.

2310

2311

@item [:graph:]

2312

Characters that are printable and are also visible.

2313

(A space is printable, but not visible, while an @samp{a} is both.)

2314

2315

@item [:lower:]

2316

Lower-case alphabetic characters.

2317

2318

@item [:print:]

2319

Printable characters (characters that are not control characters.)

2320

2321

@item [:punct:]

2322

Punctuation characters (characters that are not letter, digits,

2323

control characters, or space characters).

2324

2325

@item [:space:]

2326

Space characters (such as space, tab, and formfeed, to name a few).

2327

2328

@item [:upper:]

2329

Upper-case alphabetic characters.

2330

2331

@item [:xdigit:]

2332

Characters that are hexadecimal digits.

2333

@end table

2334

2335

For example, before the POSIX standard, to match alphanumeric

2336

characters, you had to write @code{/[A-Za-z0-9]/}. If your

2337

character set had other alphabetic characters in it, this would not

2338

match them. With the POSIX character classes, you can write

2339

@code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic

2340

and numeric characters in your character set.

2341

2342

@cindex collating elements

2343

Two additional special sequences can appear in character lists.

2344

These apply to non-ASCII character sets, which can have single symbols

2345

(called @dfn{collating elements}) that are represented with more than one

2346

character, as well as several characters that are equivalent for

2347

@dfn{collating}, or sorting, purposes. (E.g., in French, a plain ``e''

2348

and a grave-accented

2349

@iftex

2350

``@`e''

2351

@end iftex

2352

@ifinfo

2353

``e''

2354

@end ifinfo

2355

are equivalent.)

2356

2357

@table @asis

2358

@cindex collating symbols

2359

@item Collating Symbols

2360

A @dfn{collating symbol} is a multi-character collating element enclosed in

2361

@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element,

2362

then @code{[[.ch.]]} is a regexp that matches this collating element, while

2363

@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}.

2364

2365

@cindex equivalence classes

2366

@item Equivalence Classes

2367

An @dfn{equivalence class} is a list of equivalent characters enclosed in

2368

@samp{[=} and @samp{=]}.

2369

@iftex

2370

Thus, @code{[[=e@`e=]]} is regexp that matches either @samp{e} or @samp{@`e}.

2371

@end iftex

2372

@ifinfo

2373

Because Info files use plain ASCII characters, it is not possible to present

2374

a realistic equivalence class example here.

2375

@end ifinfo

2376

@end table

2377

2378

These features are very valuable in non-English speaking locales.

2379

2380

@strong{Caution:} The library functions that @code{gawk} uses for regular

2381

expression matching currently only recognize POSIX character classes;

2382

they do not recognize collating symbols or equivalence classes.

2383

@c maybe one day ...

2384

2385

@cindex complemented character list

2386

@cindex character list, complemented

2387

@item [^ @dots{}]

2388

This is a @dfn{complemented character list}. The first character after

2389

the @samp{[} @emph{must} be a @samp{^}. It matches any characters

2390

@emph{except} those in the square brackets, or newline. For example:

2391

2392

@example

2393

[^0-9]

2394

@end example

2395

2396

@noindent

2397

matches any character that is not a digit.

2398

2399

@item |

2400

This is the @dfn{alternation operator}, and it is used to specify

2401

alternatives. For example:

2402

2403

@example

2404

^P|[0-9]

2405

@end example

2406

2407

@noindent

2408

matches any string that matches either @samp{^P} or @samp{[0-9]}. This

2409

means it matches any string that starts with @samp{P} or contains a digit.

2410

2411

The alternation applies to the largest possible regexps on either side.

2412

In other words, @samp{|} has the lowest precedence of all the regular

2413

expression operators.

2414

2415

@item (@dots{})

2416

Parentheses are used for grouping in regular expressions as in

2417

arithmetic. They can be used to concatenate regular expressions

2418

containing the alternation operator, @samp{|}. For example,

2419

@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and

2420

@samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.)

2421

2422

@item *

2423

This symbol means that the preceding regular expression is to be

2424

repeated as many times as necessary to find a match. For example:

2425

2426

@example

2427

ph*

2428

@end example

2429

2430

@noindent

2431

applies the @samp{*} symbol to the preceding @samp{h} and looks for matches

2432

of one @samp{p} followed by any number of @samp{h}s. This will also match

2433

just @samp{p} if no @samp{h}s are present.

2434

2435

The @samp{*} repeats the @emph{smallest} possible preceding expression.

2436

(Use parentheses if you wish to repeat a larger expression.) It finds

2437

as many repetitions as possible. For example:

2438

2439

@example

2440

awk '/$c[ad][ad]*r x$/ @{ print @}' sample

2441

@end example

2442

2443

@noindent

2444

prints every record in @file{sample} containing a string of the form

2445

@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.

2446

Notice the escaping of the parentheses by preceding them

2447

with backslashes.

2448

2449

@item +

2450

This symbol is similar to @samp{*}, but the preceding expression must be

2451

matched at least once. This means that:

2452

2453

@example

2454

wh+y

2455

@end example

2456

2457

@noindent

2458

would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas

2459

@samp{wh*y} would match all three of these strings. This is a simpler

2460

way of writing the last @samp{*} example:

2461

2462

@example

2463

awk '/$c[ad]+r x$/ @{ print @}' sample

2464

@end example

2465

2466

@item ?

2467

This symbol is similar to @samp{*}, but the preceding expression can be

2468

matched either once or not at all. For example:

2469

2470

@example

2471

fe?d

2472

@end example

2473

2474

@noindent

2475

will match @samp{fed} and @samp{fd}, but nothing else.

2476

2477

@cindex @code{awk} language, POSIX version

2478

@cindex POSIX @code{awk}

2479

@cindex interval expressions

2480

@item @{@var{n}@}

2481

@itemx @{@var{n},@}

2482

@itemx @{@var{n},@var{m}@}

2483

One or two numbers inside braces denote an @dfn{interval expression}.

2484

If there is one number in the braces, the preceding regexp is repeated

2485

@var{n} times.

2486

If there are two numbers separated by a comma, the preceding regexp is

2487

repeated @var{n} to @var{m} times.

2488

If there is one number followed by a comma, then the preceding regexp

2489

is repeated at least @var{n} times.

2490

2491

@table @code

2492

@item wh@{3@}y

2493

matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}.

2494

2495

@item wh@{3,5@}y

2496

matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only.

2497

2498

@item wh@{2,@}y

2499

matches @samp{whhy} or @samp{whhhy}, and so on.

2500

@end table

2501

2502

Interval expressions were not traditionally available in @code{awk}.

2503

As part of the POSIX standard they were added, to make @code{awk}

2504

and @code{egrep} consistent with each other.

2505

2506

However, since old programs may use @samp{@{} and @samp{@}} in regexp

2507

constants, by default @code{gawk} does @emph{not} match interval expressions

2508

in regexps. If either @samp{--posix} or @samp{--re-interval} are specified

2509

(@pxref{Options, , Command Line Options}), then interval expressions

2510

are allowed in regexps.

2511

@end table

2512

2513

@cindex precedence, regexp operators

2514

@cindex regexp operators, precedence of

2515

In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,

2516

as well as the braces @samp{@{} and @samp{@}},

2517

have

2518

the highest precedence, followed by concatenation, and finally by @samp{|}.

2519

As in arithmetic, parentheses can change how operators are grouped.

2520

2521

If @code{gawk} is in compatibility mode

2522

(@pxref{Options, ,Command Line Options}),

2523

character classes and interval expressions are not available in

2524

regular expressions.

2525

2526

The next

2527

@ifinfo

2528

node

2529

@end ifinfo

2530

@iftex

2531

section

2532

@end iftex

2533

discusses the GNU-specific regexp operators, and provides

2534

more detail concerning how command line options affect the way @code{gawk}

2535

interprets the characters in regular expressions.

2536

2537

@node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp

2538

@section Additional Regexp Operators Only in @code{gawk}

2539

2540

@c This section adapted from the regex-0.12 manual

2541

2542

@cindex regexp operators, GNU specific

2543

GNU software that deals with regular expressions provides a number of

2544

additional regexp operators. These operators are described in this

2545

section, and are specific to @code{gawk}; they are not available in other

2546

@code{awk} implementations.

2547

2548

@cindex word, regexp definition of

2549

Most of the additional operators are for dealing with word matching.

2550

For our purposes, a @dfn{word} is a sequence of one or more letters, digits,

2551

or underscores (@samp{_}).

2552

2553

@table @code

2554

@cindex @code{\w} regexp operator

2555

@item \w

2556

This operator matches any word-constituent character, i.e.@: any

2557

letter, digit, or underscore. Think of it as a short-hand for

2558

@c @w{@code{[A-Za-z0-9_]}} or

2559

@w{@code{[[:alnum:]_]}}.

2560

2561

@cindex @code{\W} regexp operator

2562

@item \W

2563

This operator matches any character that is not word-constituent.

2564

Think of it as a short-hand for

2565

@c @w{@code{[^A-Za-z0-9_]}} or

2566

@w{@code{[^[:alnum:]_]}}.

2567

2568

@cindex @code{\<} regexp operator

2569

@item \<

2570

This operator matches the empty string at the beginning of a word.

2571

For example, @code{/\<away/} matches @samp{away}, but not

2572

@samp{stowaway}.

2573

2574

@cindex @code{\>} regexp operator

2575

@item \>

2576

This operator matches the empty string at the end of a word.

2577

For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}.

2578

2579

@cindex @code{\y} regexp operator

2580

@cindex word boundaries, matching

2581

@item \y

2582

This operator matches the empty string at either the beginning or the

2583

end of a word (the word boundar@strong{y}). For example, @samp{\yballs?\y}

2584

matches either @samp{ball} or @samp{balls} as a separate word.

2585

2586

@cindex @code{\B} regexp operator

2587

@item \B

2588

This operator matches the empty string within a word. In other words,

2589

@samp{\B} matches the empty string that occurs between two

2590

word-constituent characters. For example,

2591

@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}.

2592

@samp{\B} is essentially the opposite of @samp{\y}.

2593

@end table

2594

2595

There are two other operators that work on buffers. In Emacs, a

2596

@dfn{buffer} is, naturally, an Emacs buffer. For other programs, the

2597

regexp library routines that @code{gawk} uses consider the entire

2598

string to be matched as the buffer.

2599

2600

For @code{awk}, since @samp{^} and @samp{$} always work in terms

2601

of the beginning and end of strings, these operators don't add any

2602

new capabilities. They are provided for compatibility with other GNU

2603

software.

2604

2605

@cindex buffer matching operators

2606

@table @code

2607

@cindex @code{\`} regexp operator

2608

@item \`

2609

This operator matches the empty string at the

2610

beginning of the buffer.

2611

2612

@cindex @code{\'} regexp operator

2613

@item \'

2614

This operator matches the empty string at the

2615

end of the buffer.

2616

@end table

2617

2618

In other GNU software, the word boundary operator is @samp{\b}. However,

2619

that conflicts with the @code{awk} language's definition of @samp{\b}

2620

as backspace, so @code{gawk} uses a different letter.

2621

2622

An alternative method would have been to require two backslashes in the

2623

GNU operators, but this was deemed to be too confusing, and the current

2624

method of using @samp{\y} for the GNU @samp{\b} appears to be the

2625

lesser of two evils.

2626

2627

@c NOTE!!! Keep this in sync with the same table in the summary appendix!

2628

@cindex regexp, effect of command line options

2629

The various command line options

2630

(@pxref{Options, ,Command Line Options})

2631

control how @code{gawk} interprets characters in regexps.

2632

2633

@table @asis

2634

@item No options

2635

In the default case, @code{gawk} provide all the facilities of

2636

POSIX regexps and the GNU regexp operators described

2637

@iftex

2638

above.

2639

@end iftex

2640

@ifinfo

2641

in @ref{Regexp Operators, ,Regular Expression Operators}.

2642

@end ifinfo

2643

However, interval expressions are not supported.

2644

2645

@item @code{--posix}

2646

Only POSIX regexps are supported, the GNU operators are not special

2647

(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions

2648

are allowed.

2649

2650

@item @code{--traditional}

2651

Traditional Unix @code{awk} regexps are matched. The GNU operators

2652

are not special, interval expressions are not available, and neither

2653

are the POSIX character classes (@code{[[:alnum:]]} and so on).

2654

Characters described by octal and hexadecimal escape sequences are

2655

treated literally, even if they represent regexp metacharacters.

2656

2657

@item @code{--re-interval}

2658

Allow interval expressions in regexps, even if @samp{--traditional}

2659

has been provided.

2660

@end table

2661

2662

@node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp

2663

@section Case-sensitivity in Matching

2664

2665

@cindex case sensitivity

2666

@cindex ignoring case

2667

Case is normally significant in regular expressions, both when matching

2668

ordinary characters (i.e.@: not metacharacters), and inside character

2669

sets. Thus a @samp{w} in a regular expression matches only a lower-case

2670

@samp{w} and not an upper-case @samp{W}.

2671

2672

The simplest way to do a case-independent match is to use a character

2673

list: @samp{[Ww]}. However, this can be cumbersome if you need to use it

2674

often; and it can make the regular expressions harder to

2675

read. There are two alternatives that you might prefer.

2676

2677

One way to do a case-insensitive match at a particular point in the

2678

program is to convert the data to a single case, using the

2679

@code{tolower} or @code{toupper} built-in string functions (which we

2680

haven't discussed yet;

2681

@pxref{String Functions, ,Built-in Functions for String Manipulation}).

2682

For example:

2683

2684

@example

2685

tolower($1) ~ /foo/ @{ @dots{} @}

2686

@end example

2687

2688

@noindent

2689

converts the first field to lower-case before matching against it.

2690

This will work in any POSIX-compliant implementation of @code{awk}.

2691

2692

@cindex differences between @code{gawk} and @code{awk}

2693

@cindex @code{~} operator

2694

@cindex @code{!~} operator

2695

@vindex IGNORECASE

2696

Another method, specific to @code{gawk}, is to set the variable

2697

@code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}).

2698

When @code{IGNORECASE} is not zero, @emph{all} regexp and string

2699

operations ignore case. Changing the value of

2700

@code{IGNORECASE} dynamically controls the case sensitivity of your

2701

program as it runs. Case is significant by default because

2702

@code{IGNORECASE} (like most variables) is initialized to zero.

2703

2704

@example

2705

x = "aB"

2706

if (x ~ /ab/) @dots{} # this test will fail

2707

2708

IGNORECASE = 1

2709

if (x ~ /ab/) @dots{} # now it will succeed

2710

@end example

2711

2712

In general, you cannot use @code{IGNORECASE} to make certain rules

2713

case-insensitive and other rules case-sensitive, because there is no way

2714

to set @code{IGNORECASE} just for the pattern of a particular rule.

2715

@ignore

2716

This isn't quite true. Consider:

2717

2718

IGNORECASE=1 && /foObAr/ { .... }

2719

IGNORECASE=0 || /foobar/ { .... }

2720

2721

But that's pretty bad style and I don't want to get into it at this

2722

late date.

2723

@end ignore

2724

To do this, you must use character lists or @code{tolower}. However, one

2725

thing you can do only with @code{IGNORECASE} is turn case-sensitivity on

2726

or off dynamically for all the rules at once.

2727

2728

@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule

2729

(@pxref{Other Arguments, ,Other Command Line Arguments}; also

2730

@pxref{Using BEGIN/END, ,Startup and Cleanup Actions}).

2731

Setting @code{IGNORECASE} from the command line is a way to make

2732

a program case-insensitive without having to edit it.

2733

2734

Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE}

2735

only affected regexp operations. It did not affect string comparison

2736

with @samp{==}, @samp{!=}, and so on.

2737

Beginning with version 3.0, both regexp and string comparison

2738

operations are affected by @code{IGNORECASE}.

2739

2740

@cindex ISO 8859-1

2741

@cindex ISO Latin-1

2742

Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case

2743

and lower-case characters are based on the ISO-8859-1 (ISO Latin-1)

2744

character set. This character set is a superset of the traditional 128

2745

ASCII characters, that also provides a number of characters suitable

2746

for use with European languages.

2747

@ignore

2748

A pure ASCII character set can be used instead if @code{gawk} is compiled

2749

with @samp{-DUSE_PURE_ASCII}.

2750

@end ignore

2751

2752

The value of @code{IGNORECASE} has no effect if @code{gawk} is in

2753

compatibility mode (@pxref{Options, ,Command Line Options}).

2754

Case is always significant in compatibility mode.

2755

2756

@node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp

2757

@section How Much Text Matches?

2758

2759

@cindex leftmost longest match

2760

@cindex matching, leftmost longest

2761

Consider the following example:

2762

2763

@example

2764

echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'

2765

@end example

2766

2767

This example uses the @code{sub} function (which we haven't discussed yet,

2768

@pxref{String Functions, ,Built-in Functions for String Manipulation})

2769

to make a change to the input record. Here, the regexp @code{/a+/}

2770

indicates ``one or more @samp{a} characters,'' and the replacement

2771

text is @samp{<A>}.

2772

2773

The input contains four @samp{a} characters. What will the output be?

2774

In other words, how many is ``one or more''---will @code{awk} match two,

2775

three, or all four @samp{a} characters?

2776

2777

The answer is, @code{awk} (and POSIX) regular expressions always match

2778

the leftmost, @emph{longest} sequence of input characters that can

2779

match. Thus, in this example, all four @samp{a} characters are

2780

replaced with @samp{<A>}.

2781

2782

@example

2783

$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'

2784

@print{} <A>bcd

2785

@end example

2786

2787

For simple match/no-match tests, this is not so important. But when doing

2788

regexp-based field and record splitting, and

2789

text matching and substitutions with the @code{match}, @code{sub}, @code{gsub},

2790

and @code{gensub} functions, it is very important.

2791

@ifinfo

2792

@xref{String Functions, ,Built-in Functions for String Manipulation},

2793

for more information on these functions.

2794

@end ifinfo

2795

Understanding this principle is also important for regexp-based record

2796

and field splitting (@pxref{Records, ,How Input is Split into Records},

2797

and also @pxref{Field Separators, ,Specifying How Fields are Separated}).

2798

2799

@node Computed Regexps, , Leftmost Longest, Regexp

2800

@section Using Dynamic Regexps

2801

2802

@cindex computed regular expressions

2803

@cindex regular expressions, computed

2804

@cindex dynamic regular expressions

2805

@cindex regexp, dynamic

2806

@cindex @code{~} operator

2807

@cindex @code{!~} operator

2808

The right hand side of a @samp{~} or @samp{!~} operator need not be a

2809

regexp constant (i.e.@: a string of characters between slashes). It may

2810

be any expression. The expression is evaluated, and converted if

2811

necessary to a string; the contents of the string are used as the

2812

regexp. A regexp that is computed in this way is called a @dfn{dynamic

2813

regexp}. For example:

2814

2815

@example

2816

BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" @}

2817

$0 ~ identifier_regexp @{ print @}

2818

@end example

2819

2820

@noindent

2821

sets @code{identifier_regexp} to a regexp that describes @code{awk}

2822

variable names, and tests if the input record matches this regexp.

2823

2824

@strong{Caution:} When using the @samp{~} and @samp{!~}

2825

operators, there is a difference between a regexp constant

2826

enclosed in slashes, and a string constant enclosed in double quotes.

2827

If you are going to use a string constant, you have to understand that

2828

the string is in essence scanned @emph{twice}; the first time when

2829

@code{awk} reads your program, and the second time when it goes to

2830

match the string on the left-hand side of the operator with the pattern

2831

on the right. This is true of any string valued expression (such as

2832

@code{identifier_regexp} above), not just string constants.

2833

2834

@cindex regexp constants, difference between slashes and quotes

2835

What difference does it make if the string is

2836

scanned twice? The answer has to do with escape sequences, and particularly

2837

with backslashes. To get a backslash into a regular expression inside a

2838

string, you have to type two backslashes.

2839

2840

For example, @code{/\*/} is a regexp constant for a literal @samp{*}.

2841

Only one backslash is needed. To do the same thing with a string,

2842

you would have to type @code{"\\*"}. The first backslash escapes the

2843

second one, so that the string actually contains the

2844

two characters @samp{\} and @samp{*}.

2845

2846

@cindex common mistakes

2847

@cindex mistakes, common

2848

@cindex errors, common

2849

Given that you can use both regexp and string constants to describe

2850

regular expressions, which should you use? The answer is ``regexp

2851

constants,'' for several reasons.

2852

2853

@enumerate 1

2854

@item

2855

String constants are more complicated to write, and

2856

more difficult to read. Using regexp constants makes your programs

2857

less error-prone. Not understanding the difference between the two

2858

kinds of constants is a common source of errors.

2859

2860

@item

2861

It is also more efficient to use regexp constants: @code{awk} can note

2862

that you have supplied a regexp and store it internally in a form that

2863

makes pattern matching more efficient. When using a string constant,

2864

@code{awk} must first convert the string into this internal form, and

2865

then perform the pattern matching.

2866

2867

@item

2868

Using regexp constants is better style; it shows clearly that you

2869

intend a regexp match.

2870

@end enumerate

2871

2872

@node Reading Files, Printing, Regexp, Top

2873

@chapter Reading Input Files

2874

2875

@cindex reading files

2876

@cindex input

2877

@cindex standard input

2878

@vindex FILENAME

2879

In the typical @code{awk} program, all input is read either from the

2880

standard input (by default the keyboard, but often a pipe from another

2881

command) or from files whose names you specify on the @code{awk} command

2882

line. If you specify input files, @code{awk} reads them in order, reading

2883

all the data from one before going on to the next. The name of the current

2884

input file can be found in the built-in variable @code{FILENAME}

2885

(@pxref{Built-in Variables}).

2886

2887

The input is read in units called @dfn{records}, and processed by the

2888

rules of your program one record at a time.

2889

By default, each record is one line. Each

2890

record is automatically split into chunks called @dfn{fields}.

2891

This makes it more convenient for programs to work on the parts of a record.

2892

2893

On rare occasions you will need to use the @code{getline} command.

2894

The @code{getline} command is valuable, both because it

2895

can do explicit input from any number of files, and because the files

2896

used with it do not have to be named on the @code{awk} command line

2897

(@pxref{Getline, ,Explicit Input with @code{getline}}).

2898

2899

@menu

2900

* Records:: Controlling how data is split into records.

2901

* Fields:: An introduction to fields.

2902

* Non-Constant Fields:: Non-constant Field Numbers.

2903

* Changing Fields:: Changing the Contents of a Field.

2904

* Field Separators:: The field separator and how to change it.

2905

* Constant Size:: Reading constant width data.

2906

* Multiple Line:: Reading multi-line records.

2907

* Getline:: Reading files under explicit program control

2908

using the @code{getline} function.

2909

@end menu

2910

2911

@node Records, Fields, Reading Files, Reading Files

2912

@section How Input is Split into Records

2913

2914

@cindex record separator, @code{RS}

2915

@cindex changing the record separator

2916

@cindex record, definition of

2917

@vindex RS

2918

The @code{awk} utility divides the input for your @code{awk}

2919

program into records and fields.

2920

Records are separated by a character called the @dfn{record separator}.

2921

By default, the record separator is the newline character.

2922

This is why records are, by default, single lines.

2923

You can use a different character for the record separator by

2924

assigning the character to the built-in variable @code{RS}.

2925

2926

You can change the value of @code{RS} in the @code{awk} program,

2927

like any other variable, with the

2928

assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).

2929

The new record-separator character should be enclosed in quotation marks,

2930

which indicate

2931

a string constant. Often the right time to do this is at the beginning

2932

of execution, before any input has been processed, so that the very

2933

first record will be read with the proper separator. To do this, use

2934

the special @code{BEGIN} pattern

2935

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). For

2936

example:

2937

2938

@example

2939

awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list

2940

@end example

2941

2942

@noindent

2943

changes the value of @code{RS} to @code{"/"}, before reading any input.

2944

This is a string whose first character is a slash; as a result, records

2945

are separated by slashes. Then the input file is read, and the second

2946

rule in the @code{awk} program (the action with no pattern) prints each

2947

record. Since each @code{print} statement adds a newline at the end of

2948

its output, the effect of this @code{awk} program is to copy the input

2949

with each slash changed to a newline. Here are the results of running

2950

the program on @file{BBS-list}:

2951

2952

@example

2953

@group

2954

$ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list

2955

@print{} aardvark 555-5553 1200

2956

@print{} 300 B

2957

@print{} alpo-net 555-3412 2400

2958

@print{} 1200

2959

@print{} 300 A

2960

@print{} barfly 555-7685 1200

2961

@print{} 300 A

2962

@print{} bites 555-1675 2400

2963

@print{} 1200

2964

@print{} 300 A

2965

@print{} camelot 555-0542 300 C

2966

@print{} core 555-2912 1200

2967

@print{} 300 C

2968

@print{} fooey 555-1234 2400

2969

@print{} 1200

2970

@print{} 300 B

2971

@print{} foot 555-6699 1200

2972

@print{} 300 B

2973

@print{} macfoo 555-6480 1200

2974

@print{} 300 A

2975

@print{} sdace 555-3430 2400

2976

@print{} 1200

2977

@print{} 300 A

2978

@print{} sabafoo 555-2127 1200

2979

@print{} 300 C

2980

@print{}

2981

@end group

2982

@end example

2983

2984

@noindent

2985

Note that the entry for the @samp{camelot} BBS is not split.

2986

In the original data file

2987

(@pxref{Sample Data Files, , Data Files for the Examples}),

2988

the line looks like this:

2989

2990

@example

2991

camelot 555-0542 300 C

2992

@end example

2993

2994

@noindent

2995

It only has one baud rate; there are no slashes in the record.

2996

2997

Another way to change the record separator is on the command line,

2998

using the variable-assignment feature

2999

(@pxref{Other Arguments, ,Other Command Line Arguments}).

3000

3001

@example

3002

awk '@{ print $0 @}' RS="/" BBS-list

3003

@end example

3004

3005

@noindent

3006

This sets @code{RS} to @samp{/} before processing @file{BBS-list}.

3007

3008

Using an unusual character such as @samp{/} for the record separator

3009

produces correct behavior in the vast majority of cases. However,

3010

the following (extreme) pipeline prints a surprising @samp{1}. There

3011

is one field, consisting of a newline. The value of the built-in

3012

variable @code{NF} is the number of fields in the current record.

3013

3014

@example

3015

$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}'

3016

@print{} 1

3017

@end example

3018

3019

@cindex dark corner

3020

@noindent

3021

Reaching the end of an input file terminates the current input record,

3022

even if the last character in the file is not the character in @code{RS}

3023

(d.c.).

3024

3025

@cindex empty string

3026

The empty string, @code{""} (a string of no characters), has a special meaning

3027

as the value of @code{RS}: it means that records are separated

3028

by one or more blank lines, and nothing else.

3029

@xref{Multiple Line, ,Multiple-Line Records}, for more details.

3030

3031

If you change the value of @code{RS} in the middle of an @code{awk} run,

3032

the new value is used to delimit subsequent records, but the record

3033

currently being processed (and records already processed) are not

3034

affected.

3035

3036

@vindex RT

3037

@cindex record terminator, @code{RT}

3038

@cindex terminator, record

3039

@cindex differences between @code{gawk} and @code{awk}

3040

After the end of the record has been determined, @code{gawk}

3041

sets the variable @code{RT} to the text in the input that matched

3042

@code{RS}.

3043

3044

@cindex regular expressions as record separators

3045

The value of @code{RS} is in fact not limited to a one-character

3046

string. It can be any regular expression

3047

(@pxref{Regexp, ,Regular Expressions}).

3048

In general, each record

3049

ends at the next string that matches the regular expression; the next

3050

record starts at the end of the matching string. This general rule is

3051

actually at work in the usual case, where @code{RS} contains just a

3052

newline: a record ends at the beginning of the next matching string (the

3053

next newline in the input) and the following record starts just after

3054

the end of this string (at the first character of the following line).

3055

The newline, since it matches @code{RS}, is not part of either record.

3056

3057

When @code{RS} is a single character, @code{RT} will

3058

contain the same single character. However, when @code{RS} is a

3059

regular expression, then @code{RT} becomes more useful; it contains

3060

the actual input text that matched the regular expression.

3061

3062

The following example illustrates both of these features.

3063

It sets @code{RS} equal to a regular expression that

3064

matches either a newline, or a series of one or more upper-case letters

3065

with optional leading and/or trailing white space

3066

(@pxref{Regexp, , Regular Expressions}).

3067

3068

@example

3069

$ echo record 1 AAAA record 2 BBBB record 3 |

3070

> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}

3071

> @{ print "Record =", $0, "and RT =", RT @}'

3072

@print{} Record = record 1 and RT = AAAA

3073

@print{} Record = record 2 and RT = BBBB

3074

@print{} Record = record 3 and RT =

3075

@print{}

3076

@end example

3077

3078

@noindent

3079

The final line of output has an extra blank line. This is because the

3080

value of @code{RT} is a newline, and then the @code{print} statement

3081

supplies its own terminating newline.

3082

3083

@xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example

3084

of @code{RS} as a regexp and @code{RT}.

3085

3086

@cindex differences between @code{gawk} and @code{awk}

3087

The use of @code{RS} as a regular expression and the @code{RT}

3088

variable are @code{gawk} extensions; they are not available in

3089

compatibility mode

3090

(@pxref{Options, ,Command Line Options}).

3091

In compatibility mode, only the first character of the value of

3092

@code{RS} is used to determine the end of the record.

3093

3094

@cindex number of records, @code{NR}, @code{FNR}

3095

@vindex NR

3096

@vindex FNR

3097

The @code{awk} utility keeps track of the number of records that have

3098

been read so far from the current input file. This value is stored in a

3099

built-in variable called @code{FNR}. It is reset to zero when a new

3100

file is started. Another built-in variable, @code{NR}, is the total

3101

number of input records read so far from all data files. It starts at zero

3102

but is never automatically reset to zero.

3103

3104

@node Fields, Non-Constant Fields, Records, Reading Files

3105

@section Examining Fields

3106

3107

@cindex examining fields

3108

@cindex fields

3109

@cindex accessing fields

3110

When @code{awk} reads an input record, the record is

3111

automatically separated or @dfn{parsed} by the interpreter into chunks

3112

called @dfn{fields}. By default, fields are separated by whitespace,

3113

like words in a line.

3114

Whitespace in @code{awk} means any string of one or more spaces and/or

3115

tabs; other characters such as newline, formfeed, and so on, that are

3116

considered whitespace by other languages are @emph{not} considered

3117

whitespace by @code{awk}.

3118

3119

The purpose of fields is to make it more convenient for you to refer to

3120

these pieces of the record. You don't have to use them---you can

3121

operate on the whole record if you wish---but fields are what make

3122

simple @code{awk} programs so powerful.

3123

3124

@cindex @code{$} (field operator)

3125

@cindex field operator @code{$}

3126

To refer to a field in an @code{awk} program, you use a dollar-sign,

3127

@samp{$}, followed by the number of the field you want. Thus, @code{$1}

3128

refers to the first field, @code{$2} to the second, and so on. For

3129

example, suppose the following is a line of input:

3130

3131

@example

3132

This seems like a pretty nice example.

3133

@end example

3134

3135

@noindent

3136

Here the first field, or @code{$1}, is @samp{This}; the second field, or

3137

@code{$2}, is @samp{seems}; and so on. Note that the last field,

3138

@code{$7}, is @samp{example.}. Because there is no space between the

3139

@samp{e} and the @samp{.}, the period is considered part of the seventh

3140

field.

3141

3142

@vindex NF

3143

@cindex number of fields, @code{NF}

3144

@code{NF} is a built-in variable whose value

3145

is the number of fields in the current record.

3146

@code{awk} updates the value of @code{NF} automatically, each time

3147

a record is read.

3148

3149

No matter how many fields there are, the last field in a record can be

3150

represented by @code{$NF}. So, in the example above, @code{$NF} would

3151

be the same as @code{$7}, which is @samp{example.}. Why this works is

3152

explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}).

3153

If you try to reference a field beyond the last one, such as @code{$8}

3154

when the record has only seven fields, you get the empty string.

3155

@c the empty string acts like 0 in some contexts, but I don't want to

3156

@c get into that here....

3157

3158

@code{$0}, which looks like a reference to the ``zeroth'' field, is

3159

a special case: it represents the whole input record. @code{$0} is

3160

used when you are not interested in fields.

3161

3162

Here are some more examples:

3163

3164

@example

3165

@group

3166

$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list

3167

@print{} fooey 555-1234 2400/1200/300 B

3168

@print{} foot 555-6699 1200/300 B

3169

@print{} macfoo 555-6480 1200/300 A

3170

@print{} sabafoo 555-2127 1200/300 C

3171

@end group

3172

@end example

3173

3174

@noindent

3175

This example prints each record in the file @file{BBS-list} whose first

3176

field contains the string @samp{foo}. The operator @samp{~} is called a

3177

@dfn{matching operator}

3178

(@pxref{Regexp Usage, , How to Use Regular Expressions});

3179

it tests whether a string (here, the field @code{$1}) matches a given regular

3180

expression.

3181

3182

By contrast, the following example

3183

looks for @samp{foo} in @emph{the entire record} and prints the first

3184

field and the last field for each input record containing a

3185

match.

3186

3187

@example

3188

@group

3189

$ awk '/foo/ @{ print $1, $NF @}' BBS-list

3190

@print{} fooey B

3191

@print{} foot B

3192

@print{} macfoo A

3193

@print{} sabafoo C

3194

@end group

3195

@end example

3196

3197

@node Non-Constant Fields, Changing Fields, Fields, Reading Files

3198

@section Non-constant Field Numbers

3199

3200

The number of a field does not need to be a constant. Any expression in

3201

the @code{awk} language can be used after a @samp{$} to refer to a

3202

field. The value of the expression specifies the field number. If the

3203

value is a string, rather than a number, it is converted to a number.

3204

Consider this example:

3205

3206

@example

3207

awk '@{ print $NR @}'

3208

@end example

3209

3210

@noindent

3211

Recall that @code{NR} is the number of records read so far: one in the

3212

first record, two in the second, etc. So this example prints the first

3213

field of the first record, the second field of the second record, and so

3214

on. For the twentieth record, field number 20 is printed; most likely,

3215

the record has fewer than 20 fields, so this prints a blank line.

3216

3217

Here is another example of using expressions as field numbers:

3218

3219

@example

3220

awk '@{ print $(2*2) @}' BBS-list

3221

@end example

3222

3223

@code{awk} must evaluate the expression @samp{(2*2)} and use

3224

its value as the number of the field to print. The @samp{*} sign

3225

represents multiplication, so the expression @samp{2*2} evaluates to four.

3226

The parentheses are used so that the multiplication is done before the

3227

@samp{$} operation; they are necessary whenever there is a binary

3228

operator in the field-number expression. This example, then, prints the

3229

hours of operation (the fourth field) for every line of the file

3230

@file{BBS-list}. (All of the @code{awk} operators are listed, in

3231

order of decreasing precedence, in

3232

@ref{Precedence, , Operator Precedence (How Operators Nest)}.)

3233

3234

If the field number you compute is zero, you get the entire record.

3235

Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field

3236

numbers are not allowed; trying to reference one will usually terminate

3237

your running @code{awk} program. (The POSIX standard does not define

3238

what happens when you reference a negative field number. @code{gawk}

3239

will notice this and terminate your program. Other @code{awk}

3240

implementations may behave differently.)

3241

3242

As mentioned in @ref{Fields, ,Examining Fields},

3243

the number of fields in the current record is stored in the built-in

3244

variable @code{NF} (also @pxref{Built-in Variables}). The expression

3245

@code{$NF} is not a special feature: it is the direct consequence of

3246

evaluating @code{NF} and using its value as a field number.

3247

3248

@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files

3249

@section Changing the Contents of a Field

3250

3251

@cindex field, changing contents of

3252

@cindex changing contents of a field

3253

@cindex assignment to fields

3254

You can change the contents of a field as seen by @code{awk} within an

3255

@code{awk} program; this changes what @code{awk} perceives as the

3256

current input record. (The actual input is untouched; @code{awk} @emph{never}

3257

modifies the input file.)

3258

3259

Consider this example and its output:

3260

3261

@example

3262

@group

3263

$ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped

3264

@print{} 13 3

3265

@print{} 15 5

3266

@print{} 15 5

3267

@dots{}

3268

@end group

3269

@end example

3270

3271

@noindent

3272

The @samp{-} sign represents subtraction, so this program reassigns

3273

field three, @code{$3}, to be the value of field two minus ten,

3274

@samp{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.)

3275

Then field two, and the new value for field three, are printed.

3276

3277

In order for this to work, the text in field @code{$2} must make sense

3278

as a number; the string of characters must be converted to a number in

3279

order for the computer to do arithmetic on it. The number resulting

3280

from the subtraction is converted back to a string of characters which

3281

then becomes field three.

3282

@xref{Conversion, ,Conversion of Strings and Numbers}.

3283

3284

When you change the value of a field (as perceived by @code{awk}), the

3285

text of the input record is recalculated to contain the new field where

3286

the old one was. Therefore, @code{$0} changes to reflect the altered

3287

field. Thus, this program

3288

prints a copy of the input file, with 10 subtracted from the second

3289

field of each line.

3290

3291

@example

3292

@group

3293

$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped

3294

@print{} Jan 3 25 15 115

3295

@print{} Feb 5 32 24 226

3296

@print{} Mar 5 24 34 228

3297

@dots{}

3298

@end group

3299

@end example

3300

3301

You can also assign contents to fields that are out of range. For

3302

example:

3303

3304

@example

3305

$ awk '@{ $6 = ($5 + $4 + $3 + $2)

3306

> print $6 @}' inventory-shipped

3307

@print{} 168

3308

@print{} 297

3309

@print{} 301

3310

@dots{}

3311

@end example

3312

3313

@noindent

3314

We've just created @code{$6}, whose value is the sum of fields

3315

@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign

3316

represents addition. For the file @file{inventory-shipped}, @code{$6}

3317

represents the total number of parcels shipped for a particular month.

3318

3319

Creating a new field changes @code{awk}'s internal copy of the current

3320

input record---the value of @code{$0}. Thus, if you do @samp{print $0}

3321

after adding a field, the record printed includes the new field, with

3322

the appropriate number of field separators between it and the previously

3323

existing fields.

3324

3325

This recomputation affects and is affected by

3326

@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}),

3327

and by a feature that has not been discussed yet,

3328

the @dfn{output field separator}, @code{OFS},

3329

which is used to separate the fields (@pxref{Output Separators}).

3330

For example, the value of @code{NF} is set to the number of the highest

3331

field you create.

3332

3333

Note, however, that merely @emph{referencing} an out-of-range field

3334

does @emph{not} change the value of either @code{$0} or @code{NF}.

3335

Referencing an out-of-range field only produces an empty string. For

3336

example:

3337

3338

@example

3339

if ($(NF+1) != "")

3340

print "can't happen"

3341

else

3342

print "everything is normal"

3343

@end example

3344

3345

@noindent

3346

should print @samp{everything is normal}, because @code{NF+1} is certain

3347

to be out of range. (@xref{If Statement, ,The @code{if}-@code{else} Statement},

3348

for more information about @code{awk}'s @code{if-else} statements.

3349

@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}, for more information

3350

about the @samp{!=} operator.)

3351

3352

It is important to note that making an assignment to an existing field

3353

will change the

3354

value of @code{$0}, but will not change the value of @code{NF},

3355

even when you assign the empty string to a field. For example:

3356

3357

@example

3358

@group

3359

$ echo a b c d | awk '@{ OFS = ":"; $2 = ""

3360

> print $0; print NF @}'

3361

@print{} a::c:d

3362

@print{} 4

3363

@end group

3364

@end example

3365

3366

@noindent

3367

The field is still there; it just has an empty value. You can tell

3368

because there are two colons in a row.

3369

3370

This example shows what happens if you create a new field.

3371

3372

@example

3373

$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"

3374

> print $0; print NF @}'

3375

@print{} a::c:d::new

3376

@print{} 6

3377

@end example

3378

3379

@noindent

3380

The intervening field, @code{$5} is created with an empty value

3381

(indicated by the second pair of adjacent colons),

3382

and @code{NF} is updated with the value six.

3383

3384

@node Field Separators, Constant Size, Changing Fields, Reading Files

3385

@section Specifying How Fields are Separated

3386

3387

This section is rather long; it describes one of the most fundamental

3388

operations in @code{awk}.

3389

3390

@menu

3391

* Basic Field Splitting:: How fields are split with single characters

3392

or simple strings.

3393

* Regexp Field Splitting:: Using regexps as the field separator.

3394

* Single Character Fields:: Making each character a separate field.

3395

* Command Line Field Separator:: Setting @code{FS} from the command line.

3396

* Field Splitting Summary:: Some final points and a summary table.

3397

@end menu

3398

3399

@node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators

3400

@subsection The Basics of Field Separating

3401

@vindex FS

3402

@cindex fields, separating

3403

@cindex field separator, @code{FS}

3404

3405

The @dfn{field separator}, which is either a single character or a regular

3406

expression, controls the way @code{awk} splits an input record into fields.

3407

@code{awk} scans the input record for character sequences that

3408

match the separator; the fields themselves are the text between the matches.

3409

3410

In the examples below, we use the bullet symbol ``@bullet{}'' to represent

3411

spaces in the output.

3412

3413

If the field separator is @samp{oo}, then the following line:

3414

3415

@example

3416

moo goo gai pan

3417

@end example

3418

3419

@noindent

3420

would be split into three fields: @samp{m}, @samp{@bullet{}g} and

3421

@samp{@bullet{}gai@bullet{}pan}.

3422

Note the leading spaces in the values of the second and third fields.

3423

3424

@cindex common mistakes

3425

@cindex mistakes, common

3426

@cindex errors, common

3427

The field separator is represented by the built-in variable @code{FS}.

3428

Shell programmers take note! @code{awk} does @emph{not} use the name @code{IFS}

3429

which is used by the POSIX compatible shells (such as the Bourne shell,

3430

@code{sh}, or the GNU Bourne-Again Shell, Bash).

3431

3432

You can change the value of @code{FS} in the @code{awk} program with the

3433

assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).

3434

Often the right time to do this is at the beginning of execution,

3435

before any input has been processed, so that the very first record

3436

will be read with the proper separator. To do this, use the special

3437

@code{BEGIN} pattern

3438

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).

3439

For example, here we set the value of @code{FS} to the string

3440

@code{","}:

3441

3442

@example

3443

awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'

3444

@end example

3445

3446

@noindent

3447

Given the input line,

3448

3449

@example

3450

John Q. Smith, 29 Oak St., Walamazoo, MI 42139

3451

@end example

3452

3453

@noindent

3454

this @code{awk} program extracts and prints the string

3455

@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.

3456

3457

@cindex field separator, choice of

3458

@cindex regular expressions as field separators

3459

Sometimes your input data will contain separator characters that don't

3460

separate fields the way you thought they would. For instance, the

3461

person's name in the example we just used might have a title or

3462

suffix attached, such as @samp{John Q. Smith, LXIX}. From input

3463

containing such a name:

3464

3465

@example

3466

John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139

3467

@end example

3468

3469

@noindent

3470

@c careful of an overfull hbox here!

3471

the above program would extract @samp{@bullet{}LXIX}, instead of

3472

@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.

3473

If you were expecting the program to print the

3474

address, you would be surprised. The moral is: choose your data layout and

3475

separator characters carefully to prevent such problems.

3476

3477

@iftex

3478

As you know, normally,

3479

@end iftex

3480

@ifinfo

3481

Normally,

3482

@end ifinfo

3483

fields are separated by whitespace sequences

3484

(spaces and tabs), not by single spaces: two spaces in a row do not

3485

delimit an empty field. The default value of the field separator @code{FS}

3486

is a string containing a single space, @w{@code{" "}}. If this value were

3487

interpreted in the usual way, each space character would separate

3488

fields, so two spaces in a row would make an empty field between them.

3489

The reason this does not happen is that a single space as the value of

3490

@code{FS} is a special case: it is taken to specify the default manner

3491

of delimiting fields.

3492

3493

If @code{FS} is any other single character, such as @code{","}, then

3494

each occurrence of that character separates two fields. Two consecutive

3495

occurrences delimit an empty field. If the character occurs at the

3496

beginning or the end of the line, that too delimits an empty field. The

3497

space character is the only single character which does not follow these

3498

rules.

3499

3500

@node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators

3501

@subsection Using Regular Expressions to Separate Fields

3502

3503

The previous

3504

@iftex

3505

subsection

3506

@end iftex

3507

@ifinfo

3508

node

3509

@end ifinfo

3510

discussed the use of single characters or simple strings as the

3511

value of @code{FS}.

3512

More generally, the value of @code{FS} may be a string containing any

3513

regular expression. In this case, each match in the record for the regular

3514

expression separates fields. For example, the assignment:

3515

3516

@example

3517

FS = ", \t"

3518

@end example

3519

3520

@noindent

3521

makes every area of an input line that consists of a comma followed by a

3522

space and a tab, into a field separator. (@samp{\t}

3523

is an @dfn{escape sequence} that stands for a tab;

3524

@pxref{Escape Sequences},

3525

for the complete list of similar escape sequences.)

3526

3527

For a less trivial example of a regular expression, suppose you want

3528

single spaces to separate fields the way single commas were used above.

3529

You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right

3530

bracket). This regular expression matches a single space and nothing else

3531

(@pxref{Regexp, ,Regular Expressions}).

3532

3533

There is an important difference between the two cases of @samp{FS = @w{" "}}

3534

(a single space) and @samp{FS = @w{"[ \t]+"}} (left bracket, space, backslash,

3535

``t'', right bracket, which is a regular expression

3536

matching one or more spaces or tabs). For both values of @code{FS}, fields

3537

are separated by runs of spaces and/or tabs. However, when the value of

3538

@code{FS} is @w{@code{" "}}, @code{awk} will first strip leading and trailing

3539

whitespace from the record, and then decide where the fields are.

3540

3541

For example, the following pipeline prints @samp{b}:

3542

3543

@example

3544

$ echo ' a b c d ' | awk '@{ print $2 @}'

3545

@print{} b

3546

@end example

3547

3548

@noindent

3549

However, this pipeline prints @samp{a} (note the extra spaces around

3550

each letter):

3551

3552

@example

3553

$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @}

3554

> @{ print $2 @}'

3555

@print{} a

3556

@end example

3557

3558

@noindent

3559

@cindex null string

3560

@cindex empty string

3561

In this case, the first field is @dfn{null}, or empty.

3562

3563

The stripping of leading and trailing whitespace also comes into

3564

play whenever @code{$0} is recomputed. For instance, study this pipeline:

3565

3566

@example

3567

$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}'

3568

@print{} a b c d

3569

@print{} a b c d

3570

@end example

3571

3572

@noindent

3573

The first @code{print} statement prints the record as it was read,

3574

with leading whitespace intact. The assignment to @code{$2} rebuilds

3575

@code{$0} by concatenating @code{$1} through @code{$NF} together,

3576

separated by the value of @code{OFS}. Since the leading whitespace

3577

was ignored when finding @code{$1}, it is not part of the new @code{$0}.

3578

Finally, the last @code{print} statement prints the new @code{$0}.

3579

3580

@node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators

3581

@subsection Making Each Character a Separate Field

3582

3583

@cindex differences between @code{gawk} and @code{awk}

3584

@cindex single character fields

3585

There are times when you may want to examine each character

3586

of a record separately. In @code{gawk}, this is easy to do, you

3587

simply assign the null string (@code{""}) to @code{FS}. In this case,

3588

each individual character in the record will become a separate field.

3589

Here is an example:

3590

@c extra verbiage due to page boundaries

3591

3592

@example

3593

echo a b | gawk 'BEGIN @{ FS = "" @}

3594

@{

3595

for (i = 1; i <= NF; i = i + 1)

3596

print "Field", i, "is", $i

3597

@}'

3598

@end example

3599

3600

@noindent

3601

The output from this is:

3602

3603

@example

3604

Field 1 is a

3605

Field 2 is

3606

Field 3 is b

3607

@end example

3608

3609

@cindex dark corner

3610

Traditionally, the behavior for @code{FS} equal to @code{""} was not defined.

3611

In this case, Unix @code{awk} would simply treat the entire record

3612

as only having one field (d.c.). In compatibility mode

3613

(@pxref{Options, ,Command Line Options}),

3614

if @code{FS} is the null string, then @code{gawk} will also

3615

behave this way.

3616

3617

@node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators

3618

@subsection Setting @code{FS} from the Command Line

3619

@cindex @code{-F} option

3620

@cindex field separator, on command line

3621

@cindex command line, setting @code{FS} on

3622

3623

@code{FS} can be set on the command line. You use the @samp{-F} option to

3624

do so. For example:

3625

3626

@example

3627

awk -F, '@var{program}' @var{input-files}

3628

@end example

3629

3630

@noindent

3631

sets @code{FS} to be the @samp{,} character. Notice that the option uses

3632

a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file

3633

containing an @code{awk} program. Case is significant in command line options:

3634

the @samp{-F} and @samp{-f} options have nothing to do with each other.

3635

You can use both options at the same time to set the @code{FS} variable

3636

@emph{and} get an @code{awk} program from a file.

3637

3638

The value used for the argument to @samp{-F} is processed in exactly the

3639

same way as assignments to the built-in variable @code{FS}. This means that

3640

if the field separator contains special characters, they must be escaped

3641

appropriately. For example, to use a @samp{\} as the field separator, you

3642

would have to type:

3643

3644

@example

3645

# same as FS = "\\"

3646

awk -F\\\\ '@dots{}' files @dots{}

3647

@end example

3648

3649

@noindent

3650

Since @samp{\} is used for quoting in the shell, @code{awk} will see

3651

@samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape

3652

characters (@pxref{Escape Sequences}), finally yielding

3653

a single @samp{\} to be used for the field separator.

3654

3655

@cindex historical features

3656

As a special case, in compatibility mode

3657

(@pxref{Options, ,Command Line Options}), if the

3658

argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab

3659

character. This is because if you type @samp{-F\t} at the shell,

3660

without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you

3661

really want your fields to be separated with tabs, and not @samp{t}s.

3662

Use @samp{-v FS="t"} on the command line if you really do want to separate

3663

your fields with @samp{t}s

3664

(@pxref{Options, ,Command Line Options}).

3665

3666

For example, let's use an @code{awk} program file called @file{baud.awk}

3667

that contains the pattern @code{/300/}, and the action @samp{print $1}.

3668

Here is the program:

3669

3670

@example

3671

/300/ @{ print $1 @}

3672

@end example

3673

3674

Let's also set @code{FS} to be the @samp{-} character, and run the

3675

program on the file @file{BBS-list}. The following command prints a

3676

list of the names of the bulletin boards that operate at 300 baud and

3677

the first three digits of their phone numbers:

3678

3679

@c tweaked to make the tex output look better in @smallbook

3680

@example

3681

@group

3682

$ awk -F- -f baud.awk BBS-list

3683

@print{} aardvark 555

3684

@print{} alpo

3685

@print{} barfly 555

3686

@dots{}

3687

@end group

3688

@ignore

3689

@print{} bites 555

3690

@print{} camelot 555

3691

@print{} core 555

3692

@print{} fooey 555

3693

@print{} foot 555

3694

@print{} macfoo 555

3695

@print{} sdace 555

3696

@print{} sabafoo 555

3697

@end ignore

3698

@end example

3699

3700

@noindent

3701

Note the second line of output. In the original file

3702

(@pxref{Sample Data Files, ,Data Files for the Examples}),

3703

the second line looked like this:

3704

3705

@example

3706

alpo-net 555-3412 2400/1200/300 A

3707

@end example

3708

3709

The @samp{-} as part of the system's name was used as the field

3710

separator, instead of the @samp{-} in the phone number that was

3711

originally intended. This demonstrates why you have to be careful in

3712

choosing your field and record separators.

3713

3714

On many Unix systems, each user has a separate entry in the system password

3715

file, one line per user. The information in these lines is separated

3716

by colons. The first field is the user's logon name, and the second is

3717

the user's encrypted password. A password file entry might look like this:

3718

3719

@example

3720

arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh

3721

@end example

3722

3723

The following program searches the system password file, and prints

3724

the entries for users who have no password:

3725

3726

@example

3727

awk -F: '$2 == ""' /etc/passwd

3728

@end example

3729

3730

@node Field Splitting Summary, , Command Line Field Separator, Field Separators

3731

@subsection Field Splitting Summary

3732

3733

@cindex @code{awk} language, POSIX version

3734

@cindex POSIX @code{awk}

3735

According to the POSIX standard, @code{awk} is supposed to behave

3736

as if each record is split into fields at the time that it is read.

3737

In particular, this means that you can change the value of @code{FS}

3738

after a record is read, and the value of the fields (i.e.@: how they were split)

3739

should reflect the old value of @code{FS}, not the new one.

3740

3741

@cindex dark corner

3742

@cindex @code{sed} utility

3743

@cindex stream editor

3744

However, many implementations of @code{awk} do not work this way. Instead,

3745

they defer splitting the fields until a field is actually

3746

referenced. The fields will be split

3747

using the @emph{current} value of @code{FS}! (d.c.)

3748

This behavior can be difficult

3749

to diagnose. The following example illustrates the difference

3750

between the two methods.

3751

(The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.''

3752

Its behavior is also defined by the POSIX standard.}

3753

command prints just the first line of @file{/etc/passwd}.)

3754

3755

@example

3756

sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'

3757

@end example

3758

3759

@noindent

3760

will usually print

3761

3762

@example

3763

root

3764

@end example

3765

3766

@noindent

3767

on an incorrect implementation of @code{awk}, while @code{gawk}

3768

will print something like

3769

3770

@example

3771

root:nSijPlPhZZwgE:0:0:Root:/:

3772

@end example

3773

3774

The following table summarizes how fields are split, based on the

3775

value of @code{FS}. (@samp{==} means ``is equal to.'')

3776

3777

@c @cartouche

3778

@table @code

3779

@item FS == " "

3780

Fields are separated by runs of whitespace. Leading and trailing

3781

whitespace are ignored. This is the default.

3782

3783

@item FS == @var{any other single character}

3784

Fields are separated by each occurrence of the character. Multiple

3785

successive occurrences delimit empty fields, as do leading and

3786

trailing occurrences.

3787

The character can even be a regexp metacharacter; it does not need

3788

to be escaped.

3789

3790

@item FS == @var{regexp}

3791

Fields are separated by occurrences of characters that match @var{regexp}.

3792

Leading and trailing matches of @var{regexp} delimit empty fields.

3793

3794

@item FS == ""

3795

Each individual character in the record becomes a separate field.

3796

@end table

3797

@c @end cartouche

3798

3799

@node Constant Size, Multiple Line, Field Separators, Reading Files

3800

@section Reading Fixed-width Data

3801

3802

(This section discusses an advanced, experimental feature. If you are

3803

a novice @code{awk} user, you may wish to skip it on the first reading.)

3804

3805

@code{gawk} version 2.13 introduced a new facility for dealing with

3806

fixed-width fields with no distinctive field separator. Data of this

3807

nature arises, for example, in the input for old FORTRAN programs where

3808

numbers are run together; or in the output of programs that did not

3809

anticipate the use of their output as input for other programs.

3810

3811

An example of the latter is a table where all the columns are lined up by

3812

the use of a variable number of spaces and @emph{empty fields are just

3813

spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS}

3814

will not work well in this case. Although a portable @code{awk} program

3815

can use a series of @code{substr} calls on @code{$0}

3816

(@pxref{String Functions, ,Built-in Functions for String Manipulation}),

3817

this is awkward and inefficient for a large number of fields.

3818

3819

The splitting of an input record into fixed-width fields is specified by

3820

assigning a string containing space-separated numbers to the built-in

3821

variable @code{FIELDWIDTHS}. Each number specifies the width of the field

3822

@emph{including} columns between fields. If you want to ignore the columns

3823

between fields, you can specify the width as a separate field that is

3824

subsequently ignored.

3825

3826

The following data is the output of the Unix @code{w} utility. It is useful

3827

to illustrate the use of @code{FIELDWIDTHS}.

3828

3829

@example

3830

@group

3831

10:06pm up 21 days, 14:04, 23 users

3832

User tty login@ idle JCPU PCPU what

3833

hzuo ttyV0 8:58pm 9 5 vi p24.tex

3834

hzang ttyV3 6:37pm 50 -csh

3835

eklye ttyV5 9:53pm 7 1 em thes.tex

3836

dportein ttyV6 8:17pm 1:47 -csh

3837

gierd ttyD3 10:00pm 1 elm

3838

dave ttyD4 9:47pm 4 4 w

3839

brent ttyp0 26Jun91 4:46 26:46 4:41 bash

3840

dave ttyq4 26Jun9115days 46 46 wnewmail

3841

@end group

3842

@end example

3843

3844

The following program takes the above input, converts the idle time to

3845

number of seconds and prints out the first two fields and the calculated

3846

idle time. (This program uses a number of @code{awk} features that

3847

haven't been introduced yet.)

3848

3849

@example

3850

@group

3851

BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}

3852

NR > 2 @{

3853

idle = $4

3854

sub(/^ */, "", idle) # strip leading spaces

3855

if (idle == "")

3856

idle = 0

3857

if (idle ~ /:/) @{

3858

split(idle, t, ":")

3859

idle = t[1] * 60 + t[2]

3860

@}

3861

if (idle ~ /days/)

3862

idle *= 24 * 60 * 60

3863

3864

print $1, $2, idle

3865

@}

3866

@end group

3867

@end example

3868

3869

Here is the result of running the program on the data:

3870

3871

@example

3872

hzuo ttyV0 0

3873

hzang ttyV3 50

3874

eklye ttyV5 0

3875

dportein ttyV6 107

3876

gierd ttyD3 1

3877

dave ttyD4 0

3878

brent ttyp0 286

3879

dave ttyq4 1296000

3880

@end example

3881

3882

Another (possibly more practical) example of fixed-width input data

3883

would be the input from a deck of balloting cards. In some parts of

3884

the United States, voters mark their choices by punching holes in computer

3885

cards. These cards are then processed to count the votes for any particular

3886

candidate or on any particular issue. Since a voter may choose not to

3887

vote on some issue, any column on the card may be empty. An @code{awk}

3888

program for processing such data could use the @code{FIELDWIDTHS} feature

3889

to simplify reading the data. (Of course, getting @code{gawk} to run on

3890

a system with card readers is another story!)

3891

3892

@ignore

3893

Exercise: Write a ballot card reading program

3894

@end ignore

3895

3896

Assigning a value to @code{FS} causes @code{gawk} to return to using

3897

@code{FS} for field splitting. Use @samp{FS = FS} to make this happen,

3898

without having to know the current value of @code{FS}.

3899

3900

This feature is still experimental, and may evolve over time.

3901

Note that in particular, @code{gawk} does not attempt to verify

3902

the sanity of the values used in the value of @code{FIELDWIDTHS}.

3903

3904

@node Multiple Line, Getline, Constant Size, Reading Files

3905

@section Multiple-Line Records

3906

3907

@cindex multiple line records

3908

@cindex input, multiple line records

3909

@cindex reading files, multiple line records

3910

@cindex records, multiple line

3911

In some data bases, a single line cannot conveniently hold all the

3912

information in one entry. In such cases, you can use multi-line

3913

records.

3914

3915

The first step in doing this is to choose your data format: when records

3916

are not defined as single lines, how do you want to define them?

3917

What should separate records?

3918

3919

One technique is to use an unusual character or string to separate

3920

records. For example, you could use the formfeed character (written

3921

@samp{\f} in @code{awk}, as in C) to separate them, making each record

3922

a page of the file. To do this, just set the variable @code{RS} to

3923

@code{"\f"} (a string containing the formfeed character). Any

3924

other character could equally well be used, as long as it won't be part

3925

of the data in a record.

3926

3927

Another technique is to have blank lines separate records. By a special

3928

dispensation, an empty string as the value of @code{RS} indicates that

3929

records are separated by one or more blank lines. If you set @code{RS}

3930

to the empty string, a record always ends at the first blank line

3931

encountered. And the next record doesn't start until the first non-blank

3932

line that follows---no matter how many blank lines appear in a row, they

3933

are considered one record-separator.

3934

3935

@cindex leftmost longest match

3936

@cindex matching, leftmost longest

3937

You can achieve the same effect as @samp{RS = ""} by assigning the

3938

string @code{"\n\n+"} to @code{RS}. This regexp matches the newline

3939

at the end of the record, and one or more blank lines after the record.

3940

In addition, a regular expression always matches the longest possible

3941

sequence when there is a choice

3942

(@pxref{Leftmost Longest, ,How Much Text Matches?})

3943

So the next record doesn't start until

3944

the first non-blank line that follows---no matter how many blank lines

3945

appear in a row, they are considered one record-separator.

3946

3947

@cindex dark corner

3948

There is an important difference between @samp{RS = ""} and

3949

@samp{RS = "\n\n+"}. In the first case, leading newlines in the input

3950

data file are ignored, and if a file ends without extra blank lines

3951

after the last record, the final newline is removed from the record.

3952

In the second case, this special processing is not done (d.c.).

3953

3954

Now that the input is separated into records, the second step is to

3955

separate the fields in the record. One way to do this is to divide each

3956

of the lines into fields in the normal manner. This happens by default

3957

as the result of a special feature: when @code{RS} is set to the empty

3958

string, the newline character @emph{always} acts as a field separator.

3959

This is in addition to whatever field separations result from @code{FS}.

3960

3961

The original motivation for this special exception was probably to provide

3962

useful behavior in the default case (i.e.@: @code{FS} is equal

3963

to @w{@code{" "}}). This feature can be a problem if you really don't

3964

want the newline character to separate fields, since there is no way to

3965

prevent it. However, you can work around this by using the @code{split}

3966

function to break up the record manually

3967

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

3968

3969

Another way to separate fields is to

3970

put each field on a separate line: to do this, just set the

3971

variable @code{FS} to the string @code{"\n"}. (This simple regular

3972

expression matches a single newline.)

3973

3974

A practical example of a data file organized this way might be a mailing

3975

list, where each entry is separated by blank lines. If we have a mailing

3976

list in a file named @file{addresses}, that looks like this:

3977

3978

@example

3979

Jane Doe

3980

123 Main Street

3981

Anywhere, SE 12345-6789

3982

3983

John Smith

3984

456 Tree-lined Avenue

3985

Smallville, MW 98765-4321

3986

3987

@dots{}

3988

@end example

3989

3990

@noindent

3991

A simple program to process this file would look like this:

3992

3993

@example

3994

@group

3995

# addrs.awk --- simple mailing list program

3996

3997

# Records are separated by blank lines.

3998

# Each line is one field.

3999

BEGIN @{ RS = "" ; FS = "\n" @}

4000

4001

@{

4002

print "Name is:", $1

4003

print "Address is:", $2

4004

print "City and State are:", $3

4005

print ""

4006

@}

4007

@end group

4008

@end example

4009

4010

Running the program produces the following output:

4011

4012

@example

4013

@group

4014

$ awk -f addrs.awk addresses

4015

@print{} Name is: Jane Doe

4016

@print{} Address is: 123 Main Street

4017

@print{} City and State are: Anywhere, SE 12345-6789

4018

@print{}

4019

@end group

4020

@group

4021

@print{} Name is: John Smith

4022

@print{} Address is: 456 Tree-lined Avenue

4023

@print{} City and State are: Smallville, MW 98765-4321

4024

@print{}

4025

@dots{}

4026

@end group

4027

@end example

4028

4029

@xref{Labels Program, ,Printing Mailing Labels}, for a more realistic

4030

program that deals with address lists.

4031

4032

The following table summarizes how records are split, based on the

4033

value of @code{RS}. (@samp{==} means ``is equal to.'')

4034

4035

@c @cartouche

4036

@table @code

4037

@item RS == "\n"

4038

Records are separated by the newline character (@samp{\n}). In effect,

4039

every line in the data file is a separate record, including blank lines.

4040

This is the default.

4041

4042

@item RS == @var{any single character}

4043

Records are separated by each occurrence of the character. Multiple

4044

successive occurrences delimit empty records.

4045

4046

@item RS == ""

4047

Records are separated by runs of blank lines. The newline character

4048

always serves as a field separator, in addition to whatever value

4049

@code{FS} may have. Leading and trailing newlines in a file are ignored.

4050

4051

@item RS == @var{regexp}

4052

Records are separated by occurrences of characters that match @var{regexp}.

4053

Leading and trailing matches of @var{regexp} delimit empty records.

4054

@end table

4055

@c @end cartouche

4056

4057

@vindex RT

4058

In all cases, @code{gawk} sets @code{RT} to the input text that matched the

4059

value specified by @code{RS}.

4060

4061

@node Getline, , Multiple Line, Reading Files

4062

@section Explicit Input with @code{getline}

4063

4064

@findex getline

4065

@cindex input, explicit

4066

@cindex explicit input

4067

@cindex input, @code{getline} command

4068

@cindex reading files, @code{getline} command

4069

So far we have been getting our input data from @code{awk}'s main

4070

input stream---either the standard input (usually your terminal, sometimes

4071

the output from another program) or from the

4072

files specified on the command line. The @code{awk} language has a

4073

special built-in command called @code{getline} that

4074

can be used to read input under your explicit control.

4075

4076

@menu

4077

* Getline Intro:: Introduction to the @code{getline} function.

4078

* Plain Getline:: Using @code{getline} with no arguments.

4079

* Getline/Variable:: Using @code{getline} into a variable.

4080

* Getline/File:: Using @code{getline} from a file.

4081

* Getline/Variable/File:: Using @code{getline} into a variable from a

4082

file.

4083

* Getline/Pipe:: Using @code{getline} from a pipe.

4084

* Getline/Variable/Pipe:: Using @code{getline} into a variable from a

4085

pipe.

4086

* Getline Summary:: Summary Of @code{getline} Variants.

4087

@end menu

4088

4089

@node Getline Intro, Plain Getline, Getline, Getline

4090

@subsection Introduction to @code{getline}

4091

4092

This command is used in several different ways, and should @emph{not} be

4093

used by beginners. It is covered here because this is the chapter on input.

4094

The examples that follow the explanation of the @code{getline} command

4095

include material that has not been covered yet. Therefore, come back

4096

and study the @code{getline} command @emph{after} you have reviewed the

4097

rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works.

4098

4099

@vindex ERRNO

4100

@cindex differences between @code{gawk} and @code{awk}

4101

@cindex @code{getline}, return values

4102

@code{getline} returns one if it finds a record, and zero if the end of the

4103

file is encountered. If there is some error in getting a record, such

4104

as a file that cannot be opened, then @code{getline} returns @minus{}1.

4105

In this case, @code{gawk} sets the variable @code{ERRNO} to a string

4106

describing the error that occurred.

4107

4108

In the following examples, @var{command} stands for a string value that

4109

represents a shell command.

4110

4111

@node Plain Getline, Getline/Variable, Getline Intro, Getline

4112

@subsection Using @code{getline} with No Arguments

4113

4114

The @code{getline} command can be used without arguments to read input

4115

from the current input file. All it does in this case is read the next

4116

input record and split it up into fields. This is useful if you've

4117

finished processing the current record, but you want to do some special

4118

processing @emph{right now} on the next record. Here's an

4119

example:

4120

4121

@example

4122

@group

4123

awk '@{

4124

if ((t = index($0, "/*")) != 0) @{

4125

# value will be "" if t is 1

4126

tmp = substr($0, 1, t - 1)

4127

u = index(substr($0, t + 2), "*/")

4128

while (u == 0) @{

4129

if (getline <= 0) @{

4130

m = "unexpected EOF or error"

4131

m = (m ": " ERRNO)

4132

print m > "/dev/stderr"

4133

exit

4134

@}

4135

t = -1

4136

u = index($0, "*/")

4137

@}

4138

@end group

4139

@group

4140

# substr expression will be "" if */

4141

# occurred at end of line

4142

$0 = tmp substr($0, t + u + 3)

4143

@}

4144

print $0

4145

@}'

4146

@end group

4147

@end example

4148

4149

This @code{awk} program deletes all C-style comments, @samp{/* @dots{}

4150

*/}, from the input. By replacing the @samp{print $0} with other

4151

statements, you could perform more complicated processing on the

4152

decommented input, like searching for matches of a regular

4153

expression. This program has a subtle problem---it does not work if one

4154

comment ends and another begins on the same line.

4155

4156

@ignore

4157

Exercise,

4158

write a program that does handle multiple comments on the line.

4159

@end ignore

4160

4161

This form of the @code{getline} command sets @code{NF} (the number of

4162

fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of

4163

records read so far; @pxref{Records, ,How Input is Split into Records}),

4164

@code{FNR} (the number of records read from this input file), and the

4165

value of @code{$0}.

4166

4167

@cindex dark corner

4168

@strong{Note:} the new value of @code{$0} is used in testing

4169

the patterns of any subsequent rules. The original value

4170

of @code{$0} that triggered the rule which executed @code{getline}

4171

is lost (d.c.).

4172

By contrast, the @code{next} statement reads a new record

4173

but immediately begins processing it normally, starting with the first

4174

rule in the program. @xref{Next Statement, ,The @code{next} Statement}.

4175

4176

@node Getline/Variable, Getline/File, Plain Getline, Getline

4177

@subsection Using @code{getline} Into a Variable

4178

4179

You can use @samp{getline @var{var}} to read the next record from

4180

@code{awk}'s input into the variable @var{var}. No other processing is

4181

done.

4182

4183

For example, suppose the next line is a comment, or a special string,

4184

and you want to read it, without triggering

4185

any rules. This form of @code{getline} allows you to read that line

4186

and store it in a variable so that the main

4187

read-a-line-and-check-each-rule loop of @code{awk} never sees it.

4188

4189

The following example swaps every two lines of input. For example, given:

4190

4191

@example

4192

wan

4193

tew

4194

free

4195

phore

4196

@end example

4197

4198

@noindent

4199

it outputs:

4200

4201

@example

4202

tew

4203

wan

4204

phore

4205

free

4206

@end example

4207

4208

@noindent

4209

Here's the program:

4210

4211

@example

4212

@group

4213

awk '@{

4214

if ((getline tmp) > 0) @{

4215

print tmp

4216

print $0

4217

@} else

4218

print $0

4219

@}'

4220

@end group

4221

@end example

4222

4223

The @code{getline} command used in this way sets only the variables

4224

@code{NR} and @code{FNR} (and of course, @var{var}). The record is not

4225

split into fields, so the values of the fields (including @code{$0}) and

4226

the value of @code{NF} do not change.

4227

4228

@node Getline/File, Getline/Variable/File, Getline/Variable, Getline

4229

@subsection Using @code{getline} from a File

4230

4231

@cindex input redirection

4232

@cindex redirection of input

4233

Use @samp{getline < @var{file}} to read

4234

the next record from the file

4235

@var{file}. Here @var{file} is a string-valued expression that

4236

specifies the file name. @samp{< @var{file}} is called a @dfn{redirection}

4237

since it directs input to come from a different place.

4238

4239

For example, the following

4240

program reads its input record from the file @file{secondary.input} when it

4241

encounters a first field with a value equal to 10 in the current input

4242

file.

4243

4244

@example

4245

@group

4246

awk '@{

4247

if ($1 == 10) @{

4248

getline < "secondary.input"

4249

print

4250

@} else

4251

print

4252

@}'

4253

@end group

4254

@end example

4255

4256

Since the main input stream is not used, the values of @code{NR} and

4257

@code{FNR} are not changed. But the record read is split into fields in

4258

the normal manner, so the values of @code{$0} and other fields are

4259

changed. So is the value of @code{NF}.

4260

4261

@node Getline/Variable/File, Getline/Pipe, Getline/File, Getline

4262

@subsection Using @code{getline} Into a Variable from a File

4263

4264

Use @samp{getline @var{var} < @var{file}} to read input

4265

the file

4266

@var{file} and put it in the variable @var{var}. As above, @var{file}

4267

is a string-valued expression that specifies the file from which to read.

4268

4269

In this version of @code{getline}, none of the built-in variables are

4270

changed, and the record is not split into fields. The only variable

4271

changed is @var{var}.

4272

4273

For example, the following program copies all the input files to the

4274

output, except for records that say @w{@samp{@@include @var{filename}}}.

4275

Such a record is replaced by the contents of the file

4276

@var{filename}.

4277

4278

@example

4279

@group

4280

awk '@{

4281

if (NF == 2 && $1 == "@@include") @{

4282

while ((getline line < $2) > 0)

4283

print line

4284

close($2)

4285

@} else

4286

print

4287

@}'

4288

@end group

4289

@end example

4290

4291

Note here how the name of the extra input file is not built into

4292

the program; it is taken directly from the data, from the second field on

4293

the @samp{@@include} line.

4294

4295

The @code{close} function is called to ensure that if two identical

4296

@samp{@@include} lines appear in the input, the entire specified file is

4297

included twice.

4298

@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.

4299

4300

One deficiency of this program is that it does not process nested

4301

@samp{@@include} statements

4302

(@samp{@@include} statements in included files)

4303

the way a true macro preprocessor would.

4304

@xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program

4305

that does handle nested @samp{@@include} statements.

4306

4307

@node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline

4308

@subsection Using @code{getline} from a Pipe

4309

4310

@cindex input pipeline

4311

@cindex pipeline, input

4312

You can pipe the output of a command into @code{getline}, using

4313

@samp{@var{command} | getline}. In

4314

this case, the string @var{command} is run as a shell command and its output

4315

is piped into @code{awk} to be used as input. This form of @code{getline}

4316

reads one record at a time from the pipe.

4317

4318

For example, the following program copies its input to its output, except for

4319

lines that begin with @samp{@@execute}, which are replaced by the output

4320

produced by running the rest of the line as a shell command:

4321

4322

@example

4323

@group

4324

awk '@{

4325

if ($1 == "@@execute") @{

4326

tmp = substr($0, 10)

4327

while ((tmp | getline) > 0)

4328

print

4329

close(tmp)

4330

@} else

4331

print

4332

@}'

4333

@end group

4334

@end example

4335

4336

@noindent

4337

The @code{close} function is called to ensure that if two identical

4338

@samp{@@execute} lines appear in the input, the command is run for

4339

each one.

4340

@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.

4341

@c Exercise!!

4342

@c This example is unrealistic, since you could just use system

4343

4344

Given the input:

4345

4346

@example

4347

@group

4348

foo

4349

bar

4350

baz

4351

@@execute who

4352

bletch

4353

@end group

4354

@end example

4355

4356

@noindent

4357

the program might produce:

4358

4359

@example

4360

@group

4361

foo

4362

bar

4363

baz

4364

arnold ttyv0 Jul 13 14:22

4365

miriam ttyp0 Jul 13 14:23 (murphy:0)

4366

bill ttyp1 Jul 13 14:23 (murphy:0)

4367

bletch

4368

@end group

4369

@end example

4370

4371

@noindent

4372

Notice that this program ran the command @code{who} and printed the result.

4373

(If you try this program yourself, you will of course get different results,

4374

showing you who is logged in on your system.)

4375

4376

This variation of @code{getline} splits the record into fields, sets the

4377

value of @code{NF} and recomputes the value of @code{$0}. The values of

4378

@code{NR} and @code{FNR} are not changed.

4379

4380

@node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline

4381

@subsection Using @code{getline} Into a Variable from a Pipe

4382

4383

When you use @samp{@var{command} | getline @var{var}}, the

4384

output of the command @var{command} is sent through a pipe to

4385

@code{getline} and into the variable @var{var}. For example, the

4386

following program reads the current date and time into the variable

4387

@code{current_time}, using the @code{date} utility, and then

4388

prints it.

4389

4390

@example

4391

@group

4392

awk 'BEGIN @{

4393

"date" | getline current_time

4394

close("date")

4395

print "Report printed on " current_time

4396

@}'

4397

@end group

4398

@end example

4399

4400

In this version of @code{getline}, none of the built-in variables are

4401

changed, and the record is not split into fields.

4402

4403

@node Getline Summary, , Getline/Variable/Pipe, Getline

4404

@subsection Summary of @code{getline} Variants

4405

4406

With all the forms of @code{getline}, even though @code{$0} and @code{NF},

4407

may be updated, the record will not be tested against all the patterns

4408

in the @code{awk} program, in the way that would happen if the record

4409

were read normally by the main processing loop of @code{awk}. However

4410

the new record is tested against any subsequent rules.

4411

4412

@cindex differences between @code{gawk} and @code{awk}

4413

@cindex limitations

4414

@cindex implementation limits

4415

Many @code{awk} implementations limit the number of pipelines an @code{awk}

4416

program may have open to just one! In @code{gawk}, there is no such limit.

4417

You can open as many pipelines as the underlying operating system will

4418

permit.

4419

4420

The following table summarizes the six variants of @code{getline},

4421

listing which built-in variables are set by each one.

4422

4423

@iftex

4424

@page

4425

@end iftex

4426

@c @cartouche

4427

@table @code

4428

@item getline

4429

sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}.

4430

4431

@item getline @var{var}

4432

sets @var{var}, @code{FNR}, and @code{NR}.

4433

4434

@item getline < @var{file}

4435

sets @code{$0}, and @code{NF}.

4436

4437

@item getline @var{var} < @var{file}

4438

sets @var{var}.

4439

4440

@item @var{command} | getline

4441

sets @code{$0}, and @code{NF}.

4442

4443

@item @var{command} | getline @var{var}

4444

sets @var{var}.

4445

@end table

4446

@c @end cartouche

4447

4448

@node Printing, Expressions, Reading Files, Top

4449

@chapter Printing Output

4450

4451

@cindex printing

4452

@cindex output

4453

One of the most common actions is to @dfn{print}, or output,

4454

some or all of the input. You use the @code{print} statement

4455

for simple output. You use the @code{printf} statement

4456

for fancier formatting. Both are described in this chapter.

4457

4458

@menu

4459

* Print:: The @code{print} statement.

4460

* Print Examples:: Simple examples of @code{print} statements.

4461

* Output Separators:: The output separators and how to change them.

4462

* OFMT:: Controlling Numeric Output With @code{print}.

4463

* Printf:: The @code{printf} statement.

4464

* Redirection:: How to redirect output to multiple files and

4465

pipes.

4466

* Special Files:: File name interpretation in @code{gawk}.

4467

@code{gawk} allows access to inherited file

4468

descriptors.

4469

* Close Files And Pipes:: Closing Input and Output Files and Pipes.

4470

@end menu

4471

4472

@node Print, Print Examples, Printing, Printing

4473

@section The @code{print} Statement

4474

@cindex @code{print} statement

4475

4476

The @code{print} statement does output with simple, standardized

4477

formatting. You specify only the strings or numbers to be printed, in a

4478

list separated by commas. They are output, separated by single spaces,

4479

followed by a newline. The statement looks like this:

4480

4481

@example

4482

print @var{item1}, @var{item2}, @dots{}

4483

@end example

4484

4485

@noindent

4486

The entire list of items may optionally be enclosed in parentheses. The

4487

parentheses are necessary if any of the item expressions uses the @samp{>}

4488

relational operator; otherwise it could be confused with a redirection

4489

(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).

4490

4491

The items to be printed can be constant strings or numbers, fields of the

4492

current record (such as @code{$1}), variables, or any @code{awk}

4493

expressions.

4494

Numeric values are converted to strings, and then printed.

4495

4496

The @code{print} statement is completely general for

4497

computing @emph{what} values to print. However, with two exceptions,

4498

you cannot specify @emph{how} to print them---how many

4499

columns, whether to use exponential notation or not, and so on.

4500

(For the exceptions, @pxref{Output Separators}, and

4501

@ref{OFMT, ,Controlling Numeric Output with @code{print}}.)

4502

For that, you need the @code{printf} statement

4503

(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).

4504

4505

The simple statement @samp{print} with no items is equivalent to

4506

@samp{print $0}: it prints the entire current record. To print a blank

4507

line, use @samp{print ""}, where @code{""} is the empty string.

4508

4509

To print a fixed piece of text, use a string constant such as

4510

@w{@code{"Don't Panic"}} as one item. If you forget to use the

4511

double-quote characters, your text will be taken as an @code{awk}

4512

expression, and you will probably get an error. Keep in mind that a

4513

space is printed between any two items.

4514

4515

Each @code{print} statement makes at least one line of output. But it

4516

isn't limited to one line. If an item value is a string that contains a

4517

newline, the newline is output along with the rest of the string. A

4518

single @code{print} can make any number of lines this way.

4519

4520

@node Print Examples, Output Separators, Print, Printing

4521

@section Examples of @code{print} Statements

4522

4523

Here is an example of printing a string that contains embedded newlines

4524

(the @samp{\n} is an escape sequence, used to represent the newline

4525

character; see @ref{Escape Sequences}):

4526

4527

@example

4528

@group

4529

$ awk 'BEGIN @{ print "line one\nline two\nline three" @}'

4530

@print{} line one

4531

@print{} line two

4532

@print{} line three

4533

@end group

4534

@end example

4535

4536

Here is an example that prints the first two fields of each input record,

4537

with a space between them:

4538

4539

@example

4540

@group

4541

$ awk '@{ print $1, $2 @}' inventory-shipped

4542

@print{} Jan 13

4543

@print{} Feb 15

4544

@print{} Mar 15

4545

@dots{}

4546

@end group

4547

@end example

4548

4549

@cindex common mistakes

4550

@cindex mistakes, common

4551

@cindex errors, common

4552

A common mistake in using the @code{print} statement is to omit the comma

4553

between two items. This often has the effect of making the items run

4554

together in the output, with no space. The reason for this is that

4555

juxtaposing two string expressions in @code{awk} means to concatenate

4556

them. Here is the same program, without the comma:

4557

4558

@example

4559

@group

4560

$ awk '@{ print $1 $2 @}' inventory-shipped

4561

@print{} Jan13

4562

@print{} Feb15

4563

@print{} Mar15

4564

@dots{}

4565

@end group

4566

@end example

4567

4568

To someone unfamiliar with the file @file{inventory-shipped}, neither

4569

example's output makes much sense. A heading line at the beginning

4570

would make it clearer. Let's add some headings to our table of months

4571

(@code{$1}) and green crates shipped (@code{$2}). We do this using the

4572

@code{BEGIN} pattern

4573

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})

4574

to force the headings to be printed only once:

4575

4576

@example

4577

awk 'BEGIN @{ print "Month Crates"

4578

print "----- ------" @}

4579

@{ print $1, $2 @}' inventory-shipped

4580

@end example

4581

4582

@noindent

4583

Did you already guess what happens? When run, the program prints

4584

the following:

4585

4586

@example

4587

@group

4588

Month Crates

4589

----- ------

4590

Jan 13

4591

Feb 15

4592

Mar 15

4593

@dots{}

4594

@end group

4595

@end example

4596

4597

@noindent

4598

The headings and the table data don't line up! We can fix this by printing

4599

some spaces between the two fields:

4600

4601

@example

4602

awk 'BEGIN @{ print "Month Crates"

4603

print "----- ------" @}

4604

@{ print $1, " ", $2 @}' inventory-shipped

4605

@end example

4606

4607

You can imagine that this way of lining up columns can get pretty

4608

complicated when you have many columns to fix. Counting spaces for two

4609

or three columns can be simple, but more than this and you can get

4610

lost quite easily. This is why the @code{printf} statement was

4611

created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing});

4612

one of its specialties is lining up columns of data.

4613

4614

@cindex line continuation

4615

As a side point,

4616

you can continue either a @code{print} or @code{printf} statement simply

4617

by putting a newline after any comma

4618

(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).

4619

4620

@node Output Separators, OFMT, Print Examples, Printing

4621

@section Output Separators

4622

4623

@cindex output field separator, @code{OFS}

4624

@cindex output record separator, @code{ORS}

4625

@vindex OFS

4626

@vindex ORS

4627

As mentioned previously, a @code{print} statement contains a list

4628

of items, separated by commas. In the output, the items are normally

4629

separated by single spaces. This need not be the case; a

4630

single space is only the default. You can specify any string of

4631

characters to use as the @dfn{output field separator} by setting the

4632

built-in variable @code{OFS}. The initial value of this variable

4633

is the string @w{@code{" "}}, that is, a single space.

4634

4635

The output from an entire @code{print} statement is called an

4636

@dfn{output record}. Each @code{print} statement outputs one output

4637

record and then outputs a string called the @dfn{output record separator}.

4638

The built-in variable @code{ORS} specifies this string. The initial

4639

value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline

4640

character; thus, normally each @code{print} statement makes a separate line.

4641

4642

You can change how output fields and records are separated by assigning

4643

new values to the variables @code{OFS} and/or @code{ORS}. The usual

4644

place to do this is in the @code{BEGIN} rule

4645

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so

4646

that it happens before any input is processed. You may also do this

4647

with assignments on the command line, before the names of your input

4648

files, or using the @samp{-v} command line option

4649

(@pxref{Options, ,Command Line Options}).

4650

4651

@ignore

4652

Exercise,

4653

Rewrite the

4654

@example

4655

awk 'BEGIN @{ print "Month Crates"

4656

print "----- ------" @}

4657

@{ print $1, " ", $2 @}' inventory-shipped

4658

@end example

4659

program by using a new value of @code{OFS}.

4660

@end ignore

4661

4662

The following example prints the first and second fields of each input

4663

record separated by a semicolon, with a blank line added after each

4664

line:

4665

4666

@example

4667

@group

4668

$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}

4669

> @{ print $1, $2 @}' BBS-list

4670

@print{} aardvark;555-5553

4671

@print{}

4672

@print{} alpo-net;555-3412

4673

@print{}

4674

@print{} barfly;555-7685

4675

@dots{}

4676

@end group

4677

@end example

4678

4679

If the value of @code{ORS} does not contain a newline, all your output

4680

will be run together on a single line, unless you output newlines some

4681

other way.

4682

4683

@node OFMT, Printf, Output Separators, Printing

4684

@section Controlling Numeric Output with @code{print}

4685

@vindex OFMT

4686

@cindex numeric output format

4687

@cindex format, numeric output

4688

@cindex output format specifier, @code{OFMT}

4689

When you use the @code{print} statement to print numeric values,

4690

@code{awk} internally converts the number to a string of characters,

4691

and prints that string. @code{awk} uses the @code{sprintf} function

4692

to do this conversion

4693

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

4694

For now, it suffices to say that the @code{sprintf}

4695

function accepts a @dfn{format specification} that tells it how to format

4696

numbers (or strings), and that there are a number of different ways in which

4697

numbers can be formatted. The different format specifications are discussed

4698

more fully in

4699

@ref{Control Letters, , Format-Control Letters}.

4700

4701

The built-in variable @code{OFMT} contains the default format specification

4702

that @code{print} uses with @code{sprintf} when it wants to convert a

4703

number to a string for printing.

4704

The default value of @code{OFMT} is @code{"%.6g"}.

4705

By supplying different format specifications

4706

as the value of @code{OFMT}, you can change how @code{print} will print

4707

your numbers. As a brief example:

4708

4709

@example

4710

@group

4711

$ awk 'BEGIN @{

4712

> OFMT = "%.0f" # print numbers as integers (rounds)

4713

> print 17.23 @}'

4714

@print{} 17

4715

@end group

4716

@end example

4717

4718

@noindent

4719

@cindex dark corner

4720

@cindex @code{awk} language, POSIX version

4721

@cindex POSIX @code{awk}

4722

According to the POSIX standard, @code{awk}'s behavior will be undefined

4723

if @code{OFMT} contains anything but a floating point conversion specification

4724

(d.c.).

4725

4726

@node Printf, Redirection, OFMT, Printing

4727

@section Using @code{printf} Statements for Fancier Printing

4728

@cindex formatted output

4729

@cindex output, formatted

4730

4731

If you want more precise control over the output format than

4732

@code{print} gives you, use @code{printf}. With @code{printf} you can

4733

specify the width to use for each item, and you can specify various

4734

formatting choices for numbers (such as what radix to use, whether to

4735

print an exponent, whether to print a sign, and how many digits to print

4736

after the decimal point). You do this by supplying a string, called

4737

the @dfn{format string}, which controls how and where to print the other

4738

arguments.

4739

4740

@menu

4741

* Basic Printf:: Syntax of the @code{printf} statement.

4742

* Control Letters:: Format-control letters.

4743

* Format Modifiers:: Format-specification modifiers.

4744

* Printf Examples:: Several examples.

4745

@end menu

4746

4747

@node Basic Printf, Control Letters, Printf, Printf

4748

@subsection Introduction to the @code{printf} Statement

4749

4750

@cindex @code{printf} statement, syntax of

4751

The @code{printf} statement looks like this:

4752

4753

@example

4754

printf @var{format}, @var{item1}, @var{item2}, @dots{}

4755

@end example

4756

4757

@noindent

4758

The entire list of arguments may optionally be enclosed in parentheses. The

4759

parentheses are necessary if any of the item expressions use the @samp{>}

4760

relational operator; otherwise it could be confused with a redirection

4761

(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).

4762

4763

@cindex format string

4764

The difference between @code{printf} and @code{print} is the @var{format}

4765

argument. This is an expression whose value is taken as a string; it

4766

specifies how to output each of the other arguments. It is called

4767

the @dfn{format string}.

4768

4769

The format string is very similar to that in the ANSI C library function

4770

@code{printf}. Most of @var{format} is text to be output verbatim.

4771

Scattered among this text are @dfn{format specifiers}, one per item.

4772

Each format specifier says to output the next item in the argument list

4773

at that place in the format.

4774

4775

The @code{printf} statement does not automatically append a newline to its

4776

output. It outputs only what the format string specifies. So if you want

4777

a newline, you must include one in the format string. The output separator

4778

variables @code{OFS} and @code{ORS} have no effect on @code{printf}

4779

statements. For example:

4780

4781

@example

4782

@group

4783

BEGIN @{

4784

ORS = "\nOUCH!\n"; OFS = "!"

4785

msg = "Don't Panic!"; printf "%s\n", msg

4786

@}

4787

@end group

4788

@end example

4789

4790

This program still prints the familiar @samp{Don't Panic!} message.

4791

4792

@node Control Letters, Format Modifiers, Basic Printf, Printf

4793

@subsection Format-Control Letters

4794

@cindex @code{printf}, format-control characters

4795

@cindex format specifier

4796

4797

A format specifier starts with the character @samp{%} and ends with a

4798

@dfn{format-control letter}; it tells the @code{printf} statement how

4799

to output one item. (If you actually want to output a @samp{%}, write

4800

@samp{%%}.) The format-control letter specifies what kind of value to

4801

print. The rest of the format specifier is made up of optional

4802

@dfn{modifiers} which are parameters to use, such as the field width.

4803

4804

Here is a list of the format-control letters:

4805

4806

@table @code

4807

@item c

4808

This prints a number as an ASCII character. Thus, @samp{printf "%c",

4809

65} outputs the letter @samp{A}. The output for a string value is

4810

the first character of the string.

4811

4812

@iftex

4813

@page

4814

@end iftex

4815

@item d

4816

@itemx i

4817

These are equivalent. They both print a decimal integer.

4818

The @samp{%i} specification is for compatibility with ANSI C.

4819

4820

@item e

4821

@itemx E

4822

This prints a number in scientific (exponential) notation.

4823

For example,

4824

4825

@example

4826

printf "%4.3e\n", 1950

4827

@end example

4828

4829

@noindent

4830

prints @samp{1.950e+03}, with a total of four significant figures of

4831

which three follow the decimal point. The @samp{4.3} are modifiers,

4832

discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output.

4833

4834

@item f

4835

This prints a number in floating point notation.

4836

For example,

4837

4838

@example

4839

printf "%4.3f", 1950

4840

@end example

4841

4842

@noindent

4843

prints @samp{1950.000}, with a total of four significant figures of

4844

which three follow the decimal point. The @samp{4.3} are modifiers,

4845

discussed below.

4846

4847

@item g

4848

@itemx G

4849

This prints a number in either scientific notation or floating point

4850

notation, whichever uses fewer characters. If the result is printed in

4851

scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.

4852

4853

@item o

4854

This prints an unsigned octal integer.

4855

(In octal, or base-eight notation, the digits run from @samp{0} to @samp{7};

4856

the decimal number eight is represented as @samp{10} in octal.)

4857

4858

@item s

4859

This prints a string.

4860

4861

@item x

4862

@itemx X

4863

This prints an unsigned hexadecimal integer.

4864

(In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9}

4865

and @samp{a} through @samp{f}. The hexadecimal digit @samp{f} represents

4866

the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F}

4867

instead of @samp{a} through @samp{f}.

4868

4869

@item %

4870

This isn't really a format-control letter, but it does have a meaning

4871

when used after a @samp{%}: the sequence @samp{%%} outputs one

4872

@samp{%}. It does not consume an argument, and it ignores any

4873

modifiers.

4874

@end table

4875

4876

@cindex dark corner

4877

When using the integer format-control letters for values that are outside

4878

the range of a C @code{long} integer, @code{gawk} will switch to the

4879

@samp{%g} format specifier. Other versions of @code{awk} may print

4880

invalid values, or do something else entirely (d.c.).

4881

4882

@node Format Modifiers, Printf Examples, Control Letters, Printf

4883

@subsection Modifiers for @code{printf} Formats

4884

4885

@cindex @code{printf}, modifiers

4886

@cindex modifiers (in format specifiers)

4887

A format specification can also include @dfn{modifiers} that can control

4888

how much of the item's value is printed and how much space it gets. The

4889

modifiers come between the @samp{%} and the format-control letter.

4890

In the examples below, we use the bullet symbol ``@bullet{}'' to represent

4891

spaces in the output. Here are the possible modifiers, in the order in

4892

which they may appear:

4893

4894

@table @code

4895

@item -

4896

The minus sign, used before the width modifier (see below),

4897

says to left-justify

4898

the argument within its specified width. Normally the argument

4899

is printed right-justified in the specified width. Thus,

4900

4901

@example

4902

printf "%-4s", "foo"

4903

@end example

4904

4905

@noindent

4906

prints @samp{foo@bullet{}}.

4907

4908

@item @var{space}

4909

For numeric conversions, prefix positive values with a space, and

4910

negative values with a minus sign.

4911

4912

@item +

4913

The plus sign, used before the width modifier (see below),

4914

says to always supply a sign for numeric conversions, even if the data

4915

to be formatted is positive. The @samp{+} overrides the space modifier.

4916

4917

@item #

4918

Use an ``alternate form'' for certain control letters.

4919

For @samp{%o}, supply a leading zero.

4920

For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for

4921

a non-zero result.

4922

For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a

4923

decimal point.

4924

For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result.

4925

4926

@cindex dark corner

4927

@item 0

4928

A leading @samp{0} (zero) acts as a flag, that indicates output should be

4929

padded with zeros instead of spaces.

4930

This applies even to non-numeric output formats (d.c.).

4931

This flag only has an effect when the field width is wider than the

4932

value to be printed.

4933

4934

@item @var{width}

4935

This is a number specifying the desired minimum width of a field. Inserting any

4936

number between the @samp{%} sign and the format control character forces the

4937

field to be expanded to this width. The default way to do this is to

4938

pad with spaces on the left. For example,

4939

4940

@example

4941

printf "%4s", "foo"

4942

@end example

4943

4944

@noindent

4945

prints @samp{@bullet{}foo}.

4946

4947

The value of @var{width} is a minimum width, not a maximum. If the item

4948

value requires more than @var{width} characters, it can be as wide as

4949

necessary. Thus,

4950

4951

@example

4952

printf "%4s", "foobar"

4953

@end example

4954

4955

@noindent

4956

prints @samp{foobar}.

4957

4958

Preceding the @var{width} with a minus sign causes the output to be

4959

padded with spaces on the right, instead of on the left.

4960

4961

@item .@var{prec}

4962

This is a number that specifies the precision to use when printing.

4963

For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the

4964

number of digits you want printed to the right of the decimal point.

4965

For the @samp{g}, and @samp{G} formats, it specifies the maximum number

4966

of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},

4967

@samp{x}, and @samp{X} formats, it specifies the minimum number of

4968

digits to print. For a string, it specifies the maximum number of

4969

characters from the string that should be printed. Thus,

4970

4971

@example

4972

printf "%.4s", "foobar"

4973

@end example

4974

4975

@noindent

4976

prints @samp{foob}.

4977

@end table

4978

4979

The C library @code{printf}'s dynamic @var{width} and @var{prec}

4980

capability (for example, @code{"%*.*s"}) is supported. Instead of

4981

supplying explicit @var{width} and/or @var{prec} values in the format

4982

string, you pass them in the argument list. For example:

4983

4984

@example

4985

w = 5

4986

p = 3

4987

s = "abcdefg"

4988

printf "%*.*s\n", w, p, s

4989

@end example

4990

4991

@noindent

4992

is exactly equivalent to

4993

4994

@example

4995

s = "abcdefg"

4996

printf "%5.3s\n", s

4997

@end example

4998

4999

@noindent

5000

Both programs output @samp{@w{@bullet{}@bullet{}abc}}.

5001

5002

Earlier versions of @code{awk} did not support this capability.

5003

If you must use such a version, you may simulate this feature by using

5004

concatenation to build up the format string, like so:

5005

5006

@example

5007

w = 5

5008

p = 3

5009

s = "abcdefg"

5010

printf "%" w "." p "s\n", s

5011

@end example

5012

5013

@noindent

5014

This is not particularly easy to read, but it does work.

5015

5016

@cindex @code{awk} language, POSIX version

5017

@cindex POSIX @code{awk}

5018

C programmers may be used to supplying additional @samp{l} and @samp{h}

5019

flags in @code{printf} format strings. These are not valid in @code{awk}.

5020

Most @code{awk} implementations silently ignore these flags.

5021

If @samp{--lint} is provided on the command line

5022

(@pxref{Options, ,Command Line Options}),

5023

@code{gawk} will warn about their use. If @samp{--posix} is supplied,

5024

their use is a fatal error.

5025

5026

@node Printf Examples, , Format Modifiers, Printf

5027

@subsection Examples Using @code{printf}

5028

5029

Here is how to use @code{printf} to make an aligned table:

5030

5031

@example

5032

awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list

5033

@end example

5034

5035

@noindent

5036

prints the names of bulletin boards (@code{$1}) of the file

5037

@file{BBS-list} as a string of 10 characters, left justified. It also

5038

prints the phone numbers (@code{$2}) afterward on the line. This

5039

produces an aligned two-column table of names and phone numbers:

5040

5041

@example

5042

@group

5043

$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list

5044

@print{} aardvark 555-5553

5045

@print{} alpo-net 555-3412

5046

@print{} barfly 555-7685

5047

@print{} bites 555-1675

5048

@print{} camelot 555-0542

5049

@print{} core 555-2912

5050

@print{} fooey 555-1234

5051

@print{} foot 555-6699

5052

@print{} macfoo 555-6480

5053

@print{} sdace 555-3430

5054

@print{} sabafoo 555-2127

5055

@end group

5056

@end example

5057

5058

Did you notice that we did not specify that the phone numbers be printed

5059

as numbers? They had to be printed as strings because the numbers are

5060

separated by a dash.

5061

If we had tried to print the phone numbers as numbers, all we would have

5062

gotten would have been the first three digits, @samp{555}.

5063

This would have been pretty confusing.

5064

5065

We did not specify a width for the phone numbers because they are the

5066

last things on their lines. We don't need to put spaces after them.

5067

5068

We could make our table look even nicer by adding headings to the tops

5069

of the columns. To do this, we use the @code{BEGIN} pattern

5070

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})

5071

to force the header to be printed only once, at the beginning of

5072

the @code{awk} program:

5073

5074

@example

5075

@group

5076

awk 'BEGIN @{ print "Name Number"

5077

print "---- ------" @}

5078

@{ printf "%-10s %s\n", $1, $2 @}' BBS-list

5079

@end group

5080

@end example

5081

5082

Did you notice that we mixed @code{print} and @code{printf} statements in

5083

the above example? We could have used just @code{printf} statements to get

5084

the same results:

5085

5086

@example

5087

@group

5088

awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"

5089

printf "%-10s %s\n", "----", "------" @}

5090

@{ printf "%-10s %s\n", $1, $2 @}' BBS-list

5091

@end group

5092

@end example

5093

5094

@noindent

5095

By printing each column heading with the same format specification

5096

used for the elements of the column, we have made sure that the headings

5097

are aligned just like the columns.

5098

5099

The fact that the same format specification is used three times can be

5100

emphasized by storing it in a variable, like this:

5101

5102

@example

5103

@group

5104

awk 'BEGIN @{ format = "%-10s %s\n"

5105

printf format, "Name", "Number"

5106

printf format, "----", "------" @}

5107

@{ printf format, $1, $2 @}' BBS-list

5108

@end group

5109

@end example

5110

5111

@c !!! exercise

5112

See if you can use the @code{printf} statement to line up the headings and

5113

table data for our @file{inventory-shipped} example covered earlier in the

5114

section on the @code{print} statement

5115

(@pxref{Print, ,The @code{print} Statement}).

5116

5117

@node Redirection, Special Files, Printf, Printing

5118

@section Redirecting Output of @code{print} and @code{printf}

5119

5120

@cindex output redirection

5121

@cindex redirection of output

5122

So far we have been dealing only with output that prints to the standard

5123

output, usually your terminal. Both @code{print} and @code{printf} can

5124

also send their output to other places.

5125

This is called @dfn{redirection}.

5126

5127

A redirection appears after the @code{print} or @code{printf} statement.

5128

Redirections in @code{awk} are written just like redirections in shell

5129

commands, except that they are written inside the @code{awk} program.

5130

5131

There are three forms of output redirection: output to a file,

5132

output appended to a file, and output through a pipe to another

5133

command.

5134

They are all shown for

5135

the @code{print} statement, but they work identically for @code{printf}

5136

also.

5137

5138

@table @code

5139

@item print @var{items} > @var{output-file}

5140

This type of redirection prints the items into the output file

5141

@var{output-file}. The file name @var{output-file} can be any

5142

expression. Its value is changed to a string and then used as a

5143

file name (@pxref{Expressions}).

5144

5145

When this type of redirection is used, the @var{output-file} is erased

5146

before the first output is written to it. Subsequent writes

5147

to the same @var{output-file} do not

5148

erase @var{output-file}, but append to it. If @var{output-file} does

5149

not exist, then it is created.

5150

5151

For example, here is how an @code{awk} program can write a list of

5152

BBS names to a file @file{name-list} and a list of phone numbers to a

5153

file @file{phone-list}. Each output file contains one name or number

5154

per line.

5155

5156

@example

5157

@group

5158

$ awk '@{ print $2 > "phone-list"

5159

> print $1 > "name-list" @}' BBS-list

5160

@end group

5161

@group

5162

$ cat phone-list

5163

@print{} 555-5553

5164

@print{} 555-3412

5165

@dots{}

5166

@end group

5167

@group

5168

$ cat name-list

5169

@print{} aardvark

5170

@print{} alpo-net

5171

@dots{}

5172

@end group

5173

@end example

5174

5175

@item print @var{items} >> @var{output-file}

5176

This type of redirection prints the items into the pre-existing output file

5177

@var{output-file}. The difference between this and the

5178

single-@samp{>} redirection is that the old contents (if any) of

5179

@var{output-file} are not erased. Instead, the @code{awk} output is

5180

appended to the file.

5181

If @var{output-file} does not exist, then it is created.

5182

5183

@cindex pipes for output

5184

@cindex output, piping

5185

@item print @var{items} | @var{command}

5186

It is also possible to send output to another program through a pipe

5187

instead of into a

5188

file. This type of redirection opens a pipe to @var{command} and writes

5189

the values of @var{items} through this pipe, to another process created

5190

to execute @var{command}.

5191

5192

The redirection argument @var{command} is actually an @code{awk}

5193

expression. Its value is converted to a string, whose contents give the

5194

shell command to be run.

5195

5196

For example, this produces two files, one unsorted list of BBS names

5197

and one list sorted in reverse alphabetical order:

5198

5199

@example

5200

awk '@{ print $1 > "names.unsorted"

5201

command = "sort -r > names.sorted"

5202

print $1 | command @}' BBS-list

5203

@end example

5204

5205

Here the unsorted list is written with an ordinary redirection while

5206

the sorted list is written by piping through the @code{sort} utility.

5207

5208

This example uses redirection to mail a message to a mailing

5209

list @samp{bug-system}. This might be useful when trouble is encountered

5210

in an @code{awk} script run periodically for system maintenance.

5211

5212

@example

5213

report = "mail bug-system"

5214

print "Awk script failed:", $0 | report

5215

m = ("at record number " FNR " of " FILENAME)

5216

print m | report

5217

close(report)

5218

@end example

5219

5220

The message is built using string concatenation and saved in the variable

5221

@code{m}. It is then sent down the pipeline to the @code{mail} program.

5222

5223

We call the @code{close} function here because it's a good idea to close

5224

the pipe as soon as all the intended output has been sent to it.

5225

@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},

5226

for more information

5227

on this. This example also illustrates the use of a variable to represent

5228

a @var{file} or @var{command}: it is not necessary to always

5229

use a string constant. Using a variable is generally a good idea,

5230

since @code{awk} requires you to spell the string value identically

5231

every time.

5232

@end table

5233

5234

Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system

5235

to open a file or pipe only if the particular @var{file} or @var{command}

5236

you've specified has not already been written to by your program, or if

5237

it has been closed since it was last written to.

5238

5239

@cindex differences between @code{gawk} and @code{awk}

5240

@cindex limitations

5241

@cindex implementation limits

5242

Many @code{awk} implementations limit the number of pipelines an @code{awk}

5243

program may have open to just one! In @code{gawk}, there is no such limit.

5244

You can open as many pipelines as the underlying operating system will

5245

permit.

5246

5247

@node Special Files, Close Files And Pipes , Redirection, Printing

5248

@section Special File Names in @code{gawk}

5249

@cindex standard input

5250

@cindex standard output

5251

@cindex standard error output

5252

@cindex file descriptors

5253

5254

Running programs conventionally have three input and output streams

5255

already available to them for reading and writing. These are known as

5256

the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error

5257

output}. These streams are, by default, connected to your terminal, but

5258

they are often redirected with the shell, via the @samp{<}, @samp{<<},

5259

@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error

5260

is typically used for writing error messages; the reason we have two separate

5261

streams, standard output and standard error, is so that they can be

5262

redirected separately.

5263

5264

@cindex differences between @code{gawk} and @code{awk}

5265

In other implementations of @code{awk}, the only way to write an error

5266

message to standard error in an @code{awk} program is as follows:

5267

5268

@example

5269

print "Serious error detected!" | "cat 1>&2"

5270

@end example

5271

5272

@noindent

5273

This works by opening a pipeline to a shell command which can access the

5274

standard error stream which it inherits from the @code{awk} process.

5275

This is far from elegant, and is also inefficient, since it requires a

5276

separate process. So people writing @code{awk} programs often

5277

neglect to do this. Instead, they send the error messages to the

5278

terminal, like this:

5279

5280

@example

5281

@group

5282

print "Serious error detected!" > "/dev/tty"

5283

@end group

5284

@end example

5285

5286

@noindent

5287

This usually has the same effect, but not always: although the

5288

standard error stream is usually the terminal, it can be redirected, and

5289

when that happens, writing to the terminal is not correct. In fact, if

5290

@code{awk} is run from a background job, it may not have a terminal at all.

5291

Then opening @file{/dev/tty} will fail.

5292

5293

@code{gawk} provides special file names for accessing the three standard

5294

streams. When you redirect input or output in @code{gawk}, if the file name

5295

matches one of these special names, then @code{gawk} directly uses the

5296

stream it stands for.

5297

5298

@cindex @file{/dev/stdin}

5299

@cindex @file{/dev/stdout}

5300

@cindex @file{/dev/stderr}

5301

@cindex @file{/dev/fd}

5302

@c @cartouche

5303

@table @file

5304

@item /dev/stdin

5305

The standard input (file descriptor 0).

5306

5307

@item /dev/stdout

5308

The standard output (file descriptor 1).

5309

5310

@item /dev/stderr

5311

The standard error output (file descriptor 2).

5312

5313

@item /dev/fd/@var{N}

5314

The file associated with file descriptor @var{N}. Such a file must have

5315

been opened by the program initiating the @code{awk} execution (typically

5316

the shell). Unless you take special pains in the shell from which

5317

you invoke @code{gawk}, only descriptors 0, 1 and 2 are available.

5318

@end table

5319

@c @end cartouche

5320

5321

The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}

5322

are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2},

5323

respectively, but they are more self-explanatory.

5324

5325

The proper way to write an error message in a @code{gawk} program

5326

is to use @file{/dev/stderr}, like this:

5327

5328

@example

5329

print "Serious error detected!" > "/dev/stderr"

5330

@end example

5331

5332

@code{gawk} also provides special file names that give access to information

5333

about the running @code{gawk} process. Each of these ``files'' provides

5334

a single record of information. To read them more than once, you must

5335

first close them with the @code{close} function

5336

(@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}).

5337

The filenames are:

5338

5339

@cindex process information

5340

@cindex @file{/dev/pid}

5341

@cindex @file{/dev/pgrpid}

5342

@cindex @file{/dev/ppid}

5343

@cindex @file{/dev/user}

5344

@c @cartouche

5345

@table @file

5346

@item /dev/pid

5347

Reading this file returns the process ID of the current process,

5348

in decimal, terminated with a newline.

5349

5350

@item /dev/ppid

5351

Reading this file returns the parent process ID of the current process,

5352

in decimal, terminated with a newline.

5353

5354

@item /dev/pgrpid

5355

Reading this file returns the process group ID of the current process,

5356

in decimal, terminated with a newline.

5357

5358

@item /dev/user

5359

Reading this file returns a single record terminated with a newline.

5360

The fields are separated with spaces. The fields represent the

5361

following information:

5362

5363

@table @code

5364

@item $1

5365

The return value of the @code{getuid} system call

5366

(the real user ID number).

5367

5368

@item $2

5369

The return value of the @code{geteuid} system call

5370

(the effective user ID number).

5371

5372

@item $3

5373

The return value of the @code{getgid} system call

5374

(the real group ID number).

5375

5376

@item $4

5377

The return value of the @code{getegid} system call

5378

(the effective group ID number).

5379

@end table

5380

5381

If there are any additional fields, they are the group IDs returned by

5382

@code{getgroups} system call.

5383

(Multiple groups may not be supported on all systems.)

5384

@end table

5385

@c @end cartouche

5386

5387

These special file names may be used on the command line as data

5388

files, as well as for I/O redirections within an @code{awk} program.

5389

They may not be used as source files with the @samp{-f} option.

5390

5391

Recognition of these special file names is disabled if @code{gawk} is in

5392

compatibility mode (@pxref{Options, ,Command Line Options}).

5393

5394

@strong{Caution}: Unless your system actually has a @file{/dev/fd} directory

5395

(or any of the other above listed special files),

5396

the interpretation of these file names is done by @code{gawk} itself.

5397

For example, using @samp{/dev/fd/4} for output will actually write on

5398

file descriptor 4, and not on a new file descriptor that was @code{dup}'ed

5399

from file descriptor 4. Most of the time this does not matter; however, it

5400

is important to @emph{not} close any of the files related to file descriptors

5401

0, 1, and 2. If you do close one of these files, unpredictable behavior

5402

will result.

5403

5404

The special files that provide process-related information may disappear

5405

in a future version of @code{gawk}.

5406

@xref{Future Extensions, ,Probable Future Extensions}.

5407

5408

@node Close Files And Pipes, , Special Files, Printing

5409

@section Closing Input and Output Files and Pipes

5410

@cindex closing input files and pipes

5411

@cindex closing output files and pipes

5412

@findex close

5413

5414

If the same file name or the same shell command is used with

5415

@code{getline}

5416

(@pxref{Getline, ,Explicit Input with @code{getline}})

5417

more than once during the execution of an @code{awk}

5418

program, the file is opened (or the command is executed) only the first time.

5419

At that time, the first record of input is read from that file or command.

5420

The next time the same file or command is used in @code{getline}, another

5421

record is read from it, and so on.

5422

5423

Similarly, when a file or pipe is opened for output, the file name or command

5424

associated with

5425

it is remembered by @code{awk} and subsequent writes to the same file or

5426

command are appended to the previous writes. The file or pipe stays

5427

open until @code{awk} exits.

5428

5429

This implies that if you want to start reading the same file again from

5430

the beginning, or if you want to rerun a shell command (rather than

5431

reading more output from the command), you must take special steps.

5432

What you must do is use the @code{close} function, as follows:

5433

5434

@example

5435

close(@var{filename})

5436

@end example

5437

5438

@noindent

5439

or

5440

5441

@example

5442

close(@var{command})

5443

@end example

5444

5445

The argument @var{filename} or @var{command} can be any expression. Its

5446

value must @emph{exactly} match the string that was used to open the file or

5447

start the command (spaces and other ``irrelevant'' characters

5448

included). For example, if you open a pipe with this:

5449

5450

@example

5451

"sort -r names" | getline foo

5452

@end example

5453

5454

@noindent

5455

then you must close it with this:

5456

5457

@example

5458

close("sort -r names")

5459

@end example

5460

5461

Once this function call is executed, the next @code{getline} from that

5462

file or command, or the next @code{print} or @code{printf} to that

5463

file or command, will reopen the file or rerun the command.

5464

5465

Because the expression that you use to close a file or pipeline must

5466

exactly match the expression used to open the file or run the command,

5467

it is good practice to use a variable to store the file name or command.

5468

The previous example would become

5469

5470

@example

5471

sortcom = "sort -r names"

5472

sortcom | getline foo

5473

@dots{}

5474

close(sortcom)

5475

@end example

5476

5477

@noindent

5478

This helps avoid hard-to-find typographical errors in your @code{awk}

5479

programs.

5480

5481

Here are some reasons why you might need to close an output file:

5482

5483

@itemize @bullet

5484

@item

5485

To write a file and read it back later on in the same @code{awk}

5486

program. Close the file when you are finished writing it; then

5487

you can start reading it with @code{getline}.

5488

5489

@item

5490

To write numerous files, successively, in the same @code{awk}

5491

program. If you don't close the files, eventually you may exceed a

5492

system limit on the number of open files in one process. So close

5493

each one when you are finished writing it.

5494

5495

@item

5496

To make a command finish. When you redirect output through a pipe,

5497

the command reading the pipe normally continues to try to read input

5498

as long as the pipe is open. Often this means the command cannot

5499

really do its work until the pipe is closed. For example, if you

5500

redirect output to the @code{mail} program, the message is not

5501

actually sent until the pipe is closed.

5502

5503

@item

5504

To run the same program a second time, with the same arguments.

5505

This is not the same thing as giving more input to the first run!

5506

5507

For example, suppose you pipe output to the @code{mail} program. If you

5508

output several lines redirected to this pipe without closing it, they make

5509

a single message of several lines. By contrast, if you close the pipe

5510

after each line of output, then each line makes a separate message.

5511

@end itemize

5512

5513

@vindex ERRNO

5514

@cindex differences between @code{gawk} and @code{awk}

5515

@code{close} returns a value of zero if the close succeeded.

5516

Otherwise, the value will be non-zero.

5517

In this case, @code{gawk} sets the variable @code{ERRNO} to a string

5518

describing the error that occurred.

5519

5520

@cindex differences between @code{gawk} and @code{awk}

5521

@cindex portability issues

5522

If you use more files than the system allows you to have open,

5523

@code{gawk} will attempt to multiplex the available open files among

5524

your data files. @code{gawk}'s ability to do this depends upon the

5525

facilities of your operating system: it may not always work. It is

5526

therefore both good practice and good portability advice to always

5527

use @code{close} on your files when you are done with them.

5528

5529

@node Expressions, Patterns and Actions, Printing, Top

5530

@chapter Expressions

5531

@cindex expression

5532

5533

Expressions are the basic building blocks of @code{awk} patterns

5534

and actions. An expression evaluates to a value, which you can print, test,

5535

store in a variable or pass to a function. Additionally, an expression

5536

can assign a new value to a variable or a field, with an assignment operator.

5537

5538

An expression can serve as a pattern or action statement on its own.

5539

Most other kinds of

5540

statements contain one or more expressions which specify data on which to

5541

operate. As in other languages, expressions in @code{awk} include

5542

variables, array references, constants, and function calls, as well as

5543

combinations of these with various operators.

5544

5545

@menu

5546

* Constants:: String, numeric, and regexp constants.

5547

* Using Constant Regexps:: When and how to use a regexp constant.

5548

* Variables:: Variables give names to values for later use.

5549

* Conversion:: The conversion of strings to numbers and vice

5550

versa.

5551

* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},

5552

etc.)

5553

* Concatenation:: Concatenating strings.

5554

* Assignment Ops:: Changing the value of a variable or a field.

5555

* Increment Ops:: Incrementing the numeric value of a variable.

5556

* Truth Values:: What is ``true'' and what is ``false''.

5557

* Typing and Comparison:: How variables acquire types, and how this

5558

affects comparison of numbers and strings with

5559

@samp{<}, etc.

5560

* Boolean Ops:: Combining comparison expressions using boolean

5561

operators @samp{||} (``or''), @samp{&&}

5562

(``and'') and @samp{!} (``not'').

5563

* Conditional Exp:: Conditional expressions select between two

5564

subexpressions under control of a third

5565

subexpression.

5566

* Function Calls:: A function call is an expression.

5567

* Precedence:: How various operators nest.

5568

@end menu

5569

5570

@node Constants, Using Constant Regexps, Expressions, Expressions

5571

@section Constant Expressions

5572

@cindex constants, types of

5573

@cindex string constants

5574

5575

The simplest type of expression is the @dfn{constant}, which always has

5576

the same value. There are three types of constants: numeric constants,

5577

string constants, and regular expression constants.

5578

5579

@menu

5580

* Scalar Constants:: Numeric and string constants.

5581

* Regexp Constants:: Regular Expression constants.

5582

@end menu

5583

5584

@node Scalar Constants, Regexp Constants, Constants, Constants

5585

@subsection Numeric and String Constants

5586

5587

@cindex numeric constant

5588

@cindex numeric value

5589

A @dfn{numeric constant} stands for a number. This number can be an

5590

integer, a decimal fraction, or a number in scientific (exponential)

5591

notation.@footnote{The internal representation uses double-precision

5592

floating point numbers. If you don't know what that means, then don't

5593

worry about it.} Here are some examples of numeric constants, which all

5594

have the same value:

5595

5596

@example

5597

105

5598

1.05e+2

5599

1050e-1

5600

@end example

5601

5602

A string constant consists of a sequence of characters enclosed in

5603

double-quote marks. For example:

5604

5605

@example

5606

"parrot"

5607

@end example

5608

5609

@noindent

5610

@cindex differences between @code{gawk} and @code{awk}

5611

represents the string whose contents are @samp{parrot}. Strings in

5612

@code{gawk} can be of any length and they can contain any of the possible

5613

eight-bit ASCII characters including ASCII NUL (character code zero).

5614

Other @code{awk}

5615

implementations may have difficulty with some character codes.

5616

5617

@node Regexp Constants, , Scalar Constants, Constants

5618

@subsection Regular Expression Constants

5619

5620

@cindex @code{~} operator

5621

@cindex @code{!~} operator

5622

A regexp constant is a regular expression description enclosed in

5623

slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in

5624

@code{awk} programs are constant, but the @samp{~} and @samp{!~}

5625

matching operators can also match computed or ``dynamic'' regexps

5626

(which are just ordinary strings or variables that contain a regexp).

5627

5628

@node Using Constant Regexps, Variables, Constants, Expressions

5629

@section Using Regular Expression Constants

5630

5631

When used on the right hand side of the @samp{~} or @samp{!~}

5632

operators, a regexp constant merely stands for the regexp that is to be

5633

matched.

5634

5635

@cindex dark corner

5636

Regexp constants (such as @code{/foo/}) may be used like simple expressions.

5637

When a

5638

regexp constant appears by itself, it has the same meaning as if it appeared

5639

in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.)

5640

(@pxref{Expression Patterns, ,Expressions as Patterns}).

5641

This means that the two code segments,

5642

5643

@example

5644

if ($0 ~ /barfly/ || $0 ~ /camelot/)

5645

print "found"

5646

@end example

5647

5648

@noindent

5649

and

5650

5651

@example

5652

if (/barfly/ || /camelot/)

5653

print "found"

5654

@end example

5655

5656

@noindent

5657

are exactly equivalent.

5658

5659

One rather bizarre consequence of this rule is that the following

5660

boolean expression is valid, but does not do what the user probably

5661

intended:

5662

5663

@example

5664

# note that /foo/ is on the left of the ~

5665

if (/foo/ ~ $1) print "found foo"

5666

@end example

5667

5668

@noindent

5669

This code is ``obviously'' testing @code{$1} for a match against the regexp

5670

@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means

5671

@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record

5672

against the regexp @code{/foo/}. The result will be either zero or one,

5673

depending upon the success or failure of the match. Then match that result

5674

against the first field in the record.

5675

5676

Since it is unlikely that you would ever really wish to make this kind of

5677

test, @code{gawk} will issue a warning when it sees this construct in

5678

a program.

5679

5680

Another consequence of this rule is that the assignment statement

5681

5682

@example

5683

matches = /foo/

5684

@end example

5685

5686

@noindent

5687

will assign either zero or one to the variable @code{matches}, depending

5688

upon the contents of the current input record.

5689

5690

This feature of the language was never well documented until the

5691

POSIX specification.

5692

5693

@cindex differences between @code{gawk} and @code{awk}

5694

@cindex dark corner

5695

Constant regular expressions are also used as the first argument for

5696

the @code{gensub}, @code{sub} and @code{gsub} functions, and as the

5697

second argument of the @code{match} function

5698

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

5699

Modern implementations of @code{awk}, including @code{gawk}, allow

5700

the third argument of @code{split} to be a regexp constant, while some

5701

older implementations do not (d.c.).

5702

5703

This can lead to confusion when attempting to use regexp constants

5704

as arguments to user defined functions

5705

(@pxref{User-defined, , User-defined Functions}).

5706

For example:

5707

5708

@example

5709

function mysub(pat, repl, str, global)

5710

@{

5711

if (global)

5712

gsub(pat, repl, str)

5713

else

5714

sub(pat, repl, str)

5715

return str

5716

@}

5717

5718

@{

5719

@dots{}

5720

text = "hi! hi yourself!"

5721

mysub(/hi/, "howdy", text, 1)

5722

@dots{}

5723

@}

5724

@end example

5725

5726

In this example, the programmer wishes to pass a regexp constant to the

5727

user-defined function @code{mysub}, which will in turn pass it on to

5728

either @code{sub} or @code{gsub}. However, what really happens is that

5729

the @code{pat} parameter will be either one or zero, depending upon whether

5730

or not @code{$0} matches @code{/hi/}.

5731

5732

As it is unlikely that you would ever really wish to pass a truth value

5733

in this way, @code{gawk} will issue a warning when it sees a regexp

5734

constant used as a parameter to a user-defined function.

5735

5736

@node Variables, Conversion, Using Constant Regexps, Expressions

5737

@section Variables

5738

5739

Variables are ways of storing values at one point in your program for

5740

use later in another part of your program. You can manipulate them

5741

entirely within your program text, and you can also assign values to

5742

them on the @code{awk} command line.

5743

5744

@menu

5745

* Using Variables:: Using variables in your programs.

5746

* Assignment Options:: Setting variables on the command line and a

5747

summary of command line syntax. This is an

5748

advanced method of input.

5749

@end menu

5750

5751

@node Using Variables, Assignment Options, Variables, Variables

5752

@subsection Using Variables in a Program

5753

5754

@cindex variables, user-defined

5755

@cindex user-defined variables

5756

Variables let you give names to values and refer to them later. You have

5757

already seen variables in many of the examples. The name of a variable

5758

must be a sequence of letters, digits and underscores, but it may not begin

5759

with a digit. Case is significant in variable names; @code{a} and @code{A}

5760

are distinct variables.

5761

5762

A variable name is a valid expression by itself; it represents the

5763

variable's current value. Variables are given new values with

5764

@dfn{assignment operators}, @dfn{increment operators} and

5765

@dfn{decrement operators}.

5766

@xref{Assignment Ops, ,Assignment Expressions}.

5767

5768

A few variables have special built-in meanings, such as @code{FS}, the

5769

field separator, and @code{NF}, the number of fields in the current

5770

input record. @xref{Built-in Variables}, for a list of them. These

5771

built-in variables can be used and assigned just like all other

5772

variables, but their values are also used or changed automatically by

5773

@code{awk}. All built-in variables names are entirely upper-case.

5774

5775

Variables in @code{awk} can be assigned either numeric or string

5776

values. By default, variables are initialized to the empty string, which

5777

is zero if converted to a number. There is no need to

5778

``initialize'' each variable explicitly in @code{awk},

5779

the way you would in C and in most other traditional languages.

5780

5781

@node Assignment Options, , Using Variables, Variables

5782

@subsection Assigning Variables on the Command Line

5783

5784

You can set any @code{awk} variable by including a @dfn{variable assignment}

5785

among the arguments on the command line when you invoke @code{awk}

5786

(@pxref{Other Arguments, ,Other Command Line Arguments}). Such an assignment has

5787

this form:

5788

5789

@example

5790

@var{variable}=@var{text}

5791

@end example

5792

5793

@noindent

5794

With it, you can set a variable either at the beginning of the

5795

@code{awk} run or in between input files.

5796

5797

If you precede the assignment with the @samp{-v} option, like this:

5798

5799

@example

5800

-v @var{variable}=@var{text}

5801

@end example

5802

5803

@noindent

5804

then the variable is set at the very beginning, before even the

5805

@code{BEGIN} rules are run. The @samp{-v} option and its assignment

5806

must precede all the file name arguments, as well as the program text.

5807

(@xref{Options, ,Command Line Options}, for more information about

5808

the @samp{-v} option.)

5809

5810

Otherwise, the variable assignment is performed at a time determined by

5811

its position among the input file arguments: after the processing of the

5812

preceding input file argument. For example:

5813

5814

@example

5815

awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list

5816

@end example

5817

5818

@noindent

5819

prints the value of field number @code{n} for all input records. Before

5820

the first file is read, the command line sets the variable @code{n}

5821

equal to four. This causes the fourth field to be printed in lines from

5822

the file @file{inventory-shipped}. After the first file has finished,

5823

but before the second file is started, @code{n} is set to two, so that the

5824

second field is printed in lines from @file{BBS-list}.

5825

5826

@example

5827

@group

5828

$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list

5829

@print{} 15

5830

@print{} 24

5831

@dots{}

5832

@print{} 555-5553

5833

@print{} 555-3412

5834

@dots{}

5835

@end group

5836

@end example

5837

5838

Command line arguments are made available for explicit examination by

5839

the @code{awk} program in an array named @code{ARGV}

5840

(@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}).

5841

5842

@cindex dark corner

5843

@code{awk} processes the values of command line assignments for escape

5844

sequences (d.c.) (@pxref{Escape Sequences}).

5845

5846

@node Conversion, Arithmetic Ops, Variables, Expressions

5847

@section Conversion of Strings and Numbers

5848

5849

@cindex conversion of strings and numbers

5850

Strings are converted to numbers, and numbers to strings, if the context

5851

of the @code{awk} program demands it. For example, if the value of

5852

either @code{foo} or @code{bar} in the expression @samp{foo + bar}

5853

happens to be a string, it is converted to a number before the addition

5854

is performed. If numeric values appear in string concatenation, they

5855

are converted to strings. Consider this:

5856

5857

@example

5858

two = 2; three = 3

5859

print (two three) + 4

5860

@end example

5861

5862

@noindent

5863

This prints the (numeric) value 27. The numeric values of

5864

the variables @code{two} and @code{three} are converted to strings and

5865

concatenated together, and the resulting string is converted back to the

5866

number 23, to which four is then added.

5867

5868

@cindex null string

5869

@cindex empty string

5870

@cindex type conversion

5871

If, for some reason, you need to force a number to be converted to a

5872

string, concatenate the empty string, @code{""}, with that number.

5873

To force a string to be converted to a number, add zero to that string.

5874

5875

A string is converted to a number by interpreting any numeric prefix

5876

of the string as numerals:

5877

@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"}

5878

has a numeric value of 25.

5879

Strings that can't be interpreted as valid numbers are converted to

5880

zero.

5881

5882

@vindex CONVFMT

5883

The exact manner in which numbers are converted into strings is controlled

5884

by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}).

5885

Numbers are converted using the @code{sprintf} function

5886

(@pxref{String Functions, ,Built-in Functions for String Manipulation})

5887

with @code{CONVFMT} as the format

5888

specifier.

5889

5890

@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with

5891

at least six significant digits. For some applications you will want to

5892

change it to specify more precision. Double precision on most modern

5893

machines gives you 16 or 17 decimal digits of precision.

5894

5895

Strange results can happen if you set @code{CONVFMT} to a string that doesn't

5896

tell @code{sprintf} how to format floating point numbers in a useful way.

5897

For example, if you forget the @samp{%} in the format, all numbers will be

5898

converted to the same constant string.

5899

5900

@cindex dark corner

5901

As a special case, if a number is an integer, then the result of converting

5902

it to a string is @emph{always} an integer, no matter what the value of

5903

@code{CONVFMT} may be. Given the following code fragment:

5904

5905

@example

5906

CONVFMT = "%2.2f"

5907

a = 12

5908

b = a ""

5909

@end example

5910

5911

@noindent

5912

@code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.).

5913

5914

@cindex @code{awk} language, POSIX version

5915

@cindex POSIX @code{awk}

5916

@vindex OFMT

5917

Prior to the POSIX standard, @code{awk} specified that the value

5918

of @code{OFMT} was used for converting numbers to strings. @code{OFMT}

5919

specifies the output format to use when printing numbers with @code{print}.

5920

@code{CONVFMT} was introduced in order to separate the semantics of

5921

conversion from the semantics of printing. Both @code{CONVFMT} and

5922

@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority

5923

of cases, old @code{awk} programs will not change their behavior.

5924

However, this use of @code{OFMT} is something to keep in mind if you must

5925

port your program to other implementations of @code{awk}; we recommend

5926

that instead of changing your programs, you just port @code{gawk} itself!

5927

@xref{Print, ,The @code{print} Statement},

5928

for more information on the @code{print} statement.

5929

5930

@node Arithmetic Ops, Concatenation, Conversion, Expressions

5931

@section Arithmetic Operators

5932

@cindex arithmetic operators

5933

@cindex operators, arithmetic

5934

@cindex addition

5935

@cindex subtraction

5936

@cindex multiplication

5937

@cindex division

5938

@cindex remainder

5939

@cindex quotient

5940

@cindex exponentiation

5941

5942

The @code{awk} language uses the common arithmetic operators when

5943

evaluating expressions. All of these arithmetic operators follow normal

5944

precedence rules, and work as you would expect them to.

5945

5946

Here is a file @file{grades} containing a list of student names and

5947

three test scores per student (it's a small class):

5948

5949

@example

5950

Pat 100 97 58

5951

Sandy 84 72 93

5952

Chris 72 92 89

5953

@end example

5954

5955

@noindent

5956

This programs takes the file @file{grades}, and prints the average

5957

of the scores.

5958

5959

@example

5960

$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3

5961

> print $1, avg @}' grades

5962

@print{} Pat 85

5963

@print{} Sandy 83

5964

@print{} Chris 84.3333

5965

@end example

5966

5967

This table lists the arithmetic operators in @code{awk}, in order from

5968

highest precedence to lowest:

5969

5970

@c sigh. this seems necessary

5971

@iftex

5972

@page

5973

@end iftex

5974

@c @cartouche

5975

@table @code

5976

@item - @var{x}

5977

Negation.

5978

5979

@item + @var{x}

5980

Unary plus. The expression is converted to a number.

5981

5982

@cindex @code{awk} language, POSIX version

5983

@cindex POSIX @code{awk}

5984

@item @var{x} ^ @var{y}

5985

@itemx @var{x} ** @var{y}

5986

Exponentiation: @var{x} raised to the @var{y} power. @samp{2 ^ 3} has

5987

the value eight. The character sequence @samp{**} is equivalent to

5988

@samp{^}. (The POSIX standard only specifies the use of @samp{^}

5989

for exponentiation.)

5990

5991

@item @var{x} * @var{y}

5992

Multiplication.

5993

5994

@item @var{x} / @var{y}

5995

Division. Since all numbers in @code{awk} are

5996

real numbers, the result is not rounded to an integer: @samp{3 / 4}

5997

has the value 0.75.

5998

5999

@item @var{x} % @var{y}

6000

@cindex differences between @code{gawk} and @code{awk}

6001

Remainder. The quotient is rounded toward zero to an integer,

6002

multiplied by @var{y} and this result is subtracted from @var{x}.

6003

This operation is sometimes known as ``trunc-mod.'' The following

6004

relation always holds:

6005

6006

@example

6007

b * int(a / b) + (a % b) == a

6008

@end example

6009

6010

One possibly undesirable effect of this definition of remainder is that

6011

@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus,

6012

6013

@example

6014

-17 % 8 = -1

6015

@end example

6016

6017

In other @code{awk} implementations, the signedness of the remainder

6018

may be machine dependent.

6019

@c !!! what does posix say?

6020

6021

@item @var{x} + @var{y}

6022

Addition.

6023

6024

@item @var{x} - @var{y}

6025

Subtraction.

6026

@end table

6027

@c @end cartouche

6028

6029

For maximum portability, do not use the @samp{**} operator.

6030

6031

Unary plus and minus have the same precedence,

6032

the multiplication operators all have the same precedence, and

6033

addition and subtraction have the same precedence.

6034

6035

@node Concatenation, Assignment Ops, Arithmetic Ops, Expressions

6036

@section String Concatenation

6037

6038

@cindex string operators

6039

@cindex operators, string

6040

@cindex concatenation

6041

There is only one string operation: concatenation. It does not have a

6042

specific operator to represent it. Instead, concatenation is performed by

6043

writing expressions next to one another, with no operator. For example:

6044

6045

@example

6046

@group

6047

$ awk '@{ print "Field number one: " $1 @}' BBS-list

6048

@print{} Field number one: aardvark

6049

@print{} Field number one: alpo-net

6050

@dots{}

6051

@end group

6052

@end example

6053

6054

Without the space in the string constant after the @samp{:}, the line

6055

would run together. For example:

6056

6057

@example

6058

@group

6059

$ awk '@{ print "Field number one:" $1 @}' BBS-list

6060

@print{} Field number one:aardvark

6061

@print{} Field number one:alpo-net

6062

@dots{}

6063

@end group

6064

@end example

6065

6066

Since string concatenation does not have an explicit operator, it is

6067

often necessary to insure that it happens where you want it to by

6068

using parentheses to enclose

6069

the items to be concatenated. For example, the

6070

following code fragment does not concatenate @code{file} and @code{name}

6071

as you might expect:

6072

6073

@example

6074

file = "file"

6075

name = "name"

6076

print "something meaningful" > file name

6077

@end example

6078

6079

@noindent

6080

It is necessary to use the following:

6081

6082

@example

6083

print "something meaningful" > (file name)

6084

@end example

6085

6086

We recommend that you use parentheses around concatenation in all but the

6087

most common contexts (such as on the right-hand side of @samp{=}).

6088

6089

@node Assignment Ops, Increment Ops, Concatenation, Expressions

6090

@section Assignment Expressions

6091

@cindex assignment operators

6092

@cindex operators, assignment

6093

@cindex expression, assignment

6094

6095

An @dfn{assignment} is an expression that stores a new value into a

6096

variable. For example, let's assign the value one to the variable

6097

@code{z}:

6098

6099

@example

6100

z = 1

6101

@end example

6102

6103

After this expression is executed, the variable @code{z} has the value one.

6104

Whatever old value @code{z} had before the assignment is forgotten.

6105

6106

Assignments can store string values also. For example, this would store

6107

the value @code{"this food is good"} in the variable @code{message}:

6108

6109

@example

6110

thing = "food"

6111

predicate = "good"

6112

message = "this " thing " is " predicate

6113

@end example

6114

6115

@noindent

6116

(This also illustrates string concatenation.)

6117

6118

The @samp{=} sign is called an @dfn{assignment operator}. It is the

6119

simplest assignment operator because the value of the right-hand

6120

operand is stored unchanged.

6121

6122

@cindex side effect

6123

Most operators (addition, concatenation, and so on) have no effect

6124

except to compute a value. If you ignore the value, you might as well

6125

not use the operator. An assignment operator is different; it does

6126

produce a value, but even if you ignore the value, the assignment still

6127

makes itself felt through the alteration of the variable. We call this

6128

a @dfn{side effect}.

6129

6130

@cindex lvalue

6131

@cindex rvalue

6132

The left-hand operand of an assignment need not be a variable

6133

(@pxref{Variables}); it can also be a field

6134

(@pxref{Changing Fields, ,Changing the Contents of a Field}) or

6135

an array element (@pxref{Arrays, ,Arrays in @code{awk}}).

6136

These are all called @dfn{lvalues},

6137

which means they can appear on the left-hand side of an assignment operator.

6138

The right-hand operand may be any expression; it produces the new value

6139

which the assignment stores in the specified variable, field or array

6140

element. (Such values are called @dfn{rvalues}).

6141

6142

@cindex types of variables

6143

It is important to note that variables do @emph{not} have permanent types.

6144

The type of a variable is simply the type of whatever value it happens

6145

to hold at the moment. In the following program fragment, the variable

6146

@code{foo} has a numeric value at first, and a string value later on:

6147

6148

@example

6149

foo = 1

6150

print foo

6151

foo = "bar"

6152

print foo

6153

@end example

6154

6155

@noindent

6156

When the second assignment gives @code{foo} a string value, the fact that

6157

it previously had a numeric value is forgotten.

6158

6159

String values that do not begin with a digit have a numeric value of

6160

zero. After executing this code, the value of @code{foo} is five:

6161

6162

@example

6163

foo = "a string"

6164

foo = foo + 5

6165

@end example

6166

6167

@noindent

6168

(Note that using a variable as a number and then later as a string can

6169

be confusing and is poor programming style. The above examples illustrate how

6170

@code{awk} works, @emph{not} how you should write your own programs!)

6171

6172

An assignment is an expression, so it has a value: the same value that

6173

is assigned. Thus, @samp{z = 1} as an expression has the value one.

6174

One consequence of this is that you can write multiple assignments together:

6175

6176

@example

6177

x = y = z = 0

6178

@end example

6179

6180

@noindent

6181

stores the value zero in all three variables. It does this because the

6182

value of @samp{z = 0}, which is zero, is stored into @code{y}, and then

6183

the value of @samp{y = z = 0}, which is zero, is stored into @code{x}.

6184

6185

You can use an assignment anywhere an expression is called for. For

6186

example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one

6187

and then test whether @code{x} equals one. But this style tends to make

6188

programs hard to read; except in a one-shot program, you should

6189

not use such nesting of assignments.

6190

6191

Aside from @samp{=}, there are several other assignment operators that

6192

do arithmetic with the old value of the variable. For example, the

6193

operator @samp{+=} computes a new value by adding the right-hand value

6194

to the old value of the variable. Thus, the following assignment adds

6195

five to the value of @code{foo}:

6196

6197

@example

6198

foo += 5

6199

@end example

6200

6201

@noindent

6202

This is equivalent to the following:

6203

6204

@example

6205

foo = foo + 5

6206

@end example

6207

6208

@noindent

6209

Use whichever one makes the meaning of your program clearer.

6210

6211

There are situations where using @samp{+=} (or any assignment operator)

6212

is @emph{not} the same as simply repeating the left-hand operand in the

6213

right-hand expression. For example:

6214

6215

@cindex Rankin, Pat

6216

@example

6217

@group

6218

# Thanks to Pat Rankin for this example

6219

BEGIN @{

6220

foo[rand()] += 5

6221

for (x in foo)

6222

print x, foo[x]

6223

6224

bar[rand()] = bar[rand()] + 5

6225

for (x in bar)

6226

print x, bar[x]

6227

@}

6228

@end group

6229

@end example

6230

6231

@noindent

6232

The indices of @code{bar} are guaranteed to be different, because

6233

@code{rand} will return different values each time it is called.

6234

(Arrays and the @code{rand} function haven't been covered yet.

6235

@xref{Arrays, ,Arrays in @code{awk}},

6236

and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information).

6237

This example illustrates an important fact about the assignment

6238

operators: the left-hand expression is only evaluated @emph{once}.

6239

6240

It is also up to the implementation as to which expression is evaluated

6241

first, the left-hand one or the right-hand one.

6242

Consider this example:

6243

6244

@example

6245

i = 1

6246

a[i += 2] = i + 1

6247

@end example

6248

6249

@noindent

6250

The value of @code{a[3]} could be either two or four.

6251

6252

Here is a table of the arithmetic assignment operators. In each

6253

case, the right-hand operand is an expression whose value is converted

6254

to a number.

6255

6256

@c @cartouche

6257

@table @code

6258

@item @var{lvalue} += @var{increment}

6259

Adds @var{increment} to the value of @var{lvalue} to make the new value

6260

of @var{lvalue}.

6261

6262

@item @var{lvalue} -= @var{decrement}

6263

Subtracts @var{decrement} from the value of @var{lvalue}.

6264

6265

@item @var{lvalue} *= @var{coefficient}

6266

Multiplies the value of @var{lvalue} by @var{coefficient}.

6267

6268

@item @var{lvalue} /= @var{divisor}

6269

Divides the value of @var{lvalue} by @var{divisor}.

6270

6271

@item @var{lvalue} %= @var{modulus}

6272

Sets @var{lvalue} to its remainder by @var{modulus}.

6273

6274

@cindex @code{awk} language, POSIX version

6275

@cindex POSIX @code{awk}

6276

@item @var{lvalue} ^= @var{power}

6277

@itemx @var{lvalue} **= @var{power}

6278

Raises @var{lvalue} to the power @var{power}.

6279

(Only the @samp{^=} operator is specified by POSIX.)

6280

@end table

6281

@c @end cartouche

6282

6283

For maximum portability, do not use the @samp{**=} operator.

6284

6285

@node Increment Ops, Truth Values, Assignment Ops, Expressions

6286

@section Increment and Decrement Operators

6287

6288

@cindex increment operators

6289

@cindex operators, increment

6290

@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of

6291

a variable by one. You could do the same thing with an assignment operator, so

6292

the increment operators add no power to the @code{awk} language; but they

6293

are convenient abbreviations for very common operations.

6294

6295

The operator to add one is written @samp{++}. It can be used to increment

6296

a variable either before or after taking its value.

6297

6298

To pre-increment a variable @var{v}, write @samp{++@var{v}}. This adds

6299

one to the value of @var{v} and that new value is also the value of this

6300

expression. The assignment expression @samp{@var{v} += 1} is completely

6301

equivalent.

6302

6303

Writing the @samp{++} after the variable specifies post-increment. This

6304

increments the variable value just the same; the difference is that the

6305

value of the increment expression itself is the variable's @emph{old}

6306

value. Thus, if @code{foo} has the value four, then the expression @samp{foo++}

6307

has the value four, but it changes the value of @code{foo} to five.

6308

6309

The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo

6310

+= 1) - 1}. It is not perfectly equivalent because all numbers in

6311

@code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does

6312

not necessarily equal @code{foo}. But the difference is minute as

6313

long as you stick to numbers that are fairly small (less than 10e12).

6314

6315

Any lvalue can be incremented. Fields and array elements are incremented

6316

just like variables. (Use @samp{$(i++)} when you wish to do a field reference

6317

and a variable increment at the same time. The parentheses are necessary

6318

because of the precedence of the field reference operator, @samp{$}.)

6319

6320

@cindex decrement operators

6321

@cindex operators, decrement

6322

The decrement operator @samp{--} works just like @samp{++} except that

6323

it subtracts one instead of adding. Like @samp{++}, it can be used before

6324

the lvalue to pre-decrement or after it to post-decrement.

6325

6326

Here is a summary of increment and decrement expressions.

6327

6328

@c @cartouche

6329

@table @code

6330

@item ++@var{lvalue}

6331

This expression increments @var{lvalue} and the new value becomes the

6332

value of the expression.

6333

6334

@item @var{lvalue}++

6335

This expression increments @var{lvalue}, but

6336

the value of the expression is the @emph{old} value of @var{lvalue}.

6337

6338

@item --@var{lvalue}

6339

Like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It

6340

decrements @var{lvalue} and delivers the value that results.

6341

6342

@item @var{lvalue}--

6343

Like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It

6344

decrements @var{lvalue}. The value of the expression is the @emph{old}

6345

value of @var{lvalue}.

6346

@end table

6347

@c @end cartouche

6348

6349

@node Truth Values, Typing and Comparison, Increment Ops, Expressions

6350

@section True and False in @code{awk}

6351

@cindex truth values

6352

@cindex logical true

6353

@cindex logical false

6354

6355

Many programming languages have a special representation for the concepts

6356

of ``true'' and ``false.'' Such languages usually use the special

6357

constants @code{true} and @code{false}, or perhaps their upper-case

6358

equivalents.

6359

6360

@cindex null string

6361

@cindex empty string

6362

@code{awk} is different. It borrows a very simple concept of true and

6363

false from C. In @code{awk}, any non-zero numeric value, @emph{or} any

6364

non-empty string value is true. Any other value (zero or the null

6365

string, @code{""}) is false. The following program will print @samp{A strange

6366

truth value} three times:

6367

6368

@example

6369

BEGIN @{

6370

if (3.1415927)

6371

print "A strange truth value"

6372

if ("Four Score And Seven Years Ago")

6373

print "A strange truth value"

6374

if (j = 57)

6375

print "A strange truth value"

6376

@}

6377

@end example

6378

6379

@cindex dark corner

6380

There is a surprising consequence of the ``non-zero or non-null'' rule:

6381

The string constant @code{"0"} is actually true, since it is non-null (d.c.).

6382

6383

@node Typing and Comparison, Boolean Ops, Truth Values, Expressions

6384

@section Variable Typing and Comparison Expressions

6385

@cindex comparison expressions

6386

@cindex expression, comparison

6387

@cindex expression, matching

6388

@cindex relational operators

6389

@cindex operators, relational

6390

@cindex regexp match/non-match operators

6391

@cindex variable typing

6392

@cindex types of variables

6393

6394

@c 2e: consider splitting this section into subsections

6395

6396

Unlike other programming languages, @code{awk} variables do not have a

6397

fixed type. Instead, they can be either a number or a string, depending

6398

upon the value that is assigned to them.

6399

6400

@cindex numeric string

6401

The 1992 POSIX standard introduced

6402

the concept of a @dfn{numeric string}, which is simply a string that looks

6403

like a number, for example, @code{@w{" +2"}}. This concept is used

6404

for determining the type of a variable.

6405

6406

The type of the variable is important, since the types of two variables

6407

determine how they are compared.

6408

6409

In @code{gawk}, variable typing follows these rules.

6410

6411

@enumerate 1

6412

@item

6413

A numeric literal or the result of a numeric operation has the @var{numeric}

6414

attribute.

6415

6416

@item

6417

A string literal or the result of a string operation has the @var{string}

6418

attribute.

6419

6420

@item

6421

Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,

6422

@code{ENVIRON} elements and the

6423

elements of an array created by @code{split} that are numeric strings

6424

have the @var{strnum} attribute. Otherwise, they have the @var{string}

6425

attribute.

6426

Uninitialized variables also have the @var{strnum} attribute.

6427

6428

@item

6429

Attributes propagate across assignments, but are not changed by

6430

any use.

6431

@c (Although a use may cause the entity to acquire an additional

6432

@c value such that it has both a numeric and string value -- this leaves the

6433

@c attribute unchanged.)

6434

@c This is important but not relevant

6435

@end enumerate

6436

6437

The last rule is particularly important. In the following program,

6438

@code{a} has numeric type, even though it is later used in a string

6439

operation.

6440

6441

@example

6442

BEGIN @{

6443

a = 12.345

6444

b = a " is a cute number"

6445

print b

6446

@}

6447

@end example

6448

6449

When two operands are compared, either string comparison or numeric comparison

6450

may be used, depending on the attributes of the operands, according to the

6451

following, symmetric, matrix:

6452

6453

@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables

6454

@tex

6455

\centerline{

6456

\vbox{\bigskip % space above the table (about 1 linespace)

6457

% Because we have vertical rules, we can't let TeX insert interline space

6458

% in its usual way.

6459

\offinterlineskip

6460

%

6461

% Define the table template. & separates columns, and \cr ends the

6462

% template (and each row). # is replaced by the text of that entry on

6463

% each row. The template for the first column breaks down like this:

6464

% \strut -- a way to make each line have the height and depth

6465

% of a normal line of type, since we turned off interline spacing.

6466

% \hfil -- infinite glue; has the effect of right-justifying in this case.

6467

% # -- replaced by the text (for instance, `STRNUM', in the last row).

6468

% \quad -- about the width of an `M'. Just separates the columns.

6469

%

6470

% The second column (\vrule#) is what generates the vertical rule that

6471

% spans table rows.

6472

%

6473

% The doubled && before the next entry means `repeat the following

6474

% template as many times as necessary on each line' -- in our case, twice.

6475

%

6476

% The template itself, \quad#\hfil, left-justifies with a little space before.

6477

%

6478

\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr

6479

&&STRING &NUMERIC &STRNUM\cr

6480

% The \omit tells TeX to skip inserting the template for this column on

6481

% this particular row. In this case, we only want a little extra space

6482

% to separate the heading row from the rule below it. the depth 2pt --

6483

% `\vrule depth 2pt' is that little space.

6484

\omit &depth 2pt\cr

6485

% This is the horizontal rule below the heading. Since it has nothing to

6486

% do with the columns of the table, we use \noalign to get it in there.

6487

\noalign{\hrule}

6488

% Like above, this time a little more space.

6489

\omit &depth 4pt\cr

6490

% The remaining rows have nothing special about them.

6491

STRING &&string &string &string\cr

6492

NUMERIC &&string &numeric &numeric\cr

6493

STRNUM &&string &numeric &numeric\cr

6494

}}}

6495

@end tex

6496

@ifinfo

6497

@display

6498

+----------------------------------------------

6499

| STRING NUMERIC STRNUM

6500

--------+----------------------------------------------

6501

|

6502

STRING | string string string

6503

|

6504

NUMERIC | string numeric numeric

6505

|

6506

STRNUM | string numeric numeric

6507

--------+----------------------------------------------

6508

@end display

6509

@end ifinfo

6510

6511

The basic idea is that user input that looks numeric, and @emph{only}

6512

user input, should be treated as numeric, even though it is actually

6513

made of characters, and is therefore also a string.

6514

6515

@dfn{Comparison expressions} compare strings or numbers for

6516

relationships such as equality. They are written using @dfn{relational

6517

operators}, which are a superset of those in C. Here is a table of

6518

them:

6519

6520

@cindex relational operators

6521

@cindex operators, relational

6522

@cindex @code{<} operator

6523

@cindex @code{<=} operator

6524

@cindex @code{>} operator

6525

@cindex @code{>=} operator

6526

@cindex @code{==} operator

6527

@cindex @code{!=} operator

6528

@cindex @code{~} operator

6529

@cindex @code{!~} operator

6530

@cindex @code{in} operator

6531

@c @cartouche

6532

@table @code

6533

@item @var{x} < @var{y}

6534

True if @var{x} is less than @var{y}.

6535

6536

@item @var{x} <= @var{y}

6537

True if @var{x} is less than or equal to @var{y}.

6538

6539

@item @var{x} > @var{y}

6540

True if @var{x} is greater than @var{y}.

6541

6542

@item @var{x} >= @var{y}

6543

True if @var{x} is greater than or equal to @var{y}.

6544

6545

@item @var{x} == @var{y}

6546

True if @var{x} is equal to @var{y}.

6547

6548

@item @var{x} != @var{y}

6549

True if @var{x} is not equal to @var{y}.

6550

6551

@item @var{x} ~ @var{y}

6552

True if the string @var{x} matches the regexp denoted by @var{y}.

6553

6554

@item @var{x} !~ @var{y}

6555

True if the string @var{x} does not match the regexp denoted by @var{y}.

6556

6557

@item @var{subscript} in @var{array}

6558

True if the array @var{array} has an element with the subscript @var{subscript}.

6559

@end table

6560

@c @end cartouche

6561

6562

Comparison expressions have the value one if true and zero if false.

6563

6564

When comparing operands of mixed types, numeric operands are converted

6565

to strings using the value of @code{CONVFMT}

6566

(@pxref{Conversion, ,Conversion of Strings and Numbers}).

6567

6568

Strings are compared

6569

by comparing the first character of each, then the second character of each,

6570

and so on. Thus @code{"10"} is less than @code{"9"}. If there are two

6571

strings where one is a prefix of the other, the shorter string is less than

6572

the longer one. Thus @code{"abc"} is less than @code{"abcd"}.

6573

6574

@cindex common mistakes

6575

@cindex mistakes, common

6576

@cindex errors, common

6577

It is very easy to accidentally mistype the @samp{==} operator, and

6578

leave off one of the @samp{=}s. The result is still valid @code{awk}

6579

code, but the program will not do what you mean:

6580

6581

@example

6582

if (a = b) # oops! should be a == b

6583

@dots{}

6584

else

6585

@dots{}

6586

@end example

6587

6588

@noindent

6589

Unless @code{b} happens to be zero or the null string, the @code{if}

6590

part of the test will always succeed. Because the operators are

6591

so similar, this kind of error is very difficult to spot when

6592

scanning the source code.

6593

6594

Here are some sample expressions, how @code{gawk} compares them, and what

6595

the result of the comparison is.

6596

6597

@table @code

6598

@item 1.5 <= 2.0

6599

numeric comparison (true)

6600

6601

@item "abc" >= "xyz"

6602

string comparison (false)

6603

6604

@item 1.5 != " +2"

6605

string comparison (true)

6606

6607

@item "1e2" < "3"

6608

string comparison (true)

6609

6610

@item a = 2; b = "2"

6611

@itemx a == b

6612

string comparison (true)

6613

6614

@item a = 2; b = " +2"

6615

@itemx a == b

6616

string comparison (false)

6617

@end table

6618

6619

In this example,

6620

6621

@example

6622

@group

6623

$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'

6624

@print{} false

6625

@end group

6626

@end example

6627

6628

@noindent

6629

the result is @samp{false} since both @code{$1} and @code{$2} are numeric

6630

strings and thus both have the @var{strnum} attribute,

6631

dictating a numeric comparison.

6632

6633

The purpose of the comparison rules and the use of numeric strings is

6634

to attempt to produce the behavior that is ``least surprising,'' while

6635

still ``doing the right thing.''

6636

6637

@cindex comparisons, string vs. regexp

6638

@cindex string comparison vs. regexp comparison

6639

@cindex regexp comparison vs. string comparison

6640

String comparisons and regular expression comparisons are very different.

6641

For example,

6642

6643

@example

6644

x == "foo"

6645

@end example

6646

6647

@noindent

6648

has the value of one, or is true, if the variable @code{x}

6649

is precisely @samp{foo}. By contrast,

6650

6651

@example

6652

x ~ /foo/

6653

@end example

6654

6655

@noindent

6656

has the value one if @code{x} contains @samp{foo}, such as

6657

@code{"Oh, what a fool am I!"}.

6658

6659

The right hand operand of the @samp{~} and @samp{!~} operators may be

6660

either a regexp constant (@code{/@dots{}/}), or an ordinary

6661

expression, in which case the value of the expression as a string is used as a

6662

dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also

6663

@pxref{Computed Regexps, ,Using Dynamic Regexps}).

6664

6665

@cindex regexp as expression

6666

In recent implementations of @code{awk}, a constant regular

6667

expression in slashes by itself is also an expression. The regexp

6668

@code{/@var{regexp}/} is an abbreviation for this comparison expression:

6669

6670

@example

6671

$0 ~ /@var{regexp}/

6672

@end example

6673

6674

One special place where @code{/foo/} is @emph{not} an abbreviation for

6675

@samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or

6676

@samp{!~}!

6677

@xref{Using Constant Regexps, ,Using Regular Expression Constants},

6678

where this is discussed in more detail.

6679

6680

@c This paragraph has been here since day 1, and has always bothered

6681

@c me, especially since the expression doesn't really make a lot of

6682

@c sense. So, just take it out.

6683

@ignore

6684

In some contexts it may be necessary to write parentheses around the

6685

regexp to avoid confusing the @code{gawk} parser. For example,

6686

@samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/))

6687

> threshold} parses properly.

6688

@end ignore

6689

6690

@node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions

6691

@section Boolean Expressions

6692

@cindex expression, boolean

6693

@cindex boolean expressions

6694

@cindex operators, boolean

6695

@cindex boolean operators

6696

@cindex logical operations

6697

@cindex operations, logical

6698

@cindex short-circuit operators

6699

@cindex operators, short-circuit

6700

@cindex and operator

6701

@cindex or operator

6702

@cindex not operator

6703

@cindex @code{&&} operator

6704

@cindex @code{||} operator

6705

@cindex @code{!} operator

6706

6707

A @dfn{boolean expression} is a combination of comparison expressions or

6708

matching expressions, using the boolean operators ``or''

6709

(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with

6710

parentheses to control nesting. The truth value of the boolean expression is

6711

computed by combining the truth values of the component expressions.

6712

Boolean expressions are also referred to as @dfn{logical expressions}.

6713

The terms are equivalent.

6714

6715

Boolean expressions can be used wherever comparison and matching

6716

expressions can be used. They can be used in @code{if}, @code{while},

6717

@code{do} and @code{for} statements

6718

(@pxref{Statements, ,Control Statements in Actions}).

6719

They have numeric values (one if true, zero if false), which come into play

6720

if the result of the boolean expression is stored in a variable, or

6721

used in arithmetic.

6722

6723

In addition, every boolean expression is also a valid pattern, so

6724

you can use one as a pattern to control the execution of rules.

6725

6726

Here are descriptions of the three boolean operators, with examples.

6727

6728

@c @cartouche

6729

@table @code

6730

@item @var{boolean1} && @var{boolean2}

6731

True if both @var{boolean1} and @var{boolean2} are true. For example,

6732

the following statement prints the current input record if it contains

6733

both @samp{2400} and @samp{foo}.

6734

6735

@example

6736

if ($0 ~ /2400/ && $0 ~ /foo/) print

6737

@end example

6738

6739

The subexpression @var{boolean2} is evaluated only if @var{boolean1}

6740

is true. This can make a difference when @var{boolean2} contains

6741

expressions that have side effects: in the case of @samp{$0 ~ /foo/ &&

6742

($2 == bar++)}, the variable @code{bar} is not incremented if there is

6743

no @samp{foo} in the record.

6744

6745

@item @var{boolean1} || @var{boolean2}

6746

True if at least one of @var{boolean1} or @var{boolean2} is true.

6747

For example, the following statement prints all records in the input

6748

that contain @emph{either} @samp{2400} or

6749

@samp{foo}, or both.

6750

6751

@example

6752

if ($0 ~ /2400/ || $0 ~ /foo/) print

6753

@end example

6754

6755

The subexpression @var{boolean2} is evaluated only if @var{boolean1}

6756

is false. This can make a difference when @var{boolean2} contains

6757

expressions that have side effects.

6758

6759

@item ! @var{boolean}

6760

True if @var{boolean} is false. For example, the following program prints

6761

all records in the input file @file{BBS-list} that do @emph{not} contain the

6762

string @samp{foo}.

6763

6764

@c A better example would be `if (! (subscript in array)) ...' but we

6765

@c haven't done anything with arrays or `in' yet. Sigh.

6766

@example

6767

awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list

6768

@end example

6769

@end table

6770

@c @end cartouche

6771

6772

The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}

6773

operators because of the way they work. Evaluation of the full expression

6774

is ``short-circuited'' if the result can be determined part way through

6775

its evaluation.

6776

6777

@cindex line continuation

6778

You can continue a statement that uses @samp{&&} or @samp{||} simply

6779

by putting a newline after them. But you cannot put a newline in front

6780

of either of these operators without using backslash continuation

6781

(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).

6782

6783

The actual value of an expression using the @samp{!} operator will be

6784

either one or zero, depending upon the truth value of the expression it

6785

is applied to.

6786

6787

The @samp{!} operator is often useful for changing the sense of a flag

6788

variable from false to true and back again. For example, the following

6789

program is one way to print lines in between special bracketing lines:

6790

6791

@example

6792

$1 == "START" @{ interested = ! interested @}

6793

interested == 1 @{ print @}

6794

$1 == "END" @{ interested = ! interested @}

6795

@end example

6796

6797

@noindent

6798

The variable @code{interested}, like all @code{awk} variables, starts

6799

out initialized to zero, which is also false. When a line is seen whose

6800

first field is @samp{START}, the value of @code{interested} is toggled

6801

to true, using @samp{!}. The next rule prints lines as long as

6802

@code{interested} is true. When a line is seen whose first field is

6803

@samp{END}, @code{interested} is toggled back to false.

6804

@ignore

6805

We should discuss using `next' in the two rules that toggle the

6806

variable, to avoid printing the bracketing lines, but that's more

6807

distraction than really needed.

6808

@end ignore

6809

6810

@node Conditional Exp, Function Calls, Boolean Ops, Expressions

6811

@section Conditional Expressions

6812

@cindex conditional expression

6813

@cindex expression, conditional

6814

6815

A @dfn{conditional expression} is a special kind of expression with

6816

three operands. It allows you to use one expression's value to select

6817

one of two other expressions.

6818

6819

The conditional expression is the same as in the C language:

6820

6821

@example

6822

@var{selector} ? @var{if-true-exp} : @var{if-false-exp}

6823

@end example

6824

6825

@noindent

6826

There are three subexpressions. The first, @var{selector}, is always

6827

computed first. If it is ``true'' (not zero and not null) then

6828

@var{if-true-exp} is computed next and its value becomes the value of

6829

the whole expression. Otherwise, @var{if-false-exp} is computed next

6830

and its value becomes the value of the whole expression.

6831

6832

For example, this expression produces the absolute value of @code{x}:

6833

6834

@example

6835

x > 0 ? x : -x

6836

@end example

6837

6838

Each time the conditional expression is computed, exactly one of

6839

@var{if-true-exp} and @var{if-false-exp} is computed; the other is ignored.

6840

This is important when the expressions contain side effects. For example,

6841

this conditional expression examines element @code{i} of either array

6842

@code{a} or array @code{b}, and increments @code{i}.

6843

6844

@example

6845

x == y ? a[i++] : b[i++]

6846

@end example

6847

6848

@noindent

6849

This is guaranteed to increment @code{i} exactly once, because each time

6850

only one of the two increment expressions is executed,

6851

and the other is not.

6852

@xref{Arrays, ,Arrays in @code{awk}},

6853

for more information about arrays.

6854

6855

@cindex differences between @code{gawk} and @code{awk}

6856

@cindex line continuation

6857

As a minor @code{gawk} extension,

6858

you can continue a statement that uses @samp{?:} simply

6859

by putting a newline after either character.

6860

However, you cannot put a newline in front

6861

of either character without using backslash continuation

6862

(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).

6863

6864

@node Function Calls, Precedence, Conditional Exp, Expressions

6865

@section Function Calls

6866

@cindex function call

6867

@cindex calling a function

6868

6869

A @dfn{function} is a name for a particular calculation. Because it has

6870

a name, you can ask for it by name at any point in the program. For

6871

example, the function @code{sqrt} computes the square root of a number.

6872

6873

A fixed set of functions are @dfn{built-in}, which means they are

6874

available in every @code{awk} program. The @code{sqrt} function is one

6875

of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in

6876

functions and their descriptions. In addition, you can define your own

6877

functions for use in your program.

6878

@xref{User-defined, ,User-defined Functions}, for how to do this.

6879

6880

@cindex arguments in function call

6881

The way to use a function is with a @dfn{function call} expression,

6882

which consists of the function name followed immediately by a list of

6883

@dfn{arguments} in parentheses. The arguments are expressions which

6884

provide the raw materials for the function's calculations.

6885

When there is more than one argument, they are separated by commas. If

6886

there are no arguments, write just @samp{()} after the function name.

6887

Here are some examples:

6888

6889

@example

6890

sqrt(x^2 + y^2) @i{one argument}

6891

atan2(y, x) @i{two arguments}

6892

rand() @i{no arguments}

6893

@end example

6894

6895

@strong{Do not put any space between the function name and the

6896

open-parenthesis!} A user-defined function name looks just like the name of

6897

a variable, and space would make the expression look like concatenation

6898

of a variable with an expression inside parentheses. Space before the

6899

parenthesis is harmless with built-in functions, but it is best not to get

6900

into the habit of using space to avoid mistakes with user-defined

6901

functions.

6902

6903

Each function expects a particular number of arguments. For example, the

6904

@code{sqrt} function must be called with a single argument, the number

6905

to take the square root of:

6906

6907

@example

6908

sqrt(@var{argument})

6909

@end example

6910

6911

Some of the built-in functions allow you to omit the final argument.

6912

If you do so, they use a reasonable default.

6913

@xref{Built-in, ,Built-in Functions}, for full details. If arguments

6914

are omitted in calls to user-defined functions, then those arguments are

6915

treated as local variables, initialized to the empty string

6916

(@pxref{User-defined, ,User-defined Functions}).

6917

6918

Like every other expression, the function call has a value, which is

6919

computed by the function based on the arguments you give it. In this

6920

example, the value of @samp{sqrt(@var{argument})} is the square root of

6921

@var{argument}. A function can also have side effects, such as assigning

6922

values to certain variables or doing I/O.

6923

6924

Here is a command to read numbers, one number per line, and print the

6925

square root of each one:

6926

6927

@example

6928

@group

6929

$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}'

6930

1

6931

@print{} The square root of 1 is 1

6932

3

6933

@print{} The square root of 3 is 1.73205

6934

5

6935

@print{} The square root of 5 is 2.23607

6936

@kbd{Control-d}

6937

@end group

6938

@end example

6939

6940

@node Precedence, , Function Calls, Expressions

6941

@section Operator Precedence (How Operators Nest)

6942

@cindex precedence

6943

@cindex operator precedence

6944

6945

@dfn{Operator precedence} determines how operators are grouped, when

6946

different operators appear close by in one expression. For example,

6947

@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}

6948

means to multiply @code{b} and @code{c}, and then add @code{a} to the

6949

product (i.e.@: @samp{a + (b * c)}).

6950

6951

You can overrule the precedence of the operators by using parentheses.

6952

You can think of the precedence rules as saying where the

6953

parentheses are assumed to be if you do not write parentheses yourself. In

6954

fact, it is wise to always use parentheses whenever you have an unusual

6955

combination of operators, because other people who read the program may

6956

not remember what the precedence is in this case. You might forget,

6957

too; then you could make a mistake. Explicit parentheses will help prevent

6958

any such mistake.

6959

6960

When operators of equal precedence are used together, the leftmost

6961

operator groups first, except for the assignment, conditional and

6962

exponentiation operators, which group in the opposite order.

6963

Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and

6964

@samp{a = b = c} groups as @samp{a = (b = c)}.

6965

6966

The precedence of prefix unary operators does not matter as long as only

6967

unary operators are involved, because there is only one way to interpret

6968

them---innermost first. Thus, @samp{$++i} means @samp{$(++i)} and

6969

@samp{++$x} means @samp{++($x)}. However, when another operator follows

6970

the operand, then the precedence of the unary operators can matter.

6971

Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means

6972

@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}

6973

while @samp{$} has higher precedence.

6974

6975

Here is a table of @code{awk}'s operators, in order from highest

6976

precedence to lowest:

6977

6978

@c use @code in the items, looks better in TeX w/o all the quotes

6979

@table @code

6980

@item (@dots{})

6981

Grouping.

6982

6983

@item $

6984

Field.

6985

6986

@item ++ --

6987

Increment, decrement.

6988

6989

@cindex @code{awk} language, POSIX version

6990

@cindex POSIX @code{awk}

6991

@item ^ **

6992

Exponentiation. These operators group right-to-left.

6993

(The @samp{**} operator is not specified by POSIX.)

6994

6995

@item + - !

6996

Unary plus, minus, logical ``not''.

6997

6998

@item * / %

6999

Multiplication, division, modulus.

7000

7001

@item + -

7002

Addition, subtraction.

7003

7004

@item @r{Concatenation}

7005

No special token is used to indicate concatenation.

7006

The operands are simply written side by side.

7007

7008

@item < <= == !=

7009

@itemx > >= >> |

7010

Relational, and redirection.

7011

The relational operators and the redirections have the same precedence

7012

level. Characters such as @samp{>} serve both as relationals and as

7013

redirections; the context distinguishes between the two meanings.

7014

7015

Note that the I/O redirection operators in @code{print} and @code{printf}

7016

statements belong to the statement level, not to expressions. The

7017

redirection does not produce an expression which could be the operand of

7018

another operator. As a result, it does not make sense to use a

7019

redirection operator near another operator of lower precedence, without

7020

parentheses. Such combinations, for example @samp{print foo > a ? b : c},

7021

result in syntax errors.

7022

The correct way to write this statement is @samp{print foo > (a ? b : c)}.

7023

7024

@item ~ !~

7025

Matching, non-matching.

7026

7027

@item in

7028

Array membership.

7029

7030

@item &&

7031

Logical ``and''.

7032

7033

@item ||

7034

Logical ``or''.

7035

7036

@item ?:

7037

Conditional. This operator groups right-to-left.

7038

7039

@cindex @code{awk} language, POSIX version

7040

@cindex POSIX @code{awk}

7041

@item = += -= *=

7042

@itemx /= %= ^= **=

7043

Assignment. These operators group right-to-left.

7044

(The @samp{**=} operator is not specified by POSIX.)

7045

@end table

7046

7047

@node Patterns and Actions, Statements, Expressions, Top

7048

@chapter Patterns and Actions

7049

@cindex pattern, definition of

7050

7051

As you have already seen, each @code{awk} statement consists of

7052

a pattern with an associated action. This chapter describes how

7053

you build patterns and actions.

7054

7055

@menu

7056

* Pattern Overview:: What goes into a pattern.

7057

* Action Overview:: What goes into an action.

7058

@end menu

7059

7060

@node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions

7061

@section Pattern Elements

7062

7063

Patterns in @code{awk} control the execution of rules: a rule is

7064

executed when its pattern matches the current input record. This

7065

section explains all about how to write patterns.

7066

7067

@menu

7068

* Kinds of Patterns:: A list of all kinds of patterns.

7069

* Regexp Patterns:: Using regexps as patterns.

7070

* Expression Patterns:: Any expression can be used as a pattern.

7071

* Ranges:: Pairs of patterns specify record ranges.

7072

* BEGIN/END:: Specifying initialization and cleanup rules.

7073

* Empty:: The empty pattern, which matches every record.

7074

@end menu

7075

7076

@node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview

7077

@subsection Kinds of Patterns

7078

@cindex patterns, types of

7079

7080

Here is a summary of the types of patterns supported in @code{awk}.

7081

7082

@table @code

7083

@item /@var{regular expression}/

7084

A regular expression as a pattern. It matches when the text of the

7085

input record fits the regular expression.

7086

(@xref{Regexp, ,Regular Expressions}.)

7087

7088

@item @var{expression}

7089

A single expression. It matches when its value

7090

is non-zero (if a number) or non-null (if a string).

7091

(@xref{Expression Patterns, ,Expressions as Patterns}.)

7092

7093

@item @var{pat1}, @var{pat2}

7094

A pair of patterns separated by a comma, specifying a range of records.

7095

The range includes both the initial record that matches @var{pat1}, and

7096

the final record that matches @var{pat2}.

7097

(@xref{Ranges, ,Specifying Record Ranges with Patterns}.)

7098

7099

@item BEGIN

7100

@itemx END

7101

Special patterns for you to supply start-up or clean-up actions for your

7102

@code{awk} program.

7103

(@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.)

7104

7105

@item @var{empty}

7106

The empty pattern matches every input record.

7107

(@xref{Empty, ,The Empty Pattern}.)

7108

@end table

7109

7110

@node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview

7111

@subsection Regular Expressions as Patterns

7112

7113

We have been using regular expressions as patterns since our early examples.

7114

This kind of pattern is simply a regexp constant in the pattern part of

7115

a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}.

7116

The pattern matches when the input record matches the regexp.

7117

For example:

7118

7119

@example

7120

/foo|bar|baz/ @{ buzzwords++ @}

7121

END @{ print buzzwords, "buzzwords seen" @}

7122

@end example

7123

7124

@node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview

7125

@subsection Expressions as Patterns

7126

7127

Any @code{awk} expression is valid as an @code{awk} pattern.

7128

Then the pattern matches if the expression's value is non-zero (if a

7129

number) or non-null (if a string).

7130

7131

The expression is reevaluated each time the rule is tested against a new

7132

input record. If the expression uses fields such as @code{$1}, the

7133

value depends directly on the new input record's text; otherwise, it

7134

depends only on what has happened so far in the execution of the

7135

@code{awk} program, but that may still be useful.

7136

7137

A very common kind of expression used as a pattern is the comparison

7138

expression, using the comparison operators described in

7139

@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.

7140

7141

Regexp matching and non-matching are also very common expressions.

7142

The left operand of the @samp{~} and @samp{!~} operators is a string.

7143

The right operand is either a constant regular expression enclosed in

7144

slashes (@code{/@var{regexp}/}), or any expression, whose string value

7145

is used as a dynamic regular expression

7146

(@pxref{Computed Regexps, , Using Dynamic Regexps}).

7147

7148

The following example prints the second field of each input record

7149

whose first field is precisely @samp{foo}.

7150

7151

@example

7152

$ awk '$1 == "foo" @{ print $2 @}' BBS-list

7153

@end example

7154

7155

@noindent

7156

(There is no output, since there is no BBS site named ``foo''.)

7157

Contrast this with the following regular expression match, which would

7158

accept any record with a first field that contains @samp{foo}:

7159

7160

@example

7161

@group

7162

$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list

7163

@print{} 555-1234

7164

@print{} 555-6699

7165

@print{} 555-6480

7166

@print{} 555-2127

7167

@end group

7168

@end example

7169

7170

Boolean expressions are also commonly used as patterns.

7171

Whether the pattern

7172

matches an input record depends on whether its subexpressions match.

7173

7174

For example, the following command prints all records in

7175

@file{BBS-list} that contain both @samp{2400} and @samp{foo}.

7176

7177

@example

7178

$ awk '/2400/ && /foo/' BBS-list

7179

@print{} fooey 555-1234 2400/1200/300 B

7180

@end example

7181

7182

The following command prints all records in

7183

@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or

7184

both.

7185

7186

@example

7187

@group

7188

$ awk '/2400/ || /foo/' BBS-list

7189

@print{} alpo-net 555-3412 2400/1200/300 A

7190

@print{} bites 555-1675 2400/1200/300 A

7191

@print{} fooey 555-1234 2400/1200/300 B

7192

@print{} foot 555-6699 1200/300 B

7193

@print{} macfoo 555-6480 1200/300 A

7194

@print{} sdace 555-3430 2400/1200/300 A

7195

@print{} sabafoo 555-2127 1200/300 C

7196

@end group

7197

@end example

7198

7199

The following command prints all records in

7200

@file{BBS-list} that do @emph{not} contain the string @samp{foo}.

7201

7202

@example

7203

@group

7204

$ awk '! /foo/' BBS-list

7205

@print{} aardvark 555-5553 1200/300 B

7206

@print{} alpo-net 555-3412 2400/1200/300 A

7207

@print{} barfly 555-7685 1200/300 A

7208

@print{} bites 555-1675 2400/1200/300 A

7209

@print{} camelot 555-0542 300 C

7210

@print{} core 555-2912 1200/300 C

7211

@print{} sdace 555-3430 2400/1200/300 A

7212

@end group

7213

@end example

7214

7215

The subexpressions of a boolean operator in a pattern can be constant regular

7216

expressions, comparisons, or any other @code{awk} expressions. Range

7217

patterns are not expressions, so they cannot appear inside boolean

7218

patterns. Likewise, the special patterns @code{BEGIN} and @code{END},

7219

which never match any input record, are not expressions and cannot

7220

appear inside boolean patterns.

7221

7222

A regexp constant as a pattern is also a special case of an expression

7223

pattern. @code{/foo/} as an expression has the value one if @samp{foo}

7224

appears in the current input record; thus, as a pattern, @code{/foo/}

7225

matches any record containing @samp{foo}.

7226

7227

@node Ranges, BEGIN/END, Expression Patterns, Pattern Overview

7228

@subsection Specifying Record Ranges with Patterns

7229

7230

@cindex range pattern

7231

@cindex pattern, range

7232

@cindex matching ranges of lines

7233

A @dfn{range pattern} is made of two patterns separated by a comma, of

7234

the form @samp{@var{begpat}, @var{endpat}}. It matches ranges of

7235

consecutive input records. The first pattern, @var{begpat}, controls

7236

where the range begins, and the second one, @var{endpat}, controls where

7237

it ends. For example,

7238

7239

@example

7240

awk '$1 == "on", $1 == "off"'

7241

@end example

7242

7243

@noindent

7244

prints every record between @samp{on}/@samp{off} pairs, inclusive.

7245

7246

A range pattern starts out by matching @var{begpat}

7247

against every input record; when a record matches @var{begpat}, the

7248

range pattern becomes @dfn{turned on}. The range pattern matches this

7249

record. As long as it stays turned on, it automatically matches every

7250

input record read. It also matches @var{endpat} against

7251

every input record; when that succeeds, the range pattern is turned

7252

off again for the following record. Then it goes back to checking

7253

@var{begpat} against each record.

7254

7255

The record that turns on the range pattern and the one that turns it

7256

off both match the range pattern. If you don't want to operate on

7257

these records, you can write @code{if} statements in the rule's action

7258

to distinguish them from the records you are interested in.

7259

7260

It is possible for a pattern to be turned both on and off by the same

7261

record, if the record satisfies both conditions. Then the action is

7262

executed for just that record.

7263

7264

For example, suppose you have text between two identical markers (say

7265

the @samp{%} symbol) that you wish to ignore. You might try to

7266

combine a range pattern that describes the delimited text with the

7267

@code{next} statement

7268

(not discussed yet, @pxref{Next Statement, , The @code{next} Statement}),

7269

which causes @code{awk} to skip any further processing of the current

7270

record and start over again with the next input record. Such a program

7271

would like this:

7272

7273

@example

7274

/^%$/,/^%$/ @{ next @}

7275

@{ print @}

7276

@end example

7277

7278

@noindent

7279

@cindex skipping lines between markers

7280

This program fails because the range pattern is both turned on and turned off

7281

by the first line with just a @samp{%} on it. To accomplish this task, you

7282

must write the program this way, using a flag:

7283

7284

@example

7285

/^%$/ @{ skip = ! skip; next @}

7286

skip == 1 @{ next @} # skip lines with `skip' set

7287

@end example

7288

7289

Note that in a range pattern, the @samp{,} has the lowest precedence

7290

(is evaluated last) of all the operators. Thus, for example, the

7291

following program attempts to combine a range pattern with another,

7292

simpler test.

7293

7294

@example

7295

echo Yes | awk '/1/,/2/ || /Yes/'

7296

@end example

7297

7298

The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}.

7299

However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.

7300

This cannot be changed or worked around; range patterns do not combine

7301

with other patterns.

7302

7303

@node BEGIN/END, Empty, Ranges, Pattern Overview

7304

@subsection The @code{BEGIN} and @code{END} Special Patterns

7305

7306

@cindex @code{BEGIN} special pattern

7307

@cindex pattern, @code{BEGIN}

7308

@cindex @code{END} special pattern

7309

@cindex pattern, @code{END}

7310

@code{BEGIN} and @code{END} are special patterns. They are not used to

7311

match input records. Rather, they supply start-up or

7312

clean-up actions for your @code{awk} script.

7313

7314

@menu

7315

* Using BEGIN/END:: How and why to use BEGIN/END rules.

7316

* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.

7317

@end menu

7318

7319

@node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END

7320

@subsubsection Startup and Cleanup Actions

7321

7322

A @code{BEGIN} rule is executed, once, before the first input record

7323

has been read. An @code{END} rule is executed, once, after all the

7324

input has been read. For example:

7325

7326

@example

7327

@group

7328

$ awk '

7329

> BEGIN @{ print "Analysis of \"foo\"" @}

7330

> /foo/ @{ ++n @}

7331

> END @{ print "\"foo\" appears " n " times." @}' BBS-list

7332

@print{} Analysis of "foo"

7333

@print{} "foo" appears 4 times.

7334

@end group

7335

@end example

7336

7337

This program finds the number of records in the input file @file{BBS-list}

7338

that contain the string @samp{foo}. The @code{BEGIN} rule prints a title

7339

for the report. There is no need to use the @code{BEGIN} rule to

7340

initialize the counter @code{n} to zero, as @code{awk} does this

7341

automatically (@pxref{Variables}).

7342

7343

The second rule increments the variable @code{n} every time a

7344

record containing the pattern @samp{foo} is read. The @code{END} rule

7345

prints the value of @code{n} at the end of the run.

7346

7347

The special patterns @code{BEGIN} and @code{END} cannot be used in ranges

7348

or with boolean operators (indeed, they cannot be used with any operators).

7349

7350

An @code{awk} program may have multiple @code{BEGIN} and/or @code{END}

7351

rules. They are executed in the order they appear, all the @code{BEGIN}

7352

rules at start-up and all the @code{END} rules at termination.

7353

@code{BEGIN} and @code{END} rules may be intermixed with other rules.

7354

This feature was added in the 1987 version of @code{awk}, and is included

7355

in the POSIX standard. The original (1978) version of @code{awk}

7356

required you to put the @code{BEGIN} rule at the beginning of the

7357

program, and the @code{END} rule at the end, and only allowed one of

7358

each. This is no longer required, but it is a good idea in terms of

7359

program organization and readability.

7360

7361

Multiple @code{BEGIN} and @code{END} rules are useful for writing

7362

library functions, since each library file can have its own @code{BEGIN} and/or

7363

@code{END} rule to do its own initialization and/or cleanup. Note that

7364

the order in which library functions are named on the command line

7365

controls the order in which their @code{BEGIN} and @code{END} rules are

7366

executed. Therefore you have to be careful to write such rules in

7367

library files so that the order in which they are executed doesn't matter.

7368

@xref{Options, ,Command Line Options}, for more information on

7369

using library functions.

7370

@xref{Library Functions, ,A Library of @code{awk} Functions},

7371

for a number of useful library functions.

7372

7373

@cindex dark corner

7374

If an @code{awk} program only has a @code{BEGIN} rule, and no other

7375

rules, then the program exits after the @code{BEGIN} rule has been run.

7376

(The original version of @code{awk} used to keep reading and ignoring input

7377

until end of file was seen.) However, if an @code{END} rule exists,

7378

then the input will be read, even if there are no other rules in

7379

the program. This is necessary in case the @code{END} rule checks the

7380

@code{FNR} and @code{NR} variables (d.c.).

7381

7382

@code{BEGIN} and @code{END} rules must have actions; there is no default

7383

action for these rules since there is no current record when they run.

7384

7385

@node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END

7386

@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules

7387

7388

@cindex I/O from @code{BEGIN} and @code{END}

7389

There are several (sometimes subtle) issues involved when doing I/O

7390

from a @code{BEGIN} or @code{END} rule.

7391

7392

The first has to do with the value of @code{$0} in a @code{BEGIN}

7393

rule. Since @code{BEGIN} rules are executed before any input is read,

7394

there simply is no input record, and therefore no fields, when

7395

executing @code{BEGIN} rules. References to @code{$0} and the fields

7396

yield a null string or zero, depending upon the context. One way

7397

to give @code{$0} a real value is to execute a @code{getline} command

7398

without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}).

7399

Another way is to simply assign a value to it.

7400

7401

@cindex differences between @code{gawk} and @code{awk}

7402

The second point is similar to the first, but from the other direction.

7403

Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}?

7404

Traditionally, due largely to implementation issues, @code{$0} and

7405

@code{NF} were @emph{undefined} inside an @code{END} rule.

7406

The POSIX standard specified that @code{NF} was available in an @code{END}

7407

rule, containing the number of fields from the last input record.

7408

Due most probably to an oversight, the standard does not say that @code{$0}

7409

is also preserved, although logically one would think that it should be.

7410

In fact, @code{gawk} does preserve the value of @code{$0} for use in

7411

@code{END} rules. Be aware, however, that Unix @code{awk}, and possibly

7412

other implementations, do not.

7413

7414

The third point follows from the first two. What is the meaning of

7415

@samp{print} inside a @code{BEGIN} or @code{END} rule? The meaning is

7416

the same as always, @samp{print $0}. If @code{$0} is the null string,

7417

then this prints an empty line. Many long time @code{awk} programmers

7418

use @samp{print} in @code{BEGIN} and @code{END} rules, to mean

7419

@samp{@w{print ""}}, relying on @code{$0} being null. While you might

7420

generally get away with this in @code{BEGIN} rules, in @code{gawk} at

7421

least, it is a very bad idea in @code{END} rules. It is also poor

7422

style, since if you want an empty line in the output, you

7423

should say so explicitly in your program.

7424

7425

@node Empty, , BEGIN/END, Pattern Overview

7426

@subsection The Empty Pattern

7427

7428

@cindex empty pattern

7429

@cindex pattern, empty

7430

An empty (i.e.@: non-existent) pattern is considered to match @emph{every}

7431

input record. For example, the program:

7432

7433

@example

7434

awk '@{ print $1 @}' BBS-list

7435

@end example

7436

7437

@noindent

7438

prints the first field of every record.

7439

7440

@node Action Overview, , Pattern Overview, Patterns and Actions

7441

@section Overview of Actions

7442

@cindex action, definition of

7443

@cindex curly braces

7444

@cindex action, curly braces

7445

@cindex action, separating statements

7446

7447

An @code{awk} program or script consists of a series of

7448

rules and function definitions, interspersed. (Functions are

7449

described later. @xref{User-defined, ,User-defined Functions}.)

7450

7451

A rule contains a pattern and an action, either of which (but not

7452

both) may be

7453

omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do

7454

once a match for the pattern is found. Thus, in outline, an @code{awk}

7455

program generally looks like this:

7456

7457

@example

7458

@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}

7459

@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}

7460

@dots{}

7461

function @var{name}(@var{args}) @{ @dots{} @}

7462

@dots{}

7463

@end example

7464

7465

An action consists of one or more @code{awk} @dfn{statements}, enclosed

7466

in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one

7467

thing to be done. The statements are separated by newlines or

7468

semicolons.

7469

7470

The curly braces around an action must be used even if the action

7471

contains only one statement, or even if it contains no statements at

7472

all. However, if you omit the action entirely, omit the curly braces as

7473

well. An omitted action is equivalent to @samp{@{ print $0 @}}.

7474

7475

@example

7476

/foo/ @{ @} # match foo, do nothing - empty action

7477

/foo/ # match foo, print the record - omitted action

7478

@end example

7479

7480

Here are the kinds of statements supported in @code{awk}:

7481

7482

@itemize @bullet

7483

@item

7484

Expressions, which can call functions or assign values to variables

7485

(@pxref{Expressions}). Executing

7486

this kind of statement simply computes the value of the expression.

7487

This is useful when the expression has side effects

7488

(@pxref{Assignment Ops, ,Assignment Expressions}).

7489

7490

@item

7491

Control statements, which specify the control flow of @code{awk}

7492

programs. The @code{awk} language gives you C-like constructs

7493

(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few

7494

special ones (@pxref{Statements, ,Control Statements in Actions}).

7495

7496

@item

7497

Compound statements, which consist of one or more statements enclosed in

7498

curly braces. A compound statement is used in order to put several

7499

statements together in the body of an @code{if}, @code{while}, @code{do}

7500

or @code{for} statement.

7501

7502

@item

7503

Input statements, using the @code{getline} command

7504

(@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next}

7505

statement (@pxref{Next Statement, ,The @code{next} Statement}),

7506

and the @code{nextfile} statement

7507

(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).

7508

7509

@item

7510

Output statements, @code{print} and @code{printf}.

7511

@xref{Printing, ,Printing Output}.

7512

7513

@item

7514

Deletion statements, for deleting array elements.

7515

@xref{Delete, ,The @code{delete} Statement}.

7516

@end itemize

7517

7518

@iftex

7519

The next chapter covers control statements in detail.

7520

@end iftex

7521

7522

@node Statements, Built-in Variables, Patterns and Actions, Top

7523

@chapter Control Statements in Actions

7524

@cindex control statement

7525

7526

@dfn{Control statements} such as @code{if}, @code{while}, and so on

7527

control the flow of execution in @code{awk} programs. Most of the

7528

control statements in @code{awk} are patterned on similar statements in

7529

C.

7530

7531

All the control statements start with special keywords such as @code{if}

7532

and @code{while}, to distinguish them from simple expressions.

7533

7534

@cindex compound statement

7535

@cindex statement, compound

7536

Many control statements contain other statements; for example, the

7537

@code{if} statement contains another statement which may or may not be

7538

executed. The contained statement is called the @dfn{body}. If you

7539

want to include more than one statement in the body, group them into a

7540

single @dfn{compound statement} with curly braces, separating them with

7541

newlines or semicolons.

7542

7543

@menu

7544

* If Statement:: Conditionally execute some @code{awk}

7545

statements.

7546

* While Statement:: Loop until some condition is satisfied.

7547

* Do Statement:: Do specified action while looping until some

7548

condition is satisfied.

7549

* For Statement:: Another looping statement, that provides

7550

initialization and increment clauses.

7551

* Break Statement:: Immediately exit the innermost enclosing loop.

7552

* Continue Statement:: Skip to the end of the innermost enclosing

7553

loop.

7554

* Next Statement:: Stop processing the current input record.

7555

* Nextfile Statement:: Stop processing the current file.

7556

* Exit Statement:: Stop execution of @code{awk}.

7557

@end menu

7558

7559

@node If Statement, While Statement, Statements, Statements

7560

@section The @code{if}-@code{else} Statement

7561

7562

@cindex @code{if}-@code{else} statement

7563

The @code{if}-@code{else} statement is @code{awk}'s decision-making

7564

statement. It looks like this:

7565

7566

@example

7567

if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]}

7568

@end example

7569

7570

@noindent

7571

The @var{condition} is an expression that controls what the rest of the

7572

statement will do. If @var{condition} is true, @var{then-body} is

7573

executed; otherwise, @var{else-body} is executed.

7574

The @code{else} part of the statement is

7575

optional. The condition is considered false if its value is zero or

7576

the null string, and true otherwise.

7577

7578

Here is an example:

7579

7580

@example

7581

if (x % 2 == 0)

7582

print "x is even"

7583

else

7584

print "x is odd"

7585

@end example

7586

7587

In this example, if the expression @samp{x % 2 == 0} is true (that is,

7588

the value of @code{x} is evenly divisible by two), then the first @code{print}

7589

statement is executed, otherwise the second @code{print} statement is

7590

executed.

7591

7592

If the @code{else} appears on the same line as @var{then-body}, and

7593

@var{then-body} is not a compound statement (i.e.@: not surrounded by

7594

curly braces), then a semicolon must separate @var{then-body} from

7595

@code{else}. To illustrate this, let's rewrite the previous example:

7596

7597

@example

7598

if (x % 2 == 0) print "x is even"; else

7599

print "x is odd"

7600

@end example

7601

7602

@noindent

7603

If you forget the @samp{;}, @code{awk} won't be able to interpret the

7604

statement, and you will get a syntax error.

7605

7606

We would not actually write this example this way, because a human

7607

reader might fail to see the @code{else} if it were not the first thing

7608

on its line.

7609

7610

@node While Statement, Do Statement, If Statement, Statements

7611

@section The @code{while} Statement

7612

@cindex @code{while} statement

7613

@cindex loop

7614

@cindex body of a loop

7615

7616

In programming, a @dfn{loop} means a part of a program that can

7617

be executed two or more times in succession.

7618

7619

The @code{while} statement is the simplest looping statement in

7620

@code{awk}. It repeatedly executes a statement as long as a condition is

7621

true. It looks like this:

7622

7623

@example

7624

while (@var{condition})

7625

@var{body}

7626

@end example

7627

7628

@noindent

7629

Here @var{body} is a statement that we call the @dfn{body} of the loop,

7630

and @var{condition} is an expression that controls how long the loop

7631

keeps running.

7632

7633

The first thing the @code{while} statement does is test @var{condition}.

7634

If @var{condition} is true, it executes the statement @var{body}.

7635

@ifinfo

7636

(The @var{condition} is true when the value

7637

is not zero and not a null string.)

7638

@end ifinfo

7639

After @var{body} has been executed,

7640

@var{condition} is tested again, and if it is still true, @var{body} is

7641

executed again. This process repeats until @var{condition} is no longer

7642

true. If @var{condition} is initially false, the body of the loop is

7643

never executed, and @code{awk} continues with the statement following

7644

the loop.

7645

7646

This example prints the first three fields of each record, one per line.

7647

7648

@example

7649

awk '@{ i = 1

7650

while (i <= 3) @{

7651

print $i

7652

i++

7653

@}

7654

@}' inventory-shipped

7655

@end example

7656

7657

@noindent

7658

Here the body of the loop is a compound statement enclosed in braces,

7659

containing two statements.

7660

7661

The loop works like this: first, the value of @code{i} is set to one.

7662

Then, the @code{while} tests whether @code{i} is less than or equal to

7663

three. This is true when @code{i} equals one, so the @code{i}-th

7664

field is printed. Then the @samp{i++} increments the value of @code{i}

7665

and the loop repeats. The loop terminates when @code{i} reaches four.

7666

7667

As you can see, a newline is not required between the condition and the

7668

body; but using one makes the program clearer unless the body is a

7669

compound statement or is very simple. The newline after the open-brace

7670

that begins the compound statement is not required either, but the

7671

program would be harder to read without it.

7672

7673

@node Do Statement, For Statement, While Statement, Statements

7674

@section The @code{do}-@code{while} Statement

7675

7676

The @code{do} loop is a variation of the @code{while} looping statement.

7677

The @code{do} loop executes the @var{body} once, and then repeats @var{body}

7678

as long as @var{condition} is true. It looks like this:

7679

7680

@example

7681

do

7682

@var{body}

7683

while (@var{condition})

7684

@end example

7685

7686

Even if @var{condition} is false at the start, @var{body} is executed at

7687

least once (and only once, unless executing @var{body} makes

7688

@var{condition} true). Contrast this with the corresponding

7689

@code{while} statement:

7690

7691

@example

7692

while (@var{condition})

7693

@var{body}

7694

@end example

7695

7696

@noindent

7697

This statement does not execute @var{body} even once if @var{condition}

7698

is false to begin with.

7699

7700

Here is an example of a @code{do} statement:

7701

7702

@example

7703

awk '@{ i = 1

7704

do @{

7705

print $0

7706

i++

7707

@} while (i <= 10)

7708

@}'

7709

@end example

7710

7711

@noindent

7712

This program prints each input record ten times. It isn't a very

7713

realistic example, since in this case an ordinary @code{while} would do

7714

just as well. But this reflects actual experience; there is only

7715

occasionally a real use for a @code{do} statement.

7716

7717

@node For Statement, Break Statement, Do Statement, Statements

7718

@section The @code{for} Statement

7719

@cindex @code{for} statement

7720

7721

The @code{for} statement makes it more convenient to count iterations of a

7722

loop. The general form of the @code{for} statement looks like this:

7723

7724

@example

7725

for (@var{initialization}; @var{condition}; @var{increment})

7726

@var{body}

7727

@end example

7728

7729

@noindent

7730

The @var{initialization}, @var{condition} and @var{increment} parts are

7731

arbitrary @code{awk} expressions, and @var{body} stands for any

7732

@code{awk} statement.

7733

7734

The @code{for} statement starts by executing @var{initialization}.

7735

Then, as long

7736

as @var{condition} is true, it repeatedly executes @var{body} and then

7737

@var{increment}. Typically @var{initialization} sets a variable to

7738

either zero or one, @var{increment} adds one to it, and @var{condition}

7739

compares it against the desired number of iterations.

7740

7741

Here is an example of a @code{for} statement:

7742

7743

@example

7744

@group

7745

awk '@{ for (i = 1; i <= 3; i++)

7746

print $i

7747

@}' inventory-shipped

7748

@end group

7749

@end example

7750

7751

@noindent

7752

This prints the first three fields of each input record, one field per

7753

line.

7754

7755

You cannot set more than one variable in the

7756

@var{initialization} part unless you use a multiple assignment statement

7757

such as @samp{x = y = 0}, which is possible only if all the initial values

7758

are equal. (But you can initialize additional variables by writing

7759

their assignments as separate statements preceding the @code{for} loop.)

7760

7761

The same is true of the @var{increment} part; to increment additional

7762

variables, you must write separate statements at the end of the loop.

7763

The C compound expression, using C's comma operator, would be useful in

7764

this context, but it is not supported in @code{awk}.

7765

7766

Most often, @var{increment} is an increment expression, as in the

7767

example above. But this is not required; it can be any expression

7768

whatever. For example, this statement prints all the powers of two

7769

between one and 100:

7770

7771

@example

7772

for (i = 1; i <= 100; i *= 2)

7773

print i

7774

@end example

7775

7776

Any of the three expressions in the parentheses following the @code{for} may

7777

be omitted if there is nothing to be done there. Thus, @w{@samp{for (; x

7778

> 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the

7779

@var{condition} is omitted, it is treated as @var{true}, effectively

7780

yielding an @dfn{infinite loop} (i.e.@: a loop that will never

7781

terminate).

7782

7783

In most cases, a @code{for} loop is an abbreviation for a @code{while}

7784

loop, as shown here:

7785

7786

@example

7787

@var{initialization}

7788

while (@var{condition}) @{

7789

@var{body}

7790

@var{increment}

7791

@}

7792

@end example

7793

7794

@noindent

7795

The only exception is when the @code{continue} statement

7796

(@pxref{Continue Statement, ,The @code{continue} Statement}) is used

7797

inside the loop; changing a @code{for} statement to a @code{while}

7798

statement in this way can change the effect of the @code{continue}

7799

statement inside the loop.

7800

7801

There is an alternate version of the @code{for} loop, for iterating over

7802

all the indices of an array:

7803

7804

@example

7805

for (i in array)

7806

@var{do something with} array[i]

7807

@end example

7808

7809

@noindent

7810

@xref{Scanning an Array, ,Scanning All Elements of an Array},

7811

for more information on this version of the @code{for} loop.

7812

7813

The @code{awk} language has a @code{for} statement in addition to a

7814

@code{while} statement because often a @code{for} loop is both less work to

7815

type and more natural to think of. Counting the number of iterations is

7816

very common in loops. It can be easier to think of this counting as part

7817

of looping rather than as something to do inside the loop.

7818

7819

The next section has more complicated examples of @code{for} loops.

7820

7821

@node Break Statement, Continue Statement, For Statement, Statements

7822

@section The @code{break} Statement

7823

@cindex @code{break} statement

7824

@cindex loops, exiting

7825

7826

The @code{break} statement jumps out of the innermost @code{for},

7827

@code{while}, or @code{do} loop that encloses it. The

7828

following example finds the smallest divisor of any integer, and also

7829

identifies prime numbers:

7830

7831

@example

7832

awk '# find smallest divisor of num

7833

@{ num = $1

7834

for (div = 2; div*div <= num; div++)

7835

if (num % div == 0)

7836

break

7837

if (num % div == 0)

7838

printf "Smallest divisor of %d is %d\n", num, div

7839

else

7840

printf "%d is prime\n", num

7841

@}'

7842

@end example

7843

7844

When the remainder is zero in the first @code{if} statement, @code{awk}

7845

immediately @dfn{breaks out} of the containing @code{for} loop. This means

7846

that @code{awk} proceeds immediately to the statement following the loop

7847

and continues processing. (This is very different from the @code{exit}

7848

statement which stops the entire @code{awk} program.

7849

@xref{Exit Statement, ,The @code{exit} Statement}.)

7850

7851

Here is another program equivalent to the previous one. It illustrates how

7852

the @var{condition} of a @code{for} or @code{while} could just as well be

7853

replaced with a @code{break} inside an @code{if}:

7854

7855

@example

7856

@group

7857

awk '# find smallest divisor of num

7858

@{ num = $1

7859

for (div = 2; ; div++) @{

7860

if (num % div == 0) @{

7861

printf "Smallest divisor of %d is %d\n", num, div

7862

break

7863

@}

7864

if (div*div > num) @{

7865

printf "%d is prime\n", num

7866

break

7867

@}

7868

@}

7869

@}'

7870

@end group

7871

@end example

7872

7873

@cindex @code{break}, outside of loops

7874

@cindex historical features

7875

@cindex @code{awk} language, POSIX version

7876

@cindex POSIX @code{awk}

7877

@cindex dark corner

7878

As described above, the @code{break} statement has no meaning when

7879

used outside the body of a loop. However, although it was never documented,

7880

historical implementations of @code{awk} have treated the @code{break}

7881

statement outside of a loop as if it were a @code{next} statement

7882

(@pxref{Next Statement, ,The @code{next} Statement}).

7883

Recent versions of Unix @code{awk} no longer allow this usage.

7884

@code{gawk} will support this use of @code{break} only if @samp{--traditional}

7885

has been specified on the command line

7886

(@pxref{Options, ,Command Line Options}).

7887

Otherwise, it will be treated as an error, since the POSIX standard

7888

specifies that @code{break} should only be used inside the body of a

7889

loop (d.c.).

7890

7891

@node Continue Statement, Next Statement, Break Statement, Statements

7892

@section The @code{continue} Statement

7893

7894

@cindex @code{continue} statement

7895

The @code{continue} statement, like @code{break}, is used only inside

7896

@code{for}, @code{while}, and @code{do} loops. It skips

7897

over the rest of the loop body, causing the next cycle around the loop

7898

to begin immediately. Contrast this with @code{break}, which jumps out

7899

of the loop altogether.

7900

7901

@c The point of this program was to illustrate the use of continue with

7902

@c a while loop. But Karl Berry points out that that is done adequately

7903

@c below, and that this example is very un-awk-like. So for now, we'll

7904

@c omit it.

7905

@ignore

7906

In Texinfo source files, text that the author wishes to ignore can be

7907

enclosed between lines that start with @samp{@@ignore} and end with

7908

@samp{@@end ignore}. Here is a program that strips out lines between

7909

@samp{@@ignore} and @samp{@@end ignore} pairs.

7910

7911

@example

7912

BEGIN @{

7913

while (getline > 0) @{

7914

if (/^@@ignore/)

7915

ignoring = 1

7916

else if (/^@@end[ \t]+ignore/) @{

7917

ignoring = 0

7918

continue

7919

@}

7920

if (ignoring)

7921

continue

7922

print

7923

@}

7924

@}

7925

@end example

7926

7927

When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true).

7928

When @samp{@@end ignore} is seen, the flag is reset to zero (false). As long

7929

as the flag is true, the input record is not printed, because the

7930

@code{continue} restarts the @code{while} loop, skipping over the @code{print}

7931

statement.

7932

7933

@c Exercise!!!

7934

@c How could this program be written to make better use of the awk language?

7935

@end ignore

7936

7937

The @code{continue} statement in a @code{for} loop directs @code{awk} to

7938

skip the rest of the body of the loop, and resume execution with the

7939

increment-expression of the @code{for} statement. The following program

7940

illustrates this fact:

7941

7942

@example

7943

awk 'BEGIN @{

7944

for (x = 0; x <= 20; x++) @{

7945

if (x == 5)

7946

continue

7947

printf "%d ", x

7948

@}

7949

print ""

7950

@}'

7951

@end example

7952

7953

@noindent

7954

This program prints all the numbers from zero to 20, except for five, for

7955

which the @code{printf} is skipped. Since the increment @samp{x++}

7956

is not skipped, @code{x} does not remain stuck at five. Contrast the

7957

@code{for} loop above with this @code{while} loop:

7958

7959

@example

7960

awk 'BEGIN @{

7961

x = 0

7962

while (x <= 20) @{

7963

if (x == 5)

7964

continue

7965

printf "%d ", x

7966

x++

7967

@}

7968

print ""

7969

@}'

7970

@end example

7971

7972

@noindent

7973

This program loops forever once @code{x} gets to five.

7974

7975

@cindex @code{continue}, outside of loops

7976

@cindex historical features

7977

@cindex @code{awk} language, POSIX version

7978

@cindex POSIX @code{awk}

7979

@cindex dark corner

7980

As described above, the @code{continue} statement has no meaning when

7981

used outside the body of a loop. However, although it was never documented,

7982

historical implementations of @code{awk} have treated the @code{continue}

7983

statement outside of a loop as if it were a @code{next} statement

7984

(@pxref{Next Statement, ,The @code{next} Statement}).

7985

Recent versions of Unix @code{awk} no longer allow this usage.

7986

@code{gawk} will support this use of @code{continue} only if

7987

@samp{--traditional} has been specified on the command line

7988

(@pxref{Options, ,Command Line Options}).

7989

Otherwise, it will be treated as an error, since the POSIX standard

7990

specifies that @code{continue} should only be used inside the body of a

7991

loop (d.c.).

7992

7993

@node Next Statement, Nextfile Statement, Continue Statement, Statements

7994

@section The @code{next} Statement

7995

@cindex @code{next} statement

7996

7997

The @code{next} statement forces @code{awk} to immediately stop processing

7998

the current record and go on to the next record. This means that no

7999

further rules are executed for the current record. The rest of the

8000

current rule's action is not executed either.

8001

8002

Contrast this with the effect of the @code{getline} function

8003

(@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes

8004

@code{awk} to read the next record immediately, but it does not alter the

8005

flow of control in any way. So the rest of the current action executes

8006

with a new input record.

8007

8008

At the highest level, @code{awk} program execution is a loop that reads

8009

an input record and then tests each rule's pattern against it. If you

8010

think of this loop as a @code{for} statement whose body contains the

8011

rules, then the @code{next} statement is analogous to a @code{continue}

8012

statement: it skips to the end of the body of this implicit loop, and

8013

executes the increment (which reads another record).

8014

8015

For example, if your @code{awk} program works only on records with four

8016

fields, and you don't want it to fail when given bad input, you might

8017

use this rule near the beginning of the program:

8018

8019

@example

8020

@group

8021

NF != 4 @{

8022

err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)

8023

print err > "/dev/stderr"

8024

@}

8026

@end group

8027

@end example

8028

8029

@noindent

8030

so that the following rules will not see the bad record. The error

8031

message is redirected to the standard error output stream, as error

8032

messages should be. @xref{Special Files, ,Special File Names in @code{gawk}}.

8033

8034

@cindex @code{awk} language, POSIX version

8035

@cindex POSIX @code{awk}

8036

According to the POSIX standard, the behavior is undefined if

8037

the @code{next} statement is used in a @code{BEGIN} or @code{END} rule.

8038

@code{gawk} will treat it as a syntax error.

8039

Although POSIX permits it,

8040

some other @code{awk} implementations don't allow the @code{next}

8041

statement inside function bodies

8042

(@pxref{User-defined, ,User-defined Functions}).

8043

Just as any other @code{next} statement, a @code{next} inside a

8044

function body reads the next record and starts processing it with the

8045

first rule in the program.

8046

8047

If the @code{next} statement causes the end of the input to be reached,

8048

then the code in any @code{END} rules will be executed.

8049

@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.

8050

8051

@node Nextfile Statement, Exit Statement, Next Statement, Statements

8052

@section The @code{nextfile} Statement

8053

@cindex @code{nextfile} statement

8054

@cindex differences between @code{gawk} and @code{awk}

8055

8056

@code{gawk} provides the @code{nextfile} statement,

8057

which is similar to the @code{next} statement.

8058

However, instead of abandoning processing of the current record, the

8059

@code{nextfile} statement instructs @code{gawk} to stop processing the

8060

current data file.

8061

8062

Upon execution of the @code{nextfile} statement, @code{FILENAME} is

8063

updated to the name of the next data file listed on the command line,

8064

@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing

8065

starts over with the first rule in the progam. @xref{Built-in Variables}.

8066

8067

If the @code{nextfile} statement causes the end of the input to be reached,

8068

then the code in any @code{END} rules will be executed.

8069

@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.

8070

8071

The @code{nextfile} statement is a @code{gawk} extension; it is not

8072

(currently) available in any other @code{awk} implementation.

8073

@xref{Nextfile Function, ,Implementing @code{nextfile} as a Function},

8074

for a user-defined function you can use to simulate the @code{nextfile}

8075

statement.

8076

8077

The @code{nextfile} statement would be useful if you have many data

8078

files to process, and you expect that you

8079

would not want to process every record in every file.

8080

Normally, in order to move on to

8081

the next data file, you would have to continue scanning the unwanted

8082

records. The @code{nextfile} statement accomplishes this much more

8083

efficiently.

8084

8085

@cindex @code{next file} statement

8086

@strong{Caution:} Versions of @code{gawk} prior to 3.0 used two

8087

words (@samp{next file}) for the @code{nextfile} statement. This was

8088

changed in 3.0 to one word, since the treatment of @samp{file} was

8089

inconsistent. When it appeared after @code{next}, it was a keyword.

8090

Otherwise, it was a regular identifier. The old usage is still

8091

accepted. However, @code{gawk} will generate a warning message, and

8092

support for @code{next file} will eventually be discontinued in a

8093

future version of @code{gawk}.

8094

8095

@node Exit Statement, , Nextfile Statement, Statements

8096

@section The @code{exit} Statement

8097

8098

@cindex @code{exit} statement

8099

The @code{exit} statement causes @code{awk} to immediately stop

8100

executing the current rule and to stop processing input; any remaining input

8101

is ignored. It looks like this:

8102

8103

@example

8104

exit @r{[}@var{return code}@r{]}

8105

@end example

8106

8107

If an @code{exit} statement is executed from a @code{BEGIN} rule the

8108

program stops processing everything immediately. No input records are

8109

read. However, if an @code{END} rule is present, it is executed

8110

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).

8111

8112

If @code{exit} is used as part of an @code{END} rule, it causes

8113

the program to stop immediately.

8114

8115

An @code{exit} statement that is not part

8116

of a @code{BEGIN} or @code{END} rule stops the execution of any further

8117

automatic rules for the current record, skips reading any remaining input

8118

records, and executes

8119

the @code{END} rule if there is one.

8120

8121

If you do not want the @code{END} rule to do its job in this case, you

8122

can set a variable to non-zero before the @code{exit} statement, and check

8123

that variable in the @code{END} rule.

8124

@xref{Assert Function, ,Assertions},

8125

for an example that does this.

8126

8127

@cindex dark corner

8128

If an argument is supplied to @code{exit}, its value is used as the exit

8129

status code for the @code{awk} process. If no argument is supplied,

8130

@code{exit} returns status zero (success). In the case where an argument

8131

is supplied to a first @code{exit} statement, and then @code{exit} is

8132

called a second time with no argument, the previously supplied exit value

8133

is used (d.c.).

8134

8135

For example, let's say you've discovered an error condition you really

8136

don't know how to handle. Conventionally, programs report this by

8137

exiting with a non-zero status. Your @code{awk} program can do this

8138

using an @code{exit} statement with a non-zero argument. Here is an

8139

example:

8140

8141

@example

8142

@group

8143

BEGIN @{

8144

if (("date" | getline date_now) < 0) @{

8145

print "Can't get system date" > "/dev/stderr"

8146

exit 1

8147

@}

8148

print "current date is", date_now

8149

close("date")

8150

@}

8151

@end group

8152

@end example

8153

8154

@node Built-in Variables, Arrays, Statements, Top

8155

@chapter Built-in Variables

8156

@cindex built-in variables

8157

8158

Most @code{awk} variables are available for you to use for your own

8159

purposes; they never change except when your program assigns values to

8160

them, and never affect anything except when your program examines them.

8161

However, a few variables in @code{awk} have special built-in meanings.

8162

Some of them @code{awk} examines automatically, so that they enable you

8163

to tell @code{awk} how to do certain things. Others are set

8164

automatically by @code{awk}, so that they carry information from the

8165

internal workings of @code{awk} to your program.

8166

8167

This chapter documents all the built-in variables of @code{gawk}. Most

8168

of them are also documented in the chapters describing their areas of

8169

activity.

8170

8171

@menu

8172

* User-modified:: Built-in variables that you change to control

8173

@code{awk}.

8174

* Auto-set:: Built-in variables where @code{awk} gives you

8175

information.

8176

* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.

8177

@end menu

8178

8179

@node User-modified, Auto-set, Built-in Variables, Built-in Variables

8180

@section Built-in Variables that Control @code{awk}

8181

@cindex built-in variables, user modifiable

8182

8183

This is an alphabetical list of the variables which you can change to

8184

control how @code{awk} does certain things. Those variables that are

8185

specific to @code{gawk} are marked with an asterisk, @samp{*}.

8186

8187

@table @code

8188

@vindex CONVFMT

8189

@cindex @code{awk} language, POSIX version

8190

@cindex POSIX @code{awk}

8191

@item CONVFMT

8192

This string controls conversion of numbers to

8193

strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).

8194

It works by being passed, in effect, as the first argument to the

8195

@code{sprintf} function

8196

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

8197

Its default value is @code{"%.6g"}.

8198

@code{CONVFMT} was introduced by the POSIX standard.

8199

8200

@vindex FIELDWIDTHS

8201

@item FIELDWIDTHS *

8202

This is a space separated list of columns that tells @code{gawk}

8203

how to split input with fixed, columnar boundaries. It is an

8204

experimental feature. Assigning to @code{FIELDWIDTHS}

8205

overrides the use of @code{FS} for field splitting.

8206

@xref{Constant Size, ,Reading Fixed-width Data}, for more information.

8207

8208

If @code{gawk} is in compatibility mode

8209

(@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS}

8210

has no special meaning, and field splitting operations are done based

8211

exclusively on the value of @code{FS}.

8212

8213

@vindex FS

8214

@item FS

8215

@code{FS} is the input field separator

8216

(@pxref{Field Separators, ,Specifying How Fields are Separated}).

8217

The value is a single-character string or a multi-character regular

8218

expression that matches the separations between fields in an input

8219

record. If the value is the null string (@code{""}), then each

8220

character in the record becomes a separate field.

8221

8222

The default value is @w{@code{" "}}, a string consisting of a single

8223

space. As a special exception, this value means that any

8224

sequence of spaces and tabs is a single separator. It also causes

8225

spaces and tabs at the beginning and end of a record to be ignored.

8226

8227

You can set the value of @code{FS} on the command line using the

8228

@samp{-F} option:

8229

8230

@example

8231

awk -F, '@var{program}' @var{input-files}

8232

@end example

8233

8234

If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting,

8235

assigning a value to @code{FS} will cause @code{gawk} to return to

8236

the normal, @code{FS}-based, field splitting. An easy way to do this

8237

is to simply say @samp{FS = FS}, perhaps with an explanatory comment.

8238

8239

@vindex IGNORECASE

8240

@item IGNORECASE *

8241

If @code{IGNORECASE} is non-zero or non-null, then all string comparisons,

8242

and all regular expression matching are case-independent. Thus, regexp

8243

matching with @samp{~} and @samp{!~}, and the @code{gensub},

8244

@code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub}

8245

functions, record termination with @code{RS}, and field splitting with

8246

@code{FS} all ignore case when doing their particular regexp operations.

8247

@xref{Case-sensitivity, ,Case-sensitivity in Matching}.

8248

8249

If @code{gawk} is in compatibility mode

8250

(@pxref{Options, ,Command Line Options}),

8251

then @code{IGNORECASE} has no special meaning, and string

8252

and regexp operations are always case-sensitive.

8253

8254

@vindex OFMT

8255

@item OFMT

8256

This string controls conversion of numbers to

8257

strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for

8258

printing with the @code{print} statement. It works by being passed, in

8259

effect, as the first argument to the @code{sprintf} function

8260

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

8261

Its default value is @code{"%.6g"}. Earlier versions of @code{awk}

8262

also used @code{OFMT} to specify the format for converting numbers to

8263

strings in general expressions; this is now done by @code{CONVFMT}.

8264

8265

@vindex OFS

8266

@item OFS

8267

This is the output field separator (@pxref{Output Separators}). It is

8268

output between the fields output by a @code{print} statement. Its

8269

default value is @w{@code{" "}}, a string consisting of a single space.

8270

8271

@vindex ORS

8272

@item ORS

8273

This is the output record separator. It is output at the end of every

8274

@code{print} statement. Its default value is @code{"\n"}.

8275

(@xref{Output Separators}.)

8276

8277

@vindex RS

8278

@item RS

8279

This is @code{awk}'s input record separator. Its default value is a string

8280

containing a single newline character, which means that an input record

8281

consists of a single line of text.

8282

It can also be the null string, in which case records are separated by

8283

runs of blank lines, or a regexp, in which case records are separated by

8284

matches of the regexp in the input text.

8285

(@xref{Records, ,How Input is Split into Records}.)

8286

8287

@vindex SUBSEP

8288

@item SUBSEP

8289

@code{SUBSEP} is the subscript separator. It has the default value of

8290

@code{"\034"}, and is used to separate the parts of the indices of a

8291

multi-dimensional array. Thus, the expression @code{@w{foo["A", "B"]}}

8292

really accesses @code{foo["A\034B"]}

8293

(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).

8294

@end table

8295

8296

@node Auto-set, ARGC and ARGV, User-modified, Built-in Variables

8297

@section Built-in Variables that Convey Information

8298

@cindex built-in variables, convey information

8299

8300

This is an alphabetical list of the variables that are set

8301

automatically by @code{awk} on certain occasions in order to provide

8302

information to your program. Those variables that are specific to

8303

@code{gawk} are marked with an asterisk, @samp{*}.

8304

8305

@table @code

8306

@vindex ARGC

8307

@vindex ARGV

8308

@item ARGC

8309

@itemx ARGV

8310

The command-line arguments available to @code{awk} programs are stored in

8311

an array called @code{ARGV}. @code{ARGC} is the number of command-line

8312

arguments present. @xref{Other Arguments, ,Other Command Line Arguments}.

8313

Unlike most @code{awk} arrays,

8314

@code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1. For example:

8315

8316

@example

8317

@group

8318

$ awk 'BEGIN @{

8319

> for (i = 0; i < ARGC; i++)

8320

> print ARGV[i]

8321

> @}' inventory-shipped BBS-list

8322

@print{} awk

8323

@print{} inventory-shipped

8324

@print{} BBS-list

8325

@end group

8326

@end example

8327

8328

@noindent

8329

In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}

8330

contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains

8331

@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the

8332

index of the last element in @code{ARGV}, since the elements are numbered

8333

from zero.

8334

8335

The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing

8336

the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's

8337

method of accessing command line arguments.

8338

@xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information

8339

about how @code{awk} uses these variables.

8340

8341

@vindex ARGIND

8342

@item ARGIND *

8343

The index in @code{ARGV} of the current file being processed.

8344

Every time @code{gawk} opens a new data file for processing, it sets

8345

@code{ARGIND} to the index in @code{ARGV} of the file name.

8346

When @code{gawk} is processing the input files, it is always

8347

true that @samp{FILENAME == ARGV[ARGIND]}.

8348

8349

This variable is useful in file processing; it allows you to tell how far

8350

along you are in the list of data files, and to distinguish between

8351

successive instances of the same filename on the command line.

8352

8353

While you can change the value of @code{ARGIND} within your @code{awk}

8354

program, @code{gawk} will automatically set it to a new value when the

8355

next file is opened.

8356

8357

This variable is a @code{gawk} extension. In other @code{awk} implementations,

8358

or if @code{gawk} is in compatibility mode

8359

(@pxref{Options, ,Command Line Options}),

8360

it is not special.

8361

8362

@vindex ENVIRON

8363

@item ENVIRON

8364

An associative array that contains the values of the environment. The array

8365

indices are the environment variable names; the values are the values of

8366

the particular environment variables. For example,

8367

@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array

8368

does not affect the environment passed on to any programs that

8369

@code{awk} may spawn via redirection or the @code{system} function.

8370

(In a future version of @code{gawk}, it may do so.)

8371

8372

Some operating systems may not have environment variables.

8373

On such systems, the @code{ENVIRON} array is empty (except for

8374

@w{@code{ENVIRON["AWKPATH"]}}).

8375

8376

@vindex ERRNO

8377

@item ERRNO *

8378

If a system error occurs either doing a redirection for @code{getline},

8379

during a read for @code{getline}, or during a @code{close} operation,

8380

then @code{ERRNO} will contain a string describing the error.

8381

8382

This variable is a @code{gawk} extension. In other @code{awk} implementations,

8383

or if @code{gawk} is in compatibility mode

8384

(@pxref{Options, ,Command Line Options}),

8385

it is not special.

8386

8387

@cindex dark corner

8388

@vindex FILENAME

8389

@item FILENAME

8390

This is the name of the file that @code{awk} is currently reading.

8391

When no data files are listed on the command line, @code{awk} reads

8392

from the standard input, and @code{FILENAME} is set to @code{"-"}.

8393

@code{FILENAME} is changed each time a new file is read

8394

(@pxref{Reading Files, ,Reading Input Files}).

8395

Inside a @code{BEGIN} rule, the value of @code{FILENAME} is

8396

@code{""}, since there are no input files being processed

8397

yet.@footnote{Some early implementations of Unix @code{awk} initialized

8398

@code{FILENAME} to @code{"-"}, even if there were data files to be

8399

processed. This behavior was incorrect, and should not be relied

8400

upon in your programs.} (d.c.)

8401

8402

@vindex FNR

8403

@item FNR

8404

@code{FNR} is the current record number in the current file. @code{FNR} is

8405

incremented each time a new record is read

8406

(@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized

8407

to zero each time a new input file is started.

8408

8409

@vindex NF

8410

@item NF

8411

@code{NF} is the number of fields in the current input record.

8412

@code{NF} is set each time a new record is read, when a new field is

8413

created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}).

8414

8415

@vindex NR

8416

@item NR

8417

This is the number of input records @code{awk} has processed since

8418

the beginning of the program's execution

8419

(@pxref{Records, ,How Input is Split into Records}).

8420

@code{NR} is set each time a new record is read.

8421

8422

@vindex RLENGTH

8423

@item RLENGTH

8424

@code{RLENGTH} is the length of the substring matched by the

8425

@code{match} function

8426

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

8427

@code{RLENGTH} is set by invoking the @code{match} function. Its value

8428

is the length of the matched string, or @minus{}1 if no match was found.

8429

8430

@vindex RSTART

8431

@item RSTART

8432

@code{RSTART} is the start-index in characters of the substring matched by the

8433

@code{match} function

8434

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

8435

@code{RSTART} is set by invoking the @code{match} function. Its value

8436

is the position of the string where the matched substring starts, or zero

8437

if no match was found.

8438

8439

@vindex RT

8440

@item RT *

8441

@code{RT} is set each time a record is read. It contains the input text

8442

that matched the text denoted by @code{RS}, the record separator.

8443

8444

This variable is a @code{gawk} extension. In other @code{awk} implementations,

8445

or if @code{gawk} is in compatibility mode

8446

(@pxref{Options, ,Command Line Options}),

8447

it is not special.

8448

@end table

8449

8450

@cindex dark corner

8451

A side note about @code{NR} and @code{FNR}.

8452

@code{awk} simply increments both of these variables

8453

each time it reads a record, instead of setting them to the absolute

8454

value of the number of records read. This means that your program can

8455

change these variables, and their new values will be incremented for

8456

each record (d.c.). For example:

8457

8458

@example

8459

@group

8460

$ echo '1

8461

> 2

8462

> 3

8463

> 4' | awk 'NR == 2 @{ NR = 17 @}

8464

> @{ print NR @}'

8465

@print{} 1

8466

@print{} 17

8467

@print{} 18

8468

@print{} 19

8469

@end group

8470

@end example

8471

8472

@noindent

8473

Before @code{FNR} was added to the @code{awk} language

8474

(@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}),

8475

many @code{awk} programs used this feature to track the number of

8476

records in a file by resetting @code{NR} to zero when @code{FILENAME}

8477

changed.

8478

8479

@node ARGC and ARGV, , Auto-set, Built-in Variables

8480

@section Using @code{ARGC} and @code{ARGV}

8481

8482

In @ref{Auto-set, , Built-in Variables that Convey Information},

8483

you saw this program describing the information contained in @code{ARGC}

8484

and @code{ARGV}:

8485

8486

@example

8487

@group

8488

$ awk 'BEGIN @{

8489

> for (i = 0; i < ARGC; i++)

8490

> print ARGV[i]

8491

> @}' inventory-shipped BBS-list

8492

@print{} awk

8493

@print{} inventory-shipped

8494

@print{} BBS-list

8495

@end group

8496

@end example

8497

8498

@noindent

8499

In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}

8500

contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains

8501

@code{"BBS-list"}.

8502

8503

Notice that the @code{awk} program is not entered in @code{ARGV}. The

8504

other special command line options, with their arguments, are also not

8505

entered. But variable assignments on the command line @emph{are}

8506

treated as arguments, and do show up in the @code{ARGV} array.

8507

8508

Your program can alter @code{ARGC} and the elements of @code{ARGV}.

8509

Each time @code{awk} reaches the end of an input file, it uses the next

8510

element of @code{ARGV} as the name of the next input file. By storing a

8511

different string there, your program can change which files are read.

8512

You can use @code{"-"} to represent the standard input. By storing

8513

additional elements and incrementing @code{ARGC} you can cause

8514

additional files to be read.

8515

8516

If you decrease the value of @code{ARGC}, that eliminates input files

8517

from the end of the list. By recording the old value of @code{ARGC}

8518

elsewhere, your program can treat the eliminated arguments as

8519

something other than file names.

8520

8521

To eliminate a file from the middle of the list, store the null string

8522

(@code{""}) into @code{ARGV} in place of the file's name. As a

8523

special feature, @code{awk} ignores file names that have been

8524

replaced with the null string.

8525

You may also use the @code{delete} statement to remove elements from

8526

@code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}).

8527

8528

All of these actions are typically done from the @code{BEGIN} rule,

8529

before actual processing of the input begins.

8530

@xref{Split Program, ,Splitting a Large File Into Pieces}, and see

8531

@ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example

8532

of each way of removing elements from @code{ARGV}.

8533

8534

The following fragment processes @code{ARGV} in order to examine, and

8535

then remove, command line options.

8536

8537

@example

8538

@group

8539

BEGIN @{

8540

for (i = 1; i < ARGC; i++) @{

8541

if (ARGV[i] == "-v")

8542

verbose = 1

8543

else if (ARGV[i] == "-d")

8544

debug = 1

8545

@end group

8546

@group

8547

else if (ARGV[i] ~ /^-?/) @{

8548

e = sprintf("%s: unrecognized option -- %c",

8549

ARGV[0], substr(ARGV[i], 1, ,1))

8550

print e > "/dev/stderr"

8551

@} else

8552

break

8553

delete ARGV[i]

8554

@}

8555

@}

8556

@end group

8557

@end example

8558

8559

@node Arrays, Built-in, Built-in Variables, Top

8560

@chapter Arrays in @code{awk}

8561

8562

An @dfn{array} is a table of values, called @dfn{elements}. The

8563

elements of an array are distinguished by their indices. @dfn{Indices}

8564

may be either numbers or strings. @code{awk} maintains a single set

8565

of names that may be used for naming variables, arrays and functions

8566

(@pxref{User-defined, ,User-defined Functions}).

8567

Thus, you cannot have a variable and an array with the same name in the

8568

same @code{awk} program.

8569

8570

@menu

8571

* Array Intro:: Introduction to Arrays

8572

* Reference to Elements:: How to examine one element of an array.

8573

* Assigning Elements:: How to change an element of an array.

8574

* Array Example:: Basic Example of an Array

8575

* Scanning an Array:: A variation of the @code{for} statement. It

8576

loops through the indices of an array's

8577

existing elements.

8578

* Delete:: The @code{delete} statement removes an element

8579

from an array.

8580

* Numeric Array Subscripts:: How to use numbers as subscripts in

8581

@code{awk}.

8582

* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.

8583

* Multi-dimensional:: Emulating multi-dimensional arrays in

8584

@code{awk}.

8585

* Multi-scanning:: Scanning multi-dimensional arrays.

8586

@end menu

8587

8588

@node Array Intro, Reference to Elements, Arrays, Arrays

8589

@section Introduction to Arrays

8590

8591

@cindex arrays

8592

The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups

8593

of related strings or numbers.

8594

8595

Every @code{awk} array must have a name. Array names have the same

8596

syntax as variable names; any valid variable name would also be a valid

8597

array name. But you cannot use one name in both ways (as an array and

8598

as a variable) in one @code{awk} program.

8599

8600

Arrays in @code{awk} superficially resemble arrays in other programming

8601

languages; but there are fundamental differences. In @code{awk}, you

8602

don't need to specify the size of an array before you start to use it.

8603

Additionally, any number or string in @code{awk} may be used as an

8604

array index, not just consecutive integers.

8605

8606

In most other languages, you have to @dfn{declare} an array and specify

8607

how many elements or components it contains. In such languages, the

8608

declaration causes a contiguous block of memory to be allocated for that

8609

many elements. An index in the array usually must be a positive integer; for

8610

example, the index zero specifies the first element in the array, which is

8611

actually stored at the beginning of the block of memory. Index one

8612

specifies the second element, which is stored in memory right after the

8613

first element, and so on. It is impossible to add more elements to the

8614

array, because it has room for only as many elements as you declared.

8615

(Some languages allow arbitrary starting and ending indices,

8616

e.g., @samp{15 .. 27}, but the size of the array is still fixed when

8617

the array is declared.)

8618

8619

A contiguous array of four elements might look like this,

8620

conceptually, if the element values are eight, @code{"foo"},

8621

@code{""} and 30:

8622

8623

@iftex

8624

@c from Karl Berry, much thanks for the help.

8625

@tex

8626

\bigskip % space above the table (about 1 linespace)

8627

\offinterlineskip

8628

\newdimen\width \width = 1.5cm

8629

\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt

8630

\centerline{\vbox{

8631

\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr

8632

\noalign{\hrule width\hwidth}

8633

&&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr

8634

\noalign{\hrule width\hwidth}

8635

\noalign{\smallskip}

8636

&\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad index\cr

8637

}

8638

}}

8639

@end tex

8640

@end iftex

8641

@ifinfo

8642

@example

8643

+---------+---------+--------+---------+

8644

| 8 | "foo" | "" | 30 | @r{value}

8645

+---------+---------+--------+---------+

8646

0 1 2 3 @r{index}

8647

@end example

8648

@end ifinfo

8649

8650

@noindent

8651

Only the values are stored; the indices are implicit from the order of

8652

the values. Eight is the value at index zero, because eight appears in the

8653

position with zero elements before it.

8654

8655

@cindex arrays, definition of

8656

@cindex associative arrays

8657

@cindex arrays, associative

8658

Arrays in @code{awk} are different: they are @dfn{associative}. This means

8659

that each array is a collection of pairs: an index, and its corresponding

8660

array element value:

8661

8662

@example

8663

@r{Element} 4 @r{Value} 30

8664

@r{Element} 2 @r{Value} "foo"

8665

@r{Element} 1 @r{Value} 8

8666

@r{Element} 3 @r{Value} ""

8667

@end example

8668

8669

@noindent

8670

We have shown the pairs in jumbled order because their order is irrelevant.

8671

8672

One advantage of associative arrays is that new pairs can be added

8673

at any time. For example, suppose we add to the above array a tenth element

8674

whose value is @w{@code{"number ten"}}. The result is this:

8675

8676

@example

8677

@r{Element} 10 @r{Value} "number ten"

8678

@r{Element} 4 @r{Value} 30

8679

@r{Element} 2 @r{Value} "foo"

8680

@r{Element} 1 @r{Value} 8

8681

@r{Element} 3 @r{Value} ""

8682

@end example

8683

8684

@noindent

8685

@cindex sparse arrays

8686

@cindex arrays, sparse

8687

Now the array is @dfn{sparse}, which just means some indices are missing:

8688

it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.

8689

@c ok, I should spell out the above, but ...

8690

8691

Another consequence of associative arrays is that the indices don't

8692

have to be positive integers. Any number, or even a string, can be

8693

an index. For example, here is an array which translates words from

8694

English into French:

8695

8696

@example

8697

@r{Element} "dog" @r{Value} "chien"

8698

@r{Element} "cat" @r{Value} "chat"

8699

@r{Element} "one" @r{Value} "un"

8700

@r{Element} 1 @r{Value} "un"

8701

@end example

8702

8703

@noindent

8704

Here we decided to translate the number one in both spelled-out and

8705

numeric form---thus illustrating that a single array can have both

8706

numbers and strings as indices.

8707

(In fact, array subscripts are always strings; this is discussed

8708

in more detail in

8709

@ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.)

8710

8711

When @code{awk} creates an array for you, e.g., with the @code{split}

8712

built-in function,

8713

that array's indices are consecutive integers starting at one.

8714

(@xref{String Functions, ,Built-in Functions for String Manipulation}.)

8715

8716

@node Reference to Elements, Assigning Elements, Array Intro, Arrays

8717

@section Referring to an Array Element

8718

@cindex array reference

8719

@cindex element of array

8720

@cindex reference to array

8721

8722

The principal way of using an array is to refer to one of its elements.

8723

An array reference is an expression which looks like this:

8724

8725

@example

8726

@var{array}[@var{index}]

8727

@end example

8728

8729

@noindent

8730

Here, @var{array} is the name of an array. The expression @var{index} is

8731

the index of the element of the array that you want.

8732

8733

The value of the array reference is the current value of that array

8734

element. For example, @code{foo[4.3]} is an expression for the element

8735

of array @code{foo} at index @samp{4.3}.

8736

8737

If you refer to an array element that has no recorded value, the value

8738

of the reference is @code{""}, the null string. This includes elements

8739

to which you have not assigned any value, and elements that have been

8740

deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference

8741

automatically creates that array element, with the null string as its value.

8742

(In some cases, this is unfortunate, because it might waste memory inside

8743

@code{awk}.)

8744

8745

@cindex arrays, presence of elements

8746

@cindex arrays, the @code{in} operator

8747

You can find out if an element exists in an array at a certain index with

8748

the expression:

8749

8750

@example

8751

@var{index} in @var{array}

8752

@end example

8753

8754

@noindent

8755

This expression tests whether or not the particular index exists,

8756

without the side effect of creating that element if it is not present.

8757

The expression has the value one (true) if @code{@var{array}[@var{index}]}

8758

exists, and zero (false) if it does not exist.

8759

8760

For example, to test whether the array @code{frequencies} contains the

8761

index @samp{2}, you could write this statement:

8762

8763

@example

8764

if (2 in frequencies)

8765

print "Subscript 2 is present."

8766

@end example

8767

8768

Note that this is @emph{not} a test of whether or not the array

8769

@code{frequencies} contains an element whose @emph{value} is two.

8770

(There is no way to do that except to scan all the elements.) Also, this

8771

@emph{does not} create @code{frequencies[2]}, while the following

8772

(incorrect) alternative would do so:

8773

8774

@example

8775

if (frequencies[2] != "")

8776

print "Subscript 2 is present."

8777

@end example

8778

8779

@node Assigning Elements, Array Example, Reference to Elements, Arrays

8780

@section Assigning Array Elements

8781

@cindex array assignment

8782

@cindex element assignment

8783

8784

Array elements are lvalues: they can be assigned values just like

8785

@code{awk} variables:

8786

8787

@example

8788

@var{array}[@var{subscript}] = @var{value}

8789

@end example

8790

8791

@noindent

8792

Here @var{array} is the name of your array. The expression

8793

@var{subscript} is the index of the element of the array that you want

8794

to assign a value. The expression @var{value} is the value you are

8795

assigning to that element of the array.

8796

8797

@node Array Example, Scanning an Array, Assigning Elements, Arrays

8798

@section Basic Array Example

8799

8800

The following program takes a list of lines, each beginning with a line

8801

number, and prints them out in order of line number. The line numbers are

8802

not in order, however, when they are first read: they are scrambled. This

8803

program sorts the lines by making an array using the line numbers as

8804

subscripts. It then prints out the lines in sorted order of their numbers.

8805

It is a very simple program, and gets confused if it encounters repeated

8806

numbers, gaps, or lines that don't begin with a number.

8807

8808

@example

8809

@c file eg/misc/arraymax.awk

8810

@{

8811

if ($1 > max)

8812

max = $1

8813

arr[$1] = $0

8814

@}

8815

8816

END @{

8817

for (x = 1; x <= max; x++)

8818

print arr[x]

8819

@}

8820

@c endfile

8821

@end example

8822

8823

The first rule keeps track of the largest line number seen so far;

8824

it also stores each line into the array @code{arr}, at an index that

8825

is the line's number.

8826

8827

The second rule runs after all the input has been read, to print out

8828

all the lines.

8829

8830

When this program is run with the following input:

8831

8832

@example

8833

@group

8834

@c file eg/misc/arraymax.data

8835

5 I am the Five man

8836

2 Who are you? The new number two!

8837

4 . . . And four on the floor

8838

1 Who is number one?

8839

3 I three you.

8840

@c endfile

8841

@end group

8842

@end example

8843

8844

@noindent

8845

its output is this:

8846

8847

@example

8848

1 Who is number one?

8849

2 Who are you? The new number two!

8850

3 I three you.

8851

4 . . . And four on the floor

8852

5 I am the Five man

8853

@end example

8854

8855

If a line number is repeated, the last line with a given number overrides

8856

the others.

8857

8858

Gaps in the line numbers can be handled with an easy improvement to the

8859

program's @code{END} rule:

8860

8861

@example

8862

END @{

8863

for (x = 1; x <= max; x++)

8864

if (x in arr)

8865

print arr[x]

8866

@}

8867

@end example

8868

8869

@node Scanning an Array, Delete, Array Example, Arrays

8870

@section Scanning All Elements of an Array

8871

@cindex @code{for (x in @dots{})}

8872

@cindex arrays, special @code{for} statement

8873

@cindex scanning an array

8874

8875

In programs that use arrays, you often need a loop that executes

8876

once for each element of an array. In other languages, where arrays are

8877

contiguous and indices are limited to positive integers, this is

8878

easy: you can

8879

find all the valid indices by counting from the lowest index

8880

up to the highest. This

8881

technique won't do the job in @code{awk}, since any number or string

8882

can be an array index. So @code{awk} has a special kind of @code{for}

8883

statement for scanning an array:

8884

8885

@example

8886

for (@var{var} in @var{array})

8887

@var{body}

8888

@end example

8889

8890

@noindent

8891

This loop executes @var{body} once for each index in @var{array} that your

8892

program has previously used, with the

8893

variable @var{var} set to that index.

8894

8895

Here is a program that uses this form of the @code{for} statement. The

8896

first rule scans the input records and notes which words appear (at

8897

least once) in the input, by storing a one into the array @code{used} with

8898

the word as index. The second rule scans the elements of @code{used} to

8899

find all the distinct words that appear in the input. It prints each

8900

word that is more than 10 characters long, and also prints the number of

8901

such words. @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information

8902

on the built-in function @code{length}.

8903

8904

@example

8905

# Record a 1 for each word that is used at least once.

8906

@{

8907

for (i = 1; i <= NF; i++)

8908

used[$i] = 1

8909

@}

8910

8911

# Find number of distinct words more than 10 characters long.

8912

END @{

8913

for (x in used)

8914

if (length(x) > 10) @{

8915

++num_long_words

8916

print x

8917

@}

8918

print num_long_words, "words longer than 10 characters"

8919

@}

8920

@end example

8921

8922

@noindent

8923

@xref{Word Sorting, ,Generating Word Usage Counts},

8924

for a more detailed example of this type.

8925

8926

The order in which elements of the array are accessed by this statement

8927

is determined by the internal arrangement of the array elements within

8928

@code{awk} and cannot be controlled or changed. This can lead to

8929

problems if new elements are added to @var{array} by statements in

8930

the loop body; you cannot predict whether or not the @code{for} loop will

8931

reach them. Similarly, changing @var{var} inside the loop may produce

8932

strange results. It is best to avoid such things.

8933

8934

@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays

8935

@section The @code{delete} Statement

8936

@cindex @code{delete} statement

8937

@cindex deleting elements of arrays

8938

@cindex removing elements of arrays

8939

@cindex arrays, deleting an element

8940

8941

You can remove an individual element of an array using the @code{delete}

8942

statement:

8943

8944

@example

8945

delete @var{array}[@var{index}]

8946

@end example

8947

8948

Once you have deleted an array element, you can no longer obtain any

8949

value the element once had. It is as if you had never referred

8950

to it and had never given it any value.

8951

8952

Here is an example of deleting elements in an array:

8953

8954

@example

8955

for (i in frequencies)

8956

delete frequencies[i]

8957

@end example

8958

8959

@noindent

8960

This example removes all the elements from the array @code{frequencies}.

8961

8962

If you delete an element, a subsequent @code{for} statement to scan the array

8963

will not report that element, and the @code{in} operator to check for

8964

the presence of that element will return zero (i.e.@: false):

8965

8966

@example

8967

delete foo[4]

8968

if (4 in foo)

8969

print "This will never be printed"

8970

@end example

8971

8972

It is important to note that deleting an element is @emph{not} the

8973

same as assigning it a null value (the empty string, @code{""}).

8974

8975

@example

8976

foo[4] = ""

8977

if (4 in foo)

8978

print "This is printed, even though foo[4] is empty"

8979

@end example

8980

8981

It is not an error to delete an element that does not exist.

8982

8983

@cindex arrays, deleting entire contents

8984

@cindex deleting entire arrays

8985

@cindex differences between @code{gawk} and @code{awk}

8986

You can delete all the elements of an array with a single statement,

8987

by leaving off the subscript in the @code{delete} statement.

8988

8989

@example

8990

delete @var{array}

8991

@end example

8992

8993

This ability is a @code{gawk} extension; it is not available in

8994

compatibility mode (@pxref{Options, ,Command Line Options}).

8995

8996

Using this version of the @code{delete} statement is about three times

8997

more efficient than the equivalent loop that deletes each element one

8998

at a time.

8999

9000

@cindex portability issues

9001

The following statement provides a portable, but non-obvious way to clear

9002

out an array.

9003

9004

@cindex Brennan, Michael

9005

@example

9006

@group

9007

# thanks to Michael Brennan for pointing this out

9008

split("", array)

9009

@end group

9010

@end example

9011

9012

The @code{split} function

9013

(@pxref{String Functions, ,Built-in Functions for String Manipulation})

9014

clears out the target array first. This call asks it to split

9015

apart the null string. Since there is no data to split out, the

9016

function simply clears the array and then returns.

9017

9018

@node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays

9019

@section Using Numbers to Subscript Arrays

9020

9021

An important aspect of arrays to remember is that @emph{array subscripts

9022

are always strings}. If you use a numeric value as a subscript,

9023

it will be converted to a string value before it is used for subscripting

9024

(@pxref{Conversion, ,Conversion of Strings and Numbers}).

9025

9026

@cindex conversions, during subscripting

9027

@cindex numbers, used as subscripts

9028

@vindex CONVFMT

9029

This means that the value of the built-in variable @code{CONVFMT} can potentially

9030

affect how your program accesses elements of an array. For example:

9031

9032

@example

9033

xyz = 12.153

9034

data[xyz] = 1

9035

CONVFMT = "%2.2f"

9036

@group

9037

if (xyz in data)

9038

printf "%s is in data\n", xyz

9039

else

9040

printf "%s is not in data\n", xyz

9041

@end group

9042

@end example

9043

9044

@noindent

9045

This prints @samp{12.15 is not in data}. The first statement gives

9046

@code{xyz} a numeric value. Assigning to

9047

@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}

9048

(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}),

9049

and assigns one to @code{data["12.153"]}. The program then changes

9050

the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new

9051

string value from @code{xyz}, this time @code{"12.15"}, since the value of

9052

@code{CONVFMT} only allows two significant digits. This test fails,

9053

since @code{"12.15"} is a different string from @code{"12.153"}.

9054

9055

According to the rules for conversions

9056

(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer

9057

values are always converted to strings as integers, no matter what the

9058

value of @code{CONVFMT} may happen to be. So the usual case of:

9059

9060

@example

9061

for (i = 1; i <= maxsub; i++)

9062

@i{do something with} array[i]

9063

@end example

9064

9065

@noindent

9066

will work, no matter what the value of @code{CONVFMT}.

9067

9068

Like many things in @code{awk}, the majority of the time things work

9069

as you would expect them to work. But it is useful to have a precise

9070

knowledge of the actual rules, since sometimes they can have a subtle

9071

effect on your programs.

9072

9073

@node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays

9074

@section Using Uninitialized Variables as Subscripts

9075

9076

@cindex uninitialized variables, as array subscripts

9077

@cindex array subscripts, uninitialized variables

9078

Suppose you want to print your input data in reverse order.

9079

A reasonable attempt at a program to do so (with some test

9080

data) might look like this:

9081

9082

@example

9083

$ echo 'line 1

9084

> line 2

9085

> line 3' | awk '@{ l[lines] = $0; ++lines @}

9086

> END @{

9087

> for (i = lines-1; i >= 0; --i)

9088

> print l[i]

9089

> @}'

9090

@print{} line 3

9091

@print{} line 2

9092

@end example

9093

9094

Unfortunately, the very first line of input data did not come out in the

9095

output!

9096

9097

At first glance, this program should have worked. The variable @code{lines}

9098

is uninitialized, and uninitialized variables have the numeric value zero.

9099

So, the value of @code{l[0]} should have been printed.

9100

9101

The issue here is that subscripts for @code{awk} arrays are @strong{always}

9102

strings. And uninitialized variables, when used as strings, have the

9103

value @code{""}, not zero. Thus, @samp{line 1} ended up stored in

9104

@code{l[""]}.

9105

9106

The following version of the program works correctly:

9107

9108

@example

9109

@{ l[lines++] = $0 @}

9110

END @{

9111

for (i = lines - 1; i >= 0; --i)

9112

print l[i]

9113

@}

9114

@end example

9115

9116

Here, the @samp{++} forces @code{l} to be numeric, thus making

9117

the ``old value'' numeric zero, which is then converted to @code{"0"}

9118

as the array subscript.

9119

9120

@cindex null string, as array subscript

9121

@cindex dark corner

9122

As we have just seen, even though it is somewhat unusual, the null string

9123

(@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided

9124

on the command line (@pxref{Options, ,Command Line Options}),

9125

@code{gawk} will warn about the use of the null string as a subscript.

9126

9127

@node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays

9128

@section Multi-dimensional Arrays

9129

9130

@cindex subscripts in arrays

9131

@cindex arrays, multi-dimensional subscripts

9132

@cindex multi-dimensional subscripts

9133

A multi-dimensional array is an array in which an element is identified

9134

by a sequence of indices, instead of a single index. For example, a

9135

two-dimensional array requires two indices. The usual way (in most

9136

languages, including @code{awk}) to refer to an element of a

9137

two-dimensional array named @code{grid} is with

9138

@code{grid[@var{x},@var{y}]}.

9139

9140

@vindex SUBSEP

9141

Multi-dimensional arrays are supported in @code{awk} through

9142

concatenation of indices into one string. What happens is that

9143

@code{awk} converts the indices into strings

9144

(@pxref{Conversion, ,Conversion of Strings and Numbers}) and

9145

concatenates them together, with a separator between them. This creates

9146

a single string that describes the values of the separate indices. The

9147

combined string is used as a single index into an ordinary,

9148

one-dimensional array. The separator used is the value of the built-in

9149

variable @code{SUBSEP}.

9150

9151

For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}

9152

when the value of @code{SUBSEP} is @code{"@@"}. The numbers five and 12 are

9153

converted to strings and

9154

concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,

9155

the array element @code{foo["5@@12"]} is set to @code{"value"}.

9156

9157

Once the element's value is stored, @code{awk} has no record of whether

9158

it was stored with a single index or a sequence of indices. The two

9159

expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always

9160

equivalent.

9161

9162

The default value of @code{SUBSEP} is the string @code{"\034"},

9163

which contains a non-printing character that is unlikely to appear in an

9164

@code{awk} program or in most input data.

9165

9166

The usefulness of choosing an unlikely character comes from the fact

9167

that index values that contain a string matching @code{SUBSEP} lead to

9168

combined strings that are ambiguous. Suppose that @code{SUBSEP} were

9169

@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",

9170

"b@@c"]}} would be indistinguishable because both would actually be

9171

stored as @samp{foo["a@@b@@c"]}.

9172

9173

You can test whether a particular index-sequence exists in a

9174

``multi-dimensional'' array with the same operator @samp{in} used for single

9175

dimensional arrays. Instead of a single index as the left-hand operand,

9176

write the whole sequence of indices, separated by commas, in

9177

parentheses:

9178

9179

@example

9180

(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}

9181

@end example

9182

9183

The following example treats its input as a two-dimensional array of

9184

fields; it rotates this array 90 degrees clockwise and prints the

9185

result. It assumes that all lines have the same number of

9186

elements.

9187

9188

@example

9189

@group

9190

awk '@{

9191

if (max_nf < NF)

9192

max_nf = NF

9193

max_nr = NR

9194

for (x = 1; x <= NF; x++)

9195

vector[x, NR] = $x

9196

@}

9197

@end group

9198

9199

@group

9200

END @{

9201

for (x = 1; x <= max_nf; x++) @{

9202

for (y = max_nr; y >= 1; --y)

9203

printf("%s ", vector[x, y])

9204

printf("\n")

9205

@}

9206

@}'

9207

@end group

9208

@end example

9209

9210

@noindent

9211

When given the input:

9212

9213

@example

9214

@group

9215

1 2 3 4 5 6

9216

2 3 4 5 6 1

9217

3 4 5 6 1 2

9218

4 5 6 1 2 3

9219

@end group

9220

@end example

9221

9222

@noindent

9223

it produces:

9224

9225

@example

9226

@group

9227

4 3 2 1

9228

5 4 3 2

9229

6 5 4 3

9230

1 6 5 4

9231

2 1 6 5

9232

3 2 1 6

9233

@end group

9234

@end example

9235

9236

@node Multi-scanning, , Multi-dimensional, Arrays

9237

@section Scanning Multi-dimensional Arrays

9238

9239

There is no special @code{for} statement for scanning a

9240

``multi-dimensional'' array; there cannot be one, because in truth there

9241

are no multi-dimensional arrays or elements; there is only a

9242

multi-dimensional @emph{way of accessing} an array.

9243

9244

However, if your program has an array that is always accessed as

9245

multi-dimensional, you can get the effect of scanning it by combining

9246

the scanning @code{for} statement

9247

(@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the

9248

@code{split} built-in function

9249

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

9250

It works like this:

9251

9252

@example

9253

for (combined in array) @{

9254

split(combined, separate, SUBSEP)

9255

@dots{}

9256

@}

9257

@end example

9258

9259

@noindent

9260

This sets @code{combined} to

9261

each concatenated, combined index in the array, and splits it

9262

into the individual indices by breaking it apart where the value of

9263

@code{SUBSEP} appears. The split-out indices become the elements of

9264

the array @code{separate}.

9265

9266

Thus, suppose you have previously stored a value in @code{array[1, "foo"]};

9267

then an element with index @code{"1\034foo"} exists in

9268

@code{array}. (Recall that the default value of @code{SUBSEP} is

9269

the character with code 034.) Sooner or later the @code{for} statement

9270

will find that index and do an iteration with @code{combined} set to

9271

@code{"1\034foo"}. Then the @code{split} function is called as

9272

follows:

9273

9274

@example

9275

split("1\034foo", separate, "\034")

9276

@end example

9277

9278

@noindent

9279

The result of this is to set @code{separate[1]} to @code{"1"} and

9280

@code{separate[2]} to @code{"foo"}. Presto, the original sequence of

9281

separate indices has been recovered.

9282

9283

@node Built-in, User-defined, Arrays, Top

9284

@chapter Built-in Functions

9285

9286

@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!

9287

@cindex built-in functions

9288

@dfn{Built-in} functions are functions that are always available for

9289

your @code{awk} program to call. This chapter defines all the built-in

9290

functions in @code{awk}; some of them are mentioned in other sections,

9291

but they are summarized here for your convenience. (You can also define

9292

new functions yourself. @xref{User-defined, ,User-defined Functions}.)

9293

9294

@menu

9295

* Calling Built-in:: How to call built-in functions.

9296

* Numeric Functions:: Functions that work with numbers, including

9297

@code{int}, @code{sin} and @code{rand}.

9298

* String Functions:: Functions for string manipulation, such as

9299

@code{split}, @code{match}, and

9300

@code{sprintf}.

9301

* I/O Functions:: Functions for files and shell commands.

9302

* Time Functions:: Functions for dealing with time stamps.

9303

@end menu

9304

9305

@node Calling Built-in, Numeric Functions, Built-in, Built-in

9306

@section Calling Built-in Functions

9307

9308

To call a built-in function, write the name of the function followed

9309

by arguments in parentheses. For example, @samp{atan2(y + z, 1)}

9310

is a call to the function @code{atan2}, with two arguments.

9311

9312

Whitespace is ignored between the built-in function name and the

9313

open-parenthesis, but we recommend that you avoid using whitespace

9314

there. User-defined functions do not permit whitespace in this way, and

9315

you will find it easier to avoid mistakes by following a simple

9316

convention which always works: no whitespace after a function name.

9317

9318

@cindex differences between @code{gawk} and @code{awk}

9319

Each built-in function accepts a certain number of arguments.

9320

In some cases, arguments can be omitted. The defaults for omitted

9321

arguments vary from function to function and are described under the

9322

individual functions. In some @code{awk} implementations, extra

9323

arguments given to built-in functions are ignored. However, in @code{gawk},

9324

it is a fatal error to give extra arguments to a built-in function.

9325

9326

When a function is called, expressions that create the function's actual

9327

parameters are evaluated completely before the function call is performed.

9328

For example, in the code fragment:

9329

9330

@example

9331

i = 4

9332

j = sqrt(i++)

9333

@end example

9334

9335

@noindent

9336

the variable @code{i} is set to five before @code{sqrt} is called

9337

with a value of four for its actual parameter.

9338

9339

@cindex evaluation, order of

9340

@cindex order of evaluation

9341

The order of evaluation of the expressions used for the function's

9342

parameters is undefined. Thus, you should not write programs that

9343

assume that parameters are evaluated from left to right or from

9344

right to left. For example,

9345

9346

@example

9347

i = 5

9348

j = atan2(i++, i *= 2)

9349

@end example

9350

9351

If the order of evaluation is left to right, then @code{i} first becomes

9352

six, and then 12, and @code{atan2} is called with the two arguments six

9353

and 12. But if the order of evaluation is right to left, @code{i}

9354

first becomes 10, and then 11, and @code{atan2} is called with the

9355

two arguments 11 and 10.

9356

9357

@node Numeric Functions, String Functions, Calling Built-in, Built-in

9358

@section Numeric Built-in Functions

9359

9360

Here is a full list of built-in functions that work with numbers.

9361

Optional parameters are enclosed in square brackets (``['' and ``]'').

9362

9363

@table @code

9364

@item int(@var{x})

9365

@findex int

9366

This produces the nearest integer to @var{x}, located between @var{x} and zero,

9367

truncated toward zero.

9368

9369

For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)}

9370

is @minus{}3, and @code{int(-3)} is @minus{}3 as well.

9371

9372

@item sqrt(@var{x})

9373

@findex sqrt

9374

This gives you the positive square root of @var{x}. It reports an error

9375

if @var{x} is negative. Thus, @code{sqrt(4)} is two.

9376

9377

@item exp(@var{x})

9378

@findex exp

9379

This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports

9380

an error if @var{x} is out of range. The range of values @var{x} can have

9381

depends on your machine's floating point representation.

9382

9383

@item log(@var{x})

9384

@findex log

9385

This gives you the natural logarithm of @var{x}, if @var{x} is positive;

9386

otherwise, it reports an error.

9387

9388

@item sin(@var{x})

9389

@findex sin

9390

This gives you the sine of @var{x}, with @var{x} in radians.

9391

9392

@item cos(@var{x})

9393

@findex cos

9394

This gives you the cosine of @var{x}, with @var{x} in radians.

9395

9396

@item atan2(@var{y}, @var{x})

9397

@findex atan2

9398

This gives you the arctangent of @code{@var{y} / @var{x}} in radians.

9399

9400

@item rand()

9401

@findex rand

9402

This gives you a random number. The values of @code{rand} are

9403

uniformly-distributed between zero and one.

9404

The value is never zero and never one.

9405

9406

Often you want random integers instead. Here is a user-defined function

9407

you can use to obtain a random non-negative integer less than @var{n}:

9408

9409

@example

9410

function randint(n) @{

9411

return int(n * rand())

9412

@}

9413

@end example

9414

9415

@noindent

9416

The multiplication produces a random real number greater than zero and less

9417

than @code{n}. We then make it an integer (using @code{int}) between zero

9418

and @code{n} @minus{} 1, inclusive.

9419

9420

Here is an example where a similar function is used to produce

9421

random integers between one and @var{n}. This program

9422

prints a new random number for each input record.

9423

9424

@example

9425

@group

9426

awk '

9427

# Function to roll a simulated die.

9428

function roll(n) @{ return 1 + int(rand() * n) @}

9429

@end group

9430

9431

@group

9432

# Roll 3 six-sided dice and

9433

# print total number of points.

9434

@{

9435

printf("%d points\n",

9436

roll(6)+roll(6)+roll(6))

9437

@}'

9438

@end group

9439

@end example

9440

9441

@cindex seed for random numbers

9442

@cindex random numbers, seed of

9443

@comment MAWK uses a different seed each time.

9444

@strong{Caution:} In most @code{awk} implementations, including @code{gawk},

9445

@code{rand} starts generating numbers from the same

9446

starting number, or @dfn{seed}, each time you run @code{awk}. Thus,

9447

a program will generate the same results each time you run it.

9448

The numbers are random within one @code{awk} run, but predictable

9449

from run to run. This is convenient for debugging, but if you want

9450

a program to do different things each time it is used, you must change

9451

the seed to a value that will be different in each run. To do this,

9452

use @code{srand}.

9453

9454

@item srand(@r{[}@var{x}@r{]})

9455

@findex srand

9456

The function @code{srand} sets the starting point, or seed,

9457

for generating random numbers to the value @var{x}.

9458

9459

Each seed value leads to a particular sequence of random

9460

numbers.@footnote{Computer generated random numbers really are not truly

9461

random. They are technically known as ``pseudo-random.'' This means

9462

that while the numbers in a sequence appear to be random, you can in

9463

fact generate the same sequence of random numbers over and over again.}

9464

Thus, if you set the seed to the same value a second time, you will get

9465

the same sequence of random numbers again.

9466

9467

If you omit the argument @var{x}, as in @code{srand()}, then the current

9468

date and time of day are used for a seed. This is the way to get random

9469

numbers that are truly unpredictable.

9470

9471

The return value of @code{srand} is the previous seed. This makes it

9472

easy to keep track of the seeds for use in consistently reproducing

9473

sequences of random numbers.

9474

@end table

9475

9476

@node String Functions, I/O Functions, Numeric Functions, Built-in

9477

@section Built-in Functions for String Manipulation

9478

9479

The functions in this section look at or change the text of one or more

9480

strings.

9481

Optional parameters are enclosed in square brackets (``['' and ``]'').

9482

9483

@table @code

9484

@item index(@var{in}, @var{find})

9485

@findex index

9486

This searches the string @var{in} for the first occurrence of the string

9487

@var{find}, and returns the position in characters where that occurrence

9488

begins in the string @var{in}. For example:

9489

9490

@example

9491

$ awk 'BEGIN @{ print index("peanut", "an") @}'

9492

@print{} 3

9493

@end example

9494

9495

@noindent

9496

If @var{find} is not found, @code{index} returns zero.

9497

(Remember that string indices in @code{awk} start at one.)

9498

9499

@item length(@r{[}@var{string}@r{]})

9500

@findex length

9501

This gives you the number of characters in @var{string}. If

9502

@var{string} is a number, the length of the digit string representing

9503

that number is returned. For example, @code{length("abcde")} is five. By

9504

contrast, @code{length(15 * 35)} works out to three. How? Well, 15 * 35 =

9505

525, and 525 is then converted to the string @code{"525"}, which has

9506

three characters.

9507

9508

If no argument is supplied, @code{length} returns the length of @code{$0}.

9509

9510

@cindex historical features

9511

@cindex portability issues

9512

@cindex @code{awk} language, POSIX version

9513

@cindex POSIX @code{awk}

9514

In older versions of @code{awk}, you could call the @code{length} function

9515

without any parentheses. Doing so is marked as ``deprecated'' in the

9516

POSIX standard. This means that while you can do this in your

9517

programs, it is a feature that can eventually be removed from a future

9518

version of the standard. Therefore, for maximal portability of your

9519

@code{awk} programs, you should always supply the parentheses.

9520

9521

@item match(@var{string}, @var{regexp})

9522

@findex match

9523

The @code{match} function searches the string, @var{string}, for the

9524

longest, leftmost substring matched by the regular expression,

9525

@var{regexp}. It returns the character position, or @dfn{index}, of

9526

where that substring begins (one, if it starts at the beginning of

9527

@var{string}). If no match is found, it returns zero.

9528

9529

@vindex RSTART

9530

@vindex RLENGTH

9531

The @code{match} function sets the built-in variable @code{RSTART} to

9532

the index. It also sets the built-in variable @code{RLENGTH} to the

9533

length in characters of the matched substring. If no match is found,

9534

@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.

9535

9536

For example:

9537

9538

@example

9539

@group

9540

@c file eg/misc/findpat.sh

9541

awk '@{

9542

if ($1 == "FIND")

9543

regex = $2

9544

else @{

9545

where = match($0, regex)

9546

if (where != 0)

9547

print "Match of", regex, "found at", \

9548

where, "in", $0

9549

@}

9550

@}'

9551

@c endfile

9552

@end group

9553

@end example

9554

9555

@noindent

9556

This program looks for lines that match the regular expression stored in

9557

the variable @code{regex}. This regular expression can be changed. If the

9558

first word on a line is @samp{FIND}, @code{regex} is changed to be the

9559

second word on that line. Therefore, given:

9560

9561

@example

9562

@c file eg/misc/findpat.data

9563

FIND ru+n

9564

My program runs

9565

but not very quickly

9566

FIND Melvin

9567

JF+KM

9568

This line is property of Reality Engineering Co.

9569

Melvin was here.

9570

@c endfile

9571

@end example

9572

9573

@noindent

9574

@code{awk} prints:

9575

9576

@example

9577

Match of ru+n found at 12 in My program runs

9578

Match of Melvin found at 1 in Melvin was here.

9579

@end example

9580

9581

@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})

9582

@findex split

9583

This divides @var{string} into pieces separated by @var{fieldsep},

9584

and stores the pieces in @var{array}. The first piece is stored in

9585

@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so

9586

forth. The string value of the third argument, @var{fieldsep}, is

9587

a regexp describing where to split @var{string} (much as @code{FS} can

9588

be a regexp describing where to split input records). If

9589

the @var{fieldsep} is omitted, the value of @code{FS} is used.

9590

@code{split} returns the number of elements created.

9591

9592

The @code{split} function splits strings into pieces in a

9593

manner similar to the way input lines are split into fields. For example:

9594

9595

@example

9596

split("cul-de-sac", a, "-")

9597

@end example

9598

9599

@noindent

9600

splits the string @samp{cul-de-sac} into three fields using @samp{-} as the

9601

separator. It sets the contents of the array @code{a} as follows:

9602

9603

@example

9604

a[1] = "cul"

9605

a[2] = "de"

9606

a[3] = "sac"

9607

@end example

9608

9609

@noindent

9610

The value returned by this call to @code{split} is three.

9611

9612

As with input field-splitting, when the value of @var{fieldsep} is

9613

@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements

9614

are separated by runs of whitespace.

9615

9616

@cindex differences between @code{gawk} and @code{awk}

9617

Also as with input field-splitting, if @var{fieldsep} is the null string, each

9618

individual character in the string is split into its own array element.

9619

(This is a @code{gawk}-specific extension.)

9620

9621

@cindex dark corner

9622

Recent implementations of @code{awk}, including @code{gawk}, allow

9623

the third argument to be a regexp constant (@code{/abc/}), as well as a

9624

string (d.c.). The POSIX standard allows this as well.

9625

9626

Before splitting the string, @code{split} deletes any previously existing

9627

elements in the array @var{array} (d.c.).

9628

9629

@item sprintf(@var{format}, @var{expression1},@dots{})

9630

@findex sprintf

9631

This returns (without printing) the string that @code{printf} would

9632

have printed out with the same arguments

9633

(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).

9634

For example:

9635

9636

@example

9637

sprintf("pi = %.2f (approx.)", 22/7)

9638

@end example

9639

9640

@noindent

9641

returns the string @w{@code{"pi = 3.14 (approx.)"}}.

9642

9643

@ignore

9644

2e: For sub, gsub, and gensub, either here or in the "how much matches"

9645

section, we need some explanation that it is possible to match the

9646

null string when using closures like *. E.g.,

9647

9648

$ echo abc | awk '{ gsub(/m*/, "X"); print }'

9649

@print{} XaXbXc

9650

9651

Although this makes a certain amount of sense, it can be very

9652

suprising.

9653

@end ignore

9654

9655

@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})

9656

@findex sub

9657

The @code{sub} function alters the value of @var{target}.

9658

It searches this value, which is treated as a string, for the

9659

leftmost longest substring matched by the regular expression, @var{regexp},

9660

extending this match as far as possible. Then the entire string is

9661

changed by replacing the matched text with @var{replacement}.

9662

The modified string becomes the new value of @var{target}.

9663

9664

This function is peculiar because @var{target} is not simply

9665

used to compute a value, and not just any expression will do: it

9666

must be a variable, field or array element, so that @code{sub} can

9667

store a modified value there. If this argument is omitted, then the

9668

default is to use and alter @code{$0}.

9669

9670

For example:

9671

9672

@example

9673

str = "water, water, everywhere"

9674

sub(/at/, "ith", str)

9675

@end example

9676

9677

@noindent

9678

sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the

9679

leftmost, longest occurrence of @samp{at} with @samp{ith}.

9680

9681

The @code{sub} function returns the number of substitutions made (either

9682

one or zero).

9683

9684

If the special character @samp{&} appears in @var{replacement}, it

9685

stands for the precise substring that was matched by @var{regexp}. (If

9686

the regexp can match more than one string, then this precise substring

9687

may vary.) For example:

9688

9689

@example

9690

awk '@{ sub(/candidate/, "& and his wife"); print @}'

9691

@end example

9692

9693

@noindent

9694

changes the first occurrence of @samp{candidate} to @samp{candidate

9695

and his wife} on each input line.

9696

9697

Here is another example:

9698

9699

@example

9700

awk 'BEGIN @{

9701

str = "daabaaa"

9702

sub(/a*/, "c&c", str)

9703

print str

9704

@}'

9705

@print{} dcaacbaaa

9706

@end example

9707

9708

@noindent

9709

This shows how @samp{&} can represent a non-constant string, and also

9710

illustrates the ``leftmost, longest'' rule in regexp matching

9711

(@pxref{Leftmost Longest, ,How Much Text Matches?}).

9712

9713

The effect of this special character (@samp{&}) can be turned off by putting a

9714

backslash before it in the string. As usual, to insert one backslash in

9715

the string, you must write two backslashes. Therefore, write @samp{\\&}

9716

in a string constant to include a literal @samp{&} in the replacement.

9717

For example, here is how to replace the first @samp{|} on each line with

9718

an @samp{&}:

9719

9720

@example

9721

awk '@{ sub(/\|/, "\\&"); print @}'

9722

@end example

9723

9724

@strong{Note:} As mentioned above, the third argument to @code{sub} must

9725

be a variable, field or array reference.

9726

Some versions of @code{awk} allow the third argument to

9727

be an expression which is not an lvalue. In such a case, @code{sub}

9728

would still search for the pattern and return zero or one, but the result of

9729

the substitution (if any) would be thrown away because there is no place

9730

to put it. Such versions of @code{awk} accept expressions like

9731

this:

9732

9733

@example

9734

sub(/USA/, "United States", "the USA and Canada")

9735

@end example

9736

9737

@noindent

9738

This is considered erroneous in @code{gawk}.

9739

9740

@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})

9741

@findex gsub

9742

This is similar to the @code{sub} function, except @code{gsub} replaces

9743

@emph{all} of the longest, leftmost, @emph{non-overlapping} matching

9744

substrings it can find. The @samp{g} in @code{gsub} stands for

9745

``global,'' which means replace everywhere. For example:

9746

9747

@example

9748

awk '@{ gsub(/Britain/, "United Kingdom"); print @}'

9749

@end example

9750

9751

@noindent

9752

replaces all occurrences of the string @samp{Britain} with @samp{United

9753

Kingdom} for all input records.

9754

9755

The @code{gsub} function returns the number of substitutions made. If

9756

the variable to be searched and altered, @var{target}, is

9757

omitted, then the entire input record, @code{$0}, is used.

9758

9759

As in @code{sub}, the characters @samp{&} and @samp{\} are special,

9760

and the third argument must be an lvalue.

9761

@end table

9762

9763

@table @code

9764

@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]})

9765

@findex gensub

9766

@code{gensub} is a general substitution function. Like @code{sub} and

9767

@code{gsub}, it searches the target string @var{target} for matches of

9768

the regular expression @var{regexp}. Unlike @code{sub} and

9769

@code{gsub}, the modified string is returned as the result of the

9770

function, and the original target string is @emph{not} changed. If

9771

@var{how} is a string beginning with @samp{g} or @samp{G}, then it

9772

replaces all matches of @var{regexp} with @var{replacement}.

9773

Otherwise, @var{how} is a number indicating which match of @var{regexp}

9774

to replace. If no @var{target} is supplied, @code{$0} is used instead.

9775

9776

@code{gensub} provides an additional feature that is not available

9777

in @code{sub} or @code{gsub}: the ability to specify components of

9778

a regexp in the replacement text. This is done by using parentheses

9779

in the regexp to mark the components, and then specifying @samp{\@var{n}}

9780

in the replacement text, where @var{n} is a digit from one to nine.

9781

For example:

9782

9783

@example

9784

@group

9785

$ gawk '

9786

> BEGIN @{

9787

> a = "abc def"

9788

> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)

9789

> print b

9790

> @}'

9791

@print{} def abc

9792

@end group

9793

@end example

9794

9795

@noindent

9796

As described above for @code{sub}, you must type two backslashes in order

9797

to get one into the string.

9798

9799

In the replacement text, the sequence @samp{\0} represents the entire

9800

matched text, as does the character @samp{&}.

9801

9802

This example shows how you can use the third argument to control

9803

which match of the regexp should be changed.

9804

9805

@example

9806

$ echo a b c a b c |

9807

> gawk '@{ print gensub(/a/, "AA", 2) @}'

9808

@print{} a b c AA b c

9809

@end example

9810

9811

In this case, @code{$0} is used as the default target string.

9812

@code{gensub} returns the new string as its result, which is

9813

passed directly to @code{print} for printing.

9814

9815

If the @var{how} argument is a string that does not begin with @samp{g} or

9816

@samp{G}, or if it is a number that is less than zero, only one

9817

substitution is performed.

9818

9819

@cindex differences between @code{gawk} and @code{awk}

9820

@code{gensub} is a @code{gawk} extension; it is not available

9821

in compatibility mode (@pxref{Options, ,Command Line Options}).

9822

9823

@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]})

9824

@findex substr

9825

This returns a @var{length}-character-long substring of @var{string},

9826

starting at character number @var{start}. The first character of a

9827

string is character number one. For example,

9828

@code{substr("washington", 5, 3)} returns @code{"ing"}.

9829

9830

If @var{length} is not present, this function returns the whole suffix of

9831

@var{string} that begins at character number @var{start}. For example,

9832

@code{substr("washington", 5)} returns @code{"ington"}. The whole

9833

suffix is also returned

9834

if @var{length} is greater than the number of characters remaining

9835

in the string, counting from character number @var{start}.

9836

9837

@cindex case conversion

9838

@cindex conversion of case

9839

@item tolower(@var{string})

9840

@findex tolower

9841

This returns a copy of @var{string}, with each upper-case character

9842

in the string replaced with its corresponding lower-case character.

9843

Non-alphabetic characters are left unchanged. For example,

9844

@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.

9845

9846

@item toupper(@var{string})

9847

@findex toupper

9848

This returns a copy of @var{string}, with each lower-case character

9849

in the string replaced with its corresponding upper-case character.

9850

Non-alphabetic characters are left unchanged. For example,

9851

@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.

9852

@end table

9853

9854

@c fakenode --- for prepinfo

9855

@subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub}

9856

9857

@cindex escape processing, @code{sub} et. al.

9858

When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal

9859

backslashes and ampersands into the replacement text, you need to remember

9860

that there are several levels of @dfn{escape processing} going on.

9861

9862

First, there is the @dfn{lexical} level, which is when @code{awk} reads

9863

your program, and builds an internal copy of your program that can

9864

be executed.

9865

9866

Then there is the run-time level, when @code{awk} actually scans the

9867

replacement string to determine what to generate.

9868

9869

At both levels, @code{awk} looks for a defined set of characters that

9870

can come after a backslash. At the lexical level, it looks for the

9871

escape sequences listed in @ref{Escape Sequences}.

9872

Thus, for every @samp{\} that @code{awk} will process at the run-time

9873

level, you type two @samp{\}s at the lexical level.

9874

When a character that is not valid for an escape sequence follows the

9875

@samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial

9876

@samp{\}, and put the following character into the string. Thus, for

9877

example, @code{"a\qb"} is treated as @code{"aqb"}.

9878

9879

At the run-time level, the various functions handle sequences of

9880

@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex.

9881

9882

Historically, the @code{sub} and @code{gsub} functions treated the two

9883

character sequence @samp{\&} specially; this sequence was replaced in

9884

the generated text with a single @samp{&}. Any other @samp{\} within

9885

the @var{replacement} string that did not precede an @samp{&} was passed

9886

through unchanged. To illustrate with a table:

9887

9888

@c Thank to Karl Berry for help with the TeX stuff.

9889

@tex

9890

\vbox{\bigskip

9891

% This table has lots of &'s and \'s, so unspecialize them.

9892

\catcode`\& = \other \catcode`\\ = \other

9893

% But then we need character for escape and tab.

9894

@catcode`! = 4

9895

@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr

9896

You type!@code{sub} sees!@code{sub} generates@cr

9897

@hrulefill!@hrulefill!@hrulefill@cr

9898

@code{\&}! @code{&}!the matched text@cr

9899

@code{\\&}! @code{\&}!a literal @samp{&}@cr

9900

@code{\\\&}! @code{\&}!a literal @samp{&}@cr

9901

@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr

9902

@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr

9903

@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr

9904

@code{\\q}! @code{\q}!a literal @samp{\q}@cr

9905

}

9906

@bigskip}

9907

@end tex

9908

@ifinfo

9909

@display

9910

You type @code{sub} sees @code{sub} generates

9911

-------- ---------- ---------------

9912

@code{\&} @code{&} the matched text

9913

@code{\\&} @code{\&} a literal @samp{&}

9914

@code{\\\&} @code{\&} a literal @samp{&}

9915

@code{\\\\&} @code{\\&} a literal @samp{\&}

9916

@code{\\\\\&} @code{\\&} a literal @samp{\&}

9917

@code{\\\\\\&} @code{\\\&} a literal @samp{\\&}

9918

@code{\\q} @code{\q} a literal @samp{\q}

9919

@end display

9920

@end ifinfo

9921

9922

@noindent

9923

This table shows both the lexical level processing, where

9924

an odd number of backslashes becomes an even number at the run time level,

9925

and the run-time processing done by @code{sub}.

9926

(For the sake of simplicity, the rest of the tables below only show the

9927

case of even numbers of @samp{\}s entered at the lexical level.)

9928

9929

The problem with the historical approach is that there is no way to get

9930

a literal @samp{\} followed by the matched text.

9931

9932

@cindex @code{awk} language, POSIX version

9933

@cindex POSIX @code{awk}

9934

The 1992 POSIX standard attempted to fix this problem. The standard

9935

says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&}

9936

after the @samp{\}. If either one follows a @samp{\}, that character is

9937

output literally. The interpretation of @samp{\} and @samp{&} then becomes

9938

like this:

9939

9940

@c thanks to Karl Berry for formatting this table

9941

@tex

9942

\vbox{\bigskip

9943

% This table has lots of &'s and \'s, so unspecialize them.

9944

\catcode`\& = \other \catcode`\\ = \other

9945

% But then we need character for escape and tab.

9946

@catcode`! = 4

9947

@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr

9948

You type!@code{sub} sees!@code{sub} generates@cr

9949

@hrulefill!@hrulefill!@hrulefill@cr

9950

@code{&}! @code{&}!the matched text@cr

9951

@code{\\&}! @code{\&}!a literal @samp{&}@cr

9952

@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr

9953

@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr

9954

}

9955

@bigskip}

9956

@end tex

9957

@ifinfo

9958

@display

9959

You type @code{sub} sees @code{sub} generates

9960

-------- ---------- ---------------

9961

@code{&} @code{&} the matched text

9962

@code{\\&} @code{\&} a literal @samp{&}

9963

@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text

9964

@code{\\\\\\&} @code{\\\&} a literal @samp{\&}

9965

@end display

9966

@end ifinfo

9967

9968

@noindent

9969

This would appear to solve the problem.

9970

Unfortunately, the phrasing of the standard is unusual. It

9971

says, in effect, that @samp{\} turns off the special meaning of any

9972

following character, but that for anything other than @samp{\} and @samp{&},

9973

such special meaning is undefined. This wording leads to two problems.

9974

9975

@enumerate

9976

@item

9977

Backslashes must now be doubled in the @var{replacement} string, breaking

9978

historical @code{awk} programs.

9979

9980

@item

9981

To make sure that an @code{awk} program is portable, @emph{every} character

9982

in the @var{replacement} string must be preceded with a

9983

backslash.@footnote{This consequence was certainly unintended.}

9984

@c I can say that, 'cause I was involved in making this change

9985

@end enumerate

9986

9987

The POSIX standard is under revision.@footnote{As of December 1995,

9988

with final approval and publication hopefully sometime in 1996.}

9989

Because of the above problems, proposed text for the revised standard

9990

reverts to rules that correspond more closely to the original existing

9991

practice. The proposed rules have special cases that make it possible

9992

to produce a @samp{\} preceding the matched text.

9993

9994

@tex

9995

\vbox{\bigskip

9996

% This table has lots of &'s and \'s, so unspecialize them.

9997

\catcode`\& = \other \catcode`\\ = \other

9998

% But then we need character for escape and tab.

9999

@catcode`! = 4

10000

@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr

10001

You type!@code{sub} sees!@code{sub} generates@cr

10002

@hrulefill!@hrulefill!@hrulefill@cr

10003

@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr

10004

@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr

10005

@code{\\&}! @code{\&}!a literal @samp{&}@cr

10006

@code{\\q}! @code{\q}!a literal @samp{\q}@cr

10007

}

10008

@bigskip}

10009

@end tex

10010

@ifinfo

10011

@display

10012

You type @code{sub} sees @code{sub} generates

10013

-------- ---------- ---------------

10014

@code{\\\\\\&} @code{\\\&} a literal @samp{\&}

10015

@code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text

10016

@code{\\&} @code{\&} a literal @samp{&}

10017

@code{\\q} @code{\q} a literal @samp{\q}

10018

@end display

10019

@end ifinfo

10020

10021

In a nutshell, at the run-time level, there are now three special sequences

10022

of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically,

10023

there was only one. However, as in the historical case, any @samp{\} that

10024

is not part of one of these three sequences is not special, and appears

10025

in the output literally.

10026

10027

@code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and

10028

@code{gsub}.

10029

@c As much as we think it's a lousy idea. You win some, you lose some. Sigh.

10030

Whether these proposed rules will actually become codified into the

10031

standard is unknown at this point. Subsequent @code{gawk} releases will

10032

track the standard and implement whatever the final version specifies;

10033

this @value{DOCUMENT} will be updated as well.

10034

10035

The rules for @code{gensub} are considerably simpler. At the run-time

10036

level, whenever @code{gawk} sees a @samp{\}, if the following character

10037

is a digit, then the text that matched the corresponding parenthesized

10038

subexpression is placed in the generated output. Otherwise,

10039

no matter what the character after the @samp{\} is, that character will

10040

appear in the generated text, and the @samp{\} will not.

10041

10042

@tex

10043

\vbox{\bigskip

10044

% This table has lots of &'s and \'s, so unspecialize them.

10045

\catcode`\& = \other \catcode`\\ = \other

10046

% But then we need character for escape and tab.

10047

@catcode`! = 4

10048

@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr

10049

You type!@code{gensub} sees!@code{gensub} generates@cr

10050

@hrulefill!@hrulefill!@hrulefill@cr

10051

@code{&}! @code{&}!the matched text@cr

10052

@code{\\&}! @code{\&}!a literal @samp{&}@cr

10053

@code{\\\\}! @code{\\}!a literal @samp{\}@cr

10054

@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr

10055

@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr

10056

@code{\\q}! @code{\q}!a literal @samp{q}@cr

10057

}

10058

@bigskip}

10059

@end tex

10060

@ifinfo

10061

@display

10062

You type @code{gensub} sees @code{gensub} generates

10063

-------- ------------- ------------------

10064

@code{&} @code{&} the matched text

10065

@code{\\&} @code{\&} a literal @samp{&}

10066

@code{\\\\} @code{\\} a literal @samp{\}

10067

@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text

10068

@code{\\\\\\&} @code{\\\&} a literal @samp{\&}

10069

@code{\\q} @code{\q} a literal @samp{q}

10070

@end display

10071

@end ifinfo

10072

10073

Because of the complexity of the lexical and run-time level processing,

10074

and the special cases for @code{sub} and @code{gsub},

10075

we recommend the use of @code{gawk} and @code{gensub} for when you have

10076

to do substitutions.

10077

10078

@node I/O Functions, Time Functions, String Functions, Built-in

10079

@section Built-in Functions for Input/Output

10080

10081

The following functions are related to Input/Output (I/O).

10082

Optional parameters are enclosed in square brackets (``['' and ``]'').

10083

10084

@table @code

10085

@item close(@var{filename})

10086

@findex close

10087

Close the file @var{filename}, for input or output. The argument may

10088

alternatively be a shell command that was used for redirecting to or

10089

from a pipe; then the pipe is closed.

10090

@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},

10091

for more information.

10092

10093

@item fflush(@r{[}@var{filename}@r{]})

10094

@findex fflush

10095

@cindex portability issues

10096

@cindex flushing buffers

10097

@cindex buffers, flushing

10098

@cindex buffering output

10099

@cindex output, buffering

10100

Flush any buffered output associated @var{filename}, which is either a

10101

file opened for writing, or a shell command for redirecting output to

10102

a pipe.

10103

10104

Many utility programs will @dfn{buffer} their output; they save information

10105

to be written to a disk file or terminal in memory, until there is enough

10106

for it to be worthwhile to send the data to the ouput device.

10107

This is often more efficient than writing

10108

every little bit of information as soon as it is ready. However, sometimes

10109

it is necessary to force a program to @dfn{flush} its buffers; that is,

10110

write the information to its destination, even if a buffer is not full.

10111

This is the purpose of the @code{fflush} function; @code{gawk} too

10112

buffers its output, and the @code{fflush} function can be used to force

10113

@code{gawk} to flush its buffers.

10114

10115

@code{fflush} is a recent (1994) addition to the Bell Labs research

10116

version of @code{awk}; it is not part of the POSIX standard, and will

10117

not be available if @samp{--posix} has been specified on the command

10118

line (@pxref{Options, ,Command Line Options}).

10119

10120

@code{gawk} extends the @code{fflush} function in two ways. This first

10121

is to allow no argument at all. In this case, the buffer for the

10122

standard output is flushed. The second way is to allow the null string

10123

(@w{@code{""}}) as the argument. In this case, the buffers for

10124

@emph{all} open output files and pipes are flushed.

10125

10126

@code{fflush} returns zero if the buffer was successfully flushed,

10127

and nonzero otherwise.

10128

10129

@item system(@var{command})

10130

@findex system

10131

@cindex interaction, @code{awk} and other programs

10132

The system function allows the user to execute operating system commands

10133

and then return to the @code{awk} program. The @code{system} function

10134

executes the command given by the string @var{command}. It returns, as

10135

its value, the status returned by the command that was executed.

10136

10137

For example, if the following fragment of code is put in your @code{awk}

10138

program:

10139

10140

@example

10141

END @{

10142

system("date | mail -s 'awk run done' root")

10143

@}

10144

@end example

10145

10146

@noindent

10147

the system administrator will be sent mail when the @code{awk} program

10148

finishes processing input and begins its end-of-input processing.

10149

10150

Note that redirecting @code{print} or @code{printf} into a pipe is often

10151

enough to accomplish your task. However, if your @code{awk}

10152

program is interactive, @code{system} is useful for cranking up large

10153

self-contained programs, such as a shell or an editor.

10154

10155

Some operating systems cannot implement the @code{system} function.

10156

@code{system} causes a fatal error if it is not supported.

10157

@end table

10158

10159

@c fakenode --- for prepinfo

10160

@subheading Controlling Output Buffering with @code{system}

10161

@cindex flushing buffers

10162

@cindex buffers, flushing

10163

@cindex buffering output

10164

@cindex output, buffering

10165

10166

The @code{fflush} function provides explicit control over output buffering for

10167

individual files and pipes. However, its use is not portable to many other

10168

@code{awk} implementations. An alternative method to flush output

10169

buffers is by calling @code{system} with a null string as its argument:

10170

10171

@example

10172

system("") # flush output

10173

@end example

10174

10175

@noindent

10176

@code{gawk} treats this use of the @code{system} function as a special

10177

case, and is smart enough not to run a shell (or other command

10178

interpreter) with the empty command. Therefore, with @code{gawk}, this

10179

idiom is not only useful, it is efficient. While this method should work

10180

with other @code{awk} implementations, it will not necessarily avoid

10181

starting an unnecessary shell. (Other implementations may only

10182

flush the buffer associated with the standard output, and not necessarily

10183

all buffered output.)

10184

10185

If you think about what a programmer expects, it makes sense that

10186

@code{system} should flush any pending output. The following program:

10187

10188

@example

10189

BEGIN @{

10190

print "first print"

10191

system("echo system echo")

10192

print "second print"

10193

@}

10194

@end example

10195

10196

@noindent

10197

must print

10198

10199

@example

10200

first print

10201

system echo

10202

second print

10203

@end example

10204

10205

@noindent

10206

and not

10207

10208

@example

10209

system echo

10210

first print

10211

second print

10212

@end example

10213

10214

If @code{awk} did not flush its buffers before calling @code{system}, the

10215

latter (undesirable) output is what you would see.

10216

10217

@node Time Functions, , I/O Functions, Built-in

10218

@section Functions for Dealing with Time Stamps

10219

10220

@cindex timestamps

10221

@cindex time of day

10222

A common use for @code{awk} programs is the processing of log files

10223

containing time stamp information, indicating when a

10224

particular log record was written. Many programs log their time stamp

10225

in the form returned by the @code{time} system call, which is the

10226

number of seconds since a particular epoch. On POSIX systems,

10227

it is the number of seconds since Midnight, January 1, 1970, UTC.

10228

10229

In order to make it easier to process such log files, and to produce

10230

useful reports, @code{gawk} provides two functions for working with time

10231

stamps. Both of these are @code{gawk} extensions; they are not specified

10232

in the POSIX standard, nor are they in any other known version

10233

of @code{awk}.

10234

10235

Optional parameters are enclosed in square brackets (``['' and ``]'').

10236

10237

@table @code

10238

@item systime()

10239

@findex systime

10240

This function returns the current time as the number of seconds since

10241

the system epoch. On POSIX systems, this is the number of seconds

10242

since Midnight, January 1, 1970, UTC. It may be a different number on

10243

other systems.

10244

10245

@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]})

10246

@findex strftime

10247

This function returns a string. It is similar to the function of the

10248

same name in ANSI C. The time specified by @var{timestamp} is used to

10249

produce a string, based on the contents of the @var{format} string.

10250

The @var{timestamp} is in the same format as the value returned by the

10251

@code{systime} function. If no @var{timestamp} argument is supplied,

10252

@code{gawk} will use the current time of day as the time stamp.

10253

If no @var{format} argument is supplied, @code{strftime} uses

10254

@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces

10255

output (almost) equivalent to that of the @code{date} utility.

10256

(Versions of @code{gawk} prior to 3.0 require the @var{format} argument.)

10257

@end table

10258

10259

The @code{systime} function allows you to compare a time stamp from a

10260

log file with the current time of day. In particular, it is easy to

10261

determine how long ago a particular record was logged. It also allows

10262

you to produce log records using the ``seconds since the epoch'' format.

10263

10264

The @code{strftime} function allows you to easily turn a time stamp

10265

into human-readable information. It is similar in nature to the @code{sprintf}

10266

function

10267

(@pxref{String Functions, ,Built-in Functions for String Manipulation}),

10268

in that it copies non-format specification characters verbatim to the

10269

returned string, while substituting date and time values for format

10270

specifications in the @var{format} string.

10271

10272

@code{strftime} is guaranteed by the ANSI C standard to support

10273

the following date format specifications:

10274

10275

@table @code

10276

@item %a

10277

The locale's abbreviated weekday name.

10278

10279

@item %A

10280

The locale's full weekday name.

10281

10282

@item %b

10283

The locale's abbreviated month name.

10284

10285

@item %B

10286

The locale's full month name.

10287

10288

@item %c

10289

The locale's ``appropriate'' date and time representation.

10290

10291

@item %d

10292

The day of the month as a decimal number (01--31).

10293

10294

@item %H

10295

The hour (24-hour clock) as a decimal number (00--23).

10296

10297

@item %I

10298

The hour (12-hour clock) as a decimal number (01--12).

10299

10300

@item %j

10301

The day of the year as a decimal number (001--366).

10302

10303

@item %m

10304

The month as a decimal number (01--12).

10305

10306

@item %M

10307

The minute as a decimal number (00--59).

10308

10309

@item %p

10310

The locale's equivalent of the AM/PM designations associated

10311

with a 12-hour clock.

10312

10313

@item %S

10314

The second as a decimal number (00--61).@footnote{Occasionally there are

10315

minutes in a year with one or two leap seconds, which is why the

10316

seconds can go up to 61.}

10317

10318

@item %U

10319

The week number of the year (the first Sunday as the first day of week one)

10320

as a decimal number (00--53).

10321

10322

@item %w

10323

The weekday as a decimal number (0--6). Sunday is day zero.

10324

10325

@item %W

10326

The week number of the year (the first Monday as the first day of week one)

10327

as a decimal number (00--53).

10328

10329

@item %x

10330

The locale's ``appropriate'' date representation.

10331

10332

@item %X

10333

The locale's ``appropriate'' time representation.

10334

10335

@item %y

10336

The year without century as a decimal number (00--99).

10337

10338

@item %Y

10339

The year with century as a decimal number (e.g., 1995).

10340

10341

@item %Z

10342

The time zone name or abbreviation, or no characters if

10343

no time zone is determinable.

10344

10345

@item %%

10346

A literal @samp{%}.

10347

@end table

10348

10349

If a conversion specifier is not one of the above, the behavior is

10350

undefined.@footnote{This is because ANSI C leaves the

10351

behavior of the C version of @code{strftime} undefined, and @code{gawk}

10352

will use the system's version of @code{strftime} if it's there.

10353

Typically, the conversion specifier will either not appear in the

10354

returned string, or it will appear literally.}

10355

10356

@cindex locale, definition of

10357

Informally, a @dfn{locale} is the geographic place in which a program

10358

is meant to run. For example, a common way to abbreviate the date

10359

September 4, 1991 in the United States would be ``9/4/91''.

10360

In many countries in Europe, however, it would be abbreviated ``4.9.91''.

10361

Thus, the @samp{%x} specification in a @code{"US"} locale might produce

10362

@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce

10363

@samp{4.9.91}. The ANSI C standard defines a default @code{"C"}

10364

locale, which is an environment that is typical of what most C programmers

10365

are used to.

10366

10367

A public-domain C version of @code{strftime} is supplied with @code{gawk}

10368

for systems that are not yet fully ANSI-compliant. If that version is

10369

used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}),

10370

then the following additional format specifications are available:

10371

10372

@table @code

10373

@item %D

10374

Equivalent to specifying @samp{%m/%d/%y}.

10375

10376

@item %e

10377

The day of the month, padded with a space if it is only one digit.

10378

10379

@item %h

10380

Equivalent to @samp{%b}, above.

10381

10382

@item %n

10383

A newline character (ASCII LF).

10384

10385

@item %r

10386

Equivalent to specifying @samp{%I:%M:%S %p}.

10387

10388

@item %R

10389

Equivalent to specifying @samp{%H:%M}.

10390

10391

@item %T

10392

Equivalent to specifying @samp{%H:%M:%S}.

10393

10394

@item %t

10395

A tab character.

10396

10397

@item %k

10398

The hour (24-hour clock) as a decimal number (0-23).

10399

Single digit numbers are padded with a space.

10400

10401

@item %l

10402

The hour (12-hour clock) as a decimal number (1-12).

10403

Single digit numbers are padded with a space.

10404

10405

@item %C

10406

The century, as a number between 00 and 99.

10407

10408

@item %u

10409

The weekday as a decimal number

10410

[1 (Monday)--7].

10411

10412

@cindex ISO 8601

10413

@item %V

10414

The week number of the year (the first Monday as the first

10415

day of week one) as a decimal number (01--53).

10416

The method for determining the week number is as specified by ISO 8601

10417

(to wit: if the week containing January 1 has four or more days in the

10418

new year, then it is week one, otherwise it is week 53 of the previous year

10419

and the next week is week one).

10420

10421

@item %G

10422

The year with century of the ISO week number, as a decimal number.

10423

10424

For example, January 1, 1993, is in week 53 of 1992. Thus, the year

10425

of its ISO week number is 1992, even though its year is 1993.

10426

Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year

10427

of its ISO week number is 1974, even though its year is 1973.

10428

10429

@item %g

10430

The year without century of the ISO week number, as a decimal number (00--99).

10431

10432

@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI

10433

@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy

10434

These are ``alternate representations'' for the specifications

10435

that use only the second letter (@samp{%c}, @samp{%C}, and so on).

10436

They are recognized, but their normal representations are

10437

used.@footnote{If you don't understand any of this, don't worry about

10438

it; these facilities are meant to make it easier to ``internationalize''

10439

programs.}

10440

(These facilitate compliance with the POSIX @code{date} utility.)

10441

10442

@item %v

10443

The date in VMS format (e.g., 20-JUN-1991).

10444

10445

@cindex RFC-822

10446

@cindex RFC-1036

10447

@item %z

10448

The timezone offset in a +HHMM format (e.g., the format necessary to

10449

produce RFC-822/RFC-1036 date headers).

10450

@end table

10451

10452

This example is an @code{awk} implementation of the POSIX

10453

@code{date} utility. Normally, the @code{date} utility prints the

10454

current date and time of day in a well known format. However, if you

10455

provide an argument to it that begins with a @samp{+}, @code{date}

10456

will copy non-format specifier characters to the standard output, and

10457

will interpret the current time according to the format specifiers in

10458

the string. For example:

10459

10460

@example

10461

$ date '+Today is %A, %B %d, %Y.'

10462

@print{} Today is Thursday, July 11, 1991.

10463

@end example

10464

10465

Here is the @code{gawk} version of the @code{date} utility.

10466

It has a shell ``wrapper'', to handle the @samp{-u} option,

10467

which requires that @code{date} run as if the time zone

10468

was set to UTC.

10469

10470

@example

10471

@group

10472

#! /bin/sh

10473

#

10474

# date --- approximate the P1003.2 'date' command

10475

10476

case $1 in

10477

-u) TZ=GMT0 # use UTC

10478

export TZ

10479

shift ;;

10480

esac

10481

@end group

10482

10483

@group

10484

gawk 'BEGIN @{

10485

format = "%a %b %d %H:%M:%S %Z %Y"

10486

exitval = 0

10487

@end group

10488

10489

@group

10490

if (ARGC > 2)

10491

exitval = 1

10492

else if (ARGC == 2) @{

10493

format = ARGV[1]

10494

if (format ~ /^\+/)

10495

format = substr(format, 2) # remove leading +

10496

@}

10497

print strftime(format)

10498

exit exitval

10499

@}' "$@@"

10500

@end group

10501

@end example

10502

10503

@node User-defined, Invoking Gawk, Built-in, Top

10504

@chapter User-defined Functions

10505

10506

@cindex user-defined functions

10507

@cindex functions, user-defined

10508

Complicated @code{awk} programs can often be simplified by defining

10509

your own functions. User-defined functions can be called just like

10510

built-in ones (@pxref{Function Calls}), but it is up to you to define

10511

them---to tell @code{awk} what they should do.

10512

10513

@menu

10514

* Definition Syntax:: How to write definitions and what they mean.

10515

* Function Example:: An example function definition and what it

10516

does.

10517

* Function Caveats:: Things to watch out for.

10518

* Return Statement:: Specifying the value a function returns.

10519

@end menu

10520

10521

@node Definition Syntax, Function Example, User-defined, User-defined

10522

@section Function Definition Syntax

10523

@cindex defining functions

10524

@cindex function definition

10525

10526

Definitions of functions can appear anywhere between the rules of an

10527

@code{awk} program. Thus, the general form of an @code{awk} program is

10528

extended to include sequences of rules @emph{and} user-defined function

10529

definitions.

10530

There is no need in @code{awk} to put the definition of a function

10531

before all uses of the function. This is because @code{awk} reads the

10532

entire program before starting to execute any of it.

10533

10534

The definition of a function named @var{name} looks like this:

10535

10536

@example

10537

function @var{name}(@var{parameter-list})

10538

@{

10539

@var{body-of-function}

10540

@}

10541

@end example

10542

10543

@cindex names, use of

10544

@cindex namespaces

10545

@noindent

10546

@var{name} is the name of the function to be defined. A valid function

10547

name is like a valid variable name: a sequence of letters, digits and

10548

underscores, not starting with a digit.

10549

Within a single @code{awk} program, any particular name can only be

10550

used as a variable, array or function.

10551

10552

@var{parameter-list} is a list of the function's arguments and local

10553

variable names, separated by commas. When the function is called,

10554

the argument names are used to hold the argument values given in

10555

the call. The local variables are initialized to the empty string.

10556

A function cannot have two parameters with the same name.

10557

10558

The @var{body-of-function} consists of @code{awk} statements. It is the

10559

most important part of the definition, because it says what the function

10560

should actually @emph{do}. The argument names exist to give the body a

10561

way to talk about the arguments; local variables, to give the body

10562

places to keep temporary values.

10563

10564

Argument names are not distinguished syntactically from local variable

10565

names; instead, the number of arguments supplied when the function is

10566

called determines how many argument variables there are. Thus, if three

10567

argument values are given, the first three names in @var{parameter-list}

10568

are arguments, and the rest are local variables.

10569

10570

It follows that if the number of arguments is not the same in all calls

10571

to the function, some of the names in @var{parameter-list} may be

10572

arguments on some occasions and local variables on others. Another

10573

way to think of this is that omitted arguments default to the

10574

null string.

10575

10576

Usually when you write a function you know how many names you intend to

10577

use for arguments and how many you intend to use as local variables. It is

10578

conventional to place some extra space between the arguments and

10579

the local variables, to document how your function is supposed to be used.

10580

10581

@cindex variable shadowing

10582

During execution of the function body, the arguments and local variable

10583

values hide or @dfn{shadow} any variables of the same names used in the

10584

rest of the program. The shadowed variables are not accessible in the

10585

function definition, because there is no way to name them while their

10586

names have been taken away for the local variables. All other variables

10587

used in the @code{awk} program can be referenced or set normally in the

10588

function's body.

10589

10590

The arguments and local variables last only as long as the function body

10591

is executing. Once the body finishes, you can once again access the

10592

variables that were shadowed while the function was running.

10593

10594

@cindex recursive function

10595

@cindex function, recursive

10596

The function body can contain expressions which call functions. They

10597

can even call this function, either directly or by way of another

10598

function. When this happens, we say the function is @dfn{recursive}.

10599

10600

@cindex @code{awk} language, POSIX version

10601

@cindex POSIX @code{awk}

10602

In many @code{awk} implementations, including @code{gawk},

10603

the keyword @code{function} may be

10604

abbreviated @code{func}. However, POSIX only specifies the use of

10605

the keyword @code{function}. This actually has some practical implications.

10606

If @code{gawk} is in POSIX-compatibility mode

10607

(@pxref{Options, ,Command Line Options}), then the following

10608

statement will @emph{not} define a function:

10609

10610

@example

10611

func foo() @{ a = sqrt($1) ; print a @}

10612

@end example

10613

10614

@noindent

10615

Instead it defines a rule that, for each record, concatenates the value

10616

of the variable @samp{func} with the return value of the function @samp{foo}.

10617

If the resulting string is non-null, the action is executed.

10618

This is probably not what was desired. (@code{awk} accepts this input as

10619

syntactically valid, since functions may be used before they are defined

10620

in @code{awk} programs.)

10621

10622

@cindex portability issues

10623

To ensure that your @code{awk} programs are portable, always use the

10624

keyword @code{function} when defining a function.

10625

10626

@node Function Example, Function Caveats, Definition Syntax, User-defined

10627

@section Function Definition Examples

10628

10629

Here is an example of a user-defined function, called @code{myprint}, that

10630

takes a number and prints it in a specific format.

10631

10632

@example

10633

function myprint(num)

10634

@{

10635

printf "%6.3g\n", num

10636

@}

10637

@end example

10638

10639

@noindent

10640

To illustrate, here is an @code{awk} rule which uses our @code{myprint}

10641

function:

10642

10643

@example

10644

$3 > 0 @{ myprint($3) @}

10645

@end example

10646

10647

@noindent

10648

This program prints, in our special format, all the third fields that

10649

contain a positive number in our input. Therefore, when given:

10650

10651

@example

10652

1.2 3.4 5.6 7.8

10653

9.10 11.12 -13.14 15.16

10654

17.18 19.20 21.22 23.24

10655

@end example

10656

10657

@noindent

10658

this program, using our function to format the results, prints:

10659

10660

@example

10661

5.6

10662

21.2

10663

@end example

10664

10665

This function deletes all the elements in an array.

10666

10667

@example

10668

function delarray(a, i)

10669

@{

10670

for (i in a)

10671

delete a[i]

10672

@}

10673

@end example

10674

10675

When working with arrays, it is often necessary to delete all the elements

10676

in an array and start over with a new list of elements

10677

(@pxref{Delete, ,The @code{delete} Statement}).

10678

Instead of having

10679

to repeat this loop everywhere in your program that you need to clear out

10680

an array, your program can just call @code{delarray}.

10681

10682

Here is an example of a recursive function. It takes a string

10683

as an input parameter, and returns the string in backwards order.

10684

10685

@example

10686

function rev(str, start)

10687

@{

10688

if (start == 0)

10689

return ""

10690

10691

return (substr(str, start, 1) rev(str, start - 1))

10692

@}

10693

@end example

10694

10695

If this function is in a file named @file{rev.awk}, we can test it

10696

this way:

10697

10698

@example

10699

$ echo "Don't Panic!" |

10700

> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk

10701

@print{} !cinaP t'noD

10702

@end example

10703

10704

Here is an example that uses the built-in function @code{strftime}.

10705

(@xref{Time Functions, ,Functions for Dealing with Time Stamps},

10706

for more information on @code{strftime}.)

10707

The C @code{ctime} function takes a timestamp and returns it in a string,

10708

formatted in a well known fashion. Here is an @code{awk} version:

10709

10710

@example

10711

@c file eg/lib/ctime.awk

10712

@group

10713

# ctime.awk

10714

#

10715

# awk version of C ctime(3) function

10716

10717

function ctime(ts, format)

10718

@{

10719

format = "%a %b %d %H:%M:%S %Z %Y"

10720

if (ts == 0)

10721

ts = systime() # use current time as default

10722

return strftime(format, ts)

10723

@}

10724

@c endfile

10725

@end group

10726

@end example

10727

10728

@node Function Caveats, Return Statement, Function Example, User-defined

10729

@section Calling User-defined Functions

10730

10731

@cindex call by value

10732

@cindex call by reference

10733

@cindex calling a function

10734

@cindex function call

10735

@dfn{Calling a function} means causing the function to run and do its job.

10736

A function call is an expression, and its value is the value returned by

10737

the function.

10738

10739

A function call consists of the function name followed by the arguments

10740

in parentheses. What you write in the call for the arguments are

10741

@code{awk} expressions; each time the call is executed, these

10742

expressions are evaluated, and the values are the actual arguments. For

10743

example, here is a call to @code{foo} with three arguments (the first

10744

being a string concatenation):

10745

10746

@example

10747

foo(x y, "lose", 4 * z)

10748

@end example

10749

10750

@strong{Caution:} whitespace characters (spaces and tabs) are not allowed

10751

between the function name and the open-parenthesis of the argument list.

10752

If you write whitespace by mistake, @code{awk} might think that you mean

10753

to concatenate a variable with an expression in parentheses. However, it

10754

notices that you used a function name and not a variable name, and reports

10755

an error.

10756

10757

@cindex call by value

10758

When a function is called, it is given a @emph{copy} of the values of

10759

its arguments. This is known as @dfn{call by value}. The caller may use

10760

a variable as the expression for the argument, but the called function

10761

does not know this: it only knows what value the argument had. For

10762

example, if you write this code:

10763

10764

@example

10765

foo = "bar"

10766

z = myfunc(foo)

10767

@end example

10768

10769

@noindent

10770

then you should not think of the argument to @code{myfunc} as being

10771

``the variable @code{foo}.'' Instead, think of the argument as the

10772

string value, @code{"bar"}.

10773

10774

If the function @code{myfunc} alters the values of its local variables,

10775

this has no effect on any other variables. Thus, if @code{myfunc}

10776

does this:

10777

10778

@example

10779

@group

10780

function myfunc(str)

10781

@{

10782

print str

10783

str = "zzz"

10784

print str

10785

@}

10786

@end group

10787

@end example

10788

10789

@noindent

10790

to change its first argument variable @code{str}, this @emph{does not}

10791

change the value of @code{foo} in the caller. The role of @code{foo} in

10792

calling @code{myfunc} ended when its value, @code{"bar"}, was computed.

10793

If @code{str} also exists outside of @code{myfunc}, the function body

10794

cannot alter this outer value, because it is shadowed during the

10795

execution of @code{myfunc} and cannot be seen or changed from there.

10796

10797

@cindex call by reference

10798

However, when arrays are the parameters to functions, they are @emph{not}

10799

copied. Instead, the array itself is made available for direct manipulation

10800

by the function. This is usually called @dfn{call by reference}.

10801

Changes made to an array parameter inside the body of a function @emph{are}

10802

visible outside that function.

10803

@ifinfo

10804

This can be @strong{very} dangerous if you do not watch what you are

10805

doing. For example:

10806

@end ifinfo

10807

@iftex

10808

@emph{This can be very dangerous if you do not watch what you are

10809

doing.} For example:

10810

@end iftex

10811

10812

@example

10813

function changeit(array, ind, nvalue)

10814

@{

10815

array[ind] = nvalue

10816

@}

10817

10818

BEGIN @{

10819

a[1] = 1; a[2] = 2; a[3] = 3

10820

changeit(a, 2, "two")

10821

printf "a[1] = %s, a[2] = %s, a[3] = %s\n",

10822

a[1], a[2], a[3]

10823

@}

10824

@end example

10825

10826

@noindent

10827

This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because

10828

@code{changeit} stores @code{"two"} in the second element of @code{a}.

10829

10830

@cindex undefined functions

10831

@cindex functions, undefined

10832

Some @code{awk} implementations allow you to call a function that

10833

has not been defined, and only report a problem at run-time when the

10834

program actually tries to call the function. For example:

10835

10836

@example

10837

@group

10838

BEGIN @{

10839

if (0)

10840

foo()

10841

else

10842

bar()

10843

@}

10844

function bar() @{ @dots{} @}

10845

# note that `foo' is not defined

10846

@end group

10847

@end example

10848

10849

@noindent

10850

Since the @samp{if} statement will never be true, it is not really a

10851

problem that @code{foo} has not been defined. Usually though, it is a

10852

problem if a program calls an undefined function.

10853

10854

@ignore

10855

At one point, I had gawk dieing on this, but later decided that this might

10856

break old programs and/or test suites.

10857

@end ignore

10858

10859

If @samp{--lint} has been specified

10860

(@pxref{Options, ,Command Line Options}),

10861

@code{gawk} will report about calls to undefined functions.

10862

10863

@node Return Statement, , Function Caveats, User-defined

10864

@section The @code{return} Statement

10865

@cindex @code{return} statement

10866

10867

The body of a user-defined function can contain a @code{return} statement.

10868

This statement returns control to the rest of the @code{awk} program. It

10869

can also be used to return a value for use in the rest of the @code{awk}

10870

program. It looks like this:

10871

10872

@example

10873

return @r{[}@var{expression}@r{]}

10874

@end example

10875

10876

The @var{expression} part is optional. If it is omitted, then the returned

10877

value is undefined and, therefore, unpredictable.

10878

10879

A @code{return} statement with no value expression is assumed at the end of

10880

every function definition. So if control reaches the end of the function

10881

body, then the function returns an unpredictable value. @code{awk}

10882

will @emph{not} warn you if you use the return value of such a function.

10883

10884

Sometimes, you want to write a function for what it does, not for

10885

what it returns. Such a function corresponds to a @code{void} function

10886

in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not

10887

return any value; you should simply bear in mind that if you use the return

10888

value of such a function, you do so at your own risk.

10889

10890

Here is an example of a user-defined function that returns a value

10891

for the largest number among the elements of an array:

10892

10893

@example

10894

@group

10895

function maxelt(vec, i, ret)

10896

@{

10897

for (i in vec) @{

10898

if (ret == "" || vec[i] > ret)

10899

ret = vec[i]

10900

@}

10901

return ret

10902

@}

10903

@end group

10904

@end example

10905

10906

@noindent

10907

You call @code{maxelt} with one argument, which is an array name. The local

10908

variables @code{i} and @code{ret} are not intended to be arguments;

10909

while there is nothing to stop you from passing two or three arguments

10910

to @code{maxelt}, the results would be strange. The extra space before

10911

@code{i} in the function parameter list indicates that @code{i} and

10912

@code{ret} are not supposed to be arguments. This is a convention that

10913

you should follow when you define functions.

10914

10915

Here is a program that uses our @code{maxelt} function. It loads an

10916

array, calls @code{maxelt}, and then reports the maximum number in that

10917

array:

10918

10919

@example

10920

@group

10921

awk '

10922

function maxelt(vec, i, ret)

10923

@{

10924

for (i in vec) @{

10925

if (ret == "" || vec[i] > ret)

10926

ret = vec[i]

10927

@}

10928

return ret

10929

@}

10930

@end group

10931

10932

@group

10933

# Load all fields of each record into nums.

10934

@{

10935

for(i = 1; i <= NF; i++)

10936

nums[NR, i] = $i

10937

@}

10938

10939

END @{

10940

print maxelt(nums)

10941

@}'

10942

@end group

10943

@end example

10944

10945

Given the following input:

10946

10947

@example

10948

@group

10949

1 5 23 8 16

10950

44 3 5 2 8 26

10951

256 291 1396 2962 100

10952

-6 467 998 1101

10953

99385 11 0 225

10954

@end group

10955

@end example

10956

10957

@noindent

10958

our program tells us (predictably) that @code{99385} is the largest number

10959

in our array.

10960

10961

@node Invoking Gawk, Library Functions, User-defined, Top

10962

@chapter Running @code{awk}

10963

@cindex command line

10964

@cindex invocation of @code{gawk}

10965

@cindex arguments, command line

10966

@cindex options, command line

10967

@cindex long options

10968

@cindex options, long

10969

10970

There are two ways to run @code{awk}: with an explicit program, or with

10971

one or more program files. Here are templates for both of them; items

10972

enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional.

10973

10974

Besides traditional one-letter POSIX-style options, @code{gawk} also

10975

supports GNU long options.

10976

10977

@example

10978

awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{}

10979

awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}

10980

@end example

10981

10982

@cindex empty program

10983

@cindex dark corner

10984

It is possible to invoke @code{awk} with an empty program:

10985

10986

@example

10987

$ awk '' datafile1 datafile2

10988

@end example

10989

10990

@noindent

10991

Doing so makes little sense though; @code{awk} will simply exit

10992

silently when given an empty program (d.c.). If @samp{--lint} has

10993

been specified on the command line, @code{gawk} will issue a

10994

warning that the program is empty.

10995

10996

@menu

10997

* Options:: Command line options and their meanings.

10998

* Other Arguments:: Input file names and variable assignments.

10999

* AWKPATH Variable:: Searching directories for @code{awk} programs.

11000

* Obsolete:: Obsolete Options and/or features.

11001

* Undocumented:: Undocumented Options and Features.

11002

* Known Bugs:: Known Bugs in @code{gawk}.

11003

@end menu

11004

11005

@node Options, Other Arguments, Invoking Gawk, Invoking Gawk

11006

@section Command Line Options

11007

11008

Options begin with a dash, and consist of a single character.

11009

GNU style long options consist of two dashes and a keyword.

11010

The keyword can be abbreviated, as long the abbreviation allows the option

11011

to be uniquely identified. If the option takes an argument, then the

11012

keyword is either immediately followed by an equals sign (@samp{=}) and the

11013

argument's value, or the keyword and the argument's value are separated

11014

by whitespace. For brevity, the discussion below only refers to the

11015

traditional short options; however the long and short options are

11016

interchangeable in all contexts.

11017

11018

Each long option for @code{gawk} has a corresponding

11019

POSIX-style option. The options and their meanings are as follows:

11020

11021

@table @code

11022

@item -F @var{fs}

11023

@itemx --field-separator @var{fs}

11024

@cindex @code{-F} option

11025

@cindex @code{--field-separator} option

11026

Sets the @code{FS} variable to @var{fs}

11027

(@pxref{Field Separators, ,Specifying How Fields are Separated}).

11028

11029

@item -f @var{source-file}

11030

@itemx --file @var{source-file}

11031

@cindex @code{-f} option

11032

@cindex @code{--file} option

11033

Indicates that the @code{awk} program is to be found in @var{source-file}

11034

instead of in the first non-option argument.

11035

11036

@item -v @var{var}=@var{val}

11037

@itemx --assign @var{var}=@var{val}

11038

@cindex @code{-v} option

11039

@cindex @code{--assign} option

11040

Sets the variable @var{var} to the value @var{val} @strong{before}

11041

execution of the program begins. Such variable values are available

11042

inside the @code{BEGIN} rule

11043

(@pxref{Other Arguments, ,Other Command Line Arguments}).

11044

11045

The @samp{-v} option can only set one variable, but you can use

11046

it more than once, setting another variable each time, like this:

11047

@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.

11048

11049

@item -mf=@var{NNN}

11050

@itemx -mr=@var{NNN}

11051

Set various memory limits to the value @var{NNN}. The @samp{f} flag sets

11052

the maximum number of fields, and the @samp{r} flag sets the maximum

11053

record size. These two flags and the @samp{-m} option are from the

11054

Bell Labs research version of Unix @code{awk}. They are provided

11055

for compatibility, but otherwise ignored by

11056

@code{gawk}, since @code{gawk} has no predefined limits.

11057

11058

@item -W @var{gawk-opt}

11059

@cindex @code{-W} option

11060

Following the POSIX standard, options that are implementation

11061

specific are supplied as arguments to the @samp{-W} option. With @code{gawk},

11062

these arguments may be separated by commas, or quoted and separated by

11063

whitespace. Case is ignored when processing these options. These options

11064

also have corresponding GNU style long options.

11065

See below.

11066

11067

@item --

11068

Signals the end of the command line options. The following arguments

11069

are not treated as options even if they begin with @samp{-}. This

11070

interpretation of @samp{--} follows the POSIX argument parsing

11071

conventions.

11072

11073

This is useful if you have file names that start with @samp{-},

11074

or in shell scripts, if you have file names that will be specified

11075

by the user which could start with @samp{-}.

11076

@end table

11077

11078

The following @code{gawk}-specific options are available:

11079

11080

@table @code

11081

@item -W traditional

11082

@itemx -W compat

11083

@itemx --traditional

11084

@itemx --compat

11085

@cindex @code{--compat} option

11086

@cindex @code{--traditional} option

11087

@cindex compatibility mode

11088

Specifies @dfn{compatibility mode}, in which the GNU extensions to

11089

the @code{awk} language are disabled, so that @code{gawk} behaves just

11090

like the Bell Labs research version of Unix @code{awk}.

11091

@samp{--traditional} is the preferred form of this option.

11092

@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},

11093

which summarizes the extensions. Also see

11094

@ref{Compatibility Mode, ,Downward Compatibility and Debugging}.

11095

11096

@item -W copyleft

11097

@itemx -W copyright

11098

@itemx --copyleft

11099

@itemx --copyright

11100

@cindex @code{--copyleft} option

11101

@cindex @code{--copyright} option

11102

Print the short version of the General Public License.

11103

This option may disappear in a future version of @code{gawk}.

11104

11105

@item -W help

11106

@itemx -W usage

11107

@itemx --help

11108

@itemx --usage

11109

@cindex @code{--help} option

11110

@cindex @code{--usage} option

11111

Print a ``usage'' message summarizing the short and long style options

11112

that @code{gawk} accepts, and then exit.

11113

11114

@item -W lint

11115

@itemx --lint

11116

@cindex @code{--lint} option

11117

Warn about constructs that are dubious or non-portable to

11118

other @code{awk} implementations.

11119

Some warnings are issued when @code{gawk} first reads your program. Others

11120

are issued at run-time, as your program executes.

11121

11122

@item -W lint-old

11123

@itemx --lint-old

11124

@cindex @code{--lint-old} option

11125

Warn about constructs that are not available in

11126

the original Version 7 Unix version of @code{awk}

11127

(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).

11128

11129

@item -W posix

11130

@itemx --posix

11131

@cindex @code{--posix} option

11132

@cindex POSIX mode

11133

Operate in strict POSIX mode. This disables all @code{gawk}

11134

extensions (just like @samp{--traditional}), and adds the following additional

11135

restrictions:

11136

11137

@c IMPORTANT! Keep this list in sync with the one in node POSIX

11138

11139

@itemize @bullet

11140

@item

11141

@code{\x} escape sequences are not recognized

11142

(@pxref{Escape Sequences}).

11143

11144

@item

11145

The synonym @code{func} for the keyword @code{function} is not

11146

recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).

11147

11148

@item

11149

The operators @samp{**} and @samp{**=} cannot be used in

11150

place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},

11151

and also @pxref{Assignment Ops, ,Assignment Expressions}).

11152

11153

@item

11154

Specifying @samp{-Ft} on the command line does not set the value

11155

of @code{FS} to be a single tab character

11156

(@pxref{Field Separators, ,Specifying How Fields are Separated}).

11157

11158

@item

11159

The @code{fflush} built-in function is not supported

11160

(@pxref{I/O Functions, , Built-in Functions for Input/Output}).

11161

@end itemize

11162

11163

If you supply both @samp{--traditional} and @samp{--posix} on the

11164

command line, @samp{--posix} will take precedence. @code{gawk}

11165

will also issue a warning if both options are supplied.

11166

11167

@item -W re-interval

11168

@itemx --re-interval

11169

Allow interval expressions

11170

(@pxref{Regexp Operators, , Regular Expression Operators}),

11171

in regexps.

11172

Because interval expressions were traditionally not available in @code{awk},

11173

@code{gawk} does not provide them by default. This prevents old @code{awk}

11174

programs from breaking.

11175

11176

@item -W source @var{program-text}

11177

@itemx --source @var{program-text}

11178

@cindex @code{--source} option

11179

Program source code is taken from the @var{program-text}. This option

11180

allows you to mix source code in files with source

11181

code that you enter on the command line. This is particularly useful

11182

when you have library functions that you wish to use from your command line

11183

programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).

11184

11185

@item -W version

11186

@itemx --version

11187

@cindex @code{--version} option

11188

Prints version information for this particular copy of @code{gawk}.

11189

This allows you to determine if your copy of @code{gawk} is up to date

11190

with respect to whatever the Free Software Foundation is currently

11191

distributing.

11192

It is also useful for bug reports

11193

(@pxref{Bugs, , Reporting Problems and Bugs}).

11194

@end table

11195

11196

Any other options are flagged as invalid with a warning message, but

11197

are otherwise ignored.

11198

11199

In compatibility mode, as a special case, if the value of @var{fs} supplied

11200

to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab

11201

character (@code{"\t"}). This is only true for @samp{--traditional}, and not

11202

for @samp{--posix}

11203

(@pxref{Field Separators, ,Specifying How Fields are Separated}).

11204

11205

The @samp{-f} option may be used more than once on the command line.

11206

If it is, @code{awk} reads its program source from all of the named files, as

11207

if they had been concatenated together into one big file. This is

11208

useful for creating libraries of @code{awk} functions. Useful functions

11209

can be written once, and then retrieved from a standard place, instead

11210

of having to be included into each individual program.

11211

11212

You can type in a program at the terminal and still use library functions,

11213

by specifying @samp{-f /dev/tty}. @code{awk} will read a file from the terminal

11214

to use as part of the @code{awk} program. After typing your program,

11215

type @kbd{Control-d} (the end-of-file character) to terminate it.

11216

(You may also use @samp{-f -} to read program source from the standard

11217

input, but then you will not be able to also use the standard input as a

11218

source of data.)

11219

11220

Because it is clumsy using the standard @code{awk} mechanisms to mix source

11221

file and command line @code{awk} programs, @code{gawk} provides the

11222

@samp{--source} option. This does not require you to pre-empt the standard

11223

input for your source code, and allows you to easily mix command line

11224

and library source code

11225

(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).

11226

11227

If no @samp{-f} or @samp{--source} option is specified, then @code{gawk}

11228

will use the first non-option command line argument as the text of the

11229

program source code.

11230

11231

@cindex @code{POSIXLY_CORRECT} environment variable

11232

@cindex environment variable, @code{POSIXLY_CORRECT}

11233

If the environment variable @code{POSIXLY_CORRECT} exists,

11234

then @code{gawk} will behave in strict POSIX mode, exactly as if

11235

you had supplied the @samp{--posix} command line option.

11236

Many GNU programs look for this environment variable to turn on

11237

strict POSIX mode. If you supply @samp{--lint} on the command line,

11238

and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT},

11239

then it will print a warning message indicating that POSIX

11240

mode is in effect.

11241

11242

You would typically set this variable in your shell's startup file.

11243

For a Bourne compatible shell (such as Bash), you would add these

11244

lines to the @file{.profile} file in your home directory.

11245

11246

@example

11247

@group

11248

POSIXLY_CORRECT=true

11249

export POSIXLY_CORRECT

11250

@end group

11251

@end example

11252

11253

For a @code{csh} compatible shell,@footnote{Not recommended.}

11254

you would add this line to the @file{.login} file in your home directory.

11255

11256

@example

11257

setenv POSIXLY_CORRECT true

11258

@end example

11259

11260

@node Other Arguments, AWKPATH Variable, Options, Invoking Gawk

11261

@section Other Command Line Arguments

11262

11263

Any additional arguments on the command line are normally treated as

11264

input files to be processed in the order specified. However, an

11265

argument that has the form @code{@var{var}=@var{value}}, assigns

11266

the value @var{value} to the variable @var{var}---it does not specify a

11267

file at all.

11268

11269

@vindex ARGIND

11270

@vindex ARGV

11271

All these arguments are made available to your @code{awk} program in the

11272

@code{ARGV} array (@pxref{Built-in Variables}). Command line options

11273

and the program text (if present) are omitted from @code{ARGV}.

11274

All other arguments, including variable assignments, are

11275

included. As each element of @code{ARGV} is processed, @code{gawk}

11276

sets the variable @code{ARGIND} to the index in @code{ARGV} of the

11277

current element.

11278

11279

The distinction between file name arguments and variable-assignment

11280

arguments is made when @code{awk} is about to open the next input file.

11281

At that point in execution, it checks the ``file name'' to see whether

11282

it is really a variable assignment; if so, @code{awk} sets the variable

11283

instead of reading a file.

11284

11285

Therefore, the variables actually receive the given values after all

11286

previously specified files have been read. In particular, the values of

11287

variables assigned in this fashion are @emph{not} available inside a

11288

@code{BEGIN} rule

11289

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}),

11290

since such rules are run before @code{awk} begins scanning the argument list.

11291

11292

@cindex dark corner

11293

The variable values given on the command line are processed for escape

11294

sequences (d.c.) (@pxref{Escape Sequences}).

11295

11296

In some earlier implementations of @code{awk}, when a variable assignment

11297

occurred before any file names, the assignment would happen @emph{before}

11298

the @code{BEGIN} rule was executed. @code{awk}'s behavior was thus

11299

inconsistent; some command line assignments were available inside the

11300

@code{BEGIN} rule, while others were not. However,

11301

some applications came to depend

11302

upon this ``feature.'' When @code{awk} was changed to be more consistent,

11303

the @samp{-v} option was added to accommodate applications that depended

11304

upon the old behavior.

11305

11306

The variable assignment feature is most useful for assigning to variables

11307

such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and

11308

output formats, before scanning the data files. It is also useful for

11309

controlling state if multiple passes are needed over a data file. For

11310

example:

11311

11312

@cindex multiple passes over data

11313

@cindex passes, multiple

11314

@example

11315

awk 'pass == 1 @{ @var{pass 1 stuff} @}

11316

pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata

11317

@end example

11318

11319

Given the variable assignment feature, the @samp{-F} option for setting

11320

the value of @code{FS} is not

11321

strictly necessary. It remains for historical compatibility.

11322

11323

@node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk

11324

@section The @code{AWKPATH} Environment Variable

11325

@cindex @code{AWKPATH} environment variable

11326

@cindex environment variable, @code{AWKPATH}

11327

@cindex search path

11328

@cindex directory search

11329

@cindex path, search

11330

@cindex differences between @code{gawk} and @code{awk}

11331

11332

The previous section described how @code{awk} program files can be named

11333

on the command line with the @samp{-f} option. In most @code{awk}

11334

implementations, you must supply a precise path name for each program

11335

file, unless the file is in the current directory.

11336

11337

@cindex search path, for source files

11338

But in @code{gawk}, if the file name supplied to the @samp{-f} option

11339

does not contain a @samp{/}, then @code{gawk} searches a list of

11340

directories (called the @dfn{search path}), one by one, looking for a

11341

file with the specified name.

11342

11343

The search path is a string consisting of directory names

11344

separated by colons. @code{gawk} gets its search path from the

11345

@code{AWKPATH} environment variable. If that variable does not exist,

11346

@code{gawk} uses a default path, which is

11347

@samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk}

11348

may use a directory that is different than @file{/usr/local/share/awk}; it

11349

will depend upon how @code{gawk} was built and installed. The actual

11350

directory will be the value of @samp{$(datadir)} generated when

11351

@code{gawk} was configured. You probably don't need to worry about this

11352

though.} (Programs written for use by

11353

system administrators should use an @code{AWKPATH} variable that

11354

does not include the current directory, @file{.}.)

11355

11356

The search path feature is particularly useful for building up libraries

11357

of useful @code{awk} functions. The library files can be placed in a

11358

standard directory that is in the default path, and then specified on

11359

the command line with a short file name. Otherwise, the full file name

11360

would have to be typed for each file.

11361

11362

By using both the @samp{--source} and @samp{-f} options, your command line

11363

@code{awk} programs can use facilities in @code{awk} library files.

11364

@xref{Library Functions, , A Library of @code{awk} Functions}.

11365

11366

Path searching is not done if @code{gawk} is in compatibility mode.

11367

This is true for both @samp{--traditional} and @samp{--posix}.

11368

@xref{Options, ,Command Line Options}.

11369

11370

@strong{Note:} if you want files in the current directory to be found,

11371

you must include the current directory in the path, either by including

11372

@file{.} explicitly in the path, or by writing a null entry in the

11373

path. (A null entry is indicated by starting or ending the path with a

11374

colon, or by placing two colons next to each other (@samp{::}).) If the

11375

current directory is not included in the path, then files cannot be

11376

found in the current directory. This path search mechanism is identical

11377

to the shell's.

11378

@c someday, @cite{The Bourne Again Shell}....

11379

11380

Starting with version 3.0, if @code{AWKPATH} is not defined in the

11381

environment, @code{gawk} will place its default search path into

11382

@code{ENVIRON["AWKPATH"]}. This makes it easy to determine

11383

the actual search path @code{gawk} will use.

11384

11385

@node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk

11386

@section Obsolete Options and/or Features

11387

11388

@cindex deprecated options

11389

@cindex obsolete options

11390

@cindex deprecated features

11391

@cindex obsolete features

11392

This section describes features and/or command line options from

11393

previous releases of @code{gawk} that are either not available in the

11394

current version, or that are still supported but deprecated (meaning that

11395

they will @emph{not} be in the next release).

11396

11397

@c update this section for each release!

11398

11399

For version @value{VERSION} of @code{gawk}, there are no command line options

11400

or other deprecated features from the previous version of @code{gawk}.

11401

@iftex

11402

This section

11403

@end iftex

11404

@ifinfo

11405

This node

11406

@end ifinfo

11407

is thus essentially a place holder,

11408

in case some option becomes obsolete in a future version of @code{gawk}.

11409

11410

@ignore

11411

@c This is pretty old news...

11412

The public-domain version of @code{strftime} that is distributed with

11413

@code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier

11414

that used to generate the date in VMS format was changed to @samp{%v}.

11415

This is because the POSIX standard for the @code{date} utility now

11416

specifies a @samp{%V} conversion specifier.

11417

@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details.

11418

@end ignore

11419

11420

@node Undocumented, Known Bugs, Obsolete, Invoking Gawk

11421

@section Undocumented Options and Features

11422

@cindex undocumented features

11423

11424

This section intentionally left blank.

11425

11426

@c Read The Source, Luke!

11427

11428

@ignore

11429

@c If these came out in the Info file or TeX document, then they wouldn't

11430

@c be undocumented, would they?

11431

11432

@code{gawk} has one undocumented option:

11433

11434

@table @code

11435

@item -W nostalgia

11436

@itemx --nostalgia

11437

Print the message @code{"awk: bailing out near line 1"} and dump core.

11438

This option was inspired by the common behavior of very early versions of

11439

Unix @code{awk}, and by a t--shirt.

11440

@end table

11441

11442

Early versions of @code{awk} used to not require any separator (either

11443

a newline or @samp{;}) between the rules in @code{awk} programs. Thus,

11444

it was common to see one-line programs like:

11445

11446

@example

11447

awk '@{ sum += $1 @} END @{ print sum @}'

11448

@end example

11449

11450

@code{gawk} actually supports this, but it is purposely undocumented

11451

since it is considered bad style. The correct way to write such a program

11452

is either

11453

11454

@example

11455

awk '@{ sum += $1 @} ; END @{ print sum @}'

11456

@end example

11457

11458

@noindent

11459

or

11460

11461

@example

11462

awk '@{ sum += $1 @}

11463

END @{ print sum @}' data

11464

@end example

11465

11466

@noindent

11467

@xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller

11468

explanation.

11469

11470

@end ignore

11471

11472

@node Known Bugs, , Undocumented, Invoking Gawk

11473

@section Known Bugs in @code{gawk}

11474

@cindex bugs, known in @code{gawk}

11475

@cindex known bugs

11476

11477

@itemize @bullet

11478

@item

11479

The @samp{-F} option for changing the value of @code{FS}

11480

(@pxref{Options, ,Command Line Options})

11481

is not necessary given the command line variable

11482

assignment feature; it remains only for backwards compatibility.

11483

11484

@item

11485

If your system actually has support for @file{/dev/fd} and the

11486

associated @file{/dev/stdin}, @file{/dev/stdout}, and

11487

@file{/dev/stderr} files, you may get different output from @code{gawk}

11488

than you would get on a system without those files. When @code{gawk}

11489

interprets these files internally, it synchronizes output to the

11490

standard output with output to @file{/dev/stdout}, while on a system

11491

with those files, the output is actually to different open files

11492

(@pxref{Special Files, ,Special File Names in @code{gawk}}).

11493

11494

@item

11495

Syntactically invalid single character programs tend to overflow

11496

the parse stack, generating a rather unhelpful message. Such programs

11497

are surprisingly difficult to diagnose in the completely general case,

11498

and the effort to do so really is not worth it.

11499

11500

@item

11501

The word ``GNU'' is incorrectly capitalized in at least one

11502

file in the source code.

11503

@end itemize

11504

11505

@node Library Functions, Sample Programs, Invoking Gawk, Top

11506

@chapter A Library of @code{awk} Functions

11507

11508

@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!

11509

This chapter presents a library of useful @code{awk} functions. The

11510

sample programs presented later

11511

(@pxref{Sample Programs, ,Practical @code{awk} Programs})

11512

use these functions.

11513

The functions are presented here in a progression from simple to complex.

11514

11515

@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},

11516

presents a program that you can use to extract the source code for

11517

these example library functions and programs from the Texinfo source

11518

for this @value{DOCUMENT}.

11519

(This has already been done as part of the @code{gawk} distribution.)

11520

11521

If you have written one or more useful, general purpose @code{awk} functions,

11522

and would like to contribute them for a subsequent edition of this @value{DOCUMENT},

11523

please contact the author. @xref{Bugs, ,Reporting Problems and Bugs},

11524

for information on doing this. Don't just send code, as you will be

11525

required to either place your code in the public domain,

11526

publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),

11527

or assign the copyright in it to the Free Software Foundation.

11528

11529

@menu

11530

* Portability Notes:: What to do if you don't have @code{gawk}.

11531

* Nextfile Function:: Two implementations of a @code{nextfile}

11532

function.

11533

* Assert Function:: A function for assertions in @code{awk}

11534

programs.

11535

* Ordinal Functions:: Functions for using characters as numbers and

11536

vice versa.

11537

* Join Function:: A function to join an array into a string.

11538

* Mktime Function:: A function to turn a date into a timestamp.

11539

* Gettimeofday Function:: A function to get formatted times.

11540

* Filetrans Function:: A function for handling data file transitions.

11541

* Getopt Function:: A function for processing command line

11542

arguments.

11543

* Passwd Functions:: Functions for getting user information.

11544

* Group Functions:: Functions for getting group information.

11545

* Library Names:: How to best name private global variables in

11546

library functions.

11547

@end menu

11548

11549

@node Portability Notes, Nextfile Function, Library Functions, Library Functions

11550

@section Simulating @code{gawk}-specific Features

11551

@cindex portability issues

11552

11553

The programs in this chapter and in

11554

@ref{Sample Programs, ,Practical @code{awk} Programs},

11555

freely use features that are specific to @code{gawk}.

11556

This section briefly discusses how you can rewrite these programs for

11557

different implementations of @code{awk}.

11558

11559

Diagnostic error messages are sent to @file{/dev/stderr}.

11560

Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system

11561

does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}.

11562

11563

A number of programs use @code{nextfile}

11564

(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}),

11565

to skip any remaining input in the input file.

11566

@ref{Nextfile Function, ,Implementing @code{nextfile} as a Function},

11567

shows you how to write a function that will do the same thing.

11568

11569

Finally, some of the programs choose to ignore upper-case and lower-case

11570

distinctions in their input. They do this by assigning one to @code{IGNORECASE}.

11571

You can achieve the same effect by adding the following rule to the

11572

beginning of the program:

11573

11574

@example

11575

# ignore case

11576

@{ $0 = tolower($0) @}

11577

@end example

11578

11579

@noindent

11580

Also, verify that all regexp and string constants used in

11581

comparisons only use lower-case letters.

11582

11583

@node Nextfile Function, Assert Function, Portability Notes, Library Functions

11584

@section Implementing @code{nextfile} as a Function

11585

11586

@cindex skipping input files

11587

@cindex input files, skipping

11588

The @code{nextfile} statement presented in

11589

@ref{Nextfile Statement, ,The @code{nextfile} Statement},

11590

is a @code{gawk}-specific extension. It is not available in other

11591

implementations of @code{awk}. This section shows two versions of a

11592

@code{nextfile} function that you can use to simulate @code{gawk}'s

11593

@code{nextfile} statement if you cannot use @code{gawk}.

11594

11595

Here is a first attempt at writing a @code{nextfile} function.

11596

11597

@example

11598

@group

11599

# nextfile --- skip remaining records in current file

11600

11601

# this should be read in before the "main" awk program

11602

11603

function nextfile() @{ _abandon_ = FILENAME; next @}

11604

11605

_abandon_ == FILENAME @{ next @}

11606

@end group

11607

@end example

11608

11609

This file should be included before the main program, because it supplies

11610

a rule that must be executed first. This rule compares the current data

11611

file's name (which is always in the @code{FILENAME} variable) to a private

11612

variable named @code{_abandon_}. If the file name matches, then the action

11613

part of the rule executes a @code{next} statement, to go on to the next

11614

record. (The use of @samp{_} in the variable name is a convention.

11615

It is discussed more fully in

11616

@ref{Library Names, , Naming Library Function Global Variables}.)

11617

11618

The use of the @code{next} statement effectively creates a loop that reads

11619

all the records from the current data file.

11620

Eventually, the end of the file is reached, and

11621

a new data file is opened, changing the value of @code{FILENAME}.

11622

Once this happens, the comparison of @code{_abandon_} to @code{FILENAME}

11623

fails, and execution continues with the first rule of the ``real'' program.

11624

11625

The @code{nextfile} function itself simply sets the value of @code{_abandon_}

11626

and then executes a @code{next} statement to start the loop

11627

going.@footnote{Some implementations of @code{awk} do not allow you to

11628

execute @code{next} from within a function body. Some other work-around

11629

will be necessary if you use such a version.}

11630

@c mawk is what we're talking about.

11631

11632

This initial version has a subtle problem. What happens if the same data

11633

file is listed @emph{twice} on the command line, one right after the other,

11634

or even with just a variable assignment between the two occurrences of

11635

the file name?

11636

11637

@c @findex nextfile

11638

@c do it this way, since all the indices are merged

11639

@cindex @code{nextfile} function

11640

In such a case,

11641

this code will skip right through the file, a second time, even though

11642

it should stop when it gets to the end of the first occurrence.

11643

Here is a second version of @code{nextfile} that remedies this problem.

11644

11645

@example

11646

@group

11647

@c file eg/lib/nextfile.awk

11648

# nextfile --- skip remaining records in current file

11649

# correctly handle successive occurrences of the same file

11650

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

11651

# May, 1993

11652

11653

# this should be read in before the "main" awk program

11654

11655

function nextfile() @{ _abandon_ = FILENAME; next @}

11656

11657

_abandon_ == FILENAME @{

11658

if (FNR == 1)

11659

_abandon_ = ""

11660

else

11661

@}

11663

@c endfile

11664

@end group

11665

@end example

11666

11667

The @code{nextfile} function has not changed. It sets @code{_abandon_}

11668

equal to the current file name and then executes a @code{next} satement.

11669

The @code{next} statement reads the next record and increments @code{FNR},

11670

so @code{FNR} is guaranteed to have a value of at least two.

11671

However, if @code{nextfile} is called for the last record in the file,

11672

then @code{awk} will close the current data file and move on to the next

11673

one. Upon doing so, @code{FILENAME} will be set to the name of the new file,

11674

and @code{FNR} will be reset to one. If this next file is the same as

11675

the previous one, @code{_abandon_} will still be equal to @code{FILENAME}.

11676

However, @code{FNR} will be equal to one, telling us that this is a new

11677

occurrence of the file, and not the one we were reading when the

11678

@code{nextfile} function was executed. In that case, @code{_abandon_}

11679

is reset to the empty string, so that further executions of this rule

11680

will fail (until the next time that @code{nextfile} is called).

11681

11682

If @code{FNR} is not one, then we are still in the original data file,

11683

and the program executes a @code{next} statement to skip through it.

11684

11685

An important question to ask at this point is: ``Given that the

11686

functionality of @code{nextfile} can be provided with a library file,

11687

why is it built into @code{gawk}?'' This is an important question. Adding

11688

features for little reason leads to larger, slower programs that are

11689

harder to maintain.

11690

11691

The answer is that building @code{nextfile} into @code{gawk} provides

11692

significant gains in efficiency. If the @code{nextfile} function is executed

11693

at the beginning of a large data file, @code{awk} still has to scan the entire

11694

file, splitting it up into records, just to skip over it. The built-in

11695

@code{nextfile} can simply close the file immediately and proceed to the

11696

next one, saving a lot of time. This is particularly important in

11697

@code{awk}, since @code{awk} programs are generally I/O bound (i.e.@:

11698

they spend most of their time doing input and output, instead of performing

11699

computations).

11700

11701

@node Assert Function, Ordinal Functions, Nextfile Function, Library Functions

11702

@section Assertions

11703

11704

@cindex assertions

11705

@cindex @code{assert}, C version

11706

When writing large programs, it is often useful to be able to know

11707

that a condition or set of conditions is true. Before proceeding with a

11708

particular computation, you make a statement about what you believe to be

11709

the case. Such a statement is known as an

11710

``assertion.'' The C language provides an @code{<assert.h>} header file

11711

and corresponding @code{assert} macro that the programmer can use to make

11712

assertions. If an assertion fails, the @code{assert} macro arranges to

11713

print a diagnostic message describing the condition that should have

11714

been true but was not, and then it kills the program. In C, using

11715

@code{assert} looks this:

11716

11717

@example

11718

#include <assert.h>

11719

11720

int myfunc(int a, double b)

11721

@{

11722

assert(a <= 5 && b >= 17);

11723

@dots{}

11724

@}

11725

@end example

11726

11727

If the assertion failed, the program would print a message similar to

11728

this:

11729

11730

@example

11731

prog.c:5: assertion failed: a <= 5 && b >= 17

11732

@end example

11733

11734

@findex assert

11735

The ANSI C language makes it possible to turn the condition into a string for use

11736

in printing the diagnostic message. This is not possible in @code{awk}, so

11737

this @code{assert} function also requires a string version of the condition

11738

that is being tested.

11739

11740

@example

11741

@c @group

11742

@c file eg/lib/assert.awk

11743

# assert --- assert that a condition is true. Otherwise exit.

11744

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

11745

# May, 1993

11746

11747

function assert(condition, string)

11748

@{

11749

if (! condition) @{

11750

printf("%s:%d: assertion failed: %s\n",

11751

FILENAME, FNR, string) > "/dev/stderr"

11752

_assert_exit = 1

11753

exit 1

11754

@}

11755

@}

11756

11757

END @{

11758

if (_assert_exit)

11759

exit 1

11760

@}

11761

@c endfile

11762

@c @end group

11763

@end example

11764

11765

The @code{assert} function tests the @code{condition} parameter. If it

11766

is false, it prints a message to standard error, using the @code{string}

11767

parameter to describe the failed condition. It then sets the variable

11768

@code{_assert_exit} to one, and executes the @code{exit} statement.

11769

The @code{exit} statement jumps to the @code{END} rule. If the @code{END}

11770

rules finds @code{_assert_exit} to be true, then it exits immediately.

11771

11772

The purpose of the @code{END} rule with its test is to

11773

keep any other @code{END} rules from running. When an assertion fails, the

11774

program should exit immediately.

11775

If no assertions fail, then @code{_assert_exit} will still be

11776

false when the @code{END} rule is run normally, and the rest of the

11777

program's @code{END} rules will execute.

11778

For all of this to work correctly, @file{assert.awk} must be the

11779

first source file read by @code{awk}.

11780

11781

You would use this function in your programs this way:

11782

11783

@example

11784

function myfunc(a, b)

11785

@{

11786

assert(a <= 5 && b >= 17, "a <= 5 && b >= 17")

11787

@dots{}

11788

@}

11789

@end example

11790

11791

@noindent

11792

If the assertion failed, you would see a message like this:

11793

11794

@example

11795

mydata:1357: assertion failed: a <= 5 && b >= 17

11796

@end example

11797

11798

There is a problem with this version of @code{assert}, that it may not

11799

be possible to work around. An @code{END} rule is automatically added

11800

to the program calling @code{assert}. Normally, if a program consists

11801

of just a @code{BEGIN} rule, the input files and/or standard input are

11802

not read. However, now that the program has an @code{END} rule, @code{awk}

11803

will attempt to read the input data files, or standard input

11804

(@pxref{Using BEGIN/END, , Startup and Cleanup Actions}),

11805

most likely causing the program to hang, waiting for input.

11806

11807

@cindex backslash continuation

11808

Just a note on programming style. You may have noticed that the @code{END}

11809

rule uses backslash continuation, with the open brace on a line by

11810

itself. This is so that it more closely resembles the way functions

11811

are written. Many of the examples

11812

@iftex

11813

in this chapter and the next one

11814

@end iftex

11815

use this style. You can decide for yourself if you like writing

11816

your @code{BEGIN} and @code{END} rules this way,

11817

or not.

11818

11819

@node Ordinal Functions, Join Function, Assert Function, Library Functions

11820

@section Translating Between Characters and Numbers

11821

11822

@cindex numeric character values

11823

@cindex values of characters as numbers

11824

One commercial implementation of @code{awk} supplies a built-in function,

11825

@code{ord}, which takes a character and returns the numeric value for that

11826

character in the machine's character set. If the string passed to

11827

@code{ord} has more than one character, only the first one is used.

11828

11829

The inverse of this function is @code{chr} (from the function of the same

11830

name in Pascal), which takes a number and returns the corresponding character.

11831

11832

Both functions can be written very nicely in @code{awk}; there is no real

11833

reason to build them into the @code{awk} interpreter.

11834

11835

@findex ord

11836

@findex chr

11837

@example

11838

@c @group

11839

@c file eg/lib/ord.awk

11840

# ord.awk --- do ord and chr

11841

#

11842

# Global identifiers:

11843

# _ord_: numerical values indexed by characters

11844

# _ord_init: function to initialize _ord_

11845

#

11846

# Arnold Robbins

11847

# arnold@@gnu.ai.mit.edu

11848

# Public Domain

11849

# 16 January, 1992

11850

# 20 July, 1992, revised

11851

11852

BEGIN @{ _ord_init() @}

11853

@c endfile

11854

@c @end group

11855

11856

@c @group

11857

@c file eg/lib/ord.awk

11858

function _ord_init( low, high, i, t)

11859

@{

11860

low = sprintf("%c", 7) # BEL is ascii 7

11861

if (low == "\a") @{ # regular ascii

11862

low = 0

11863

high = 127

11864

@} else if (sprintf("%c", 128 + 7) == "\a") @{

11865

# ascii, mark parity

11866

low = 128

11867

high = 255

11868

@} else @{ # ebcdic(!)

11869

low = 0

11870

high = 255

11871

@}

11872

11873

for (i = low; i <= high; i++) @{

11874

t = sprintf("%c", i)

11875

_ord_[t] = i

11876

@}

11877

@}

11878

@c endfile

11879

@c @end group

11880

@end example

11881

11882

@cindex character sets

11883

@cindex character encodings

11884

@cindex ASCII

11885

@cindex EBCDIC

11886

@cindex mark parity

11887

Some explanation of the numbers used by @code{chr} is worthwhile.

11888

The most prominent character set in use today is ASCII. Although an

11889

eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only

11890

defines characters that use the values from zero to 127.@footnote{ASCII

11891

has been extended in many countries to use the values from 128 to 255

11892

for country-specific characters. If your system uses these extensions,

11893

you can simplify @code{_ord_init} to simply loop from zero to 255.}

11894

At least one computer manufacturer that we know of

11895

@c Pr1me, blech

11896

uses ASCII, but with mark parity, meaning that the leftmost bit in the byte

11897

is always one. What this means is that on those systems, characters

11898

have numeric values from 128 to 255.

11899

Finally, large mainframe systems use the EBCDIC character set, which

11900

uses all 256 values.

11901

While there are other character sets in use on some older systems,

11902

they are not really worth worrying about.

11903

11904

@example

11905

@group

11906

@c file eg/lib/ord.awk

11907

function ord(str, c)

11908

@{

11909

# only first character is of interest

11910

c = substr(str, 1, 1)

11911

return _ord_[c]

11912

@}

11913

@c endfile

11914

@end group

11915

11916

@group

11917

@c file eg/lib/ord.awk

11918

function chr(c)

11919

@{

11920

# force c to be numeric by adding 0

11921

return sprintf("%c", c + 0)

11922

@}

11923

@c endfile

11924

@end group

11925

11926

@c @group

11927

@c file eg/lib/ord.awk

11928

#### test code ####

11929

# BEGIN \

11930

# @{

11931

# for (;;) @{

11932

# printf("enter a character: ")

11933

# if (getline var <= 0)

11934

# break

11935

# printf("ord(%s) = %d\n", var, ord(var))

11936

# @}

11937

# @}

11938

@c endfile

11939

@c @end group

11940

@end example

11941

11942

An obvious improvement to these functions would be to move the code for the

11943

@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was

11944

written this way initially for ease of development.

11945

11946

There is a ``test program'' in a @code{BEGIN} rule, for testing the

11947

function. It is commented out for production use.

11948

11949

@node Join Function, Mktime Function, Ordinal Functions, Library Functions

11950

@section Merging an Array Into a String

11951

11952

@cindex merging strings

11953

When doing string processing, it is often useful to be able to join

11954

all the strings in an array into one long string. The following function,

11955

@code{join}, accomplishes this task. It is used later in several of

11956

the application programs

11957

(@pxref{Sample Programs, ,Practical @code{awk} Programs}).

11958

11959

Good function design is important; this function needs to be general, but it

11960

should also have a reasonable default behavior. It is called with an array

11961

and the beginning and ending indices of the elements in the array to be

11962

merged. This assumes that the array indices are numeric---a reasonable

11963

assumption since the array was likely created with @code{split}

11964

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

11965

11966

@findex join

11967

@example

11968

@group

11969

@c file eg/lib/join.awk

11970

# join.awk --- join an array into a string

11971

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

11972

# May 1993

11973

11974

function join(array, start, end, sep, result, i)

11975

@{

11976

if (sep == "")

11977

sep = " "

11978

else if (sep == SUBSEP) # magic value

11979

sep = ""

11980

result = array[start]

11981

for (i = start + 1; i <= end; i++)

11982

result = result sep array[i]

11983

return result

11984

@}

11985

@c endfile

11986

@end group

11987

@end example

11988

11989

An optional additional argument is the separator to use when joining the

11990

strings back together. If the caller supplies a non-empty value,

11991

@code{join} uses it. If it is not supplied, it will have a null

11992

value. In this case, @code{join} uses a single blank as a default

11993

separator for the strings. If the value is equal to @code{SUBSEP},

11994

then @code{join} joins the strings with no separator between them.

11995

@code{SUBSEP} serves as a ``magic'' value to indicate that there should

11996

be no separation between the component strings.

11997

11998

It would be nice if @code{awk} had an assignment operator for concatenation.

11999

The lack of an explicit operator for concatenation makes string operations

12000

more difficult than they really need to be.

12001

12002

@node Mktime Function, Gettimeofday Function, Join Function, Library Functions

12003

@section Turning Dates Into Timestamps

12004

12005

The @code{systime} function built in to @code{gawk}

12006

returns the current time of day as

12007

a timestamp in ``seconds since the Epoch.'' This timestamp

12008

can be converted into a printable date of almost infinitely variable

12009

format using the built-in @code{strftime} function.

12010

(For more information on @code{systime} and @code{strftime},

12011

@pxref{Time Functions, ,Functions for Dealing with Time Stamps}.)

12012

12013

@cindex converting dates to timestamps

12014

@cindex dates, converting to timestamps

12015

@cindex timestamps, converting from dates

12016

An interesting but difficult problem is to convert a readable representation

12017

of a date back into a timestamp. The ANSI C library provides a @code{mktime}

12018

function that does the basic job, converting a canonical representation of a

12019

date into a timestamp.

12020

12021

It would appear at first glance that @code{gawk} would have to supply a

12022

@code{mktime} built-in function that was simply a ``hook'' to the C language

12023

version. In fact though, @code{mktime} can be implemented entirely in

12024

@code{awk}.

12025

12026

Here is a version of @code{mktime} for @code{awk}. It takes a simple

12027

representation of the date and time, and converts it into a timestamp.

12028

12029

The code is presented here intermixed with explanatory prose. In

12030

@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},

12031

you will see how the Texinfo source file for this @value{DOCUMENT}

12032

can be processed to extract the code into a single source file.

12033

12034

The program begins with a descriptive comment and a @code{BEGIN} rule

12035

that initializes a table @code{_tm_months}. This table is a two-dimensional

12036

array that has the lengths of the months. The first index is zero for

12037

regular years, and one for leap years. The values are the same for all the

12038

months in both kinds of years, except for February; thus the use of multiple

12039

assignment.

12040

12041

@example

12042

@c @group

12043

@c file eg/lib/mktime.awk

12044

# mktime.awk --- convert a canonical date representation

12045

# into a timestamp

12046

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

12047

# May 1993

12048

12049

BEGIN \

12050

@{

12051

# Initialize table of month lengths

12052

_tm_months[0,1] = _tm_months[1,1] = 31

12053

_tm_months[0,2] = 28; _tm_months[1,2] = 29

12054

_tm_months[0,3] = _tm_months[1,3] = 31

12055

_tm_months[0,4] = _tm_months[1,4] = 30

12056

_tm_months[0,5] = _tm_months[1,5] = 31

12057

_tm_months[0,6] = _tm_months[1,6] = 30

12058

_tm_months[0,7] = _tm_months[1,7] = 31

12059

_tm_months[0,8] = _tm_months[1,8] = 31

12060

_tm_months[0,9] = _tm_months[1,9] = 30

12061

_tm_months[0,10] = _tm_months[1,10] = 31

12062

_tm_months[0,11] = _tm_months[1,11] = 30

12063

_tm_months[0,12] = _tm_months[1,12] = 31

12064

@}

12065

@c endfile

12066

@c @end group

12067

@end example

12068

12069

The benefit of merging multiple @code{BEGIN} rules

12070

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})

12071

is particularly clear when writing library files. Functions in library

12072

files can cleanly initialize their own private data and also provide clean-up

12073

actions in private @code{END} rules.

12074

12075

The next function is a simple one that computes whether a given year is or

12076

is not a leap year. If a year is evenly divisible by four, but not evenly

12077

divisible by 100, or if it is evenly divisible by 400, then it is a leap

12078

year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be.

12079

@c Change this after the year 2000 to ``2000 was'' (:-)

12080

12081

@findex _tm_isleap

12082

@example

12083

@group

12084

@c file eg/lib/mktime.awk

12085

# decide if a year is a leap year

12086

function _tm_isleap(year, ret)

12087

@{

12088

ret = (year % 4 == 0 && year % 100 != 0) ||

12089

(year % 400 == 0)

12090

12091

return ret

12092

@}

12093

@c endfile

12094

@end group

12095

@end example

12096

12097

This function is only used a few times in this file, and its computation

12098

could have been written @dfn{in-line} (at the point where it's used).

12099

Making it a separate function made the original development easier, and also

12100

avoids the possibility of typing errors when duplicating the code in

12101

multiple places.

12102

12103

The next function is more interesting. It does most of the work of

12104

generating a timestamp, which is converting a date and time into some number

12105

of seconds since the Epoch. The caller passes an array (rather

12106

imaginatively named @code{a}) containing six

12107

values: the year including century, the month as a number between one and 12,

12108

the day of the month, the hour as a number between zero and 23, the minute in

12109

the hour, and the seconds within the minute.

12110

12111

The function uses several local variables to precompute the number of

12112

seconds in an hour, seconds in a day, and seconds in a year. Often,

12113

similar C code simply writes out the expression in-line, expecting the

12114

compiler to do @dfn{constant folding}. E.g., most C compilers would

12115

turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing

12116

it every time at run time. Precomputing these values makes the

12117

function more efficient.

12118

12119

@findex _tm_addup

12120

@example

12121

@c @group

12122

@c file eg/lib/mktime.awk

12123

# convert a date into seconds

12124

function _tm_addup(a, total, yearsecs, daysecs,

12125

hoursecs, i, j)

12126

@{

12127

hoursecs = 60 * 60

12128

daysecs = 24 * hoursecs

12129

yearsecs = 365 * daysecs

12130

12131

total = (a[1] - 1970) * yearsecs

12132

12133

@group

12134

# extra day for leap years

12135

for (i = 1970; i < a[1]; i++)

12136

if (_tm_isleap(i))

12137

total += daysecs

12138

@end group

12139

12140

@group

12141

j = _tm_isleap(a[1])

12142

for (i = 1; i < a[2]; i++)

12143

total += _tm_months[j, i] * daysecs

12144

@end group

12145

12146

total += (a[3] - 1) * daysecs

12147

total += a[4] * hoursecs

12148

total += a[5] * 60

12149

total += a[6]

12150

12151

return total

12152

@}

12153

@c endfile

12154

@c @end group

12155

@end example

12156

12157

The function starts with a first approximation of all the seconds between

12158

Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems.

12159

It may be different on other systems.} and the beginning of the current

12160

year. It then goes through all those years, and for every leap year,

12161

adds an additional day's worth of seconds.

12162

12163

The variable @code{j} holds either one or zero, if the current year is or is not

12164

a leap year.

12165

For every month in the current year prior to the current month, it adds

12166

the number of seconds in the month, using the appropriate entry in the

12167

@code{_tm_months} array.

12168

12169

Finally, it adds in the seconds for the number of days prior to the current

12170

day, and the number of hours, minutes, and seconds in the current day.

12171

12172

The result is a count of seconds since January 1, 1970. This value is not

12173

yet what is needed though. The reason why is described shortly.

12174

12175

The main @code{mktime} function takes a single character string argument.

12176

This string is a representation of a date and time in a ``canonical''

12177

(fixed) form. This string should be

12178

@code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}.

12179

12180

@findex mktime

12181

@example

12182

@c @group

12183

@c file eg/lib/mktime.awk

12184

# mktime --- convert a date into seconds,

12185

# compensate for time zone

12186

12187

function mktime(str, res1, res2, a, b, i, j, t, diff)

12188

@{

12189

i = split(str, a, " ") # don't rely on FS

12190

12191

if (i != 6)

12192

return -1

12193

12194

# force numeric

12195

for (j in a)

12196

a[j] += 0

12197

12198

@group

12199

# validate

12200

if (a[1] < 1970 ||

12201

a[2] < 1 || a[2] > 12 ||

12202

a[3] < 1 || a[3] > 31 ||

12203

a[4] < 0 || a[4] > 23 ||

12204

a[5] < 0 || a[5] > 59 ||

12205

a[6] < 0 || a[6] > 61 )

12206

return -1

12207

@end group

12208

12209

res1 = _tm_addup(a)

12210

t = strftime("%Y %m %d %H %M %S", res1)

12211

12212

if (_tm_debug)

12213

printf("(%s) -> (%s)\n", str, t) > "/dev/stderr"

12214

12215

split(t, b, " ")

12216

res2 = _tm_addup(b)

12217

12218

diff = res1 - res2

12219

12220

if (_tm_debug)

12221

printf("diff = %d seconds\n", diff) > "/dev/stderr"

12222

12223

res1 += diff

12224

12225

return res1

12226

@}

12227

@c endfile

12228

@c @end group

12229

@end example

12230

12231

The function first splits the string into an array, using spaces and tabs as

12232

separators. If there are not six elements in the array, it returns an

12233

error, signaled as the value @minus{}1.

12234

Next, it forces each element of the array to be numeric, by adding zero to it.

12235

The following @samp{if} statement then makes sure that each element is

12236

within an allowable range. (This checking could be extended further, e.g.,

12237

to make sure that the day of the month is within the correct range for the

12238

particular month supplied.) All of this is essentially preliminary set-up

12239

and error checking.

12240

12241

Recall that @code{_tm_addup} generated a value in seconds since Midnight,

12242

January 1, 1970. This value is not directly usable as the result we want,

12243

@emph{since the calculation does not account for the local timezone}. In other

12244

words, the value represents the count in seconds since the Epoch, but only

12245

for UTC (Universal Coordinated Time). If the local timezone is east or west

12246

of UTC, then some number of hours should be either added to, or subtracted from

12247

the resulting timestamp.

12248

12249

For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west

12250

of (behind) UTC. It is only four hours behind UTC if daylight savings

12251

time is in effect.

12252

If you are calling @code{mktime} in Atlanta, with the argument

12253

@code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be

12254

for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to

12255

add another four hours worth of seconds to the result.

12256

12257

How can @code{mktime} determine how far away it is from UTC? This is

12258

surprisingly easy. The returned timestamp represents the time passed to

12259

@code{mktime} @emph{as UTC}. This timestamp can be fed back to

12260

@code{strftime}, which will format it as a @emph{local} time; i.e.@: as

12261

if it already had the UTC difference added in to it. This is done by

12262

giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format

12263

argument. It returns the computed timestamp in the original string

12264

format. The result represents a time that accounts for the UTC

12265

difference. When the new time is converted back to a timestamp, the

12266

difference between the two timestamps is the difference (in seconds)

12267

between the local timezone and UTC. This difference is then added back

12268

to the original result. An example demonstrating this is presented below.

12269

12270

Finally, there is a ``main'' program for testing the function.

12271

12272

@example

12273

@c @group

12274

@c file eg/lib/mktime.awk

12275

BEGIN @{

12276

if (_tm_test) @{

12277

printf "Enter date as yyyy mm dd hh mm ss: "

12278

getline _tm_test_date

12279

12280

t = mktime(_tm_test_date)

12281

r = strftime("%Y %m %d %H %M %S", t)

12282

printf "Got back (%s)\n", r

12283

@}

12284

@}

12285

@c endfile

12286

@c @end group

12287

@end example

12288

12289

The entire program uses two variables that can be set on the command

12290

line to control debugging output and to enable the test in the final

12291

@code{BEGIN} rule. Here is the result of a test run. (Note that debugging

12292

output is to standard error, and test output is to standard output.)

12293

12294

@example

12295

@c @group

12296

$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1

12297

@print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10

12298

@error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10)

12299

@error{} diff = 14400 seconds

12300

@print{} Got back (1993 05 23 15 35 10)

12301

@c @end group

12302

@end example

12303

12304

The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993.

12305

The first line

12306

of debugging output shows the resulting time as UTC---four hours ahead of

12307

the local time zone. The second line shows that the difference is 14400

12308

seconds, which is four hours. (The difference is only four hours, since

12309

daylight savings time is in effect during May.)

12310

The final line of test output shows that the timezone compensation

12311

algorithm works; the returned time is the same as the entered time.

12312

12313

This program does not solve the general problem of turning an arbitrary date

12314

representation into a timestamp. That problem is very involved. However,

12315

the @code{mktime} function provides a foundation upon which to build. Other

12316

software can convert month names into numeric months, and AM/PM times into

12317

24-hour clocks, to generate the ``canonical'' format that @code{mktime}

12318

requires.

12319

12320

@node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions

12321

@section Managing the Time of Day

12322

12323

@cindex formatted timestamps

12324

@cindex timestamps, formatted

12325

The @code{systime} and @code{strftime} functions described in

12326

@ref{Time Functions, ,Functions for Dealing with Time Stamps},

12327

provide the minimum functionality necessary for dealing with the time of day

12328

in human readable form. While @code{strftime} is extensive, the control

12329

formats are not necessarily easy to remember or intuitively obvious when

12330

reading a program.

12331

12332

The following function, @code{gettimeofday}, populates a user-supplied array

12333

with pre-formatted time information. It returns a string with the current

12334

time formatted in the same way as the @code{date} utility.

12335

12336

@findex gettimeofday

12337

@example

12338

@c @group

12339

@c file eg/lib/gettime.awk

12340

# gettimeofday --- get the time of day in a usable format

12341

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain, May 1993

12342

#

12343

# Returns a string in the format of output of date(1)

12344

# Populates the array argument time with individual values:

12345

# time["second"] -- seconds (0 - 59)

12346

# time["minute"] -- minutes (0 - 59)

12347

# time["hour"] -- hours (0 - 23)

12348

# time["althour"] -- hours (0 - 12)

12349

# time["monthday"] -- day of month (1 - 31)

12350

# time["month"] -- month of year (1 - 12)

12351

# time["monthname"] -- name of the month

12352

# time["shortmonth"] -- short name of the month

12353

# time["year"] -- year within century (0 - 99)

12354

# time["fullyear"] -- year with century (19xx or 20xx)

12355

# time["weekday"] -- day of week (Sunday = 0)

12356

# time["altweekday"] -- day of week (Monday = 0)

12357

# time["weeknum"] -- week number, Sunday first day

12358

# time["altweeknum"] -- week number, Monday first day

12359

# time["dayname"] -- name of weekday

12360

# time["shortdayname"] -- short name of weekday

12361

# time["yearday"] -- day of year (0 - 365)

12362

# time["timezone"] -- abbreviation of timezone name

12363

# time["ampm"] -- AM or PM designation

12364

12365

@group

12366

function gettimeofday(time, ret, now, i)

12367

@{

12368

# get time once, avoids unnecessary system calls

12369

now = systime()

12370

12371

# return date(1)-style output

12372

ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)

12373

12374

# clear out target array

12375

for (i in time)

12376

delete time[i]

12377

@end group

12378

12379

@group

12380

# fill in values, force numeric values to be

12381

# numeric by adding 0

12382

time["second"] = strftime("%S", now) + 0

12383

time["minute"] = strftime("%M", now) + 0

12384

time["hour"] = strftime("%H", now) + 0

12385

time["althour"] = strftime("%I", now) + 0

12386

time["monthday"] = strftime("%d", now) + 0

12387

time["month"] = strftime("%m", now) + 0

12388

time["monthname"] = strftime("%B", now)

12389

time["shortmonth"] = strftime("%b", now)

12390

time["year"] = strftime("%y", now) + 0

12391

time["fullyear"] = strftime("%Y", now) + 0

12392

time["weekday"] = strftime("%w", now) + 0

12393

time["altweekday"] = strftime("%u", now) + 0

12394

time["dayname"] = strftime("%A", now)

12395

time["shortdayname"] = strftime("%a", now)

12396

time["yearday"] = strftime("%j", now) + 0

12397

time["timezone"] = strftime("%Z", now)

12398

time["ampm"] = strftime("%p", now)

12399

time["weeknum"] = strftime("%U", now) + 0

12400

time["altweeknum"] = strftime("%W", now) + 0

12401

12402

return ret

12403

@}

12404

@end group

12405

@c endfile

12406

@end example

12407

12408

The string indices are easier to use and read than the various formats

12409

required by @code{strftime}. The @code{alarm} program presented in

12410

@ref{Alarm Program, ,An Alarm Clock Program},

12411

uses this function.

12412

12413

@c exercise!!!

12414

The @code{gettimeofday} function is presented above as it was written. A

12415

more general design for this function would have allowed the user to supply

12416

an optional timestamp value that would have been used instead of the current

12417

time.

12418

12419

@node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions

12420

@section Noting Data File Boundaries

12421

12422

@cindex per file initialization and clean-up

12423

The @code{BEGIN} and @code{END} rules are each executed exactly once, at

12424

the beginning and end respectively of your @code{awk} program

12425

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).

12426

We (the @code{gawk} authors) once had a user who mistakenly thought that the

12427

@code{BEGIN} rule was executed at the beginning of each data file and the

12428

@code{END} rule was executed at the end of each data file. When informed

12429

that this was not the case, the user requested that we add new special

12430

patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that

12431

would have the desired behavior. He even supplied us the code to do so.

12432

12433

However, after a little thought, I came up with the following library program.

12434

It arranges to call two user-supplied functions, @code{beginfile} and

12435

@code{endfile}, at the beginning and end of each data file.

12436

Besides solving the problem in only nine(!) lines of code, it does so

12437

@emph{portably}; this will work with any implementation of @code{awk}.

12438

12439

@example

12440

@c @group

12441

# transfile.awk

12442

#

12443

# Give the user a hook for filename transitions

12444

#

12445

# The user must supply functions beginfile() and endfile()

12446

# that each take the name of the file being started or

12447

# finished, respectively.

12448

#

12449

# Arnold Robbins, arnold@@gnu.ai.mit.edu, January 1992

12450

# Public Domain

12451

12452

FILENAME != _oldfilename \

12453

@{

12454

if (_oldfilename != "")

12455

endfile(_oldfilename)

12456

_oldfilename = FILENAME

12457

beginfile(FILENAME)

12458

@}

12459

12460

END @{ endfile(FILENAME) @}

12461

@c @end group

12462

@end example

12463

12464

This file must be loaded before the user's ``main'' program, so that the

12465

rule it supplies will be executed first.

12466

12467

This rule relies on @code{awk}'s @code{FILENAME} variable that

12468

automatically changes for each new data file. The current file name is

12469

saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does

12470

not equal @code{_oldfilename}, then a new data file is being processed, and

12471

it is necessary to call @code{endfile} for the old file. Since

12472

@code{endfile} should only be called if a file has been processed, the

12473

program first checks to make sure that @code{_oldfilename} is not the null

12474

string. The program then assigns the current file name to

12475

@code{_oldfilename}, and calls @code{beginfile} for the file.

12476

Since, like all @code{awk} variables, @code{_oldfilename} will be

12477

initialized to the null string, this rule executes correctly even for the

12478

first data file.

12479

12480

The program also supplies an @code{END} rule, to do the final processing for

12481

the last file. Since this @code{END} rule comes before any @code{END} rules

12482

supplied in the ``main'' program, @code{endfile} will be called first. Once

12483

again the value of multiple @code{BEGIN} and @code{END} rules should be clear.

12484

12485

@findex beginfile

12486

@findex endfile

12487

This version has same problem as the first version of @code{nextfile}

12488

(@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}).

12489

If the same data file occurs twice in a row on command line, then

12490

@code{endfile} and @code{beginfile} will not be executed at the end of the

12491

first pass and at the beginning of the second pass.

12492

This version solves the problem.

12493

12494

@example

12495

@c @group

12496

@c file eg/lib/ftrans.awk

12497

# ftrans.awk --- handle data file transitions

12498

#

12499

# user supplies beginfile() and endfile() functions

12500

#

12501

# Arnold Robbins, arnold@@gnu.ai.mit.edu. November 1992

12502

# Public Domain

12503

12504

FNR == 1 @{

12505

if (_filename_ != "")

12506

endfile(_filename_)

12507

_filename_ = FILENAME

12508

beginfile(FILENAME)

12509

@}

12510

12511

END @{ endfile(_filename_) @}

12512

@c endfile

12513

@c @end group

12514

@end example

12515

12516

In @ref{Wc Program, ,Counting Things},

12517

you will see how this library function can be used, and

12518

how it simplifies writing the main program.

12519

12520

@node Getopt Function, Passwd Functions, Filetrans Function, Library Functions

12521

@section Processing Command Line Options

12522

12523

@cindex @code{getopt}, C version

12524

@cindex processing arguments

12525

@cindex argument processing

12526

Most utilities on POSIX compatible systems take options or ``switches'' on

12527

the command line that can be used to change the way a program behaves.

12528

@code{awk} is an example of such a program

12529

(@pxref{Options, ,Command Line Options}).

12530

Often, options take @dfn{arguments}, data that the program needs to

12531

correctly obey the command line option. For example, @code{awk}'s

12532

@samp{-F} option requires a string to use as the field separator.

12533

The first occurrence on the command line of either @samp{--} or a

12534

string that does not begin with @samp{-} ends the options.

12535

12536

Most Unix systems provide a C function named @code{getopt} for processing

12537

command line arguments. The programmer provides a string describing the one

12538

letter options. If an option requires an argument, it is followed in the

12539

string with a colon. @code{getopt} is also passed the

12540

count and values of the command line arguments, and is called in a loop.

12541

@code{getopt} processes the command line arguments for option letters.

12542

Each time around the loop, it returns a single character representing the

12543

next option letter that it found, or @samp{?} if it found an invalid option.

12544

When it returns @minus{}1, there are no options left on the command line.

12545

12546

When using @code{getopt}, options that do not take arguments can be

12547

grouped together. Furthermore, options that take arguments require that the

12548

argument be present. The argument can immediately follow the option letter,

12549

or it can be a separate command line argument.

12550

12551

Given a hypothetical program that takes

12552

three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and

12553

@samp{-b} requires an argument, all of the following are valid ways of

12554

invoking the program:

12555

12556

@example

12557

@c @group

12558

prog -a -b foo -c data1 data2 data3

12559

prog -ac -bfoo -- data1 data2 data3

12560

prog -acbfoo data1 data2 data3

12561

@c @end group

12562

@end example

12563

12564

Notice that when the argument is grouped with its option, the rest of

12565

the command line argument is considered to be the option's argument.

12566

In the above example, @samp{-acbfoo} indicates that all of the

12567

@samp{-a}, @samp{-b}, and @samp{-c} options were supplied,

12568

and that @samp{foo} is the argument to the @samp{-b} option.

12569

12570

@code{getopt} provides four external variables that the programmer can use.

12571

12572

@table @code

12573

@item optind

12574

The index in the argument value array (@code{argv}) where the first

12575

non-option command line argument can be found.

12576

12577

@item optarg

12578

The string value of the argument to an option.

12579

12580

@item opterr

12581

Usually @code{getopt} prints an error message when it finds an invalid

12582

option. Setting @code{opterr} to zero disables this feature. (An

12583

application might wish to print its own error message.)

12584

12585

@item optopt

12586

The letter representing the command line option.

12587

While not usually documented, most versions supply this variable.

12588

@end table

12589

12590

The following C fragment shows how @code{getopt} might process command line

12591

arguments for @code{awk}.

12592

12593

@example

12594

@group

12595

int

12596

main(int argc, char *argv[])

12597

@{

12598

@dots{}

12599

/* print our own message */

12600

opterr = 0;

12601

@end group

12602

@group

12603

while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{

12604

switch (c) @{

12605

case 'f': /* file */

12606

@dots{}

12607

break;

12608

case 'F': /* field separator */

12609

@dots{}

12610

break;

12611

case 'v': /* variable assignment */

12612

@dots{}

12613

break;

12614

case 'W': /* extension */

12615

@dots{}

12616

break;

12617

case '?':

12618

default:

12619

usage();

12620

break;

12621

@}

12622

@}

12623

@dots{}

12624

@}

12625

@end group

12626

@end example

12627

12628

As a side point, @code{gawk} actually uses the GNU @code{getopt_long}

12629

function to process both normal and GNU-style long options

12630

(@pxref{Options, ,Command Line Options}).

12631

12632

The abstraction provided by @code{getopt} is very useful, and would be quite

12633

handy in @code{awk} programs as well. Here is an @code{awk} version of

12634

@code{getopt}. This function highlights one of the greatest weaknesses in

12635

@code{awk}, which is that it is very poor at manipulating single characters.

12636

Repeated calls to @code{substr} are necessary for accessing individual

12637

characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}).

12638

12639

The discussion walks through the code a bit at a time.

12640

12641

@example

12642

@c @group

12643

@c file eg/lib/getopt.awk

12644

# getopt --- do C library getopt(3) function in awk

12645

#

12646

# arnold@@gnu.ai.mit.edu

12647

# Public domain

12648

#

12649

# Initial version: March, 1991

12650

# Revised: May, 1993

12651

12652

# External variables:

12653

# Optind -- index of ARGV for first non-option argument

12654

# Optarg -- string value of argument to current option

12655

# Opterr -- if non-zero, print our own diagnostic

12656

# Optopt -- current option letter

12657

12658

# Returns

12659

# -1 at end of options

12660

# ? for unrecognized option

12661

# <c> a character representing the current option

12662

12663

# Private Data

12664

# _opti index in multi-flag option, e.g., -abc

12665

@c endfile

12666

@c @end group

12667

@end example

12668

12669

The function starts out with some documentation: who wrote the code,

12670

and when it was revised, followed by a list of the global variables it uses,

12671

what the return values are and what they mean, and any global variables that

12672

are ``private'' to this library function. Such documentation is essential

12673

for any program, and particularly for library functions.

12674

12675

@findex getopt

12676

@example

12677

@c @group

12678

@c file eg/lib/getopt.awk

12679

function getopt(argc, argv, options, optl, thisopt, i)

12680

@{

12681

optl = length(options)

12682

if (optl == 0) # no options given

12683

return -1

12684

12685

if (argv[Optind] == "--") @{ # all done

12686

Optind++

12687

_opti = 0

12688

return -1

12689

@} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{

12690

_opti = 0

12691

return -1

12692

@}

12693

@c endfile

12694

@c @end group

12695

@end example

12696

12697

The function first checks that it was indeed called with a string of options

12698

(the @code{options} parameter). If @code{options} has a zero length,

12699

@code{getopt} immediately returns @minus{}1.

12700

12701

The next thing to check for is the end of the options. A @samp{--} ends the

12702

command line options, as does any command line argument that does not begin

12703

with a @samp{-}. @code{Optind} is used to step through the array of command

12704

line arguments; it retains its value across calls to @code{getopt}, since it

12705

is a global variable.

12706

12707

The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is

12708

perhaps a bit of overkill; it checks for a @samp{-} followed by anything

12709

that is not whitespace and not a colon.

12710

If the current command line argument does not match this pattern,

12711

it is not an option, and it ends option processing.

12712

12713

@example

12714

@group

12715

@c file eg/lib/getopt.awk

12716

if (_opti == 0)

12717

_opti = 2

12718

thisopt = substr(argv[Optind], _opti, 1)

12719

Optopt = thisopt

12720

i = index(options, thisopt)

12721

if (i == 0) @{

12722

if (Opterr)

12723

printf("%c -- invalid option\n",

12724

thisopt) > "/dev/stderr"

12725

if (_opti >= length(argv[Optind])) @{

12726

Optind++

12727

_opti = 0

12728

@} else

12729

_opti++

12730

return "?"

12731

@}

12732

@c endfile

12733

@end group

12734

@end example

12735

12736

The @code{_opti} variable tracks the position in the current command line

12737

argument (@code{argv[Optind]}). In the case that multiple options were

12738

grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary

12739

to return them to the user one at a time.

12740

12741

If @code{_opti} is equal to zero, it is set to two, the index in the string

12742

of the next character to look at (we skip the @samp{-}, which is at position

12743

one). The variable @code{thisopt} holds the character, obtained with

12744

@code{substr}. It is saved in @code{Optopt} for the main program to use.

12745

12746

If @code{thisopt} is not in the @code{options} string, then it is an

12747

invalid option. If @code{Opterr} is non-zero, @code{getopt} prints an error

12748

message on the standard error that is similar to the message from the C

12749

version of @code{getopt}.

12750

12751

Since the option is invalid, it is necessary to skip it and move on to the

12752

next option character. If @code{_opti} is greater than or equal to the

12753

length of the current command line argument, then it is necessary to move on

12754

to the next one, so @code{Optind} is incremented and @code{_opti} is reset

12755

to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely

12756

incremented.

12757

12758

In any case, since the option was invalid, @code{getopt} returns @samp{?}.

12759

The main program can examine @code{Optopt} if it needs to know what the

12760

invalid option letter actually was.

12761

12762

@example

12763

@group

12764

@c file eg/lib/getopt.awk

12765

if (substr(options, i + 1, 1) == ":") @{

12766

# get option argument

12767

if (length(substr(argv[Optind], _opti + 1)) > 0)

12768

Optarg = substr(argv[Optind], _opti + 1)

12769

else

12770

Optarg = argv[++Optind]

12771

_opti = 0

12772

@} else

12773

Optarg = ""

12774

@c endfile

12775

@end group

12776

@end example

12777

12778

If the option requires an argument, the option letter is followed by a colon

12779

in the @code{options} string. If there are remaining characters in the

12780

current command line argument (@code{argv[Optind]}), then the rest of that

12781

string is assigned to @code{Optarg}. Otherwise, the next command line

12782

argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case,

12783

@code{_opti} is reset to zero, since there are no more characters left to

12784

examine in the current command line argument.

12785

12786

@example

12787

@c @group

12788

@c file eg/lib/getopt.awk

12789

if (_opti == 0 || _opti >= length(argv[Optind])) @{

12790

Optind++

12791

_opti = 0

12792

@} else

12793

_opti++

12794

return thisopt

12795

@}

12796

@c endfile

12797

@c @end group

12798

@end example

12799

12800

Finally, if @code{_opti} is either zero or greater than the length of the

12801

current command line argument, it means this element in @code{argv} is

12802

through being processed, so @code{Optind} is incremented to point to the

12803

next element in @code{argv}. If neither condition is true, then only

12804

@code{_opti} is incremented, so that the next option letter can be processed

12805

on the next call to @code{getopt}.

12806

12807

@example

12808

@c @group

12809

@c file eg/lib/getopt.awk

12810

BEGIN @{

12811

Opterr = 1 # default is to diagnose

12812

Optind = 1 # skip ARGV[0]

12813

12814

# test program

12815

if (_getopt_test) @{

12816

while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)

12817

printf("c = <%c>, optarg = <%s>\n",

12818

_go_c, Optarg)

12819

printf("non-option arguments:\n")

12820

for (; Optind < ARGC; Optind++)

12821

printf("\tARGV[%d] = <%s>\n",

12822

Optind, ARGV[Optind])

12823

@}

12824

@}

12825

@c endfile

12826

@c @end group

12827

@end example

12828

12829

The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.

12830

@code{Opterr} is set to one, since the default behavior is for @code{getopt}

12831

to print a diagnostic message upon seeing an invalid option. @code{Optind}

12832

is set to one, since there's no reason to look at the program name, which is

12833

in @code{ARGV[0]}.

12834

12835

The rest of the @code{BEGIN} rule is a simple test program. Here is the

12836

result of two sample runs of the test program.

12837

12838

@example

12839

@group

12840

$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x

12841

@print{} c = <a>, optarg = <>

12842

@print{} c = <c>, optarg = <>

12843

@print{} c = <b>, optarg = <ARG>

12844

@print{} non-option arguments:

12845

@print{} ARGV[3] = <bax>

12846

@print{} ARGV[4] = <-x>

12847

@end group

12848

12849

@group

12850

$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc

12851

@print{} c = <a>, optarg = <>

12852

@error{} x -- invalid option

12853

@print{} c = <?>, optarg = <>

12854

@print{} non-option arguments:

12855

@print{} ARGV[4] = <xyz>

12856

@print{} ARGV[5] = <abc>

12857

@end group

12858

@end example

12859

12860

The first @samp{--} terminates the arguments to @code{awk}, so that it does

12861

not try to interpret the @samp{-a} etc. as its own options.

12862

12863

Several of the sample programs presented in

12864

@ref{Sample Programs, ,Practical @code{awk} Programs},

12865

use @code{getopt} to process their arguments.

12866

12867

@node Passwd Functions, Group Functions, Getopt Function, Library Functions

12868

@section Reading the User Database

12869

12870

@cindex @file{/dev/user}

12871

The @file{/dev/user} special file

12872

(@pxref{Special Files, ,Special File Names in @code{gawk}})

12873

provides access to the current user's real and effective user and group id

12874

numbers, and if available, the user's supplementary group set.

12875

However, since these are numbers, they do not provide very useful

12876

information to the average user. There needs to be some way to find the

12877

user information associated with the user and group numbers. This

12878

section presents a suite of functions for retrieving information from the

12879

user database. @xref{Group Functions, ,Reading the Group Database},

12880

for a similar suite that retrieves information from the group database.

12881

12882

@cindex @code{getpwent}, C version

12883

@cindex user information

12884

@cindex login information

12885

@cindex account information

12886

@cindex password file

12887

The POSIX standard does not define the file where user information is

12888

kept. Instead, it provides the @code{<pwd.h>} header file

12889

and several C language subroutines for obtaining user information.

12890

The primary function is @code{getpwent}, for ``get password entry.''

12891

The ``password'' comes from the original user database file,

12892

@file{/etc/passwd}, which kept user information, along with the

12893

encrypted passwords (hence the name).

12894

12895

While an @code{awk} program could simply read @file{/etc/passwd} directly

12896

(the format is well known), because of the way password

12897

files are handled on networked systems,

12898

this file may not contain complete information about the system's set of users.

12899

12900

@cindex @code{pwcat} program

12901

To be sure of being

12902

able to produce a readable, complete version of the user database, it is

12903

necessary to write a small C program that calls @code{getpwent}.

12904

@code{getpwent} is defined to return a pointer to a @code{struct passwd}.

12905

Each time it is called, it returns the next entry in the database.

12906

When there are no more entries, it returns @code{NULL}, the null pointer.

12907

When this happens, the C program should call @code{endpwent} to close the

12908

database.

12909

Here is @code{pwcat}, a C program that ``cats'' the password database.

12910

12911

@findex pwcat.c

12912

@example

12913

@c @group

12914

@c file eg/lib/pwcat.c

12915

/*

12916

* pwcat.c

12917

*

12918

* Generate a printable version of the password database

12919

*

12920

* Arnold Robbins

12921

* arnold@@gnu.ai.mit.edu

12922

* May 1993

12923

* Public Domain

12924

*/

12925

12926

#include <stdio.h>

12927

#include <pwd.h>

12928

12929

int

12930

main(argc, argv)

12931

int argc;

12932

char **argv;

12933

@{

12934

struct passwd *p;

12935

12936

while ((p = getpwent()) != NULL)

12937

printf("%s:%s:%d:%d:%s:%s:%s\n",

12938

p->pw_name, p->pw_passwd, p->pw_uid,

12939

p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);

12940

12941

endpwent();

12942

exit(0);

12943

@}

12944

@c endfile

12945

@c @end group

12946

@end example

12947

12948

If you don't understand C, don't worry about it.

12949

The output from @code{pwcat} is the user database, in the traditional

12950

@file{/etc/passwd} format of colon-separated fields. The fields are:

12951

12952

@table @asis

12953

@item Login name

12954

The user's login name.

12955

12956

@item Encrypted password

12957

The user's encrypted password. This may not be available on some systems.

12958

12959

@item User-ID

12960

The user's numeric user-id number.

12961

12962

@item Group-ID

12963

The user's numeric group-id number.

12964

12965

@item Full name

12966

The user's full name, and perhaps other information associated with the

12967

user.

12968

12969

@item Home directory

12970

The user's login, or ``home'' directory (familiar to shell programmers as

12971

@code{$HOME}).

12972

12973

@item Login shell

12974

The program that will be run when the user logs in. This is usually a

12975

shell, such as Bash (the Gnu Bourne-Again shell).

12976

@end table

12977

12978

Here are a few lines representative of @code{pwcat}'s output.

12979

12980

@example

12981

@c @group

12982

$ pwcat

12983

@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh

12984

@print{} nobody:*:65534:65534::/:

12985

@print{} daemon:*:1:1::/:

12986

@print{} sys:*:2:2::/:/bin/csh

12987

@print{} bin:*:3:3::/bin:

12988

@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh

12989

@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh

12990

@dots{}

12991

@c @end group

12992

@end example

12993

12994

With that introduction, here is a group of functions for getting user

12995

information. There are several functions here, corresponding to the C

12996

functions of the same name.

12997

12998

@findex _pw_init

12999

@example

13000

@c file eg/lib/passwdawk.in

13001

@group

13002

# passwd.awk --- access password file information

13003

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

13004

# May 1993

13005

13006

BEGIN @{

13007

# tailor this to suit your system

13008

_pw_awklib = "/usr/local/libexec/awk/"

13009

@}

13010

@end group

13011

13012

function _pw_init( oldfs, oldrs, olddol0, pwcat)

13013

@{

13014

if (_pw_inited)

13015

return

13016

oldfs = FS

13017

oldrs = RS

13018

olddol0 = $0

13019

FS = ":"

13020

RS = "\n"

13021

pwcat = _pw_awklib "pwcat"

13022

while ((pwcat | getline) > 0) @{

13023

_pw_byname[$1] = $0

13024

_pw_byuid[$3] = $0

13025

_pw_bycount[++_pw_total] = $0

13026

@}

13027

close(pwcat)

13028

_pw_count = 0

13029

_pw_inited = 1

13030

FS = oldfs

13031

RS = oldrs

13032

$0 = olddol0

13033

@}

13034

@c endfile

13035

@c @end group

13036

@end example

13037

13038

The @code{BEGIN} rule sets a private variable to the directory where

13039

@code{pwcat} is stored. Since it is used to help out an @code{awk} library

13040

routine, we have chosen to put it in @file{/usr/local/libexec/awk}.

13041

You might want it to be in a different directory on your system.

13042

13043

The function @code{_pw_init} keeps three copies of the user information

13044

in three associative arrays. The arrays are indexed by user name

13045

(@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of

13046

occurrence (@code{_pw_bycount}).

13047

13048

The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only

13049

needs to be called once.

13050

13051

Since this function uses @code{getline} to read information from

13052

@code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and

13053

@code{$0}. Doing so is necessary, since these functions could be called

13054

from anywhere within a user's program, and the user may have his or her

13055

own values for @code{FS} and @code{RS}.

13056

@ignore

13057

Problem, what if FIELDWIDTHS is in use? Sigh.

13058

@end ignore

13059

13060

The main part of the function uses a loop to read database lines, split

13061

the line into fields, and then store the line into each array as necessary.

13062

When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline,

13063

setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and

13064

@code{$0}. The use of @code{@w{_pw_count}} will be explained below.

13065

13066

@findex getpwnam

13067

@example

13068

@group

13069

@c file eg/lib/passwdawk.in

13070

function getpwnam(name)

13071

@{

13072

_pw_init()

13073

if (name in _pw_byname)

13074

return _pw_byname[name]

13075

return ""

13076

@}

13077

@c endfile

13078

@end group

13079

@end example

13080

13081

The @code{getpwnam} function takes a user name as a string argument. If that

13082

user is in the database, it returns the appropriate line. Otherwise it

13083

returns the null string.

13084

13085

@findex getpwuid

13086

@example

13087

@group

13088

@c file eg/lib/passwdawk.in

13089

function getpwuid(uid)

13090

@{

13091

_pw_init()

13092

if (uid in _pw_byuid)

13093

return _pw_byuid[uid]

13094

return ""

13095

@}

13096

@c endfile

13097

@end group

13098

@end example

13099

13100

Similarly,

13101

the @code{getpwuid} function takes a user-id number argument. If that

13102

user number is in the database, it returns the appropriate line. Otherwise it

13103

returns the null string.

13104

13105

@findex getpwent

13106

@example

13107

@c @group

13108

@c file eg/lib/passwdawk.in

13109

function getpwent()

13110

@{

13111

_pw_init()

13112

if (_pw_count < _pw_total)

13113

return _pw_bycount[++_pw_count]

13114

return ""

13115

@}

13116

@c endfile

13117

@c @end group

13118

@end example

13119

13120

The @code{getpwent} function simply steps through the database, one entry at

13121

a time. It uses @code{_pw_count} to track its current position in the

13122

@code{_pw_bycount} array.

13123

13124

@findex endpwent

13125

@example

13126

@c @group

13127

@c file eg/lib/passwdawk.in

13128

function endpwent()

13129

@{

13130

_pw_count = 0

13131

@}

13132

@c endfile

13133

@c @end group

13134

@end example

13135

13136

The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that

13137

subsequent calls to @code{getpwent} will start over again.

13138

13139

A conscious design decision in this suite is that each subroutine calls

13140

@code{@w{_pw_init}} to initialize the database arrays. The overhead of running

13141

a separate process to generate the user database, and the I/O to scan it,

13142

will only be incurred if the user's main program actually calls one of these

13143

functions. If this library file is loaded along with a user's program, but

13144

none of the routines are ever called, then there is no extra run-time overhead.

13145

(The alternative would be to move the body of @code{@w{_pw_init}} into a

13146

@code{BEGIN} rule, which would always run @code{pwcat}. This simplifies the

13147

code but runs an extra process that may never be needed.)

13148

13149

In turn, calling @code{_pw_init} is not too expensive, since the

13150

@code{_pw_inited} variable keeps the program from reading the data more than

13151

once. If you are worried about squeezing every last cycle out of your

13152

@code{awk} program, the check of @code{_pw_inited} could be moved out of

13153

@code{_pw_init} and duplicated in all the other functions. In practice,

13154

this is not necessary, since most @code{awk} programs are I/O bound, and it

13155

would clutter up the code.

13156

13157

The @code{id} program in @ref{Id Program, ,Printing Out User Information},

13158

uses these functions.

13159

13160

@node Group Functions, Library Names, Passwd Functions, Library Functions

13161

@section Reading the Group Database

13162

13163

@cindex @code{getgrent}, C version

13164

@cindex group information

13165

@cindex account information

13166

@cindex group file

13167

Much of the discussion presented in

13168

@ref{Passwd Functions, ,Reading the User Database},

13169

applies to the group database as well. Although there has traditionally

13170

been a well known file, @file{/etc/group}, in a well known format, the POSIX

13171

standard only provides a set of C library routines

13172

(@code{<grp.h>} and @code{getgrent})

13173

for accessing the information.

13174

Even though this file may exist, it likely does not have

13175

complete information. Therefore, as with the user database, it is necessary

13176

to have a small C program that generates the group database as its output.

13177

13178

@cindex @code{grcat} program

13179

Here is @code{grcat}, a C program that ``cats'' the group database.

13180

13181

@findex grcat.c

13182

@example

13183

@c @group

13184

@c file eg/lib/grcat.c

13185

/*

13186

* grcat.c

13187

*

13188

* Generate a printable version of the group database

13189

*

13190

* Arnold Robbins, arnold@@gnu.ai.mit.edu

13191

* May 1993

13192

* Public Domain

13193

*/

13194

13195

#include <stdio.h>

13196

#include <grp.h>

13197

13198

@group

13199

int

13200

main(argc, argv)

13201

int argc;

13202

char **argv;

13203

@{

13204

struct group *g;

13205

int i;

13206

@end group

13207

13208

while ((g = getgrent()) != NULL) @{

13209

printf("%s:%s:%d:", g->gr_name, g->gr_passwd,

13210

g->gr_gid);

13211

for (i = 0; g->gr_mem[i] != NULL; i++) @{

13212

printf("%s", g->gr_mem[i]);

13213

if (g->gr_mem[i+1] != NULL)

13214

putchar(',');

13215

@}

13216

putchar('\n');

13217

@}

13218

endgrent();

13219

exit(0);

13220

@}

13221

@c endfile

13222

@c @end group

13223

@end example

13224

13225

Each line in the group database represent one group. The fields are

13226

separated with colons, and represent the following information.

13227

13228

@table @asis

13229

@item Group Name

13230

The name of the group.

13231

13232

@item Group Password

13233

The encrypted group password. In practice, this field is never used. It is

13234

usually empty, or set to @samp{*}.

13235

13236

@item Group ID Number

13237

The numeric group-id number. This number should be unique within the file.

13238

13239

@item Group Member List

13240

A comma-separated list of user names. These users are members of the group.

13241

Most Unix systems allow users to be members of several groups

13242

simultaneously. If your system does, then reading @file{/dev/user} will

13243

return those group-id numbers in @code{$5} through @code{$NF}.

13244

(Note that @file{/dev/user} is a @code{gawk} extension;

13245

@pxref{Special Files, ,Special File Names in @code{gawk}}.)

13246

@end table

13247

13248

@iftex

13249

@page

13250

@end iftex

13251

Here is what running @code{grcat} might produce:

13252

13253

@example

13254

@group

13255

$ grcat

13256

@print{} wheel:*:0:arnold

13257

@print{} nogroup:*:65534:

13258

@print{} daemon:*:1:

13259

@print{} kmem:*:2:

13260

@print{} staff:*:10:arnold,miriam,andy

13261

@print{} other:*:20:

13262

@dots{}

13263

@end group

13264

@end example

13265

13266

Here are the functions for obtaining information from the group database.

13267

There are several, modeled after the C library functions of the same names.

13268

13269

@findex _gr_init

13270

@example

13271

@group

13272

@c file eg/lib/groupawk.in

13273

# group.awk --- functions for dealing with the group file

13274

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

13275

# May 1993

13276

13277

BEGIN \

13278

@{

13279

# Change to suit your system

13280

_gr_awklib = "/usr/local/libexec/awk/"

13281

@}

13282

@c endfile

13283

@end group

13284

13285

@group

13286

@c file eg/lib/groupawk.in

13287

function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i)

13288

@{

13289

if (_gr_inited)

13290

return

13291

@end group

13292

13293

@group

13294

oldfs = FS

13295

oldrs = RS

13296

olddol0 = $0

13297

FS = ":"

13298

RS = "\n"

13299

@end group

13300

13301

@group

13302

grcat = _gr_awklib "grcat"

13303

while ((grcat | getline) > 0) @{

13304

if ($1 in _gr_byname)

13305

_gr_byname[$1] = _gr_byname[$1] "," $4

13306

else

13307

_gr_byname[$1] = $0

13308

if ($3 in _gr_bygid)

13309

_gr_bygid[$3] = _gr_bygid[$3] "," $4

13310

else

13311

_gr_bygid[$3] = $0

13312

13313

n = split($4, a, "[ \t]*,[ \t]*")

13314

@end group

13315

@group

13316

for (i = 1; i <= n; i++)

13317

if (a[i] in _gr_groupsbyuser)

13318

_gr_groupsbyuser[a[i]] = \

13319

_gr_groupsbyuser[a[i]] " " $1

13320

else

13321

_gr_groupsbyuser[a[i]] = $1

13322

@end group

13323

13324

@group

13325

_gr_bycount[++_gr_count] = $0

13326

@}

13327

@end group

13328

@group

13329

close(grcat)

13330

_gr_count = 0

13331

_gr_inited++

13332

FS = oldfs

13333

RS = oldrs

13334

$0 = olddol0

13335

@}

13336

@c endfile

13337

@end group

13338

@end example

13339

13340

The @code{BEGIN} rule sets a private variable to the directory where

13341

@code{grcat} is stored. Since it is used to help out an @code{awk} library

13342

routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might

13343

want it to be in a different directory on your system.

13344

13345

These routines follow the same general outline as the user database routines

13346

(@pxref{Passwd Functions, ,Reading the User Database}).

13347

The @code{@w{_gr_inited}} variable is used to

13348

ensure that the database is scanned no more than once.

13349

The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and

13350

@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for

13351

scanning the group information.

13352

13353

The group information is stored is several associative arrays.

13354

The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number

13355

(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).

13356

There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}),

13357

that is a space separated list of groups that each user belongs to.

13358

13359

Unlike the user database, it is possible to have multiple records in the

13360

database for the same group. This is common when a group has a large number

13361

of members. Such a pair of entries might look like:

13362

13363

@example

13364

tvpeople:*:101:johny,jay,arsenio

13365

tvpeople:*:101:david,conan,tom,joan

13366

@end example

13367

13368

For this reason, @code{_gr_init} looks to see if a group name or

13369

group-id number has already been seen. If it has, then the user names are

13370

simply concatenated onto the previous list of users. (There is actually a

13371

subtle problem with the code presented above. Suppose that

13372

the first time there were no names. This code adds the names with

13373

a leading comma. It also doesn't check that there is a @code{$4}.)

13374

13375

Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores

13376

@code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero

13377

(it is used later), and makes @code{_gr_inited} non-zero.

13378

13379

@findex getgrnam

13380

@example

13381

@c @group

13382

@c file eg/lib/groupawk.in

13383

function getgrnam(group)

13384

@{

13385

_gr_init()

13386

if (group in _gr_byname)

13387

return _gr_byname[group]

13388

return ""

13389

@}

13390

@c endfile

13391

@c @end group

13392

@end example

13393

13394

The @code{getgrnam} function takes a group name as its argument, and if that

13395

group exists, it is returned. Otherwise, @code{getgrnam} returns the null

13396

string.

13397

13398

@findex getgrgid

13399

@example

13400

@c @group

13401

@c file eg/lib/groupawk.in

13402

function getgrgid(gid)

13403

@{

13404

_gr_init()

13405

if (gid in _gr_bygid)

13406

return _gr_bygid[gid]

13407

return ""

13408

@}

13409

@c endfile

13410

@c @end group

13411

@end example

13412

13413

The @code{getgrgid} function is similar, it takes a numeric group-id, and

13414

looks up the information associated with that group-id.

13415

13416

@findex getgruser

13417

@example

13418

@group

13419

@c file eg/lib/groupawk.in

13420

function getgruser(user)

13421

@{

13422

_gr_init()

13423

if (user in _gr_groupsbyuser)

13424

return _gr_groupsbyuser[user]

13425

return ""

13426

@}

13427

@c endfile

13428

@end group

13429

@end example

13430

13431

The @code{getgruser} function does not have a C counterpart. It takes a

13432

user name, and returns the list of groups that have the user as a member.

13433

13434

@findex getgrent

13435

@example

13436

@c @group

13437

@c file eg/lib/groupawk.in

13438

function getgrent()

13439

@{

13440

_gr_init()

13441

if (++gr_count in _gr_bycount)

13442

return _gr_bycount[_gr_count]

13443

return ""

13444

@}

13445

@c endfile

13446

@c @end group

13447

@end example

13448

13449

The @code{getgrent} function steps through the database one entry at a time.

13450

It uses @code{_gr_count} to track its position in the list.

13451

13452

@findex endgrent

13453

@example

13454

@group

13455

@c file eg/lib/groupawk.in

13456

function endgrent()

13457

@{

13458

_gr_count = 0

13459

@}

13460

@c endfile

13461

@end group

13462

@end example

13463

13464

@code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can

13465

start over again.

13466

13467

As with the user database routines, each function calls @code{_gr_init} to

13468

initialize the arrays. Doing so only incurs the extra overhead of running

13469

@code{grcat} if these functions are used (as opposed to moving the body of

13470

@code{_gr_init} into a @code{BEGIN} rule).

13471

13472

Most of the work is in scanning the database and building the various

13473

associative arrays. The functions that the user calls are themselves very

13474

simple, relying on @code{awk}'s associative arrays to do work.

13475

13476

The @code{id} program in @ref{Id Program, ,Printing Out User Information},

13477

uses these functions.

13478

13479

@node Library Names, , Group Functions, Library Functions

13480

@section Naming Library Function Global Variables

13481

13482

@cindex namespace issues in @code{awk}

13483

@cindex documenting @code{awk} programs

13484

@cindex programs, documenting

13485

Due to the way the @code{awk} language evolved, variables are either

13486

@dfn{global} (usable by the entire program), or @dfn{local} (usable just by

13487

a specific function). There is no intermediate state analogous to

13488

@code{static} variables in C.

13489

13490

Library functions often need to have global variables that they can use to

13491

preserve state information between calls to the function. For example,

13492

@code{getopt}'s variable @code{_opti}

13493

(@pxref{Getopt Function, ,Processing Command Line Options}),

13494

and the @code{_tm_months} array used by @code{mktime}

13495

(@pxref{Mktime Function, ,Turning Dates Into Timestamps}).

13496

Such variables are called @dfn{private}, since the only functions that need to

13497

use them are the ones in the library.

13498

13499

When writing a library function, you should try to choose names for your

13500

private variables so that they will not conflict with any variables used by

13501

either another library function or a user's main program. For example, a

13502

name like @samp{i} or @samp{j} is not a good choice, since user programs

13503

often use variable names like these for their own purposes.

13504

13505

The example programs shown in this chapter all start the names of their

13506

private variables with an underscore (@samp{_}). Users generally don't use

13507

leading underscores in their variable names, so this convention immediately

13508

decreases the chances that the variable name will be accidentally shared

13509

with the user's program.

13510

13511

In addition, several of the library functions use a prefix that helps

13512

indicate what function or set of functions uses the variables. For example,

13513

@code{_tm_months} in @code{mktime}

13514

(@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and

13515

@code{_pw_byname} in the user data base routines

13516

(@pxref{Passwd Functions, ,Reading the User Database}).

13517

This convention is recommended, since it even further decreases the chance

13518

of inadvertent conflict among variable names.

13519

Note that this convention can be used equally well both for variable names

13520

and for private function names too.

13521

13522

While I could have re-written all the library routines to use this

13523

convention, I did not do so, in order to show how my own @code{awk}

13524

programming style has evolved, and to provide some basis for this

13525

discussion.

13526

13527

As a final note on variable naming, if a function makes global variables

13528

available for use by a main program, it is a good convention to start that

13529

variable's name with a capital letter.

13530

For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables

13531

(@pxref{Getopt Function, ,Processing Command Line Options}).

13532

The leading capital letter indicates that it is global, while the fact that

13533

the variable name is not all capital letters indicates that the variable is

13534

not one of @code{awk}'s built-in variables, like @code{FS}.

13535

13536

It is also important that @emph{all} variables in library functions

13537

that do not need to save state are in fact declared local. If this is

13538

not done, the variable could accidentally be used in the user's program,

13539

leading to bugs that are very difficult to track down.

13540

13541

@example

13542

function lib_func(x, y, l1, l2)

13543

@{

13544

@dots{}

13545

@var{use variable} some_var # some_var could be local

13546

@dots{} # but is not by oversight

13547

@}

13548

@end example

13549

13550

@cindex Tcl

13551

A different convention, common in the Tcl community, is to use a single

13552

associative array to hold the values needed by the library function(s), or

13553

``package.'' This significantly decreases the number of actual global names

13554

in use. For example, the functions described in

13555

@ref{Passwd Functions, , Reading the User Database},

13556

might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},

13557

@code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of

13558

@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},

13559

and @code{@w{_pw_count}}.

13560

13561

The conventions presented in this section are exactly that, conventions. You

13562

are not required to write your programs this way, we merely recommend that

13563

you do so.

13564

13565

@node Sample Programs, Language History, Library Functions, Top

13566

@chapter Practical @code{awk} Programs

13567

13568

This chapter presents a potpourri of @code{awk} programs for your reading

13569

enjoyment.

13570

@iftex

13571

There are two sections. The first presents @code{awk}

13572

versions of several common POSIX utilities.

13573

The second is a grab-bag of interesting programs.

13574

@end iftex

13575

13576

Many of these programs use the library functions presented in

13577

@ref{Library Functions, ,A Library of @code{awk} Functions}.

13578

13579

@menu

13580

* Clones:: Clones of common utilities.

13581

* Miscellaneous Programs:: Some interesting @code{awk} programs.

13582

@end menu

13583

13584

@node Clones, Miscellaneous Programs, Sample Programs, Sample Programs

13585

@section Re-inventing Wheels for Fun and Profit

13586

13587

This section presents a number of POSIX utilities that are implemented in

13588

@code{awk}. Re-inventing these programs in @code{awk} is often enjoyable,

13589

since the algorithms can be very clearly expressed, and usually the code is

13590

very concise and simple. This is true because @code{awk} does so much for you.

13591

13592

It should be noted that these programs are not necessarily intended to

13593

replace the installed versions on your system. Instead, their

13594

purpose is to illustrate @code{awk} language programming for ``real world''

13595

tasks.

13596

13597

The programs are presented in alphabetical order.

13598

13599

@menu

13600

* Cut Program:: The @code{cut} utility.

13601

* Egrep Program:: The @code{egrep} utility.

13602

* Id Program:: The @code{id} utility.

13603

* Split Program:: The @code{split} utility.

13604

* Tee Program:: The @code{tee} utility.

13605

* Uniq Program:: The @code{uniq} utility.

13606

* Wc Program:: The @code{wc} utility.

13607

@end menu

13608

13609

@node Cut Program, Egrep Program, Clones, Clones

13610

@subsection Cutting Out Fields and Columns

13611

13612

@cindex @code{cut} utility

13613

The @code{cut} utility selects, or ``cuts,'' either characters or fields

13614

from its standard

13615

input and sends them to its standard output. @code{cut} can cut out either

13616

a list of characters, or a list of fields. By default, fields are separated

13617

by tabs, but you may supply a command line option to change the field

13618

@dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition

13619

of fields is less general than @code{awk}'s.

13620

13621

A common use of @code{cut} might be to pull out just the login name of

13622

logged-on users from the output of @code{who}. For example, the following

13623

pipeline generates a sorted, unique list of the logged on users:

13624

13625

@example

13626

who | cut -c1-8 | sort | uniq

13627

@end example

13628

13629

The options for @code{cut} are:

13630

13631

@table @code

13632

@item -c @var{list}

13633

Use @var{list} as the list of characters to cut out. Items within the list

13634

may be separated by commas, and ranges of characters can be separated with

13635

dashes. The list @samp{1-8,15,22-35} specifies characters one through

13636

eight, 15, and 22 through 35.

13637

13638

@item -f @var{list}

13639

Use @var{list} as the list of fields to cut out.

13640

13641

@item -d @var{delim}

13642

Use @var{delim} as the field separator character instead of the tab

13643

character.

13644

13645

@item -s

13646

Suppress printing of lines that do not contain the field delimiter.

13647

@end table

13648

13649

The @code{awk} implementation of @code{cut} uses the @code{getopt} library

13650

function (@pxref{Getopt Function, ,Processing Command Line Options}),

13651

and the @code{join} library function

13652

(@pxref{Join Function, ,Merging an Array Into a String}).

13653

13654

The program begins with a comment describing the options and a @code{usage}

13655

function which prints out a usage message and exits. @code{usage} is called

13656

if invalid arguments are supplied.

13657

13658

@findex cut.awk

13659

@example

13660

@c @group

13661

@c file eg/prog/cut.awk

13662

# cut.awk --- implement cut in awk

13663

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

13664

# May 1993

13665

13666

# Options:

13667

# -f list Cut fields

13668

# -d c Field delimiter character

13669

# -c list Cut characters

13670

#

13671

# -s Suppress lines without the delimiter character

13672

13673

function usage( e1, e2)

13674

@{

13675

e1 = "usage: cut [-f list] [-d c] [-s] [files...]"

13676

e2 = "usage: cut [-c list] [files...]"

13677

print e1 > "/dev/stderr"

13678

print e2 > "/dev/stderr"

13679

exit 1

13680

@}

13681

@c endfile

13682

@c @end group

13683

@end example

13684

13685

@noindent

13686

The variables @code{e1} and @code{e2} are used so that the function

13687

fits nicely on the

13688

@iftex

13689

page.

13690

@end iftex

13691

@ifinfo

13692

screen.

13693

@end ifinfo

13694

13695

Next comes a @code{BEGIN} rule that parses the command line options.

13696

It sets @code{FS} to a single tab character, since that is @code{cut}'s

13697

default field separator. The output field separator is also set to be the

13698

same as the input field separator. Then @code{getopt} is used to step

13699

through the command line options. One or the other of the variables

13700

@code{by_fields} or @code{by_chars} is set to true, to indicate that

13701

processing should be done by fields or by characters respectively.

13702

When cutting by characters, the output field separator is set to the null

13703

string.

13704

13705

@example

13706

@c @group

13707

@c file eg/prog/cut.awk

13708

BEGIN \

13709

@{

13710

FS = "\t" # default

13711

OFS = FS

13712

while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{

13713

if (c == "f") @{

13714

by_fields = 1

13715

fieldlist = Optarg

13716

@} else if (c == "c") @{

13717

by_chars = 1

13718

fieldlist = Optarg

13719

OFS = ""

13720

@} else if (c == "d") @{

13721

if (length(Optarg) > 1) @{

13722

printf("Using first character of %s" \

13723

" for delimiter\n", Optarg) > "/dev/stderr"

13724

Optarg = substr(Optarg, 1, 1)

13725

@}

13726

FS = Optarg

13727

OFS = FS

13728

if (FS == " ") # defeat awk semantics

13729

FS = "[ ]"

13730

@} else if (c == "s")

13731

suppress++

13732

else

13733

usage()

13734

@}

13735

13736

for (i = 1; i < Optind; i++)

13737

ARGV[i] = ""

13738

@c endfile

13739

@c @end group

13740

@end example

13741

13742

Special care is taken when the field delimiter is a space. Using

13743

@code{@w{" "}} (a single space) for the value of @code{FS} is

13744

incorrect---@code{awk} would

13745

separate fields with runs of spaces and/or tabs, and we want them to be

13746

separated with individual spaces. Also, note that after @code{getopt} is

13747

through, we have to clear out all the elements of @code{ARGV} from one to

13748

@code{Optind}, so that @code{awk} will not try to process the command line

13749

options as file names.

13750

13751

After dealing with the command line options, the program verifies that the

13752

options make sense. Only one or the other of @samp{-c} and @samp{-f} should

13753

be used, and both require a field list. Then either @code{set_fieldlist} or

13754

@code{set_charlist} is called to pull apart the list of fields or

13755

characters.

13756

13757

@example

13758

@c @group

13759

@c file eg/prog/cut.awk

13760

if (by_fields && by_chars)

13761

usage()

13762

13763

if (by_fields == 0 && by_chars == 0)

13764

by_fields = 1 # default

13765

13766

if (fieldlist == "") @{

13767

print "cut: needs list for -c or -f" > "/dev/stderr"

13768

exit 1

13769

@}

13770

13771

@group

13772

if (by_fields)

13773

set_fieldlist()

13774

else

13775

set_charlist()

13776

@}

13777

@c endfile

13778

@end group

13779

@end example

13780

13781

Here is @code{set_fieldlist}. It first splits the field list apart

13782

at the commas, into an array. Then, for each element of the array, it

13783

looks to see if it is actually a range, and if so splits it apart. The range

13784

is verified to make sure the first number is smaller than the second.

13785

Each number in the list is added to the @code{flist} array, which simply

13786

lists the fields that will be printed.

13787

Normal field splitting is used.

13788

The program lets @code{awk}

13789

handle the job of doing the field splitting.

13790

13791

@example

13792

@c @group

13793

@c file eg/prog/cut.awk

13794

function set_fieldlist( n, m, i, j, k, f, g)

13795

@{

13796

n = split(fieldlist, f, ",")

13797

j = 1 # index in flist

13798

for (i = 1; i <= n; i++) @{

13799

if (index(f[i], "-") != 0) @{ # a range

13800

m = split(f[i], g, "-")

13801

if (m != 2 || g[1] >= g[2]) @{

13802

printf("bad field list: %s\n",

13803

f[i]) > "/dev/stderr"

13804

exit 1

13805

@}

13806

for (k = g[1]; k <= g[2]; k++)

13807

flist[j++] = k

13808

@} else

13809

flist[j++] = f[i]

13810

@}

13811

nfields = j - 1

13812

@}

13813

@c endfile

13814

@c @end group

13815

@end example

13816

13817

The @code{set_charlist} function is more complicated than @code{set_fieldlist}.

13818

The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable

13819

(@pxref{Constant Size, ,Reading Fixed-width Data}),

13820

which describes constant width input. When using a character list, that is

13821

exactly what we have.

13822

13823

Setting up @code{FIELDWIDTHS} is more complicated than simply listing the

13824

fields that need to be printed. We have to keep track of the fields to be

13825

printed, and also the intervening characters that have to be skipped.

13826

For example, suppose you wanted characters one through eight, 15, and

13827

22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value

13828

for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}. This gives us five

13829

fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}.

13830

The intermediate fields are ``filler,'' stuff in between the desired data.

13831

13832

@code{flist} lists the fields to be printed, and @code{t} tracks the

13833

complete field list, including filler fields.

13834

13835

@example

13836

@c @group

13837

@c file eg/prog/cut.awk

13838

function set_charlist( field, i, j, f, g, t,

13839

filler, last, len)

13840

@{

13841

field = 1 # count total fields

13842

n = split(fieldlist, f, ",")

13843

j = 1 # index in flist

13844

for (i = 1; i <= n; i++) @{

13845

if (index(f[i], "-") != 0) @{ # range

13846

m = split(f[i], g, "-")

13847

if (m != 2 || g[1] >= g[2]) @{

13848

printf(bad character list: %s\n",

13849

f[i]) > "/dev/stderr"

13850

exit 1

13851

@}

13852

len = g[2] - g[1] + 1

13853

if (g[1] > 1) # compute length of filler

13854

filler = g[1] - last - 1

13855

else

13856

filler = 0

13857

if (filler)

13858

t[field++] = filler

13859

t[field++] = len # length of field

13860

last = g[2]

13861

flist[j++] = field - 1

13862

@} else @{

13863

if (f[i] > 1)

13864

filler = f[i] - last - 1

13865

else

13866

filler = 0

13867

if (filler)

13868

t[field++] = filler

13869

t[field++] = 1

13870

last = f[i]

13871

flist[j++] = field - 1

13872

@}

13873

@}

13874

@group

13875

FIELDWIDTHS = join(t, 1, field - 1)

13876

nfields = j - 1

13877

@}

13878

@end group

13879

@c endfile

13880

@end example

13881

13882

Here is the rule that actually processes the data. If the @samp{-s} option

13883

was given, then @code{suppress} will be true. The first @code{if} statement

13884

makes sure that the input record does have the field separator. If

13885

@code{cut} is processing fields, @code{suppress} is true, and the field

13886

separator character is not in the record, then the record is skipped.

13887

13888

If the record is valid, then at this point, @code{gawk} has split the data

13889

into fields, either using the character in @code{FS} or using fixed-length

13890

fields and @code{FIELDWIDTHS}. The loop goes through the list of fields

13891

that should be printed. If the corresponding field has data in it, it is

13892

printed. If the next field also has data, then the separator character is

13893

written out in between the fields.

13894

13895

@c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below

13896

13897

@example

13898

@c @group

13899

@c file eg/prog/cut.awk

13900

@{

13901

if (by_fields && suppress && $0 !~ FS)

13902

13904

for (i = 1; i <= nfields; i++) @{

13905

if ($flist[i] != "") @{

13906

printf "%s", $flist[i]

13907

if (i < nfields && $flist[i+1] != "")

13908

printf "%s", OFS

13909

@}

13910

@}

13911

print ""

13912

@}

13913

@c endfile

13914

@c @end group

13915

@end example

13916

13917

This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS}

13918

variable to do the character-based cutting. While it would be possible in

13919

other @code{awk} implementations to use @code{substr}

13920

(@pxref{String Functions, ,Built-in Functions for String Manipulation}),

13921

it would also be extremely painful to do so.

13922

The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem

13923

of picking the input line apart by characters.

13924

13925

@node Egrep Program, Id Program, Cut Program, Clones

13926

@subsection Searching for Regular Expressions in Files

13927

13928

@cindex @code{egrep} utility

13929

The @code{egrep} utility searches files for patterns. It uses regular

13930

expressions that are almost identical to those available in @code{awk}

13931

(@pxref{Regexp Constants, ,Regular Expression Constants}). It is used this way:

13932

13933

@example

13934

egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}

13935

@end example

13936

13937

The @var{pattern} is a regexp.

13938

In typical usage, the regexp is quoted to prevent the shell from expanding

13939

any of the special characters as file name wildcards.

13940

Normally, @code{egrep} prints the

13941

lines that matched. If multiple file names are provided on the command

13942

line, each output line is preceded by the name of the file and a colon.

13943

13944

The options are:

13945

13946

@table @code

13947

@item -c

13948

Print out a count of the lines that matched the pattern, instead of the

13949

lines themselves.

13950

13951

@item -s

13952

Be silent. No output is produced, and the exit value indicates whether

13953

or not the pattern was matched.

13954

13955

@item -v

13956

Invert the sense of the test. @code{egrep} prints the lines that do

13957

@emph{not} match the pattern, and exits successfully if the pattern was not

13958

matched.

13959

13960

@item -i

13961

Ignore case distinctions in both the pattern and the input data.

13962

13963

@item -l

13964

Only print the names of the files that matched, not the lines that matched.

13965

13966

@item -e @var{pattern}

13967

Use @var{pattern} as the regexp to match. The purpose of the @samp{-e}

13968

option is to allow patterns that start with a @samp{-}.

13969

@end table

13970

13971

This version uses the @code{getopt} library function

13972

(@pxref{Getopt Function, ,Processing Command Line Options}),

13973

and the file transition library program

13974

(@pxref{Filetrans Function, ,Noting Data File Boundaries}).

13975

13976

The program begins with a descriptive comment, and then a @code{BEGIN} rule

13977

that processes the command line arguments with @code{getopt}. The @samp{-i}

13978

(ignore case) option is particularly easy with @code{gawk}; we just use the

13979

@code{IGNORECASE} built in variable

13980

(@pxref{Built-in Variables}).

13981

13982

@findex egrep.awk

13983

@example

13984

@c @group

13985

@c file eg/prog/egrep.awk

13986

# egrep.awk --- simulate egrep in awk

13987

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

13988

# May 1993

13989

13990

# Options:

13991

# -c count of lines

13992

# -s silent - use exit value

13993

# -v invert test, success if no match

13994

# -i ignore case

13995

# -l print filenames only

13996

# -e argument is pattern

13997

13998

BEGIN @{

13999

while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{

14000

if (c == "c")

14001

count_only++

14002

else if (c == "s")

14003

no_print++

14004

else if (c == "v")

14005

invert++

14006

else if (c == "i")

14007

IGNORECASE = 1

14008

else if (c == "l")

14009

filenames_only++

14010

else if (c == "e")

14011

pattern = Optarg

14012

else

14013

usage()

14014

@}

14015

@c endfile

14016

@c @end group

14017

@end example

14018

14019

Next comes the code that handles the @code{egrep} specific behavior. If no

14020

pattern was supplied with @samp{-e}, the first non-option on the command

14021

line is used. The @code{awk} command line arguments up to @code{ARGV[Optind]}

14022

are cleared, so that @code{awk} won't try to process them as files. If no

14023

files were specified, the standard input is used, and if multiple files were

14024

specified, we make sure to note this so that the file names can precede the

14025

matched lines in the output.

14026

14027

The last two lines are commented out, since they are not needed in

14028

@code{gawk}. They should be uncommented if you have to use another version

14029

of @code{awk}.

14030

14031

@example

14032

@c @group

14033

@c file eg/prog/egrep.awk

14034

if (pattern == "")

14035

pattern = ARGV[Optind++]

14036

14037

for (i = 1; i < Optind; i++)

14038

ARGV[i] = ""

14039

if (Optind >= ARGC) @{

14040

ARGV[1] = "-"

14041

ARGC = 2

14042

@} else if (ARGC - Optind > 1)

14043

do_filenames++

14044

14045

# if (IGNORECASE)

14046

# pattern = tolower(pattern)

14047

@}

14048

@c endfile

14049

@c @end group

14050

@end example

14051

14052

The next set of lines should be uncommented if you are not using

14053

@code{gawk}. This rule translates all the characters in the input line

14054

into lower-case if the @samp{-i} option was specified. The rule is

14055

commented out since it is not necessary with @code{gawk}.

14056

@c bug: if a match happens, we output the translated line, not the original

14057

14058

@example

14059

@c @group

14060

@c file eg/prog/egrep.awk

14061

#@{

14062

# if (IGNORECASE)

14063

# $0 = tolower($0)

14064

#@}

14065

@c endfile

14066

@c @end group

14067

@end example

14068

14069

The @code{beginfile} function is called by the rule in @file{ftrans.awk}

14070

when each new file is processed. In this case, it is very simple; all it

14071

does is initialize a variable @code{fcount} to zero. @code{fcount} tracks

14072

how many lines in the current file matched the pattern.

14073

14074

@example

14075

@c @group

14076

@c file eg/prog/egrep.awk

14077

function beginfile(junk)

14078

@{

14079

fcount = 0

14080

@}

14081

@c endfile

14082

@c @end group

14083

@end example

14084

14085

The @code{endfile} function is called after each file has been processed.

14086

It is used only when the user wants a count of the number of lines that

14087

matched. @code{no_print} will be true only if the exit status is desired.

14088

@code{count_only} will be true if line counts are desired. @code{egrep}

14089

will therefore only print line counts if printing and counting are enabled.

14090

The output format must be adjusted depending upon the number of files to be

14091

processed. Finally, @code{fcount} is added to @code{total}, so that we

14092

know how many lines altogether matched the pattern.

14093

14094

@example

14095

@c @group

14096

@c file eg/prog/egrep.awk

14097

function endfile(file)

14098

@{

14099

if (! no_print && count_only)

14100

if (do_filenames)

14101

print file ":" fcount

14102

else

14103

print fcount

14104

14105

total += fcount

14106

@}

14107

@c endfile

14108

@c @end group

14109

@end example

14110

14111

This rule does most of the work of matching lines. The variable

14112

@code{matches} will be true if the line matched the pattern. If the user

14113

wants lines that did not match, the sense of the @code{matches} is inverted

14114

using the @samp{!} operator. @code{fcount} is incremented with the value of

14115

@code{matches}, which will be either one or zero, depending upon a

14116

successful or unsuccessful match. If the line did not match, the

14117

@code{next} statement just moves on to the next record.

14118

14119

There are several optimizations for performance in the following few lines

14120

of code. If the user only wants exit status (@code{no_print} is true), and

14121

we don't have to count lines, then it is enough to know that one line in

14122

this file matched, and we can skip on to the next file with @code{nextfile}.

14123

Along similar lines, if we are only printing file names, and we

14124

don't need to count lines, we can print the file name, and then skip to the

14125

next file with @code{nextfile}.

14126

14127

Finally, each line is printed, with a leading filename and colon if

14128

necessary.

14129

14130

@ignore

14131

2e: note, probably better to recode the last few lines as

14132

if (! count_only) @{

14133

if (no_print)

14134

nextfile

14135

14136

if (filenames_only) @{

14137

print FILENAME

14138

nextfile

14139

@}

14140

14141

if (do_filenames)

14142

print FILENAME ":" $0

14143

else

14144

print

14145

@}

14146

@end ignore

14147

14148

@example

14149

@c @group

14150

@c file eg/prog/egrep.awk

14151

@{

14152

matches = ($0 ~ pattern)

14153

if (invert)

14154

matches = ! matches

14155

14156

fcount += matches # 1 or 0

14157

14158

if (! matches)

14159

14161

if (no_print && ! count_only)

14162

nextfile

14163

14164

if (filenames_only && ! count_only) @{

14165

print FILENAME

14166

nextfile

14167

@}

14168

14169

if (do_filenames && ! count_only)

14170

print FILENAME ":" $0

14171

else if (! count_only)

14172

print

14173

@}

14174

@c endfile

14175

@c @end group

14176

@end example

14177

14178

@c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}.

14179

14180

The @code{END} rule takes care of producing the correct exit status. If

14181

there were no matches, the exit status is one, otherwise it is zero.

14182

14183

@example

14184

@c @group

14185

@c file eg/prog/egrep.awk

14186

END \

14187

@{

14188

if (total == 0)

14189

exit 1

14190

exit 0

14191

@}

14192

@c endfile

14193

@c @end group

14194

@end example

14195

14196

The @code{usage} function prints a usage message in case of invalid options

14197

and then exits.

14198

14199

@example

14200

@c @group

14201

@c file eg/prog/egrep.awk

14202

function usage( e)

14203

@{

14204

e = "Usage: egrep [-csvil] [-e pat] [files ...]"

14205

print e > "/dev/stderr"

14206

exit 1

14207

@}

14208

@c endfile

14209

@c @end group

14210

@end example

14211

14212

The variable @code{e} is used so that the function fits nicely

14213

on the printed page.

14214

14215

@node Id Program, Split Program, Egrep Program, Clones

14216

@subsection Printing Out User Information

14217

14218

@cindex @code{id} utility

14219

The @code{id} utility lists a user's real and effective user-id numbers,

14220

real and effective group-id numbers, and the user's group set, if any.

14221

@code{id} will only print the effective user-id and group-id if they are

14222

different from the real ones. If possible, @code{id} will also supply the

14223

corresponding user and group names. The output might look like this:

14224

14225

@example

14226

$ id

14227

@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty)

14228

@end example

14229

14230

This information is exactly what is provided by @code{gawk}'s

14231

@file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}).

14232

However, the @code{id} utility provides a more palatable output than just a

14233

string of numbers.

14234

14235

Here is a simple version of @code{id} written in @code{awk}.

14236

It uses the user database library functions

14237

(@pxref{Passwd Functions, ,Reading the User Database}),

14238

and the group database library functions

14239

(@pxref{Group Functions, ,Reading the Group Database}).

14240

14241

The program is fairly straightforward. All the work is done in the

14242

@code{BEGIN} rule. The user and group id numbers are obtained from

14243

@file{/dev/user}. If there is no support for @file{/dev/user}, the program

14244

gives up.

14245

14246

The code is repetitive. The entry in the user database for the real user-id

14247

number is split into parts at the @samp{:}. The name is the first field.

14248

Similar code is used for the effective user-id number, and the group

14249

numbers.

14250

14251

@findex id.awk

14252

@example

14253

@c @group

14254

@c file eg/prog/id.awk

14255

# id.awk --- implement id in awk

14256

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

14257

# May 1993

14258

14259

# output is:

14260

# uid=12(foo) euid=34(bar) gid=3(baz) \

14261

# egid=5(blat) groups=9(nine),2(two),1(one)

14262

14263

BEGIN \

14264

@{

14265

if ((getline < "/dev/user") < 0) @{

14266

err = "id: no /dev/user support - cannot run"

14267

print err > "/dev/stderr"

14268

exit 1

14269

@}

14270

close("/dev/user")

14271

14272

uid = $1

14273

euid = $2

14274

gid = $3

14275

egid = $4

14276

14277

printf("uid=%d", uid)

14278

pw = getpwuid(uid)

14279

@group

14280

if (pw != "") @{

14281

split(pw, a, ":")

14282

printf("(%s)", a[1])

14283

@}

14284

@end group

14285

14286

if (euid != uid) @{

14287

printf(" euid=%d", euid)

14288

pw = getpwuid(euid)

14289

if (pw != "") @{

14290

split(pw, a, ":")

14291

printf("(%s)", a[1])

14292

@}

14293

@}

14294

14295

printf(" gid=%d", gid)

14296

pw = getgrgid(gid)

14297

if (pw != "") @{

14298

split(pw, a, ":")

14299

printf("(%s)", a[1])

14300

@}

14301

14302

if (egid != gid) @{

14303

printf(" egid=%d", egid)

14304

pw = getgrgid(egid)

14305

if (pw != "") @{

14306

split(pw, a, ":")

14307

printf("(%s)", a[1])

14308

@}

14309

@}

14310

14311

if (NF > 4) @{

14312

printf(" groups=");

14313

for (i = 5; i <= NF; i++) @{

14314

printf("%d", $i)

14315

pw = getgrgid($i)

14316

if (pw != "") @{

14317

split(pw, a, ":")

14318

printf("(%s)", a[1])

14319

@}

14320

if (i < NF)

14321

printf(",")

14322

@}

14323

@}

14324

print ""

14325

@}

14326

@c endfile

14327

@c @end group

14328

@end example

14329

14330

@c exercise!!!

14331

@ignore

14332

The POSIX version of @code{id} takes arguments that control which

14333

information is printed. Modify this version to accept the same

14334

arguments and perform in the same way.

14335

@end ignore

14336

14337

@node Split Program, Tee Program, Id Program, Clones

14338

@subsection Splitting a Large File Into Pieces

14339

14340

@cindex @code{split} utility

14341

The @code{split} program splits large text files into smaller pieces. By default,

14342

the output files are named @file{xaa}, @file{xab}, and so on. Each file has

14343

1000 lines in it, with the likely exception of the last file. To change the

14344

number of lines in each file, you supply a number on the command line

14345

preceded with a minus, e.g., @samp{-500} for files with 500 lines in them

14346

instead of 1000. To change the name of the output files to something like

14347

@file{myfileaa}, @file{myfileab}, and so on, you supply an additional

14348

argument that specifies the filename.

14349

14350

Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and

14351

@code{chr} functions presented in

14352

@ref{Ordinal Functions, ,Translating Between Characters and Numbers}.

14353

14354

The program first sets its defaults, and then tests to make sure there are

14355

not too many arguments. It then looks at each argument in turn. The

14356

first argument could be a minus followed by a number. If it is, this happens

14357

to look like a negative number, so it is made positive, and that is the

14358

count of lines. The data file name is skipped over, and the final argument

14359

is used as the prefix for the output file names.

14360

14361

@findex split.awk

14362

@example

14363

@c @group

14364

@c file eg/prog/split.awk

14365

# split.awk --- do split in awk

14366

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

14367

# May 1993

14368

14369

# usage: split [-num] [file] [outname]

14370

14371

BEGIN \

14372

@{

14373

outfile = "x" # default

14374

count = 1000

14375

if (ARGC > 4)

14376

usage()

14377

14378

i = 1

14379

if (ARGV[i] ~ /^-[0-9]+$/) @{

14380

count = -ARGV[i]

14381

ARGV[i] = ""

14382

i++

14383

@}

14384

# test argv in case reading from stdin instead of file

14385

if (i in ARGV)

14386

i++ # skip data file name

14387

if (i in ARGV) @{

14388

outfile = ARGV[i]

14389

ARGV[i] = ""

14390

@}

14391

14392

s1 = s2 = "a"

14393

out = (outfile s1 s2)

14394

@}

14395

@c endfile

14396

@c @end group

14397

@end example

14398

14399

The next rule does most of the work. @code{tcount} (temporary count) tracks

14400

how many lines have been printed to the output file so far. If it is greater

14401

than @code{count}, it is time to close the current file and start a new one.

14402

@code{s1} and @code{s2} track the current suffixes for the file name. If

14403

they are both @samp{z}, the file is just too big. Otherwise, @code{s1}

14404

moves to the next letter in the alphabet and @code{s2} starts over again at

14405

@samp{a}.

14406

14407

@example

14408

@c @group

14409

@c file eg/prog/split.awk

14410

@{

14411

if (++tcount > count) @{

14412

close(out)

14413

if (s2 == "z") @{

14414

if (s1 == "z") @{

14415

printf("split: %s is too large to split\n", \

14416

FILENAME) > "/dev/stderr"

14417

exit 1

14418

@}

14419

s1 = chr(ord(s1) + 1)

14420

s2 = "a"

14421

@} else

14422

s2 = chr(ord(s2) + 1)

14423

out = (outfile s1 s2)

14424

tcount = 1

14425

@}

14426

print > out

14427

@}

14428

@c endfile

14429

@c @end group

14430

@end example

14431

14432

The @code{usage} function simply prints an error message and exits.

14433

14434

@example

14435

@c @group

14436

@c file eg/prog/split.awk

14437

function usage( e)

14438

@{

14439

e = "usage: split [-num] [file] [outname]"

14440

print e > "/dev/stderr"

14441

exit 1

14442

@}

14443

@c endfile

14444

@c @end group

14445

@end example

14446

14447

@noindent

14448

The variable @code{e} is used so that the function

14449

fits nicely on the

14450

@iftex

14451

page.

14452

@end iftex

14453

@ifinfo

14454

screen.

14455

@end ifinfo

14456

14457

This program is a bit sloppy; it relies on @code{awk} to close the last file

14458

for it automatically, instead of doing it in an @code{END} rule.

14459

14460

@node Tee Program, Uniq Program, Split Program, Clones

14461

@subsection Duplicating Output Into Multiple Files

14462

14463

@cindex @code{tee} utility

14464

The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies

14465

its standard input to its standard output, and also duplicates it to the

14466

files named on the command line. Its usage is:

14467

14468

@example

14469

tee @r{[}-a@r{]} file @dots{}

14470

@end example

14471

14472

The @samp{-a} option tells @code{tee} to append to the named files, instead of

14473

truncating them and starting over.

14474

14475

The @code{BEGIN} rule first makes a copy of all the command line arguments,

14476

into an array named @code{copy}.

14477

@code{ARGV[0]} is not copied, since it is not needed.

14478

@code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to

14479

process each file named in @code{ARGV} as input data.

14480

14481

If the first argument is @samp{-a}, then the flag variable

14482

@code{append} is set to true, and both @code{ARGV[1]} and

14483

@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file

14484

names were supplied, and @code{tee} prints a usage message and exits.

14485

Finally, @code{awk} is forced to read the standard input by setting

14486

@code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two.

14487

14488

@c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed.

14489

14490

@findex tee.awk

14491

@example

14492

@c @group

14493

@c file eg/prog/tee.awk

14494

# tee.awk --- tee in awk

14495

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

14496

# May 1993

14497

# Revised December 1995

14498

14499

BEGIN \

14500

@{

14501

for (i = 1; i < ARGC; i++)

14502

copy[i] = ARGV[i]

14503

14504

if (ARGV[1] == "-a") @{

14505

append = 1

14506

delete ARGV[1]

14507

delete copy[1]

14508

ARGC--

14509

@}

14510

if (ARGC < 2) @{

14511

print "usage: tee [-a] file ..." > "/dev/stderr"

14512

exit 1

14513

@}

14514

ARGV[1] = "-"

14515

ARGC = 2

14516

@}

14517

@c endfile

14518

@c @end group

14519

@end example

14520

14521

The single rule does all the work. Since there is no pattern, it is

14522

executed for each line of input. The body of the rule simply prints the

14523

line into each file on the command line, and then to the standard output.

14524

14525

@example

14526

@group

14527

@c file eg/prog/tee.awk

14528

@{

14529

# moving the if outside the loop makes it run faster

14530

if (append)

14531

for (i in copy)

14532

print >> copy[i]

14533

else

14534

for (i in copy)

14535

print > copy[i]

14536

print

14537

@}

14538

@c endfile

14539

@end group

14540

@end example

14541

14542

It would have been possible to code the loop this way:

14543

14544

@example

14545

for (i in copy)

14546

if (append)

14547

print >> copy[i]

14548

else

14549

print > copy[i]

14550

@end example

14551

14552

@noindent

14553

This is more concise, but it is also less efficient. The @samp{if} is

14554

tested for each record and for each output file. By duplicating the loop

14555

body, the @samp{if} is only tested once for each input record. If there are

14556

@var{N} input records and @var{M} input files, the first method only

14557

executes @var{N} @samp{if} statements, while the second would execute

14558

@var{N}@code{*}@var{M} @samp{if} statements.

14559

14560

Finally, the @code{END} rule cleans up, by closing all the output files.

14561

14562

@example

14563

@c @group

14564

@c file eg/prog/tee.awk

14565

END \

14566

@{

14567

for (i in copy)

14568

close(copy[i])

14569

@}

14570

@c endfile

14571

@c @end group

14572

@end example

14573

14574

@node Uniq Program, Wc Program, Tee Program, Clones

14575

@subsection Printing Non-duplicated Lines of Text

14576

14577

@cindex @code{uniq} utility

14578

The @code{uniq} utility reads sorted lines of data on its standard input,

14579

and (by default) removes duplicate lines. In other words, only unique lines

14580

are printed, hence the name. @code{uniq} has a number of options. The usage is:

14581

14582

@example

14583

uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]}

14584

@end example

14585

14586

The option meanings are:

14587

14588

@table @code

14589

@item -d

14590

Only print repeated lines.

14591

14592

@item -u

14593

Only print non-repeated lines.

14594

14595

@item -c

14596

Count lines. This option overrides @samp{-d} and @samp{-u}. Both repeated

14597

and non-repeated lines are counted.

14598

14599

@item -@var{n}

14600

Skip @var{n} fields before comparing lines. The definition of fields is the

14601

same as @code{awk}'s default: non-whitespace characters separated by runs of

14602

spaces and/or tabs.

14603

14604

@item +@var{n}

14605

Skip @var{n} characters before comparing lines. Any fields specified with

14606

@samp{-@var{n}} are skipped first.

14607

14608

@item @var{input file}

14609

Data is read from the input file named on the command line, instead of from

14610

the standard input.

14611

14612

@item @var{output file}

14613

The generated output is sent to the named output file, instead of to the

14614

standard output.

14615

@end table

14616

14617

Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options

14618

had been provided.

14619

14620

Here is an @code{awk} implementation of @code{uniq}. It uses the

14621

@code{getopt} library function

14622

(@pxref{Getopt Function, ,Processing Command Line Options}),

14623

and the @code{join} library function

14624

(@pxref{Join Function, ,Merging an Array Into a String}).

14625

14626

The program begins with a @code{usage} function and then a brief outline of

14627

the options and their meanings in a comment.

14628

14629

The @code{BEGIN} rule deals with the command line arguments and options. It

14630

uses a trick to get @code{getopt} to handle options of the form @samp{-25},

14631

treating such an option as the option letter @samp{2} with an argument of

14632

@samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks

14633

like a number), @code{Optarg} is

14634

concatenated with the option digit, and then result is added to zero to make

14635

it into a number. If there is only one digit in the option, then

14636

@code{Optarg} is not needed, and @code{Optind} must be decremented so that

14637

@code{getopt} will process it next time. This code is admittedly a bit

14638

tricky.

14639

14640

If no options were supplied, then the default is taken, to print both

14641

repeated and non-repeated lines. The output file, if provided, is assigned

14642

to @code{outputfile}. Earlier, @code{outputfile} was initialized to the

14643

standard output, @file{/dev/stdout}.

14644

14645

@findex uniq.awk

14646

@example

14647

@c @group

14648

@c file eg/prog/uniq.awk

14649

# uniq.awk --- do uniq in awk

14650

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

14651

# May 1993

14652

14653

function usage( e)

14654

@{

14655

e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"

14656

print e > "/dev/stderr"

14657

exit 1

14658

@}

14659

14660

# -c count lines. overrides -d and -u

14661

# -d only repeated lines

14662

# -u only non-repeated lines

14663

# -n skip n fields

14664

# +n skip n characters, skip fields first

14665

14666

BEGIN \

14667

@{

14668

count = 1

14669

outputfile = "/dev/stdout"

14670

opts = "udc0:1:2:3:4:5:6:7:8:9:"

14671

while ((c = getopt(ARGC, ARGV, opts)) != -1) @{

14672

if (c == "u")

14673

non_repeated_only++

14674

else if (c == "d")

14675

repeated_only++

14676

else if (c == "c")

14677

do_count++

14678

else if (index("0123456789", c) != 0) @{

14679

# getopt requires args to options

14680

# this messes us up for things like -5

14681

if (Optarg ~ /^[0-9]+$/)

14682

fcount = (c Optarg) + 0

14683

else @{

14684

fcount = c + 0

14685

Optind--

14686

@}

14687

@} else

14688

usage()

14689

@}

14690

14691

if (ARGV[Optind] ~ /^\+[0-9]+$/) @{

14692

charcount = substr(ARGV[Optind], 2) + 0

14693

Optind++

14694

@}

14695

14696

for (i = 1; i < Optind; i++)

14697

ARGV[i] = ""

14698

14699

if (repeated_only == 0 && non_repeated_only == 0)

14700

repeated_only = non_repeated_only = 1

14701

14702

if (ARGC - Optind == 2) @{

14703

outputfile = ARGV[ARGC - 1]

14704

ARGV[ARGC - 1] = ""

14705

@}

14706

@}

14707

@c endfile

14708

@c @end group

14709

@end example

14710

14711

The following function, @code{are_equal}, compares the current line,

14712

@code{$0}, to the

14713

previous line, @code{last}. It handles skipping fields and characters.

14714

14715

If no field count and no character count were specified, @code{are_equal}

14716

simply returns one or zero depending upon the result of a simple string

14717

comparison of @code{last} and @code{$0}. Otherwise, things get more

14718

complicated.

14719

14720

If fields have to be skipped, each line is broken into an array using

14721

@code{split}

14722

(@pxref{String Functions, ,Built-in Functions for String Manipulation}),

14723

and then the desired fields are joined back into a line using @code{join}.

14724

The joined lines are stored in @code{clast} and @code{cline}.

14725

If no fields are skipped, @code{clast} and @code{cline} are set to

14726

@code{last} and @code{$0} respectively.

14727

14728

Finally, if characters are skipped, @code{substr} is used to strip off the

14729

leading @code{charcount} characters in @code{clast} and @code{cline}. The

14730

two strings are then compared, and @code{are_equal} returns the result.

14731

14732

@example

14733

@c @group

14734

@c file eg/prog/uniq.awk

14735

function are_equal( n, m, clast, cline, alast, aline)

14736

@{

14737

if (fcount == 0 && charcount == 0)

14738

return (last == $0)

14739

14740

if (fcount > 0) @{

14741

n = split(last, alast)

14742

m = split($0, aline)

14743

clast = join(alast, fcount+1, n)

14744

cline = join(aline, fcount+1, m)

14745

@} else @{

14746

clast = last

14747

cline = $0

14748

@}

14749

if (charcount) @{

14750

clast = substr(clast, charcount + 1)

14751

cline = substr(cline, charcount + 1)

14752

@}

14753

14754

return (clast == cline)

14755

@}

14756

@c endfile

14757

@c @end group

14758

@end example

14759

14760

The following two rules are the body of the program. The first one is

14761

executed only for the very first line of data. It sets @code{last} equal to

14762

@code{$0}, so that subsequent lines of text have something to be compared to.

14763

14764

The second rule does the work. The variable @code{equal} will be one or zero

14765

depending upon the results of @code{are_equal}'s comparison. If @code{uniq}

14766

is counting repeated lines, then the @code{count} variable is incremented if

14767

the lines are equal. Otherwise the line is printed and @code{count} is

14768

reset, since the two lines are not equal.

14769

14770

If @code{uniq} is not counting, @code{count} is incremented if the lines are

14771

equal. Otherwise, if @code{uniq} is counting repeated lines, and more than

14772

one line has been seen, or if @code{uniq} is counting non-repeated lines,

14773

and only one line has been seen, then the line is printed, and @code{count}

14774

is reset.

14775

14776

Finally, similar logic is used in the @code{END} rule to print the final

14777

line of input data.

14778

14779

@example

14780

@c @group

14781

@c file eg/prog/uniq.awk

14782

@group

14783

NR == 1 @{

14784

last = $0

14785

@}

14787

@end group

14788

14789

@{

14790

equal = are_equal()

14791

14792

if (do_count) @{ # overrides -d and -u

14793

if (equal)

14794

count++

14795

else @{

14796

printf("%4d %s\n", count, last) > outputfile

14797

last = $0

14798

count = 1 # reset

14799

@}

14800

@}

14802

14803

if (equal)

14804

count++

14805

else @{

14806

if ((repeated_only && count > 1) ||

14807

(non_repeated_only && count == 1))

14808

print last > outputfile

14809

last = $0

14810

count = 1

14811

@}

14812

@}

14813

14814

@group

14815

END @{

14816

if (do_count)

14817

printf("%4d %s\n", count, last) > outputfile

14818

else if ((repeated_only && count > 1) ||

14819

(non_repeated_only && count == 1))

14820

print last > outputfile

14821

@}

14822

@end group

14823

@c endfile

14824

@c @end group

14825

@end example

14826

14827

@node Wc Program, , Uniq Program, Clones

14828

@subsection Counting Things

14829

14830

@cindex @code{wc} utility

14831

The @code{wc} (word count) utility counts lines, words, and characters in

14832

one or more input files. Its usage is:

14833

14834

@example

14835

wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]}

14836

@end example

14837

14838

If no files are specified on the command line, @code{wc} reads its standard

14839

input. If there are multiple files, it will also print total counts for all

14840

the files. The options and their meanings are:

14841

14842

@table @code

14843

@item -l

14844

Only count lines.

14845

14846

@item -w

14847

Only count words.

14848

A ``word'' is a contiguous sequence of non-whitespace characters, separated

14849

by spaces and/or tabs. Happily, this is the normal way @code{awk} separates

14850

fields in its input data.

14851

14852

@item -c

14853

Only count characters.

14854

@end table

14855

14856

Implementing @code{wc} in @code{awk} is particularly elegant, since

14857

@code{awk} does a lot of the work for us; it splits lines into words (i.e.@:

14858

fields) and counts them, it counts lines (i.e.@: records) for us, and it can

14859

easily tell us how long a line is.

14860

14861

This version uses the @code{getopt} library function

14862

(@pxref{Getopt Function, ,Processing Command Line Options}),

14863

and the file transition functions

14864

(@pxref{Filetrans Function, ,Noting Data File Boundaries}).

14865

14866

This version has one major difference from traditional versions of @code{wc}.

14867

Our version always prints the counts in the order lines, words,

14868

and characters. Traditional versions note the order of the @samp{-l},

14869

@samp{-w}, and @samp{-c} options on the command line, and print the counts

14870

in that order.

14871

14872

The @code{BEGIN} rule does the argument processing.

14873

The variable @code{print_total} will

14874

be true if more than one file was named on the command line.

14875

14876

@findex wc.awk

14877

@example

14878

@c @group

14879

@c file eg/prog/wc.awk

14880

# wc.awk --- count lines, words, characters

14881

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

14882

# May 1993

14883

14884

# Options:

14885

# -l only count lines

14886

# -w only count words

14887

# -c only count characters

14888

#

14889

# Default is to count lines, words, characters

14890

14891

BEGIN @{

14892

# let getopt print a message about

14893

# invalid options. we ignore them

14894

while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{

14895

if (c == "l")

14896

do_lines = 1

14897

else if (c == "w")

14898

do_words = 1

14899

else if (c == "c")

14900

do_chars = 1

14901

@}

14902

for (i = 1; i < Optind; i++)

14903

ARGV[i] = ""

14904

14905

# if no options, do all

14906

if (! do_lines && ! do_words && ! do_chars)

14907

do_lines = do_words = do_chars = 1

14908

14909

print_total = (ARC - i > 2)

14910

@}

14911

@c endfile

14912

@c @end group

14913

@end example

14914

14915

The @code{beginfile} function is simple; it just resets the counts of lines,

14916

words, and characters to zero, and saves the current file name in

14917

@code{fname}.

14918

14919

The @code{endfile} function adds the current file's numbers to the running

14920

totals of lines, words, and characters. It then prints out those numbers

14921

for the file that was just read. It relies on @code{beginfile} to reset the

14922

numbers for the following data file.

14923

14924

@example

14925

@c @group

14926

@c file eg/prog/wc.awk

14927

function beginfile(file)

14928

@{

14929

chars = lines = words = 0

14930

fname = FILENAME

14931

@}

14932

14933

function endfile(file)

14934

@{

14935

tchars += chars

14936

tlines += lines

14937

twords += words

14938

@group

14939

if (do_lines)

14940

printf "\t%d", lines

14941

@end group

14942

if (do_words)

14943

printf "\t%d", words

14944

if (do_chars)

14945

printf "\t%d", chars

14946

printf "\t%s\n", fname

14947

@}

14948

@c endfile

14949

@c @end group

14950

@end example

14951

14952

There is one rule that is executed for each line. It adds the length of the

14953

record to @code{chars}. It has to add one, since the newline character

14954

separating records (the value of @code{RS}) is not part of the record

14955

itself. @code{lines} is incremented for each line read, and @code{words} is

14956

incremented by the value of @code{NF}, the number of ``words'' on this

14957

line.@footnote{Examine the code in

14958

@ref{Filetrans Function, ,Noting Data File Boundaries}.

14959

Why must @code{wc} use a separate @code{lines} variable, instead of using

14960

the value of @code{FNR} in @code{endfile}?}

14961

14962

Finally, the @code{END} rule simply prints the totals for all the files.

14963

14964

@example

14965

@c @group

14966

@c file eg/prog/wc.awk

14967

# do per line

14968

@{

14969

chars += length($0) + 1 # get newline

14970

lines++

14971

words += NF

14972

@}

14973

14974

END @{

14975

if (print_total) @{

14976

if (do_lines)

14977

printf "\t%d", tlines

14978

if (do_words)

14979

printf "\t%d", twords

14980

if (do_chars)

14981

printf "\t%d", tchars

14982

print "\ttotal"

14983

@}

14984

@}

14985

@c endfile

14986

@c @end group

14987

@end example

14988

14989

@node Miscellaneous Programs, , Clones, Sample Programs

14990

@section A Grab Bag of @code{awk} Programs

14991

14992

This section is a large ``grab bag'' of miscellaneous programs.

14993

We hope you find them both interesting and enjoyable.

14994

14995

@menu

14996

* Dupword Program:: Finding duplicated words in a document.

14997

* Alarm Program:: An alarm clock.

14998

* Translate Program:: A program similar to the @code{tr} utility.

14999

* Labels Program:: Printing mailing labels.

15000

* Word Sorting:: A program to produce a word usage count.

15001

* History Sorting:: Eliminating duplicate entries from a history

15002

file.

15003

* Extract Program:: Pulling out programs from Texinfo source

15004

files.

15005

* Simple Sed:: A Simple Stream Editor.

15006

* Igawk Program:: A wrapper for @code{awk} that includes files.

15007

@end menu

15008

15009

@node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs

15010

@subsection Finding Duplicated Words in a Document

15011

15012

A common error when writing large amounts of prose is to accidentally

15013

duplicate words. Often you will see this in text as something like ``the

15014

the program does the following @dots{}.'' When the text is on-line, often

15015

the duplicated words occur at the end of one line and the beginning of

15016

another, making them very difficult to spot.

15017

@c as here!

15018

15019

This program, @file{dupword.awk}, scans through a file one line at a time,

15020

and looks for adjacent occurrences of the same word. It also saves the last

15021

word on a line (in the variable @code{prev}) for comparison with the first

15022

word on the next line.

15023

15024

The first two statements make sure that the line is all lower-case, so that,

15025

for example,

15026

``The'' and ``the'' compare equal to each other. The second statement

15027

removes all non-alphanumeric and non-whitespace characters from the line, so

15028

that punctuation does not affect the comparison either. This sometimes

15029

leads to reports of duplicated words that really are different, but this is

15030

unusual.

15031

15032

@findex dupword.awk

15033

@example

15034

@group

15035

@c file eg/prog/dupword.awk

15036

# dupword --- find duplicate words in text

15037

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15038

# December 1991

15039

15040

@{

15041

$0 = tolower($0)

15042

gsub(/[^A-Za-z0-9 \t]/, "");

15043

if ($1 == prev)

15044

printf("%s:%d: duplicate %s\n",

15045

FILENAME, FNR, $1)

15046

for (i = 2; i <= NF; i++)

15047

if ($i == $(i-1))

15048

printf("%s:%d: duplicate %s\n",

15049

FILENAME, FNR, $i)

15050

prev = $NF

15051

@}

15052

@c endfile

15053

@end group

15054

@end example

15055

15056

@node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs

15057

@subsection An Alarm Clock Program

15058

15059

The following program is a simple ``alarm clock'' program.

15060

You give it a time of day, and an optional message. At the given time,

15061

it prints the message on the standard output. In addition, you can give it

15062

the number of times to repeat the message, and also a delay between

15063

repetitions.

15064

15065

This program uses the @code{gettimeofday} function from

15066

@ref{Gettimeofday Function, ,Managing the Time of Day}.

15067

15068

All the work is done in the @code{BEGIN} rule. The first part is argument

15069

checking and setting of defaults; the delay, the count, and the message to

15070

print. If the user supplied a message, but it does not contain the ASCII BEL

15071

character (known as the ``alert'' character, @samp{\a}), then it is added to

15072

the message. (On many systems, printing the ASCII BEL generates some sort

15073

of audible alert. Thus, when the alarm goes off, the system calls attention

15074

to itself, in case the user is not looking at their computer or terminal.)

15075

15076

@findex alarm.awk

15077

@example

15078

@c @group

15079

@c file eg/prog/alarm.awk

15080

# alarm --- set an alarm

15081

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15082

# May 1993

15083

15084

# usage: alarm time [ "message" [ count [ delay ] ] ]

15085

15086

BEGIN \

15087

@{

15088

# Initial argument sanity checking

15089

usage1 = "usage: alarm time ['message' [count [delay]]]"

15090

usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])

15091

15092

if (ARGC < 2) @{

15093

print usage > "/dev/stderr"

15094

exit 1

15095

@} else if (ARGC == 5) @{

15096

delay = ARGV[4] + 0

15097

count = ARGV[3] + 0

15098

message = ARGV[2]

15099

@} else if (ARGC == 4) @{

15100

count = ARGV[3] + 0

15101

message = ARGV[2]

15102

@} else if (ARGC == 3) @{

15103

message = ARGV[2]

15104

@} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{

15105

print usage1 > "/dev/stderr"

15106

print usage2 > "/dev/stderr"

15107

exit 1

15108

@}

15109

15110

# set defaults for once we reach the desired time

15111

if (delay == 0)

15112

delay = 180 # 3 minutes

15113

if (count == 0)

15114

count = 5

15115

@group

15116

if (message == "")

15117

message = sprintf("\aIt is now %s!\a", ARGV[1])

15118

else if (index(message, "\a") == 0)

15119

message = "\a" message "\a"

15120

@end group

15121

@c endfile

15122

@end example

15123

15124

The next section of code turns the alarm time into hours and minutes,

15125

and converts it if necessary to a 24-hour clock. Then it turns that

15126

time into a count of the seconds since midnight. Next it turns the current

15127

time into a count of seconds since midnight. The difference between the two

15128

is how long to wait before setting off the alarm.

15129

15130

@example

15131

@c @group

15132

@c file eg/prog/alarm.awk

15133

# split up dest time

15134

split(ARGV[1], atime, ":")

15135

hour = atime[1] + 0 # force numeric

15136

minute = atime[2] + 0 # force numeric

15137

15138

# get current broken down time

15139

gettimeofday(now)

15140

15141

# if time given is 12-hour hours and it's after that

15142

# hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,

15143

# then add 12 to real hour

15144

if (hour < 12 && now["hour"] > hour)

15145

hour += 12

15146

15147

# set target time in seconds since midnight

15148

target = (hour * 60 * 60) + (minute * 60)

15149

15150

# get current time in seconds since midnight

15151

current = (now["hour"] * 60 * 60) + \

15152

(now["minute"] * 60) + now["second"]

15153

15154

# how long to sleep for

15155

naptime = target - current

15156

if (naptime <= 0) @{

15157

print "time is in the past!" > "/dev/stderr"

15158

exit 1

15159

@}

15160

@c endfile

15161

@c @end group

15162

@end example

15163

15164

Finally, the program uses the @code{system} function

15165

(@pxref{I/O Functions, ,Built-in Functions for Input/Output})

15166

to call the @code{sleep} utility. The @code{sleep} utility simply pauses

15167

for the given number of seconds. If the exit status is not zero,

15168

the program assumes that @code{sleep} was interrupted, and exits. If

15169

@code{sleep} exited with an OK status (zero), then the program prints the

15170

message in a loop, again using @code{sleep} to delay for however many

15171

seconds are necessary.

15172

15173

@example

15174

@c @group

15175

@c file eg/prog/alarm.awk

15176

# zzzzzz..... go away if interrupted

15177

if (system(sprintf("sleep %d", naptime)) != 0)

15178

exit 1

15179

15180

# time to notify!

15181

command = sprintf("sleep %d", delay)

15182

for (i = 1; i <= count; i++) @{

15183

print message

15184

# if sleep command interrupted, go away

15185

if (system(command) != 0)

15186

break

15187

@}

15188

15189

exit 0

15190

@}

15191

@c endfile

15192

@c @end group

15193

@end example

15194

15195

@node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs

15196

@subsection Transliterating Characters

15197

15198

The system @code{tr} utility transliterates characters. For example, it is

15199

often used to map upper-case letters into lower-case, for further

15200

processing.

15201

15202

@example

15203

@var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{}

15204

@end example

15205

15206

You give @code{tr} two lists of characters enclosed in square brackets.

15207

Usually, the lists are quoted to keep the shell from attempting to do a

15208

filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often

15209

does not require that the lists be enclosed in square brackets and quoted.

15210

This is a feature.} When processing the input, the

15211

first character in the first list is replaced with the first character in the

15212

second list, the second character in the first list is replaced with the

15213

second character in the second list, and so on.

15214

If there are more characters in the ``from'' list than in the ``to'' list,

15215

the last character of the ``to'' list is used for the remaining characters

15216

in the ``from'' list.

15217

15218

Some time ago,

15219

@c early or mid-1989!

15220

a user proposed to us that we add a transliteration function to @code{gawk}.

15221

Being opposed to ``creeping featurism,'' I wrote the following program to

15222

prove that character transliteration could be done with a user-level

15223

function. This program is not as complete as the system @code{tr} utility,

15224

but it will do most of the job.

15225

15226

The @code{translate} program demonstrates one of the few weaknesses of

15227

standard

15228

@code{awk}: dealing with individual characters is very painful, requiring

15229

repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in

15230

functions

15231

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This

15232

program was written before @code{gawk} acquired the ability to

15233

split each character in a string into separate array elements.

15234

How might this ability simplify the program?}

15235

15236

There are two functions. The first, @code{stranslate}, takes three

15237

arguments.

15238

15239

@table @code

15240

@item from

15241

A list of characters to translate from.

15242

15243

@item to

15244

A list of characters to translate to.

15245

15246

@item target

15247

The string to do the translation on.

15248

@end table

15249

15250

Associative arrays make the translation part fairly easy. @code{t_ar} holds

15251

the ``to'' characters, indexed by the ``from'' characters. Then a simple

15252

loop goes through @code{from}, one character at a time. For each character

15253

in @code{from}, if the character appears in @code{target}, @code{gsub}

15254

is used to change it to the corresponding @code{to} character.

15255

15256

The @code{translate} function simply calls @code{stranslate} using @code{$0}

15257

as the target. The main program sets two global variables, @code{FROM} and

15258

@code{TO}, from the command line, and then changes @code{ARGV} so that

15259

@code{awk} will read from the standard input.

15260

15261

Finally, the processing rule simply calls @code{translate} for each record.

15262

15263

@findex translate.awk

15264

@example

15265

@c @group

15266

@c file eg/prog/translate.awk

15267

# translate --- do tr like stuff

15268

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15269

# August 1989

15270

15271

# bugs: does not handle things like: tr A-Z a-z, it has

15272

# to be spelled out. However, if `to' is shorter than `from',

15273

# the last character in `to' is used for the rest of `from'.

15274

15275

function stranslate(from, to, target, lf, lt, t_ar, i, c)

15276

@{

15277

lf = length(from)

15278

lt = length(to)

15279

for (i = 1; i <= lt; i++)

15280

t_ar[substr(from, i, 1)] = substr(to, i, 1)

15281

if (lt < lf)

15282

for (; i <= lf; i++)

15283

t_ar[substr(from, i, 1)] = substr(to, lt, 1)

15284

for (i = 1; i <= lf; i++) @{

15285

c = substr(from, i, 1)

15286

if (index(target, c) > 0)

15287

gsub(c, t_ar[c], target)

15288

@}

15289

return target

15290

@}

15291

15292

@group

15293

function translate(from, to)

15294

@{

15295

return $0 = stranslate(from, to, $0)

15296

@}

15297

@end group

15298

15299

# main program

15300

BEGIN @{

15301

if (ARGC < 3) @{

15302

print "usage: translate from to" > "/dev/stderr"

15303

exit

15304

@}

15305

FROM = ARGV[1]

15306

TO = ARGV[2]

15307

ARGC = 2

15308

ARGV[1] = "-"

15309

@}

15310

15311

@{

15312

translate(FROM, TO)

15313

print

15314

@}

15315

@c endfile

15316

@c @end group

15317

@end example

15318

15319

While it is possible to do character transliteration in a user-level

15320

function, it is not necessarily efficient, and we started to consider adding

15321

a built-in function. However, shortly after writing this program, we learned

15322

that the System V Release 4 @code{awk} had added the @code{toupper} and

15323

@code{tolower} functions. These functions handle the vast majority of the

15324

cases where character transliteration is necessary, and so we chose to

15325

simply add those functions to @code{gawk} as well, and then leave well

15326

enough alone.

15327

15328

An obvious improvement to this program would be to set up the

15329

@code{t_ar} array only once, in a @code{BEGIN} rule. However, this

15330

assumes that the ``from'' and ``to'' lists

15331

will never change throughout the lifetime of the program.

15332

15333

@node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs

15334

@subsection Printing Mailing Labels

15335

15336

Here is a ``real world''@footnote{``Real world'' is defined as

15337

``a program actually used to get something done.''}

15338

program. This script reads lists of names and

15339

addresses, and generates mailing labels. Each page of labels has 20 labels

15340

on it, two across and ten down. The addresses are guaranteed to be no more

15341

than five lines of data. Each address is separated from the next by a blank

15342

line.

15343

15344

The basic idea is to read 20 labels worth of data. Each line of each label

15345

is stored in the @code{line} array. The single rule takes care of filling

15346

the @code{line} array and printing the page when 20 labels have been read.

15347

15348

The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that

15349

@code{awk} will split records at blank lines

15350

(@pxref{Records, ,How Input is Split into Records}).

15351

It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number

15352

of lines on the page (20 * 5 = 100).

15353

15354

Most of the work is done in the @code{printpage} function.

15355

The label lines are stored sequentially in the @code{line} array. But they

15356

have to be printed horizontally; @code{line[1]} next to @code{line[6]},

15357

@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to

15358

accomplish this. The outer loop, controlled by @code{i}, steps through

15359

every 10 lines of data; this is each row of labels. The inner loop,

15360

controlled by @code{j}, goes through the lines within the row.

15361

As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in

15362

the row, and @samp{i+j+5} is the entry next to it. The output ends up

15363

looking something like this:

15364

15365

@example

15366

line 1 line 6

15367

line 2 line 7

15368

line 3 line 8

15369

line 4 line 9

15370

line 5 line 10

15371

@end example

15372

15373

As a final note, at lines 21 and 61, an extra blank line is printed, to keep

15374

the output lined up on the labels. This is dependent on the particular

15375

brand of labels in use when the program was written. You will also note

15376

that there are two blank lines at the top and two blank lines at the bottom.

15377

15378

The @code{END} rule arranges to flush the final page of labels; there may

15379

not have been an even multiple of 20 labels in the data.

15380

15381

@findex labels.awk

15382

@example

15383

@c @group

15384

@c file eg/prog/labels.awk

15385

# labels.awk

15386

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15387

# June 1992

15388

15389

# Program to print labels. Each label is 5 lines of data

15390

# that may have blank lines. The label sheets have 2

15391

# blank lines at the top and 2 at the bottom.

15392

15393

BEGIN @{ RS = "" ; MAXLINES = 100 @}

15394

15395

function printpage( i, j)

15396

@{

15397

if (Nlines <= 0)

15398

return

15399

15400

printf "\n\n" # header

15401

15402

for (i = 1; i <= Nlines; i += 10) @{

15403

if (i == 21 || i == 61)

15404

print ""

15405

for (j = 0; j < 5; j++) @{

15406

if (i + j > MAXLINES)

15407

break

15408

printf " %-41s %s\n", line[i+j], line[i+j+5]

15409

@}

15410

print ""

15411

@}

15412

15413

printf "\n\n" # footer

15414

15415

for (i in line)

15416

line[i] = ""

15417

@}

15418

15419

# main rule

15420

@{

15421

if (Count >= 20) @{

15422

printpage()

15423

Count = 0

15424

Nlines = 0

15425

@}

15426

n = split($0, a, "\n")

15427

for (i = 1; i <= n; i++)

15428

line[++Nlines] = a[i]

15429

for (; i <= 5; i++)

15430

line[++Nlines] = ""

15431

Count++

15432

@}

15433

15434

END \

15435

@{

15436

printpage()

15437

@}

15438

@c endfile

15439

@c @end group

15440

@end example

15441

15442

@node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs

15443

@subsection Generating Word Usage Counts

15444

15445

The following @code{awk} program prints

15446

the number of occurrences of each word in its input. It illustrates the

15447

associative nature of @code{awk} arrays by using strings as subscripts. It

15448

also demonstrates the @samp{for @var{x} in @var{array}} construction.

15449

Finally, it shows how @code{awk} can be used in conjunction with other

15450

utility programs to do a useful task of some complexity with a minimum of

15451

effort. Some explanations follow the program listing.

15452

15453

@example

15454

awk '

15455

# Print list of word frequencies

15456

@{

15457

for (i = 1; i <= NF; i++)

15458

freq[$i]++

15459

@}

15460

15461

END @{

15462

for (word in freq)

15463

printf "%s\t%d\n", word, freq[word]

15464

@}'

15465

@end example

15466

15467

The first thing to notice about this program is that it has two rules. The

15468

first rule, because it has an empty pattern, is executed on every line of

15469

the input. It uses @code{awk}'s field-accessing mechanism

15470

(@pxref{Fields, ,Examining Fields}) to pick out the individual words from

15471

the line, and the built-in variable @code{NF} (@pxref{Built-in Variables})

15472

to know how many fields are available.

15473

15474

For each input word, an element of the array @code{freq} is incremented to

15475

reflect that the word has been seen an additional time.

15476

15477

The second rule, because it has the pattern @code{END}, is not executed

15478

until the input has been exhausted. It prints out the contents of the

15479

@code{freq} table that has been built up inside the first action.

15480

15481

This program has several problems that would prevent it from being

15482

useful by itself on real text files:

15483

15484

@itemize @bullet

15485

@item

15486

Words are detected using the @code{awk} convention that fields are

15487

separated by whitespace and that other characters in the input (except

15488

newlines) don't have any special meaning to @code{awk}. This means that

15489

punctuation characters count as part of words.

15490

15491

@item

15492

The @code{awk} language considers upper- and lower-case characters to be

15493

distinct. Therefore, @samp{bartender} and @samp{Bartender} are not treated

15494

as the same word. This is undesirable since, in normal text, words

15495

are capitalized if they begin sentences, and a frequency analyzer should not

15496

be sensitive to capitalization.

15497

15498

@iftex

15499

@page

15500

@end iftex

15501

@item

15502

The output does not come out in any useful order. You're more likely to be

15503

interested in which words occur most frequently, or having an alphabetized

15504

table of how frequently each word occurs.

15505

@end itemize

15506

15507

The way to solve these problems is to use some of the more advanced

15508

features of the @code{awk} language. First, we use @code{tolower} to remove

15509

case distinctions. Next, we use @code{gsub} to remove punctuation

15510

characters. Finally, we use the system @code{sort} utility to process the

15511

output of the @code{awk} script. Here is the new version of

15512

the program:

15513

15514

@findex wordfreq.sh

15515

@example

15516

@c file eg/prog/wordfreq.awk

15517

# Print list of word frequencies

15518

@{

15519

$0 = tolower($0) # remove case distinctions

15520

gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation

15521

for (i = 1; i <= NF; i++)

15522

freq[$i]++

15523

@}

15524

@c endfile

15525

15526

END @{

15527

for (word in freq)

15528

printf "%s\t%d\n", word, freq[word]

15529

@}

15530

@end example

15531

15532

Assuming we have saved this program in a file named @file{wordfreq.awk},

15533

and that the data is in @file{file1}, the following pipeline

15534

15535

@example

15536

awk -f wordfreq.awk file1 | sort +1 -nr

15537

@end example

15538

15539

@noindent

15540

produces a table of the words appearing in @file{file1} in order of

15541

decreasing frequency.

15542

15543

The @code{awk} program suitably massages the data and produces a word

15544

frequency table, which is not ordered.

15545

15546

The @code{awk} script's output is then sorted by the @code{sort} utility and

15547

printed on the terminal. The options given to @code{sort} in this example

15548

specify to sort using the second field of each input line (skipping one field),

15549

that the sort keys should be treated as numeric quantities (otherwise

15550

@samp{15} would come before @samp{5}), and that the sorting should be done

15551

in descending (reverse) order.

15552

15553

We could have even done the @code{sort} from within the program, by

15554

changing the @code{END} action to:

15555

15556

@example

15557

@c file eg/prog/wordfreq.awk

15558

END @{

15559

sort = "sort +1 -nr"

15560

for (word in freq)

15561

printf "%s\t%d\n", word, freq[word] | sort

15562

close(sort)

15563

@}

15564

@c endfile

15565

@end example

15566

15567

You would have to use this way of sorting on systems that do not

15568

have true pipes.

15569

15570

See the general operating system documentation for more information on how

15571

to use the @code{sort} program.

15572

15573

@node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs

15574

@subsection Removing Duplicates from Unsorted Text

15575

15576

The @code{uniq} program

15577

(@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}),

15578

removes duplicate lines from @emph{sorted} data.

15579

15580

Suppose, however, you need to remove duplicate lines from a data file, but

15581

that you wish to preserve the order the lines are in? A good example of

15582

this might be a shell history file. The history file keeps a copy of all

15583

the commands you have entered, and it is not unusual to repeat a command

15584

several times in a row. Occasionally you might wish to compact the history

15585

by removing duplicate entries. Yet it is desirable to maintain the order

15586

of the original commands.

15587

15588

This simple program does the job. It uses two arrays. The @code{data}

15589

array is indexed by the text of each line.

15590

For each line, @code{data[$0]} is incremented.

15591

15592

If a particular line has not

15593

been seen before, then @code{data[$0]} will be zero.

15594

In that case, the text of the line is stored in @code{lines[count]}.

15595

Each element of @code{lines} is a unique command, and the indices of

15596

@code{lines} indicate the order in which those lines were encountered.

15597

The @code{END} rule simply prints out the lines, in order.

15598

15599

@cindex Rakitzis, Byron

15600

@findex histsort.awk

15601

@example

15602

@group

15603

@c file eg/prog/histsort.awk

15604

# histsort.awk --- compact a shell history file

15605

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15606

# May 1993

15607

15608

# Thanks to Byron Rakitzis for the general idea

15609

@{

15610

if (data[$0]++ == 0)

15611

lines[++count] = $0

15612

@}

15613

15614

END @{

15615

for (i = 1; i <= count; i++)

15616

print lines[i]

15617

@}

15618

@c endfile

15619

@end group

15620

@end example

15621

15622

This program also provides a foundation for generating other useful

15623

information. For example, using the following @code{print} satement in the

15624

@code{END} rule would indicate how often a particular command was used.

15625

15626

@example

15627

print data[lines[i]], lines[i]

15628

@end example

15629

15630

This works because @code{data[$0]} was incremented each time a line was

15631

seen.

15632

15633

@node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs

15634

@subsection Extracting Programs from Texinfo Source Files

15635

15636

@iftex

15637

Both this chapter and the previous chapter

15638

(@ref{Library Functions, ,A Library of @code{awk} Functions}),

15639

present a large number of @code{awk} programs.

15640

@end iftex

15641

@ifinfo

15642

The nodes

15643

@ref{Library Functions, ,A Library of @code{awk} Functions},

15644

and @ref{Sample Programs, ,Practical @code{awk} Programs},

15645

are the top level nodes for a large number of @code{awk} programs.

15646

@end ifinfo

15647

If you wish to experiment with these programs, it is tedious to have to type

15648

them in by hand. Here we present a program that can extract parts of a

15649

Texinfo input file into separate files.

15650

15651

This @value{DOCUMENT} is written in Texinfo, the GNU project's document

15652

formatting language. A single Texinfo source file can be used to produce both

15653

printed and on-line documentation.

15654

@iftex

15655

Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format},

15656

available from the Free Software Foundation.

15657

@end iftex

15658

@ifinfo

15659

The Texinfo language is described fully, starting with

15660

@ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}.

15661

@end ifinfo

15662

15663

For our purposes, it is enough to know three things about Texinfo input

15664

files.

15665

15666

@itemize @bullet

15667

@item

15668

The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C

15669

or @code{awk}. Literal @samp{@@} symbols are represented in Texinfo source

15670

files as @samp{@@@@}.

15671

15672

@item

15673

Comments start with either @samp{@@c} or @samp{@@comment}.

15674

The file extraction program will work by using special comments that start

15675

at the beginning of a line.

15676

15677

@item

15678

Example text that should not be split across a page boundary is bracketed

15679

between lines containing @samp{@@group} and @samp{@@end group} commands.

15680

@end itemize

15681

15682

The following program, @file{extract.awk}, reads through a Texinfo source

15683

file, and does two things, based on the special comments.

15684

Upon seeing @samp{@w{@@c system @dots{}}},

15685

it runs a command, by extracting the command text from the

15686

control line and passing it on to the @code{system} function

15687

(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).

15688

Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to

15689

the file @var{filename}, until @samp{@@c endfile} is encountered.

15690

The rules in @file{extract.awk} will match either @samp{@@c} or

15691

@samp{@@comment} by letting the @samp{omment} part be optional.

15692

Lines containing @samp{@@group} and @samp{@@end group} are simply removed.

15693

@file{extract.awk} uses the @code{join} library function

15694

(@pxref{Join Function, ,Merging an Array Into a String}).

15695

15696

The example programs in the on-line Texinfo source for @cite{@value{TITLE}}

15697

(@file{gawk.texi}) have all been bracketed inside @samp{file},

15698

and @samp{endfile} lines. The @code{gawk} distribution uses a copy of

15699

@file{extract.awk} to extract the sample

15700

programs and install many of them in a standard directory, where

15701

@code{gawk} can find them.

15702

15703

@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that

15704

mixed upper-case and lower-case letters in the directives won't matter.

15705

15706

The first rule handles calling @code{system}, checking that a command was

15707

given (@code{NF} is at least three), and also checking that the command

15708

exited with a zero exit status, signifying OK.

15709

15710

@findex extract.awk

15711

@example

15712

@c @group

15713

@c file eg/prog/extract.awk

15714

# extract.awk --- extract files and run programs

15715

# from texinfo files

15716

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15717

# May 1993

15718

15719

BEGIN @{ IGNORECASE = 1 @}

15720

15721

@group

15722

/^@@c(omment)?[ \t]+system/ \

15723

@{

15724

if (NF < 3) @{

15725

e = (FILENAME ":" FNR)

15726

e = (e ": badly formed `system' line")

15727

print e > "/dev/stderr"

15728

@}

15730

$1 = ""

15731

$2 = ""

15732

stat = system($0)

15733

if (stat != 0) @{

15734

e = (FILENAME ":" FNR)

15735

e = (e ": warning: system returned " stat)

15736

print e > "/dev/stderr"

15737

@}

15738

@}

15739

@end group

15740

@c endfile

15741

@end example

15742

15743

@noindent

15744

The variable @code{e} is used so that the function

15745

fits nicely on the

15746

@iftex

15747

page.

15748

@end iftex

15749

@ifinfo

15750

screen.

15751

@end ifinfo

15752

15753

The second rule handles moving data into files. It verifies that a file

15754

name was given in the directive. If the file named is not the current file,

15755

then the current file is closed. This means that an @samp{@@c endfile} was

15756

not given for that file. (We should probably print a diagnostic in this

15757

case, although at the moment we do not.)

15758

15759

The @samp{for} loop does the work. It reads lines using @code{getline}

15760

(@pxref{Getline, ,Explicit Input with @code{getline}}).

15761

For an unexpected end of file, it calls the @code{@w{unexpected_eof}}

15762

function. If the line is an ``endfile'' line, then it breaks out of

15763

the loop.

15764

If the line is an @samp{@@group} or @samp{@@end group} line, then it

15765

ignores it, and goes on to the next line.

15766

15767

Most of the work is in the following few lines. If the line has no @samp{@@}

15768

symbols, it can be printed directly. Otherwise, each leading @samp{@@} must be

15769

stripped off.

15770

15771

To remove the @samp{@@} symbols, the line is split into separate elements of

15772

the array @code{a}, using the @code{split} function

15773

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

15774

Each element of @code{a} that is empty indicates two successive @samp{@@}

15775

symbols in the original line. For each two empty elements (@samp{@@@@} in

15776

the original file), we have to add back in a single @samp{@@} symbol.

15777

15778

When the processing of the array is finished, @code{join} is called with the

15779

value of @code{SUBSEP}, to rejoin the pieces back into a single

15780

line. That line is then printed to the output file.

15781

15782

@example

15783

@c @group

15784

@c file eg/prog/extract.awk

15785

/^@@c(omment)?[ \t]+file/ \

15786

@{

15787

@group

15788

if (NF != 3) @{

15789

e = (FILENAME ":" FNR ": badly formed `file' line")

15790

print e > "/dev/stderr"

15791

@}

15793

@end group

15794

if ($3 != curfile) @{

15795

if (curfile != "")

15796

close(curfile)

15797

curfile = $3

15798

@}

15799

15800

for (;;) @{

15801

if ((getline line) <= 0)

15802

unexpected_eof()

15803

if (line ~ /^@@c(omment)?[ \t]+endfile/)

15804

break

15805

else if (line ~ /^@@(end[ \t]+)?group/)

15806

continue

15807

if (index(line, "@@") == 0) @{

15808

print line > curfile

15809

continue

15810

@}

15811

n = split(line, a, "@@")

15812

@group

15813

# if a[1] == "", means leading @@,

15814

# don't add one back in.

15815

@end group

15816

for (i = 2; i <= n; i++) @{

15817

if (a[i] == "") @{ # was an @@@@

15818

a[i] = "@@"

15819

if (a[i+1] == "")

15820

i++

15821

@}

15822

@}

15823

print join(a, 1, n, SUBSEP) > curfile

15824

@}

15825

@}

15826

@c endfile

15827

@c @end group

15828

@end example

15829

15830

An important thing to note is the use of the @samp{>} redirection.

15831

Output done with @samp{>} only opens the file once; it stays open and

15832

subsequent output is appended to the file

15833

(@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}).

15834

This allows us to easily mix program text and explanatory prose for the same

15835

sample source file (as has been done here!) without any hassle. The file is

15836

only closed when a new data file name is encountered, or at the end of the

15837

input file.

15838

15839

Finally, the function @code{@w{unexpected_eof}} prints an appropriate

15840

error message and then exits.

15841

15842

The @code{END} rule handles the final cleanup, closing the open file.

15843

15844

@example

15845

@c file eg/prog/extract.awk

15846

@group

15847

function unexpected_eof()

15848

@{

15849

printf("%s:%d: unexpected EOF or error\n", \

15850

FILENAME, FNR) > "/dev/stderr"

15851

exit 1

15852

@}

15853

@end group

15854

15855

END @{

15856

if (curfile)

15857

close(curfile)

15858

@}

15859

@c endfile

15860

@end example

15861

15862

@node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs

15863

@subsection A Simple Stream Editor

15864

15865

@cindex @code{sed} utility

15866

The @code{sed} utility is a ``stream editor,'' a program that reads a

15867

stream of data, makes changes to it, and passes the modified data on.

15868

It is often used to make global changes to a large file, or to a stream

15869

of data generated by a pipeline of commands.

15870

15871

While @code{sed} is a complicated program in its own right, its most common

15872

use is to perform global substitutions in the middle of a pipeline:

15873

15874

@example

15875

command1 < orig.data | sed 's/old/new/g' | command2 > result

15876

@end example

15877

15878

Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp

15879

@samp{old} on each input line, and replace it with the text @samp{new},

15880

globally (i.e.@: all the occurrences on a line). This is similar to

15881

@code{awk}'s @code{gsub} function

15882

(@pxref{String Functions, , Built-in Functions for String Manipulation}).

15883

15884

The following program, @file{awksed.awk}, accepts at least two command line

15885

arguments; the pattern to look for and the text to replace it with. Any

15886

additional arguments are treated as data file names to process. If none

15887

are provided, the standard input is used.

15888

15889

@cindex Brennan, Michael

15890

@cindex @code{awksed}

15891

@cindex simple stream editor

15892

@cindex stream editor, simple

15893

@example

15894

@c @group

15895

@c file eg/prog/awksed.awk

15896

# awksed.awk --- do s/foo/bar/g using just print

15897

# Thanks to Michael Brennan for the idea

15898

15899

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

15900

# August 1995

15901

15902

function usage()

15903

@{

15904

print "usage: awksed pat repl [files...]" > "/dev/stderr"

15905

exit 1

15906

@}

15907

15908

BEGIN @{

15909

# validate arguments

15910

if (ARGC < 3)

15911

usage()

15912

15913

RS = ARGV[1]

15914

ORS = ARGV[2]

15915

15916

# don't use arguments as files

15917

ARGV[1] = ARGV[2] = ""

15918

@}

15919

15920

# look ma, no hands!

15921

@{

15922

if (RT == "")

15923

printf "%s", $0

15924

else

15925

print

15926

@}

15927

@c endfile

15928

@c @end group

15929

@end example

15930

15931

The program relies on @code{gawk}'s ability to have @code{RS} be a regexp

15932

and on the setting of @code{RT} to the actual text that terminated the

15933

record (@pxref{Records, ,How Input is Split into Records}).

15934

15935

The idea is to have @code{RS} be the pattern to look for. @code{gawk}

15936

will automatically set @code{$0} to the text between matches of the pattern.

15937

This is text that we wish to keep, unmodified. Then, by setting @code{ORS}

15938

to the replacement text, a simple @code{print} statement will output the

15939

text we wish to keep, followed by the replacement text.

15940

15941

There is one wrinkle to this scheme, which is what to do if the last record

15942

doesn't end with text that matches @code{RS}? Using a @code{print}

15943

statement unconditionally prints the replacement text, which is not correct.

15944

15945

However, if the file did not end in text that matches @code{RS}, @code{RT}

15946

will be set to the null string. In this case, we can print @code{$0} using

15947

@code{printf}

15948

(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).

15949

15950

The @code{BEGIN} rule handles the setup, checking for the right number

15951

of arguments, and calling @code{usage} if there is a problem. Then it sets

15952

@code{RS} and @code{ORS} from the command line arguments, and sets

15953

@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will

15954

not be treated as file names

15955

(@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}).

15956

15957

The @code{usage} function prints an error message and exits.

15958

15959

Finally, the single rule handles the printing scheme outlined above,

15960

using @code{print} or @code{printf} as appropriate, depending upon the

15961

value of @code{RT}.

15962

15963

@ignore

15964

Exercise, compare the performance of this version with the more

15965

straightforward:

15966

15967

BEGIN {

15968

pat = ARGV[1]

15969

repl = ARGV[2]

15970

ARGV[1] = ARGV[2] = ""

15971

}

15972

15973

{ gsub(pat, repl); print }

15974

15975

Exercise: what are the advantages and disadvantages of this version vs. sed?

15976

Advantage: egrep regexps

15977

speed (?)

15978

Disadvantage: no & in replacement text

15979

15980

Others?

15981

@end ignore

15982

15983

@node Igawk Program, , Simple Sed, Miscellaneous Programs

15984

@subsection An Easy Way to Use Library Functions

15985

15986

Using library functions in @code{awk} can be very beneficial. It

15987

encourages code re-use and the writing of general functions. Programs are

15988

smaller, and therefore clearer.

15989

However, using library functions is only easy when writing @code{awk}

15990

programs; it is painful when running them, requiring multiple @samp{-f}

15991

options. If @code{gawk} is unavailable, then so too is the @code{AWKPATH}

15992

environment variable and the ability to put @code{awk} functions into a

15993

library directory (@pxref{Options, ,Command Line Options}).

15994

15995

It would be nice to be able to write programs like so:

15996

15997

@example

15998

# library functions

15999

@@include getopt.awk

16000

@@include join.awk

16001

@dots{}

16002

16003

# main program

16004

BEGIN @{

16005

while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)

16006

@dots{}

16007

@dots{}

16008

@}

16009

@end example

16010

16011

The following program, @file{igawk.sh}, provides this service.

16012

It simulates @code{gawk}'s searching of the @code{AWKPATH} variable,

16013

and also allows @dfn{nested} includes; i.e.@: a file that has been included

16014

with @samp{@@include} can contain further @samp{@@include} statements.

16015

@code{igawk} will make an effort to only include files once, so that nested

16016

includes don't accidentally include a library function twice.

16017

16018

@code{igawk} should behave externally just like @code{gawk}. This means it

16019

should accept all of @code{gawk}'s command line arguments, including the

16020

ability to have multiple source files specified via @samp{-f}, and the

16021

ability to mix command line and library source files.

16022

16023

The program is written using the POSIX Shell (@code{sh}) command language.

16024

The way the program works is as follows:

16025

16026

@enumerate

16027

@item

16028

Loop through the arguments, saving anything that doesn't represent

16029

@code{awk} source code for later, when the expanded program is run.

16030

16031

@item

16032

For any arguments that do represent @code{awk} text, put the arguments into

16033

a temporary file that will be expanded. There are two cases.

16034

16035

@enumerate a

16036

@item

16037

Literal text, provided with @samp{--source} or @samp{--source=}. This

16038

text is just echoed directly. The @code{echo} program will automatically

16039

supply a trailing newline.

16040

16041

@item

16042

File names provided with @samp{-f}. We use a neat trick, and echo

16043

@samp{@@include @var{filename}} into the temporary file. Since the file

16044

inclusion program will work the way @code{gawk} does, this will get the text

16045

of the file included into the program at the correct point.

16046

@end enumerate

16047

16048

@item

16049

Run an @code{awk} program (naturally) over the temporary file to expand

16050

@samp{@@include} statements. The expanded program is placed in a second

16051

temporary file.

16052

16053

@item

16054

Run the expanded program with @code{gawk} and any other original command line

16055

arguments that the user supplied (such as the data file names).

16056

@end enumerate

16057

16058

The initial part of the program turns on shell tracing if the first

16059

argument was @samp{debug}. Otherwise, a shell @code{trap} statement

16060

arranges to clean up any temporary files on program exit or upon an

16061

interrupt.

16062

16063

@c 2e: For the temp file handling, go with Darrel's ig=${TMP:-/tmp}/igs.$$

16064

@c 2e: or something as similar as possible.

16065

16066

The next part loops through all the command line arguments.

16067

There are several cases of interest.

16068

16069

@table @code

16070

@item --

16071

This ends the arguments to @code{igawk}. Anything else should be passed on

16072

to the user's @code{awk} program without being evaluated.

16073

16074

@item -W

16075

This indicates that the next option is specific to @code{gawk}. To make

16076

argument processing easier, the @samp{-W} is appended to the front of the

16077

remaining arguments and the loop continues. (This is an @code{sh}

16078

programming trick. Don't worry about it if you are not familiar with

16079

@code{sh}.)

16080

16081

@item -v

16082

@itemx -F

16083

These are saved and passed on to @code{gawk}.

16084

16085

@item -f

16086

@itemx --file

16087

@itemx --file=

16088

@itemx -Wfile=

16089

The file name is saved to the temporary file @file{/tmp/ig.s.$$} with an

16090

@samp{@@include} statement.

16091

The @code{sed} utility is used to remove the leading option part of the

16092

argument (e.g., @samp{--file=}).

16093

16094

@item --source

16095

@itemx --source=

16096

@itemx -Wsource=

16097

The source text is echoed into @file{/tmp/ig.s.$$}.

16098

16099

@iftex

16100

@page

16101

@end iftex

16102

@item --version

16103

@itemx --version

16104

@itemx -Wversion

16105

@code{igawk} prints its version number, and runs @samp{gawk --version}

16106

to get the @code{gawk} version information, and then exits.

16107

@end table

16108

16109

If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source},

16110

or @samp{-Wsource}, were supplied, then the first non-option argument

16111

should be the @code{awk} program. If there are no command line

16112

arguments left, @code{igawk} prints an error message and exits.

16113

Otherwise, the first argument is echoed into @file{/tmp/ig.s.$$}.

16114

16115

In any case, after the arguments have been processed,

16116

@file{/tmp/ig.s.$$} contains the complete text of the original @code{awk}

16117

program.

16118

16119

The @samp{$$} in @code{sh} represents the current process ID number.

16120

It is often used in shell programs to generate unique temporary file

16121

names. This allows multiple users to run @code{igawk} without worrying

16122

that the temporary file names will clash.

16123

16124

@cindex @code{sed} utility

16125

Here's the program:

16126

16127

@findex igawk.sh

16128

@example

16129

@c @group

16130

@c file eg/prog/igawk.sh

16131

#! /bin/sh

16132

16133

# igawk --- like gawk but do @@include processing

16134

# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain

16135

# July 1993

16136

16137

if [ "$1" = debug ]

16138

then

16139

set -x

16140

shift

16141

else

16142

# cleanup on exit, hangup, interrupt, quit, termination

16143

trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15

16144

fi

16145

16146

while [ $# -ne 0 ] # loop over arguments

16147

do

16148

case $1 in

16149

--) shift; break;;

16150

16151

-W) shift

16152

set -- -W"$@@"

16153

continue;;

16154

16155

-[vF]) opts="$opts $1 '$2'"

16156

shift;;

16157

16158

-[vF]*) opts="$opts '$1'" ;;

16159

16160

-f) echo @@include "$2" >> /tmp/ig.s.$$

16161

shift;;

16162

16163

-f*) f=`echo "$1" | sed 's/-f//'`

16164

echo @@include "$f" >> /tmp/ig.s.$$ ;;

16165

16166

-?file=*) # -Wfile or --file

16167

f=`echo "$1" | sed 's/-.file=//'`

16168

echo @@include "$f" >> /tmp/ig.s.$$ ;;

16169

16170

-?file) # get arg, $2

16171

echo @@include "$2" >> /tmp/ig.s.$$

16172

shift;;

16173

16174

-?source=*) # -Wsource or --source

16175

t=`echo "$1" | sed 's/-.source=//'`

16176

echo "$t" >> /tmp/ig.s.$$ ;;

16177

16178

-?source) # get arg, $2

16179

echo "$2" >> /tmp/ig.s.$$

16180

shift;;

16181

16182

-?version)

16183

echo igawk: version 1.0 1>&2

16184

gawk --version

16185

exit 0 ;;

16186

16187

-[W-]*) opts="$opts '$1'" ;;

16188

16189

*) break;;

16190

esac

16191

shift

16192

done

16193

16194

if [ ! -s /tmp/ig.s.$$ ]

16195

then

16196

if [ -z "$1" ]

16197

then

16198

echo igawk: no program! 1>&2

16199

exit 1

16200

else

16201

echo "$1" > /tmp/ig.s.$$

16202

shift

16203

fi

16204

fi

16205

16206

# at this point, /tmp/ig.s.$$ has the program

16207

@c endfile

16208

@c @end group

16209

@end example

16210

16211

The @code{awk} program to process @samp{@@include} directives reads through

16212

the program, one line at a time using @code{getline}

16213

(@pxref{Getline, ,Explicit Input with @code{getline}}).

16214

The input file names and @samp{@@include} statements are managed using a

16215

stack. As each @samp{@@include} is encountered, the current file name is

16216

``pushed'' onto the stack, and the file named in the @samp{@@include}

16217

directive becomes

16218

the current file name. As each file is finished, the stack is ``popped,''

16219

and the previous input file becomes the current input file again.

16220

The process is started by making the original file the first one on the

16221

stack.

16222

16223

The @code{pathto} function does the work of finding the full path to a

16224

file. It simulates @code{gawk}'s behavior when searching the @code{AWKPATH}

16225

environment variable

16226

(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).

16227

If a file name has a @samp{/} in it, no path search

16228

is done. Otherwise, the file name is concatenated with the name of each

16229

directory in the path, and an attempt is made to open the generated file

16230

name. The only way in @code{awk} to test if a file can be read is to go

16231

ahead and try to read it with @code{getline}; that is what @code{pathto}

16232

does. If the file can be read, it is closed, and the file name is

16233

returned.

16234

@ignore

16235

An alternative way to test for the file's existence would be to call

16236

@samp{system("test -r " t)}, which uses the @code{test} utility to

16237

see if the file exists and is readable. The disadvantage to this method

16238

is that it requires creating an extra process, and can thus be slightly

16239

slower.

16240

@end ignore

16241

16242

@example

16243

@c @group

16244

@c file eg/prog/igawk.sh

16245

gawk -- '

16246

# process @@include directives

16247

16248

function pathto(file, i, t, junk)

16249

@{

16250

if (index(file, "/") != 0)

16251

return file

16252

16253

for (i = 1; i <= ndirs; i++) @{

16254

t = (pathlist[i] "/" file)

16255

if ((getline junk < t) > 0) @{

16256

# found it

16257

close(t)

16258

return t

16259

@}

16260

@}

16261

return ""

16262

@}

16263

@c endfile

16264

@c @end group

16265

@end example

16266

16267

The main program is contained inside one @code{BEGIN} rule. The first thing it

16268

does is set up the @code{pathlist} array that @code{pathto} uses. After

16269

splitting the path on @samp{:}, null elements are replaced with @code{"."},

16270

which represents the current directory.

16271

16272

@example

16273

@c @group

16274

@c file eg/prog/igawk.sh

16275

BEGIN @{

16276

path = ENVIRON["AWKPATH"]

16277

ndirs = split(path, pathlist, ":")

16278

for (i = 1; i <= ndirs; i++) @{

16279

if (pathlist[i] == "")

16280

pathlist[i] = "."

16281

@}

16282

@c endfile

16283

@c @end group

16284

@end example

16285

16286

The stack is initialized with @code{ARGV[1]}, which will be @file{/tmp/ig.s.$$}.

16287

The main loop comes next. Input lines are read in succession. Lines that

16288

do not start with @samp{@@include} are printed verbatim.

16289

16290

If the line does start with @samp{@@include}, the file name is in @code{$2}.

16291

@code{pathto} is called to generate the full path. If it could not, then we

16292

print an error message and continue.

16293

16294

The next thing to check is if the file has been included already. The

16295

@code{processed} array is indexed by the full file name of each included

16296

file, and it tracks this information for us. If the file has been

16297

seen, a warning message is printed. Otherwise, the new file name is

16298

pushed onto the stack and processing continues.

16299

16300

Finally, when @code{getline} encounters the end of the input file, the file

16301

is closed and the stack is popped. When @code{stackptr} is less than zero,

16302

the program is done.

16303

16304

@example

16305

@c @group

16306

@c file eg/prog/igawk.sh

16307

stackptr = 0

16308

input[stackptr] = ARGV[1] # ARGV[1] is first file

16309

16310

for (; stackptr >= 0; stackptr--) @{

16311

while ((getline < input[stackptr]) > 0) @{

16312

if (tolower($1) != "@@include") @{

16313

print

16314

continue

16315

@}

16316

fpath = pathto($2)

16317

if (fpath == "") @{

16318

printf("igawk:%s:%d: cannot find %s\n", \

16319

input[stackptr], FNR, $2) > "/dev/stderr"

16320

continue

16321

@}

16322

@group

16323

if (! (fpath in processed)) @{

16324

processed[fpath] = input[stackptr]

16325

input[++stackptr] = fpath

16326

@} else

16327

print $2, "included in", input[stackptr], \

16328

"already included in", \

16329

processed[fpath] > "/dev/stderr"

16330

@}

16331

@end group

16332

@group

16333

close(input[stackptr])

16334

@}

16335

@}' /tmp/ig.s.$$ > /tmp/ig.e.$$

16336

@end group

16337

@c endfile

16338

@c @end group

16339

@end example

16340

16341

The last step is to call @code{gawk} with the expanded program and the original

16342

options and command line arguments that the user supplied. @code{gawk}'s

16343

exit status is passed back on to @code{igawk}'s calling program.

16344

16345

@c this causes more problems than it solves, so leave it out.

16346

@ignore

16347

The special file @file{/dev/null} is passed as a data file to @code{gawk}

16348

to handle an interesting case. Suppose that the user's program only has

16349

a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data

16350

files. However, suppose that an included library file defines an @code{END}

16351

rule of its own. In this case, @code{gawk} will hang, reading standard

16352

input. In order to avoid this, @file{/dev/null} is explicitly to the

16353

command line. Reading from @file{/dev/null} always returns an immediate

16354

end of file indication.

16355

16356

@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.

16357

@end ignore

16358

16359

@example

16360

@c @group

16361

@c file eg/prog/igawk.sh

16362

eval gawk -f /tmp/ig.e.$$ $opts -- "$@@"

16363

16364

exit $?

16365

@c endfile

16366

@c @end group

16367

@end example

16368

16369

This version of @code{igawk} represents my third attempt at this program.

16370

There are three key simplifications that made the program work better.

16371

16372

@enumerate

16373

@item

16374

Using @samp{@@include} even for the files named with @samp{-f} makes building

16375

the initial collected @code{awk} program much simpler; all the

16376

@samp{@@include} processing can be done once.

16377

16378

@item

16379

The @code{pathto} function doesn't try to save the line read with

16380

@code{getline} when testing for the file's accessibility. Trying to save

16381

this line for use with the main program complicates things considerably.

16382

@c what problem does this engender though - exercise

16383

@c answer, reading from "-" or /dev/stdin

16384

16385

@item

16386

Using a @code{getline} loop in the @code{BEGIN} rule does it all in one

16387

place. It is not necessary to call out to a separate loop for processing

16388

nested @samp{@@include} statements.

16389

@end enumerate

16390

16391

Also, this program illustrates that it is often worthwhile to combine

16392

@code{sh} and @code{awk} programming together. You can usually accomplish

16393

quite a lot, without having to resort to low-level programming in C or C++, and it

16394

is frequently easier to do certain kinds of string and argument manipulation

16395

using the shell than it is in @code{awk}.

16396

16397

Finally, @code{igawk} shows that it is not always necessary to add new

16398

features to a program; they can often be layered on top. With @code{igawk},

16399

there is no real reason to build @samp{@@include} processing into

16400

@code{gawk} itself.

16401

16402

As an additional example of this, consider the idea of having two

16403

files in a directory in the search path.

16404

16405

@table @file

16406

@item default.awk

16407

This file would contain a set of default library functions, such

16408

as @code{getopt} and @code{assert}.

16409

16410

@item site.awk

16411

This file would contain library functions that are specific to a site or

16412

installation, i.e.@: locally developed functions.

16413

Having a separate file allows @file{default.awk} to change with

16414

new @code{gawk} releases, without requiring the system administrator to

16415

update it each time by adding the local functions.

16416

@end table

16417

16418

One user

16419

@c Karl Berry, karl@ileaf.com, 10/95

16420

suggested that @code{gawk} be modified to automatically read these files

16421

upon startup. Instead, it would be very simple to modify @code{igawk}

16422

to do this. Since @code{igawk} can process nested @samp{@@include}

16423

directives, @file{default.awk} could simply contain @samp{@@include}

16424

statements for the desired library functions.

16425

16426

@c Exercise: make this change

16427

16428

@node Language History, Gawk Summary, Sample Programs, Top

16429

@chapter The Evolution of the @code{awk} Language

16430

16431

This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows

16432

the POSIX specification. Many @code{awk} users are only familiar

16433

with the original @code{awk} implementation in Version 7 Unix.

16434

(This implementation was the basis for @code{awk} in Berkeley Unix,

16435

through 4.3--Reno. The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2

16436

for its version of @code{awk}.) This chapter briefly describes the

16437

evolution of the @code{awk} language, with cross references to other parts

16438

of the @value{DOCUMENT} where you can find more information.

16439

16440

@menu

16441

* V7/SVR3.1:: The major changes between V7 and System V

16442

Release 3.1.

16443

* SVR4:: Minor changes between System V Releases 3.1

16444

and 4.

16445

* POSIX:: New features from the POSIX standard.

16446

* BTL:: New features from the AT&T Bell Laboratories

16447

version of @code{awk}.

16448

* POSIX/GNU:: The extensions in @code{gawk} not in POSIX

16449

@code{awk}.

16450

@end menu

16451

16452

@node V7/SVR3.1, SVR4, Language History, Language History

16453

@section Major Changes between V7 and SVR3.1

16454

16455

The @code{awk} language evolved considerably between the release of

16456

Version 7 Unix (1978) and the new version first made generally available in

16457

System V Release 3.1 (1987). This section summarizes the changes, with

16458

cross-references to further details.

16459

16460

@itemize @bullet

16461

@item

16462

The requirement for @samp{;} to separate rules on a line

16463

(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).

16464

16465

@item

16466

User-defined functions, and the @code{return} statement

16467

(@pxref{User-defined, ,User-defined Functions}).

16468

16469

@item

16470

The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}).

16471

16472

@item

16473

The @code{do}-@code{while} statement

16474

(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}).

16475

16476

@item

16477

The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and

16478

@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}).

16479

16480

@item

16481

The built-in functions @code{gsub}, @code{sub}, and @code{match}

16482

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

16483

16484

@item

16485

The built-in functions @code{close}, and @code{system}

16486

(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).

16487

16488

@item

16489

The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},

16490

and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}).

16491

16492

@item

16493

The conditional expression using the ternary operator @samp{?:}

16494

(@pxref{Conditional Exp, ,Conditional Expressions}).

16495

16496

@item

16497

The exponentiation operator @samp{^}

16498

(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator

16499

form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}).

16500

16501

@item

16502

C-compatible operator precedence, which breaks some old @code{awk}

16503

programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}).

16504

16505

@item

16506

Regexps as the value of @code{FS}

16507

(@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the

16508

third argument to the @code{split} function

16509

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

16510

16511

@item

16512

Dynamic regexps as operands of the @samp{~} and @samp{!~} operators

16513

(@pxref{Regexp Usage, ,How to Use Regular Expressions}).

16514

16515

@item

16516

The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}

16517

(@pxref{Escape Sequences}).

16518

(Some vendors have updated their old versions of @code{awk} to

16519

recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not

16520

something you can rely on.)

16521

16522

@item

16523

Redirection of input for the @code{getline} function

16524

(@pxref{Getline, ,Explicit Input with @code{getline}}).

16525

16526

@item

16527

Multiple @code{BEGIN} and @code{END} rules

16528

(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).

16529

16530

@item

16531

Multi-dimensional arrays

16532

(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).

16533

@end itemize

16534

16535

@node SVR4, POSIX, V7/SVR3.1, Language History

16536

@section Changes between SVR3.1 and SVR4

16537

16538

@cindex @code{awk} language, V.4 version

16539

The System V Release 4 version of Unix @code{awk} added these features

16540

(some of which originated in @code{gawk}):

16541

16542

@itemize @bullet

16543

@item

16544

The @code{ENVIRON} variable (@pxref{Built-in Variables}).

16545

16546

@item

16547

Multiple @samp{-f} options on the command line

16548

(@pxref{Options, ,Command Line Options}).

16549

16550

@item

16551

The @samp{-v} option for assigning variables before program execution begins

16552

(@pxref{Options, ,Command Line Options}).

16553

16554

@item

16555

The @samp{--} option for terminating command line options.

16556

16557

@item

16558

The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences

16559

(@pxref{Escape Sequences}).

16560

16561

@item

16562

A defined return value for the @code{srand} built-in function

16563

(@pxref{Numeric Functions, ,Numeric Built-in Functions}).

16564

16565

@item

16566

The @code{toupper} and @code{tolower} built-in string functions

16567

for case translation

16568

(@pxref{String Functions, ,Built-in Functions for String Manipulation}).

16569

16570

@item

16571

A cleaner specification for the @samp{%c} format-control letter in the

16572

@code{printf} function

16573

(@pxref{Control Letters, ,Format-Control Letters}).

16574

16575

@item

16576

The ability to dynamically pass the field width and precision (@code{"%*.*d"})

16577

in the argument list of the @code{printf} function

16578

(@pxref{Control Letters, ,Format-Control Letters}).

16579

16580

@item

16581

The use of regexp constants such as @code{/foo/} as expressions, where

16582

they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}

16583

(@pxref{Using Constant Regexps, ,Using Regular Expression Constants}).

16584

@end itemize

16585

16586

@node POSIX, BTL, SVR4, Language History

16587

@section Changes between SVR4 and POSIX @code{awk}

16588

16589

The POSIX Command Language and Utilities standard for @code{awk}

16590

introduced the following changes into the language:

16591

16592

@itemize @bullet

16593

@item

16594

The use of @samp{-W} for implementation-specific options.

16595

16596

@item

16597

The use of @code{CONVFMT} for controlling the conversion of numbers

16598

to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).

16599

16600

@item

16601

The concept of a numeric string, and tighter comparison rules to go

16602

with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}).

16603

16604

@item

16605

More complete documentation of many of the previously undocumented

16606

features of the language.

16607

@end itemize

16608

16609

The following common extensions are not permitted by the POSIX

16610

standard:

16611

16612

@c IMPORTANT! Keep this list in sync with the one in node Options

16613

16614

@itemize @bullet

16615

@item

16616

@code{\x} escape sequences are not recognized

16617

(@pxref{Escape Sequences}).

16618

16619

@item

16620

The synonym @code{func} for the keyword @code{function} is not

16621

recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).

16622

16623

@item

16624

The operators @samp{**} and @samp{**=} cannot be used in

16625

place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},

16626

and also @pxref{Assignment Ops, ,Assignment Expressions}).

16627

16628

@item

16629

Specifying @samp{-Ft} on the command line does not set the value

16630

of @code{FS} to be a single tab character

16631

(@pxref{Field Separators, ,Specifying How Fields are Separated}).

16632

16633

@item

16634

The @code{fflush} built-in function is not supported

16635

(@pxref{I/O Functions, , Built-in Functions for Input/Output}).

16636

@end itemize

16637

16638

@node BTL, POSIX/GNU, POSIX, Language History

16639

@section Extensions in the AT&T Bell Laboratories @code{awk}

16640

16641

@cindex Kernighan, Brian

16642

Brian Kernighan, one of the original designers of Unix @code{awk},

16643

has made his version available via anonymous @code{ftp}

16644

(@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}).

16645

This section describes extensions in his version of @code{awk} that are

16646

not in POSIX @code{awk}.

16647

16648

@itemize @bullet

16649

@item

16650

The @samp{-mf=@var{NNN}} and @samp{-mr=@var{NNN}} command line options

16651

to set the maximum number of fields, and the maximum

16652

record size, respectively

16653

(@pxref{Options, ,Command Line Options}).

16654

16655

@item

16656

The @code{fflush} built-in function for flushing buffered output

16657

(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).

16658

16659

@ignore

16660

@item

16661

The @code{SYMTAB} array, that allows access to the internal symbol

16662

table of @code{awk}. This feature is not documented, largely because

16663

it is somewhat shakily implemented. For instance, you cannot access arrays

16664

or array elements through it.

16665

@end ignore

16666

@end itemize

16667

16668

@node POSIX/GNU, , BTL, Language History

16669

@section Extensions in @code{gawk} Not in POSIX @code{awk}

16670

16671

@cindex compatibility mode

16672

The GNU implementation, @code{gawk}, adds a number of features.

16673

This sections lists them in the order they were added to @code{gawk}.

16674

They can all be disabled with either the @samp{--traditional} or

16675

@samp{--posix} options

16676

(@pxref{Options, ,Command Line Options}).

16677

16678

Version 2.10 of @code{gawk} introduced these features:

16679

16680

@itemize @bullet

16681

@item

16682

The @code{AWKPATH} environment variable for specifying a path search for

16683

the @samp{-f} command line option

16684

(@pxref{Options, ,Command Line Options}).

16685

16686

@item

16687

The @code{IGNORECASE} variable and its effects

16688

(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).

16689

16690

@item

16691

The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and

16692

@file{/dev/fd/@var{n}} file name interpretation

16693

(@pxref{Special Files, ,Special File Names in @code{gawk}}).

16694

@end itemize

16695

16696

Version 2.13 of @code{gawk} introduced these features:

16697

16698

@itemize @bullet

16699

@item

16700

The @code{FIELDWIDTHS} variable and its effects

16701

(@pxref{Constant Size, ,Reading Fixed-width Data}).

16702

16703

@item

16704

The @code{systime} and @code{strftime} built-in functions for obtaining

16705

and printing time stamps

16706

(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}).

16707

16708

@item

16709

The @samp{-W lint} option to provide source code and run time error

16710

and portability checking

16711

(@pxref{Options, ,Command Line Options}).

16712

16713

@item

16714

The @samp{-W compat} option to turn off these extensions

16715

(@pxref{Options, ,Command Line Options}).

16716

16717

@item

16718

The @samp{-W posix} option for full POSIX compliance

16719

(@pxref{Options, ,Command Line Options}).

16720

@end itemize

16721

16722

Version 2.14 of @code{gawk} introduced these features:

16723

16724

@itemize @bullet

16725

@item

16726

The @code{next file} statement for skipping to the next data file

16727

(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).

16728

@end itemize

16729

16730

Version 2.15 of @code{gawk} introduced these features:

16731

16732

@itemize @bullet

16733

@item

16734

The @code{ARGIND} variable, that tracks the movement of @code{FILENAME}

16735

through @code{ARGV} (@pxref{Built-in Variables}).

16736

16737

@item

16738

The @code{ERRNO} variable, that contains the system error message when

16739

@code{getline} returns @minus{}1, or when @code{close} fails

16740

(@pxref{Built-in Variables}).

16741

16742

@item

16743

The ability to use GNU-style long named options that start with @samp{--}

16744

(@pxref{Options, ,Command Line Options}).

16745

16746

@item

16747

The @samp{--source} option for mixing command line and library

16748

file source code

16749

(@pxref{Options, ,Command Line Options}).

16750

16751

@item

16752

The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and

16753

@file{/dev/user} file name interpretation

16754

(@pxref{Special Files, ,Special File Names in @code{gawk}}).

16755

@end itemize

16756

16757

Version 3.0 of @code{gawk} introduced these features:

16758

16759

@itemize @bullet

16760

@item

16761

The @code{next file} statement became @code{nextfile}

16762

(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).

16763

16764

@item

16765

The @samp{--lint-old} option to

16766

warn about constructs that are not available in

16767

the original Version 7 Unix version of @code{awk}

16768

(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).

16769

16770

@item

16771

The @samp{--traditional} option was added as a better name for

16772

@samp{--compat} (@pxref{Options, ,Command Line Options}).

16773

16774

@item

16775

The ability for @code{FS} to be a null string, and for the third

16776

argument to @code{split} to be the null string

16777

(@pxref{Single Character Fields, , Making Each Character a Separate Field}).

16778

16779

@item

16780

The ability for @code{RS} to be a regexp

16781

(@pxref{Records, , How Input is Split into Records}).

16782

16783

@item

16784

The @code{RT} variable

16785

(@pxref{Records, , How Input is Split into Records}).

16786

16787

@item

16788

The @code{gensub} function for more powerful text manipulation

16789

(@pxref{String Functions, , Built-in Functions for String Manipulation}).

16790

16791

@item

16792

The @code{strftime} function acquired a default time format,

16793

allowing it to be called with no arguments

16794

(@pxref{Time Functions, , Functions for Dealing with Time Stamps}).

16795

16796

@item

16797

Full support for both POSIX and GNU regexps

16798

(@pxref{Regexp, , Regular Expressions}).

16799

16800

@item

16801

The @samp{--re-interval} option to provide interval expressions in regexps

16802

(@pxref{Regexp Operators, , Regular Expression Operators}).

16803

16804

@item

16805

@code{IGNORECASE} changed, now applying to string comparison as well

16806

as regexp operations

16807

(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).

16808

16809

@item

16810

The @samp{-m} option and the @code{fflush} function from the

16811

Bell Labs research version of @code{awk}

16812

(@pxref{Options, ,Command Line Options}; also

16813

@pxref{I/O Functions, ,Built-in Functions for Input/Output}).

16814

16815

@item

16816

The use of GNU Autoconf to control the configuration process

16817

(@pxref{Quick Installation, , Compiling @code{gawk} for Unix}).

16818

16819

@item

16820

Amiga support

16821

(@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}).

16822

16823

@c XXX ADD MORE STUFF HERE

16824

16825

@end itemize

16826

16827

@node Gawk Summary, Installation, Language History, Top

16828

@appendix @code{gawk} Summary

16829

16830

This appendix provides a brief summary of the @code{gawk} command line and the

16831

@code{awk} language. It is designed to serve as ``quick reference.'' It is

16832

therefore terse, but complete.

16833

16834

@menu

16835

* Command Line Summary:: Recapitulation of the command line.

16836

* Language Summary:: A terse review of the language.

16837

* Variables/Fields:: Variables, fields, and arrays.

16838

* Rules Summary:: Patterns and Actions, and their component

16839

parts.

16840

* Actions Summary:: Quick overview of actions.

16841

* Functions Summary:: Defining and calling functions.

16842

* Historical Features:: Some undocumented but supported ``features''.

16843

@end menu

16844

16845

@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary

16846

@appendixsec Command Line Options Summary

16847

16848

The command line consists of options to @code{gawk} itself, the

16849

@code{awk} program text (if not supplied via the @samp{-f} option), and

16850

values to be made available in the @code{ARGC} and @code{ARGV}

16851

predefined @code{awk} variables:

16852

16853

@example

16854

gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{}

16855

gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}

16856

@end example

16857

16858

The options that @code{gawk} accepts are:

16859

16860

@table @code

16861

@item -F @var{fs}

16862

@itemx --field-separator @var{fs}

16863

Use @var{fs} for the input field separator (the value of the @code{FS}

16864

predefined variable).

16865

16866

@item -f @var{program-file}

16867

@itemx --file @var{program-file}

16868

Read the @code{awk} program source from the file @var{program-file}, instead

16869

of from the first command line argument.

16870

16871

@item -mf=@var{NNN}

16872

@itemx -mr=@var{NNN}

16873

The @samp{f} flag sets

16874

the maximum number of fields, and the @samp{r} flag sets the maximum

16875

record size. These options are ignored by @code{gawk}, since @code{gawk}

16876

has no predefined limits; they are only for compatibility with the

16877

Bell Labs research version of Unix @code{awk}.

16878

16879

@item -v @var{var}=@var{val}

16880

@itemx --assign @var{var}=@var{val}

16881

Assign the variable @var{var} the value @var{val} before program execution

16882

begins.

16883

16884

@item -W traditional

16885

@itemx -W compat

16886

@itemx --traditional

16887

@itemx --compat

16888

Use compatibility mode, in which @code{gawk} extensions are turned

16889

off.

16890

16891

@item -W copyleft

16892

@itemx -W copyright

16893

@itemx --copyleft

16894

@itemx --copyright

16895

Print the short version of the General Public License on the error

16896

output. This option may disappear in a future version of @code{gawk}.

16897

16898

@item -W help

16899

@itemx -W usage

16900

@itemx --help

16901

@itemx --usage

16902

Print a relatively short summary of the available options on the error output.

16903

16904

@item -W lint

16905

@itemx --lint

16906

Give warnings about dubious or non-portable @code{awk} constructs.

16907

16908

@item -W lint-old

16909

@itemx --lint-old

16910

Warn about constructs that are not available in

16911

the original Version 7 Unix version of @code{awk}.

16912

16913

@item -W posix

16914

@itemx --posix

16915

Use POSIX compatibility mode, in which @code{gawk} extensions

16916

are turned off and additional restrictions apply.

16917

16918

@item -W re-interval

16919

@itemx --re-interval

16920

Allow interval expressions

16921

(@pxref{Regexp Operators, , Regular Expression Operators}),

16922

in regexps.

16923

16924

@item -W source=@var{program-text}

16925

@itemx --source @var{program-text}

16926

Use @var{program-text} as @code{awk} program source code. This option allows

16927

mixing command line source code with source code from files, and is

16928

particularly useful for mixing command line programs with library functions.

16929

16930

@item -W version

16931

@itemx --version

16932

Print version information for this particular copy of @code{gawk} on the error

16933

output.

16934

16935

@item --

16936

Signal the end of options. This is useful to allow further arguments to the

16937

@code{awk} program itself to start with a @samp{-}. This is mainly for

16938

consistency with POSIX argument parsing conventions.

16939

@end table

16940

16941

Any other options are flagged as invalid, but are otherwise ignored.

16942

@xref{Options, ,Command Line Options}, for more details.

16943

16944

@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary

16945

@appendixsec Language Summary

16946

16947

An @code{awk} program consists of a sequence of zero or more pattern-action

16948

statements and optional function definitions. One or the other of the

16949

pattern and action may be omitted.

16950

16951

@example

16952

@var{pattern} @{ @var{action statements} @}

16953

@var{pattern}

16954

@{ @var{action statements} @}

16955

16956

function @var{name}(@var{parameter list}) @{ @var{action statements} @}

16957

@end example

16958

16959

@code{gawk} first reads the program source from the

16960

@var{program-file}(s), if specified, or from the first non-option

16961

argument on the command line. The @samp{-f} option may be used multiple

16962

times on the command line. @code{gawk} reads the program text from all

16963

the @var{program-file} files, effectively concatenating them in the

16964

order they are specified. This is useful for building libraries of

16965

@code{awk} functions, without having to include them in each new

16966

@code{awk} program that uses them. To use a library function in a file

16967

from a program typed in on the command line, specify

16968

@samp{--source '@var{program}'}, and type your program in between the single

16969

quotes.

16970

@xref{Options, ,Command Line Options}.

16971

16972

The environment variable @code{AWKPATH} specifies a search path to use

16973

when finding source files named with the @samp{-f} option. The default

16974

path, which is

16975

@samp{.:/usr/local/share/awk}@footnote{The path may use a directory

16976

other than @file{/usr/local/share/awk}, depending upon how @code{gawk}

16977

was built and installed.} is used if @code{AWKPATH} is not set.

16978

If a file name given to the @samp{-f} option contains a @samp{/} character,

16979

no path search is performed.

16980

@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.

16981

16982

@code{gawk} compiles the program into an internal form, and then proceeds to

16983

read each file named in the @code{ARGV} array.

16984

The initial values of @code{ARGV} come from the command line arguments.

16985

If there are no files named

16986

on the command line, @code{gawk} reads the standard input.

16987

16988

If a ``file'' named on the command line has the form

16989

@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the

16990

variable @var{var} is assigned the value @var{val}.

16991

If any of the files have a value that is the null string, that

16992

element in the list is skipped.

16993

16994

For each record in the input, @code{gawk} tests to see if it matches any

16995

@var{pattern} in the @code{awk} program. For each pattern that the record

16996

matches, the associated @var{action} is executed.

16997

16998

@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary

16999

@appendixsec Variables and Fields

17000

17001

@code{awk} variables are not declared; they come into existence when they are

17002

first used. Their values are either floating-point numbers or strings.

17003

@code{awk} also has one-dimensional arrays; multiple-dimensional arrays

17004

may be simulated. There are several predefined variables that

17005

@code{awk} sets as a program runs; these are summarized below.

17006

17007

@menu

17008

* Fields Summary:: Input field splitting.

17009

* Built-in Summary:: @code{awk}'s built-in variables.

17010

* Arrays Summary:: Using arrays.

17011

* Data Type Summary:: Values in @code{awk} are numbers or strings.

17012

@end menu

17013

17014

@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields

17015

@appendixsubsec Fields

17016

17017

As each input line is read, @code{gawk} splits the line into

17018

@var{fields}, using the value of the @code{FS} variable as the field

17019

separator. If @code{FS} is a single character, fields are separated by

17020

that character. Otherwise, @code{FS} is expected to be a full regular

17021

expression. In the special case that @code{FS} is a single space,

17022

fields are separated by runs of spaces and/or tabs.

17023

If @code{FS} is the null string (@code{""}), then each individual

17024

character in the record becomes a separate field.

17025

Note that the value

17026

of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching})

17027

also affects how fields are split when @code{FS} is a regular expression.

17028

17029

Each field in the input line may be referenced by its position, @code{$1},

17030

@code{$2}, and so on. @code{$0} is the whole line. The value of a field may

17031

be assigned to as well. Field numbers need not be constants:

17032

17033

@example

17034

n = 5

17035

print $n

17036

@end example

17037

17038

@noindent

17039

prints the fifth field in the input line. The variable @code{NF} is set to

17040

the total number of fields in the input line.

17041

17042

References to non-existent fields (i.e.@: fields after @code{$NF}) return

17043

the null string. However, assigning to a non-existent field (e.g.,

17044

@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any

17045

intervening fields with the null string as their value, and causes the

17046

value of @code{$0} to be recomputed, with the fields being separated by

17047

the value of @code{OFS}.

17048

@xref{Reading Files, ,Reading Input Files}.

17049

17050

@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields

17051

@appendixsubsec Built-in Variables

17052

17053

@code{gawk}'s built-in variables are:

17054

17055

@table @code

17056

@item ARGC

17057

The number of elements in @code{ARGV}. See below for what is actually

17058

included in @code{ARGV}.

17059

17060

@item ARGIND

17061

The index in @code{ARGV} of the current file being processed.

17062

When @code{gawk} is processing the input data files,

17063

it is always true that @samp{FILENAME == ARGV[ARGIND]}.

17064

17065

@item ARGV

17066

The array of command line arguments. The array is indexed from zero to

17067

@code{ARGC} @minus{} 1. Dynamically changing @code{ARGC} and

17068

the contents of @code{ARGV}

17069

can control the files used for data. A null-valued element in

17070

@code{ARGV} is ignored. @code{ARGV} does not include the options to

17071

@code{awk} or the text of the @code{awk} program itself.

17072

17073

@item CONVFMT

17074

The conversion format to use when converting numbers to strings.

17075

17076

@item FIELDWIDTHS

17077

A space separated list of numbers describing the fixed-width input data.

17078

17079

@item ENVIRON

17080

An array of environment variable values. The array

17081

is indexed by variable name, each element being the value of that

17082

variable. Thus, the environment variable @code{HOME} is

17083

@code{ENVIRON["HOME"]}. One possible value might be @file{/home/arnold}.

17084

17085

Changing this array does not affect the environment seen by programs

17086

which @code{gawk} spawns via redirection or the @code{system} function.

17087

(This may change in a future version of @code{gawk}.)

17088

17089

Some operating systems do not have environment variables.

17090

The @code{ENVIRON} array is empty when running on these systems.

17091

17092

@item ERRNO

17093

The system error message when an error occurs using @code{getline}

17094

or @code{close}.

17095

17096

@item FILENAME

17097

The name of the current input file. If no files are specified on the command

17098

line, the value of @code{FILENAME} is the null string.

17099

17100

@item FNR

17101

The input record number in the current input file.

17102

17103

@item FS

17104

The input field separator, a space by default.

17105

17106

@item IGNORECASE

17107

The case-sensitivity flag for string comparisons and regular expression

17108

operations. If @code{IGNORECASE} has a non-zero value, then pattern

17109

matching in rules, record separating with @code{RS}, field splitting

17110

with @code{FS}, regular expression matching with @samp{~} and

17111

@samp{!~}, and the @code{gensub}, @code{gsub}, @code{index},

17112

@code{match}, @code{split} and @code{sub} built-in functions all

17113

ignore case when doing regular expression operations, and all string

17114

comparisons are done ignoring case.

17115

17116

@item NF

17117

The number of fields in the current input record.

17118

17119

@item NR

17120

The total number of input records seen so far.

17121

17122

@item OFMT

17123

The output format for numbers for the @code{print} statement,

17124

@code{"%.6g"} by default.

17125

17126

@item OFS

17127

The output field separator, a space by default.

17128

17129

@item ORS

17130

The output record separator, by default a newline.

17131

17132

@item RS

17133

The input record separator, by default a newline.

17134

If @code{RS} is set to the null string, then records are separated by

17135

blank lines. When @code{RS} is set to the null string, then the newline

17136

character always acts as a field separator, in addition to whatever value

17137

@code{FS} may have. If @code{RS} is set to a multi-character

17138

string, it denotes a regexp; input text matching the regexp

17139

separates records.

17140

17141

@item RT

17142

The input text that matched the text denoted by @code{RS},

17143

the record separator.

17144

17145

@item RSTART

17146

The index of the first character last matched by @code{match}; zero if no match.

17147

17148

@item RLENGTH

17149

The length of the string last matched by @code{match}; @minus{}1 if no match.

17150

17151

@item SUBSEP

17152

The string used to separate multiple subscripts in array elements, by

17153

default @code{"\034"}.

17154

@end table

17155

17156

@xref{Built-in Variables}, for more information.

17157

17158

@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields

17159

@appendixsubsec Arrays

17160

17161

Arrays are subscripted with an expression between square brackets

17162

(@samp{[} and @samp{]}). Array subscripts are @emph{always} strings;

17163

numbers are converted to strings as necessary, following the standard

17164

conversion rules

17165

(@pxref{Conversion, ,Conversion of Strings and Numbers}).

17166

17167

If you use multiple expressions separated by commas inside the square

17168

brackets, then the array subscript is a string consisting of the

17169

concatenation of the individual subscript values, converted to strings,

17170

separated by the subscript separator (the value of @code{SUBSEP}).

17171

17172

The special operator @code{in} may be used in a conditional context

17173

to see if an array has an index consisting of a particular value.

17174

17175

@example

17176

if (val in array)

17177

print array[val]

17178

@end example

17179

17180

If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}}

17181

to test for existence of an element.

17182

17183

The @code{in} construct may also be used in a @code{for} loop to iterate

17184

over all the elements of an array.

17185

@xref{Scanning an Array, ,Scanning All Elements of an Array}.

17186

17187

You can remove an element from an array using the @code{delete} statement.

17188

17189

You can clear an entire array using @samp{delete @var{array}}.

17190

17191

@xref{Arrays, ,Arrays in @code{awk}}.

17192

17193

@node Data Type Summary, , Arrays Summary, Variables/Fields

17194

@appendixsubsec Data Types

17195

17196

The value of an @code{awk} expression is always either a number

17197

or a string.

17198

17199

Some contexts (such as arithmetic operators) require numeric

17200

values. They convert strings to numbers by interpreting the text

17201

of the string as a number. If the string does not look like a

17202

number, it converts to zero.

17203

17204

Other contexts (such as concatenation) require string values.

17205

They convert numbers to strings by effectively printing them

17206

with @code{sprintf}.

17207

@xref{Conversion, ,Conversion of Strings and Numbers}, for the details.

17208

17209

To force conversion of a string value to a number, simply add zero

17210

to it. If the value you start with is already a number, this

17211

does not change it.

17212

17213

To force conversion of a numeric value to a string, concatenate it with

17214

the null string.

17215

17216

Comparisons are done numerically if both operands are numeric, or if

17217

one is numeric and the other is a numeric string. Otherwise one or

17218

both operands are converted to strings and a string comparison is

17219

performed. Fields, @code{getline} input, @code{FILENAME}, @code{ARGV}

17220

elements, @code{ENVIRON} elements and the elements of an array created

17221

by @code{split} are the only items that can be numeric strings. String

17222

constants, such as @code{"3.1415927"} are not numeric strings, they are

17223

string constants. The full rules for comparisons are described in

17224

@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.

17225

17226

Uninitialized variables have the string value @code{""} (the null, or

17227

empty, string). In contexts where a number is required, this is

17228

equivalent to zero.

17229

17230

@xref{Variables}, for more information on variable naming and initialization;

17231

@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information

17232

on how variable values are interpreted.

17233

17234

@node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary

17235

@appendixsec Patterns

17236

17237

@menu

17238

* Pattern Summary:: Quick overview of patterns.

17239

* Regexp Summary:: Quick overview of regular expressions.

17240

@end menu

17241

17242

An @code{awk} program is mostly composed of rules, each consisting of a

17243

pattern followed by an action. The action is enclosed in @samp{@{} and

17244

@samp{@}}. Either the pattern may be missing, or the action may be

17245

missing, but not both. If the pattern is missing, the

17246

action is executed for every input record. A missing action is

17247

equivalent to @samp{@w{@{ print @}}}, which prints the entire line.

17248

17249

@c These paragraphs repeated for both patterns and actions. I don't

17250

@c like this, but I also don't see any way around it. Update both copies

17251

@c if they need fixing.

17252

Comments begin with the @samp{#} character, and continue until the end of the

17253

line. Blank lines may be used to separate statements. Statements normally

17254

end with a newline; however, this is not the case for lines ending in a

17255

@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines

17256

ending in @code{do} or @code{else} also have their statements automatically

17257

continued on the following line. In other cases, a line can be continued by

17258

ending it with a @samp{\}, in which case the newline is ignored.

17259

17260

Multiple statements may be put on one line by separating each one with

17261

a @samp{;}.

17262

This applies to both the statements within the action part of a rule (the

17263

usual case), and to the rule statements.

17264

17265

@xref{Comments, ,Comments in @code{awk} Programs}, for information on

17266

@code{awk}'s commenting convention;

17267

@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a

17268

description of the line continuation mechanism in @code{awk}.

17269

17270

@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary

17271

@appendixsubsec Pattern Summary

17272

17273

@code{awk} patterns may be one of the following:

17274

17275

@example

17276

/@var{regular expression}/

17277

@var{relational expression}

17278

@var{pattern} && @var{pattern}

17279

@var{pattern} || @var{pattern}

17280

@var{pattern} ? @var{pattern} : @var{pattern}

17281

(@var{pattern})

17282

! @var{pattern}

17283

@var{pattern1}, @var{pattern2}

17284

BEGIN

17285

END

17286

@end example

17287

17288

@code{BEGIN} and @code{END} are two special kinds of patterns that are not

17289

tested against the input. The action parts of all @code{BEGIN} rules are

17290

concatenated as if all the statements had been written in a single @code{BEGIN}

17291

rule. They are executed before any of the input is read. Similarly, all the

17292

@code{END} rules are concatenated, and executed when all the input is exhausted (or

17293

when an @code{exit} statement is executed). @code{BEGIN} and @code{END}

17294

patterns cannot be combined with other patterns in pattern expressions.

17295

@code{BEGIN} and @code{END} rules cannot have missing action parts.

17296

17297

For @code{/@var{regular-expression}/} patterns, the associated statement is

17298

executed for each input record that matches the regular expression. Regular

17299

expressions are summarized below.

17300

17301

A @var{relational expression} may use any of the operators defined below in

17302

the section on actions. These generally test whether certain fields match

17303

certain regular expressions.

17304

17305

The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,''

17306

logical ``or,'' and logical ``not,'' respectively, as in C. They do

17307

short-circuit evaluation, also as in C, and are used for combining more

17308

primitive pattern expressions. As in most languages, parentheses may be

17309

used to change the order of evaluation.

17310

17311

The @samp{?:} operator is like the same operator in C. If the first

17312

pattern matches, then the second pattern is matched against the input

17313

record; otherwise, the third is matched. Only one of the second and

17314

third patterns is matched.

17315

17316

The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a

17317

range pattern. It matches all input lines starting with a line that

17318

matches @var{pattern1}, and continuing until a line that matches

17319

@var{pattern2}, inclusive. A range pattern cannot be used as an operand

17320

of any of the pattern operators.

17321

17322

@xref{Pattern Overview, ,Pattern Elements}.

17323

17324

@node Regexp Summary, , Pattern Summary, Rules Summary

17325

@appendixsubsec Regular Expressions

17326

17327

Regular expressions are based on POSIX EREs (extended regular expressions).

17328

The escape sequences allowed in string constants are also valid in

17329

regular expressions (@pxref{Escape Sequences}).

17330

Regexps are composed of characters as follows:

17331

17332

@table @code

17333

@item @var{c}

17334

matches the character @var{c} (assuming @var{c} is none of the characters

17335

listed below).

17336

17337

@item \@var{c}

17338

matches the literal character @var{c}.

17339

17340

@item .

17341

matches any character, @emph{including} newline.

17342

In strict POSIX mode, @samp{.} does not match the @sc{nul}

17343

character, which is a character with all bits equal to zero.

17344

17345

@item ^

17346

matches the beginning of a string.

17347

17348

@item $

17349

matches the end of a string.

17350

17351

@item [@var{abc}@dots{}]

17352

matches any of the characters @var{abc}@dots{} (character list).

17353

17354

@item [[:@var{class}:]]

17355

matches any character in the character class @var{class}. Allowable classes

17356

are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl},

17357

@code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct},

17358

@code{space}, @code{upper}, and @code{xdigit}.

17359

17360

@item [[.@var{symbol}.]]

17361

matches the multi-character collating symbol @var{symbol}.

17362

@code{gawk} does not currently support collating symbols.

17363

17364

@item [[=@var{chars}=]]

17365

matches any of the equivalent characters in @var{chars}.

17366

@code{gawk} does not currently support equivalence classes.

17367

17368

@item [^@var{abc}@dots{}]

17369

matches any character except @var{abc}@dots{} and newline (negated

17370

character list).

17371

17372

@item @var{r1}|@var{r2}

17373

matches either @var{r1} or @var{r2} (alternation).

17374

17375

@item @var{r1r2}

17376

matches @var{r1}, and then @var{r2} (concatenation).

17377

17378

@item @var{r}+

17379

matches one or more @var{r}'s.

17380

17381

@item @var{r}*

17382

matches zero or more @var{r}'s.

17383

17384

@item @var{r}?

17385

matches zero or one @var{r}'s.

17386

17387

@item (@var{r})

17388

matches @var{r} (grouping).

17389

17390

@item @var{r}@{@var{n}@}

17391

@itemx @var{r}@{@var{n},@}

17392

@itemx @var{r}@{@var{n},@var{m}@}

17393

matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m}

17394

occurrences of @var{r} (interval expressions).

17395

17396

@item \y

17397

matches the empty string at either the beginning or the

17398

end of a word.

17399

17400

@item \B

17401

matches the empty string within a word.

17402

17403

@item \<

17404

matches the empty string at the beginning of a word.

17405

17406

@item \>

17407

matches the empty string at the end of a word.

17408

17409

@item \w

17410

matches any word-constituent character (alphanumeric characters and

17411

the underscore).

17412

17413

@item \W

17414

matches any character that is not word-constituent.

17415

17416

@item \`

17417

matches the empty string at the beginning of a buffer (same as a string

17418

in @code{gawk}).

17419

17420

@item \'

17421

matches the empty string at the end of a buffer.

17422

@end table

17423

17424

The various command line options

17425

control how @code{gawk} interprets characters in regexps.

17426

17427

@c NOTE!!! Keep this in sync with the same table in the regexp chapter!

17428

@table @asis

17429

@item No options

17430

In the default case, @code{gawk} provide all the facilities of

17431

POSIX regexps and the GNU regexp operators described above.

17432

However, interval expressions are not supported.

17433

17434

@item @code{--posix}

17435

Only POSIX regexps are supported, the GNU operators are not special

17436

(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions

17437

are allowed.

17438

17439

@item @code{--traditional}

17440

Traditional Unix @code{awk} regexps are matched. The GNU operators

17441

are not special, interval expressions are not available, and neither

17442

are the POSIX character classes (@code{[[:alnum:]]} and so on).

17443

Characters described by octal and hexadecimal escape sequences are

17444

treated literally, even if they represent regexp metacharacters.

17445

17446

@item @code{--re-interval}

17447

Allow interval expressions in regexps, even if @samp{--traditional}

17448

has been provided.

17449

@end table

17450

17451

@xref{Regexp, ,Regular Expressions}.

17452

17453

@node Actions Summary, Functions Summary, Rules Summary, Gawk Summary

17454

@appendixsec Actions

17455

17456

Action statements are enclosed in braces, @samp{@{} and @samp{@}}.

17457

A missing action statement is equivalent to @samp{@w{@{ print @}}}.

17458

17459

Action statements consist of the usual assignment, conditional, and looping

17460

statements found in most languages. The operators, control statements,

17461

and Input/Output statements available are similar to those in C.

17462

17463

@c These paragraphs repeated for both patterns and actions. I don't

17464

@c like this, but I also don't see any way around it. Update both copies

17465

@c if they need fixing.

17466

Comments begin with the @samp{#} character, and continue until the end of the

17467

line. Blank lines may be used to separate statements. Statements normally

17468

end with a newline; however, this is not the case for lines ending in a

17469

@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines

17470

ending in @code{do} or @code{else} also have their statements automatically

17471

continued on the following line. In other cases, a line can be continued by

17472

ending it with a @samp{\}, in which case the newline is ignored.

17473

17474

Multiple statements may be put on one line by separating each one with

17475

a @samp{;}.

17476

This applies to both the statements within the action part of a rule (the

17477

usual case), and to the rule statements.

17478

17479

@xref{Comments, ,Comments in @code{awk} Programs}, for information on

17480

@code{awk}'s commenting convention;

17481

@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a

17482

description of the line continuation mechanism in @code{awk}.

17483

17484

@menu

17485

* Operator Summary:: @code{awk} operators.

17486

* Control Flow Summary:: The control statements.

17487

* I/O Summary:: The I/O statements.

17488

* Printf Summary:: A summary of @code{printf}.

17489

* Special File Summary:: Special file names interpreted internally.

17490

* Built-in Functions Summary:: Built-in numeric and string functions.

17491

* Time Functions Summary:: Built-in time functions.

17492

* String Constants Summary:: Escape sequences in strings.

17493

@end menu

17494

17495

@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary

17496

@appendixsubsec Operators

17497

17498

The operators in @code{awk}, in order of decreasing precedence, are:

17499

17500

@table @code

17501

@item (@dots{})

17502

Grouping.

17503

17504

@item $

17505

Field reference.

17506

17507

@item ++ --

17508

Increment and decrement, both prefix and postfix.

17509

17510

@item ^

17511

Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment

17512

operator, but they are not specified in the POSIX standard).

17513

17514

@item + - !

17515

Unary plus, unary minus, and logical negation.

17516

17517

@item * / %

17518

Multiplication, division, and modulus.

17519

17520

@item + -

17521

Addition and subtraction.

17522

17523

@item @var{space}

17524

String concatenation.

17525

17526

@item < <= > >= != ==

17527

The usual relational operators.

17528

17529

@item ~ !~

17530

Regular expression match, negated match.

17531

17532

@item in

17533

Array membership.

17534

17535

@item &&

17536

Logical ``and''.

17537

17538

@item ||

17539

Logical ``or''.

17540

17541

@item ?:

17542

A conditional expression. This has the form @samp{@var{expr1} ?

17543

@var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the

17544

expression is @var{expr2}; otherwise it is @var{expr3}. Only one of

17545

@var{expr2} and @var{expr3} is evaluated.

17546

17547

@item = += -= *= /= %= ^=

17548

Assignment. Both absolute assignment (@code{@var{var}=@var{value}})

17549

and operator assignment (the other forms) are supported.

17550

@end table

17551

17552

@xref{Expressions}.

17553

17554

@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary

17555

@appendixsubsec Control Statements

17556

17557

The control statements are as follows:

17558

17559

@example

17560

if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]}

17561

while (@var{condition}) @var{statement}

17562

do @var{statement} while (@var{condition})

17563

for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement}

17564

for (@var{var} in @var{array}) @var{statement}

17565

break

17566

continue

17567

delete @var{array}[@var{index}]

17568

delete @var{array}

17569

exit @r{[} @var{expression} @r{]}

17570

@{ @var{statements} @}

17571

@end example

17572

17573

@xref{Statements, ,Control Statements in Actions}.

17574

17575

@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary

17576

@appendixsubsec I/O Statements

17577

17578

The Input/Output statements are as follows:

17579

17580

@table @code

17581

@item getline

17582

Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}.

17583

@xref{Getline, ,Explicit Input with @code{getline}}.

17584

17585

@item getline <@var{file}

17586

Set @code{$0} from next record of @var{file}; set @code{NF}.

17587

17588

@item getline @var{var}

17589

Set @var{var} from next input record; set @code{NF}, @code{FNR}.

17590

17591

@item getline @var{var} <@var{file}

17592

Set @var{var} from next record of @var{file}.

17593

17594

@item @var{command} | getline

17595

Run @var{command}, piping its output into @code{getline}; sets @code{$0},

17596

@code{NF}, @code{NR}.

17597

17598

@item @var{command} | getline @code{var}

17599

Run @var{command}, piping its output into @code{getline}; sets @var{var}.

17600

17601

@item next

17602

Stop processing the current input record. The next input record is read and

17603

processing starts over with the first pattern in the @code{awk} program.

17604

If the end of the input data is reached, the @code{END} rule(s), if any,

17605

are executed.

17606

@xref{Next Statement, ,The @code{next} Statement}.

17607

17608

@item nextfile

17609

Stop processing the current input file. The next input record read comes

17610

from the next input file. @code{FILENAME} is updated, @code{FNR} is set to one,

17611

@code{ARGIND} is incremented,

17612

and processing starts over with the first pattern in the @code{awk} program.

17613

If the end of the input data is reached, the @code{END} rule(s), if any,

17614

are executed.

17615

Earlier versions of @code{gawk} used @samp{next file}; this usage is still

17616

supported, but is considered to be deprecated.

17617

@xref{Nextfile Statement, ,The @code{nextfile} Statement}.

17618

17619

@item print

17620

Prints the current record.

17621

@xref{Printing, ,Printing Output}.

17622

17623

@item print @var{expr-list}

17624

Prints expressions.

17625

17626

@item print @var{expr-list} > @var{file}

17627

Prints expressions to @var{file}. If @var{file} does not exist, it is

17628

created. If it does exist, its contents are deleted the first time the

17629

@code{print} is executed.

17630

17631

@item print @var{expr-list} >> @var{file}

17632

Prints expressions to @var{file}. The previous contents of @var{file}

17633

are retained, and the output of @code{print} is appended to the file.

17634

17635

@item print @var{expr-list} | @var{command}

17636

Prints expressions, sending the output down a pipe to @var{command}.

17637

The pipeline to the command stays open until the @code{close} function

17638

is called.

17639

17640

@item printf @var{fmt, expr-list}

17641

Format and print.

17642

17643

@item printf @var{fmt, expr-list} > file

17644

Format and print to @var{file}. If @var{file} does not exist, it is

17645

created. If it does exist, its contents are deleted the first time the

17646

@code{printf} is executed.

17647

17648

@item printf @var{fmt, expr-list} >> @var{file}

17649

Format and print to @var{file}. The previous contents of @var{file}

17650

are retained, and the output of @code{printf} is appended to the file.

17651

17652

@item printf @var{fmt, expr-list} | @var{command}

17653

Format and print, sending the output down a pipe to @var{command}.

17654

The pipeline to the command stays open until the @code{close} function

17655

is called.

17656

@end table

17657

17658

@code{getline} returns zero on end of file, and @minus{}1 on an error.

17659

In the event of an error, @code{getline} will set @code{ERRNO} to

17660

the value of a system-dependent string that describes the error.

17661

17662

@node Printf Summary, Special File Summary, I/O Summary, Actions Summary

17663

@appendixsubsec @code{printf} Summary

17664

17665

Conversion specification have the form

17666

@code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}.

17667

@c whew!

17668

Items in brackets are optional.

17669

17670

The @code{awk} @code{printf} statement and @code{sprintf} function

17671

accept the following conversion specification formats:

17672

17673

@table @code

17674

@item %c

17675

An ASCII character. If the argument used for @samp{%c} is numeric, it is

17676

treated as a character and printed. Otherwise, the argument is assumed to

17677

be a string, and the only first character of that string is printed.

17678

17679

@item %d

17680

@itemx %i

17681

A decimal number (the integer part).

17682

17683

@item %e

17684

@itemx %E

17685

A floating point number of the form

17686

@samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}.

17687

The @samp{%E} format uses @samp{E} instead of @samp{e}.

17688

17689

@item %f

17690

A floating point number of the form

17691

@r{[}@code{-}@r{]}@code{ddd.dddddd}.

17692

17693

@item %g

17694

@itemx %G

17695

Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter

17696

string, with non-significant zeros suppressed.

17697

@samp{%G} will use @samp{%E} instead of @samp{%e}.

17698

17699

@item %o

17700

An unsigned octal number (again, an integer).

17701

17702

@item %s

17703

A character string.

17704

17705

@item %x

17706

@itemx %X

17707

An unsigned hexadecimal number (an integer).

17708

The @samp{%X} format uses @samp{A} through @samp{F} instead of

17709

@samp{a} through @samp{f} for decimal 10 through 15.

17710

17711

@item %%

17712

A single @samp{%} character; no argument is converted.

17713

@end table

17714

17715

There are optional, additional parameters that may lie between the @samp{%}

17716

and the control letter:

17717

17718

@table @code

17719

@item -

17720

The expression should be left-justified within its field.

17721

17722

@item @var{space}

17723

For numeric conversions, prefix positive values with a space, and

17724

negative values with a minus sign.

17725

17726

@item +

17727

The plus sign, used before the width modifier (see below),

17728

says to always supply a sign for numeric conversions, even if the data

17729

to be formatted is positive. The @samp{+} overrides the space modifier.

17730

17731

@item #

17732

Use an ``alternate form'' for certain control letters.

17733

For @samp{o}, supply a leading zero.

17734

For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for

17735

a non-zero result.

17736

For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a

17737

decimal point.

17738

For @samp{g}, and @samp{G}, trailing zeros are not removed from the result.

17739

17740

@item 0

17741

A leading @samp{0} (zero) acts as a flag, that indicates output should be

17742

padded with zeros instead of spaces.

17743

This applies even to non-numeric output formats.

17744

This flag only has an effect when the field width is wider than the

17745

value to be printed.

17746

17747

@item @var{width}

17748

The field should be padded to this width. The field is normally padded

17749

with spaces. If the @samp{0} flag has been used, it is padded with zeros.

17750

17751

@item .@var{prec}

17752

A number that specifies the precision to use when printing.

17753

For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the

17754

number of digits you want printed to the right of the decimal point.

17755

For the @samp{g}, and @samp{G} formats, it specifies the maximum number

17756

of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},

17757

@samp{x}, and @samp{X} formats, it specifies the minimum number of

17758

digits to print. For the @samp{s} format, it specifies the maximum number of

17759

characters from the string that should be printed.

17760

@end table

17761

17762

Either or both of the @var{width} and @var{prec} values may be specified

17763

as @samp{*}. In that case, the particular value is taken from the argument

17764

list.

17765

17766

@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}.

17767

17768

@node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary

17769

@appendixsubsec Special File Names

17770

17771

When doing I/O redirection from either @code{print} or @code{printf} into a

17772

file, or via @code{getline} from a file, @code{gawk} recognizes certain special

17773

file names internally. These file names allow access to open file descriptors

17774

inherited from @code{gawk}'s parent process (usually the shell). The

17775

file names are:

17776

17777

@table @file

17778

@item /dev/stdin

17779

The standard input.

17780

17781

@item /dev/stdout

17782

The standard output.

17783

17784

@item /dev/stderr

17785

The standard error output.

17786

17787

@item /dev/fd/@var{n}

17788

The file denoted by the open file descriptor @var{n}.

17789

@end table

17790

17791

In addition, reading the following files provides process related information

17792

about the running @code{gawk} program. All returned records are terminated

17793

with a newline.

17794

17795

@table @file

17796

@item /dev/pid

17797

Returns the process ID of the current process.

17798

17799

@item /dev/ppid

17800

Returns the parent process ID of the current process.

17801

17802

@item /dev/pgrpid

17803

Returns the process group ID of the current process.

17804

17805

@item /dev/user

17806

At least four space-separated fields, containing the return values of

17807

the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid}

17808

system calls.

17809

If there are any additional fields, they are the group IDs returned by

17810

@code{getgroups} system call.

17811

(Multiple groups may not be supported on all systems.)

17812

@end table

17813

17814

@noindent

17815

These file names may also be used on the command line to name data files.

17816

These file names are only recognized internally if you do not

17817

actually have files with these names on your system.

17818

17819

@xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that

17820

provides the motivation for this feature.

17821

17822

@node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary

17823

@appendixsubsec Built-in Functions

17824

17825

@code{awk} provides a number of built-in functions for performing

17826

numeric operations, string related operations, and I/O related operations.

17827

17828

The built-in arithmetic functions are:

17829

17830

@table @code

17831

@item atan2(@var{y}, @var{x})

17832

the arctangent of @var{y/x} in radians.

17833

17834

@item cos(@var{expr})

17835

the cosine in radians.

17836

17837

@item exp(@var{expr})

17838

the exponential function (@code{e ^ @var{expr}}).

17839

17840

@item int(@var{expr})

17841

truncates to integer.

17842

17843

@item log(@var{expr})

17844

the natural logarithm of @code{expr}.

17845

17846

@item rand()

17847

a random number between zero and one.

17848

17849

@item sin(@var{expr})

17850

the sine in radians.

17851

17852

@item sqrt(@var{expr})

17853

the square root function.

17854

17855

@item srand(@r{[}@var{expr}@r{]})

17856

use @var{expr} as a new seed for the random number generator. If no @var{expr}

17857

is provided, the time of day is used. The return value is the previous

17858

seed for the random number generator.

17859

@end table

17860

17861

@iftex

17862

@page

17863

@end iftex

17864

@code{awk} has the following built-in string functions:

17865

17866

@table @code

17867

@item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]})

17868

If @var{how} is a string beginning with @samp{g} or @samp{G}, then

17869

replace each match of @var{regex} in @var{target} with @var{subst}.

17870

Otherwise, replace the @var{how}'th occurrence. If @var{target} is not

17871

supplied, use @code{$0}. The return value is the changed string; the

17872

original @var{target} is not modified. Within @var{subst},

17873

@samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to

17874

indicate the text that matched the @var{n}'th parenthesized

17875

subexpression.

17876

17877

@item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})

17878

for each substring matching the regular expression @var{regex} in the string

17879

@var{target}, substitute the string @var{subst}, and return the number of

17880

substitutions. If @var{target} is not supplied, use @code{$0}.

17881

17882

@item index(@var{str}, @var{search})

17883

returns the index of the string @var{search} in the string @var{str}, or

17884

zero if

17885

@var{search} is not present.

17886

17887

@item length(@r{[}@var{str}@r{]})

17888

returns the length of the string @var{str}. The length of @code{$0}

17889

is returned if no argument is supplied.

17890

17891

@item match(@var{str}, @var{regex})

17892

returns the position in @var{str} where the regular expression @var{regex}

17893

occurs, or zero if @var{regex} is not present, and sets the values of

17894

@code{RSTART} and @code{RLENGTH}.

17895

17896

@item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]})

17897

splits the string @var{str} into the array @var{arr} on the regular expression

17898

@var{regex}, and returns the number of elements. If @var{regex} is omitted,

17899

@code{FS} is used instead. @var{regex} can be the null string, causing

17900

each character to be placed into its own array element.

17901

The array @var{arr} is cleared first.

17902

17903

@item sprintf(@var{fmt}, @var{expr-list})

17904

prints @var{expr-list} according to @var{fmt}, and returns the resulting string.

17905

17906

@item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})

17907

just like @code{gsub}, but only the first matching substring is replaced.

17908

17909

@item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]})

17910

returns the @var{len}-character substring of @var{str} starting at @var{index}.

17911

If @var{len} is omitted, the rest of @var{str} is used.

17912

17913

@item tolower(@var{str})

17914

returns a copy of the string @var{str}, with all the upper-case characters in

17915

@var{str} translated to their corresponding lower-case counterparts.

17916

Non-alphabetic characters are left unchanged.

17917

17918

@item toupper(@var{str})

17919

returns a copy of the string @var{str}, with all the lower-case characters in

17920

@var{str} translated to their corresponding upper-case counterparts.

17921

Non-alphabetic characters are left unchanged.

17922

@end table

17923

17924

The I/O related functions are:

17925

17926

@table @code

17927

@item close(@var{expr})

17928

Close the open file or pipe denoted by @var{expr}.

17929

17930

@item fflush(@r{[}@var{expr}@r{]})

17931

Flush any buffered output for the output file or pipe denoted by @var{expr}.

17932

If @var{expr} is omitted, standard output is flushed.

17933

If @var{expr} is the null string (@code{""}), all output buffers are flushed.

17934

17935

@item system(@var{cmd-line})

17936

Execute the command @var{cmd-line}, and return the exit status.

17937

If your operating system does not support @code{system}, calling it will

17938

generate a fatal error.

17939

17940

@samp{system("")} can be used to force @code{awk} to flush any pending

17941

output. This is more portable, but less obvious, than calling @code{fflush}.

17942

@end table

17943

17944

@node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary

17945

@appendixsubsec Time Functions

17946

17947

The following two functions are available for getting the current

17948

time of day, and for formatting time stamps.

17949

17950

@table @code

17951

@item systime()

17952

returns the current time of day as the number of seconds since a particular

17953

epoch (Midnight, January 1, 1970 UTC, on POSIX systems).

17954

17955

@item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]})

17956

formats @var{timestamp} according to the specification in @var{format}.

17957

The current time of day is used if no @var{timestamp} is supplied.

17958

A default format equivalent to the output of the @code{date} utility is used if

17959

no @var{format} is supplied.

17960

@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the

17961

details on the conversion specifiers that @code{strftime} accepts.

17962

@end table

17963

17964

@iftex

17965

@xref{Built-in, ,Built-in Functions}, for a description of all of

17966

@code{awk}'s built-in functions.

17967

@end iftex

17968

17969

@node String Constants Summary, , Time Functions Summary, Actions Summary

17970

@appendixsubsec String Constants

17971

17972

String constants in @code{awk} are sequences of characters enclosed

17973

in double quotes (@code{"}). Within strings, certain @dfn{escape sequences}

17974

are recognized, as in C. These are:

17975

17976

@table @code

17977

@item \\

17978

A literal backslash.

17979

17980

@item \a

17981

The ``alert'' character; usually the ASCII BEL character.

17982

17983

@item \b

17984

Backspace.

17985

17986

@item \f

17987

Formfeed.

17988

17989

@item \n

17990

Newline.

17991

17992

@item \r

17993

Carriage return.

17994

17995

@item \t

17996

Horizontal tab.

17997

17998

@item \v

17999

Vertical tab.

18000

18001

@item \x@var{hex digits}

18002

The character represented by the string of hexadecimal digits following

18003

the @samp{\x}. As in ANSI C, all following hexadecimal digits are

18004

considered part of the escape sequence. E.g., @code{"\x1B"} is a

18005

string containing the ASCII ESC (escape) character. (The @samp{\x}

18006

escape sequence is not in POSIX @code{awk}.)

18007

18008

@item \@var{ddd}

18009

The character represented by the one, two, or three digit sequence of octal

18010

digits. Thus, @code{"\033"} is also a string containing the ASCII ESC

18011

(escape) character.

18012

18013

@item \@var{c}

18014

The literal character @var{c}, if @var{c} is not one of the above.

18015

@end table

18016

18017

The escape sequences may also be used inside constant regular expressions

18018

(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace

18019

characters).

18020

18021

@xref{Escape Sequences}.

18022

18023

@node Functions Summary, Historical Features, Actions Summary, Gawk Summary

18024

@appendixsec User-defined Functions

18025

18026

Functions in @code{awk} are defined as follows:

18027

18028

@example

18029

function @var{name}(@var{parameter list}) @{ @var{statements} @}

18030

@end example

18031

18032

Actual parameters supplied in the function call are used to instantiate

18033

the formal parameters declared in the function. Arrays are passed by

18034

reference, other variables are passed by value.

18035

18036

If there are fewer arguments passed than there are names in @var{parameter-list},

18037

the extra names are given the null string as their value. Extra names have the

18038

effect of local variables.

18039

18040

The open-parenthesis in a function call of a user-defined function must

18041

immediately follow the function name, without any intervening white space.

18042

This is to avoid a syntactic ambiguity with the concatenation operator.

18043

18044

The word @code{func} may be used in place of @code{function} (but not in

18045

POSIX @code{awk}).

18046

18047

Use the @code{return} statement to return a value from a function.

18048

18049

@xref{User-defined, ,User-defined Functions}.

18050

18051

@node Historical Features, , Functions Summary, Gawk Summary

18052

@appendixsec Historical Features

18053

18054

@cindex historical features

18055

There are two features of historical @code{awk} implementations that

18056

@code{gawk} supports.

18057

18058

First, it is possible to call the @code{length} built-in function not only

18059

with no arguments, but even without parentheses!

18060

18061

@example

18062

a = length

18063

@end example

18064

18065

@noindent

18066

is the same as either of

18067

18068

@example

18069

a = length()

18070

a = length($0)

18071

@end example

18072

18073

@noindent

18074

For example:

18075

18076

@example

18077

$ echo abcdef | awk '@{ print length @}'

18078

@print{} 6

18079

@end example

18080

18081

@noindent

18082

This feature is marked as ``deprecated'' in the POSIX standard, and

18083

@code{gawk} will issue a warning about its use if @samp{--lint} is

18084

specified on the command line.

18085

(The ability to use @code{length} this way was actually an accident of the

18086

original Unix @code{awk} implementation. If any built-in function used

18087

@code{$0} as its default argument, it was possible to call that function

18088

without the parentheses. In particular, it was common practice to use

18089

the @code{length} function in this fashion, and this usage was documented

18090

in the @code{awk} manual page.)

18091

18092

The other historical feature is the use of either the @code{break} statement,

18093

or the @code{continue} statement

18094

outside the body of a @code{while}, @code{for}, or @code{do} loop. Traditional

18095

@code{awk} implementations have treated such usage as equivalent to the

18096

@code{next} statement. More recent versions of Unix @code{awk} do not allow

18097

it. @code{gawk} supports this usage if @samp{--traditional} has been

18098

specified.

18099

18100

@xref{Options, ,Command Line Options}, for more information about the

18101

@samp{--posix} and @samp{--lint} options.

18102

18103

@node Installation, Notes, Gawk Summary, Top

18104

@appendix Installing @code{gawk}

18105

18106

This appendix provides instructions for installing @code{gawk} on the

18107

various platforms that are supported by the developers. The primary

18108

developers support Unix (and one day, GNU), while the other ports were

18109

contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk}

18110

distribution lists the electronic mail addresses of the people who did

18111

the respective ports, and they are also provided in

18112

@ref{Bugs, , Reporting Problems and Bugs}.

18113

18114

@menu

18115

* Gawk Distribution:: What is in the @code{gawk} distribution.

18116

* Unix Installation:: Installing @code{gawk} under various versions

18117

of Unix.

18118

* VMS Installation:: Installing @code{gawk} on VMS.

18119

* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS

18120

and OS/2

18121

* Atari Installation:: Installing @code{gawk} on the Atari ST.

18122

* Amiga Installation:: Installing @code{gawk} on an Amiga.

18123

* Bugs:: Reporting Problems and Bugs.

18124

* Other Versions:: Other freely available @code{awk}

18125

implementations.

18126

@end menu

18127

18128

@node Gawk Distribution, Unix Installation, Installation, Installation

18129

@appendixsec The @code{gawk} Distribution

18130

18131

This section first describes how to get the @code{gawk}

18132

distribution, how to extract it, and then what is in the various files and

18133

subdirectories.

18134

18135

@menu

18136

* Getting:: How to get the distribution.

18137

* Extracting:: How to extract the distribution.

18138

* Distribution contents:: What is in the distribution.

18139

@end menu

18140

18141

@node Getting, Extracting, Gawk Distribution, Gawk Distribution

18142

@appendixsubsec Getting the @code{gawk} Distribution

18143

@cindex getting @code{gawk}

18144

@cindex anonymous @code{ftp}

18145

@cindex @code{ftp}, anonymous

18146

@cindex Free Software Foundation

18147

There are three ways you can get GNU software.

18148

18149

@enumerate

18150

@item

18151

You can copy it from someone else who already has it.

18152

18153

@cindex Free Software Foundation

18154

@item

18155

You can order @code{gawk} directly from the Free Software Foundation.

18156

Software distributions are available for Unix, MS-DOS, and VMS, on

18157

tape, CD-ROM, or floppies (MS-DOS only). The address is:

18158

18159

@quotation

18160

Free Software Foundation @*

18161

59 Temple Place---Suite 330 @*

18162

Boston, MA 02111-1307 USA @*

18163

Phone: +1-617-542-5942 @*

18164

Fax (including Japan): +1-617-542-2652 @*

18165

E-mail: @code{gnu@@prep.ai.mit.edu} @*

18166

@end quotation

18167

18168

@noindent

18169

Ordering from the FSF directly contributes to the support of the foundation

18170

and to the production of more free software.

18171

18172

@item

18173

You can get @code{gawk} by using anonymous @code{ftp} to the Internet host

18174

@code{ftp.gnu.ai.mit.edu}, in the directory @file{/pub/gnu}.

18175

18176

Here is a list of alternate @code{ftp} sites from which you can obtain GNU

18177

software. When a site is listed as ``@var{site}@code{:}@var{directory}'' the

18178

@var{directory} indicates the directory where GNU software is kept.

18179

You should use a site that is geographically close to you.

18180

18181

@table @asis

18182

@item Asia:

18183

@table @code

18184

@item cair-archive.kaist.ac.kr:/pub/gnu

18185

@itemx ftp.cs.titech.ac.jp

18186

@itemx ftp.nectec.or.th:/pub/mirrors/gnu

18187

@itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep

18188

@end table

18189

18190

@item Australia:

18191

@table @code

18192

@item archie.au:/gnu

18193

(@code{archie.oz} or @code{archie.oz.au} for ACSnet)

18194

@end table

18195

18196

@item Africa:

18197

@table @code

18198

@item ftp.sun.ac.za:/pub/gnu

18199

@end table

18200

18201

@item Middle East:

18202

@table @code

18203

@item ftp.technion.ac.il:/pub/unsupported/gnu

18204

@end table

18205

18206

@item Europe:

18207

@table @code

18208

@item archive.eu.net

18209

@itemx ftp.denet.dk

18210

@itemx ftp.eunet.ch

18211

@itemx ftp.funet.fi:/pub/gnu

18212

@itemx ftp.ieunet.ie:pub/gnu

18213

@itemx ftp.informatik.rwth-aachen.de:/pub/gnu

18214

@itemx ftp.informatik.tu-muenchen.de

18215

@itemx ftp.luth.se:/pub/unix/gnu

18216

@itemx ftp.mcc.ac.uk

18217

@itemx ftp.stacken.kth.se

18218

@itemx ftp.sunet.se:/pub/gnu

18219

@itemx ftp.univ-lyon1.fr:pub/gnu

18220

@itemx ftp.win.tue.nl:/pub/gnu

18221

@itemx irisa.irisa.fr:/pub/gnu

18222

@itemx isy.liu.se

18223

@itemx nic.switch.ch:/mirror/gnu

18224

@itemx src.doc.ic.ac.uk:/gnu

18225

@itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu

18226

@end table

18227

18228

@item South America:

18229

@table @code

18230

@item ftp.inf.utfsm.cl:/pub/gnu

18231

@itemx ftp.unicamp.br:/pub/gnu

18232

@end table

18233

18234

@item Western Canada:

18235

@table @code

18236

@item ftp.cs.ubc.ca:/mirror2/gnu

18237

@end table

18238

18239

@item USA:

18240

@table @code

18241

@item col.hp.com:/mirrors/gnu

18242

@itemx f.ms.uky.edu:/pub3/gnu

18243

@itemx ftp.cc.gatech.edu:/pub/gnu

18244

@itemx ftp.cs.columbia.edu:/archives/gnu/prep

18245

@itemx ftp.digex.net:/pub/gnu

18246

@itemx ftp.hawaii.edu:/mirrors/gnu

18247

@itemx ftp.kpc.com:/pub/mirror/gnu

18248

@end table

18249

18250

@iftex

18251

@page

18252

@end iftex

18253

@item USA (continued):

18254

@table @code

18255

@itemx ftp.uu.net:/systems/gnu

18256

@itemx gatekeeper.dec.com:/pub/GNU

18257

@itemx jaguar.utah.edu:/gnustuff

18258

@itemx labrea.stanford.edu

18259

@itemx mrcnext.cso.uiuc.edu:/pub/gnu

18260

@itemx vixen.cso.uiuc.edu:/gnu

18261

@itemx wuarchive.wustl.edu:/systems/gnu

18262

@end table

18263

@end table

18264

@end enumerate

18265

18266

@node Extracting, Distribution contents, Getting, Gawk Distribution

18267

@appendixsubsec Extracting the Distribution

18268

@code{gawk} is distributed as a @code{tar} file compressed with the

18269

GNU Zip program, @code{gzip}.

18270

18271

Once you have the distribution (for example,

18272

@file{gawk-@value{VERSION}.0.tar.gz}), first use @code{gzip} to expand the

18273

file, and then use @code{tar} to extract it. You can use the following

18274

pipeline to produce the @code{gawk} distribution:

18275

18276

@example

18277

# Under System V, add 'o' to the tar flags

18278

gzip -d -c gawk-@value{VERSION}.0.tar.gz | tar -xvpf -

18279

@end example

18280

18281

@noindent

18282

This will create a directory named @file{gawk-@value{VERSION}.0} in the current

18283

directory.

18284

18285

The distribution file name is of the form

18286

@file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}.

18287

The @var{V} represents the major version of @code{gawk},

18288

the @var{R} represents the current release of version @var{V}, and

18289

the @var{n} represents a @dfn{patch level}, meaning that minor bugs have

18290

been fixed in the release. The current patch level is 0, but when

18291

retrieving distributions, you should get the version with the highest

18292

version, release, and patch level. (Note that release levels greater than

18293

or equal to 90 denote ``beta,'' or non-production software; you may not wish

18294

to retrieve such a version unless you don't mind experimenting.)

18295

18296

If you are not on a Unix system, you will need to make other arrangements

18297

for getting and extracting the @code{gawk} distribution. You should consult

18298

a local expert.

18299

18300

@node Distribution contents, , Extracting, Gawk Distribution

18301

@appendixsubsec Contents of the @code{gawk} Distribution

18302

18303

The @code{gawk} distribution has a number of C source files,

18304

documentation files,

18305

subdirectories and files related to the configuration process

18306

(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}),

18307

and several subdirectories related to different, non-Unix,

18308

operating systems.

18309

18310

@table @asis

18311

@item various @samp{.c}, @samp{.y}, and @samp{.h} files

18312

These files are the actual @code{gawk} source code.

18313

@end table

18314

18315

@iftex

18316

@page

18317

@end iftex

18318

@table @file

18319

@item README

18320

@itemx README_d/README.*

18321

Descriptive files: @file{README} for @code{gawk} under Unix, and the

18322

rest for the various hardware and software combinations.

18323

18324

@item INSTALL

18325

A file providing an overview of the configuration and installation process.

18326

18327

@item PORTS

18328

A list of systems to which @code{gawk} has been ported, and which

18329

have successfully run the test suite.

18330

18331

@item ACKNOWLEDGMENT

18332

A list of the people who contributed major parts of the code or documentation.

18333

18334

@item ChangeLog

18335

A detailed list of source code changes as bugs are fixed or improvements made.

18336

18337

@item NEWS

18338

A list of changes to @code{gawk} since the last release or patch.

18339

18340

@item COPYING

18341

The GNU General Public License.

18342

18343

@item FUTURES

18344

A brief list of features and/or changes being contemplated for future

18345

releases, with some indication of the time frame for the feature, based

18346

on its difficulty.

18347

18348

@item LIMITATIONS

18349

A list of those factors that limit @code{gawk}'s performance.

18350

Most of these depend on the hardware or operating system software, and

18351

are not limits in @code{gawk} itself.

18352

18353

@item POSIX.STD

18354

A description of one area where the POSIX standard for @code{awk} is

18355

incorrect, and how @code{gawk} handles the problem.

18356

18357

@item PROBLEMS

18358

A file describing known problems with the current release.

18359

18360

@item doc/gawk.1

18361

The @code{troff} source for a manual page describing @code{gawk}.

18362

This is distributed for the convenience of Unix users.

18363

18364

@item doc/gawk.texi

18365

The Texinfo source file for this @value{DOCUMENT}.

18366

It should be processed with @TeX{} to produce a printed document, and

18367

with @code{makeinfo} to produce an Info file.

18368

18369

@item doc/gawk.info

18370

The generated Info file for this @value{DOCUMENT}.

18371

18372

@item doc/igawk.1

18373

The @code{troff} source for a manual page describing the @code{igawk}

18374

program presented in

18375

@ref{Igawk Program, ,An Easy Way to Use Library Functions}.

18376

18377

@item doc/Makefile.in

18378

The input file used during the configuration process to generate the

18379

actual @file{Makefile} for creating the documentation.

18380

18381

@item Makefile.in

18382

@itemx acconfig.h

18383

@itemx aclocal.m4

18384

@itemx configh.in

18385

@itemx configure.in

18386

@itemx configure

18387

@itemx custom.h

18388

@itemx missing/*

18389

These files and subdirectory are used when configuring @code{gawk}

18390

for various Unix systems. They are explained in detail in

18391

@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}.

18392

18393

@item awklib/extract.awk

18394

@itemx awklib/Makefile.in

18395

The @file{awklib} directory contains a copy of @file{extract.awk}

18396

(@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}),

18397

which can be used to extract the sample programs from the Texinfo

18398

source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which

18399

@code{configure} uses to generate a @file{Makefile}.

18400

As part of the process of building @code{gawk}, the library functions from

18401

@ref{Library Functions, , A Library of @code{awk} Functions},

18402

and the @code{igawk} program from

18403

@ref{Igawk Program, , An Easy Way to Use Library Functions},

18404

are extracted into ready to use files.

18405

They are installed as part of the installation process.

18406

18407

@item amiga/*

18408

Files needed for building @code{gawk} on an Amiga.

18409

@xref{Amiga Installation, ,Installing @code{gawk} on an Amiga}, for details.

18410

18411

@item atari/*

18412

Files needed for building @code{gawk} on an Atari ST.

18413

@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details.

18414

18415

@item pc/*

18416

Files needed for building @code{gawk} under MS-DOS and OS/2.

18417

@xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details.

18418

18419

@item vms/*

18420

Files needed for building @code{gawk} under VMS.

18421

@xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details.

18422

18423

@item test/*

18424

A test suite for

18425

@code{gawk}. You can use @samp{make check} from the top level @code{gawk}

18426

directory to run your version of @code{gawk} against the test suite.

18427

If @code{gawk} successfully passes @samp{make check} then you can

18428

be confident of a successful port.

18429

@end table

18430

18431

@node Unix Installation, VMS Installation, Gawk Distribution, Installation

18432

@appendixsec Compiling and Installing @code{gawk} on Unix

18433

18434

Usually, you can compile and install @code{gawk} by typing only two

18435

commands. However, if you do use an unusual system, you may need

18436

to configure @code{gawk} for your system yourself.

18437

18438

@menu

18439

* Quick Installation:: Compiling @code{gawk} under Unix.

18440

* Configuration Philosophy:: How it's all supposed to work.

18441

@end menu

18442

18443

@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation

18444

@appendixsubsec Compiling @code{gawk} for Unix

18445

18446

@cindex installation, unix

18447

After you have extracted the @code{gawk} distribution, @code{cd}

18448

to @file{gawk-@value{VERSION}.0}. Like most GNU software,

18449

@code{gawk} is configured

18450

automatically for your Unix system by running the @code{configure} program.

18451

This program is a Bourne shell script that was generated automatically using

18452

GNU @code{autoconf}.

18453

@iftex

18454

(The @code{autoconf} software is

18455

described fully in

18456

@cite{Autoconf---Generating Automatic Configuration Scripts},

18457

which is available from the Free Software Foundation.)

18458

@end iftex

18459

@ifinfo

18460

(The @code{autoconf} software is described fully starting with

18461

@ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.)

18462

@end ifinfo

18463

18464

To configure @code{gawk}, simply run @code{configure}:

18465

18466

@example

18467

sh ./configure

18468

@end example

18469

18470

This produces a @file{Makefile} and @file{config.h} tailored to your system.

18471

The @file{config.h} file describes various facts about your system.

18472

You may wish to edit the @file{Makefile} to

18473

change the @code{CFLAGS} variable, which controls

18474

the command line options that are passed to the C compiler (such as

18475

optimization levels, or compiling for debugging).

18476

18477

Alternatively, you can add your own values for most @code{make}

18478

variables, such as @code{CC} and @code{CFLAGS}, on the command line when

18479

running @code{configure}:

18480

18481

@example

18482

CC=cc CFLAGS=-g sh ./configure

18483

@end example

18484

18485

@noindent

18486

See the file @file{INSTALL} in the @code{gawk} distribution for

18487

all the details.

18488

18489

After you have run @code{configure}, and possibly edited the @file{Makefile},

18490

type:

18491

18492

@example

18493

make

18494

@end example

18495

18496

@noindent

18497

and shortly thereafter, you should have an executable version of @code{gawk}.

18498

That's all there is to it!

18499

(If these steps do not work, please send in a bug report;

18500

@pxref{Bugs, ,Reporting Problems and Bugs}.)

18501

18502

@node Configuration Philosophy, , Quick Installation, Unix Installation

18503

@appendixsubsec The Configuration Process

18504

18505

@cindex configuring @code{gawk}

18506

(This section is of interest only if you know something about using the

18507

C language and the Unix operating system.)

18508

18509

The source code for @code{gawk} generally attempts to adhere to formal

18510

standards wherever possible. This means that @code{gawk} uses library

18511

routines that are specified by the ANSI C standard and by the POSIX

18512

operating system interface standard. When using an ANSI C compiler,

18513

function prototypes are used to help improve the compile-time checking.

18514

18515

Many Unix systems do not support all of either the ANSI or the

18516

POSIX standards. The @file{missing} subdirectory in the @code{gawk}

18517

distribution contains replacement versions of those subroutines that are

18518

most likely to be missing.

18519

18520

The @file{config.h} file that is created by the @code{configure} program

18521

contains definitions that describe features of the particular operating

18522

system where you are attempting to compile @code{gawk}. The three things

18523

described by this file are what header files are available, so that

18524

they can be correctly included,

18525

what (supposedly) standard functions are actually available in your C

18526

libraries, and

18527

other miscellaneous facts about your

18528

variant of Unix. For example, there may not be an @code{st_blksize}

18529

element in the @code{stat} structure. In this case @samp{HAVE_ST_BLKSIZE}

18530

would be undefined.

18531

18532

@cindex @code{custom.h} configuration file

18533

It is possible for your C compiler to lie to @code{configure}. It may

18534

do so by not exiting with an error when a library function is not

18535

available. To get around this, you can edit the file @file{custom.h}.

18536

Use an @samp{#ifdef} that is appropriate for your system, and either

18537

@code{#define} any constants that @code{configure} should have defined but

18538

didn't, or @code{#undef} any constants that @code{configure} defined and

18539

should not have. @file{custom.h} is automatically included by

18540

@file{config.h}.

18541

18542

It is also possible that the @code{configure} program generated by

18543

@code{autoconf}

18544

will not work on your system in some other fashion. If you do have a problem,

18545

the file

18546

@file{configure.in} is the input for @code{autoconf}. You may be able to

18547

change this file, and generate a new version of @code{configure} that will

18548

work on your system. @xref{Bugs, ,Reporting Problems and Bugs}, for

18549

information on how to report problems in configuring @code{gawk}. The same

18550

mechanism may be used to send in updates to @file{configure.in} and/or

18551

@file{custom.h}.

18552

18553

@node VMS Installation, PC Installation, Unix Installation, Installation

18554

@appendixsec How to Compile and Install @code{gawk} on VMS

18555

18556

@c based on material from Pat Rankin <rankin@eql.caltech.edu>

18557

18558

@cindex installation, vms

18559

This section describes how to compile and install @code{gawk} under VMS.

18560

18561

@menu

18562

* VMS Compilation:: How to compile @code{gawk} under VMS.

18563

* VMS Installation Details:: How to install @code{gawk} under VMS.

18564

* VMS Running:: How to run @code{gawk} under VMS.

18565

* VMS POSIX:: Alternate instructions for VMS POSIX.

18566

@end menu

18567

18568

@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation

18569

@appendixsubsec Compiling @code{gawk} on VMS

18570

18571

To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that

18572

will issue all the necessary @code{CC} and @code{LINK} commands, and there is

18573

also a @file{Makefile} for use with the @code{MMS} utility. From the source

18574

directory, use either

18575

18576

@example

18577

$ @@[.VMS]VMSBUILD.COM

18578

@end example

18579

18580

@noindent

18581

or

18582

18583

@example

18584

$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK

18585

@end example

18586

18587

Depending upon which C compiler you are using, follow one of the sets

18588

of instructions in this table:

18589

18590

@table @asis

18591

@item VAX C V3.x

18592

Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use

18593

@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0.

18594

18595

@item VAX C V2.x

18596

You must have Version 2.3 or 2.4; older ones won't work. Edit either

18597

@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them.

18598

For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters.

18599

Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h})

18600

and comment out or delete the two lines @samp{#define __STDC__ 0} and

18601

@samp{#define VAXC_BUILTINS} near the end.

18602

18603

@item GNU C

18604

Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different

18605

from those for VAX C V2.x, but equally straightforward. No changes to

18606

@file{config.h} should be needed.

18607

18608

@item DEC C

18609

Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments.

18610

No changes to @file{config.h} should be needed.

18611

@end table

18612

18613

@code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2,

18614

GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up.

18615

18616

@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation

18617

@appendixsubsec Installing @code{gawk} on VMS

18618

18619

To install @code{gawk}, all you need is a ``foreign'' command, which is

18620

a @code{DCL} symbol whose value begins with a dollar sign. For example:

18621

18622

@example

18623

$ GAWK :== $disk1:[gnubin]GAWK

18624

@end example

18625

18626

@noindent

18627

(Substitute the actual location of @code{gawk.exe} for

18628

@samp{$disk1:[gnubin]}.) The symbol should be placed in the

18629

@file{login.com} of any user who wishes to run @code{gawk},

18630

so that it will be defined every time the user logs on.

18631

Alternatively, the symbol may be placed in the system-wide

18632

@file{sylogin.com} procedure, which will allow all users

18633

to run @code{gawk}.

18634

18635

Optionally, the help entry can be loaded into a VMS help library:

18636

18637

@example

18638

$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP

18639

@end example

18640

18641

@noindent

18642

(You may want to substitute a site-specific help library rather than

18643

the standard VMS library @samp{HELPLIB}.) After loading the help text,

18644

18645

@example

18646

$ HELP GAWK

18647

@end example

18648

18649

@noindent

18650

will provide information about both the @code{gawk} implementation and the

18651

@code{awk} programming language.

18652

18653

The logical name @samp{AWK_LIBRARY} can designate a default location

18654

for @code{awk} program files. For the @samp{-f} option, if the specified

18655

filename has no device or directory path information in it, @code{gawk}

18656

will look in the current directory first, then in the directory specified

18657

by the translation of @samp{AWK_LIBRARY} if the file was not found.

18658

If after searching in both directories, the file still is not found,

18659

then @code{gawk} appends the suffix @samp{.awk} to the filename and the

18660

file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that

18661

portion of the file search will fail benignly.

18662

18663

@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation

18664

@appendixsubsec Running @code{gawk} on VMS

18665

18666

Command line parsing and quoting conventions are significantly different

18667

on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor

18668

changes. They @emph{are} minor though, and all @code{awk} programs

18669

should run correctly.

18670

18671

Here are a couple of trivial tests:

18672

18673

@example

18674

$ gawk -- "BEGIN @{print ""Hello, World!""@}"

18675

$ gawk -"W" version

18676

! could also be -"W version" or "-W version"

18677

@end example

18678

18679

@noindent

18680

Note that upper-case and mixed-case text must be quoted.

18681

18682

The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition

18683

to the original shell-style interface (see the help entry for details).

18684

One side-effect of dual command line parsing is that if there is only a

18685

single parameter (as in the quoted string program above), the command

18686

becomes ambiguous. To work around this, the normally optional @samp{--}

18687

flag is required to force Unix style rather than @code{DCL} parsing. If any

18688

other dash-type options (or multiple parameters such as data files to be

18689

processed) are present, there is no ambiguity and @samp{--} can be omitted.

18690

18691

The default search path when looking for @code{awk} program files specified

18692

by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical

18693

name @samp{AWKPATH} can be used to override this default. The format

18694

of @samp{AWKPATH} is a comma-separated list of directory specifications.

18695

When defining it, the value should be quoted so that it retains a single

18696

translation, and not a multi-translation @code{RMS} searchlist.

18697

18698

@node VMS POSIX, , VMS Running, VMS Installation

18699

@appendixsubsec Building and Using @code{gawk} on VMS POSIX

18700

18701

Ignore the instructions above, although @file{vms/gawk.hlp} should still

18702

be made available in a help library. Make sure that the @code{configure}

18703

script is executable; use @samp{chmod +x}

18704

on it if necessary. Then execute the following commands:

18705

18706

@example

18707

@group

18708

$ POSIX

18709

psx> CC=vms/posix-cc.sh configure

18710

psx> CC=c89 make gawk

18711

@end group

18712

@end example

18713

18714

@noindent

18715

The first command will construct files @file{config.h} and @file{Makefile}

18716

out of templates. The second command will compile and link @code{gawk}.

18717

@ignore

18718

Due to a @code{make} bug in VMS POSIX V1.0 and V1.1,

18719

the file @file{awktab.c} must be given as an explicit target or it will

18720

not be built and the final link step will fail.

18721

@end ignore

18722

Ignore the warning

18723

@code{"Could not find lib m in lib list"}; it is harmless, caused by the

18724

explicit use of @samp{-lm} as a linker option which is not needed

18725

under VMS POSIX. Under V1.1 (but not V1.0) a problem with the @code{yacc}

18726

skeleton @file{/etc/yyparse.c} will cause a compiler warning for

18727

@file{awktab.c}, followed by a linker warning about compilation warnings

18728

in the resulting object module. These warnings can be ignored.

18729

18730

Once built, @code{gawk} will work like any other shell utility. Unlike

18731

the normal VMS port of @code{gawk}, no special command line manipulation is

18732

needed in the VMS POSIX environment.

18733

18734

@c Rewritten by Scott Deifik <scottd@amgen.com>

18735

@c and Darrel Hankerson <hankedr@mail.auburn.edu>

18736

@node PC Installation, Atari Installation, VMS Installation, Installation

18737

@appendixsec MS-DOS and OS/2 Installation and Compilation

18738

18739

@cindex installation, MS-DOS and OS/2

18740

If you have received a binary distribution prepared by the DOS

18741

maintainers, then @code{gawk} and the necessary support files will appear

18742

under the @file{gnu} directory, with executables in @file{gnu/bin},

18743

libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}.

18744

This is designed for easy installation to a @file{/gnu} directory on your

18745

drive, but the files can be installed anywhere provided @code{AWKPATH} is

18746

set properly. Regardless of the installation directory, the first line of

18747

@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be

18748

edited.

18749

18750

The binary distribution will contain a separate file describing the

18751

contents. In particular, it may include more than one version of the

18752

@code{gawk} executable. OS/2 binary distributions may have a

18753

different arrangement, but installation is similar.

18754

18755

The OS/2 and MS-DOS versions of @code{gawk} search for program files as

18756

described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.

18757

However, semicolons (rather than colons) separate elements

18758

in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty,

18759

then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}.

18760

18761

An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS

18762

or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming.

18763

Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a

18764

@code{ksh} clone and GNU Bash are available for OS/2. The file

18765

@file{README_d/README.pc} in the @code{gawk} distribution contains

18766

information on these shells. Users of Stewartson's shell on DOS should

18767

examine its documentation on handling of command-lines. In particular,

18768

the setting for @code{gawk} in the shell configuration may need to be

18769

changed, and the @code{ignoretype} option may also be of interest.

18770

18771

@code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools

18772

from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2).

18773

Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2. The file

18774

@file{README_d/README.pc} in the @code{gawk} distribution contains additional

18775

notes, and @file{pc/Makefile} contains important notes on compilation options.

18776

18777

To build @code{gawk}, copy the files in the @file{pc} directory to the

18778

directory with the rest of the @code{gawk} sources. The @file{Makefile}

18779

contains a configuration section with comments, and may need to be

18780

edited in order to work with your @code{make} utility.

18781

18782

The @file{Makefile} contains a number of targets for building various MS-DOS

18783

and OS/2 versions. A list of targets will be printed if the @code{make}

18784

command is given without a target. As an example, to build @code{gawk}

18785

using the DJGPP tools, enter @samp{make djgpp}.

18786

18787

Using @code{make} to run the standard tests and to install @code{gawk}

18788

requires additional Unix-like tools, including @code{sh}, @code{sed}, and

18789

@code{cp}. In order to run the tests, the @file{test/*.ok} files may need to

18790

be converted so that they have the usual DOS-style end-of-line markers. Most

18791

of the tests will work properly with Stewartson's shell along with the

18792

companion utilities or appropriate GNU utilities. However, some editing of

18793

@file{test/Makefile} is required. It is recommended that the file

18794

@file{pc/Makefile.tst} be copied to @file{test/Makefile} as a

18795

replacement. Details can be found in @file{README_d/README.pc}.

18796

18797

@node Atari Installation, Amiga Installation, PC Installation, Installation

18798

@appendixsec Installing @code{gawk} on the Atari ST

18799

18800

@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca>

18801

18802

@cindex atari

18803

@cindex installation, atari

18804

There are no substantial differences when installing @code{gawk} on

18805

various Atari models. Compiled @code{gawk} executables do not require

18806

a large amount of memory with most @code{awk} programs and should run on all

18807

Motorola processor based models (called further ST, even if that is not

18808

exactly right).

18809

18810

In order to use @code{gawk}, you need to have a shell, either text or

18811

graphics, that does not map all the characters of a command line to

18812

upper-case. Maintaining case distinction in option flags is very

18813

important (@pxref{Options, ,Command Line Options}).

18814

These days this is the default, and it may only be a problem for some

18815

very old machines. If your system does not preserve the case of option

18816

flags, you will need to upgrade your tools. Support for I/O

18817

redirection is necessary to make it easy to import @code{awk} programs

18818

from other environments. Pipes are nice to have, but not vital.

18819

18820

@menu

18821

* Atari Compiling:: Compiling @code{gawk} on Atari

18822

* Atari Using:: Running @code{gawk} on Atari

18823

@end menu

18824

18825

@node Atari Compiling, Atari Using, Atari Installation, Atari Installation

18826

@appendixsubsec Compiling @code{gawk} on the Atari ST

18827

18828

A proper compilation of @code{gawk} sources when @code{sizeof(int)}

18829

differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial

18830

port was done with @code{gcc}. You may actually prefer executables

18831

where @code{int}s are four bytes wide, but the other variant works as well.

18832

18833

You may need quite a bit of memory when trying to recompile the @code{gawk}

18834

sources, as some source files (@file{regex.c} in particular) are quite

18835

big. If you run out of memory compiling such a file, try reducing the

18836

optimization level for this particular file; this may help.

18837

18838

@cindex Linux

18839

With a reasonable shell (Bash will do), and in particular if you run

18840

Linux, MiNT or a similar operating system, you have a pretty good

18841

chance that the @code{configure} utility will succeed. Otherwise

18842

sample versions of @file{config.h} and @file{Makefile.st} are given in the

18843

@file{atari} subdirectory and can be edited and copied to the

18844

corresponding files in the main source directory. Even if

18845

@code{configure} produced something, it might be advisable to compare

18846

its results with the sample versions and possibly make adjustments.

18847

18848

Some @code{gawk} source code fragments depend on a preprocessor define

18849

@samp{atarist}. This basically assumes the TOS environment with @code{gcc}.

18850

Modify these sections as appropriate if they are not right for your

18851

environment. Also see the remarks about @code{AWKPATH} and @code{envsep} in

18852

@ref{Atari Using, ,Running @code{gawk} on the Atari ST}.

18853

18854

As shipped, the sample @file{config.h} claims that the @code{system}

18855

function is missing from the libraries, which is not true, and an

18856

alternative implementation of this function is provided in

18857

@file{atari/system.c}. Depending upon your particular combination of

18858

shell and operating system, you may wish to change the file to indicate

18859

that @code{system} is available.

18860

18861

@node Atari Using, , Atari Compiling, Atari Installation

18862

@appendixsubsec Running @code{gawk} on the Atari ST

18863

18864

An executable version of @code{gawk} should be placed, as usual,

18865

anywhere in your @code{PATH} where your shell can find it.

18866

18867

While executing, @code{gawk} creates a number of temporary files. When

18868

using @code{gcc} libraries for TOS, @code{gawk} looks for either of

18869

the environment variables @code{TEMP} or @code{TMPDIR}, in that order.

18870

If either one is found, its value is assumed to be a directory for

18871

temporary files. This directory must exist, and if you can spare the

18872

memory, it is a good idea to put it on a RAM drive. If neither

18873

@code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the

18874

current directory for its temporary files.

18875

18876

The ST version of @code{gawk} searches for its program files as described in

18877

@ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.

18878

The default value for the @code{AWKPATH} variable is taken from

18879

@code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS

18880

@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to

18881

@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be

18882

modified by explicitly setting @code{AWKPATH} to whatever you wish.

18883

Note that colons cannot be used on the ST to separate elements in the

18884

@code{AWKPATH} variable, since they have another, reserved, meaning.

18885

Instead, you must use a comma to separate elements in the path. When

18886

recompiling, the separating character can be modified by initializing

18887

the @code{envsep} variable in @file{atari/gawkmisc.atr} to another

18888

value.

18889

18890

Although @code{awk} allows great flexibility in doing I/O redirections

18891

from within a program, this facility should be used with care on the ST

18892

running under TOS. In some circumstances the OS routines for file

18893

handle pool processing lose track of certain events, causing the

18894

computer to crash, and requiring a reboot. Often a warm reboot is

18895

sufficient. Fortunately, this happens infrequently, and in rather

18896

esoteric situations. In particular, avoid having one part of an

18897

@code{awk} program using @code{print} statements explicitly redirected

18898

to @code{"/dev/stdout"}, while other @code{print} statements use the

18899

default standard output, and a calling shell has redirected standard

18900

output to a file.

18901

18902

When @code{gawk} is compiled with the ST version of @code{gcc} and its

18903

usual libraries, it will accept both @samp{/} and @samp{\} as path separators.

18904

While this is convenient, it should be remembered that this removes one,

18905

technically valid, character (@samp{/}) from your file names, and that

18906

it may create problems for external programs, called via the @code{system}

18907

function, which may not support this convention. Whenever it is possible

18908

that a file created by @code{gawk} will be used by some other program,

18909

use only backslashes. Also remember that in @code{awk}, backslashes in

18910

strings have to be doubled in order to get literal backslashes

18911

(@pxref{Escape Sequences}).

18912

18913

@node Amiga Installation, Bugs, Atari Installation, Installation

18914

@appendixsec Installing @code{gawk} on an Amiga

18915

18916

@cindex amiga

18917

@cindex installation, amiga

18918

You can install @code{gawk} on an Amiga system using a Unix emulation

18919

environment available via anonymous @code{ftp} from

18920

@code{wuarchive.wustl.edu} in the directory @file{pub/aminet/dev/gcc}.

18921

This includes a shell based on @code{pdksh}. The primary component of

18922

this environment is a Unix emulation library, @file{ixemul.lib}.

18923

@c could really use more background here, who wrote this, etc.

18924

18925

A more complete distribution for the Amiga is available on

18926

the FreshFish CD-ROM from:

18927

18928

@quotation

18929

Amiga Library Services @*

18930

610 North Alma School Road, Suite 18 @*

18931

Chandler, AZ 85224 USA @*

18932

Phone: +1-602-491-0048 @*

18933

FAX: +1-602-491-0048 @*

18934

E-mail: @code{orders@@amigalib.com}

18935

@end quotation

18936

18937

Once you have the distribution, you can configure @code{gawk} simply by

18938

running @code{configure}:

18939

18940

@example

18941

configure -v m68k-cbm-amigados

18942

@end example

18943

18944

Then run @code{make}, and you should be all set!

18945

(If these steps do not work, please send in a bug report;

18946

@pxref{Bugs, ,Reporting Problems and Bugs}.)

18947

18948

@node Bugs, Other Versions, Amiga Installation, Installation

18949

@appendixsec Reporting Problems and Bugs

18950

18951

If you have problems with @code{gawk} or think that you have found a bug,

18952

please report it to the developers; we cannot promise to do anything

18953

but we might well want to fix it.

18954

18955

Before reporting a bug, make sure you have actually found a real bug.

18956

Carefully reread the documentation and see if it really says you can do

18957

what you're trying to do. If it's not clear whether you should be able

18958

to do something or not, report that too; it's a bug in the documentation!

18959

18960

Before reporting a bug or trying to fix it yourself, try to isolate it

18961

to the smallest possible @code{awk} program and input data file that

18962

reproduces the problem. Then send us the program and data file,

18963

some idea of what kind of Unix system you're using, and the exact results

18964

@code{gawk} gave you. Also say what you expected to occur; this will help

18965

us decide whether the problem was really in the documentation.

18966

18967

Once you have a precise problem, there are two e-mail addresses you

18968

can send mail to.

18969

18970

@table @asis

18971

@item Internet:

18972

@samp{bug-gnu-utils@@prep.ai.mit.edu}

18973

18974

@item UUCP:

18975

@samp{uunet!prep.ai.mit.edu!bug-gnu-utils}

18976

@end table

18977

18978

Please include the

18979

version number of @code{gawk} you are using. You can get this information

18980

with the command @samp{gawk --version}.

18981

You should send a carbon copy of your mail to Arnold Robbins, who can

18982

be reached at @samp{arnold@@gnu.ai.mit.edu}.

18983

18984

@cindex @code{comp.lang.awk}

18985

@strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by

18986

posting to the Usenet/Internet newsgroup @code{comp.lang.awk}.

18987

While the @code{gawk} developers do occasionally read this newsgroup,

18988

there is no guarantee that we will see your posting. The steps described

18989

above are the official, recognized ways for reporting bugs.

18990

18991

Non-bug suggestions are always welcome as well. If you have questions

18992

about things that are unclear in the documentation or are just obscure

18993

features, ask Arnold Robbins; he will try to help you out, although he

18994

may not have the time to fix the problem. You can send him electronic

18995

mail at the Internet address above.

18996

18997

If you find bugs in one of the non-Unix ports of @code{gawk}, please send

18998

an electronic mail message to the person who maintains that port. They

18999

are listed below, and also in the @file{README} file in the @code{gawk}

19000

distribution. Information in the @code{README} file should be considered

19001

authoritative if it conflicts with this @value{DOCUMENT}.

19002

19003

The people maintaining the non-Unix ports of @code{gawk} are:

19004

19005

@cindex Deifik, Scott

19006

@cindex Fish, Fred

19007

@cindex Hankerson, Darrel

19008

@cindex Jaegermann, Michal

19009

@cindex Rankin, Pat

19010

@cindex Rommel, Kai Uwe

19011

@table @asis

19012

@item MS-DOS

19013

Scott Deifik, @samp{scottd@@amgen.com}, and

19014

Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}.

19015

19016

@item OS/2

19017

Kai Uwe Rommel, @samp{rommel@@ars.de}.

19018

19019

@item VMS

19020

Pat Rankin, @samp{rankin@@eql.caltech.edu}.

19021

19022

@item Atari ST

19023

Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}.

19024

19025

@item Amiga

19026

Fred Fish, @samp{fnf@@amigalib.com}.

19027

@end table

19028

19029

If your bug is also reproducible under Unix, please send copies of your

19030

report to the general GNU bug list, as well as to Arnold Robbins, at the

19031

addresses listed above.

19032

19033

@node Other Versions, , Bugs, Installation

19034

@appendixsec Other Freely Available @code{awk} Implementations

19035

19036

There are two other freely available @code{awk} implementations.

19037

This section briefly describes where to get them.

19038

19039

@table @asis

19040

@cindex Kernighan, Brian

19041

@cindex anonymous @code{ftp}

19042

@cindex @code{ftp}, anonymous

19043

@item Unix @code{awk}

19044

Brian Kernighan has been able to make his implementation of

19045

@code{awk} freely available. You can get it via anonymous @code{ftp}

19046

to the host @code{@w{netlib.att.com}}. Change directory to

19047

@file{/netlib/research}. Use ``binary'' or ``image'' mode, and

19048

retrieve @file{awk.bundle.Z}.

19049

19050

This is a shell archive that has been compressed with the @code{compress}

19051

utility. It can be uncompressed with either @code{uncompress} or the

19052

GNU @code{gunzip} utility.

19053

19054

This version requires an ANSI C compiler; GCC (the GNU C compiler)

19055

works quite nicely.

19056

19057

@cindex Brennan, Michael

19058

@cindex @code{mawk}

19059

@item @code{mawk}

19060

Michael Brennan has written an independent implementation of @code{awk},

19061

called @code{mawk}. It is available under the GPL

19062

(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),

19063

just as @code{gawk} is.

19064

19065

You can get it via anonymous @code{ftp} to the host

19066

@code{@w{oxy.edu}}. Change directory to @file{/public}. Use ``binary''

19067

or ``image'' mode, and retrieve @file{mawk1.2.1.tar.gz} (or the latest

19068

version that is there).

19069

19070

@code{gunzip} may be used to decompress this file. Installation

19071

is similar to @code{gawk}'s

19072

(@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}).

19073

@end table

19074

19075

@node Notes, Glossary, Installation, Top

19076

@appendix Implementation Notes

19077

19078

This appendix contains information mainly of interest to implementors and

19079

maintainers of @code{gawk}. Everything in it applies specifically to

19080

@code{gawk}, and not to other implementations.

19081

19082

@menu

19083

* Compatibility Mode:: How to disable certain @code{gawk} extensions.

19084

* Additions:: Making Additions To @code{gawk}.

19085

* Future Extensions:: New features that may be implemented one day.

19086

* Improvements:: Suggestions for improvements by volunteers.

19087

@end menu

19088

19089

@node Compatibility Mode, Additions, Notes, Notes

19090

@appendixsec Downward Compatibility and Debugging

19091

19092

@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},

19093

for a summary of the GNU extensions to the @code{awk} language and program.

19094

All of these features can be turned off by invoking @code{gawk} with the

19095

@samp{--traditional} option, or with the @samp{--posix} option.

19096

19097

If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there

19098

is one more option available on the command line:

19099

19100

@table @code

19101

@item -W parsedebug

19102

@itemx --parsedebug

19103

Print out the parse stack information as the program is being parsed.

19104

@end table

19105

19106

This option is intended only for serious @code{gawk} developers,

19107

and not for the casual user. It probably has not even been compiled into

19108

your version of @code{gawk}, since it slows down execution.

19109

19110

@node Additions, Future Extensions, Compatibility Mode, Notes

19111

@appendixsec Making Additions to @code{gawk}

19112

19113

If you should find that you wish to enhance @code{gawk} in a significant

19114

fashion, you are perfectly free to do so. That is the point of having

19115

free software; the source code is available, and you are free to change

19116

it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).

19117

19118

This section discusses the ways you might wish to change @code{gawk},

19119

and any considerations you should bear in mind.

19120

19121

@menu

19122

* Adding Code:: Adding code to the main body of @code{gawk}.

19123

* New Ports:: Porting @code{gawk} to a new operating system.

19124

@end menu

19125

19126

@node Adding Code, New Ports, Additions, Additions

19127

@appendixsubsec Adding New Features

19128

19129

@cindex adding new features

19130

@cindex features, adding

19131

You are free to add any new features you like to @code{gawk}.

19132

However, if you want your changes to be incorporated into the @code{gawk}

19133

distribution, there are several steps that you need to take in order to

19134

make it possible for me to include to your changes.

19135

19136

@enumerate 1

19137

@item

19138

Get the latest version.

19139

It is much easier for me to integrate changes if they are relative to

19140

the most recent distributed version of @code{gawk}. If your version of

19141

@code{gawk} is very old, I may not be able to integrate them at all.

19142

@xref{Getting, ,Getting the @code{gawk} Distribution},

19143

for information on getting the latest version of @code{gawk}.

19144

19145

@item

19146

@iftex

19147

Follow the @cite{GNU Coding Standards}.

19148

@end iftex

19149

@ifinfo

19150

See @inforef{Top, , Version, standards, GNU Coding Standards}.

19151

@end ifinfo

19152

This document describes how GNU software should be written. If you haven't

19153

read it, please do so, preferably @emph{before} starting to modify @code{gawk}.

19154

(The @cite{GNU Coding Standards} are available as part of the Autoconf

19155

distribution, from the FSF.)

19156

19157

@cindex @code{gawk} coding style

19158

@cindex coding style used in @code{gawk}

19159

@item

19160

Use the @code{gawk} coding style.

19161

The C code for @code{gawk} follows the instructions in the

19162

@cite{GNU Coding Standards}, with minor exceptions. The code is formatted

19163

using the traditional ``K&R'' style, particularly as regards the placement

19164

of braces and the use of tabs. In brief, the coding rules for @code{gawk}

19165

are:

19166

19167

@itemize @bullet

19168

@item

19169

Use old style (non-prototype) function headers when defining functions.

19170

19171

@item

19172

Put the name of the function at the beginning of its own line.

19173

19174

@item

19175

Put the return type of the function, even if it is @code{int}, on the

19176

line above the line with the name and arguments of the function.

19177

19178

@item

19179

The declarations for the function arguments should not be indented.

19180

19181

@item

19182

Put spaces around parentheses used in control structures

19183

(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}

19184

and @code{return}).

19185

19186

@item

19187

Do not put spaces in front of parentheses used in function calls.

19188

19189

@item

19190

Put spaces around all C operators, and after commas in function calls.

19191

19192

@item

19193

Do not use the comma operator to produce multiple side-effects, except

19194

in @code{for} loop initialization and increment parts, and in macro bodies.

19195

19196

@item

19197

Use real tabs for indenting, not spaces.

19198

19199

@item

19200

Use the ``K&R'' brace layout style.

19201

19202

@item

19203

Use comparisons against @code{NULL} and @code{'\0'} in the conditions of

19204

@code{if}, @code{while} and @code{for} statements, and in the @code{case}s

19205

of @code{switch} statements, instead of just the

19206

plain pointer or character value.

19207

19208

@item

19209

Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants,

19210

and the character constant @code{'\0'} where appropriate, instead of @code{1}

19211

and @code{0}.

19212

19213

@item

19214

Provide one-line descriptive comments for each function.

19215

19216

@item

19217

Do not use @samp{#elif}. Many older Unix C compilers cannot handle it.

19218

@end itemize

19219

19220

If I have to reformat your code to follow the coding style used in

19221

@code{gawk}, I may not bother.

19222

19223

@item

19224

Be prepared to sign the appropriate paperwork.

19225

In order for the FSF to distribute your changes, you must either place

19226

those changes in the public domain, and submit a signed statement to that

19227

effect, or assign the copyright in your changes to the FSF.

19228

Both of these actions are easy to do, and @emph{many} people have done so

19229

already. If you have questions, please contact me

19230

(@pxref{Bugs, , Reporting Problems and Bugs}),

19231

or @code{gnu@@prep.ai.mit.edu}.

19232

19233

@item

19234

Update the documentation.

19235

Along with your new code, please supply new sections and or chapters

19236

for this @value{DOCUMENT}. If at all possible, please use real

19237

Texinfo, instead of just supplying unformatted ASCII text (although

19238

even that is better than no documentation at all).

19239

Conventions to be followed in @cite{@value{TITLE}} are provided

19240

after the @samp{@@bye} at the end of the Texinfo source file.

19241

If possible, please update the man page as well.

19242

19243

You will also have to sign paperwork for your documentation changes.

19244

19245

@item

19246

Submit changes as context diffs or unified diffs.

19247

Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare

19248

the original @code{gawk} source tree with your version.

19249

(I find context diffs to be more readable, but unified diffs are

19250

more compact.)

19251

I recommend using the GNU version of @code{diff}.

19252

Send the output produced by either run of @code{diff} to me when you

19253

submit your changes.

19254

@xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail

19255

information.

19256

19257

Using this format makes it easy for me to apply your changes to the

19258

master version of the @code{gawk} source code (using @code{patch}).

19259

If I have to apply the changes manually, using a text editor, I may

19260

not do so, particularly if there are lots of changes.

19261

@end enumerate

19262

19263

Although this sounds like a lot of work, please remember that while you

19264

may write the new code, I have to maintain it and support it, and if it

19265

isn't possible for me to do that with a minimum of extra work, then I

19266

probably will not.

19267

19268

@node New Ports, , Adding Code, Additions

19269

@appendixsubsec Porting @code{gawk} to a New Operating System

19270

19271

@cindex porting @code{gawk}

19272

If you wish to port @code{gawk} to a new operating system, there are

19273

several steps to follow.

19274

19275

@enumerate 1

19276

@item

19277

Follow the guidelines in

19278

@ref{Adding Code, ,Adding New Features},

19279

concerning coding style, submission of diffs, and so on.

19280

19281

@item

19282

When doing a port, bear in mind that your code must co-exist peacefully

19283

with the rest of @code{gawk}, and the other ports. Avoid gratuitous

19284

changes to the system-independent parts of the code. If at all possible,

19285

avoid sprinkling @samp{#ifdef}s just for your port throughout the

19286

code.

19287

19288

If the changes needed for a particular system affect too much of the

19289

code, I probably will not accept them. In such a case, you will, of course,

19290

be able to distribute your changes on your own, as long as you comply

19291

with the GPL

19292

(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).

19293

19294

@item

19295

A number of the files that come with @code{gawk} are maintained by other

19296

people at the Free Software Foundation. Thus, you should not change them

19297

unless it is for a very good reason. I.e.@: changes are not out of the

19298

question, but changes to these files will be scrutinized extra carefully.

19299

The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c},

19300

@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h},

19301

@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}.

19302

19303

@item

19304

Be willing to continue to maintain the port.

19305

Non-Unix operating systems are supported by volunteers who maintain

19306

the code needed to compile and run @code{gawk} on their systems. If no-one

19307

volunteers to maintain a port, that port becomes unsupported, and it may

19308

be necessary to remove it from the distribution.

19309

19310

@item

19311

Supply an appropriate @file{gawkmisc.???} file.

19312

Each port has its own @file{gawkmisc.???} that implements certain

19313

operating system specific functions. This is cleaner than a plethora of

19314

@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in

19315

the main source directory includes the appropriate

19316

@file{gawkmisc.???} file from each subdirectory.

19317

Be sure to update it as well.

19318

19319

Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine

19320

or operating system for the port. For example, @file{pc/gawkmisc.pc} and

19321

@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain

19322

@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory

19323

into the main subdirectory, without accidentally destroying the real

19324

@file{gawkmisc.c} file. (Currently, this is only an issue for the MS-DOS

19325

and OS/2 ports.)

19326

19327

@item

19328

Supply a @file{Makefile} and any other C source and header files that are

19329

necessary for your operating system. All your code should be in a

19330

separate subdirectory, with a name that is the same as, or reminiscent

19331

of, either your operating system or the computer system. If possible,

19332

try to structure things so that it is not necessary to move files out

19333

of the subdirectory into the main source directory. If that is not

19334

possible, then be sure to avoid using names for your files that

19335

duplicate the names of files in the main source directory.

19336

19337

@item

19338

Update the documentation.

19339

Please write a section (or sections) for this @value{DOCUMENT} describing the

19340

installation and compilation steps needed to install and/or compile

19341

@code{gawk} for your system.

19342

19343

@item

19344

Be prepared to sign the appropriate paperwork.

19345

In order for the FSF to distribute your code, you must either place

19346

your code in the public domain, and submit a signed statement to that

19347

effect, or assign the copyright in your code to the FSF.

19348

@ifinfo

19349

Both of these actions are easy to do, and @emph{many} people have done so

19350

already. If you have questions, please contact me, or

19351

@code{gnu@@prep.ai.mit.edu}.

19352

@end ifinfo

19353

@end enumerate

19354

19355

Following these steps will make it much easier to integrate your changes

19356

into @code{gawk}, and have them co-exist happily with the code for other

19357

operating systems that is already there.

19358

19359

In the code that you supply, and that you maintain, feel free to use a

19360

coding style and brace layout that suits your taste.

19361

19362

@c why should this be needed? sigh

19363

@iftex

19364

@page

19365

@end iftex

19366

@node Future Extensions, Improvements, Additions, Notes

19367

@appendixsec Probable Future Extensions

19368

19369

@ignore

19370

From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995

19371

Return-Path: <emory!scalpel.netlabs.com!lwall>

19372

Message-Id: <9510311732.AA28472@scalpel.netlabs.com>

19373

To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)

19374

Subject: Re: May I quote you?

19375

In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."

19376

<m0tAHPQ-00014MC@skeeve.atl.ga.us>

19377

Date: Tue, 31 Oct 95 09:32:46 -0800

19378

From: Larry Wall <emory!scalpel.netlabs.com!lwall>

19379

19380

: Greetings. I am working on the release of gawk 3.0. Part of it will be a

19381

: thoroughly updated manual. One of the sections deals with planned future

19382

: extensions and enhancements. I have the following at the beginning

19383

: of it:

19384

:

19385

: @cindex PERL

19386

: @cindex Wall, Larry

19387

: @display

19388

: @i{AWK is a language similar to PERL, only considerably more elegant.} @*

19389

: Arnold Robbins

19390

: @sp 1

19391

: @i{Hey!} @*

19392

: Larry Wall

19393

: @end display

19394

:

19395

: Before I actually release this for publication, I wanted to get your

19396

: permission to quote you. (Hopefully, in the spirit of much of GNU, the

19397

: implied humor is visible... :-)

19398

19399

I think that would be fine.

19400

19401

Larry

19402

@end ignore

19403

19404

@cindex PERL

19405

@cindex Wall, Larry

19406

@display

19407

@i{AWK is a language similar to PERL, only considerably more elegant.}

19408

Arnold Robbins

19409

19410

@i{Hey!}

19411

Larry Wall

19412

@end display

19413

19414

This section briefly lists extensions and possible improvements

19415

that indicate the directions we are

19416

currently considering for @code{gawk}. The file @file{FUTURES} in the

19417

@code{gawk} distributions lists these extensions as well.

19418

19419

This is a list of probable future changes that will be usable by the

19420

@code{awk} language programmer.

19421

19422

@c these are ordered by likelihood

19423

@table @asis

19424

@item Localization

19425

The GNU project is starting to support multiple languages.

19426

It will at least be possible to make @code{gawk} print its warnings and

19427

error messages in languages other than English.

19428

It may be possible for @code{awk} programs to also use the multiple

19429

language facilities, separate from @code{gawk} itself.

19430

19431

@item Databases

19432

It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array.

19433

19434

@item A @code{PROCINFO} Array

19435

The special files that provide process-related information

19436

(@pxref{Special Files, ,Special File Names in @code{gawk}})

19437

may be superseded by a @code{PROCINFO} array that would provide the same

19438

information, in an easier to access fashion.

19439

19440

@item More @code{lint} warnings

19441

There are more things that could be checked for portability.

19442

19443

@item Control of subprocess environment

19444

Changes made in @code{gawk} to the array @code{ENVIRON} may be

19445

propagated to subprocesses run by @code{gawk}.

19446

19447

@ignore

19448

@item @code{RECLEN} variable for fixed length records

19449

Along with @code{FIELDWIDTHS}, this would speed up the processing of

19450

fixed-length records.

19451

19452

@item A @code{restart} keyword

19453

After modifying @code{$0}, @code{restart} would restart the pattern

19454

matching loop, without reading a new record from the input.

19455

19456

@item A @samp{|&} redirection

19457

The @samp{|&} redirection, in place of @samp{|}, would open a two-way

19458

pipeline for communication with a sub-process (via @code{getline} and

19459

@code{print} and @code{printf}).

19460

19461

@item Function valued variables

19462

It would be possible to assign the name of a user-defined or built-in

19463

function to a regular @code{awk} variable, and then call the function

19464

indirectly, by using the regular variable. This would make it possible

19465

to write general purpose sorting and comparing routines, for example,

19466

by simply passing the name of one function into another.

19467

19468

@item A built-in @code{stat} function

19469

The @code{stat} function would provide an easy-to-use hook to the

19470

@code{stat} system call so that @code{awk} programs could determine information

19471

about files.

19472

19473

@item A built-in @code{ftw} function

19474

Combined with function valued variables and the @code{stat} function,

19475

@code{ftw} (file tree walk) would make it easy for an @code{awk} program

19476

to walk an entire file tree.

19477

@end ignore

19478

@end table

19479

19480

This is a list of probable improvements that will make @code{gawk}

19481

perform better.

19482

19483

@table @asis

19484

@item An Improved Version of @code{dfa}

19485

The @code{dfa} pattern matcher from GNU @code{grep} has some

19486

problems. Either a new version or a fixed one will deal with some

19487

important regexp matching issues.

19488

19489

@item Use of @code{mmap}

19490

On systems that support the @code{mmap} system call, its use would provide

19491

much faster file input, and considerably simplified input buffer management.

19492

19493

@item Use of GNU @code{malloc}

19494

The GNU version of @code{malloc} could potentially speed up @code{gawk},

19495

since it relies heavily on the use of dynamic memory allocation.

19496

19497

@item Use of the @code{rx} regexp library

19498

The @code{rx} regular expression library could potentially speed up

19499

all regexp operations that require knowing the exact location of matches.

19500

This includes record termination, field and array splitting,

19501

and the @code{sub}, @code{gsub}, @code{gensub} and @code{match} functions.

19502

@end table

19503

19504

@node Improvements, , Future Extensions, Notes

19505

@appendixsec Suggestions for Improvements

19506

19507

Here are some projects that would-be @code{gawk} hackers might like to take

19508

on. They vary in size from a few days to a few weeks of programming,

19509

depending on which one you choose and how fast a programmer you are. Please

19510

send any improvements you write to the maintainers at the GNU project.

19511

@xref{Adding Code, , Adding New Features},

19512

for guidelines to follow when adding new features to @code{gawk}.

19513

@xref{Bugs, ,Reporting Problems and Bugs}, for information on

19514

contacting the maintainers.

19515

19516

@enumerate

19517

@item

19518

Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like)

19519

parser to convert the script given it into a syntax tree; the syntax

19520

tree is then executed by a simple recursive evaluator. This method incurs

19521

a lot of overhead, since the recursive evaluator performs many procedure

19522

calls to do even the simplest things.

19523

19524

It should be possible for @code{gawk} to convert the script's parse tree

19525

into a C program which the user would then compile, using the normal

19526

C compiler and a special @code{gawk} library to provide all the needed

19527

functions (regexps, fields, associative arrays, type coercion, and so

19528

on).

19529

19530

An easier possibility might be for an intermediate phase of @code{awk} to

19531

convert the parse tree into a linear byte code form like the one used

19532

in GNU Emacs Lisp. The recursive evaluator would then be replaced by

19533

a straight line byte code interpreter that would be intermediate in speed

19534

between running a compiled program and doing what @code{gawk} does

19535

now.

19536

19537

@item

19538

The programs in the test suite could use documenting in this @value{DOCUMENT}.

19539

19540

@item

19541

See the @file{FUTURES} file for more ideas. Contact us if you would

19542

seriously like to tackle any of the items listed there.

19543

@end enumerate

19544

19545

@node Glossary, Copying, Notes, Top

19546

@appendix Glossary

19547

19548

@table @asis

19549

@item Action

19550

A series of @code{awk} statements attached to a rule. If the rule's

19551

pattern matches an input record, @code{awk} executes the

19552

rule's action. Actions are always enclosed in curly braces.

19553

@xref{Action Overview, ,Overview of Actions}.

19554

19555

@item Amazing @code{awk} Assembler

19556

Henry Spencer at the University of Toronto wrote a retargetable assembler

19557

completely as @code{awk} scripts. It is thousands of lines long, including

19558

machine descriptions for several eight-bit microcomputers.

19559

It is a good example of a

19560

program that would have been better written in another language.

19561

19562

@item Amazingly Workable Formatter (@code{awf})

19563

Henry Spencer at the University of Toronto wrote a formatter that accepts

19564

a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting

19565

commands, using @code{awk} and @code{sh}.

19566

19567

@item ANSI

19568

The American National Standards Institute. This organization produces

19569

many standards, among them the standards for the C and C++ programming

19570

languages.

19571

19572

@item Assignment

19573

An @code{awk} expression that changes the value of some @code{awk}

19574

variable or data object. An object that you can assign to is called an

19575

@dfn{lvalue}. The assigned values are called @dfn{rvalues}.

19576

@xref{Assignment Ops, ,Assignment Expressions}.

19577

19578

@item @code{awk} Language

19579

The language in which @code{awk} programs are written.

19580

19581

@item @code{awk} Program

19582

An @code{awk} program consists of a series of @dfn{patterns} and

19583

@dfn{actions}, collectively known as @dfn{rules}. For each input record

19584

given to the program, the program's rules are all processed in turn.

19585

@code{awk} programs may also contain function definitions.

19586

19587

@item @code{awk} Script

19588

Another name for an @code{awk} program.

19589

19590

@item Bash

19591

The GNU version of the standard shell (the Bourne-Again shell).

19592

See ``Bourne Shell.''

19593

19594

@item BBS

19595

See ``Bulletin Board System.''

19596

19597

@item Boolean Expression

19598

Named after the English mathematician Boole. See ``Logical Expression.''

19599

19600

@item Bourne Shell

19601

The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,

19602

originally written by Steven R.@: Bourne.

19603

Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are

19604

generally upwardly compatible with the Bourne shell.

19605

19606

@item Built-in Function

19607

The @code{awk} language provides built-in functions that perform various

19608

numerical, time stamp related, and string computations. Examples are

19609

@code{sqrt} (for the square root of a number) and @code{substr} (for a

19610

substring of a string). @xref{Built-in, ,Built-in Functions}.

19611

19612

@item Built-in Variable

19613

@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON},

19614

@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS},

19615

@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS},

19616

@code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP},

19617

are the variables that have special meaning to @code{awk}.

19618

Changing some of them affects @code{awk}'s running environment.

19619

Several of these variables are specific to @code{gawk}.

19620

@xref{Built-in Variables}.

19621

19622

@item Braces

19623

See ``Curly Braces.''

19624

19625

@item Bulletin Board System

19626

A computer system allowing users to log in and read and/or leave messages

19627

for other users of the system, much like leaving paper notes on a bulletin

19628

board.

19629

19630

@item C

19631

The system programming language that most GNU software is written in. The

19632

@code{awk} programming language has C-like syntax, and this @value{DOCUMENT}

19633

points out similarities between @code{awk} and C when appropriate.

19634

19635

@cindex ISO 8859-1

19636

@cindex ISO Latin-1

19637

@item Character Set

19638

The set of numeric codes used by a computer system to represent the

19639

characters (letters, numbers, punctuation, etc.) of a particular country

19640

or place. The most common character set in use today is ASCII (American

19641

Standard Code for Information Interchange). Many European

19642

countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).

19643

19644

@item CHEM

19645

A preprocessor for @code{pic} that reads descriptions of molecules

19646

and produces @code{pic} input for drawing them. It was written in @code{awk}

19647

by Brian Kernighan and Jon Bentley, and is available from

19648

@code{@w{netlib@@research.att.com}}.

19649

19650

@item Compound Statement

19651

A series of @code{awk} statements, enclosed in curly braces. Compound

19652

statements may be nested.

19653

@xref{Statements, ,Control Statements in Actions}.

19654

19655

@item Concatenation

19656

Concatenating two strings means sticking them together, one after another,

19657

giving a new string. For example, the string @samp{foo} concatenated with

19658

the string @samp{bar} gives the string @samp{foobar}.

19659

@xref{Concatenation, ,String Concatenation}.

19660

19661

@item Conditional Expression

19662

An expression using the @samp{?:} ternary operator, such as

19663

@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression

19664

@var{expr1} is evaluated; if the result is true, the value of the whole

19665

expression is the value of @var{expr2}, otherwise the value is

19666

@var{expr3}. In either case, only one of @var{expr2} and @var{expr3}

19667

is evaluated. @xref{Conditional Exp, ,Conditional Expressions}.

19668

19669

@item Comparison Expression

19670

A relation that is either true or false, such as @samp{(a < b)}.

19671

Comparison expressions are used in @code{if}, @code{while}, @code{do},

19672

and @code{for}

19673

statements, and in patterns to select which input records to process.

19674

@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.

19675

19676

@item Curly Braces

19677

The characters @samp{@{} and @samp{@}}. Curly braces are used in

19678

@code{awk} for delimiting actions, compound statements, and function

19679

bodies.

19680

19681

@item Dark Corner

19682

An area in the language where specifications often were (or still

19683

are) not clear, leading to unexpected or undesirable behavior.

19684

Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the

19685

text, and are indexed under the heading ``dark corner.''

19686

19687

@item Data Objects

19688

These are numbers and strings of characters. Numbers are converted into

19689

strings and vice versa, as needed.

19690

@xref{Conversion, ,Conversion of Strings and Numbers}.

19691

19692

@item Double Precision

19693

An internal representation of numbers that can have fractional parts.

19694

Double precision numbers keep track of more digits than do single precision

19695

numbers, but operations on them are more expensive. This is the way

19696

@code{awk} stores numeric values. It is the C type @code{double}.

19697

19698

@item Dynamic Regular Expression

19699

A dynamic regular expression is a regular expression written as an

19700

ordinary expression. It could be a string constant, such as

19701

@code{"foo"}, but it may also be an expression whose value can vary.

19702

@xref{Computed Regexps, , Using Dynamic Regexps}.

19703

19704

@item Environment

19705

A collection of strings, of the form @var{name@code{=}val}, that each

19706

program has available to it. Users generally place values into the

19707

environment in order to provide information to various programs. Typical

19708

examples are the environment variables @code{HOME} and @code{PATH}.

19709

19710

@item Empty String

19711

See ``Null String.''

19712

19713

@item Escape Sequences

19714

A special sequence of characters used for describing non-printing

19715

characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII

19716

ESC (escape) character. @xref{Escape Sequences}.

19717

19718

@item Field

19719

When @code{awk} reads an input record, it splits the record into pieces

19720

separated by whitespace (or by a separator regexp which you can

19721

change by setting the built-in variable @code{FS}). Such pieces are

19722

called fields. If the pieces are of fixed length, you can use the built-in

19723

variable @code{FIELDWIDTHS} to describe their lengths.

19724

@xref{Field Separators, ,Specifying How Fields are Separated},

19725

and also see

19726

@xref{Constant Size, , Reading Fixed-width Data}.

19727

19728

@item Floating Point Number

19729

Often referred to in mathematical terms as a ``rational'' number, this is

19730

just a number that can have a fractional part.

19731

See ``Double Precision'' and ``Single Precision.''

19732

19733

@item Format

19734

Format strings are used to control the appearance of output in the

19735

@code{printf} statement. Also, data conversions from numbers to strings

19736

are controlled by the format string contained in the built-in variable

19737

@code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}.

19738

19739

@item Function

19740

A specialized group of statements used to encapsulate general

19741

or program-specific tasks. @code{awk} has a number of built-in

19742

functions, and also allows you to define your own.

19743

@xref{Built-in, ,Built-in Functions},

19744

and @ref{User-defined, ,User-defined Functions}.

19745

19746

@item FSF

19747

See ``Free Software Foundation.''

19748

19749

@item Free Software Foundation

19750

A non-profit organization dedicated

19751

to the production and distribution of freely distributable software.

19752

It was founded by Richard M.@: Stallman, the author of the original

19753

Emacs editor. GNU Emacs is the most widely used version of Emacs today.

19754

19755

@item @code{gawk}

19756

The GNU implementation of @code{awk}.

19757

19758

@item General Public License

19759

This document describes the terms under which @code{gawk} and its source

19760

code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE})

19761

19762

@item GNU

19763

``GNU's not Unix''. An on-going project of the Free Software Foundation

19764

to create a complete, freely distributable, POSIX-compliant computing

19765

environment.

19766

19767

@item GPL

19768

See ``General Public License.''

19769

19770

@item Hexadecimal

19771

Base 16 notation, where the digits are @code{0}-@code{9} and

19772

@code{A}-@code{F}, with @samp{A}

19773

representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15.

19774

Hexadecimal numbers are written in C using a leading @samp{0x},

19775

to indicate their base. Thus, @code{0x12} is 18 (one times 16 plus 2).

19776

19777

@item I/O

19778

Abbreviation for ``Input/Output,'' the act of moving data into and/or

19779

out of a running program.

19780

19781

@item Input Record

19782

A single chunk of data read in by @code{awk}. Usually, an @code{awk} input

19783

record consists of one line of text.

19784

@xref{Records, ,How Input is Split into Records}.

19785

19786

@item Integer

19787

A whole number, i.e.@: a number that does not have a fractional part.

19788

19789

@item Keyword

19790

In the @code{awk} language, a keyword is a word that has special

19791

meaning. Keywords are reserved and may not be used as variable names.

19792

19793

@code{gawk}'s keywords are:

19794

@code{BEGIN},

19795

@code{END},

19796

@code{if},

19797

@code{else},

19798

@code{while},

19799

@code{do@dots{}while},

19800

@code{for},

19801

@code{for@dots{}in},

19802

@code{break},

19803

@code{continue},

19804

@code{delete},

19805

@code{next},

19806

@code{nextfile},

19807

@code{function},

19808

@code{func},

19809

and @code{exit}.

19810

19811

@item Logical Expression

19812

An expression using the operators for logic, AND, OR, and NOT, written

19813

@samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean

19814

expressions, after the mathematician who pioneered this kind of

19815

mathematical logic.

19816

19817

@item Lvalue

19818

An expression that can appear on the left side of an assignment

19819

operator. In most languages, lvalues can be variables or array

19820

elements. In @code{awk}, a field designator can also be used as an

19821

lvalue.

19822

19823

@item Null String

19824

A string with no characters in it. It is represented explicitly in

19825

@code{awk} programs by placing two double-quote characters next to

19826

each other (@code{""}). It can appear in input data by having two successive

19827

occurrences of the field separator appear next to each other.

19828

19829

@item Number

19830

A numeric valued data object. The @code{gawk} implementation uses double

19831

precision floating point to represent numbers.

19832

Very old @code{awk} implementations use single precision floating

19833

point.

19834

19835

@item Octal

19836

Base-eight notation, where the digits are @code{0}-@code{7}.

19837

Octal numbers are written in C using a leading @samp{0},

19838

to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3).

19839

19840

@item Pattern

19841

Patterns tell @code{awk} which input records are interesting to which

19842

rules.

19843

19844

A pattern is an arbitrary conditional expression against which input is

19845

tested. If the condition is satisfied, the pattern is said to @dfn{match}

19846

the input record. A typical pattern might compare the input record against

19847

a regular expression. @xref{Pattern Overview, ,Pattern Elements}.

19848

19849

@item POSIX

19850

The name for a series of standards being developed by the IEEE

19851

that specify a Portable Operating System interface. The ``IX'' denotes

19852

the Unix heritage of these standards. The main standard of interest for

19853

@code{awk} users is

19854

@cite{IEEE Standard for Information Technology, Standard 1003.2-1992,

19855

Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}.

19856

Informally, this standard is often referred to as simply ``P1003.2.''

19857

19858

@item Private

19859

Variables and/or functions that are meant for use exclusively by library

19860

functions, and not for the main @code{awk} program. Special care must be

19861

taken when naming such variables and functions.

19862

@xref{Library Names, , Naming Library Function Global Variables}.

19863

19864

@item Range (of input lines)

19865

A sequence of consecutive lines from the input file. A pattern

19866

can specify ranges of input lines for @code{awk} to process, or it can

19867

specify single lines. @xref{Pattern Overview, ,Pattern Elements}.

19868

19869

@item Recursion

19870

When a function calls itself, either directly or indirectly.

19871

If this isn't clear, refer to the entry for ``recursion.''

19872

19873

@item Redirection

19874

Redirection means performing input from other than the standard input

19875

stream, or output to other than the standard output stream.

19876

19877

You can redirect the output of the @code{print} and @code{printf} statements

19878

to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|}

19879

operators. You can redirect input to the @code{getline} statement using

19880

the @samp{<} and @samp{|} operators.

19881

@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}},

19882

and @ref{Getline, ,Explicit Input with @code{getline}}.

19883

19884

@item Regexp

19885

Short for @dfn{regular expression}. A regexp is a pattern that denotes a

19886

set of strings, possibly an infinite set. For example, the regexp

19887

@samp{R.*xp} matches any string starting with the letter @samp{R}

19888

and ending with the letters @samp{xp}. In @code{awk}, regexps are

19889

used in patterns and in conditional expressions. Regexps may contain

19890

escape sequences. @xref{Regexp, ,Regular Expressions}.

19891

19892

@item Regular Expression

19893

See ``regexp.''

19894

19895

@item Regular Expression Constant

19896

A regular expression constant is a regular expression written within

19897

slashes, such as @code{/foo/}. This regular expression is chosen

19898

when you write the @code{awk} program, and cannot be changed doing

19899

its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}.

19900

19901

@item Rule

19902

A segment of an @code{awk} program that specifies how to process single

19903

input records. A rule consists of a @dfn{pattern} and an @dfn{action}.

19904

@code{awk} reads an input record; then, for each rule, if the input record

19905

satisfies the rule's pattern, @code{awk} executes the rule's action.

19906

Otherwise, the rule does nothing for that input record.

19907

19908

@item Rvalue

19909

A value that can appear on the right side of an assignment operator.

19910

In @code{awk}, essentially every expression has a value. These values

19911

are rvalues.

19912

19913

@item @code{sed}

19914

See ``Stream Editor.''

19915

19916

@item Short-Circuit

19917

The nature of the @code{awk} logical operators @samp{&&} and @samp{||}.

19918

If the value of the entire expression can be deduced from evaluating just

19919

the left-hand side of these operators, the right-hand side will not

19920

be evaluated

19921

(@pxref{Boolean Ops, ,Boolean Expressions}).

19922

19923

@item Side Effect

19924

A side effect occurs when an expression has an effect aside from merely

19925

producing a value. Assignment expressions, increment and decrement

19926

expressions and function calls have side effects.

19927

@xref{Assignment Ops, ,Assignment Expressions}.

19928

19929

@item Single Precision

19930

An internal representation of numbers that can have fractional parts.

19931

Single precision numbers keep track of fewer digits than do double precision

19932

numbers, but operations on them are less expensive in terms of CPU time.

19933

This is the type used by some very old versions of @code{awk} to store

19934

numeric values. It is the C type @code{float}.

19935

19936

@item Space

19937

The character generated by hitting the space bar on the keyboard.

19938

19939

@item Special File

19940

A file name interpreted internally by @code{gawk}, instead of being handed

19941

directly to the underlying operating system. For example, @file{/dev/stderr}.

19942

@xref{Special Files, ,Special File Names in @code{gawk}}.

19943

19944

@item Stream Editor

19945

A program that reads records from an input stream and processes them one

19946

or more at a time. This is in contrast with batch programs, which may

19947

expect to read their input files in entirety before starting to do

19948

anything, and with interactive programs, which require input from the

19949

user.

19950

19951

@item String

19952

A datum consisting of a sequence of characters, such as @samp{I am a

19953

string}. Constant strings are written with double-quotes in the

19954

@code{awk} language, and may contain escape sequences.

19955

@xref{Escape Sequences}.

19956

19957

@item Tab

19958

The character generated by hitting the @kbd{TAB} key on the keyboard.

19959

It usually expands to up to eight spaces upon output.

19960

19961

@item Unix

19962

A computer operating system originally developed in the early 1970's at

19963

AT&T Bell Laboratories. It initially became popular in universities around

19964

the world, and later moved into commercial evnironments as a software

19965

development system and network server system. There are many commercial

19966

versions of Unix, as well as several work-alike systems whose source code

19967

is freely available (such as Linux, NetBSD, and FreeBSD).

19968

19969

@item Whitespace

19970

A sequence of space or tab characters occurring inside an input record or a

19971

string.

19972

@end table

19973

19974

@node Copying, Index, Glossary, Top

19975

@unnumbered GNU GENERAL PUBLIC LICENSE

19976

@center Version 2, June 1991

19977

19978

@display

19979

19980

59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA

19981

19982

Everyone is permitted to copy and distribute verbatim copies

19983

of this license document, but changing it is not allowed.

19984

@end display

19985

19986

@c fakenode --- for prepinfo

19987

@unnumberedsec Preamble

19988

19989

The licenses for most software are designed to take away your

19990

freedom to share and change it. By contrast, the GNU General Public

19991

License is intended to guarantee your freedom to share and change free

19992

software---to make sure the software is free for all its users. This

19993

General Public License applies to most of the Free Software

19994

Foundation's software and to any other program whose authors commit to

19995

using it. (Some other Free Software Foundation software is covered by

19996

the GNU Library General Public License instead.) You can apply it to

19997

your programs, too.

19998

19999

When we speak of free software, we are referring to freedom, not

20000

price. Our General Public Licenses are designed to make sure that you

20001

have the freedom to distribute copies of free software (and charge for

20002

this service if you wish), that you receive source code or can get it

20003

if you want it, that you can change the software or use pieces of it

20004

in new free programs; and that you know you can do these things.

20005

20006

To protect your rights, we need to make restrictions that forbid

20007

anyone to deny you these rights or to ask you to surrender the rights.

20008

These restrictions translate to certain responsibilities for you if you

20009

distribute copies of the software, or if you modify it.

20010

20011

For example, if you distribute copies of such a program, whether

20012

gratis or for a fee, you must give the recipients all the rights that

20013

you have. You must make sure that they, too, receive or can get the

20014

source code. And you must show them these terms so they know their

20015

rights.

20016

20017

We protect your rights with two steps: (1) copyright the software, and

20018

(2) offer you this license which gives you legal permission to copy,

20019

distribute and/or modify the software.

20020

20021

Also, for each author's protection and ours, we want to make certain

20022

that everyone understands that there is no warranty for this free

20023

software. If the software is modified by someone else and passed on, we

20024

want its recipients to know that what they have is not the original, so

20025

that any problems introduced by others will not reflect on the original

20026

authors' reputations.

20027

20028

Finally, any free program is threatened constantly by software

20029

patents. We wish to avoid the danger that redistributors of a free

20030

program will individually obtain patent licenses, in effect making the

20031

program proprietary. To prevent this, we have made it clear that any

20032

patent must be licensed for everyone's free use or not licensed at all.

20033

20034

The precise terms and conditions for copying, distribution and

20035

modification follow.

20036

20037

@iftex

20038

@c fakenode --- for prepinfo

20039

@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

20040

@end iftex

20041

@ifinfo

20042

@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

20043

@end ifinfo

20044

20045

@enumerate 0

20046

@item

20047

This License applies to any program or other work which contains

20048

a notice placed by the copyright holder saying it may be distributed

20049

under the terms of this General Public License. The ``Program'', below,

20050

refers to any such program or work, and a ``work based on the Program''

20051

means either the Program or any derivative work under copyright law:

20052

that is to say, a work containing the Program or a portion of it,

20053

either verbatim or with modifications and/or translated into another

20054

language. (Hereinafter, translation is included without limitation in

20055

the term ``modification''.) Each licensee is addressed as ``you''.

20056

20057

Activities other than copying, distribution and modification are not

20058

covered by this License; they are outside its scope. The act of

20059

running the Program is not restricted, and the output from the Program

20060

is covered only if its contents constitute a work based on the

20061

Program (independent of having been made by running the Program).

20062

Whether that is true depends on what the Program does.

20063

20064

@item

20065

You may copy and distribute verbatim copies of the Program's

20066

source code as you receive it, in any medium, provided that you

20067

conspicuously and appropriately publish on each copy an appropriate

20068

copyright notice and disclaimer of warranty; keep intact all the

20069

notices that refer to this License and to the absence of any warranty;

20070

and give any other recipients of the Program a copy of this License

20071

along with the Program.

20072

20073

You may charge a fee for the physical act of transferring a copy, and

20074

you may at your option offer warranty protection in exchange for a fee.

20075

20076

@item

20077

You may modify your copy or copies of the Program or any portion

20078

of it, thus forming a work based on the Program, and copy and

20079

distribute such modifications or work under the terms of Section 1

20080

above, provided that you also meet all of these conditions:

20081

20082

@enumerate a

20083

@item

20084

You must cause the modified files to carry prominent notices

20085

stating that you changed the files and the date of any change.

20086

20087

@item

20088

You must cause any work that you distribute or publish, that in

20089

whole or in part contains or is derived from the Program or any

20090

part thereof, to be licensed as a whole at no charge to all third

20091

parties under the terms of this License.

20092

20093

@item

20094

If the modified program normally reads commands interactively

20095

when run, you must cause it, when started running for such

20096

interactive use in the most ordinary way, to print or display an

20097

announcement including an appropriate copyright notice and a

20098

notice that there is no warranty (or else, saying that you provide

20099

a warranty) and that users may redistribute the program under

20100

these conditions, and telling the user how to view a copy of this

20101

License. (Exception: if the Program itself is interactive but

20102

does not normally print such an announcement, your work based on

20103

the Program is not required to print an announcement.)

20104

@end enumerate

20105

20106

These requirements apply to the modified work as a whole. If

20107

identifiable sections of that work are not derived from the Program,

20108

and can be reasonably considered independent and separate works in

20109

themselves, then this License, and its terms, do not apply to those

20110

sections when you distribute them as separate works. But when you

20111

distribute the same sections as part of a whole which is a work based

20112

on the Program, the distribution of the whole must be on the terms of

20113

this License, whose permissions for other licensees extend to the

20114

entire whole, and thus to each and every part regardless of who wrote it.

20115

20116

Thus, it is not the intent of this section to claim rights or contest

20117

your rights to work written entirely by you; rather, the intent is to

20118

exercise the right to control the distribution of derivative or

20119

collective works based on the Program.

20120

20121

In addition, mere aggregation of another work not based on the Program

20122

with the Program (or with a work based on the Program) on a volume of

20123

a storage or distribution medium does not bring the other work under

20124

the scope of this License.

20125

20126

@item

20127

You may copy and distribute the Program (or a work based on it,

20128

under Section 2) in object code or executable form under the terms of

20129

Sections 1 and 2 above provided that you also do one of the following:

20130

20131

@enumerate a

20132

@item

20133

Accompany it with the complete corresponding machine-readable

20134

source code, which must be distributed under the terms of Sections

20135

1 and 2 above on a medium customarily used for software interchange; or,

20136

20137

@item

20138

Accompany it with a written offer, valid for at least three

20139

years, to give any third party, for a charge no more than your

20140

cost of physically performing source distribution, a complete

20141

machine-readable copy of the corresponding source code, to be

20142

distributed under the terms of Sections 1 and 2 above on a medium

20143

customarily used for software interchange; or,

20144

20145

@item

20146

Accompany it with the information you received as to the offer

20147

to distribute corresponding source code. (This alternative is

20148

allowed only for non-commercial distribution and only if you

20149

received the program in object code or executable form with such

20150

an offer, in accord with Subsection b above.)

20151

@end enumerate

20152

20153

The source code for a work means the preferred form of the work for

20154

making modifications to it. For an executable work, complete source

20155

code means all the source code for all modules it contains, plus any

20156

associated interface definition files, plus the scripts used to

20157

control compilation and installation of the executable. However, as a

20158

special exception, the source code distributed need not include

20159

anything that is normally distributed (in either source or binary

20160

form) with the major components (compiler, kernel, and so on) of the

20161

operating system on which the executable runs, unless that component

20162

itself accompanies the executable.

20163

20164

If distribution of executable or object code is made by offering

20165

access to copy from a designated place, then offering equivalent

20166

access to copy the source code from the same place counts as

20167

distribution of the source code, even though third parties are not

20168

compelled to copy the source along with the object code.

20169

20170

@item

20171

You may not copy, modify, sublicense, or distribute the Program

20172

except as expressly provided under this License. Any attempt

20173

otherwise to copy, modify, sublicense or distribute the Program is

20174

void, and will automatically terminate your rights under this License.

20175

However, parties who have received copies, or rights, from you under

20176

this License will not have their licenses terminated so long as such

20177

parties remain in full compliance.

20178

20179

@item

20180

You are not required to accept this License, since you have not

20181

signed it. However, nothing else grants you permission to modify or

20182

distribute the Program or its derivative works. These actions are

20183

prohibited by law if you do not accept this License. Therefore, by

20184

modifying or distributing the Program (or any work based on the

20185

Program), you indicate your acceptance of this License to do so, and

20186

all its terms and conditions for copying, distributing or modifying

20187

the Program or works based on it.

20188

20189

@item

20190

Each time you redistribute the Program (or any work based on the

20191

Program), the recipient automatically receives a license from the

20192

original licensor to copy, distribute or modify the Program subject to

20193

these terms and conditions. You may not impose any further

20194

restrictions on the recipients' exercise of the rights granted herein.

20195

You are not responsible for enforcing compliance by third parties to

20196

this License.

20197

20198

@item

20199

If, as a consequence of a court judgment or allegation of patent

20200

infringement or for any other reason (not limited to patent issues),

20201

conditions are imposed on you (whether by court order, agreement or

20202

otherwise) that contradict the conditions of this License, they do not

20203

excuse you from the conditions of this License. If you cannot

20204

distribute so as to satisfy simultaneously your obligations under this

20205

License and any other pertinent obligations, then as a consequence you

20206

may not distribute the Program at all. For example, if a patent

20207

license would not permit royalty-free redistribution of the Program by

20208

all those who receive copies directly or indirectly through you, then

20209

the only way you could satisfy both it and this License would be to

20210

refrain entirely from distribution of the Program.

20211

20212

If any portion of this section is held invalid or unenforceable under

20213

any particular circumstance, the balance of the section is intended to

20214

apply and the section as a whole is intended to apply in other

20215

circumstances.

20216

20217

It is not the purpose of this section to induce you to infringe any

20218

patents or other property right claims or to contest validity of any

20219

such claims; this section has the sole purpose of protecting the

20220

integrity of the free software distribution system, which is

20221

implemented by public license practices. Many people have made

20222

generous contributions to the wide range of software distributed

20223

through that system in reliance on consistent application of that

20224

system; it is up to the author/donor to decide if he or she is willing

20225

to distribute software through any other system and a licensee cannot

20226

impose that choice.

20227

20228

This section is intended to make thoroughly clear what is believed to

20229

be a consequence of the rest of this License.

20230

20231

@item

20232

If the distribution and/or use of the Program is restricted in

20233

certain countries either by patents or by copyrighted interfaces, the

20234

original copyright holder who places the Program under this License

20235

may add an explicit geographical distribution limitation excluding

20236

those countries, so that distribution is permitted only in or among

20237

countries not thus excluded. In such case, this License incorporates

20238

the limitation as if written in the body of this License.

20239

20240

@item

20241

The Free Software Foundation may publish revised and/or new versions

20242

of the General Public License from time to time. Such new versions will

20243

be similar in spirit to the present version, but may differ in detail to

20244

address new problems or concerns.

20245

20246

Each version is given a distinguishing version number. If the Program

20247

specifies a version number of this License which applies to it and ``any

20248

later version'', you have the option of following the terms and conditions

20249

either of that version or of any later version published by the Free

20250

Software Foundation. If the Program does not specify a version number of

20251

this License, you may choose any version ever published by the Free Software

20252

Foundation.

20253

20254

@item

20255

If you wish to incorporate parts of the Program into other free

20256

programs whose distribution conditions are different, write to the author

20257

to ask for permission. For software which is copyrighted by the Free

20258

Software Foundation, write to the Free Software Foundation; we sometimes

20259

make exceptions for this. Our decision will be guided by the two goals

20260

of preserving the free status of all derivatives of our free software and

20261

of promoting the sharing and reuse of software generally.

20262

20263

@iftex

20264

@c fakenode --- for prepinfo

20265

@heading NO WARRANTY

20266

@end iftex

20267

@ifinfo

20268

@center NO WARRANTY

20269

@end ifinfo

20270

20271

@item

20272

BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY

20273

FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN

20274

OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES

20275

PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED

20276

OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF

20277

MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS

20278

TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE

20279

PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,

20280

REPAIR OR CORRECTION.

20281

20282

@item

20283

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING

20284

WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR

20285

REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,

20286

INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING

20287

OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED

20288

TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY

20289

YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER

20290

PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE

20291

POSSIBILITY OF SUCH DAMAGES.

20292

@end enumerate

20293

20294

@iftex

20295

@c fakenode --- for prepinfo

20296

@heading END OF TERMS AND CONDITIONS

20297

@end iftex

20298

@ifinfo

20299

@center END OF TERMS AND CONDITIONS

20300

@end ifinfo

20301

20302

@page

20303

@c fakenode --- for prepinfo

20304

@unnumberedsec How to Apply These Terms to Your New Programs

20305

20306

If you develop a new program, and you want it to be of the greatest

20307

possible use to the public, the best way to achieve this is to make it

20308

free software which everyone can redistribute and change under these terms.

20309

20310

To do so, attach the following notices to the program. It is safest

20311

to attach them to the start of each source file to most effectively

20312

convey the exclusion of warranty; and each file should have at least

20313

the ``copyright'' line and a pointer to where the full notice is found.

20314

20315

@smallexample

20316

@var{one line to give the program's name and an idea of what it does.}

20317

Copyright (C) 19@var{yy} @var{name of author}

20318

20319

This program is free software; you can redistribute it and/or

20320

modify it under the terms of the GNU General Public License

20321

as published by the Free Software Foundation; either version 2

20322

of the License, or (at your option) any later version.

20323

20324

This program is distributed in the hope that it will be useful,

20325

but WITHOUT ANY WARRANTY; without even the implied warranty of

20326

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the

20327

GNU General Public License for more details.

20328

20329

You should have received a copy of the GNU General Public License

20330

along with this program; if not, write to the Free Software

20331

Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA.

20332

@end smallexample

20333

20334

Also add information on how to contact you by electronic and paper mail.

20335

20336

If the program is interactive, make it output a short notice like this

20337

when it starts in an interactive mode:

20338

20339

@smallexample

20340

Gnomovision version 69, Copyright (C) 19@var{yy} @var{name of author}

20341

Gnomovision comes with ABSOLUTELY NO WARRANTY; for details

20342

type `show w'. This is free software, and you are welcome

20343

to redistribute it under certain conditions; type `show c'

20344

for details.

20345

@end smallexample

20346

20347

The hypothetical commands @samp{show w} and @samp{show c} should show

20348

the appropriate parts of the General Public License. Of course, the

20349

commands you use may be called something other than @samp{show w} and

20350

@samp{show c}; they could even be mouse-clicks or menu items---whatever

20351

suits your program.

20352

20353

You should also get your employer (if you work as a programmer) or your

20354

school, if any, to sign a ``copyright disclaimer'' for the program, if

20355

necessary. Here is a sample; alter the names:

20356

20357

@smallexample

20358

@group

20359

Yoyodyne, Inc., hereby disclaims all copyright

20360

interest in the program `Gnomovision'

20361

(which makes passes at compilers) written

20362

by James Hacker.

20363

20364

@var{signature of Ty Coon}, 1 April 1989

20365

Ty Coon, President of Vice

20366

@end group

20367

@end smallexample

20368

20369

This General Public License does not permit incorporating your program into

20370

proprietary programs. If your program is a subroutine library, you may

20371

consider it more useful to permit linking proprietary applications with the

20372

library. If this is what you want to do, use the GNU Library General

20373

Public License instead of this License.

20374

20375

@node Index, , Copying, Top

20376

@unnumbered Index

20377

@printindex cp

20378

20379

@summarycontents

20380

@contents

20381

@bye

20382

20383

Unresolved Issues:

20384

------------------

20385

1. From ADR.

20386

20387

Robert J. Chassell points out that awk programs should have some indication

20388

of how to use them. It would be useful to perhaps have a "programming

20389

style" section of the manual that would include this and other tips.

20390

20391

2. The default AWKPATH search path should be configurable via `configure'

20392

The default and how this changes needs to be documented.

20393

20394

Consistency issues:

20395

/.../ regexps are in @code, not @samp

20396

".." strings are in @code, not @samp

20397

no @print before @dots

20398

values of expressions in the text (@code{x} has the value 15),

20399

should be in roman, not @code

20400

Use tab and not TAB

20401

Use ESC and not ESCAPE

20402

Use space and not blank to describe the space bar's character

20403

The term "blank" is thus basically reserved for "blank lines" etc.

20404

The `(d.c.)' should appear inside the closing `.' of a sentence

20405

It should come before (pxref{...})

20406

" " should have an @w{} around it

20407

Use "non-" everywhere

20408

Use @code{ftp} when talking about anonymous ftp

20409

Use upper-case and lower-case, not "upper case" and "lower case"

20410

Use alphanumeric, not alpha-numeric

20411

Use --foo, not -Wfoo when describing long options

20412

Use findex for all programs and functions in the example chapters

20413

Use "Bell Labs" or "AT&T Bell Laboratories", but not

20414

"AT&T Bell Labs".

20415

Use "behavior" instead of "behaviour".

20416

Use "zeros" instead of "zeroes".

20417

Use "Input/Output", not "input/output". Also "I/O", not "i/o".

20418

Use @code{do}, and not @code{do}-@code{while}, except where

20419

actually discussing the do-while.

20420

The words "a", "and", "as", "between", "for", "from", "in", "of",

20421

"on", "that", "the", "to", "with", and "without",

20422

should not be capitalized in @chapter, @section etc.

20423

"Into" and "How" should.

20424

Search for @dfn; make sure important items are also indexed.

20425

"e.g." should always be followed by a comma.

20426

"i.e." should never be followed by a comma, and should be followed

20427

by `@:'.

20428

The numbers zero through ten should be spelled out, except when

20429

talking about file descriptor numbers. > 10 and < 0, it's

20430

ok to use numbers.

20431

In tables, put command line options in @code, while in the text,

20432

put them in @samp.

20433

When using @strong, use "Note:" or "Caution:" with colons and

20434

not exclamation points. Do not surround the paragraphs

20435

with @quotation ... @end quotation.

20436

20437

Date: Wed, 13 Apr 94 15:20:52 -0400

20438

From: rsm@gnu.ai.mit.edu (Richard Stallman)

20439

To: gnu-prog@gnu.ai.mit.edu

20440

Subject: A reminder: no pathnames in GNU

20441

20442

It's a GNU convention to use the term "file name" for the name of a

20443

file, never "pathname". We use the term "path" for search paths,

20444

which are lists of file names. Using it for a single file name as

20445

well is potentially confusing to users.

20446

20447

So please check any documentation you maintain, if you think you might

20448

have used "pathname".

20449

20450

Note that "file name" should be two words when it appears as ordinary

20451

text. It's ok as one word when it's a metasyntactic variable, though.

20452

20453

Suggestions:

20454

------------

20455

Enhance FIELDWIDTHS with some way to indicate "the rest of the record".

20456

E.g., a length of 0 or -1 or something. May be "n"?

20457

20458

Make FIELDWIDTHS be an array?

20459

20460

What if FIELDWIDTHS has invalid values in it?