1
\input texinfo @c -*-texinfo-*-
2
@c %**start of header (This is for running Texinfo on a region.)
4
@settitle AWK Language Programming
5
@c %**end of header (This is for running Texinfo on a region.)
11
* Gawk: (gawk.info). A Text Scanning and Processing Language.
17
@c @set xref-automatic-section-title
20
@c The following information should be updated here only!
21
@c This sets the edition of the document, the version of gawk it
22
@c applies to, and when the document was updated.
23
@set TITLE AWK Language Programming
26
@set UPDATE-MONTH January 1996
31
@set DOCUMENT Info file
35
Some comments on the layout for TeX.
36
1. Use the texinfo.tex from the gawk distribution. It contains fixes that
37
are needed to get the footings for draft mode to not appear.
38
2. I have done A LOT of work to make this look good. There `@page' commands
39
and use of `@group ... @end group' in a number of places. If you muck
40
with anything, it's your responsibility not to break the layout.
43
@c merge the function and variable indexes into the concept index
53
@c If "finalout" is commented out, the printed output will show
54
@c black boxes that mark lines that are too long. Thus, it is
55
@c unwise to comment it out when running a master in case there are
56
@c overfulls which are deemed okay.
70
This file documents @code{awk}, a program that you can use to select
71
particular records in a file and perform operations upon them.
73
This is Edition @value{EDITION} of @cite{@value{TITLE}},
74
for the @value{VERSION} version of the GNU implementation of AWK.
76
Copyright (C) 1989, 1991 - 1996 Free Software Foundation, Inc.
78
Permission is granted to make and distribute verbatim copies of
79
this manual provided the copyright notice and this permission notice
80
are preserved on all copies.
83
Permission is granted to process this file through TeX and print the
84
results, provided the printed document carries copying permission
85
notice identical to this one except for the removal of this paragraph
86
(this paragraph not being relevant to the printed manual).
89
Permission is granted to copy and distribute modified versions of this
90
manual under the conditions for verbatim copying, provided that the entire
91
resulting derived work is distributed under the terms of a permission
92
notice identical to this one.
94
Permission is granted to copy and distribute translations of this manual
95
into another language, under the above conditions for modified versions,
96
except that this permission notice may be stated in a translation approved
100
@setchapternewpage odd
104
@subtitle A User's Guide for GNU AWK
105
@subtitle Edition @value{EDITION}
106
@subtitle @value{UPDATE-MONTH}
107
@author Arnold D. Robbins
109
@author Based on @cite{The GAWK Manual},
110
@author by Robbins, Close, Rubin, and Stallman
112
@c Include the Distribution inside the titlepage environment so
113
@c that headings are turned off. Headings on and off do not work.
116
@vskip 0pt plus 1filll
118
The programs and applications presented in this book have been
119
included for their instructional value. They have been tested with care,
120
but are not guaranteed for any particular purpose. The publisher does not
121
offer any warranties or representations, nor does it accept any
122
liabilities with respect to the programs or applications.
125
UNIX is a registered trademark of X/Open, Ltd. @*
126
Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a
127
trademark of Microsoft Corporation in the United States and other
129
Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks
130
or trademarks of Atari Corporation. @*
131
DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment
134
``To boldly go where no man has gone before'' is a
135
Registered Trademark of Paramount Pictures Corporation. @*
136
@c sorry, i couldn't resist
138
Copyright @copyright{} 1989, 1991 - 1996 Free Software Foundation, Inc.
141
This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
142
for the @value{VERSION} (or later) version of the GNU implementation of AWK.
145
Published by the Free Software Foundation @*
146
59 Temple Place --- Suite 330 @*
147
Boston, MA 02111-1307 USA @*
148
Phone: +1-617-542-5942 @*
149
Fax (including Japan): +1-617-542-2652 @*
150
Printed copies are available for $25 each. @*
151
@c this ISBN can change! Check with the FSF office...
152
@c This one is correct for gawk 3.0 and edition 1.0
153
ISBN 1-882114-26-4 @*
155
Permission is granted to make and distribute verbatim copies of
156
this manual provided the copyright notice and this permission notice
157
are preserved on all copies.
159
Permission is granted to copy and distribute modified versions of this
160
manual under the conditions for verbatim copying, provided that the entire
161
resulting derived work is distributed under the terms of a permission
162
notice identical to this one.
164
Permission is granted to copy and distribute translations of this manual
165
into another language, under the above conditions for modified versions,
166
except that this permission notice may be stated in a translation approved
169
Cover art by Etienne Suvasa.
172
@c Thanks to Bob Chassell for directions on doing dedications.
178
@center @i{To Miriam, for making me complete.}
180
@center @i{To Chana, for the joy you bring us.}
182
@center @i{To Rivka, for the exponential increase.}
191
@evenheading @thispage@ @ @ @b{@thistitle} @| @|
192
@oddheading @| @| @b{@thischapter}@ @ @ @thispage
194
@evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute
195
@oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{}
200
@node Top, Preface, (dir), (dir)
201
@top General Introduction
202
@c Preface or Licensing nodes should come right after the Top
203
@c node, in `unnumbered' sections, then the chapter, `What is gawk'.
205
This file documents @code{awk}, a program that you can use to select
206
particular records in a file and perform operations upon them.
208
This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
209
for the @value{VERSION} version of the GNU implementation @*
215
* Preface:: What this @value{DOCUMENT} is about; brief
216
history and acknowledgements.
217
* What Is Awk:: What is the @code{awk} language; using this
219
* Getting Started:: A basic introduction to using @code{awk}. How
220
to run an @code{awk} program. Command line
222
* One-liners:: Short, sample @code{awk} programs.
223
* Regexp:: All about matching things using regular
225
* Reading Files:: How to read files and manipulate fields.
226
* Printing:: How to print using @code{awk}. Describes the
227
@code{print} and @code{printf} statements.
228
Also describes redirection of output.
229
* Expressions:: Expressions are the basic building blocks of
231
* Patterns and Actions:: Overviews of patterns and actions.
232
* Statements:: The various control statements are described
234
* Built-in Variables:: Built-in Variables
235
* Arrays:: The description and use of arrays. Also
236
includes array-oriented control statements.
237
* Built-in:: The built-in functions are summarized here.
238
* User-defined:: User-defined functions are described in
240
* Invoking Gawk:: How to run @code{gawk}.
241
* Library Functions:: A Library of @code{awk} Functions.
242
* Sample Programs:: Many @code{awk} programs with complete
244
* Language History:: The evolution of the @code{awk} language.
245
* Gawk Summary:: @code{gawk} Options and Language Summary.
246
* Installation:: Installing @code{gawk} under various operating
248
* Notes:: Something about the implementation of
250
* Glossary:: An explanation of some unfamiliar terms.
251
* Copying:: Your right to copy and distribute @code{gawk}.
252
* Index:: Concept and Variable Index.
254
* History:: The history of @code{gawk} and @code{awk}.
255
* Manual History:: Brief history of the GNU project and this
257
* Acknowledgements:: Acknowledgements.
258
* This Manual:: Using this @value{DOCUMENT}. Includes sample
259
input files that you can use.
260
* Conventions:: Typographical Conventions.
261
* Sample Data Files:: Sample data files for use in the @code{awk}
262
programs illustrated in this @value{DOCUMENT}.
263
* Names:: What name to use to find @code{awk}.
264
* Running gawk:: How to run @code{gawk} programs; includes
266
* One-shot:: Running a short throw-away @code{awk} program.
267
* Read Terminal:: Using no input files (input from terminal
269
* Long:: Putting permanent @code{awk} programs in
271
* Executable Scripts:: Making self-contained @code{awk} programs.
272
* Comments:: Adding documentation to @code{gawk} programs.
273
* Very Simple:: A very simple example.
274
* Two Rules:: A less simple one-line example with two rules.
275
* More Complex:: A more complex example.
276
* Statements/Lines:: Subdividing or combining statements into
278
* Other Features:: Other Features of @code{awk}.
279
* When:: When to use @code{gawk} and when to use other
281
* Regexp Usage:: How to Use Regular Expressions.
282
* Escape Sequences:: How to write non-printing characters.
283
* Regexp Operators:: Regular Expression Operators.
284
* GNU Regexp Operators:: Operators specific to GNU software.
285
* Case-sensitivity:: How to do case-insensitive matching.
286
* Leftmost Longest:: How much text matches.
287
* Computed Regexps:: Using Dynamic Regexps.
288
* Records:: Controlling how data is split into records.
289
* Fields:: An introduction to fields.
290
* Non-Constant Fields:: Non-constant Field Numbers.
291
* Changing Fields:: Changing the Contents of a Field.
292
* Field Separators:: The field separator and how to change it.
293
* Basic Field Splitting:: How fields are split with single characters or
295
* Regexp Field Splitting:: Using regexps as the field separator.
296
* Single Character Fields:: Making each character a separate field.
297
* Command Line Field Separator:: Setting @code{FS} from the command line.
298
* Field Splitting Summary:: Some final points and a summary table.
299
* Constant Size:: Reading constant width data.
300
* Multiple Line:: Reading multi-line records.
301
* Getline:: Reading files under explicit program control
302
using the @code{getline} function.
303
* Getline Intro:: Introduction to the @code{getline} function.
304
* Plain Getline:: Using @code{getline} with no arguments.
305
* Getline/Variable:: Using @code{getline} into a variable.
306
* Getline/File:: Using @code{getline} from a file.
307
* Getline/Variable/File:: Using @code{getline} into a variable from a
309
* Getline/Pipe:: Using @code{getline} from a pipe.
310
* Getline/Variable/Pipe:: Using @code{getline} into a variable from a
312
* Getline Summary:: Summary Of @code{getline} Variants.
313
* Print:: The @code{print} statement.
314
* Print Examples:: Simple examples of @code{print} statements.
315
* Output Separators:: The output separators and how to change them.
316
* OFMT:: Controlling Numeric Output With @code{print}.
317
* Printf:: The @code{printf} statement.
318
* Basic Printf:: Syntax of the @code{printf} statement.
319
* Control Letters:: Format-control letters.
320
* Format Modifiers:: Format-specification modifiers.
321
* Printf Examples:: Several examples.
322
* Redirection:: How to redirect output to multiple files and
324
* Special Files:: File name interpretation in @code{gawk}.
325
@code{gawk} allows access to inherited file
327
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
328
* Constants:: String, numeric, and regexp constants.
329
* Scalar Constants:: Numeric and string constants.
330
* Regexp Constants:: Regular Expression constants.
331
* Using Constant Regexps:: When and how to use a regexp constant.
332
* Variables:: Variables give names to values for later use.
333
* Using Variables:: Using variables in your programs.
334
* Assignment Options:: Setting variables on the command line and a
335
summary of command line syntax. This is an
336
advanced method of input.
337
* Conversion:: The conversion of strings to numbers and vice
339
* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
341
* Concatenation:: Concatenating strings.
342
* Assignment Ops:: Changing the value of a variable or a field.
343
* Increment Ops:: Incrementing the numeric value of a variable.
344
* Truth Values:: What is ``true'' and what is ``false''.
345
* Typing and Comparison:: How variables acquire types, and how this
346
affects comparison of numbers and strings with
348
* Boolean Ops:: Combining comparison expressions using boolean
349
operators @samp{||} (``or''), @samp{&&}
350
(``and'') and @samp{!} (``not'').
351
* Conditional Exp:: Conditional expressions select between two
352
subexpressions under control of a third
354
* Function Calls:: A function call is an expression.
355
* Precedence:: How various operators nest.
356
* Pattern Overview:: What goes into a pattern.
357
* Kinds of Patterns:: A list of all kinds of patterns.
358
* Regexp Patterns:: Using regexps as patterns.
359
* Expression Patterns:: Any expression can be used as a pattern.
360
* Ranges:: Pairs of patterns specify record ranges.
361
* BEGIN/END:: Specifying initialization and cleanup rules.
362
* Using BEGIN/END:: How and why to use BEGIN/END rules.
363
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
364
* Empty:: The empty pattern, which matches every record.
365
* Action Overview:: What goes into an action.
366
* If Statement:: Conditionally execute some @code{awk}
368
* While Statement:: Loop until some condition is satisfied.
369
* Do Statement:: Do specified action while looping until some
370
condition is satisfied.
371
* For Statement:: Another looping statement, that provides
372
initialization and increment clauses.
373
* Break Statement:: Immediately exit the innermost enclosing loop.
374
* Continue Statement:: Skip to the end of the innermost enclosing
376
* Next Statement:: Stop processing the current input record.
377
* Nextfile Statement:: Stop processing the current file.
378
* Exit Statement:: Stop execution of @code{awk}.
379
* User-modified:: Built-in variables that you change to control
381
* Auto-set:: Built-in variables where @code{awk} gives you
383
* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
384
* Array Intro:: Introduction to Arrays
385
* Reference to Elements:: How to examine one element of an array.
386
* Assigning Elements:: How to change an element of an array.
387
* Array Example:: Basic Example of an Array
388
* Scanning an Array:: A variation of the @code{for} statement. It
389
loops through the indices of an array's
391
* Delete:: The @code{delete} statement removes an element
393
* Numeric Array Subscripts:: How to use numbers as subscripts in
395
* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
396
* Multi-dimensional:: Emulating multi-dimensional arrays in
398
* Multi-scanning:: Scanning multi-dimensional arrays.
399
* Calling Built-in:: How to call built-in functions.
400
* Numeric Functions:: Functions that work with numbers, including
401
@code{int}, @code{sin} and @code{rand}.
402
* String Functions:: Functions for string manipulation, such as
403
@code{split}, @code{match}, and
405
* I/O Functions:: Functions for files and shell commands.
406
* Time Functions:: Functions for dealing with time stamps.
407
* Definition Syntax:: How to write definitions and what they mean.
408
* Function Example:: An example function definition and what it
410
* Function Caveats:: Things to watch out for.
411
* Return Statement:: Specifying the value a function returns.
412
* Options:: Command line options and their meanings.
413
* Other Arguments:: Input file names and variable assignments.
414
* AWKPATH Variable:: Searching directories for @code{awk} programs.
415
* Obsolete:: Obsolete Options and/or features.
416
* Undocumented:: Undocumented Options and Features.
417
* Known Bugs:: Known Bugs in @code{gawk}.
418
* Portability Notes:: What to do if you don't have @code{gawk}.
419
* Nextfile Function:: Two implementations of a @code{nextfile}
421
* Assert Function:: A function for assertions in @code{awk}
423
* Ordinal Functions:: Functions for using characters as numbers and
425
* Join Function:: A function to join an array into a string.
426
* Mktime Function:: A function to turn a date into a timestamp.
427
* Gettimeofday Function:: A function to get formatted times.
428
* Filetrans Function:: A function for handling data file transitions.
429
* Getopt Function:: A function for processing command line
431
* Passwd Functions:: Functions for getting user information.
432
* Group Functions:: Functions for getting group information.
433
* Library Names:: How to best name private global variables in
435
* Clones:: Clones of common utilities.
436
* Cut Program:: The @code{cut} utility.
437
* Egrep Program:: The @code{egrep} utility.
438
* Id Program:: The @code{id} utility.
439
* Split Program:: The @code{split} utility.
440
* Tee Program:: The @code{tee} utility.
441
* Uniq Program:: The @code{uniq} utility.
442
* Wc Program:: The @code{wc} utility.
443
* Miscellaneous Programs:: Some interesting @code{awk} programs.
444
* Dupword Program:: Finding duplicated words in a document.
445
* Alarm Program:: An alarm clock.
446
* Translate Program:: A program similar to the @code{tr} utility.
447
* Labels Program:: Printing mailing labels.
448
* Word Sorting:: A program to produce a word usage count.
449
* History Sorting:: Eliminating duplicate entries from a history
451
* Extract Program:: Pulling out programs from Texinfo source
453
* Simple Sed:: A Simple Stream Editor.
454
* Igawk Program:: A wrapper for @code{awk} that includes files.
455
* V7/SVR3.1:: The major changes between V7 and System V
457
* SVR4:: Minor changes between System V Releases 3.1
459
* POSIX:: New features from the POSIX standard.
460
* BTL:: New features from the AT&T Bell Laboratories
461
version of @code{awk}.
462
* POSIX/GNU:: The extensions in @code{gawk} not in POSIX
464
* Command Line Summary:: Recapitulation of the command line.
465
* Language Summary:: A terse review of the language.
466
* Variables/Fields:: Variables, fields, and arrays.
467
* Fields Summary:: Input field splitting.
468
* Built-in Summary:: @code{awk}'s built-in variables.
469
* Arrays Summary:: Using arrays.
470
* Data Type Summary:: Values in @code{awk} are numbers or strings.
471
* Rules Summary:: Patterns and Actions, and their component
473
* Pattern Summary:: Quick overview of patterns.
474
* Regexp Summary:: Quick overview of regular expressions.
475
* Actions Summary:: Quick overview of actions.
476
* Operator Summary:: @code{awk} operators.
477
* Control Flow Summary:: The control statements.
478
* I/O Summary:: The I/O statements.
479
* Printf Summary:: A summary of @code{printf}.
480
* Special File Summary:: Special file names interpreted internally.
481
* Built-in Functions Summary:: Built-in numeric and string functions.
482
* Time Functions Summary:: Built-in time functions.
483
* String Constants Summary:: Escape sequences in strings.
484
* Functions Summary:: Defining and calling functions.
485
* Historical Features:: Some undocumented but supported ``features''.
486
* Gawk Distribution:: What is in the @code{gawk} distribution.
487
* Getting:: How to get the distribution.
488
* Extracting:: How to extract the distribution.
489
* Distribution contents:: What is in the distribution.
490
* Unix Installation:: Installing @code{gawk} under various versions
492
* Quick Installation:: Compiling @code{gawk} under Unix.
493
* Configuration Philosophy:: How it's all supposed to work.
494
* VMS Installation:: Installing @code{gawk} on VMS.
495
* VMS Compilation:: How to compile @code{gawk} under VMS.
496
* VMS Installation Details:: How to install @code{gawk} under VMS.
497
* VMS Running:: How to run @code{gawk} under VMS.
498
* VMS POSIX:: Alternate instructions for VMS POSIX.
499
* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS
501
* Atari Installation:: Installing @code{gawk} on the Atari ST.
502
* Atari Compiling:: Compiling @code{gawk} on Atari
503
* Atari Using:: Running @code{gawk} on Atari
504
* Amiga Installation:: Installing @code{gawk} on an Amiga.
505
* Bugs:: Reporting Problems and Bugs.
506
* Other Versions:: Other freely available @code{awk}
508
* Compatibility Mode:: How to disable certain @code{gawk} extensions.
509
* Additions:: Making Additions To @code{gawk}.
510
* Adding Code:: Adding code to the main body of @code{gawk}.
511
* New Ports:: Porting @code{gawk} to a new operating system.
512
* Future Extensions:: New features that may be implemented one day.
513
* Improvements:: Suggestions for improvements by volunteers.
517
@c dedication for Info file
519
@center To Miriam, for making me complete.
521
@center To Chana, for the joy you bring us.
523
@center To Rivka, for the exponential increase.
526
@node Preface, What Is Awk, Top, Top
529
@c I saw a comment somewhere that the preface should describe the book itself,
530
@c and the introduction should describe what the book covers.
532
This @value{DOCUMENT} teaches you about the @code{awk} language and
533
how you can use it effectively. You should already be familiar with basic
534
system commands, such as @code{cat} and @code{ls},@footnote{These commands
535
are available on POSIX compliant systems, as well as on traditional Unix
536
based systems. If you are using some other operating system, you still need to
537
be familiar with the ideas of I/O redirection and pipes} and basic shell
538
facilities, such as Input/Output (I/O) redirection and pipes.
540
Implementations of the @code{awk} language are available for many different
541
computing environments. This @value{DOCUMENT}, while describing the @code{awk} language
542
in general, also describes a particular implementation of @code{awk} called
543
@code{gawk} (which stands for ``GNU Awk''). @code{gawk} runs on a broad range
544
of Unix systems, ranging from 80386 PC-based computers, up through large scale
545
systems, such as Crays. @code{gawk} has also been ported to MS-DOS and
546
OS/2 PC's, Atari and Amiga micro-computers, and VMS.
549
* History:: The history of @code{gawk} and @code{awk}.
550
* Manual History:: Brief history of the GNU project and this
552
* Acknowledgements:: Acknowledgements.
555
@node History, Manual History, Preface, Preface
556
@unnumberedsec History of @code{awk} and @code{gawk}
559
@cindex history of @code{awk}
561
@cindex Weinberger, Peter
562
@cindex Kernighan, Brian
563
@cindex old @code{awk}
564
@cindex new @code{awk}
565
The name @code{awk} comes from the initials of its designers: Alfred V.@:
566
Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of
567
@code{awk} was written in 1977 at AT&T Bell Laboratories.
568
In 1985 a new version made the programming
569
language more powerful, introducing user-defined functions, multiple input
570
streams, and computed regular expressions.
571
This new version became generally available with Unix System V Release 3.1.
572
The version in System V Release 4 added some new features and also cleaned
573
up the behavior in some of the ``dark corners'' of the language.
574
The specification for @code{awk} in the POSIX Command Language
575
and Utilities standard further clarified the language based on feedback
576
from both the @code{gawk} designers, and the original Bell Labs @code{awk}
579
The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin
580
and Jay Fenlason, with advice from Richard Stallman. John Woods
581
contributed parts of the code as well. In 1988 and 1989, David Trueman, with
582
help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility
583
with the newer @code{awk}. Current development focuses on bug fixes,
584
performance improvements, standards compliance, and occasionally, new features.
586
@node Manual History, Acknowledgements, History, Preface
587
@unnumberedsec The GNU Project and This Book
589
@cindex Free Software Foundation
590
The Free Software Foundation (FSF) is a non-profit organization dedicated
591
to the production and distribution of freely distributable software.
592
It was founded by Richard M.@: Stallman, the author of the original
593
Emacs editor. GNU Emacs is the most widely used version of Emacs today.
596
The GNU project is an on-going effort on the part of the Free Software
597
Foundation to create a complete, freely distributable, POSIX compliant
598
computing environment. (GNU stands for ``GNU's not Unix''.)
599
The FSF uses the ``GNU General Public License'' (or GPL) to ensure that
600
source code for their software is always available to the end user. A
601
copy of the GPL is included for your reference
602
(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
603
The GPL applies to the C language source code for @code{gawk}.
605
As of this writing (1995), the only major component of the
606
GNU environment still uncompleted is the operating system kernel, and
607
work proceeds apace on that. A shell, an editor (Emacs), highly portable
608
optimizing C, C++, and Objective-C compilers, a symbolic debugger, and dozens
609
of large and small utilities (such as @code{gawk}),
610
have all been completed and are freely available.
615
Until the GNU operating system is released, the FSF recommends the use
616
of Linux, a freely distributable, Unix-like operating system for 80386
617
and other systems. There are many books on Linux. One freely available one
618
is @cite{Linux Installation and Getting Started}, by Matt Welsh.
619
Many Linux distributions are available, often in computer stores or
620
bundled on CD-ROM with books about Linux. Also, the FSF provides a Linux
621
distribution (``Debian''); contact them for more information.
622
@xref{Getting, ,Getting the @code{gawk} Distribution}, for the FSF's contact
624
(There are two other freely available, Unix-like operating systems for
625
80386 and other systems, NetBSD and FreeBSD. Both are based on the
626
4.4-Lite Berkeley Software Distribution, and both use recent versions
627
of @code{gawk} for their versions of @code{awk}.)
630
This @value{DOCUMENT} you are reading now is actually free. The
631
information in it is freely available to anyone, the machine readable
632
source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone
633
may take this @value{DOCUMENT} to a copying machine and make as many
634
copies of it as they like. (Take a moment to check the copying
635
permissions on the Copyright page.)
637
If you paid money for this @value{DOCUMENT}, what you actually paid for
638
was the @value{DOCUMENT}'s nice printing and binding, and the
639
publisher's associated costs to produce it. We have made an effort to
640
keep these costs reasonable; most people would prefer a bound book to
641
over 300 pages of photo-copied text that would then have to be held in
642
a loose-leaf binder (not to mention the time and labor involved in
643
doing the copying). The same is true of producing this
644
@value{DOCUMENT} from the machine readable source; the retail price is
645
only slightly more than the cost per page of printing it
649
This @value{DOCUMENT} itself has gone through several previous,
650
preliminary editions. I started working on a preliminary draft of
651
@cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard
652
Stallman in the fall of 1988.
653
It was around 90 pages long, and barely described the original, ``old''
654
version of @code{awk}. After substantial revision, the first version of
655
the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in
656
October of 1989. The manual then underwent more substantial revision
657
for Edition 0.13 of December 1991.
658
David Trueman, Pat Rankin, and Michal Jaegermann contributed sections
659
of the manual for Edition 0.13.
660
That edition was published by the
661
FSF as a bound book early in 1992. Since then there have been several
662
minor revisions, notably Edition 0.14 of November 1992 that was published
663
by the FSF in January of 1993, and Edition 0.16 of August 1993.
665
Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working
666
of @cite{The GAWK Manual}, with much additional material.
667
The FSF and I agree that I am now the primary author.
668
I also felt that it needed a more descriptive title.
670
@cite{@value{TITLE}} will undoubtedly continue to evolve.
671
An electronic version
672
comes with the @code{gawk} distribution from the FSF.
673
If you find an error in this @value{DOCUMENT}, please report it!
674
@xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting
675
problem reports electronically, or write to me in care of the FSF.
677
@node Acknowledgements, , Manual History, Preface
678
@unnumberedsec Acknowledgements
680
I would like to acknowledge Richard M.@: Stallman, for his vision of a
681
better world, and for his courage in founding the FSF and starting the
684
The initial draft of @cite{The GAWK Manual} had the following acknowledgements:
687
Many people need to be thanked for their assistance in producing this
688
manual. Jay Fenlason contributed many ideas and sample programs. Richard
689
Mlynarik and Robert Chassell gave helpful comments on drafts of this
690
manual. The paper @cite{A Supplemental Document for @code{awk}} by John W.@:
691
Pierce of the Chemistry Department at UC San Diego, pinpointed several
692
issues relevant both to @code{awk} implementation and to this manual, that
693
would otherwise have escaped us.
696
The following people provided many helpful comments on Edition 0.13 of
697
@cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close,
698
Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins,
699
and Michal Jaegermann.
701
The following people provided many helpful comments for Edition 1.0 of
702
@cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel
703
Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins.
704
Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik
705
updated their respective sections for Edition 1.0.
707
Robert J.@: Chassell provided much valuable advice on
708
the use of Texinfo. He also deserves special thanks for
709
convincing me @emph{not} to title this @value{DOCUMENT}
710
@cite{How To Gawk Politely}.
711
Karl Berry helped significantly with the @TeX{} part of Texinfo.
713
@cindex Trueman, David
714
David Trueman deserves special credit; he has done a yeoman job
715
of evolving @code{gawk} so that it performs well, and without bugs.
716
Although he is no longer involved with @code{gawk},
717
working with him on this project was a significant pleasure.
719
@cindex Deifik, Scott
720
@cindex Hankerson, Darrel
721
@cindex Rommel, Kai Uwe
723
@cindex Jaegermann, Michal
724
Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal
725
Jaegermann (in no particular order) are long time members of the
726
@code{gawk} ``crack portability team.'' Without their hard work and
727
help, @code{gawk} would not be nearly the fine program it is today. It
728
has been and continues to be a pleasure working with this team of fine
731
@cindex Friedl, Jeffrey
732
Jeffrey Friedl provided invaluable help in tracking down a number
733
of last minute problems with regular expressions in @code{gawk} 3.0.
735
@cindex Kernighan, Brian
736
David and I would like to thank Brian Kernighan of Bell Labs for
737
invaluable assistance during the testing and debugging of @code{gawk}, and for
738
help in clarifying numerous points about the language. We could not have
739
done nearly as good a job on either @code{gawk} or its documentation without
743
I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@:
744
Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
745
time in their homes, which allowed me to make significant progress on
746
this @value{DOCUMENT} and on @code{gawk} itself. Phil Hughes of SSC
747
contributed in a very important way by loaning me his laptop Linux
748
system, not once, but twice, allowing me to do a lot of work while
751
@cindex Robbins, Miriam
752
Finally, I must thank my wonderful wife, Miriam, for her patience through
753
the many versions of this project, for her proof-reading,
754
and for sharing me with the computer.
755
I would like to thank my parents for their love, and for the grace with
756
which they raised and educated me.
757
I also must acknowledge my gratitude to G-d, for the many opportunities
758
He has sent my way, as well as for the gifts He has given me with which to
759
take advantage of those opportunities.
767
Stuff still not covered anywhere:
769
Integer vs. floating point
770
Hex vs. octal vs. decimal
771
Interpreter vs compiler
775
@node What Is Awk, Getting Started, Preface, Top
776
@chapter Introduction
778
If you are like many computer users, you would frequently like to make
779
changes in various text files wherever certain patterns appear, or
780
extract data from parts of certain lines while discarding the rest. To
781
write a program to do this in a language such as C or Pascal is a
782
time-consuming inconvenience that may take many lines of code. The job
783
may be easier with @code{awk}.
785
The @code{awk} utility interprets a special-purpose programming language
786
that makes it possible to handle simple data-reformatting jobs
787
with just a few lines of code.
789
The GNU implementation of @code{awk} is called @code{gawk}; it is fully
790
upward compatible with the System V Release 4 version of
791
@code{awk}. @code{gawk} is also upward compatible with the POSIX
792
specification of the @code{awk} language. This means that all
793
properly written @code{awk} programs should work with @code{gawk}.
794
Thus, we usually don't distinguish between @code{gawk} and other @code{awk}
797
@cindex uses of @code{awk}
798
Using @code{awk} you can:
802
manage small, personal databases
811
produce indexes, and perform other document preparation tasks
814
even experiment with algorithms that can be adapted later to other computer
819
* This Manual:: Using this @value{DOCUMENT}. Includes sample
820
input files that you can use.
821
* Conventions:: Typographical Conventions.
822
* Sample Data Files:: Sample data files for use in the @code{awk}
823
programs illustrated in this @value{DOCUMENT}.
826
@node This Manual, Conventions, What Is Awk, What Is Awk
827
@section Using This Book
828
@cindex book, using this
829
@cindex using this book
830
@cindex language, @code{awk}
831
@cindex program, @code{awk}
833
@cindex @code{awk} language
834
@cindex @code{awk} program
837
The term @code{awk} refers to a particular program, and to the language you
838
use to tell this program what to do. When we need to be careful, we call
839
the program ``the @code{awk} utility'' and the language ``the @code{awk}
840
language.'' The term @code{gawk} refers to a version of @code{awk} developed
841
as part the GNU project. The purpose of this @value{DOCUMENT} is to explain
842
both the @code{awk} language and how to run the @code{awk} utility.
844
The main purpose of the @value{DOCUMENT} is to explain the features
845
of @code{awk}, as defined in the POSIX standard. It does so in the context
846
of one particular implementation, @code{gawk}. While doing so, it will also
847
attempt to describe important differences between @code{gawk} and other
848
@code{awk} implementations. Finally, any @code{gawk} features that
849
are not in the POSIX standard for @code{awk} will be noted.
852
This @value{DOCUMENT} has the difficult task of being both tutorial and reference.
853
If you are a novice, feel free to skip over details that seem too complex.
854
You should also ignore the many cross references; they are for the
855
expert user, and for the on-line Info version of the document.
858
The term @dfn{@code{awk} program} refers to a program written by you in
859
the @code{awk} programming language.
861
@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare
862
essentials you need to know to start using @code{awk}.
864
Some useful ``one-liners'' are included to give you a feel for the
865
@code{awk} language (@pxref{One-liners, ,Useful One Line Programs}).
867
Many sample @code{awk} programs have been provided for you
868
(@pxref{Library Functions, ,A Library of @code{awk} Functions}; also
869
@pxref{Sample Programs, ,Practical @code{awk} Programs}).
871
The entire @code{awk} language is summarized for quick reference in
872
@ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need
873
to refresh your memory about a particular feature.
875
If you find terms that you aren't familiar with, try looking them
876
up in the glossary (@pxref{Glossary}).
878
Most of the time complete @code{awk} programs are used as examples, but in
879
some of the more advanced sections, only the part of the @code{awk} program
880
that illustrates the concept being described is shown.
882
While this @value{DOCUMENT} is aimed principally at people who have not been
884
to @code{awk}, there is a lot of information here that even the @code{awk}
885
expert should find useful. In particular, the description of POSIX
886
@code{awk}, and the example programs in
887
@ref{Library Functions, ,A Library of @code{awk} Functions}, and
888
@ref{Sample Programs, ,Practical @code{awk} Programs},
889
should be of interest.
891
@c fakenode --- for prepinfo
892
@unnumberedsubsec Dark Corners
894
@cindex d.c., see ``dark corner''
896
Until the POSIX standard (and @cite{The Gawk Manual}),
897
many features of @code{awk} were either poorly documented, or not
898
documented at all. Descriptions of such features
899
(often called ``dark corners'') are noted in this @value{DOCUMENT} with
901
They also appear in the index under the heading ``dark corner.''
903
@node Conventions, Sample Data Files, This Manual, What Is Awk
904
@section Typographical Conventions
906
This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language.
907
A single Texinfo source file is used to produce both the printed and on-line
908
versions of the documentation.
910
Because of this, the typographical conventions
911
are slightly different than in other books you may have read.
914
This section briefly documents the typographical conventions used in Texinfo.
917
Examples you would type at the command line are preceded by the common
918
shell primary and secondary prompts, @samp{$} and @samp{>}.
919
Output from the command is preceded by the glyph ``@print{}''.
920
This typically represents the command's standard output.
921
Error messages, and other output on the command's standard error, are preceded
922
by the glyph ``@error{}''. For example:
926
@print{} hi on stdout
927
$ echo hello on stderr 1>&2
928
@error{} hello on stderr
932
In the text, command names appear in @code{this font}, while code segments
933
appear in the same font and quoted, @samp{like this}. Some things will
934
be emphasized @emph{like this}, and if a point needs to be made
935
strongly, it will be done @strong{like this}. The first occurrence of
936
a new term is usually its @dfn{definition}, and appears in the same
937
font as the previous occurrence of ``definition'' in this sentence.
938
File names are indicated like this: @file{/path/to/ourfile}.
941
Characters that you type at the keyboard look @kbd{like this}. In particular,
942
there are special characters called ``control characters.'' These are
943
characters that you type by holding down both the @kbd{CONTROL} key and
944
another key, at the same time. For example, a @kbd{Control-d} is typed
945
by first pressing and holding the @kbd{CONTROL} key, next
946
pressing the @kbd{d} key, and finally releasing both keys.
948
@node Sample Data Files, , Conventions, What Is Awk
949
@section Data Files for the Examples
951
@cindex input file, sample
952
@cindex sample input file
953
@cindex @file{BBS-list} file
954
Many of the examples in this @value{DOCUMENT} take their input from two sample
955
data files. The first, called @file{BBS-list}, represents a list of
956
computer bulletin board systems together with information about those systems.
957
The second data file, called @file{inventory-shipped}, contains
958
information about shipments on a monthly basis. In both files,
959
each line is considered to be one @dfn{record}.
961
In the file @file{BBS-list}, each record contains the name of a computer
962
bulletin board, its phone number, the board's baud rate(s), and a code for
963
the number of hours it is operational. An @samp{A} in the last column
964
means the board operates 24 hours a day. A @samp{B} in the last
965
column means the board operates evening and weekend hours, only. A
966
@samp{C} means the board operates only on weekends.
968
@c 2e: Update the baud rates to reflect today's faster modems
971
@c system mkdir eg/lib
972
@c system mkdir eg/data
973
@c system mkdir eg/prog
974
@c system mkdir eg/misc
975
@c file eg/data/BBS-list
976
aardvark 555-5553 1200/300 B
977
alpo-net 555-3412 2400/1200/300 A
978
barfly 555-7685 1200/300 A
979
bites 555-1675 2400/1200/300 A
980
camelot 555-0542 300 C
981
core 555-2912 1200/300 C
982
fooey 555-1234 2400/1200/300 B
983
foot 555-6699 1200/300 B
984
macfoo 555-6480 1200/300 A
985
sdace 555-3430 2400/1200/300 A
986
sabafoo 555-2127 1200/300 C
990
@cindex @file{inventory-shipped} file
991
The second data file, called @file{inventory-shipped}, represents
992
information about shipments during the year.
993
Each record contains the month of the year, the number
994
of green crates shipped, the number of red boxes shipped, the number of
995
orange bags shipped, and the number of blue packages shipped,
996
respectively. There are 16 entries, covering the 12 months of one year
997
and four months of the next year.
1000
@c file eg/data/inventory-shipped
1022
If you are reading this in GNU Emacs using Info, you can copy the regions
1023
of text showing these sample files into your own test files. This way you
1024
can try out the examples shown in the remainder of this document. You do
1025
this by using the command @kbd{M-x write-region} to copy text from the Info
1026
file into a file for use with @code{awk}
1027
(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual},
1028
for more information). Using this information, create your own
1029
@file{BBS-list} and @file{inventory-shipped} files, and practice what you
1030
learn in this @value{DOCUMENT}.
1032
If you are using the stand-alone version of Info,
1033
see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
1034
for an @code{awk} program that will extract these data files from
1035
@file{gawk.texi}, the Texinfo source file for this Info file.
1038
@node Getting Started, One-liners, What Is Awk, Top
1039
@chapter Getting Started with @code{awk}
1040
@cindex script, definition of
1041
@cindex rule, definition of
1042
@cindex program, definition of
1043
@cindex basic function of @code{awk}
1045
The basic function of @code{awk} is to search files for lines (or other
1046
units of text) that contain certain patterns. When a line matches one
1047
of the patterns, @code{awk} performs specified actions on that line.
1048
@code{awk} keeps processing input lines in this way until the end of the
1049
input files are reached.
1051
@cindex data-driven languages
1052
@cindex procedural languages
1053
@cindex language, data-driven
1054
@cindex language, procedural
1055
Programs in @code{awk} are different from programs in most other languages,
1056
because @code{awk} programs are @dfn{data-driven}; that is, you describe
1057
the data you wish to work with, and then what to do when you find it.
1058
Most other languages are @dfn{procedural}; you have to describe, in great
1059
detail, every step the program is to take. When working with procedural
1060
languages, it is usually much
1061
harder to clearly describe the data your program will process.
1062
For this reason, @code{awk} programs are often refreshingly easy to both
1065
@cindex program, definition of
1066
@cindex rule, definition of
1067
When you run @code{awk}, you specify an @code{awk} @dfn{program} that
1068
tells @code{awk} what to do. The program consists of a series of
1069
@dfn{rules}. (It may also contain @dfn{function definitions},
1070
an advanced feature which we will ignore for now.
1071
@xref{User-defined, ,User-defined Functions}.) Each rule specifies one
1072
pattern to search for, and one action to perform when that pattern is found.
1074
Syntactically, a rule consists of a pattern followed by an action. The
1075
action is enclosed in curly braces to separate it from the pattern.
1076
Rules are usually separated by newlines. Therefore, an @code{awk}
1077
program looks like this:
1080
@var{pattern} @{ @var{action} @}
1081
@var{pattern} @{ @var{action} @}
1086
* Names:: What name to use to find @code{awk}.
1087
* Running gawk:: How to run @code{gawk} programs; includes
1088
command line syntax.
1089
* Very Simple:: A very simple example.
1090
* Two Rules:: A less simple one-line example with two rules.
1091
* More Complex:: A more complex example.
1092
* Statements/Lines:: Subdividing or combining statements into
1094
* Other Features:: Other Features of @code{awk}.
1095
* When:: When to use @code{gawk} and when to use other
1099
@node Names, Running gawk , Getting Started, Getting Started
1100
@section A Rose By Any Other Name
1102
@cindex old @code{awk} vs. new @code{awk}
1103
@cindex new @code{awk} vs. old @code{awk}
1104
The @code{awk} language has evolved over the years. Full details are
1105
provided in @ref{Language History, ,The Evolution of the @code{awk} Language}.
1106
The language described in this @value{DOCUMENT}
1107
is often referred to as ``new @code{awk}.''
1109
Because of this, many systems have multiple
1110
versions of @code{awk}.
1111
Some systems have an @code{awk} utility that implements the
1112
original version of the @code{awk} language, and a @code{nawk} utility
1113
for the new version. Others have an @code{oawk} for the ``old @code{awk}''
1114
language, and plain @code{awk} for the new one. Still others only
1115
have one version, usually the new one.@footnote{Often, these systems
1116
use @code{gawk} for their @code{awk} implementation!}
1118
All in all, this makes it difficult for you to know which version of
1119
@code{awk} you should run when writing your programs. The best advice
1120
we can give here is to check your local documentation. Look for @code{awk},
1121
@code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you
1122
will have some version of new @code{awk} on your system, and that is what
1123
you should use when running your programs. (Of course, if you're reading
1124
this @value{DOCUMENT}, chances are good that you have @code{gawk}!)
1126
Throughout this @value{DOCUMENT}, whenever we refer to a language feature
1127
that should be available in any complete implementation of POSIX @code{awk},
1128
we simply use the term @code{awk}. When referring to a feature that is
1129
specific to the GNU implementation, we use the term @code{gawk}.
1131
@node Running gawk, Very Simple, Names, Getting Started
1132
@section How to Run @code{awk} Programs
1134
@cindex command line formats
1135
@cindex running @code{awk} programs
1136
There are several ways to run an @code{awk} program. If the program is
1137
short, it is easiest to include it in the command that runs @code{awk},
1141
awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
1145
where @var{program} consists of a series of patterns and actions, as
1147
(The reason for the single quotes is described below, in
1148
@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.)
1150
When the program is long, it is usually more convenient to put it in a file
1151
and run it with a command like this:
1154
awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}
1158
* One-shot:: Running a short throw-away @code{awk} program.
1159
* Read Terminal:: Using no input files (input from terminal
1161
* Long:: Putting permanent @code{awk} programs in
1163
* Executable Scripts:: Making self-contained @code{awk} programs.
1164
* Comments:: Adding documentation to @code{gawk} programs.
1167
@node One-shot, Read Terminal, Running gawk, Running gawk
1168
@subsection One-shot Throw-away @code{awk} Programs
1170
Once you are familiar with @code{awk}, you will often type in simple
1171
programs the moment you want to use them. Then you can write the
1172
program as the first argument of the @code{awk} command, like this:
1175
awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
1179
where @var{program} consists of a series of @var{patterns} and
1180
@var{actions}, as described earlier.
1182
@cindex single quotes, why needed
1183
This command format instructs the @dfn{shell}, or command interpreter,
1184
to start @code{awk} and use the @var{program} to process records in the
1185
input file(s). There are single quotes around @var{program} so that
1186
the shell doesn't interpret any @code{awk} characters as special shell
1187
characters. They also cause the shell to treat all of @var{program} as
1188
a single argument for @code{awk} and allow @var{program} to be more
1191
This format is also useful for running short or medium-sized @code{awk}
1192
programs from shell scripts, because it avoids the need for a separate
1193
file for the @code{awk} program. A self-contained shell script is more
1194
reliable since there are no other files to misplace.
1196
@ref{One-liners, , Useful One Line Programs}, presents several short,
1197
self-contained programs.
1202
As an interesting side point, the command
1205
awk '/foo/' @var{files} @dots{}
1209
is essentially the same as
1211
@cindex @code{egrep}
1213
egrep foo @var{files} @dots{}
1216
@node Read Terminal, Long, One-shot, Running gawk
1217
@subsection Running @code{awk} without Input Files
1219
@cindex standard input
1220
@cindex input, standard
1221
You can also run @code{awk} without any input files. If you type the
1229
then @code{awk} applies the @var{program} to the @dfn{standard input},
1230
which usually means whatever you type on the terminal. This continues
1231
until you indicate end-of-file by typing @kbd{Control-d}.
1232
(On other operating systems, the end-of-file character may be different.
1233
For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.)
1235
For example, the following program prints a friendly piece of advice
1236
(from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}),
1237
to keep you from worrying about the complexities of computer programming
1238
(@samp{BEGIN} is a feature we haven't discussed yet).
1241
$ awk "BEGIN @{ print \"Don't Panic!\" @}"
1242
@print{} Don't Panic!
1245
@cindex quoting, shell
1246
@cindex shell quoting
1247
This program does not read any input. The @samp{\} before each of the
1248
inner double quotes is necessary because of the shell's quoting rules,
1249
in particular because it mixes both single quotes and double quotes.
1251
This next simple @code{awk} program
1252
emulates the @code{cat} utility; it copies whatever you type at the
1253
keyboard to its standard output. (Why this works is explained shortly.)
1257
Now is the time for all good men
1258
@print{} Now is the time for all good men
1259
to come to the aid of their country.
1260
@print{} to come to the aid of their country.
1261
Four score and seven years ago, ...
1262
@print{} Four score and seven years ago, ...
1264
@print{} What, me worry?
1268
@node Long, Executable Scripts, Read Terminal, Running gawk
1269
@subsection Running Long Programs
1271
@cindex running long programs
1272
@cindex @code{-f} option
1273
@cindex program file
1274
@cindex file, @code{awk} program
1275
Sometimes your @code{awk} programs can be very long. In this case it is
1276
more convenient to put the program into a separate file. To tell
1277
@code{awk} to use that file for its program, you type:
1280
awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}
1283
The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program
1284
from the file @var{source-file}. Any file name can be used for
1285
@var{source-file}. For example, you could put the program:
1288
BEGIN @{ print "Don't Panic!" @}
1292
into the file @file{advice}. Then this command:
1299
does the same thing as this one:
1302
awk "BEGIN @{ print \"Don't Panic!\" @}"
1305
@cindex quoting, shell
1306
@cindex shell quoting
1308
which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}).
1309
Note that you don't usually need single quotes around the file name that you
1310
specify with @samp{-f}, because most file names don't contain any of the shell's
1311
special characters. Notice that in @file{advice}, the @code{awk}
1312
program did not have single quotes around it. The quotes are only needed
1313
for programs that are provided on the @code{awk} command line.
1315
If you want to identify your @code{awk} program files clearly as such,
1316
you can add the extension @file{.awk} to the file name. This doesn't
1317
affect the execution of the @code{awk} program, but it does make
1318
``housekeeping'' easier.
1320
@node Executable Scripts, Comments, Long, Running gawk
1321
@subsection Executable @code{awk} Programs
1322
@cindex executable scripts
1323
@cindex scripts, executable
1324
@cindex self contained programs
1325
@cindex program, self contained
1326
@cindex @code{#!} (executable scripts)
1328
Once you have learned @code{awk}, you may want to write self-contained
1329
@code{awk} scripts, using the @samp{#!} script mechanism. You can do
1330
this on many Unix systems@footnote{The @samp{#!} mechanism works on
1332
Unix systems derived from Berkeley Unix, System V Release 4, and some System
1333
V Release 3 systems.} (and someday on the GNU system).
1335
For example, you could update the file @file{advice} to look like this:
1340
BEGIN @{ print "Don't Panic!" @}
1344
After making this file executable (with the @code{chmod} utility), you
1345
can simply type @samp{advice}
1346
at the shell, and the system will arrange to run @code{awk} @footnote{The
1347
line beginning with @samp{#!} lists the full file name of an interpreter
1348
to be run, and an optional initial command line argument to pass to that
1349
interpreter. The operating system then runs the interpreter with the given
1350
argument and the full argument list of the executed program. The first argument
1351
in the list is the full file name of the @code{awk} program. The rest of the
1352
argument list will either be options to @code{awk}, or data files,
1353
or both.} as if you had typed @samp{awk -f advice}.
1357
@print{} Don't Panic!
1361
Self-contained @code{awk} scripts are useful when you want to write a
1362
program which users can invoke without their having to know that the program is
1363
written in @code{awk}.
1365
@cindex shell scripts
1366
@cindex scripts, shell
1367
Some older systems do not support the @samp{#!} mechanism. You can get a
1368
similar effect using a regular shell script. It would look something
1372
: The colon ensures execution by the standard shell.
1373
awk '@var{program}' "$@@"
1376
Using this technique, it is @emph{vital} to enclose the @var{program} in
1377
single quotes to protect it from interpretation by the shell. If you
1378
omit the quotes, only a shell wizard can predict the results.
1380
The @code{"$@@"} causes the shell to forward all the command line
1381
arguments to the @code{awk} program, without interpretation. The first
1382
line, which starts with a colon, is used so that this shell script will
1383
work even if invoked by a user who uses the C shell. (Not all older systems
1384
obey this convention, but many do.)
1386
@c Someday: (See @cite{The Bourne Again Shell}, by ??.)
1388
@node Comments, , Executable Scripts, Running gawk
1389
@subsection Comments in @code{awk} Programs
1390
@cindex @code{#} (comment)
1392
@cindex use of comments
1393
@cindex documenting @code{awk} programs
1394
@cindex programs, documenting
1396
A @dfn{comment} is some text that is included in a program for the sake
1397
of human readers; it is not really part of the program. Comments
1398
can explain what the program does, and how it works. Nearly all
1399
programming languages have provisions for comments, because programs are
1400
typically hard to understand without their extra help.
1402
In the @code{awk} language, a comment starts with the sharp sign
1403
character, @samp{#}, and continues to the end of the line.
1404
The @samp{#} does not have to be the first character on the line. The
1405
@code{awk} language ignores the rest of a line following a sharp sign.
1406
For example, we could have put the following into @file{advice}:
1409
# This program prints a nice friendly message. It helps
1410
# keep novice users from being afraid of the computer.
1411
BEGIN @{ print "Don't Panic!" @}
1414
You can put comment lines into keyboard-composed throw-away @code{awk}
1415
programs also, but this usually isn't very useful; the purpose of a
1416
comment is to help you or another person understand the program at
1419
@node Very Simple, Two Rules, Running gawk, Getting Started
1420
@section A Very Simple Example
1422
The following command runs a simple @code{awk} program that searches the
1423
input file @file{BBS-list} for the string of characters: @samp{foo}. (A
1424
string of characters is usually called a @dfn{string}.
1425
The term @dfn{string} is perhaps based on similar usage in English, such
1426
as ``a string of pearls,'' or, ``a string of cars in a train.'')
1429
awk '/foo/ @{ print $0 @}' BBS-list
1433
When lines containing @samp{foo} are found, they are printed, because
1434
@w{@samp{print $0}} means print the current line. (Just @samp{print} by
1435
itself means the same thing, so we could have written that
1438
You will notice that slashes, @samp{/}, surround the string @samp{foo}
1439
in the @code{awk} program. The slashes indicate that @samp{foo}
1440
is a pattern to search for. This type of pattern is called a
1441
@dfn{regular expression}, and is covered in more detail later
1442
(@pxref{Regexp, ,Regular Expressions}).
1443
The pattern is allowed to match parts of words.
1445
single-quotes around the @code{awk} program so that the shell won't
1446
interpret any of it as special shell characters.
1448
Here is what this program prints:
1452
$ awk '/foo/ @{ print $0 @}' BBS-list
1453
@print{} fooey 555-1234 2400/1200/300 B
1454
@print{} foot 555-6699 1200/300 B
1455
@print{} macfoo 555-6480 1200/300 A
1456
@print{} sabafoo 555-2127 1200/300 C
1460
@cindex action, default
1461
@cindex pattern, default
1462
@cindex default action
1463
@cindex default pattern
1464
In an @code{awk} rule, either the pattern or the action can be omitted,
1465
but not both. If the pattern is omitted, then the action is performed
1466
for @emph{every} input line. If the action is omitted, the default
1467
action is to print all lines that match the pattern.
1469
@cindex empty action
1470
@cindex action, empty
1471
Thus, we could leave out the action (the @code{print} statement and the curly
1472
braces) in the above example, and the result would be the same: all
1473
lines matching the pattern @samp{foo} would be printed. By comparison,
1474
omitting the @code{print} statement but retaining the curly braces makes an
1475
empty action that does nothing; then no lines would be printed.
1477
@node Two Rules, More Complex, Very Simple, Getting Started
1478
@section An Example with Two Rules
1479
@cindex how @code{awk} works
1481
The @code{awk} utility reads the input files one line at a
1482
time. For each line, @code{awk} tries the patterns of each of the rules.
1483
If several patterns match then several actions are run, in the order in
1484
which they appear in the @code{awk} program. If no patterns match, then
1487
After processing all the rules (perhaps none) that match the line,
1488
@code{awk} reads the next line (however,
1489
@pxref{Next Statement, ,The @code{next} Statement},
1490
and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
1491
This continues until the end of the file is reached.
1493
For example, the @code{awk} program:
1501
contains two rules. The first rule has the string @samp{12} as the
1502
pattern and @samp{print $0} as the action. The second rule has the
1503
string @samp{21} as the pattern and also has @samp{print $0} as the
1504
action. Each rule's action is enclosed in its own pair of braces.
1506
This @code{awk} program prints every line that contains the string
1507
@samp{12} @emph{or} the string @samp{21}. If a line contains both
1508
strings, it is printed twice, once by each rule.
1510
This is what happens if we run this program on our two sample data files,
1511
@file{BBS-list} and @file{inventory-shipped}, as shown here:
1514
$ awk '/12/ @{ print $0 @}
1515
> /21/ @{ print $0 @}' BBS-list inventory-shipped
1516
@print{} aardvark 555-5553 1200/300 B
1517
@print{} alpo-net 555-3412 2400/1200/300 A
1518
@print{} barfly 555-7685 1200/300 A
1519
@print{} bites 555-1675 2400/1200/300 A
1520
@print{} core 555-2912 1200/300 C
1521
@print{} fooey 555-1234 2400/1200/300 B
1522
@print{} foot 555-6699 1200/300 B
1523
@print{} macfoo 555-6480 1200/300 A
1524
@print{} sdace 555-3430 2400/1200/300 A
1525
@print{} sabafoo 555-2127 1200/300 C
1526
@print{} sabafoo 555-2127 1200/300 C
1527
@print{} Jan 21 36 64 620
1528
@print{} Apr 21 70 74 514
1532
Note how the line in @file{BBS-list} beginning with @samp{sabafoo}
1533
was printed twice, once for each rule.
1535
@node More Complex, Statements/Lines, Two Rules, Getting Started
1536
@section A More Complex Example
1539
We have to use ls -lg here to get portable output across Unix systems.
1540
The POSIX ls matches this behavior too. Sigh.
1542
Here is an example to give you an idea of what typical @code{awk}
1543
programs do. This example shows how @code{awk} can be used to
1544
summarize, select, and rearrange the output of another utility. It uses
1545
features that haven't been covered yet, so don't worry if you don't
1546
understand all the details.
1549
ls -lg | awk '$6 == "Nov" @{ sum += $5 @}
1550
END @{ print sum @}'
1553
@cindex @code{csh}, backslash continuation
1554
@cindex backslash continuation in @code{csh}
1555
This command prints the total number of bytes in all the files in the
1556
current directory that were last modified in November (of any year).
1557
(In the C shell you would need to type a semicolon and then a backslash
1558
at the end of the first line; in a POSIX-compliant shell, such as the
1559
Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example
1562
FIXME: how can users tell what shell they are running? Need a footnote
1563
or something, but getting into this is a distraction.
1566
The @w{@samp{ls -lg}} part of this example is a system command that gives
1567
you a listing of the files in a directory, including file size and the date
1568
the file was last modified. Its output looks like this:
1571
-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile
1572
-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h
1573
-rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h
1574
-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y
1575
-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c
1576
-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c
1577
-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c
1578
-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c
1582
The first field contains read-write permissions, the second field contains
1583
the number of links to the file, and the third field identifies the owner of
1584
the file. The fourth field identifies the group of the file.
1585
The fifth field contains the size of the file in bytes. The
1586
sixth, seventh and eighth fields contain the month, day, and time,
1587
respectively, that the file was last modified. Finally, the ninth field
1588
contains the name of the file.
1590
@cindex automatic initialization
1591
@cindex initialization, automatic
1592
The @samp{$6 == "Nov"} in our @code{awk} program is an expression that
1593
tests whether the sixth field of the output from @w{@samp{ls -lg}}
1594
matches the string @samp{Nov}. Each time a line has the string
1595
@samp{Nov} for its sixth field, the action @samp{sum += $5} is
1596
performed. This adds the fifth field (the file size) to the variable
1597
@code{sum}. As a result, when @code{awk} has finished reading all the
1598
input lines, @code{sum} is the sum of the sizes of files whose
1599
lines matched the pattern. (This works because @code{awk} variables
1600
are automatically initialized to zero.)
1602
After the last line of output from @code{ls} has been processed, the
1603
@code{END} rule is executed, and the value of @code{sum} is
1604
printed. In this example, the value of @code{sum} would be 80600.
1606
These more advanced @code{awk} techniques are covered in later sections
1607
(@pxref{Action Overview, ,Overview of Actions}). Before you can move on to more
1608
advanced @code{awk} programming, you have to know how @code{awk} interprets
1609
your input and displays your output. By manipulating fields and using
1610
@code{print} statements, you can produce some very useful and impressive
1613
@node Statements/Lines, Other Features, More Complex, Getting Started
1614
@section @code{awk} Statements Versus Lines
1618
Most often, each line in an @code{awk} program is a separate statement or
1619
separate rule, like this:
1622
awk '/12/ @{ print $0 @}
1623
/21/ @{ print $0 @}' BBS-list inventory-shipped
1626
However, @code{gawk} will ignore newlines after any of the following:
1629
, @{ ? : || && do else
1633
A newline at any other point is considered the end of the statement.
1634
(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk}
1635
extension. The @samp{?} and @samp{:} referred to here is the
1636
three operand conditional expression described in
1637
@ref{Conditional Exp, ,Conditional Expressions}.)
1639
@cindex backslash continuation
1640
@cindex continuation of lines
1641
@cindex line continuation
1642
If you would like to split a single statement into two lines at a point
1643
where a newline would terminate it, you can @dfn{continue} it by ending the
1644
first line with a backslash character, @samp{\}. The backslash must be
1645
the final character on the line to be recognized as a continuation
1646
character. This is allowed absolutely anywhere in the statement, even
1647
in the middle of a string or regular expression. For example:
1650
awk '/This regular expression is too long, so continue it\
1651
on the next line/ @{ print $1 @}'
1655
@cindex portability issues
1656
We have generally not used backslash continuation in the sample programs
1657
in this @value{DOCUMENT}. Since in @code{gawk} there is no limit on the
1658
length of a line, it is never strictly necessary; it just makes programs
1659
more readable. For this same reason, as well as for clarity, we have
1660
kept most statements short in the sample programs presented throughout
1661
the @value{DOCUMENT}. Backslash continuation is most useful when your
1662
@code{awk} program is in a separate source file, instead of typed in on
1663
the command line. You should also note that many @code{awk}
1664
implementations are more particular about where you may use backslash
1665
continuation. For example, they may not allow you to split a string
1666
constant using backslash continuation. Thus, for maximal portability of
1667
your @code{awk} programs, it is best not to split your lines in the
1668
middle of a regular expression or a string.
1670
@cindex @code{csh}, backslash continuation
1671
@cindex backslash continuation in @code{csh}
1672
@strong{Caution: backslash continuation does not work as described above
1673
with the C shell.} Continuation with backslash works for @code{awk}
1674
programs in files, and also for one-shot programs @emph{provided} you
1675
are using a POSIX-compliant shell, such as the Bourne shell or Bash, the
1676
GNU Bourne-Again shell. But the C shell (@code{csh}) behaves
1677
differently! There, you must use two backslashes in a row, followed by
1678
a newline. Note also that when using the C shell, @emph{every} newline
1679
in your awk program must be escaped with a backslash. To illustrate:
1686
@print{} hello, world
1690
Here, the @samp{%} and @samp{?} are the C shell's primary and secondary
1691
prompts, analogous to the standard shell's @samp{$} and @samp{>}.
1693
@code{awk} is a line-oriented language. Each rule's action has to
1694
begin on the same line as the pattern. To have the pattern and action
1695
on separate lines, you @emph{must} use backslash continuation---there
1698
@cindex multiple statements on one line
1699
When @code{awk} statements within one rule are short, you might want to put
1700
more than one of them on a line. You do this by separating the statements
1701
with a semicolon, @samp{;}.
1703
This also applies to the rules themselves.
1704
Thus, the previous program could have been written:
1707
/12/ @{ print $0 @} ; /21/ @{ print $0 @}
1711
@strong{Note:} the requirement that rules on the same line must be
1712
separated with a semicolon was not in the original @code{awk}
1713
language; it was added for consistency with the treatment of statements
1716
@node Other Features, When, Statements/Lines, Getting Started
1717
@section Other Features of @code{awk}
1719
The @code{awk} language provides a number of predefined, or built-in variables, which
1720
your programs can use to get information from @code{awk}. There are other
1721
variables your program can set to control how @code{awk} processes your
1724
In addition, @code{awk} provides a number of built-in functions for doing
1725
common computational and string related operations.
1727
As we develop our presentation of the @code{awk} language, we introduce
1728
most of the variables and many of the functions. They are defined
1729
systematically in @ref{Built-in Variables}, and
1730
@ref{Built-in, ,Built-in Functions}.
1732
@node When, , Other Features, Getting Started
1733
@section When to Use @code{awk}
1735
@cindex when to use @code{awk}
1736
@cindex applications of @code{awk}
1737
You might wonder how @code{awk} might be useful for you. Using
1738
utility programs, advanced patterns, field separators, arithmetic
1739
statements, and other selection criteria, you can produce much more
1740
complex output. The @code{awk} language is very useful for producing
1741
reports from large amounts of raw data, such as summarizing information
1742
from the output of other utility programs like @code{ls}.
1743
(@xref{More Complex, ,A More Complex Example}.)
1745
Programs written with @code{awk} are usually much smaller than they would
1746
be in other languages. This makes @code{awk} programs easy to compose and
1747
use. Often, @code{awk} programs can be quickly composed at your terminal,
1748
used once, and thrown away. Since @code{awk} programs are interpreted, you
1749
can avoid the (usually lengthy) compilation part of the typical
1750
edit-compile-test-debug cycle of software development.
1752
Complex programs have been written in @code{awk}, including a complete
1753
retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for
1754
more information) and a microcode assembler for a special purpose Prolog
1755
computer. However, @code{awk}'s capabilities are strained by tasks of
1758
If you find yourself writing @code{awk} scripts of more than, say, a few
1759
hundred lines, you might consider using a different programming
1760
language. Emacs Lisp is a good choice if you need sophisticated string
1761
or pattern matching capabilities. The shell is also good at string and
1762
pattern matching; in addition, it allows powerful use of the system
1763
utilities. More conventional languages, such as C, C++, and Lisp, offer
1764
better facilities for system programming and for managing the complexity
1765
of large programs. Programs in these languages may require more lines
1766
of source code than the equivalent @code{awk} programs, but they are
1767
easier to maintain and usually run more efficiently.
1769
@node One-liners, Regexp, Getting Started, Top
1770
@chapter Useful One Line Programs
1773
Many useful @code{awk} programs are short, just a line or two. Here is a
1774
collection of useful, short programs to get you started. Some of these
1775
programs contain constructs that haven't been covered yet. The description
1776
of the program will give you a good idea of what is going on, but please
1777
read the rest of the @value{DOCUMENT} to become an @code{awk} expert!
1779
Most of the examples use a data file named @file{data}. This is just a
1780
placeholder; if you were to use these programs yourself, you would substitute
1781
your own file names for @file{data}.
1784
Since you are reading this in Info, each line of the example code is
1785
enclosed in quotes, to represent text that you would type literally.
1786
The examples themselves represent shell commands that use single quotes
1787
to keep the shell from interpreting the contents of the program.
1788
When reading the examples, focus on the text between the open and close
1793
@item awk '@{ if (length($0) > max) max = length($0) @}
1794
@itemx @ @ @ @ @ END @{ print max @}' data
1795
This program prints the length of the longest input line.
1797
@item awk 'length($0) > 80' data
1798
This program prints every line that is longer than 80 characters. The sole
1799
rule has a relational expression as its pattern, and has no action (so the
1800
default action, printing the record, is used).
1802
@item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @}
1803
@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}'
1804
This program prints the length of the longest line in @file{data}. The input
1805
is processed by the @code{expand} program to change tabs into spaces,
1806
so the widths compared are actually the right-margin columns.
1808
@item awk 'NF > 0' data
1809
This program prints every line that has at least one field. This is an
1810
easy way to delete blank lines from a file (or rather, to create a new
1811
file similar to the old file but from which the blank lines have been
1814
@c Karl Berry points out that new users probably don't want to see
1815
@c multiple ways to do things, just the `best' way. He's probably
1816
@c right. At some point it might be worth adding something about there
1817
@c often being multiple ways to do things in awk, but for now we'll
1818
@c just take this one out.
1820
@item awk '@{ if (NF > 0) print @}' data
1821
This program also prints every line that has at least one field. Here we
1822
allow the rule to match every line, and then decide in the action whether
1826
@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++)
1827
@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}'
1828
This program prints seven random numbers from zero to 100, inclusive.
1830
@item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}'
1831
This program prints the total number of bytes used by @var{files}.
1833
@item ls -lg @var{files} | awk '@{ x += $5 @}
1834
@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}'
1835
This program prints the total number of kilobytes used by @var{files}.
1837
@item awk -F: '@{ print $1 @}' /etc/passwd | sort
1838
This program prints a sorted list of the login names of all users.
1840
@item awk 'END @{ print NR @}' data
1841
This program counts lines in a file.
1843
@item awk 'NR % 2' data
1844
This program prints the even numbered lines in the data file.
1845
If you were to use the expression @samp{NR % 2 == 1} instead,
1846
it would print the odd number lines.
1849
@node Regexp, Reading Files, One-liners, Top
1850
@chapter Regular Expressions
1851
@cindex pattern, regular expressions
1853
@cindex regular expression
1854
@cindex regular expressions as patterns
1856
A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a
1858
Because regular expressions are such a fundamental part of @code{awk}
1859
programming, their format and use deserve a separate chapter.
1861
A regular expression enclosed in slashes (@samp{/})
1862
is an @code{awk} pattern that matches every input record whose text
1863
belongs to that set.
1865
The simplest regular expression is a sequence of letters, numbers, or
1866
both. Such a regexp matches any string that contains that sequence.
1867
Thus, the regexp @samp{foo} matches any string containing @samp{foo}.
1868
Therefore, the pattern @code{/foo/} matches any input record containing
1869
the three characters @samp{foo}, @emph{anywhere} in the record. Other
1870
kinds of regexps let you specify more complicated classes of strings.
1873
Initially, the examples will be simple. As we explain more about how
1874
regular expressions work, we will present more complicated examples.
1878
* Regexp Usage:: How to Use Regular Expressions.
1879
* Escape Sequences:: How to write non-printing characters.
1880
* Regexp Operators:: Regular Expression Operators.
1881
* GNU Regexp Operators:: Operators specific to GNU software.
1882
* Case-sensitivity:: How to do case-insensitive matching.
1883
* Leftmost Longest:: How much text matches.
1884
* Computed Regexps:: Using Dynamic Regexps.
1887
@node Regexp Usage, Escape Sequences, Regexp, Regexp
1888
@section How to Use Regular Expressions
1890
A regular expression can be used as a pattern by enclosing it in
1891
slashes. Then the regular expression is tested against the
1892
entire text of each record. (Normally, it only needs
1893
to match some part of the text in order to succeed.) For example, this
1894
prints the second field of each record that contains the three
1895
characters @samp{foo} anywhere in it:
1899
$ awk '/foo/ @{ print $2 @}' BBS-list
1907
@cindex regexp matching operators
1908
@cindex string-matching operators
1909
@cindex operators, string-matching
1910
@cindex operators, regexp matching
1911
@cindex regexp match/non-match operators
1912
@cindex @code{~} operator
1913
@cindex @code{!~} operator
1914
Regular expressions can also be used in matching expressions. These
1915
expressions allow you to specify the string to match against; it need
1916
not be the entire current input record. The two operators, @samp{~}
1917
and @samp{!~}, perform regular expression comparisons. Expressions
1918
using these operators can be used as patterns or in @code{if},
1919
@code{while}, @code{for}, and @code{do} statements.
1921
@c adding this xref in TeX screws up the formatting too much
1922
(@xref{Statements, ,Control Statements in Actions}.)
1926
@item @var{exp} ~ /@var{regexp}/
1927
This is true if the expression @var{exp} (taken as a string)
1928
is matched by @var{regexp}. The following example matches, or selects,
1929
all input records with the upper-case letter @samp{J} somewhere in the
1934
$ awk '$1 ~ /J/' inventory-shipped
1935
@print{} Jan 13 25 15 115
1936
@print{} Jun 31 42 75 492
1937
@print{} Jul 24 34 67 436
1938
@print{} Jan 21 36 64 620
1945
awk '@{ if ($1 ~ /J/) print @}' inventory-shipped
1948
@item @var{exp} !~ /@var{regexp}/
1949
This is true if the expression @var{exp} (taken as a character string)
1950
is @emph{not} matched by @var{regexp}. The following example matches,
1951
or selects, all input records whose first field @emph{does not} contain
1952
the upper-case letter @samp{J}:
1956
$ awk '$1 !~ /J/' inventory-shipped
1957
@print{} Feb 15 32 24 226
1958
@print{} Mar 15 24 34 228
1959
@print{} Apr 31 52 63 420
1960
@print{} May 16 34 29 208
1966
@cindex regexp constant
1967
When a regexp is written enclosed in slashes, like @code{/foo/}, we call it
1968
a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and
1969
@code{"foo"} is a string constant.
1971
@node Escape Sequences, Regexp Operators, Regexp Usage, Regexp
1972
@section Escape Sequences
1974
@cindex escape sequence notation
1975
Some characters cannot be included literally in string constants
1976
(@code{"foo"}) or regexp constants (@code{/foo/}). You represent them
1977
instead with @dfn{escape sequences}, which are character sequences
1978
beginning with a backslash (@samp{\}).
1980
One use of an escape sequence is to include a double-quote character in
1981
a string constant. Since a plain double-quote would end the string, you
1982
must use @samp{\"} to represent an actual double-quote character as a
1983
part of the string. For example:
1986
$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}'
1987
@print{} He said "hi!" to her.
1990
The backslash character itself is another character that cannot be
1991
included normally; you write @samp{\\} to put one backslash in the
1992
string or regexp. Thus, the string whose contents are the two characters
1993
@samp{"} and @samp{\} must be written @code{"\"\\"}.
1995
Another use of backslash is to represent unprintable characters
1996
such as tab or newline. While there is nothing to stop you from entering most
1997
unprintable characters directly in a string constant or regexp constant,
2000
Here is a table of all the escape sequences used in @code{awk}, and
2001
what they represent. Unless noted otherwise, all of these escape
2002
sequences apply to both string constants and regexp constants.
2010
A literal backslash, @samp{\}.
2012
@cindex @code{awk} language, V.4 version
2014
The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL).
2017
Backspace, @kbd{Control-h}, ASCII code 8 (BS).
2020
Formfeed, @kbd{Control-l}, ASCII code 12 (FF).
2023
Newline, @kbd{Control-j}, ASCII code 10 (LF).
2026
Carriage return, @kbd{Control-m}, ASCII code 13 (CR).
2029
Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT).
2031
@cindex @code{awk} language, V.4 version
2033
Vertical tab, @kbd{Control-k}, ASCII code 11 (VT).
2036
The octal value @var{nnn}, where @var{nnn} are one to three digits
2037
between @samp{0} and @samp{7}. For example, the code for the ASCII ESC
2038
(escape) character is @samp{\033}.
2040
@cindex @code{awk} language, V.4 version
2041
@cindex @code{awk} language, POSIX version
2042
@cindex POSIX @code{awk}
2043
@item \x@var{hh}@dots{}
2044
The hexadecimal value @var{hh}, where @var{hh} are hexadecimal
2045
digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or
2046
@samp{a} through @samp{f}). Like the same construct in ANSI C, the escape
2047
sequence continues until the first non-hexadecimal digit is seen. However,
2048
using more than two hexadecimal digits produces undefined results. (The
2049
@samp{\x} escape sequence is not allowed in POSIX @code{awk}.)
2052
A literal slash (necessary for regexp constants only).
2053
You use this when you wish to write a regexp
2054
constant that contains a slash. Since the regexp is delimited by
2055
slashes, you need to escape the slash that is part of the pattern,
2056
in order to tell @code{awk} to keep processing the rest of the regexp.
2059
A literal double-quote (necessary for string constants only).
2060
You use this when you wish to write a string
2061
constant that contains a double-quote. Since the string is delimited by
2062
double-quotes, you need to escape the quote that is part of the string,
2063
in order to tell @code{awk} to keep processing the rest of the string.
2067
In @code{gawk}, there are additional two character sequences that begin
2068
with backslash that have special meaning in regexps.
2069
@xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
2071
In a string constant,
2072
what happens if you place a backslash before something that is not one of
2073
the characters listed above? POSIX @code{awk} purposely leaves this case
2074
undefined. There are two choices.
2078
Strip the backslash out. This is what Unix @code{awk} and @code{gawk} both do.
2079
For example, @code{"a\qc"} is the same as @code{"aqc"}.
2082
Leave the backslash alone. Some other @code{awk} implementations do this.
2083
In such implementations, @code{"a\qc"} is the same as if you had typed
2087
In a regexp, a backslash before any character that is not in the above table,
2089
@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}},
2090
means that the next character should be taken literally, even if it would
2091
normally be a regexp operator. E.g., @code{/a\+b/} matches the three
2092
characters @samp{a+b}.
2094
@cindex portability issues
2095
For complete portability, do not use a backslash before any character not
2096
listed in the table above.
2098
Another interesting question arises. Suppose you use an octal or hexadecimal
2099
escape to represent a regexp metacharacter
2100
(@pxref{Regexp Operators, , Regular Expression Operators}).
2101
Does @code{awk} treat the character as literal character, or as a regexp
2105
It turns out that historically, such characters were taken literally (d.c.).
2106
However, the POSIX standard indicates that they should be treated
2107
as real metacharacters, and this is what @code{gawk} does.
2108
However, in compatibility mode (@pxref{Options, ,Command Line Options}),
2109
@code{gawk} treats the characters represented by octal and hexadecimal
2110
escape sequences literally when used in regexp constants. Thus,
2111
@code{/a\52b/} is equivalent to @code{/a\*b/}.
2117
The escape sequences in the table above are always processed first,
2118
for both string constants and regexp constants. This happens very early,
2119
as soon as @code{awk} reads your program.
2122
@code{gawk} processes both regexp constants and dynamic regexps
2123
(@pxref{Computed Regexps, ,Using Dynamic Regexps}),
2124
for the special operators listed in
2125
@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
2128
A backslash before any other character means to treat that character
2132
@node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp
2133
@section Regular Expression Operators
2134
@cindex metacharacters
2135
@cindex regular expression metacharacters
2136
@cindex regexp operators
2138
You can combine regular expressions with the following characters,
2139
called @dfn{regular expression operators}, or @dfn{metacharacters}, to
2140
increase the power and versatility of regular expressions.
2142
The escape sequences described
2146
in @ref{Escape Sequences},
2147
are valid inside a regexp. They are introduced by a @samp{\}. They
2148
are recognized and converted into the corresponding real characters as
2149
the very first step in processing regexps.
2151
Here is a table of metacharacters. All characters that are not escape
2152
sequences and that are not listed in the table stand for themselves.
2159
This is used to suppress the special meaning of a character when
2160
matching. For example:
2167
matches the character @samp{$}.
2169
@cindex anchors in regexps
2170
@cindex regexp, anchors
2172
This matches the beginning of a string. For example:
2179
matches the @samp{@@chapter} at the beginning of a string, and can be used
2180
to identify chapter beginnings in Texinfo source files.
2181
The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to
2182
matching only at the beginning of the string.
2184
It is important to realize that @samp{^} does not match the beginning of
2185
a line embedded in a string. In this example the condition is not true:
2188
if ("line1\nLINE 2" ~ /^L/) @dots{}
2192
This is similar to @samp{^}, but it matches only at the end of a string.
2200
matches a record that ends with a @samp{p}. The @samp{$} is also an anchor,
2201
and also does not match the end of a line embedded in a string. In this
2202
example the condition is not true:
2205
if ("line1\nLINE 2" ~ /1$/) @dots{}
2209
The period, or dot, matches any single character,
2210
@emph{including} the newline character. For example:
2217
matches any single character followed by a @samp{P} in a string. Using
2218
concatenation we can make a regular expression like @samp{U.A}, which
2219
matches any three-character sequence that begins with @samp{U} and ends
2222
@cindex @code{awk} language, POSIX version
2223
@cindex POSIX @code{awk}
2224
In strict POSIX mode (@pxref{Options, ,Command Line Options}),
2225
@samp{.} does not match the @sc{nul}
2226
character, which is a character with all bits equal to zero.
2227
Otherwise, @sc{nul} is just another character. Other versions of @code{awk}
2228
may not be able to match the @sc{nul} character.
2231
2e: Add stuff that character list is the POSIX terminology. In other
2232
literature known as character set or character class.
2235
@cindex character list
2237
This is called a @dfn{character list}. It matches any @emph{one} of the
2238
characters that are enclosed in the square brackets. For example:
2245
matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a
2248
Ranges of characters are indicated by using a hyphen between the beginning
2249
and ending characters, and enclosing the whole thing in brackets. For
2258
Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a
2259
common way to express the idea of ``all alphanumeric characters.''
2261
To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a
2262
character list, put a @samp{\} in front of it. For example:
2269
matches either @samp{d}, or @samp{]}.
2271
@cindex @code{egrep}
2272
This treatment of @samp{\} in character lists
2273
is compatible with other @code{awk}
2274
implementations, and is also mandated by POSIX.
2275
The regular expressions in @code{awk} are a superset
2276
of the POSIX specification for Extended Regular Expressions (EREs).
2277
POSIX EREs are based on the regular expressions accepted by the
2278
traditional @code{egrep} utility.
2280
@cindex character classes
2281
@cindex @code{awk} language, POSIX version
2282
@cindex POSIX @code{awk}
2283
@dfn{Character classes} are a new feature introduced in the POSIX standard.
2284
A character class is a special notation for describing
2285
lists of characters that have a specific attribute, but where the
2286
actual characters themselves can vary from country to country and/or
2287
from character set to character set. For example, the notion of what
2288
is an alphabetic character differs in the USA and in France.
2290
A character class is only valid in a regexp @emph{inside} the
2291
brackets of a character list. Character classes consist of @samp{[:},
2292
a keyword denoting the class, and @samp{:]}. Here are the character
2293
classes defined by the POSIX standard.
2297
Alphanumeric characters.
2300
Alphabetic characters.
2303
Space and tab characters.
2312
Characters that are printable and are also visible.
2313
(A space is printable, but not visible, while an @samp{a} is both.)
2316
Lower-case alphabetic characters.
2319
Printable characters (characters that are not control characters.)
2322
Punctuation characters (characters that are not letter, digits,
2323
control characters, or space characters).
2326
Space characters (such as space, tab, and formfeed, to name a few).
2329
Upper-case alphabetic characters.
2332
Characters that are hexadecimal digits.
2335
For example, before the POSIX standard, to match alphanumeric
2336
characters, you had to write @code{/[A-Za-z0-9]/}. If your
2337
character set had other alphabetic characters in it, this would not
2338
match them. With the POSIX character classes, you can write
2339
@code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic
2340
and numeric characters in your character set.
2342
@cindex collating elements
2343
Two additional special sequences can appear in character lists.
2344
These apply to non-ASCII character sets, which can have single symbols
2345
(called @dfn{collating elements}) that are represented with more than one
2346
character, as well as several characters that are equivalent for
2347
@dfn{collating}, or sorting, purposes. (E.g., in French, a plain ``e''
2348
and a grave-accented
2358
@cindex collating symbols
2359
@item Collating Symbols
2360
A @dfn{collating symbol} is a multi-character collating element enclosed in
2361
@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element,
2362
then @code{[[.ch.]]} is a regexp that matches this collating element, while
2363
@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}.
2365
@cindex equivalence classes
2366
@item Equivalence Classes
2367
An @dfn{equivalence class} is a list of equivalent characters enclosed in
2368
@samp{[=} and @samp{=]}.
2370
Thus, @code{[[=e@`e=]]} is regexp that matches either @samp{e} or @samp{@`e}.
2373
Because Info files use plain ASCII characters, it is not possible to present
2374
a realistic equivalence class example here.
2378
These features are very valuable in non-English speaking locales.
2380
@strong{Caution:} The library functions that @code{gawk} uses for regular
2381
expression matching currently only recognize POSIX character classes;
2382
they do not recognize collating symbols or equivalence classes.
2383
@c maybe one day ...
2385
@cindex complemented character list
2386
@cindex character list, complemented
2388
This is a @dfn{complemented character list}. The first character after
2389
the @samp{[} @emph{must} be a @samp{^}. It matches any characters
2390
@emph{except} those in the square brackets, or newline. For example:
2397
matches any character that is not a digit.
2400
This is the @dfn{alternation operator}, and it is used to specify
2401
alternatives. For example:
2408
matches any string that matches either @samp{^P} or @samp{[0-9]}. This
2409
means it matches any string that starts with @samp{P} or contains a digit.
2411
The alternation applies to the largest possible regexps on either side.
2412
In other words, @samp{|} has the lowest precedence of all the regular
2413
expression operators.
2416
Parentheses are used for grouping in regular expressions as in
2417
arithmetic. They can be used to concatenate regular expressions
2418
containing the alternation operator, @samp{|}. For example,
2419
@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and
2420
@samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.)
2423
This symbol means that the preceding regular expression is to be
2424
repeated as many times as necessary to find a match. For example:
2431
applies the @samp{*} symbol to the preceding @samp{h} and looks for matches
2432
of one @samp{p} followed by any number of @samp{h}s. This will also match
2433
just @samp{p} if no @samp{h}s are present.
2435
The @samp{*} repeats the @emph{smallest} possible preceding expression.
2436
(Use parentheses if you wish to repeat a larger expression.) It finds
2437
as many repetitions as possible. For example:
2440
awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample
2444
prints every record in @file{sample} containing a string of the form
2445
@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.
2446
Notice the escaping of the parentheses by preceding them
2450
This symbol is similar to @samp{*}, but the preceding expression must be
2451
matched at least once. This means that:
2458
would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas
2459
@samp{wh*y} would match all three of these strings. This is a simpler
2460
way of writing the last @samp{*} example:
2463
awk '/\(c[ad]+r x\)/ @{ print @}' sample
2467
This symbol is similar to @samp{*}, but the preceding expression can be
2468
matched either once or not at all. For example:
2475
will match @samp{fed} and @samp{fd}, but nothing else.
2477
@cindex @code{awk} language, POSIX version
2478
@cindex POSIX @code{awk}
2479
@cindex interval expressions
2482
@itemx @{@var{n},@var{m}@}
2483
One or two numbers inside braces denote an @dfn{interval expression}.
2484
If there is one number in the braces, the preceding regexp is repeated
2486
If there are two numbers separated by a comma, the preceding regexp is
2487
repeated @var{n} to @var{m} times.
2488
If there is one number followed by a comma, then the preceding regexp
2489
is repeated at least @var{n} times.
2493
matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}.
2496
matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only.
2499
matches @samp{whhy} or @samp{whhhy}, and so on.
2502
Interval expressions were not traditionally available in @code{awk}.
2503
As part of the POSIX standard they were added, to make @code{awk}
2504
and @code{egrep} consistent with each other.
2506
However, since old programs may use @samp{@{} and @samp{@}} in regexp
2507
constants, by default @code{gawk} does @emph{not} match interval expressions
2508
in regexps. If either @samp{--posix} or @samp{--re-interval} are specified
2509
(@pxref{Options, , Command Line Options}), then interval expressions
2510
are allowed in regexps.
2513
@cindex precedence, regexp operators
2514
@cindex regexp operators, precedence of
2515
In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,
2516
as well as the braces @samp{@{} and @samp{@}},
2518
the highest precedence, followed by concatenation, and finally by @samp{|}.
2519
As in arithmetic, parentheses can change how operators are grouped.
2521
If @code{gawk} is in compatibility mode
2522
(@pxref{Options, ,Command Line Options}),
2523
character classes and interval expressions are not available in
2524
regular expressions.
2533
discusses the GNU-specific regexp operators, and provides
2534
more detail concerning how command line options affect the way @code{gawk}
2535
interprets the characters in regular expressions.
2537
@node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp
2538
@section Additional Regexp Operators Only in @code{gawk}
2540
@c This section adapted from the regex-0.12 manual
2542
@cindex regexp operators, GNU specific
2543
GNU software that deals with regular expressions provides a number of
2544
additional regexp operators. These operators are described in this
2545
section, and are specific to @code{gawk}; they are not available in other
2546
@code{awk} implementations.
2548
@cindex word, regexp definition of
2549
Most of the additional operators are for dealing with word matching.
2550
For our purposes, a @dfn{word} is a sequence of one or more letters, digits,
2551
or underscores (@samp{_}).
2554
@cindex @code{\w} regexp operator
2556
This operator matches any word-constituent character, i.e.@: any
2557
letter, digit, or underscore. Think of it as a short-hand for
2558
@c @w{@code{[A-Za-z0-9_]}} or
2559
@w{@code{[[:alnum:]_]}}.
2561
@cindex @code{\W} regexp operator
2563
This operator matches any character that is not word-constituent.
2564
Think of it as a short-hand for
2565
@c @w{@code{[^A-Za-z0-9_]}} or
2566
@w{@code{[^[:alnum:]_]}}.
2568
@cindex @code{\<} regexp operator
2570
This operator matches the empty string at the beginning of a word.
2571
For example, @code{/\<away/} matches @samp{away}, but not
2574
@cindex @code{\>} regexp operator
2576
This operator matches the empty string at the end of a word.
2577
For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}.
2579
@cindex @code{\y} regexp operator
2580
@cindex word boundaries, matching
2582
This operator matches the empty string at either the beginning or the
2583
end of a word (the word boundar@strong{y}). For example, @samp{\yballs?\y}
2584
matches either @samp{ball} or @samp{balls} as a separate word.
2586
@cindex @code{\B} regexp operator
2588
This operator matches the empty string within a word. In other words,
2589
@samp{\B} matches the empty string that occurs between two
2590
word-constituent characters. For example,
2591
@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}.
2592
@samp{\B} is essentially the opposite of @samp{\y}.
2595
There are two other operators that work on buffers. In Emacs, a
2596
@dfn{buffer} is, naturally, an Emacs buffer. For other programs, the
2597
regexp library routines that @code{gawk} uses consider the entire
2598
string to be matched as the buffer.
2600
For @code{awk}, since @samp{^} and @samp{$} always work in terms
2601
of the beginning and end of strings, these operators don't add any
2602
new capabilities. They are provided for compatibility with other GNU
2605
@cindex buffer matching operators
2607
@cindex @code{\`} regexp operator
2609
This operator matches the empty string at the
2610
beginning of the buffer.
2612
@cindex @code{\'} regexp operator
2614
This operator matches the empty string at the
2618
In other GNU software, the word boundary operator is @samp{\b}. However,
2619
that conflicts with the @code{awk} language's definition of @samp{\b}
2620
as backspace, so @code{gawk} uses a different letter.
2622
An alternative method would have been to require two backslashes in the
2623
GNU operators, but this was deemed to be too confusing, and the current
2624
method of using @samp{\y} for the GNU @samp{\b} appears to be the
2625
lesser of two evils.
2627
@c NOTE!!! Keep this in sync with the same table in the summary appendix!
2628
@cindex regexp, effect of command line options
2629
The various command line options
2630
(@pxref{Options, ,Command Line Options})
2631
control how @code{gawk} interprets characters in regexps.
2635
In the default case, @code{gawk} provide all the facilities of
2636
POSIX regexps and the GNU regexp operators described
2641
in @ref{Regexp Operators, ,Regular Expression Operators}.
2643
However, interval expressions are not supported.
2645
@item @code{--posix}
2646
Only POSIX regexps are supported, the GNU operators are not special
2647
(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
2650
@item @code{--traditional}
2651
Traditional Unix @code{awk} regexps are matched. The GNU operators
2652
are not special, interval expressions are not available, and neither
2653
are the POSIX character classes (@code{[[:alnum:]]} and so on).
2654
Characters described by octal and hexadecimal escape sequences are
2655
treated literally, even if they represent regexp metacharacters.
2657
@item @code{--re-interval}
2658
Allow interval expressions in regexps, even if @samp{--traditional}
2662
@node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp
2663
@section Case-sensitivity in Matching
2665
@cindex case sensitivity
2666
@cindex ignoring case
2667
Case is normally significant in regular expressions, both when matching
2668
ordinary characters (i.e.@: not metacharacters), and inside character
2669
sets. Thus a @samp{w} in a regular expression matches only a lower-case
2670
@samp{w} and not an upper-case @samp{W}.
2672
The simplest way to do a case-independent match is to use a character
2673
list: @samp{[Ww]}. However, this can be cumbersome if you need to use it
2674
often; and it can make the regular expressions harder to
2675
read. There are two alternatives that you might prefer.
2677
One way to do a case-insensitive match at a particular point in the
2678
program is to convert the data to a single case, using the
2679
@code{tolower} or @code{toupper} built-in string functions (which we
2680
haven't discussed yet;
2681
@pxref{String Functions, ,Built-in Functions for String Manipulation}).
2685
tolower($1) ~ /foo/ @{ @dots{} @}
2689
converts the first field to lower-case before matching against it.
2690
This will work in any POSIX-compliant implementation of @code{awk}.
2692
@cindex differences between @code{gawk} and @code{awk}
2693
@cindex @code{~} operator
2694
@cindex @code{!~} operator
2696
Another method, specific to @code{gawk}, is to set the variable
2697
@code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}).
2698
When @code{IGNORECASE} is not zero, @emph{all} regexp and string
2699
operations ignore case. Changing the value of
2700
@code{IGNORECASE} dynamically controls the case sensitivity of your
2701
program as it runs. Case is significant by default because
2702
@code{IGNORECASE} (like most variables) is initialized to zero.
2706
if (x ~ /ab/) @dots{} # this test will fail
2709
if (x ~ /ab/) @dots{} # now it will succeed
2712
In general, you cannot use @code{IGNORECASE} to make certain rules
2713
case-insensitive and other rules case-sensitive, because there is no way
2714
to set @code{IGNORECASE} just for the pattern of a particular rule.
2716
This isn't quite true. Consider:
2718
IGNORECASE=1 && /foObAr/ { .... }
2719
IGNORECASE=0 || /foobar/ { .... }
2721
But that's pretty bad style and I don't want to get into it at this
2724
To do this, you must use character lists or @code{tolower}. However, one
2725
thing you can do only with @code{IGNORECASE} is turn case-sensitivity on
2726
or off dynamically for all the rules at once.
2728
@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule
2729
(@pxref{Other Arguments, ,Other Command Line Arguments}; also
2730
@pxref{Using BEGIN/END, ,Startup and Cleanup Actions}).
2731
Setting @code{IGNORECASE} from the command line is a way to make
2732
a program case-insensitive without having to edit it.
2734
Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE}
2735
only affected regexp operations. It did not affect string comparison
2736
with @samp{==}, @samp{!=}, and so on.
2737
Beginning with version 3.0, both regexp and string comparison
2738
operations are affected by @code{IGNORECASE}.
2742
Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case
2743
and lower-case characters are based on the ISO-8859-1 (ISO Latin-1)
2744
character set. This character set is a superset of the traditional 128
2745
ASCII characters, that also provides a number of characters suitable
2746
for use with European languages.
2748
A pure ASCII character set can be used instead if @code{gawk} is compiled
2749
with @samp{-DUSE_PURE_ASCII}.
2752
The value of @code{IGNORECASE} has no effect if @code{gawk} is in
2753
compatibility mode (@pxref{Options, ,Command Line Options}).
2754
Case is always significant in compatibility mode.
2756
@node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp
2757
@section How Much Text Matches?
2759
@cindex leftmost longest match
2760
@cindex matching, leftmost longest
2761
Consider the following example:
2764
echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
2767
This example uses the @code{sub} function (which we haven't discussed yet,
2768
@pxref{String Functions, ,Built-in Functions for String Manipulation})
2769
to make a change to the input record. Here, the regexp @code{/a+/}
2770
indicates ``one or more @samp{a} characters,'' and the replacement
2773
The input contains four @samp{a} characters. What will the output be?
2774
In other words, how many is ``one or more''---will @code{awk} match two,
2775
three, or all four @samp{a} characters?
2777
The answer is, @code{awk} (and POSIX) regular expressions always match
2778
the leftmost, @emph{longest} sequence of input characters that can
2779
match. Thus, in this example, all four @samp{a} characters are
2780
replaced with @samp{<A>}.
2783
$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
2787
For simple match/no-match tests, this is not so important. But when doing
2788
regexp-based field and record splitting, and
2789
text matching and substitutions with the @code{match}, @code{sub}, @code{gsub},
2790
and @code{gensub} functions, it is very important.
2792
@xref{String Functions, ,Built-in Functions for String Manipulation},
2793
for more information on these functions.
2795
Understanding this principle is also important for regexp-based record
2796
and field splitting (@pxref{Records, ,How Input is Split into Records},
2797
and also @pxref{Field Separators, ,Specifying How Fields are Separated}).
2799
@node Computed Regexps, , Leftmost Longest, Regexp
2800
@section Using Dynamic Regexps
2802
@cindex computed regular expressions
2803
@cindex regular expressions, computed
2804
@cindex dynamic regular expressions
2805
@cindex regexp, dynamic
2806
@cindex @code{~} operator
2807
@cindex @code{!~} operator
2808
The right hand side of a @samp{~} or @samp{!~} operator need not be a
2809
regexp constant (i.e.@: a string of characters between slashes). It may
2810
be any expression. The expression is evaluated, and converted if
2811
necessary to a string; the contents of the string are used as the
2812
regexp. A regexp that is computed in this way is called a @dfn{dynamic
2813
regexp}. For example:
2816
BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" @}
2817
$0 ~ identifier_regexp @{ print @}
2821
sets @code{identifier_regexp} to a regexp that describes @code{awk}
2822
variable names, and tests if the input record matches this regexp.
2824
@strong{Caution:} When using the @samp{~} and @samp{!~}
2825
operators, there is a difference between a regexp constant
2826
enclosed in slashes, and a string constant enclosed in double quotes.
2827
If you are going to use a string constant, you have to understand that
2828
the string is in essence scanned @emph{twice}; the first time when
2829
@code{awk} reads your program, and the second time when it goes to
2830
match the string on the left-hand side of the operator with the pattern
2831
on the right. This is true of any string valued expression (such as
2832
@code{identifier_regexp} above), not just string constants.
2834
@cindex regexp constants, difference between slashes and quotes
2835
What difference does it make if the string is
2836
scanned twice? The answer has to do with escape sequences, and particularly
2837
with backslashes. To get a backslash into a regular expression inside a
2838
string, you have to type two backslashes.
2840
For example, @code{/\*/} is a regexp constant for a literal @samp{*}.
2841
Only one backslash is needed. To do the same thing with a string,
2842
you would have to type @code{"\\*"}. The first backslash escapes the
2843
second one, so that the string actually contains the
2844
two characters @samp{\} and @samp{*}.
2846
@cindex common mistakes
2847
@cindex mistakes, common
2848
@cindex errors, common
2849
Given that you can use both regexp and string constants to describe
2850
regular expressions, which should you use? The answer is ``regexp
2851
constants,'' for several reasons.
2855
String constants are more complicated to write, and
2856
more difficult to read. Using regexp constants makes your programs
2857
less error-prone. Not understanding the difference between the two
2858
kinds of constants is a common source of errors.
2861
It is also more efficient to use regexp constants: @code{awk} can note
2862
that you have supplied a regexp and store it internally in a form that
2863
makes pattern matching more efficient. When using a string constant,
2864
@code{awk} must first convert the string into this internal form, and
2865
then perform the pattern matching.
2868
Using regexp constants is better style; it shows clearly that you
2869
intend a regexp match.
2872
@node Reading Files, Printing, Regexp, Top
2873
@chapter Reading Input Files
2875
@cindex reading files
2877
@cindex standard input
2879
In the typical @code{awk} program, all input is read either from the
2880
standard input (by default the keyboard, but often a pipe from another
2881
command) or from files whose names you specify on the @code{awk} command
2882
line. If you specify input files, @code{awk} reads them in order, reading
2883
all the data from one before going on to the next. The name of the current
2884
input file can be found in the built-in variable @code{FILENAME}
2885
(@pxref{Built-in Variables}).
2887
The input is read in units called @dfn{records}, and processed by the
2888
rules of your program one record at a time.
2889
By default, each record is one line. Each
2890
record is automatically split into chunks called @dfn{fields}.
2891
This makes it more convenient for programs to work on the parts of a record.
2893
On rare occasions you will need to use the @code{getline} command.
2894
The @code{getline} command is valuable, both because it
2895
can do explicit input from any number of files, and because the files
2896
used with it do not have to be named on the @code{awk} command line
2897
(@pxref{Getline, ,Explicit Input with @code{getline}}).
2900
* Records:: Controlling how data is split into records.
2901
* Fields:: An introduction to fields.
2902
* Non-Constant Fields:: Non-constant Field Numbers.
2903
* Changing Fields:: Changing the Contents of a Field.
2904
* Field Separators:: The field separator and how to change it.
2905
* Constant Size:: Reading constant width data.
2906
* Multiple Line:: Reading multi-line records.
2907
* Getline:: Reading files under explicit program control
2908
using the @code{getline} function.
2911
@node Records, Fields, Reading Files, Reading Files
2912
@section How Input is Split into Records
2914
@cindex record separator, @code{RS}
2915
@cindex changing the record separator
2916
@cindex record, definition of
2918
The @code{awk} utility divides the input for your @code{awk}
2919
program into records and fields.
2920
Records are separated by a character called the @dfn{record separator}.
2921
By default, the record separator is the newline character.
2922
This is why records are, by default, single lines.
2923
You can use a different character for the record separator by
2924
assigning the character to the built-in variable @code{RS}.
2926
You can change the value of @code{RS} in the @code{awk} program,
2927
like any other variable, with the
2928
assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
2929
The new record-separator character should be enclosed in quotation marks,
2931
a string constant. Often the right time to do this is at the beginning
2932
of execution, before any input has been processed, so that the very
2933
first record will be read with the proper separator. To do this, use
2934
the special @code{BEGIN} pattern
2935
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). For
2939
awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
2943
changes the value of @code{RS} to @code{"/"}, before reading any input.
2944
This is a string whose first character is a slash; as a result, records
2945
are separated by slashes. Then the input file is read, and the second
2946
rule in the @code{awk} program (the action with no pattern) prints each
2947
record. Since each @code{print} statement adds a newline at the end of
2948
its output, the effect of this @code{awk} program is to copy the input
2949
with each slash changed to a newline. Here are the results of running
2950
the program on @file{BBS-list}:
2954
$ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
2955
@print{} aardvark 555-5553 1200
2957
@print{} alpo-net 555-3412 2400
2960
@print{} barfly 555-7685 1200
2962
@print{} bites 555-1675 2400
2965
@print{} camelot 555-0542 300 C
2966
@print{} core 555-2912 1200
2968
@print{} fooey 555-1234 2400
2971
@print{} foot 555-6699 1200
2973
@print{} macfoo 555-6480 1200
2975
@print{} sdace 555-3430 2400
2978
@print{} sabafoo 555-2127 1200
2985
Note that the entry for the @samp{camelot} BBS is not split.
2986
In the original data file
2987
(@pxref{Sample Data Files, , Data Files for the Examples}),
2988
the line looks like this:
2991
camelot 555-0542 300 C
2995
It only has one baud rate; there are no slashes in the record.
2997
Another way to change the record separator is on the command line,
2998
using the variable-assignment feature
2999
(@pxref{Other Arguments, ,Other Command Line Arguments}).
3002
awk '@{ print $0 @}' RS="/" BBS-list
3006
This sets @code{RS} to @samp{/} before processing @file{BBS-list}.
3008
Using an unusual character such as @samp{/} for the record separator
3009
produces correct behavior in the vast majority of cases. However,
3010
the following (extreme) pipeline prints a surprising @samp{1}. There
3011
is one field, consisting of a newline. The value of the built-in
3012
variable @code{NF} is the number of fields in the current record.
3015
$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}'
3021
Reaching the end of an input file terminates the current input record,
3022
even if the last character in the file is not the character in @code{RS}
3025
@cindex empty string
3026
The empty string, @code{""} (a string of no characters), has a special meaning
3027
as the value of @code{RS}: it means that records are separated
3028
by one or more blank lines, and nothing else.
3029
@xref{Multiple Line, ,Multiple-Line Records}, for more details.
3031
If you change the value of @code{RS} in the middle of an @code{awk} run,
3032
the new value is used to delimit subsequent records, but the record
3033
currently being processed (and records already processed) are not
3037
@cindex record terminator, @code{RT}
3038
@cindex terminator, record
3039
@cindex differences between @code{gawk} and @code{awk}
3040
After the end of the record has been determined, @code{gawk}
3041
sets the variable @code{RT} to the text in the input that matched
3044
@cindex regular expressions as record separators
3045
The value of @code{RS} is in fact not limited to a one-character
3046
string. It can be any regular expression
3047
(@pxref{Regexp, ,Regular Expressions}).
3048
In general, each record
3049
ends at the next string that matches the regular expression; the next
3050
record starts at the end of the matching string. This general rule is
3051
actually at work in the usual case, where @code{RS} contains just a
3052
newline: a record ends at the beginning of the next matching string (the
3053
next newline in the input) and the following record starts just after
3054
the end of this string (at the first character of the following line).
3055
The newline, since it matches @code{RS}, is not part of either record.
3057
When @code{RS} is a single character, @code{RT} will
3058
contain the same single character. However, when @code{RS} is a
3059
regular expression, then @code{RT} becomes more useful; it contains
3060
the actual input text that matched the regular expression.
3062
The following example illustrates both of these features.
3063
It sets @code{RS} equal to a regular expression that
3064
matches either a newline, or a series of one or more upper-case letters
3065
with optional leading and/or trailing white space
3066
(@pxref{Regexp, , Regular Expressions}).
3069
$ echo record 1 AAAA record 2 BBBB record 3 |
3070
> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}
3071
> @{ print "Record =", $0, "and RT =", RT @}'
3072
@print{} Record = record 1 and RT = AAAA
3073
@print{} Record = record 2 and RT = BBBB
3074
@print{} Record = record 3 and RT =
3079
The final line of output has an extra blank line. This is because the
3080
value of @code{RT} is a newline, and then the @code{print} statement
3081
supplies its own terminating newline.
3083
@xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example
3084
of @code{RS} as a regexp and @code{RT}.
3086
@cindex differences between @code{gawk} and @code{awk}
3087
The use of @code{RS} as a regular expression and the @code{RT}
3088
variable are @code{gawk} extensions; they are not available in
3090
(@pxref{Options, ,Command Line Options}).
3091
In compatibility mode, only the first character of the value of
3092
@code{RS} is used to determine the end of the record.
3094
@cindex number of records, @code{NR}, @code{FNR}
3097
The @code{awk} utility keeps track of the number of records that have
3098
been read so far from the current input file. This value is stored in a
3099
built-in variable called @code{FNR}. It is reset to zero when a new
3100
file is started. Another built-in variable, @code{NR}, is the total
3101
number of input records read so far from all data files. It starts at zero
3102
but is never automatically reset to zero.
3104
@node Fields, Non-Constant Fields, Records, Reading Files
3105
@section Examining Fields
3107
@cindex examining fields
3109
@cindex accessing fields
3110
When @code{awk} reads an input record, the record is
3111
automatically separated or @dfn{parsed} by the interpreter into chunks
3112
called @dfn{fields}. By default, fields are separated by whitespace,
3113
like words in a line.
3114
Whitespace in @code{awk} means any string of one or more spaces and/or
3115
tabs; other characters such as newline, formfeed, and so on, that are
3116
considered whitespace by other languages are @emph{not} considered
3117
whitespace by @code{awk}.
3119
The purpose of fields is to make it more convenient for you to refer to
3120
these pieces of the record. You don't have to use them---you can
3121
operate on the whole record if you wish---but fields are what make
3122
simple @code{awk} programs so powerful.
3124
@cindex @code{$} (field operator)
3125
@cindex field operator @code{$}
3126
To refer to a field in an @code{awk} program, you use a dollar-sign,
3127
@samp{$}, followed by the number of the field you want. Thus, @code{$1}
3128
refers to the first field, @code{$2} to the second, and so on. For
3129
example, suppose the following is a line of input:
3132
This seems like a pretty nice example.
3136
Here the first field, or @code{$1}, is @samp{This}; the second field, or
3137
@code{$2}, is @samp{seems}; and so on. Note that the last field,
3138
@code{$7}, is @samp{example.}. Because there is no space between the
3139
@samp{e} and the @samp{.}, the period is considered part of the seventh
3143
@cindex number of fields, @code{NF}
3144
@code{NF} is a built-in variable whose value
3145
is the number of fields in the current record.
3146
@code{awk} updates the value of @code{NF} automatically, each time
3149
No matter how many fields there are, the last field in a record can be
3150
represented by @code{$NF}. So, in the example above, @code{$NF} would
3151
be the same as @code{$7}, which is @samp{example.}. Why this works is
3152
explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}).
3153
If you try to reference a field beyond the last one, such as @code{$8}
3154
when the record has only seven fields, you get the empty string.
3155
@c the empty string acts like 0 in some contexts, but I don't want to
3156
@c get into that here....
3158
@code{$0}, which looks like a reference to the ``zeroth'' field, is
3159
a special case: it represents the whole input record. @code{$0} is
3160
used when you are not interested in fields.
3162
Here are some more examples:
3166
$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list
3167
@print{} fooey 555-1234 2400/1200/300 B
3168
@print{} foot 555-6699 1200/300 B
3169
@print{} macfoo 555-6480 1200/300 A
3170
@print{} sabafoo 555-2127 1200/300 C
3175
This example prints each record in the file @file{BBS-list} whose first
3176
field contains the string @samp{foo}. The operator @samp{~} is called a
3177
@dfn{matching operator}
3178
(@pxref{Regexp Usage, , How to Use Regular Expressions});
3179
it tests whether a string (here, the field @code{$1}) matches a given regular
3182
By contrast, the following example
3183
looks for @samp{foo} in @emph{the entire record} and prints the first
3184
field and the last field for each input record containing a
3189
$ awk '/foo/ @{ print $1, $NF @}' BBS-list
3197
@node Non-Constant Fields, Changing Fields, Fields, Reading Files
3198
@section Non-constant Field Numbers
3200
The number of a field does not need to be a constant. Any expression in
3201
the @code{awk} language can be used after a @samp{$} to refer to a
3202
field. The value of the expression specifies the field number. If the
3203
value is a string, rather than a number, it is converted to a number.
3204
Consider this example:
3207
awk '@{ print $NR @}'
3211
Recall that @code{NR} is the number of records read so far: one in the
3212
first record, two in the second, etc. So this example prints the first
3213
field of the first record, the second field of the second record, and so
3214
on. For the twentieth record, field number 20 is printed; most likely,
3215
the record has fewer than 20 fields, so this prints a blank line.
3217
Here is another example of using expressions as field numbers:
3220
awk '@{ print $(2*2) @}' BBS-list
3223
@code{awk} must evaluate the expression @samp{(2*2)} and use
3224
its value as the number of the field to print. The @samp{*} sign
3225
represents multiplication, so the expression @samp{2*2} evaluates to four.
3226
The parentheses are used so that the multiplication is done before the
3227
@samp{$} operation; they are necessary whenever there is a binary
3228
operator in the field-number expression. This example, then, prints the
3229
hours of operation (the fourth field) for every line of the file
3230
@file{BBS-list}. (All of the @code{awk} operators are listed, in
3231
order of decreasing precedence, in
3232
@ref{Precedence, , Operator Precedence (How Operators Nest)}.)
3234
If the field number you compute is zero, you get the entire record.
3235
Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field
3236
numbers are not allowed; trying to reference one will usually terminate
3237
your running @code{awk} program. (The POSIX standard does not define
3238
what happens when you reference a negative field number. @code{gawk}
3239
will notice this and terminate your program. Other @code{awk}
3240
implementations may behave differently.)
3242
As mentioned in @ref{Fields, ,Examining Fields},
3243
the number of fields in the current record is stored in the built-in
3244
variable @code{NF} (also @pxref{Built-in Variables}). The expression
3245
@code{$NF} is not a special feature: it is the direct consequence of
3246
evaluating @code{NF} and using its value as a field number.
3248
@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files
3249
@section Changing the Contents of a Field
3251
@cindex field, changing contents of
3252
@cindex changing contents of a field
3253
@cindex assignment to fields
3254
You can change the contents of a field as seen by @code{awk} within an
3255
@code{awk} program; this changes what @code{awk} perceives as the
3256
current input record. (The actual input is untouched; @code{awk} @emph{never}
3257
modifies the input file.)
3259
Consider this example and its output:
3263
$ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped
3272
The @samp{-} sign represents subtraction, so this program reassigns
3273
field three, @code{$3}, to be the value of field two minus ten,
3274
@samp{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.)
3275
Then field two, and the new value for field three, are printed.
3277
In order for this to work, the text in field @code{$2} must make sense
3278
as a number; the string of characters must be converted to a number in
3279
order for the computer to do arithmetic on it. The number resulting
3280
from the subtraction is converted back to a string of characters which
3281
then becomes field three.
3282
@xref{Conversion, ,Conversion of Strings and Numbers}.
3284
When you change the value of a field (as perceived by @code{awk}), the
3285
text of the input record is recalculated to contain the new field where
3286
the old one was. Therefore, @code{$0} changes to reflect the altered
3287
field. Thus, this program
3288
prints a copy of the input file, with 10 subtracted from the second
3293
$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped
3294
@print{} Jan 3 25 15 115
3295
@print{} Feb 5 32 24 226
3296
@print{} Mar 5 24 34 228
3301
You can also assign contents to fields that are out of range. For
3305
$ awk '@{ $6 = ($5 + $4 + $3 + $2)
3306
> print $6 @}' inventory-shipped
3314
We've just created @code{$6}, whose value is the sum of fields
3315
@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign
3316
represents addition. For the file @file{inventory-shipped}, @code{$6}
3317
represents the total number of parcels shipped for a particular month.
3319
Creating a new field changes @code{awk}'s internal copy of the current
3320
input record---the value of @code{$0}. Thus, if you do @samp{print $0}
3321
after adding a field, the record printed includes the new field, with
3322
the appropriate number of field separators between it and the previously
3325
This recomputation affects and is affected by
3326
@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}),
3327
and by a feature that has not been discussed yet,
3328
the @dfn{output field separator}, @code{OFS},
3329
which is used to separate the fields (@pxref{Output Separators}).
3330
For example, the value of @code{NF} is set to the number of the highest
3333
Note, however, that merely @emph{referencing} an out-of-range field
3334
does @emph{not} change the value of either @code{$0} or @code{NF}.
3335
Referencing an out-of-range field only produces an empty string. For
3340
print "can't happen"
3342
print "everything is normal"
3346
should print @samp{everything is normal}, because @code{NF+1} is certain
3347
to be out of range. (@xref{If Statement, ,The @code{if}-@code{else} Statement},
3348
for more information about @code{awk}'s @code{if-else} statements.
3349
@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}, for more information
3350
about the @samp{!=} operator.)
3352
It is important to note that making an assignment to an existing field
3354
value of @code{$0}, but will not change the value of @code{NF},
3355
even when you assign the empty string to a field. For example:
3359
$ echo a b c d | awk '@{ OFS = ":"; $2 = ""
3360
> print $0; print NF @}'
3367
The field is still there; it just has an empty value. You can tell
3368
because there are two colons in a row.
3370
This example shows what happens if you create a new field.
3373
$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"
3374
> print $0; print NF @}'
3375
@print{} a::c:d::new
3380
The intervening field, @code{$5} is created with an empty value
3381
(indicated by the second pair of adjacent colons),
3382
and @code{NF} is updated with the value six.
3384
@node Field Separators, Constant Size, Changing Fields, Reading Files
3385
@section Specifying How Fields are Separated
3387
This section is rather long; it describes one of the most fundamental
3388
operations in @code{awk}.
3391
* Basic Field Splitting:: How fields are split with single characters
3393
* Regexp Field Splitting:: Using regexps as the field separator.
3394
* Single Character Fields:: Making each character a separate field.
3395
* Command Line Field Separator:: Setting @code{FS} from the command line.
3396
* Field Splitting Summary:: Some final points and a summary table.
3399
@node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators
3400
@subsection The Basics of Field Separating
3402
@cindex fields, separating
3403
@cindex field separator, @code{FS}
3405
The @dfn{field separator}, which is either a single character or a regular
3406
expression, controls the way @code{awk} splits an input record into fields.
3407
@code{awk} scans the input record for character sequences that
3408
match the separator; the fields themselves are the text between the matches.
3410
In the examples below, we use the bullet symbol ``@bullet{}'' to represent
3411
spaces in the output.
3413
If the field separator is @samp{oo}, then the following line:
3420
would be split into three fields: @samp{m}, @samp{@bullet{}g} and
3421
@samp{@bullet{}gai@bullet{}pan}.
3422
Note the leading spaces in the values of the second and third fields.
3424
@cindex common mistakes
3425
@cindex mistakes, common
3426
@cindex errors, common
3427
The field separator is represented by the built-in variable @code{FS}.
3428
Shell programmers take note! @code{awk} does @emph{not} use the name @code{IFS}
3429
which is used by the POSIX compatible shells (such as the Bourne shell,
3430
@code{sh}, or the GNU Bourne-Again Shell, Bash).
3432
You can change the value of @code{FS} in the @code{awk} program with the
3433
assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
3434
Often the right time to do this is at the beginning of execution,
3435
before any input has been processed, so that the very first record
3436
will be read with the proper separator. To do this, use the special
3437
@code{BEGIN} pattern
3438
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
3439
For example, here we set the value of @code{FS} to the string
3443
awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
3447
Given the input line,
3450
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
3454
this @code{awk} program extracts and prints the string
3455
@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
3457
@cindex field separator, choice of
3458
@cindex regular expressions as field separators
3459
Sometimes your input data will contain separator characters that don't
3460
separate fields the way you thought they would. For instance, the
3461
person's name in the example we just used might have a title or
3462
suffix attached, such as @samp{John Q. Smith, LXIX}. From input
3463
containing such a name:
3466
John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
3470
@c careful of an overfull hbox here!
3471
the above program would extract @samp{@bullet{}LXIX}, instead of
3472
@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
3473
If you were expecting the program to print the
3474
address, you would be surprised. The moral is: choose your data layout and
3475
separator characters carefully to prevent such problems.
3478
As you know, normally,
3483
fields are separated by whitespace sequences
3484
(spaces and tabs), not by single spaces: two spaces in a row do not
3485
delimit an empty field. The default value of the field separator @code{FS}
3486
is a string containing a single space, @w{@code{" "}}. If this value were
3487
interpreted in the usual way, each space character would separate
3488
fields, so two spaces in a row would make an empty field between them.
3489
The reason this does not happen is that a single space as the value of
3490
@code{FS} is a special case: it is taken to specify the default manner
3491
of delimiting fields.
3493
If @code{FS} is any other single character, such as @code{","}, then
3494
each occurrence of that character separates two fields. Two consecutive
3495
occurrences delimit an empty field. If the character occurs at the
3496
beginning or the end of the line, that too delimits an empty field. The
3497
space character is the only single character which does not follow these
3500
@node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators
3501
@subsection Using Regular Expressions to Separate Fields
3510
discussed the use of single characters or simple strings as the
3512
More generally, the value of @code{FS} may be a string containing any
3513
regular expression. In this case, each match in the record for the regular
3514
expression separates fields. For example, the assignment:
3521
makes every area of an input line that consists of a comma followed by a
3522
space and a tab, into a field separator. (@samp{\t}
3523
is an @dfn{escape sequence} that stands for a tab;
3524
@pxref{Escape Sequences},
3525
for the complete list of similar escape sequences.)
3527
For a less trivial example of a regular expression, suppose you want
3528
single spaces to separate fields the way single commas were used above.
3529
You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right
3530
bracket). This regular expression matches a single space and nothing else
3531
(@pxref{Regexp, ,Regular Expressions}).
3533
There is an important difference between the two cases of @samp{FS = @w{" "}}
3534
(a single space) and @samp{FS = @w{"[ \t]+"}} (left bracket, space, backslash,
3535
``t'', right bracket, which is a regular expression
3536
matching one or more spaces or tabs). For both values of @code{FS}, fields
3537
are separated by runs of spaces and/or tabs. However, when the value of
3538
@code{FS} is @w{@code{" "}}, @code{awk} will first strip leading and trailing
3539
whitespace from the record, and then decide where the fields are.
3541
For example, the following pipeline prints @samp{b}:
3544
$ echo ' a b c d ' | awk '@{ print $2 @}'
3549
However, this pipeline prints @samp{a} (note the extra spaces around
3553
$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @}
3560
@cindex empty string
3561
In this case, the first field is @dfn{null}, or empty.
3563
The stripping of leading and trailing whitespace also comes into
3564
play whenever @code{$0} is recomputed. For instance, study this pipeline:
3567
$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}'
3573
The first @code{print} statement prints the record as it was read,
3574
with leading whitespace intact. The assignment to @code{$2} rebuilds
3575
@code{$0} by concatenating @code{$1} through @code{$NF} together,
3576
separated by the value of @code{OFS}. Since the leading whitespace
3577
was ignored when finding @code{$1}, it is not part of the new @code{$0}.
3578
Finally, the last @code{print} statement prints the new @code{$0}.
3580
@node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators
3581
@subsection Making Each Character a Separate Field
3583
@cindex differences between @code{gawk} and @code{awk}
3584
@cindex single character fields
3585
There are times when you may want to examine each character
3586
of a record separately. In @code{gawk}, this is easy to do, you
3587
simply assign the null string (@code{""}) to @code{FS}. In this case,
3588
each individual character in the record will become a separate field.
3590
@c extra verbiage due to page boundaries
3593
echo a b | gawk 'BEGIN @{ FS = "" @}
3595
for (i = 1; i <= NF; i = i + 1)
3596
print "Field", i, "is", $i
3601
The output from this is:
3610
Traditionally, the behavior for @code{FS} equal to @code{""} was not defined.
3611
In this case, Unix @code{awk} would simply treat the entire record
3612
as only having one field (d.c.). In compatibility mode
3613
(@pxref{Options, ,Command Line Options}),
3614
if @code{FS} is the null string, then @code{gawk} will also
3617
@node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators
3618
@subsection Setting @code{FS} from the Command Line
3619
@cindex @code{-F} option
3620
@cindex field separator, on command line
3621
@cindex command line, setting @code{FS} on
3623
@code{FS} can be set on the command line. You use the @samp{-F} option to
3627
awk -F, '@var{program}' @var{input-files}
3631
sets @code{FS} to be the @samp{,} character. Notice that the option uses
3632
a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file
3633
containing an @code{awk} program. Case is significant in command line options:
3634
the @samp{-F} and @samp{-f} options have nothing to do with each other.
3635
You can use both options at the same time to set the @code{FS} variable
3636
@emph{and} get an @code{awk} program from a file.
3638
The value used for the argument to @samp{-F} is processed in exactly the
3639
same way as assignments to the built-in variable @code{FS}. This means that
3640
if the field separator contains special characters, they must be escaped
3641
appropriately. For example, to use a @samp{\} as the field separator, you
3646
awk -F\\\\ '@dots{}' files @dots{}
3650
Since @samp{\} is used for quoting in the shell, @code{awk} will see
3651
@samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape
3652
characters (@pxref{Escape Sequences}), finally yielding
3653
a single @samp{\} to be used for the field separator.
3655
@cindex historical features
3656
As a special case, in compatibility mode
3657
(@pxref{Options, ,Command Line Options}), if the
3658
argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab
3659
character. This is because if you type @samp{-F\t} at the shell,
3660
without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you
3661
really want your fields to be separated with tabs, and not @samp{t}s.
3662
Use @samp{-v FS="t"} on the command line if you really do want to separate
3663
your fields with @samp{t}s
3664
(@pxref{Options, ,Command Line Options}).
3666
For example, let's use an @code{awk} program file called @file{baud.awk}
3667
that contains the pattern @code{/300/}, and the action @samp{print $1}.
3668
Here is the program:
3671
/300/ @{ print $1 @}
3674
Let's also set @code{FS} to be the @samp{-} character, and run the
3675
program on the file @file{BBS-list}. The following command prints a
3676
list of the names of the bulletin boards that operate at 300 baud and
3677
the first three digits of their phone numbers:
3679
@c tweaked to make the tex output look better in @smallbook
3682
$ awk -F- -f baud.awk BBS-list
3683
@print{} aardvark 555
3690
@print{} camelot 555
3696
@print{} sabafoo 555
3701
Note the second line of output. In the original file
3702
(@pxref{Sample Data Files, ,Data Files for the Examples}),
3703
the second line looked like this:
3706
alpo-net 555-3412 2400/1200/300 A
3709
The @samp{-} as part of the system's name was used as the field
3710
separator, instead of the @samp{-} in the phone number that was
3711
originally intended. This demonstrates why you have to be careful in
3712
choosing your field and record separators.
3714
On many Unix systems, each user has a separate entry in the system password
3715
file, one line per user. The information in these lines is separated
3716
by colons. The first field is the user's logon name, and the second is
3717
the user's encrypted password. A password file entry might look like this:
3720
arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
3723
The following program searches the system password file, and prints
3724
the entries for users who have no password:
3727
awk -F: '$2 == ""' /etc/passwd
3730
@node Field Splitting Summary, , Command Line Field Separator, Field Separators
3731
@subsection Field Splitting Summary
3733
@cindex @code{awk} language, POSIX version
3734
@cindex POSIX @code{awk}
3735
According to the POSIX standard, @code{awk} is supposed to behave
3736
as if each record is split into fields at the time that it is read.
3737
In particular, this means that you can change the value of @code{FS}
3738
after a record is read, and the value of the fields (i.e.@: how they were split)
3739
should reflect the old value of @code{FS}, not the new one.
3742
@cindex @code{sed} utility
3743
@cindex stream editor
3744
However, many implementations of @code{awk} do not work this way. Instead,
3745
they defer splitting the fields until a field is actually
3746
referenced. The fields will be split
3747
using the @emph{current} value of @code{FS}! (d.c.)
3748
This behavior can be difficult
3749
to diagnose. The following example illustrates the difference
3750
between the two methods.
3751
(The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.''
3752
Its behavior is also defined by the POSIX standard.}
3753
command prints just the first line of @file{/etc/passwd}.)
3756
sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'
3767
on an incorrect implementation of @code{awk}, while @code{gawk}
3768
will print something like
3771
root:nSijPlPhZZwgE:0:0:Root:/:
3774
The following table summarizes how fields are split, based on the
3775
value of @code{FS}. (@samp{==} means ``is equal to.'')
3780
Fields are separated by runs of whitespace. Leading and trailing
3781
whitespace are ignored. This is the default.
3783
@item FS == @var{any other single character}
3784
Fields are separated by each occurrence of the character. Multiple
3785
successive occurrences delimit empty fields, as do leading and
3786
trailing occurrences.
3787
The character can even be a regexp metacharacter; it does not need
3790
@item FS == @var{regexp}
3791
Fields are separated by occurrences of characters that match @var{regexp}.
3792
Leading and trailing matches of @var{regexp} delimit empty fields.
3795
Each individual character in the record becomes a separate field.
3799
@node Constant Size, Multiple Line, Field Separators, Reading Files
3800
@section Reading Fixed-width Data
3802
(This section discusses an advanced, experimental feature. If you are
3803
a novice @code{awk} user, you may wish to skip it on the first reading.)
3805
@code{gawk} version 2.13 introduced a new facility for dealing with
3806
fixed-width fields with no distinctive field separator. Data of this
3807
nature arises, for example, in the input for old FORTRAN programs where
3808
numbers are run together; or in the output of programs that did not
3809
anticipate the use of their output as input for other programs.
3811
An example of the latter is a table where all the columns are lined up by
3812
the use of a variable number of spaces and @emph{empty fields are just
3813
spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS}
3814
will not work well in this case. Although a portable @code{awk} program
3815
can use a series of @code{substr} calls on @code{$0}
3816
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
3817
this is awkward and inefficient for a large number of fields.
3819
The splitting of an input record into fixed-width fields is specified by
3820
assigning a string containing space-separated numbers to the built-in
3821
variable @code{FIELDWIDTHS}. Each number specifies the width of the field
3822
@emph{including} columns between fields. If you want to ignore the columns
3823
between fields, you can specify the width as a separate field that is
3824
subsequently ignored.
3826
The following data is the output of the Unix @code{w} utility. It is useful
3827
to illustrate the use of @code{FIELDWIDTHS}.
3831
10:06pm up 21 days, 14:04, 23 users
3832
User tty login@ idle JCPU PCPU what
3833
hzuo ttyV0 8:58pm 9 5 vi p24.tex
3834
hzang ttyV3 6:37pm 50 -csh
3835
eklye ttyV5 9:53pm 7 1 em thes.tex
3836
dportein ttyV6 8:17pm 1:47 -csh
3837
gierd ttyD3 10:00pm 1 elm
3838
dave ttyD4 9:47pm 4 4 w
3839
brent ttyp0 26Jun91 4:46 26:46 4:41 bash
3840
dave ttyq4 26Jun9115days 46 46 wnewmail
3844
The following program takes the above input, converts the idle time to
3845
number of seconds and prints out the first two fields and the calculated
3846
idle time. (This program uses a number of @code{awk} features that
3847
haven't been introduced yet.)
3851
BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}
3854
sub(/^ */, "", idle) # strip leading spaces
3859
idle = t[1] * 60 + t[2]
3862
idle *= 24 * 60 * 60
3869
Here is the result of running the program on the data:
3882
Another (possibly more practical) example of fixed-width input data
3883
would be the input from a deck of balloting cards. In some parts of
3884
the United States, voters mark their choices by punching holes in computer
3885
cards. These cards are then processed to count the votes for any particular
3886
candidate or on any particular issue. Since a voter may choose not to
3887
vote on some issue, any column on the card may be empty. An @code{awk}
3888
program for processing such data could use the @code{FIELDWIDTHS} feature
3889
to simplify reading the data. (Of course, getting @code{gawk} to run on
3890
a system with card readers is another story!)
3893
Exercise: Write a ballot card reading program
3896
Assigning a value to @code{FS} causes @code{gawk} to return to using
3897
@code{FS} for field splitting. Use @samp{FS = FS} to make this happen,
3898
without having to know the current value of @code{FS}.
3900
This feature is still experimental, and may evolve over time.
3901
Note that in particular, @code{gawk} does not attempt to verify
3902
the sanity of the values used in the value of @code{FIELDWIDTHS}.
3904
@node Multiple Line, Getline, Constant Size, Reading Files
3905
@section Multiple-Line Records
3907
@cindex multiple line records
3908
@cindex input, multiple line records
3909
@cindex reading files, multiple line records
3910
@cindex records, multiple line
3911
In some data bases, a single line cannot conveniently hold all the
3912
information in one entry. In such cases, you can use multi-line
3915
The first step in doing this is to choose your data format: when records
3916
are not defined as single lines, how do you want to define them?
3917
What should separate records?
3919
One technique is to use an unusual character or string to separate
3920
records. For example, you could use the formfeed character (written
3921
@samp{\f} in @code{awk}, as in C) to separate them, making each record
3922
a page of the file. To do this, just set the variable @code{RS} to
3923
@code{"\f"} (a string containing the formfeed character). Any
3924
other character could equally well be used, as long as it won't be part
3925
of the data in a record.
3927
Another technique is to have blank lines separate records. By a special
3928
dispensation, an empty string as the value of @code{RS} indicates that
3929
records are separated by one or more blank lines. If you set @code{RS}
3930
to the empty string, a record always ends at the first blank line
3931
encountered. And the next record doesn't start until the first non-blank
3932
line that follows---no matter how many blank lines appear in a row, they
3933
are considered one record-separator.
3935
@cindex leftmost longest match
3936
@cindex matching, leftmost longest
3937
You can achieve the same effect as @samp{RS = ""} by assigning the
3938
string @code{"\n\n+"} to @code{RS}. This regexp matches the newline
3939
at the end of the record, and one or more blank lines after the record.
3940
In addition, a regular expression always matches the longest possible
3941
sequence when there is a choice
3942
(@pxref{Leftmost Longest, ,How Much Text Matches?})
3943
So the next record doesn't start until
3944
the first non-blank line that follows---no matter how many blank lines
3945
appear in a row, they are considered one record-separator.
3948
There is an important difference between @samp{RS = ""} and
3949
@samp{RS = "\n\n+"}. In the first case, leading newlines in the input
3950
data file are ignored, and if a file ends without extra blank lines
3951
after the last record, the final newline is removed from the record.
3952
In the second case, this special processing is not done (d.c.).
3954
Now that the input is separated into records, the second step is to
3955
separate the fields in the record. One way to do this is to divide each
3956
of the lines into fields in the normal manner. This happens by default
3957
as the result of a special feature: when @code{RS} is set to the empty
3958
string, the newline character @emph{always} acts as a field separator.
3959
This is in addition to whatever field separations result from @code{FS}.
3961
The original motivation for this special exception was probably to provide
3962
useful behavior in the default case (i.e.@: @code{FS} is equal
3963
to @w{@code{" "}}). This feature can be a problem if you really don't
3964
want the newline character to separate fields, since there is no way to
3965
prevent it. However, you can work around this by using the @code{split}
3966
function to break up the record manually
3967
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
3969
Another way to separate fields is to
3970
put each field on a separate line: to do this, just set the
3971
variable @code{FS} to the string @code{"\n"}. (This simple regular
3972
expression matches a single newline.)
3974
A practical example of a data file organized this way might be a mailing
3975
list, where each entry is separated by blank lines. If we have a mailing
3976
list in a file named @file{addresses}, that looks like this:
3981
Anywhere, SE 12345-6789
3984
456 Tree-lined Avenue
3985
Smallville, MW 98765-4321
3991
A simple program to process this file would look like this:
3995
# addrs.awk --- simple mailing list program
3997
# Records are separated by blank lines.
3998
# Each line is one field.
3999
BEGIN @{ RS = "" ; FS = "\n" @}
4002
print "Name is:", $1
4003
print "Address is:", $2
4004
print "City and State are:", $3
4010
Running the program produces the following output:
4014
$ awk -f addrs.awk addresses
4015
@print{} Name is: Jane Doe
4016
@print{} Address is: 123 Main Street
4017
@print{} City and State are: Anywhere, SE 12345-6789
4021
@print{} Name is: John Smith
4022
@print{} Address is: 456 Tree-lined Avenue
4023
@print{} City and State are: Smallville, MW 98765-4321
4029
@xref{Labels Program, ,Printing Mailing Labels}, for a more realistic
4030
program that deals with address lists.
4032
The following table summarizes how records are split, based on the
4033
value of @code{RS}. (@samp{==} means ``is equal to.'')
4038
Records are separated by the newline character (@samp{\n}). In effect,
4039
every line in the data file is a separate record, including blank lines.
4040
This is the default.
4042
@item RS == @var{any single character}
4043
Records are separated by each occurrence of the character. Multiple
4044
successive occurrences delimit empty records.
4047
Records are separated by runs of blank lines. The newline character
4048
always serves as a field separator, in addition to whatever value
4049
@code{FS} may have. Leading and trailing newlines in a file are ignored.
4051
@item RS == @var{regexp}
4052
Records are separated by occurrences of characters that match @var{regexp}.
4053
Leading and trailing matches of @var{regexp} delimit empty records.
4058
In all cases, @code{gawk} sets @code{RT} to the input text that matched the
4059
value specified by @code{RS}.
4061
@node Getline, , Multiple Line, Reading Files
4062
@section Explicit Input with @code{getline}
4065
@cindex input, explicit
4066
@cindex explicit input
4067
@cindex input, @code{getline} command
4068
@cindex reading files, @code{getline} command
4069
So far we have been getting our input data from @code{awk}'s main
4070
input stream---either the standard input (usually your terminal, sometimes
4071
the output from another program) or from the
4072
files specified on the command line. The @code{awk} language has a
4073
special built-in command called @code{getline} that
4074
can be used to read input under your explicit control.
4077
* Getline Intro:: Introduction to the @code{getline} function.
4078
* Plain Getline:: Using @code{getline} with no arguments.
4079
* Getline/Variable:: Using @code{getline} into a variable.
4080
* Getline/File:: Using @code{getline} from a file.
4081
* Getline/Variable/File:: Using @code{getline} into a variable from a
4083
* Getline/Pipe:: Using @code{getline} from a pipe.
4084
* Getline/Variable/Pipe:: Using @code{getline} into a variable from a
4086
* Getline Summary:: Summary Of @code{getline} Variants.
4089
@node Getline Intro, Plain Getline, Getline, Getline
4090
@subsection Introduction to @code{getline}
4092
This command is used in several different ways, and should @emph{not} be
4093
used by beginners. It is covered here because this is the chapter on input.
4094
The examples that follow the explanation of the @code{getline} command
4095
include material that has not been covered yet. Therefore, come back
4096
and study the @code{getline} command @emph{after} you have reviewed the
4097
rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works.
4100
@cindex differences between @code{gawk} and @code{awk}
4101
@cindex @code{getline}, return values
4102
@code{getline} returns one if it finds a record, and zero if the end of the
4103
file is encountered. If there is some error in getting a record, such
4104
as a file that cannot be opened, then @code{getline} returns @minus{}1.
4105
In this case, @code{gawk} sets the variable @code{ERRNO} to a string
4106
describing the error that occurred.
4108
In the following examples, @var{command} stands for a string value that
4109
represents a shell command.
4111
@node Plain Getline, Getline/Variable, Getline Intro, Getline
4112
@subsection Using @code{getline} with No Arguments
4114
The @code{getline} command can be used without arguments to read input
4115
from the current input file. All it does in this case is read the next
4116
input record and split it up into fields. This is useful if you've
4117
finished processing the current record, but you want to do some special
4118
processing @emph{right now} on the next record. Here's an
4124
if ((t = index($0, "/*")) != 0) @{
4125
# value will be "" if t is 1
4126
tmp = substr($0, 1, t - 1)
4127
u = index(substr($0, t + 2), "*/")
4129
if (getline <= 0) @{
4130
m = "unexpected EOF or error"
4132
print m > "/dev/stderr"
4140
# substr expression will be "" if */
4141
# occurred at end of line
4142
$0 = tmp substr($0, t + u + 3)
4149
This @code{awk} program deletes all C-style comments, @samp{/* @dots{}
4150
*/}, from the input. By replacing the @samp{print $0} with other
4151
statements, you could perform more complicated processing on the
4152
decommented input, like searching for matches of a regular
4153
expression. This program has a subtle problem---it does not work if one
4154
comment ends and another begins on the same line.
4158
write a program that does handle multiple comments on the line.
4161
This form of the @code{getline} command sets @code{NF} (the number of
4162
fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of
4163
records read so far; @pxref{Records, ,How Input is Split into Records}),
4164
@code{FNR} (the number of records read from this input file), and the
4168
@strong{Note:} the new value of @code{$0} is used in testing
4169
the patterns of any subsequent rules. The original value
4170
of @code{$0} that triggered the rule which executed @code{getline}
4172
By contrast, the @code{next} statement reads a new record
4173
but immediately begins processing it normally, starting with the first
4174
rule in the program. @xref{Next Statement, ,The @code{next} Statement}.
4176
@node Getline/Variable, Getline/File, Plain Getline, Getline
4177
@subsection Using @code{getline} Into a Variable
4179
You can use @samp{getline @var{var}} to read the next record from
4180
@code{awk}'s input into the variable @var{var}. No other processing is
4183
For example, suppose the next line is a comment, or a special string,
4184
and you want to read it, without triggering
4185
any rules. This form of @code{getline} allows you to read that line
4186
and store it in a variable so that the main
4187
read-a-line-and-check-each-rule loop of @code{awk} never sees it.
4189
The following example swaps every two lines of input. For example, given:
4214
if ((getline tmp) > 0) @{
4223
The @code{getline} command used in this way sets only the variables
4224
@code{NR} and @code{FNR} (and of course, @var{var}). The record is not
4225
split into fields, so the values of the fields (including @code{$0}) and
4226
the value of @code{NF} do not change.
4228
@node Getline/File, Getline/Variable/File, Getline/Variable, Getline
4229
@subsection Using @code{getline} from a File
4231
@cindex input redirection
4232
@cindex redirection of input
4233
Use @samp{getline < @var{file}} to read
4234
the next record from the file
4235
@var{file}. Here @var{file} is a string-valued expression that
4236
specifies the file name. @samp{< @var{file}} is called a @dfn{redirection}
4237
since it directs input to come from a different place.
4239
For example, the following
4240
program reads its input record from the file @file{secondary.input} when it
4241
encounters a first field with a value equal to 10 in the current input
4248
getline < "secondary.input"
4256
Since the main input stream is not used, the values of @code{NR} and
4257
@code{FNR} are not changed. But the record read is split into fields in
4258
the normal manner, so the values of @code{$0} and other fields are
4259
changed. So is the value of @code{NF}.
4261
@node Getline/Variable/File, Getline/Pipe, Getline/File, Getline
4262
@subsection Using @code{getline} Into a Variable from a File
4264
Use @samp{getline @var{var} < @var{file}} to read input
4266
@var{file} and put it in the variable @var{var}. As above, @var{file}
4267
is a string-valued expression that specifies the file from which to read.
4269
In this version of @code{getline}, none of the built-in variables are
4270
changed, and the record is not split into fields. The only variable
4271
changed is @var{var}.
4273
For example, the following program copies all the input files to the
4274
output, except for records that say @w{@samp{@@include @var{filename}}}.
4275
Such a record is replaced by the contents of the file
4281
if (NF == 2 && $1 == "@@include") @{
4282
while ((getline line < $2) > 0)
4291
Note here how the name of the extra input file is not built into
4292
the program; it is taken directly from the data, from the second field on
4293
the @samp{@@include} line.
4295
The @code{close} function is called to ensure that if two identical
4296
@samp{@@include} lines appear in the input, the entire specified file is
4298
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
4300
One deficiency of this program is that it does not process nested
4301
@samp{@@include} statements
4302
(@samp{@@include} statements in included files)
4303
the way a true macro preprocessor would.
4304
@xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program
4305
that does handle nested @samp{@@include} statements.
4307
@node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline
4308
@subsection Using @code{getline} from a Pipe
4310
@cindex input pipeline
4311
@cindex pipeline, input
4312
You can pipe the output of a command into @code{getline}, using
4313
@samp{@var{command} | getline}. In
4314
this case, the string @var{command} is run as a shell command and its output
4315
is piped into @code{awk} to be used as input. This form of @code{getline}
4316
reads one record at a time from the pipe.
4318
For example, the following program copies its input to its output, except for
4319
lines that begin with @samp{@@execute}, which are replaced by the output
4320
produced by running the rest of the line as a shell command:
4325
if ($1 == "@@execute") @{
4326
tmp = substr($0, 10)
4327
while ((tmp | getline) > 0)
4337
The @code{close} function is called to ensure that if two identical
4338
@samp{@@execute} lines appear in the input, the command is run for
4340
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
4342
@c This example is unrealistic, since you could just use system
4357
the program might produce:
4364
arnold ttyv0 Jul 13 14:22
4365
miriam ttyp0 Jul 13 14:23 (murphy:0)
4366
bill ttyp1 Jul 13 14:23 (murphy:0)
4372
Notice that this program ran the command @code{who} and printed the result.
4373
(If you try this program yourself, you will of course get different results,
4374
showing you who is logged in on your system.)
4376
This variation of @code{getline} splits the record into fields, sets the
4377
value of @code{NF} and recomputes the value of @code{$0}. The values of
4378
@code{NR} and @code{FNR} are not changed.
4380
@node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline
4381
@subsection Using @code{getline} Into a Variable from a Pipe
4383
When you use @samp{@var{command} | getline @var{var}}, the
4384
output of the command @var{command} is sent through a pipe to
4385
@code{getline} and into the variable @var{var}. For example, the
4386
following program reads the current date and time into the variable
4387
@code{current_time}, using the @code{date} utility, and then
4393
"date" | getline current_time
4395
print "Report printed on " current_time
4400
In this version of @code{getline}, none of the built-in variables are
4401
changed, and the record is not split into fields.
4403
@node Getline Summary, , Getline/Variable/Pipe, Getline
4404
@subsection Summary of @code{getline} Variants
4406
With all the forms of @code{getline}, even though @code{$0} and @code{NF},
4407
may be updated, the record will not be tested against all the patterns
4408
in the @code{awk} program, in the way that would happen if the record
4409
were read normally by the main processing loop of @code{awk}. However
4410
the new record is tested against any subsequent rules.
4412
@cindex differences between @code{gawk} and @code{awk}
4414
@cindex implementation limits
4415
Many @code{awk} implementations limit the number of pipelines an @code{awk}
4416
program may have open to just one! In @code{gawk}, there is no such limit.
4417
You can open as many pipelines as the underlying operating system will
4420
The following table summarizes the six variants of @code{getline},
4421
listing which built-in variables are set by each one.
4429
sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}.
4431
@item getline @var{var}
4432
sets @var{var}, @code{FNR}, and @code{NR}.
4434
@item getline < @var{file}
4435
sets @code{$0}, and @code{NF}.
4437
@item getline @var{var} < @var{file}
4440
@item @var{command} | getline
4441
sets @code{$0}, and @code{NF}.
4443
@item @var{command} | getline @var{var}
4448
@node Printing, Expressions, Reading Files, Top
4449
@chapter Printing Output
4453
One of the most common actions is to @dfn{print}, or output,
4454
some or all of the input. You use the @code{print} statement
4455
for simple output. You use the @code{printf} statement
4456
for fancier formatting. Both are described in this chapter.
4459
* Print:: The @code{print} statement.
4460
* Print Examples:: Simple examples of @code{print} statements.
4461
* Output Separators:: The output separators and how to change them.
4462
* OFMT:: Controlling Numeric Output With @code{print}.
4463
* Printf:: The @code{printf} statement.
4464
* Redirection:: How to redirect output to multiple files and
4466
* Special Files:: File name interpretation in @code{gawk}.
4467
@code{gawk} allows access to inherited file
4469
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
4472
@node Print, Print Examples, Printing, Printing
4473
@section The @code{print} Statement
4474
@cindex @code{print} statement
4476
The @code{print} statement does output with simple, standardized
4477
formatting. You specify only the strings or numbers to be printed, in a
4478
list separated by commas. They are output, separated by single spaces,
4479
followed by a newline. The statement looks like this:
4482
print @var{item1}, @var{item2}, @dots{}
4486
The entire list of items may optionally be enclosed in parentheses. The
4487
parentheses are necessary if any of the item expressions uses the @samp{>}
4488
relational operator; otherwise it could be confused with a redirection
4489
(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
4491
The items to be printed can be constant strings or numbers, fields of the
4492
current record (such as @code{$1}), variables, or any @code{awk}
4494
Numeric values are converted to strings, and then printed.
4496
The @code{print} statement is completely general for
4497
computing @emph{what} values to print. However, with two exceptions,
4498
you cannot specify @emph{how} to print them---how many
4499
columns, whether to use exponential notation or not, and so on.
4500
(For the exceptions, @pxref{Output Separators}, and
4501
@ref{OFMT, ,Controlling Numeric Output with @code{print}}.)
4502
For that, you need the @code{printf} statement
4503
(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
4505
The simple statement @samp{print} with no items is equivalent to
4506
@samp{print $0}: it prints the entire current record. To print a blank
4507
line, use @samp{print ""}, where @code{""} is the empty string.
4509
To print a fixed piece of text, use a string constant such as
4510
@w{@code{"Don't Panic"}} as one item. If you forget to use the
4511
double-quote characters, your text will be taken as an @code{awk}
4512
expression, and you will probably get an error. Keep in mind that a
4513
space is printed between any two items.
4515
Each @code{print} statement makes at least one line of output. But it
4516
isn't limited to one line. If an item value is a string that contains a
4517
newline, the newline is output along with the rest of the string. A
4518
single @code{print} can make any number of lines this way.
4520
@node Print Examples, Output Separators, Print, Printing
4521
@section Examples of @code{print} Statements
4523
Here is an example of printing a string that contains embedded newlines
4524
(the @samp{\n} is an escape sequence, used to represent the newline
4525
character; see @ref{Escape Sequences}):
4529
$ awk 'BEGIN @{ print "line one\nline two\nline three" @}'
4536
Here is an example that prints the first two fields of each input record,
4537
with a space between them:
4541
$ awk '@{ print $1, $2 @}' inventory-shipped
4549
@cindex common mistakes
4550
@cindex mistakes, common
4551
@cindex errors, common
4552
A common mistake in using the @code{print} statement is to omit the comma
4553
between two items. This often has the effect of making the items run
4554
together in the output, with no space. The reason for this is that
4555
juxtaposing two string expressions in @code{awk} means to concatenate
4556
them. Here is the same program, without the comma:
4560
$ awk '@{ print $1 $2 @}' inventory-shipped
4568
To someone unfamiliar with the file @file{inventory-shipped}, neither
4569
example's output makes much sense. A heading line at the beginning
4570
would make it clearer. Let's add some headings to our table of months
4571
(@code{$1}) and green crates shipped (@code{$2}). We do this using the
4572
@code{BEGIN} pattern
4573
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
4574
to force the headings to be printed only once:
4577
awk 'BEGIN @{ print "Month Crates"
4578
print "----- ------" @}
4579
@{ print $1, $2 @}' inventory-shipped
4583
Did you already guess what happens? When run, the program prints
4598
The headings and the table data don't line up! We can fix this by printing
4599
some spaces between the two fields:
4602
awk 'BEGIN @{ print "Month Crates"
4603
print "----- ------" @}
4604
@{ print $1, " ", $2 @}' inventory-shipped
4607
You can imagine that this way of lining up columns can get pretty
4608
complicated when you have many columns to fix. Counting spaces for two
4609
or three columns can be simple, but more than this and you can get
4610
lost quite easily. This is why the @code{printf} statement was
4611
created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing});
4612
one of its specialties is lining up columns of data.
4614
@cindex line continuation
4616
you can continue either a @code{print} or @code{printf} statement simply
4617
by putting a newline after any comma
4618
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
4620
@node Output Separators, OFMT, Print Examples, Printing
4621
@section Output Separators
4623
@cindex output field separator, @code{OFS}
4624
@cindex output record separator, @code{ORS}
4627
As mentioned previously, a @code{print} statement contains a list
4628
of items, separated by commas. In the output, the items are normally
4629
separated by single spaces. This need not be the case; a
4630
single space is only the default. You can specify any string of
4631
characters to use as the @dfn{output field separator} by setting the
4632
built-in variable @code{OFS}. The initial value of this variable
4633
is the string @w{@code{" "}}, that is, a single space.
4635
The output from an entire @code{print} statement is called an
4636
@dfn{output record}. Each @code{print} statement outputs one output
4637
record and then outputs a string called the @dfn{output record separator}.
4638
The built-in variable @code{ORS} specifies this string. The initial
4639
value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline
4640
character; thus, normally each @code{print} statement makes a separate line.
4642
You can change how output fields and records are separated by assigning
4643
new values to the variables @code{OFS} and/or @code{ORS}. The usual
4644
place to do this is in the @code{BEGIN} rule
4645
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so
4646
that it happens before any input is processed. You may also do this
4647
with assignments on the command line, before the names of your input
4648
files, or using the @samp{-v} command line option
4649
(@pxref{Options, ,Command Line Options}).
4655
awk 'BEGIN @{ print "Month Crates"
4656
print "----- ------" @}
4657
@{ print $1, " ", $2 @}' inventory-shipped
4659
program by using a new value of @code{OFS}.
4662
The following example prints the first and second fields of each input
4663
record separated by a semicolon, with a blank line added after each
4668
$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}
4669
> @{ print $1, $2 @}' BBS-list
4670
@print{} aardvark;555-5553
4672
@print{} alpo-net;555-3412
4674
@print{} barfly;555-7685
4679
If the value of @code{ORS} does not contain a newline, all your output
4680
will be run together on a single line, unless you output newlines some
4683
@node OFMT, Printf, Output Separators, Printing
4684
@section Controlling Numeric Output with @code{print}
4686
@cindex numeric output format
4687
@cindex format, numeric output
4688
@cindex output format specifier, @code{OFMT}
4689
When you use the @code{print} statement to print numeric values,
4690
@code{awk} internally converts the number to a string of characters,
4691
and prints that string. @code{awk} uses the @code{sprintf} function
4692
to do this conversion
4693
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
4694
For now, it suffices to say that the @code{sprintf}
4695
function accepts a @dfn{format specification} that tells it how to format
4696
numbers (or strings), and that there are a number of different ways in which
4697
numbers can be formatted. The different format specifications are discussed
4699
@ref{Control Letters, , Format-Control Letters}.
4701
The built-in variable @code{OFMT} contains the default format specification
4702
that @code{print} uses with @code{sprintf} when it wants to convert a
4703
number to a string for printing.
4704
The default value of @code{OFMT} is @code{"%.6g"}.
4705
By supplying different format specifications
4706
as the value of @code{OFMT}, you can change how @code{print} will print
4707
your numbers. As a brief example:
4712
> OFMT = "%.0f" # print numbers as integers (rounds)
4720
@cindex @code{awk} language, POSIX version
4721
@cindex POSIX @code{awk}
4722
According to the POSIX standard, @code{awk}'s behavior will be undefined
4723
if @code{OFMT} contains anything but a floating point conversion specification
4726
@node Printf, Redirection, OFMT, Printing
4727
@section Using @code{printf} Statements for Fancier Printing
4728
@cindex formatted output
4729
@cindex output, formatted
4731
If you want more precise control over the output format than
4732
@code{print} gives you, use @code{printf}. With @code{printf} you can
4733
specify the width to use for each item, and you can specify various
4734
formatting choices for numbers (such as what radix to use, whether to
4735
print an exponent, whether to print a sign, and how many digits to print
4736
after the decimal point). You do this by supplying a string, called
4737
the @dfn{format string}, which controls how and where to print the other
4741
* Basic Printf:: Syntax of the @code{printf} statement.
4742
* Control Letters:: Format-control letters.
4743
* Format Modifiers:: Format-specification modifiers.
4744
* Printf Examples:: Several examples.
4747
@node Basic Printf, Control Letters, Printf, Printf
4748
@subsection Introduction to the @code{printf} Statement
4750
@cindex @code{printf} statement, syntax of
4751
The @code{printf} statement looks like this:
4754
printf @var{format}, @var{item1}, @var{item2}, @dots{}
4758
The entire list of arguments may optionally be enclosed in parentheses. The
4759
parentheses are necessary if any of the item expressions use the @samp{>}
4760
relational operator; otherwise it could be confused with a redirection
4761
(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
4763
@cindex format string
4764
The difference between @code{printf} and @code{print} is the @var{format}
4765
argument. This is an expression whose value is taken as a string; it
4766
specifies how to output each of the other arguments. It is called
4767
the @dfn{format string}.
4769
The format string is very similar to that in the ANSI C library function
4770
@code{printf}. Most of @var{format} is text to be output verbatim.
4771
Scattered among this text are @dfn{format specifiers}, one per item.
4772
Each format specifier says to output the next item in the argument list
4773
at that place in the format.
4775
The @code{printf} statement does not automatically append a newline to its
4776
output. It outputs only what the format string specifies. So if you want
4777
a newline, you must include one in the format string. The output separator
4778
variables @code{OFS} and @code{ORS} have no effect on @code{printf}
4779
statements. For example:
4784
ORS = "\nOUCH!\n"; OFS = "!"
4785
msg = "Don't Panic!"; printf "%s\n", msg
4790
This program still prints the familiar @samp{Don't Panic!} message.
4792
@node Control Letters, Format Modifiers, Basic Printf, Printf
4793
@subsection Format-Control Letters
4794
@cindex @code{printf}, format-control characters
4795
@cindex format specifier
4797
A format specifier starts with the character @samp{%} and ends with a
4798
@dfn{format-control letter}; it tells the @code{printf} statement how
4799
to output one item. (If you actually want to output a @samp{%}, write
4800
@samp{%%}.) The format-control letter specifies what kind of value to
4801
print. The rest of the format specifier is made up of optional
4802
@dfn{modifiers} which are parameters to use, such as the field width.
4804
Here is a list of the format-control letters:
4808
This prints a number as an ASCII character. Thus, @samp{printf "%c",
4809
65} outputs the letter @samp{A}. The output for a string value is
4810
the first character of the string.
4817
These are equivalent. They both print a decimal integer.
4818
The @samp{%i} specification is for compatibility with ANSI C.
4822
This prints a number in scientific (exponential) notation.
4826
printf "%4.3e\n", 1950
4830
prints @samp{1.950e+03}, with a total of four significant figures of
4831
which three follow the decimal point. The @samp{4.3} are modifiers,
4832
discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output.
4835
This prints a number in floating point notation.
4839
printf "%4.3f", 1950
4843
prints @samp{1950.000}, with a total of four significant figures of
4844
which three follow the decimal point. The @samp{4.3} are modifiers,
4849
This prints a number in either scientific notation or floating point
4850
notation, whichever uses fewer characters. If the result is printed in
4851
scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.
4854
This prints an unsigned octal integer.
4855
(In octal, or base-eight notation, the digits run from @samp{0} to @samp{7};
4856
the decimal number eight is represented as @samp{10} in octal.)
4859
This prints a string.
4863
This prints an unsigned hexadecimal integer.
4864
(In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9}
4865
and @samp{a} through @samp{f}. The hexadecimal digit @samp{f} represents
4866
the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F}
4867
instead of @samp{a} through @samp{f}.
4870
This isn't really a format-control letter, but it does have a meaning
4871
when used after a @samp{%}: the sequence @samp{%%} outputs one
4872
@samp{%}. It does not consume an argument, and it ignores any
4877
When using the integer format-control letters for values that are outside
4878
the range of a C @code{long} integer, @code{gawk} will switch to the
4879
@samp{%g} format specifier. Other versions of @code{awk} may print
4880
invalid values, or do something else entirely (d.c.).
4882
@node Format Modifiers, Printf Examples, Control Letters, Printf
4883
@subsection Modifiers for @code{printf} Formats
4885
@cindex @code{printf}, modifiers
4886
@cindex modifiers (in format specifiers)
4887
A format specification can also include @dfn{modifiers} that can control
4888
how much of the item's value is printed and how much space it gets. The
4889
modifiers come between the @samp{%} and the format-control letter.
4890
In the examples below, we use the bullet symbol ``@bullet{}'' to represent
4891
spaces in the output. Here are the possible modifiers, in the order in
4892
which they may appear:
4896
The minus sign, used before the width modifier (see below),
4897
says to left-justify
4898
the argument within its specified width. Normally the argument
4899
is printed right-justified in the specified width. Thus,
4902
printf "%-4s", "foo"
4906
prints @samp{foo@bullet{}}.
4909
For numeric conversions, prefix positive values with a space, and
4910
negative values with a minus sign.
4913
The plus sign, used before the width modifier (see below),
4914
says to always supply a sign for numeric conversions, even if the data
4915
to be formatted is positive. The @samp{+} overrides the space modifier.
4918
Use an ``alternate form'' for certain control letters.
4919
For @samp{%o}, supply a leading zero.
4920
For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for
4922
For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a
4924
For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result.
4928
A leading @samp{0} (zero) acts as a flag, that indicates output should be
4929
padded with zeros instead of spaces.
4930
This applies even to non-numeric output formats (d.c.).
4931
This flag only has an effect when the field width is wider than the
4932
value to be printed.
4935
This is a number specifying the desired minimum width of a field. Inserting any
4936
number between the @samp{%} sign and the format control character forces the
4937
field to be expanded to this width. The default way to do this is to
4938
pad with spaces on the left. For example,
4945
prints @samp{@bullet{}foo}.
4947
The value of @var{width} is a minimum width, not a maximum. If the item
4948
value requires more than @var{width} characters, it can be as wide as
4952
printf "%4s", "foobar"
4956
prints @samp{foobar}.
4958
Preceding the @var{width} with a minus sign causes the output to be
4959
padded with spaces on the right, instead of on the left.
4962
This is a number that specifies the precision to use when printing.
4963
For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
4964
number of digits you want printed to the right of the decimal point.
4965
For the @samp{g}, and @samp{G} formats, it specifies the maximum number
4966
of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
4967
@samp{x}, and @samp{X} formats, it specifies the minimum number of
4968
digits to print. For a string, it specifies the maximum number of
4969
characters from the string that should be printed. Thus,
4972
printf "%.4s", "foobar"
4979
The C library @code{printf}'s dynamic @var{width} and @var{prec}
4980
capability (for example, @code{"%*.*s"}) is supported. Instead of
4981
supplying explicit @var{width} and/or @var{prec} values in the format
4982
string, you pass them in the argument list. For example:
4988
printf "%*.*s\n", w, p, s
4992
is exactly equivalent to
5000
Both programs output @samp{@w{@bullet{}@bullet{}abc}}.
5002
Earlier versions of @code{awk} did not support this capability.
5003
If you must use such a version, you may simulate this feature by using
5004
concatenation to build up the format string, like so:
5010
printf "%" w "." p "s\n", s
5014
This is not particularly easy to read, but it does work.
5016
@cindex @code{awk} language, POSIX version
5017
@cindex POSIX @code{awk}
5018
C programmers may be used to supplying additional @samp{l} and @samp{h}
5019
flags in @code{printf} format strings. These are not valid in @code{awk}.
5020
Most @code{awk} implementations silently ignore these flags.
5021
If @samp{--lint} is provided on the command line
5022
(@pxref{Options, ,Command Line Options}),
5023
@code{gawk} will warn about their use. If @samp{--posix} is supplied,
5024
their use is a fatal error.
5026
@node Printf Examples, , Format Modifiers, Printf
5027
@subsection Examples Using @code{printf}
5029
Here is how to use @code{printf} to make an aligned table:
5032
awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5036
prints the names of bulletin boards (@code{$1}) of the file
5037
@file{BBS-list} as a string of 10 characters, left justified. It also
5038
prints the phone numbers (@code{$2}) afterward on the line. This
5039
produces an aligned two-column table of names and phone numbers:
5043
$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5044
@print{} aardvark 555-5553
5045
@print{} alpo-net 555-3412
5046
@print{} barfly 555-7685
5047
@print{} bites 555-1675
5048
@print{} camelot 555-0542
5049
@print{} core 555-2912
5050
@print{} fooey 555-1234
5051
@print{} foot 555-6699
5052
@print{} macfoo 555-6480
5053
@print{} sdace 555-3430
5054
@print{} sabafoo 555-2127
5058
Did you notice that we did not specify that the phone numbers be printed
5059
as numbers? They had to be printed as strings because the numbers are
5060
separated by a dash.
5061
If we had tried to print the phone numbers as numbers, all we would have
5062
gotten would have been the first three digits, @samp{555}.
5063
This would have been pretty confusing.
5065
We did not specify a width for the phone numbers because they are the
5066
last things on their lines. We don't need to put spaces after them.
5068
We could make our table look even nicer by adding headings to the tops
5069
of the columns. To do this, we use the @code{BEGIN} pattern
5070
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
5071
to force the header to be printed only once, at the beginning of
5072
the @code{awk} program:
5076
awk 'BEGIN @{ print "Name Number"
5077
print "---- ------" @}
5078
@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5082
Did you notice that we mixed @code{print} and @code{printf} statements in
5083
the above example? We could have used just @code{printf} statements to get
5088
awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"
5089
printf "%-10s %s\n", "----", "------" @}
5090
@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5095
By printing each column heading with the same format specification
5096
used for the elements of the column, we have made sure that the headings
5097
are aligned just like the columns.
5099
The fact that the same format specification is used three times can be
5100
emphasized by storing it in a variable, like this:
5104
awk 'BEGIN @{ format = "%-10s %s\n"
5105
printf format, "Name", "Number"
5106
printf format, "----", "------" @}
5107
@{ printf format, $1, $2 @}' BBS-list
5112
See if you can use the @code{printf} statement to line up the headings and
5113
table data for our @file{inventory-shipped} example covered earlier in the
5114
section on the @code{print} statement
5115
(@pxref{Print, ,The @code{print} Statement}).
5117
@node Redirection, Special Files, Printf, Printing
5118
@section Redirecting Output of @code{print} and @code{printf}
5120
@cindex output redirection
5121
@cindex redirection of output
5122
So far we have been dealing only with output that prints to the standard
5123
output, usually your terminal. Both @code{print} and @code{printf} can
5124
also send their output to other places.
5125
This is called @dfn{redirection}.
5127
A redirection appears after the @code{print} or @code{printf} statement.
5128
Redirections in @code{awk} are written just like redirections in shell
5129
commands, except that they are written inside the @code{awk} program.
5131
There are three forms of output redirection: output to a file,
5132
output appended to a file, and output through a pipe to another
5134
They are all shown for
5135
the @code{print} statement, but they work identically for @code{printf}
5139
@item print @var{items} > @var{output-file}
5140
This type of redirection prints the items into the output file
5141
@var{output-file}. The file name @var{output-file} can be any
5142
expression. Its value is changed to a string and then used as a
5143
file name (@pxref{Expressions}).
5145
When this type of redirection is used, the @var{output-file} is erased
5146
before the first output is written to it. Subsequent writes
5147
to the same @var{output-file} do not
5148
erase @var{output-file}, but append to it. If @var{output-file} does
5149
not exist, then it is created.
5151
For example, here is how an @code{awk} program can write a list of
5152
BBS names to a file @file{name-list} and a list of phone numbers to a
5153
file @file{phone-list}. Each output file contains one name or number
5158
$ awk '@{ print $2 > "phone-list"
5159
> print $1 > "name-list" @}' BBS-list
5175
@item print @var{items} >> @var{output-file}
5176
This type of redirection prints the items into the pre-existing output file
5177
@var{output-file}. The difference between this and the
5178
single-@samp{>} redirection is that the old contents (if any) of
5179
@var{output-file} are not erased. Instead, the @code{awk} output is
5180
appended to the file.
5181
If @var{output-file} does not exist, then it is created.
5183
@cindex pipes for output
5184
@cindex output, piping
5185
@item print @var{items} | @var{command}
5186
It is also possible to send output to another program through a pipe
5188
file. This type of redirection opens a pipe to @var{command} and writes
5189
the values of @var{items} through this pipe, to another process created
5190
to execute @var{command}.
5192
The redirection argument @var{command} is actually an @code{awk}
5193
expression. Its value is converted to a string, whose contents give the
5194
shell command to be run.
5196
For example, this produces two files, one unsorted list of BBS names
5197
and one list sorted in reverse alphabetical order:
5200
awk '@{ print $1 > "names.unsorted"
5201
command = "sort -r > names.sorted"
5202
print $1 | command @}' BBS-list
5205
Here the unsorted list is written with an ordinary redirection while
5206
the sorted list is written by piping through the @code{sort} utility.
5208
This example uses redirection to mail a message to a mailing
5209
list @samp{bug-system}. This might be useful when trouble is encountered
5210
in an @code{awk} script run periodically for system maintenance.
5213
report = "mail bug-system"
5214
print "Awk script failed:", $0 | report
5215
m = ("at record number " FNR " of " FILENAME)
5220
The message is built using string concatenation and saved in the variable
5221
@code{m}. It is then sent down the pipeline to the @code{mail} program.
5223
We call the @code{close} function here because it's a good idea to close
5224
the pipe as soon as all the intended output has been sent to it.
5225
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
5226
for more information
5227
on this. This example also illustrates the use of a variable to represent
5228
a @var{file} or @var{command}: it is not necessary to always
5229
use a string constant. Using a variable is generally a good idea,
5230
since @code{awk} requires you to spell the string value identically
5234
Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system
5235
to open a file or pipe only if the particular @var{file} or @var{command}
5236
you've specified has not already been written to by your program, or if
5237
it has been closed since it was last written to.
5239
@cindex differences between @code{gawk} and @code{awk}
5241
@cindex implementation limits
5242
Many @code{awk} implementations limit the number of pipelines an @code{awk}
5243
program may have open to just one! In @code{gawk}, there is no such limit.
5244
You can open as many pipelines as the underlying operating system will
5247
@node Special Files, Close Files And Pipes , Redirection, Printing
5248
@section Special File Names in @code{gawk}
5249
@cindex standard input
5250
@cindex standard output
5251
@cindex standard error output
5252
@cindex file descriptors
5254
Running programs conventionally have three input and output streams
5255
already available to them for reading and writing. These are known as
5256
the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error
5257
output}. These streams are, by default, connected to your terminal, but
5258
they are often redirected with the shell, via the @samp{<}, @samp{<<},
5259
@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error
5260
is typically used for writing error messages; the reason we have two separate
5261
streams, standard output and standard error, is so that they can be
5262
redirected separately.
5264
@cindex differences between @code{gawk} and @code{awk}
5265
In other implementations of @code{awk}, the only way to write an error
5266
message to standard error in an @code{awk} program is as follows:
5269
print "Serious error detected!" | "cat 1>&2"
5273
This works by opening a pipeline to a shell command which can access the
5274
standard error stream which it inherits from the @code{awk} process.
5275
This is far from elegant, and is also inefficient, since it requires a
5276
separate process. So people writing @code{awk} programs often
5277
neglect to do this. Instead, they send the error messages to the
5278
terminal, like this:
5282
print "Serious error detected!" > "/dev/tty"
5287
This usually has the same effect, but not always: although the
5288
standard error stream is usually the terminal, it can be redirected, and
5289
when that happens, writing to the terminal is not correct. In fact, if
5290
@code{awk} is run from a background job, it may not have a terminal at all.
5291
Then opening @file{/dev/tty} will fail.
5293
@code{gawk} provides special file names for accessing the three standard
5294
streams. When you redirect input or output in @code{gawk}, if the file name
5295
matches one of these special names, then @code{gawk} directly uses the
5296
stream it stands for.
5298
@cindex @file{/dev/stdin}
5299
@cindex @file{/dev/stdout}
5300
@cindex @file{/dev/stderr}
5301
@cindex @file{/dev/fd}
5305
The standard input (file descriptor 0).
5308
The standard output (file descriptor 1).
5311
The standard error output (file descriptor 2).
5313
@item /dev/fd/@var{N}
5314
The file associated with file descriptor @var{N}. Such a file must have
5315
been opened by the program initiating the @code{awk} execution (typically
5316
the shell). Unless you take special pains in the shell from which
5317
you invoke @code{gawk}, only descriptors 0, 1 and 2 are available.
5321
The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
5322
are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2},
5323
respectively, but they are more self-explanatory.
5325
The proper way to write an error message in a @code{gawk} program
5326
is to use @file{/dev/stderr}, like this:
5329
print "Serious error detected!" > "/dev/stderr"
5332
@code{gawk} also provides special file names that give access to information
5333
about the running @code{gawk} process. Each of these ``files'' provides
5334
a single record of information. To read them more than once, you must
5335
first close them with the @code{close} function
5336
(@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}).
5339
@cindex process information
5340
@cindex @file{/dev/pid}
5341
@cindex @file{/dev/pgrpid}
5342
@cindex @file{/dev/ppid}
5343
@cindex @file{/dev/user}
5347
Reading this file returns the process ID of the current process,
5348
in decimal, terminated with a newline.
5351
Reading this file returns the parent process ID of the current process,
5352
in decimal, terminated with a newline.
5355
Reading this file returns the process group ID of the current process,
5356
in decimal, terminated with a newline.
5359
Reading this file returns a single record terminated with a newline.
5360
The fields are separated with spaces. The fields represent the
5361
following information:
5365
The return value of the @code{getuid} system call
5366
(the real user ID number).
5369
The return value of the @code{geteuid} system call
5370
(the effective user ID number).
5373
The return value of the @code{getgid} system call
5374
(the real group ID number).
5377
The return value of the @code{getegid} system call
5378
(the effective group ID number).
5381
If there are any additional fields, they are the group IDs returned by
5382
@code{getgroups} system call.
5383
(Multiple groups may not be supported on all systems.)
5387
These special file names may be used on the command line as data
5388
files, as well as for I/O redirections within an @code{awk} program.
5389
They may not be used as source files with the @samp{-f} option.
5391
Recognition of these special file names is disabled if @code{gawk} is in
5392
compatibility mode (@pxref{Options, ,Command Line Options}).
5394
@strong{Caution}: Unless your system actually has a @file{/dev/fd} directory
5395
(or any of the other above listed special files),
5396
the interpretation of these file names is done by @code{gawk} itself.
5397
For example, using @samp{/dev/fd/4} for output will actually write on
5398
file descriptor 4, and not on a new file descriptor that was @code{dup}'ed
5399
from file descriptor 4. Most of the time this does not matter; however, it
5400
is important to @emph{not} close any of the files related to file descriptors
5401
0, 1, and 2. If you do close one of these files, unpredictable behavior
5404
The special files that provide process-related information may disappear
5405
in a future version of @code{gawk}.
5406
@xref{Future Extensions, ,Probable Future Extensions}.
5408
@node Close Files And Pipes, , Special Files, Printing
5409
@section Closing Input and Output Files and Pipes
5410
@cindex closing input files and pipes
5411
@cindex closing output files and pipes
5414
If the same file name or the same shell command is used with
5416
(@pxref{Getline, ,Explicit Input with @code{getline}})
5417
more than once during the execution of an @code{awk}
5418
program, the file is opened (or the command is executed) only the first time.
5419
At that time, the first record of input is read from that file or command.
5420
The next time the same file or command is used in @code{getline}, another
5421
record is read from it, and so on.
5423
Similarly, when a file or pipe is opened for output, the file name or command
5425
it is remembered by @code{awk} and subsequent writes to the same file or
5426
command are appended to the previous writes. The file or pipe stays
5427
open until @code{awk} exits.
5429
This implies that if you want to start reading the same file again from
5430
the beginning, or if you want to rerun a shell command (rather than
5431
reading more output from the command), you must take special steps.
5432
What you must do is use the @code{close} function, as follows:
5435
close(@var{filename})
5442
close(@var{command})
5445
The argument @var{filename} or @var{command} can be any expression. Its
5446
value must @emph{exactly} match the string that was used to open the file or
5447
start the command (spaces and other ``irrelevant'' characters
5448
included). For example, if you open a pipe with this:
5451
"sort -r names" | getline foo
5455
then you must close it with this:
5458
close("sort -r names")
5461
Once this function call is executed, the next @code{getline} from that
5462
file or command, or the next @code{print} or @code{printf} to that
5463
file or command, will reopen the file or rerun the command.
5465
Because the expression that you use to close a file or pipeline must
5466
exactly match the expression used to open the file or run the command,
5467
it is good practice to use a variable to store the file name or command.
5468
The previous example would become
5471
sortcom = "sort -r names"
5472
sortcom | getline foo
5478
This helps avoid hard-to-find typographical errors in your @code{awk}
5481
Here are some reasons why you might need to close an output file:
5485
To write a file and read it back later on in the same @code{awk}
5486
program. Close the file when you are finished writing it; then
5487
you can start reading it with @code{getline}.
5490
To write numerous files, successively, in the same @code{awk}
5491
program. If you don't close the files, eventually you may exceed a
5492
system limit on the number of open files in one process. So close
5493
each one when you are finished writing it.
5496
To make a command finish. When you redirect output through a pipe,
5497
the command reading the pipe normally continues to try to read input
5498
as long as the pipe is open. Often this means the command cannot
5499
really do its work until the pipe is closed. For example, if you
5500
redirect output to the @code{mail} program, the message is not
5501
actually sent until the pipe is closed.
5504
To run the same program a second time, with the same arguments.
5505
This is not the same thing as giving more input to the first run!
5507
For example, suppose you pipe output to the @code{mail} program. If you
5508
output several lines redirected to this pipe without closing it, they make
5509
a single message of several lines. By contrast, if you close the pipe
5510
after each line of output, then each line makes a separate message.
5514
@cindex differences between @code{gawk} and @code{awk}
5515
@code{close} returns a value of zero if the close succeeded.
5516
Otherwise, the value will be non-zero.
5517
In this case, @code{gawk} sets the variable @code{ERRNO} to a string
5518
describing the error that occurred.
5520
@cindex differences between @code{gawk} and @code{awk}
5521
@cindex portability issues
5522
If you use more files than the system allows you to have open,
5523
@code{gawk} will attempt to multiplex the available open files among
5524
your data files. @code{gawk}'s ability to do this depends upon the
5525
facilities of your operating system: it may not always work. It is
5526
therefore both good practice and good portability advice to always
5527
use @code{close} on your files when you are done with them.
5529
@node Expressions, Patterns and Actions, Printing, Top
5530
@chapter Expressions
5533
Expressions are the basic building blocks of @code{awk} patterns
5534
and actions. An expression evaluates to a value, which you can print, test,
5535
store in a variable or pass to a function. Additionally, an expression
5536
can assign a new value to a variable or a field, with an assignment operator.
5538
An expression can serve as a pattern or action statement on its own.
5540
statements contain one or more expressions which specify data on which to
5541
operate. As in other languages, expressions in @code{awk} include
5542
variables, array references, constants, and function calls, as well as
5543
combinations of these with various operators.
5546
* Constants:: String, numeric, and regexp constants.
5547
* Using Constant Regexps:: When and how to use a regexp constant.
5548
* Variables:: Variables give names to values for later use.
5549
* Conversion:: The conversion of strings to numbers and vice
5551
* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
5553
* Concatenation:: Concatenating strings.
5554
* Assignment Ops:: Changing the value of a variable or a field.
5555
* Increment Ops:: Incrementing the numeric value of a variable.
5556
* Truth Values:: What is ``true'' and what is ``false''.
5557
* Typing and Comparison:: How variables acquire types, and how this
5558
affects comparison of numbers and strings with
5560
* Boolean Ops:: Combining comparison expressions using boolean
5561
operators @samp{||} (``or''), @samp{&&}
5562
(``and'') and @samp{!} (``not'').
5563
* Conditional Exp:: Conditional expressions select between two
5564
subexpressions under control of a third
5566
* Function Calls:: A function call is an expression.
5567
* Precedence:: How various operators nest.
5570
@node Constants, Using Constant Regexps, Expressions, Expressions
5571
@section Constant Expressions
5572
@cindex constants, types of
5573
@cindex string constants
5575
The simplest type of expression is the @dfn{constant}, which always has
5576
the same value. There are three types of constants: numeric constants,
5577
string constants, and regular expression constants.
5580
* Scalar Constants:: Numeric and string constants.
5581
* Regexp Constants:: Regular Expression constants.
5584
@node Scalar Constants, Regexp Constants, Constants, Constants
5585
@subsection Numeric and String Constants
5587
@cindex numeric constant
5588
@cindex numeric value
5589
A @dfn{numeric constant} stands for a number. This number can be an
5590
integer, a decimal fraction, or a number in scientific (exponential)
5591
notation.@footnote{The internal representation uses double-precision
5592
floating point numbers. If you don't know what that means, then don't
5593
worry about it.} Here are some examples of numeric constants, which all
5594
have the same value:
5602
A string constant consists of a sequence of characters enclosed in
5603
double-quote marks. For example:
5610
@cindex differences between @code{gawk} and @code{awk}
5611
represents the string whose contents are @samp{parrot}. Strings in
5612
@code{gawk} can be of any length and they can contain any of the possible
5613
eight-bit ASCII characters including ASCII NUL (character code zero).
5615
implementations may have difficulty with some character codes.
5617
@node Regexp Constants, , Scalar Constants, Constants
5618
@subsection Regular Expression Constants
5620
@cindex @code{~} operator
5621
@cindex @code{!~} operator
5622
A regexp constant is a regular expression description enclosed in
5623
slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in
5624
@code{awk} programs are constant, but the @samp{~} and @samp{!~}
5625
matching operators can also match computed or ``dynamic'' regexps
5626
(which are just ordinary strings or variables that contain a regexp).
5628
@node Using Constant Regexps, Variables, Constants, Expressions
5629
@section Using Regular Expression Constants
5631
When used on the right hand side of the @samp{~} or @samp{!~}
5632
operators, a regexp constant merely stands for the regexp that is to be
5636
Regexp constants (such as @code{/foo/}) may be used like simple expressions.
5638
regexp constant appears by itself, it has the same meaning as if it appeared
5639
in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.)
5640
(@pxref{Expression Patterns, ,Expressions as Patterns}).
5641
This means that the two code segments,
5644
if ($0 ~ /barfly/ || $0 ~ /camelot/)
5652
if (/barfly/ || /camelot/)
5657
are exactly equivalent.
5659
One rather bizarre consequence of this rule is that the following
5660
boolean expression is valid, but does not do what the user probably
5664
# note that /foo/ is on the left of the ~
5665
if (/foo/ ~ $1) print "found foo"
5669
This code is ``obviously'' testing @code{$1} for a match against the regexp
5670
@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means
5671
@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record
5672
against the regexp @code{/foo/}. The result will be either zero or one,
5673
depending upon the success or failure of the match. Then match that result
5674
against the first field in the record.
5676
Since it is unlikely that you would ever really wish to make this kind of
5677
test, @code{gawk} will issue a warning when it sees this construct in
5680
Another consequence of this rule is that the assignment statement
5687
will assign either zero or one to the variable @code{matches}, depending
5688
upon the contents of the current input record.
5690
This feature of the language was never well documented until the
5691
POSIX specification.
5693
@cindex differences between @code{gawk} and @code{awk}
5695
Constant regular expressions are also used as the first argument for
5696
the @code{gensub}, @code{sub} and @code{gsub} functions, and as the
5697
second argument of the @code{match} function
5698
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
5699
Modern implementations of @code{awk}, including @code{gawk}, allow
5700
the third argument of @code{split} to be a regexp constant, while some
5701
older implementations do not (d.c.).
5703
This can lead to confusion when attempting to use regexp constants
5704
as arguments to user defined functions
5705
(@pxref{User-defined, , User-defined Functions}).
5709
function mysub(pat, repl, str, global)
5712
gsub(pat, repl, str)
5720
text = "hi! hi yourself!"
5721
mysub(/hi/, "howdy", text, 1)
5726
In this example, the programmer wishes to pass a regexp constant to the
5727
user-defined function @code{mysub}, which will in turn pass it on to
5728
either @code{sub} or @code{gsub}. However, what really happens is that
5729
the @code{pat} parameter will be either one or zero, depending upon whether
5730
or not @code{$0} matches @code{/hi/}.
5732
As it is unlikely that you would ever really wish to pass a truth value
5733
in this way, @code{gawk} will issue a warning when it sees a regexp
5734
constant used as a parameter to a user-defined function.
5736
@node Variables, Conversion, Using Constant Regexps, Expressions
5739
Variables are ways of storing values at one point in your program for
5740
use later in another part of your program. You can manipulate them
5741
entirely within your program text, and you can also assign values to
5742
them on the @code{awk} command line.
5745
* Using Variables:: Using variables in your programs.
5746
* Assignment Options:: Setting variables on the command line and a
5747
summary of command line syntax. This is an
5748
advanced method of input.
5751
@node Using Variables, Assignment Options, Variables, Variables
5752
@subsection Using Variables in a Program
5754
@cindex variables, user-defined
5755
@cindex user-defined variables
5756
Variables let you give names to values and refer to them later. You have
5757
already seen variables in many of the examples. The name of a variable
5758
must be a sequence of letters, digits and underscores, but it may not begin
5759
with a digit. Case is significant in variable names; @code{a} and @code{A}
5760
are distinct variables.
5762
A variable name is a valid expression by itself; it represents the
5763
variable's current value. Variables are given new values with
5764
@dfn{assignment operators}, @dfn{increment operators} and
5765
@dfn{decrement operators}.
5766
@xref{Assignment Ops, ,Assignment Expressions}.
5768
A few variables have special built-in meanings, such as @code{FS}, the
5769
field separator, and @code{NF}, the number of fields in the current
5770
input record. @xref{Built-in Variables}, for a list of them. These
5771
built-in variables can be used and assigned just like all other
5772
variables, but their values are also used or changed automatically by
5773
@code{awk}. All built-in variables names are entirely upper-case.
5775
Variables in @code{awk} can be assigned either numeric or string
5776
values. By default, variables are initialized to the empty string, which
5777
is zero if converted to a number. There is no need to
5778
``initialize'' each variable explicitly in @code{awk},
5779
the way you would in C and in most other traditional languages.
5781
@node Assignment Options, , Using Variables, Variables
5782
@subsection Assigning Variables on the Command Line
5784
You can set any @code{awk} variable by including a @dfn{variable assignment}
5785
among the arguments on the command line when you invoke @code{awk}
5786
(@pxref{Other Arguments, ,Other Command Line Arguments}). Such an assignment has
5790
@var{variable}=@var{text}
5794
With it, you can set a variable either at the beginning of the
5795
@code{awk} run or in between input files.
5797
If you precede the assignment with the @samp{-v} option, like this:
5800
-v @var{variable}=@var{text}
5804
then the variable is set at the very beginning, before even the
5805
@code{BEGIN} rules are run. The @samp{-v} option and its assignment
5806
must precede all the file name arguments, as well as the program text.
5807
(@xref{Options, ,Command Line Options}, for more information about
5808
the @samp{-v} option.)
5810
Otherwise, the variable assignment is performed at a time determined by
5811
its position among the input file arguments: after the processing of the
5812
preceding input file argument. For example:
5815
awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
5819
prints the value of field number @code{n} for all input records. Before
5820
the first file is read, the command line sets the variable @code{n}
5821
equal to four. This causes the fourth field to be printed in lines from
5822
the file @file{inventory-shipped}. After the first file has finished,
5823
but before the second file is started, @code{n} is set to two, so that the
5824
second field is printed in lines from @file{BBS-list}.
5828
$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
5838
Command line arguments are made available for explicit examination by
5839
the @code{awk} program in an array named @code{ARGV}
5840
(@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}).
5843
@code{awk} processes the values of command line assignments for escape
5844
sequences (d.c.) (@pxref{Escape Sequences}).
5846
@node Conversion, Arithmetic Ops, Variables, Expressions
5847
@section Conversion of Strings and Numbers
5849
@cindex conversion of strings and numbers
5850
Strings are converted to numbers, and numbers to strings, if the context
5851
of the @code{awk} program demands it. For example, if the value of
5852
either @code{foo} or @code{bar} in the expression @samp{foo + bar}
5853
happens to be a string, it is converted to a number before the addition
5854
is performed. If numeric values appear in string concatenation, they
5855
are converted to strings. Consider this:
5859
print (two three) + 4
5863
This prints the (numeric) value 27. The numeric values of
5864
the variables @code{two} and @code{three} are converted to strings and
5865
concatenated together, and the resulting string is converted back to the
5866
number 23, to which four is then added.
5869
@cindex empty string
5870
@cindex type conversion
5871
If, for some reason, you need to force a number to be converted to a
5872
string, concatenate the empty string, @code{""}, with that number.
5873
To force a string to be converted to a number, add zero to that string.
5875
A string is converted to a number by interpreting any numeric prefix
5876
of the string as numerals:
5877
@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"}
5878
has a numeric value of 25.
5879
Strings that can't be interpreted as valid numbers are converted to
5883
The exact manner in which numbers are converted into strings is controlled
5884
by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}).
5885
Numbers are converted using the @code{sprintf} function
5886
(@pxref{String Functions, ,Built-in Functions for String Manipulation})
5887
with @code{CONVFMT} as the format
5890
@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with
5891
at least six significant digits. For some applications you will want to
5892
change it to specify more precision. Double precision on most modern
5893
machines gives you 16 or 17 decimal digits of precision.
5895
Strange results can happen if you set @code{CONVFMT} to a string that doesn't
5896
tell @code{sprintf} how to format floating point numbers in a useful way.
5897
For example, if you forget the @samp{%} in the format, all numbers will be
5898
converted to the same constant string.
5901
As a special case, if a number is an integer, then the result of converting
5902
it to a string is @emph{always} an integer, no matter what the value of
5903
@code{CONVFMT} may be. Given the following code fragment:
5912
@code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.).
5914
@cindex @code{awk} language, POSIX version
5915
@cindex POSIX @code{awk}
5917
Prior to the POSIX standard, @code{awk} specified that the value
5918
of @code{OFMT} was used for converting numbers to strings. @code{OFMT}
5919
specifies the output format to use when printing numbers with @code{print}.
5920
@code{CONVFMT} was introduced in order to separate the semantics of
5921
conversion from the semantics of printing. Both @code{CONVFMT} and
5922
@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority
5923
of cases, old @code{awk} programs will not change their behavior.
5924
However, this use of @code{OFMT} is something to keep in mind if you must
5925
port your program to other implementations of @code{awk}; we recommend
5926
that instead of changing your programs, you just port @code{gawk} itself!
5927
@xref{Print, ,The @code{print} Statement},
5928
for more information on the @code{print} statement.
5930
@node Arithmetic Ops, Concatenation, Conversion, Expressions
5931
@section Arithmetic Operators
5932
@cindex arithmetic operators
5933
@cindex operators, arithmetic
5936
@cindex multiplication
5940
@cindex exponentiation
5942
The @code{awk} language uses the common arithmetic operators when
5943
evaluating expressions. All of these arithmetic operators follow normal
5944
precedence rules, and work as you would expect them to.
5946
Here is a file @file{grades} containing a list of student names and
5947
three test scores per student (it's a small class):
5956
This programs takes the file @file{grades}, and prints the average
5960
$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3
5961
> print $1, avg @}' grades
5964
@print{} Chris 84.3333
5967
This table lists the arithmetic operators in @code{awk}, in order from
5968
highest precedence to lowest:
5970
@c sigh. this seems necessary
5980
Unary plus. The expression is converted to a number.
5982
@cindex @code{awk} language, POSIX version
5983
@cindex POSIX @code{awk}
5984
@item @var{x} ^ @var{y}
5985
@itemx @var{x} ** @var{y}
5986
Exponentiation: @var{x} raised to the @var{y} power. @samp{2 ^ 3} has
5987
the value eight. The character sequence @samp{**} is equivalent to
5988
@samp{^}. (The POSIX standard only specifies the use of @samp{^}
5989
for exponentiation.)
5991
@item @var{x} * @var{y}
5994
@item @var{x} / @var{y}
5995
Division. Since all numbers in @code{awk} are
5996
real numbers, the result is not rounded to an integer: @samp{3 / 4}
5999
@item @var{x} % @var{y}
6000
@cindex differences between @code{gawk} and @code{awk}
6001
Remainder. The quotient is rounded toward zero to an integer,
6002
multiplied by @var{y} and this result is subtracted from @var{x}.
6003
This operation is sometimes known as ``trunc-mod.'' The following
6004
relation always holds:
6007
b * int(a / b) + (a % b) == a
6010
One possibly undesirable effect of this definition of remainder is that
6011
@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus,
6017
In other @code{awk} implementations, the signedness of the remainder
6018
may be machine dependent.
6019
@c !!! what does posix say?
6021
@item @var{x} + @var{y}
6024
@item @var{x} - @var{y}
6029
For maximum portability, do not use the @samp{**} operator.
6031
Unary plus and minus have the same precedence,
6032
the multiplication operators all have the same precedence, and
6033
addition and subtraction have the same precedence.
6035
@node Concatenation, Assignment Ops, Arithmetic Ops, Expressions
6036
@section String Concatenation
6038
@cindex string operators
6039
@cindex operators, string
6040
@cindex concatenation
6041
There is only one string operation: concatenation. It does not have a
6042
specific operator to represent it. Instead, concatenation is performed by
6043
writing expressions next to one another, with no operator. For example:
6047
$ awk '@{ print "Field number one: " $1 @}' BBS-list
6048
@print{} Field number one: aardvark
6049
@print{} Field number one: alpo-net
6054
Without the space in the string constant after the @samp{:}, the line
6055
would run together. For example:
6059
$ awk '@{ print "Field number one:" $1 @}' BBS-list
6060
@print{} Field number one:aardvark
6061
@print{} Field number one:alpo-net
6066
Since string concatenation does not have an explicit operator, it is
6067
often necessary to insure that it happens where you want it to by
6068
using parentheses to enclose
6069
the items to be concatenated. For example, the
6070
following code fragment does not concatenate @code{file} and @code{name}
6071
as you might expect:
6076
print "something meaningful" > file name
6080
It is necessary to use the following:
6083
print "something meaningful" > (file name)
6086
We recommend that you use parentheses around concatenation in all but the
6087
most common contexts (such as on the right-hand side of @samp{=}).
6089
@node Assignment Ops, Increment Ops, Concatenation, Expressions
6090
@section Assignment Expressions
6091
@cindex assignment operators
6092
@cindex operators, assignment
6093
@cindex expression, assignment
6095
An @dfn{assignment} is an expression that stores a new value into a
6096
variable. For example, let's assign the value one to the variable
6103
After this expression is executed, the variable @code{z} has the value one.
6104
Whatever old value @code{z} had before the assignment is forgotten.
6106
Assignments can store string values also. For example, this would store
6107
the value @code{"this food is good"} in the variable @code{message}:
6112
message = "this " thing " is " predicate
6116
(This also illustrates string concatenation.)
6118
The @samp{=} sign is called an @dfn{assignment operator}. It is the
6119
simplest assignment operator because the value of the right-hand
6120
operand is stored unchanged.
6123
Most operators (addition, concatenation, and so on) have no effect
6124
except to compute a value. If you ignore the value, you might as well
6125
not use the operator. An assignment operator is different; it does
6126
produce a value, but even if you ignore the value, the assignment still
6127
makes itself felt through the alteration of the variable. We call this
6128
a @dfn{side effect}.
6132
The left-hand operand of an assignment need not be a variable
6133
(@pxref{Variables}); it can also be a field
6134
(@pxref{Changing Fields, ,Changing the Contents of a Field}) or
6135
an array element (@pxref{Arrays, ,Arrays in @code{awk}}).
6136
These are all called @dfn{lvalues},
6137
which means they can appear on the left-hand side of an assignment operator.
6138
The right-hand operand may be any expression; it produces the new value
6139
which the assignment stores in the specified variable, field or array
6140
element. (Such values are called @dfn{rvalues}).
6142
@cindex types of variables
6143
It is important to note that variables do @emph{not} have permanent types.
6144
The type of a variable is simply the type of whatever value it happens
6145
to hold at the moment. In the following program fragment, the variable
6146
@code{foo} has a numeric value at first, and a string value later on:
6156
When the second assignment gives @code{foo} a string value, the fact that
6157
it previously had a numeric value is forgotten.
6159
String values that do not begin with a digit have a numeric value of
6160
zero. After executing this code, the value of @code{foo} is five:
6168
(Note that using a variable as a number and then later as a string can
6169
be confusing and is poor programming style. The above examples illustrate how
6170
@code{awk} works, @emph{not} how you should write your own programs!)
6172
An assignment is an expression, so it has a value: the same value that
6173
is assigned. Thus, @samp{z = 1} as an expression has the value one.
6174
One consequence of this is that you can write multiple assignments together:
6181
stores the value zero in all three variables. It does this because the
6182
value of @samp{z = 0}, which is zero, is stored into @code{y}, and then
6183
the value of @samp{y = z = 0}, which is zero, is stored into @code{x}.
6185
You can use an assignment anywhere an expression is called for. For
6186
example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one
6187
and then test whether @code{x} equals one. But this style tends to make
6188
programs hard to read; except in a one-shot program, you should
6189
not use such nesting of assignments.
6191
Aside from @samp{=}, there are several other assignment operators that
6192
do arithmetic with the old value of the variable. For example, the
6193
operator @samp{+=} computes a new value by adding the right-hand value
6194
to the old value of the variable. Thus, the following assignment adds
6195
five to the value of @code{foo}:
6202
This is equivalent to the following:
6209
Use whichever one makes the meaning of your program clearer.
6211
There are situations where using @samp{+=} (or any assignment operator)
6212
is @emph{not} the same as simply repeating the left-hand operand in the
6213
right-hand expression. For example:
6218
# Thanks to Pat Rankin for this example
6224
bar[rand()] = bar[rand()] + 5
6232
The indices of @code{bar} are guaranteed to be different, because
6233
@code{rand} will return different values each time it is called.
6234
(Arrays and the @code{rand} function haven't been covered yet.
6235
@xref{Arrays, ,Arrays in @code{awk}},
6236
and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information).
6237
This example illustrates an important fact about the assignment
6238
operators: the left-hand expression is only evaluated @emph{once}.
6240
It is also up to the implementation as to which expression is evaluated
6241
first, the left-hand one or the right-hand one.
6242
Consider this example:
6250
The value of @code{a[3]} could be either two or four.
6252
Here is a table of the arithmetic assignment operators. In each
6253
case, the right-hand operand is an expression whose value is converted
6258
@item @var{lvalue} += @var{increment}
6259
Adds @var{increment} to the value of @var{lvalue} to make the new value
6262
@item @var{lvalue} -= @var{decrement}
6263
Subtracts @var{decrement} from the value of @var{lvalue}.
6265
@item @var{lvalue} *= @var{coefficient}
6266
Multiplies the value of @var{lvalue} by @var{coefficient}.
6268
@item @var{lvalue} /= @var{divisor}
6269
Divides the value of @var{lvalue} by @var{divisor}.
6271
@item @var{lvalue} %= @var{modulus}
6272
Sets @var{lvalue} to its remainder by @var{modulus}.
6274
@cindex @code{awk} language, POSIX version
6275
@cindex POSIX @code{awk}
6276
@item @var{lvalue} ^= @var{power}
6277
@itemx @var{lvalue} **= @var{power}
6278
Raises @var{lvalue} to the power @var{power}.
6279
(Only the @samp{^=} operator is specified by POSIX.)
6283
For maximum portability, do not use the @samp{**=} operator.
6285
@node Increment Ops, Truth Values, Assignment Ops, Expressions
6286
@section Increment and Decrement Operators
6288
@cindex increment operators
6289
@cindex operators, increment
6290
@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of
6291
a variable by one. You could do the same thing with an assignment operator, so
6292
the increment operators add no power to the @code{awk} language; but they
6293
are convenient abbreviations for very common operations.
6295
The operator to add one is written @samp{++}. It can be used to increment
6296
a variable either before or after taking its value.
6298
To pre-increment a variable @var{v}, write @samp{++@var{v}}. This adds
6299
one to the value of @var{v} and that new value is also the value of this
6300
expression. The assignment expression @samp{@var{v} += 1} is completely
6303
Writing the @samp{++} after the variable specifies post-increment. This
6304
increments the variable value just the same; the difference is that the
6305
value of the increment expression itself is the variable's @emph{old}
6306
value. Thus, if @code{foo} has the value four, then the expression @samp{foo++}
6307
has the value four, but it changes the value of @code{foo} to five.
6309
The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo
6310
+= 1) - 1}. It is not perfectly equivalent because all numbers in
6311
@code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does
6312
not necessarily equal @code{foo}. But the difference is minute as
6313
long as you stick to numbers that are fairly small (less than 10e12).
6315
Any lvalue can be incremented. Fields and array elements are incremented
6316
just like variables. (Use @samp{$(i++)} when you wish to do a field reference
6317
and a variable increment at the same time. The parentheses are necessary
6318
because of the precedence of the field reference operator, @samp{$}.)
6320
@cindex decrement operators
6321
@cindex operators, decrement
6322
The decrement operator @samp{--} works just like @samp{++} except that
6323
it subtracts one instead of adding. Like @samp{++}, it can be used before
6324
the lvalue to pre-decrement or after it to post-decrement.
6326
Here is a summary of increment and decrement expressions.
6330
@item ++@var{lvalue}
6331
This expression increments @var{lvalue} and the new value becomes the
6332
value of the expression.
6334
@item @var{lvalue}++
6335
This expression increments @var{lvalue}, but
6336
the value of the expression is the @emph{old} value of @var{lvalue}.
6338
@item --@var{lvalue}
6339
Like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It
6340
decrements @var{lvalue} and delivers the value that results.
6342
@item @var{lvalue}--
6343
Like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It
6344
decrements @var{lvalue}. The value of the expression is the @emph{old}
6345
value of @var{lvalue}.
6349
@node Truth Values, Typing and Comparison, Increment Ops, Expressions
6350
@section True and False in @code{awk}
6351
@cindex truth values
6352
@cindex logical true
6353
@cindex logical false
6355
Many programming languages have a special representation for the concepts
6356
of ``true'' and ``false.'' Such languages usually use the special
6357
constants @code{true} and @code{false}, or perhaps their upper-case
6361
@cindex empty string
6362
@code{awk} is different. It borrows a very simple concept of true and
6363
false from C. In @code{awk}, any non-zero numeric value, @emph{or} any
6364
non-empty string value is true. Any other value (zero or the null
6365
string, @code{""}) is false. The following program will print @samp{A strange
6366
truth value} three times:
6371
print "A strange truth value"
6372
if ("Four Score And Seven Years Ago")
6373
print "A strange truth value"
6375
print "A strange truth value"
6380
There is a surprising consequence of the ``non-zero or non-null'' rule:
6381
The string constant @code{"0"} is actually true, since it is non-null (d.c.).
6383
@node Typing and Comparison, Boolean Ops, Truth Values, Expressions
6384
@section Variable Typing and Comparison Expressions
6385
@cindex comparison expressions
6386
@cindex expression, comparison
6387
@cindex expression, matching
6388
@cindex relational operators
6389
@cindex operators, relational
6390
@cindex regexp match/non-match operators
6391
@cindex variable typing
6392
@cindex types of variables
6394
@c 2e: consider splitting this section into subsections
6396
Unlike other programming languages, @code{awk} variables do not have a
6397
fixed type. Instead, they can be either a number or a string, depending
6398
upon the value that is assigned to them.
6400
@cindex numeric string
6401
The 1992 POSIX standard introduced
6402
the concept of a @dfn{numeric string}, which is simply a string that looks
6403
like a number, for example, @code{@w{" +2"}}. This concept is used
6404
for determining the type of a variable.
6406
The type of the variable is important, since the types of two variables
6407
determine how they are compared.
6409
In @code{gawk}, variable typing follows these rules.
6413
A numeric literal or the result of a numeric operation has the @var{numeric}
6417
A string literal or the result of a string operation has the @var{string}
6421
Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
6422
@code{ENVIRON} elements and the
6423
elements of an array created by @code{split} that are numeric strings
6424
have the @var{strnum} attribute. Otherwise, they have the @var{string}
6426
Uninitialized variables also have the @var{strnum} attribute.
6429
Attributes propagate across assignments, but are not changed by
6431
@c (Although a use may cause the entity to acquire an additional
6432
@c value such that it has both a numeric and string value -- this leaves the
6433
@c attribute unchanged.)
6434
@c This is important but not relevant
6437
The last rule is particularly important. In the following program,
6438
@code{a} has numeric type, even though it is later used in a string
6444
b = a " is a cute number"
6449
When two operands are compared, either string comparison or numeric comparison
6450
may be used, depending on the attributes of the operands, according to the
6451
following, symmetric, matrix:
6453
@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables
6456
\vbox{\bigskip % space above the table (about 1 linespace)
6457
% Because we have vertical rules, we can't let TeX insert interline space
6461
% Define the table template. & separates columns, and \cr ends the
6462
% template (and each row). # is replaced by the text of that entry on
6463
% each row. The template for the first column breaks down like this:
6464
% \strut -- a way to make each line have the height and depth
6465
% of a normal line of type, since we turned off interline spacing.
6466
% \hfil -- infinite glue; has the effect of right-justifying in this case.
6467
% # -- replaced by the text (for instance, `STRNUM', in the last row).
6468
% \quad -- about the width of an `M'. Just separates the columns.
6470
% The second column (\vrule#) is what generates the vertical rule that
6473
% The doubled && before the next entry means `repeat the following
6474
% template as many times as necessary on each line' -- in our case, twice.
6476
% The template itself, \quad#\hfil, left-justifies with a little space before.
6478
\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr
6479
&&STRING &NUMERIC &STRNUM\cr
6480
% The \omit tells TeX to skip inserting the template for this column on
6481
% this particular row. In this case, we only want a little extra space
6482
% to separate the heading row from the rule below it. the depth 2pt --
6483
% `\vrule depth 2pt' is that little space.
6485
% This is the horizontal rule below the heading. Since it has nothing to
6486
% do with the columns of the table, we use \noalign to get it in there.
6488
% Like above, this time a little more space.
6490
% The remaining rows have nothing special about them.
6491
STRING &&string &string &string\cr
6492
NUMERIC &&string &numeric &numeric\cr
6493
STRNUM &&string &numeric &numeric\cr
6498
+----------------------------------------------
6499
| STRING NUMERIC STRNUM
6500
--------+----------------------------------------------
6502
STRING | string string string
6504
NUMERIC | string numeric numeric
6506
STRNUM | string numeric numeric
6507
--------+----------------------------------------------
6511
The basic idea is that user input that looks numeric, and @emph{only}
6512
user input, should be treated as numeric, even though it is actually
6513
made of characters, and is therefore also a string.
6515
@dfn{Comparison expressions} compare strings or numbers for
6516
relationships such as equality. They are written using @dfn{relational
6517
operators}, which are a superset of those in C. Here is a table of
6520
@cindex relational operators
6521
@cindex operators, relational
6522
@cindex @code{<} operator
6523
@cindex @code{<=} operator
6524
@cindex @code{>} operator
6525
@cindex @code{>=} operator
6526
@cindex @code{==} operator
6527
@cindex @code{!=} operator
6528
@cindex @code{~} operator
6529
@cindex @code{!~} operator
6530
@cindex @code{in} operator
6533
@item @var{x} < @var{y}
6534
True if @var{x} is less than @var{y}.
6536
@item @var{x} <= @var{y}
6537
True if @var{x} is less than or equal to @var{y}.
6539
@item @var{x} > @var{y}
6540
True if @var{x} is greater than @var{y}.
6542
@item @var{x} >= @var{y}
6543
True if @var{x} is greater than or equal to @var{y}.
6545
@item @var{x} == @var{y}
6546
True if @var{x} is equal to @var{y}.
6548
@item @var{x} != @var{y}
6549
True if @var{x} is not equal to @var{y}.
6551
@item @var{x} ~ @var{y}
6552
True if the string @var{x} matches the regexp denoted by @var{y}.
6554
@item @var{x} !~ @var{y}
6555
True if the string @var{x} does not match the regexp denoted by @var{y}.
6557
@item @var{subscript} in @var{array}
6558
True if the array @var{array} has an element with the subscript @var{subscript}.
6562
Comparison expressions have the value one if true and zero if false.
6564
When comparing operands of mixed types, numeric operands are converted
6565
to strings using the value of @code{CONVFMT}
6566
(@pxref{Conversion, ,Conversion of Strings and Numbers}).
6568
Strings are compared
6569
by comparing the first character of each, then the second character of each,
6570
and so on. Thus @code{"10"} is less than @code{"9"}. If there are two
6571
strings where one is a prefix of the other, the shorter string is less than
6572
the longer one. Thus @code{"abc"} is less than @code{"abcd"}.
6574
@cindex common mistakes
6575
@cindex mistakes, common
6576
@cindex errors, common
6577
It is very easy to accidentally mistype the @samp{==} operator, and
6578
leave off one of the @samp{=}s. The result is still valid @code{awk}
6579
code, but the program will not do what you mean:
6582
if (a = b) # oops! should be a == b
6589
Unless @code{b} happens to be zero or the null string, the @code{if}
6590
part of the test will always succeed. Because the operators are
6591
so similar, this kind of error is very difficult to spot when
6592
scanning the source code.
6594
Here are some sample expressions, how @code{gawk} compares them, and what
6595
the result of the comparison is.
6599
numeric comparison (true)
6601
@item "abc" >= "xyz"
6602
string comparison (false)
6605
string comparison (true)
6608
string comparison (true)
6610
@item a = 2; b = "2"
6612
string comparison (true)
6614
@item a = 2; b = " +2"
6616
string comparison (false)
6623
$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'
6629
the result is @samp{false} since both @code{$1} and @code{$2} are numeric
6630
strings and thus both have the @var{strnum} attribute,
6631
dictating a numeric comparison.
6633
The purpose of the comparison rules and the use of numeric strings is
6634
to attempt to produce the behavior that is ``least surprising,'' while
6635
still ``doing the right thing.''
6637
@cindex comparisons, string vs. regexp
6638
@cindex string comparison vs. regexp comparison
6639
@cindex regexp comparison vs. string comparison
6640
String comparisons and regular expression comparisons are very different.
6648
has the value of one, or is true, if the variable @code{x}
6649
is precisely @samp{foo}. By contrast,
6656
has the value one if @code{x} contains @samp{foo}, such as
6657
@code{"Oh, what a fool am I!"}.
6659
The right hand operand of the @samp{~} and @samp{!~} operators may be
6660
either a regexp constant (@code{/@dots{}/}), or an ordinary
6661
expression, in which case the value of the expression as a string is used as a
6662
dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also
6663
@pxref{Computed Regexps, ,Using Dynamic Regexps}).
6665
@cindex regexp as expression
6666
In recent implementations of @code{awk}, a constant regular
6667
expression in slashes by itself is also an expression. The regexp
6668
@code{/@var{regexp}/} is an abbreviation for this comparison expression:
6674
One special place where @code{/foo/} is @emph{not} an abbreviation for
6675
@samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or
6677
@xref{Using Constant Regexps, ,Using Regular Expression Constants},
6678
where this is discussed in more detail.
6680
@c This paragraph has been here since day 1, and has always bothered
6681
@c me, especially since the expression doesn't really make a lot of
6682
@c sense. So, just take it out.
6684
In some contexts it may be necessary to write parentheses around the
6685
regexp to avoid confusing the @code{gawk} parser. For example,
6686
@samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/))
6687
> threshold} parses properly.
6690
@node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions
6691
@section Boolean Expressions
6692
@cindex expression, boolean
6693
@cindex boolean expressions
6694
@cindex operators, boolean
6695
@cindex boolean operators
6696
@cindex logical operations
6697
@cindex operations, logical
6698
@cindex short-circuit operators
6699
@cindex operators, short-circuit
6700
@cindex and operator
6702
@cindex not operator
6703
@cindex @code{&&} operator
6704
@cindex @code{||} operator
6705
@cindex @code{!} operator
6707
A @dfn{boolean expression} is a combination of comparison expressions or
6708
matching expressions, using the boolean operators ``or''
6709
(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with
6710
parentheses to control nesting. The truth value of the boolean expression is
6711
computed by combining the truth values of the component expressions.
6712
Boolean expressions are also referred to as @dfn{logical expressions}.
6713
The terms are equivalent.
6715
Boolean expressions can be used wherever comparison and matching
6716
expressions can be used. They can be used in @code{if}, @code{while},
6717
@code{do} and @code{for} statements
6718
(@pxref{Statements, ,Control Statements in Actions}).
6719
They have numeric values (one if true, zero if false), which come into play
6720
if the result of the boolean expression is stored in a variable, or
6723
In addition, every boolean expression is also a valid pattern, so
6724
you can use one as a pattern to control the execution of rules.
6726
Here are descriptions of the three boolean operators, with examples.
6730
@item @var{boolean1} && @var{boolean2}
6731
True if both @var{boolean1} and @var{boolean2} are true. For example,
6732
the following statement prints the current input record if it contains
6733
both @samp{2400} and @samp{foo}.
6736
if ($0 ~ /2400/ && $0 ~ /foo/) print
6739
The subexpression @var{boolean2} is evaluated only if @var{boolean1}
6740
is true. This can make a difference when @var{boolean2} contains
6741
expressions that have side effects: in the case of @samp{$0 ~ /foo/ &&
6742
($2 == bar++)}, the variable @code{bar} is not incremented if there is
6743
no @samp{foo} in the record.
6745
@item @var{boolean1} || @var{boolean2}
6746
True if at least one of @var{boolean1} or @var{boolean2} is true.
6747
For example, the following statement prints all records in the input
6748
that contain @emph{either} @samp{2400} or
6749
@samp{foo}, or both.
6752
if ($0 ~ /2400/ || $0 ~ /foo/) print
6755
The subexpression @var{boolean2} is evaluated only if @var{boolean1}
6756
is false. This can make a difference when @var{boolean2} contains
6757
expressions that have side effects.
6759
@item ! @var{boolean}
6760
True if @var{boolean} is false. For example, the following program prints
6761
all records in the input file @file{BBS-list} that do @emph{not} contain the
6764
@c A better example would be `if (! (subscript in array)) ...' but we
6765
@c haven't done anything with arrays or `in' yet. Sigh.
6767
awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list
6772
The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}
6773
operators because of the way they work. Evaluation of the full expression
6774
is ``short-circuited'' if the result can be determined part way through
6777
@cindex line continuation
6778
You can continue a statement that uses @samp{&&} or @samp{||} simply
6779
by putting a newline after them. But you cannot put a newline in front
6780
of either of these operators without using backslash continuation
6781
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
6783
The actual value of an expression using the @samp{!} operator will be
6784
either one or zero, depending upon the truth value of the expression it
6787
The @samp{!} operator is often useful for changing the sense of a flag
6788
variable from false to true and back again. For example, the following
6789
program is one way to print lines in between special bracketing lines:
6792
$1 == "START" @{ interested = ! interested @}
6793
interested == 1 @{ print @}
6794
$1 == "END" @{ interested = ! interested @}
6798
The variable @code{interested}, like all @code{awk} variables, starts
6799
out initialized to zero, which is also false. When a line is seen whose
6800
first field is @samp{START}, the value of @code{interested} is toggled
6801
to true, using @samp{!}. The next rule prints lines as long as
6802
@code{interested} is true. When a line is seen whose first field is
6803
@samp{END}, @code{interested} is toggled back to false.
6805
We should discuss using `next' in the two rules that toggle the
6806
variable, to avoid printing the bracketing lines, but that's more
6807
distraction than really needed.
6810
@node Conditional Exp, Function Calls, Boolean Ops, Expressions
6811
@section Conditional Expressions
6812
@cindex conditional expression
6813
@cindex expression, conditional
6815
A @dfn{conditional expression} is a special kind of expression with
6816
three operands. It allows you to use one expression's value to select
6817
one of two other expressions.
6819
The conditional expression is the same as in the C language:
6822
@var{selector} ? @var{if-true-exp} : @var{if-false-exp}
6826
There are three subexpressions. The first, @var{selector}, is always
6827
computed first. If it is ``true'' (not zero and not null) then
6828
@var{if-true-exp} is computed next and its value becomes the value of
6829
the whole expression. Otherwise, @var{if-false-exp} is computed next
6830
and its value becomes the value of the whole expression.
6832
For example, this expression produces the absolute value of @code{x}:
6838
Each time the conditional expression is computed, exactly one of
6839
@var{if-true-exp} and @var{if-false-exp} is computed; the other is ignored.
6840
This is important when the expressions contain side effects. For example,
6841
this conditional expression examines element @code{i} of either array
6842
@code{a} or array @code{b}, and increments @code{i}.
6845
x == y ? a[i++] : b[i++]
6849
This is guaranteed to increment @code{i} exactly once, because each time
6850
only one of the two increment expressions is executed,
6851
and the other is not.
6852
@xref{Arrays, ,Arrays in @code{awk}},
6853
for more information about arrays.
6855
@cindex differences between @code{gawk} and @code{awk}
6856
@cindex line continuation
6857
As a minor @code{gawk} extension,
6858
you can continue a statement that uses @samp{?:} simply
6859
by putting a newline after either character.
6860
However, you cannot put a newline in front
6861
of either character without using backslash continuation
6862
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
6864
@node Function Calls, Precedence, Conditional Exp, Expressions
6865
@section Function Calls
6866
@cindex function call
6867
@cindex calling a function
6869
A @dfn{function} is a name for a particular calculation. Because it has
6870
a name, you can ask for it by name at any point in the program. For
6871
example, the function @code{sqrt} computes the square root of a number.
6873
A fixed set of functions are @dfn{built-in}, which means they are
6874
available in every @code{awk} program. The @code{sqrt} function is one
6875
of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in
6876
functions and their descriptions. In addition, you can define your own
6877
functions for use in your program.
6878
@xref{User-defined, ,User-defined Functions}, for how to do this.
6880
@cindex arguments in function call
6881
The way to use a function is with a @dfn{function call} expression,
6882
which consists of the function name followed immediately by a list of
6883
@dfn{arguments} in parentheses. The arguments are expressions which
6884
provide the raw materials for the function's calculations.
6885
When there is more than one argument, they are separated by commas. If
6886
there are no arguments, write just @samp{()} after the function name.
6887
Here are some examples:
6890
sqrt(x^2 + y^2) @i{one argument}
6891
atan2(y, x) @i{two arguments}
6892
rand() @i{no arguments}
6895
@strong{Do not put any space between the function name and the
6896
open-parenthesis!} A user-defined function name looks just like the name of
6897
a variable, and space would make the expression look like concatenation
6898
of a variable with an expression inside parentheses. Space before the
6899
parenthesis is harmless with built-in functions, but it is best not to get
6900
into the habit of using space to avoid mistakes with user-defined
6903
Each function expects a particular number of arguments. For example, the
6904
@code{sqrt} function must be called with a single argument, the number
6905
to take the square root of:
6908
sqrt(@var{argument})
6911
Some of the built-in functions allow you to omit the final argument.
6912
If you do so, they use a reasonable default.
6913
@xref{Built-in, ,Built-in Functions}, for full details. If arguments
6914
are omitted in calls to user-defined functions, then those arguments are
6915
treated as local variables, initialized to the empty string
6916
(@pxref{User-defined, ,User-defined Functions}).
6918
Like every other expression, the function call has a value, which is
6919
computed by the function based on the arguments you give it. In this
6920
example, the value of @samp{sqrt(@var{argument})} is the square root of
6921
@var{argument}. A function can also have side effects, such as assigning
6922
values to certain variables or doing I/O.
6924
Here is a command to read numbers, one number per line, and print the
6925
square root of each one:
6929
$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}'
6931
@print{} The square root of 1 is 1
6933
@print{} The square root of 3 is 1.73205
6935
@print{} The square root of 5 is 2.23607
6940
@node Precedence, , Function Calls, Expressions
6941
@section Operator Precedence (How Operators Nest)
6943
@cindex operator precedence
6945
@dfn{Operator precedence} determines how operators are grouped, when
6946
different operators appear close by in one expression. For example,
6947
@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}
6948
means to multiply @code{b} and @code{c}, and then add @code{a} to the
6949
product (i.e.@: @samp{a + (b * c)}).
6951
You can overrule the precedence of the operators by using parentheses.
6952
You can think of the precedence rules as saying where the
6953
parentheses are assumed to be if you do not write parentheses yourself. In
6954
fact, it is wise to always use parentheses whenever you have an unusual
6955
combination of operators, because other people who read the program may
6956
not remember what the precedence is in this case. You might forget,
6957
too; then you could make a mistake. Explicit parentheses will help prevent
6960
When operators of equal precedence are used together, the leftmost
6961
operator groups first, except for the assignment, conditional and
6962
exponentiation operators, which group in the opposite order.
6963
Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and
6964
@samp{a = b = c} groups as @samp{a = (b = c)}.
6966
The precedence of prefix unary operators does not matter as long as only
6967
unary operators are involved, because there is only one way to interpret
6968
them---innermost first. Thus, @samp{$++i} means @samp{$(++i)} and
6969
@samp{++$x} means @samp{++($x)}. However, when another operator follows
6970
the operand, then the precedence of the unary operators can matter.
6971
Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means
6972
@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}
6973
while @samp{$} has higher precedence.
6975
Here is a table of @code{awk}'s operators, in order from highest
6976
precedence to lowest:
6978
@c use @code in the items, looks better in TeX w/o all the quotes
6987
Increment, decrement.
6989
@cindex @code{awk} language, POSIX version
6990
@cindex POSIX @code{awk}
6992
Exponentiation. These operators group right-to-left.
6993
(The @samp{**} operator is not specified by POSIX.)
6996
Unary plus, minus, logical ``not''.
6999
Multiplication, division, modulus.
7002
Addition, subtraction.
7004
@item @r{Concatenation}
7005
No special token is used to indicate concatenation.
7006
The operands are simply written side by side.
7010
Relational, and redirection.
7011
The relational operators and the redirections have the same precedence
7012
level. Characters such as @samp{>} serve both as relationals and as
7013
redirections; the context distinguishes between the two meanings.
7015
Note that the I/O redirection operators in @code{print} and @code{printf}
7016
statements belong to the statement level, not to expressions. The
7017
redirection does not produce an expression which could be the operand of
7018
another operator. As a result, it does not make sense to use a
7019
redirection operator near another operator of lower precedence, without
7020
parentheses. Such combinations, for example @samp{print foo > a ? b : c},
7021
result in syntax errors.
7022
The correct way to write this statement is @samp{print foo > (a ? b : c)}.
7025
Matching, non-matching.
7037
Conditional. This operator groups right-to-left.
7039
@cindex @code{awk} language, POSIX version
7040
@cindex POSIX @code{awk}
7043
Assignment. These operators group right-to-left.
7044
(The @samp{**=} operator is not specified by POSIX.)
7047
@node Patterns and Actions, Statements, Expressions, Top
7048
@chapter Patterns and Actions
7049
@cindex pattern, definition of
7051
As you have already seen, each @code{awk} statement consists of
7052
a pattern with an associated action. This chapter describes how
7053
you build patterns and actions.
7056
* Pattern Overview:: What goes into a pattern.
7057
* Action Overview:: What goes into an action.
7060
@node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions
7061
@section Pattern Elements
7063
Patterns in @code{awk} control the execution of rules: a rule is
7064
executed when its pattern matches the current input record. This
7065
section explains all about how to write patterns.
7068
* Kinds of Patterns:: A list of all kinds of patterns.
7069
* Regexp Patterns:: Using regexps as patterns.
7070
* Expression Patterns:: Any expression can be used as a pattern.
7071
* Ranges:: Pairs of patterns specify record ranges.
7072
* BEGIN/END:: Specifying initialization and cleanup rules.
7073
* Empty:: The empty pattern, which matches every record.
7076
@node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview
7077
@subsection Kinds of Patterns
7078
@cindex patterns, types of
7080
Here is a summary of the types of patterns supported in @code{awk}.
7083
@item /@var{regular expression}/
7084
A regular expression as a pattern. It matches when the text of the
7085
input record fits the regular expression.
7086
(@xref{Regexp, ,Regular Expressions}.)
7088
@item @var{expression}
7089
A single expression. It matches when its value
7090
is non-zero (if a number) or non-null (if a string).
7091
(@xref{Expression Patterns, ,Expressions as Patterns}.)
7093
@item @var{pat1}, @var{pat2}
7094
A pair of patterns separated by a comma, specifying a range of records.
7095
The range includes both the initial record that matches @var{pat1}, and
7096
the final record that matches @var{pat2}.
7097
(@xref{Ranges, ,Specifying Record Ranges with Patterns}.)
7101
Special patterns for you to supply start-up or clean-up actions for your
7103
(@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.)
7106
The empty pattern matches every input record.
7107
(@xref{Empty, ,The Empty Pattern}.)
7110
@node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview
7111
@subsection Regular Expressions as Patterns
7113
We have been using regular expressions as patterns since our early examples.
7114
This kind of pattern is simply a regexp constant in the pattern part of
7115
a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}.
7116
The pattern matches when the input record matches the regexp.
7120
/foo|bar|baz/ @{ buzzwords++ @}
7121
END @{ print buzzwords, "buzzwords seen" @}
7124
@node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview
7125
@subsection Expressions as Patterns
7127
Any @code{awk} expression is valid as an @code{awk} pattern.
7128
Then the pattern matches if the expression's value is non-zero (if a
7129
number) or non-null (if a string).
7131
The expression is reevaluated each time the rule is tested against a new
7132
input record. If the expression uses fields such as @code{$1}, the
7133
value depends directly on the new input record's text; otherwise, it
7134
depends only on what has happened so far in the execution of the
7135
@code{awk} program, but that may still be useful.
7137
A very common kind of expression used as a pattern is the comparison
7138
expression, using the comparison operators described in
7139
@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
7141
Regexp matching and non-matching are also very common expressions.
7142
The left operand of the @samp{~} and @samp{!~} operators is a string.
7143
The right operand is either a constant regular expression enclosed in
7144
slashes (@code{/@var{regexp}/}), or any expression, whose string value
7145
is used as a dynamic regular expression
7146
(@pxref{Computed Regexps, , Using Dynamic Regexps}).
7148
The following example prints the second field of each input record
7149
whose first field is precisely @samp{foo}.
7152
$ awk '$1 == "foo" @{ print $2 @}' BBS-list
7156
(There is no output, since there is no BBS site named ``foo''.)
7157
Contrast this with the following regular expression match, which would
7158
accept any record with a first field that contains @samp{foo}:
7162
$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list
7170
Boolean expressions are also commonly used as patterns.
7172
matches an input record depends on whether its subexpressions match.
7174
For example, the following command prints all records in
7175
@file{BBS-list} that contain both @samp{2400} and @samp{foo}.
7178
$ awk '/2400/ && /foo/' BBS-list
7179
@print{} fooey 555-1234 2400/1200/300 B
7182
The following command prints all records in
7183
@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or
7188
$ awk '/2400/ || /foo/' BBS-list
7189
@print{} alpo-net 555-3412 2400/1200/300 A
7190
@print{} bites 555-1675 2400/1200/300 A
7191
@print{} fooey 555-1234 2400/1200/300 B
7192
@print{} foot 555-6699 1200/300 B
7193
@print{} macfoo 555-6480 1200/300 A
7194
@print{} sdace 555-3430 2400/1200/300 A
7195
@print{} sabafoo 555-2127 1200/300 C
7199
The following command prints all records in
7200
@file{BBS-list} that do @emph{not} contain the string @samp{foo}.
7204
$ awk '! /foo/' BBS-list
7205
@print{} aardvark 555-5553 1200/300 B
7206
@print{} alpo-net 555-3412 2400/1200/300 A
7207
@print{} barfly 555-7685 1200/300 A
7208
@print{} bites 555-1675 2400/1200/300 A
7209
@print{} camelot 555-0542 300 C
7210
@print{} core 555-2912 1200/300 C
7211
@print{} sdace 555-3430 2400/1200/300 A
7215
The subexpressions of a boolean operator in a pattern can be constant regular
7216
expressions, comparisons, or any other @code{awk} expressions. Range
7217
patterns are not expressions, so they cannot appear inside boolean
7218
patterns. Likewise, the special patterns @code{BEGIN} and @code{END},
7219
which never match any input record, are not expressions and cannot
7220
appear inside boolean patterns.
7222
A regexp constant as a pattern is also a special case of an expression
7223
pattern. @code{/foo/} as an expression has the value one if @samp{foo}
7224
appears in the current input record; thus, as a pattern, @code{/foo/}
7225
matches any record containing @samp{foo}.
7227
@node Ranges, BEGIN/END, Expression Patterns, Pattern Overview
7228
@subsection Specifying Record Ranges with Patterns
7230
@cindex range pattern
7231
@cindex pattern, range
7232
@cindex matching ranges of lines
7233
A @dfn{range pattern} is made of two patterns separated by a comma, of
7234
the form @samp{@var{begpat}, @var{endpat}}. It matches ranges of
7235
consecutive input records. The first pattern, @var{begpat}, controls
7236
where the range begins, and the second one, @var{endpat}, controls where
7237
it ends. For example,
7240
awk '$1 == "on", $1 == "off"'
7244
prints every record between @samp{on}/@samp{off} pairs, inclusive.
7246
A range pattern starts out by matching @var{begpat}
7247
against every input record; when a record matches @var{begpat}, the
7248
range pattern becomes @dfn{turned on}. The range pattern matches this
7249
record. As long as it stays turned on, it automatically matches every
7250
input record read. It also matches @var{endpat} against
7251
every input record; when that succeeds, the range pattern is turned
7252
off again for the following record. Then it goes back to checking
7253
@var{begpat} against each record.
7255
The record that turns on the range pattern and the one that turns it
7256
off both match the range pattern. If you don't want to operate on
7257
these records, you can write @code{if} statements in the rule's action
7258
to distinguish them from the records you are interested in.
7260
It is possible for a pattern to be turned both on and off by the same
7261
record, if the record satisfies both conditions. Then the action is
7262
executed for just that record.
7264
For example, suppose you have text between two identical markers (say
7265
the @samp{%} symbol) that you wish to ignore. You might try to
7266
combine a range pattern that describes the delimited text with the
7267
@code{next} statement
7268
(not discussed yet, @pxref{Next Statement, , The @code{next} Statement}),
7269
which causes @code{awk} to skip any further processing of the current
7270
record and start over again with the next input record. Such a program
7274
/^%$/,/^%$/ @{ next @}
7279
@cindex skipping lines between markers
7280
This program fails because the range pattern is both turned on and turned off
7281
by the first line with just a @samp{%} on it. To accomplish this task, you
7282
must write the program this way, using a flag:
7285
/^%$/ @{ skip = ! skip; next @}
7286
skip == 1 @{ next @} # skip lines with `skip' set
7289
Note that in a range pattern, the @samp{,} has the lowest precedence
7290
(is evaluated last) of all the operators. Thus, for example, the
7291
following program attempts to combine a range pattern with another,
7295
echo Yes | awk '/1/,/2/ || /Yes/'
7298
The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}.
7299
However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.
7300
This cannot be changed or worked around; range patterns do not combine
7301
with other patterns.
7303
@node BEGIN/END, Empty, Ranges, Pattern Overview
7304
@subsection The @code{BEGIN} and @code{END} Special Patterns
7306
@cindex @code{BEGIN} special pattern
7307
@cindex pattern, @code{BEGIN}
7308
@cindex @code{END} special pattern
7309
@cindex pattern, @code{END}
7310
@code{BEGIN} and @code{END} are special patterns. They are not used to
7311
match input records. Rather, they supply start-up or
7312
clean-up actions for your @code{awk} script.
7315
* Using BEGIN/END:: How and why to use BEGIN/END rules.
7316
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
7319
@node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END
7320
@subsubsection Startup and Cleanup Actions
7322
A @code{BEGIN} rule is executed, once, before the first input record
7323
has been read. An @code{END} rule is executed, once, after all the
7324
input has been read. For example:
7329
> BEGIN @{ print "Analysis of \"foo\"" @}
7331
> END @{ print "\"foo\" appears " n " times." @}' BBS-list
7332
@print{} Analysis of "foo"
7333
@print{} "foo" appears 4 times.
7337
This program finds the number of records in the input file @file{BBS-list}
7338
that contain the string @samp{foo}. The @code{BEGIN} rule prints a title
7339
for the report. There is no need to use the @code{BEGIN} rule to
7340
initialize the counter @code{n} to zero, as @code{awk} does this
7341
automatically (@pxref{Variables}).
7343
The second rule increments the variable @code{n} every time a
7344
record containing the pattern @samp{foo} is read. The @code{END} rule
7345
prints the value of @code{n} at the end of the run.
7347
The special patterns @code{BEGIN} and @code{END} cannot be used in ranges
7348
or with boolean operators (indeed, they cannot be used with any operators).
7350
An @code{awk} program may have multiple @code{BEGIN} and/or @code{END}
7351
rules. They are executed in the order they appear, all the @code{BEGIN}
7352
rules at start-up and all the @code{END} rules at termination.
7353
@code{BEGIN} and @code{END} rules may be intermixed with other rules.
7354
This feature was added in the 1987 version of @code{awk}, and is included
7355
in the POSIX standard. The original (1978) version of @code{awk}
7356
required you to put the @code{BEGIN} rule at the beginning of the
7357
program, and the @code{END} rule at the end, and only allowed one of
7358
each. This is no longer required, but it is a good idea in terms of
7359
program organization and readability.
7361
Multiple @code{BEGIN} and @code{END} rules are useful for writing
7362
library functions, since each library file can have its own @code{BEGIN} and/or
7363
@code{END} rule to do its own initialization and/or cleanup. Note that
7364
the order in which library functions are named on the command line
7365
controls the order in which their @code{BEGIN} and @code{END} rules are
7366
executed. Therefore you have to be careful to write such rules in
7367
library files so that the order in which they are executed doesn't matter.
7368
@xref{Options, ,Command Line Options}, for more information on
7369
using library functions.
7370
@xref{Library Functions, ,A Library of @code{awk} Functions},
7371
for a number of useful library functions.
7374
If an @code{awk} program only has a @code{BEGIN} rule, and no other
7375
rules, then the program exits after the @code{BEGIN} rule has been run.
7376
(The original version of @code{awk} used to keep reading and ignoring input
7377
until end of file was seen.) However, if an @code{END} rule exists,
7378
then the input will be read, even if there are no other rules in
7379
the program. This is necessary in case the @code{END} rule checks the
7380
@code{FNR} and @code{NR} variables (d.c.).
7382
@code{BEGIN} and @code{END} rules must have actions; there is no default
7383
action for these rules since there is no current record when they run.
7385
@node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END
7386
@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules
7388
@cindex I/O from @code{BEGIN} and @code{END}
7389
There are several (sometimes subtle) issues involved when doing I/O
7390
from a @code{BEGIN} or @code{END} rule.
7392
The first has to do with the value of @code{$0} in a @code{BEGIN}
7393
rule. Since @code{BEGIN} rules are executed before any input is read,
7394
there simply is no input record, and therefore no fields, when
7395
executing @code{BEGIN} rules. References to @code{$0} and the fields
7396
yield a null string or zero, depending upon the context. One way
7397
to give @code{$0} a real value is to execute a @code{getline} command
7398
without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}).
7399
Another way is to simply assign a value to it.
7401
@cindex differences between @code{gawk} and @code{awk}
7402
The second point is similar to the first, but from the other direction.
7403
Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}?
7404
Traditionally, due largely to implementation issues, @code{$0} and
7405
@code{NF} were @emph{undefined} inside an @code{END} rule.
7406
The POSIX standard specified that @code{NF} was available in an @code{END}
7407
rule, containing the number of fields from the last input record.
7408
Due most probably to an oversight, the standard does not say that @code{$0}
7409
is also preserved, although logically one would think that it should be.
7410
In fact, @code{gawk} does preserve the value of @code{$0} for use in
7411
@code{END} rules. Be aware, however, that Unix @code{awk}, and possibly
7412
other implementations, do not.
7414
The third point follows from the first two. What is the meaning of
7415
@samp{print} inside a @code{BEGIN} or @code{END} rule? The meaning is
7416
the same as always, @samp{print $0}. If @code{$0} is the null string,
7417
then this prints an empty line. Many long time @code{awk} programmers
7418
use @samp{print} in @code{BEGIN} and @code{END} rules, to mean
7419
@samp{@w{print ""}}, relying on @code{$0} being null. While you might
7420
generally get away with this in @code{BEGIN} rules, in @code{gawk} at
7421
least, it is a very bad idea in @code{END} rules. It is also poor
7422
style, since if you want an empty line in the output, you
7423
should say so explicitly in your program.
7425
@node Empty, , BEGIN/END, Pattern Overview
7426
@subsection The Empty Pattern
7428
@cindex empty pattern
7429
@cindex pattern, empty
7430
An empty (i.e.@: non-existent) pattern is considered to match @emph{every}
7431
input record. For example, the program:
7434
awk '@{ print $1 @}' BBS-list
7438
prints the first field of every record.
7440
@node Action Overview, , Pattern Overview, Patterns and Actions
7441
@section Overview of Actions
7442
@cindex action, definition of
7443
@cindex curly braces
7444
@cindex action, curly braces
7445
@cindex action, separating statements
7447
An @code{awk} program or script consists of a series of
7448
rules and function definitions, interspersed. (Functions are
7449
described later. @xref{User-defined, ,User-defined Functions}.)
7451
A rule contains a pattern and an action, either of which (but not
7453
omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do
7454
once a match for the pattern is found. Thus, in outline, an @code{awk}
7455
program generally looks like this:
7458
@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
7459
@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
7461
function @var{name}(@var{args}) @{ @dots{} @}
7465
An action consists of one or more @code{awk} @dfn{statements}, enclosed
7466
in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one
7467
thing to be done. The statements are separated by newlines or
7470
The curly braces around an action must be used even if the action
7471
contains only one statement, or even if it contains no statements at
7472
all. However, if you omit the action entirely, omit the curly braces as
7473
well. An omitted action is equivalent to @samp{@{ print $0 @}}.
7476
/foo/ @{ @} # match foo, do nothing - empty action
7477
/foo/ # match foo, print the record - omitted action
7480
Here are the kinds of statements supported in @code{awk}:
7484
Expressions, which can call functions or assign values to variables
7485
(@pxref{Expressions}). Executing
7486
this kind of statement simply computes the value of the expression.
7487
This is useful when the expression has side effects
7488
(@pxref{Assignment Ops, ,Assignment Expressions}).
7491
Control statements, which specify the control flow of @code{awk}
7492
programs. The @code{awk} language gives you C-like constructs
7493
(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few
7494
special ones (@pxref{Statements, ,Control Statements in Actions}).
7497
Compound statements, which consist of one or more statements enclosed in
7498
curly braces. A compound statement is used in order to put several
7499
statements together in the body of an @code{if}, @code{while}, @code{do}
7500
or @code{for} statement.
7503
Input statements, using the @code{getline} command
7504
(@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next}
7505
statement (@pxref{Next Statement, ,The @code{next} Statement}),
7506
and the @code{nextfile} statement
7507
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
7510
Output statements, @code{print} and @code{printf}.
7511
@xref{Printing, ,Printing Output}.
7514
Deletion statements, for deleting array elements.
7515
@xref{Delete, ,The @code{delete} Statement}.
7519
The next chapter covers control statements in detail.
7522
@node Statements, Built-in Variables, Patterns and Actions, Top
7523
@chapter Control Statements in Actions
7524
@cindex control statement
7526
@dfn{Control statements} such as @code{if}, @code{while}, and so on
7527
control the flow of execution in @code{awk} programs. Most of the
7528
control statements in @code{awk} are patterned on similar statements in
7531
All the control statements start with special keywords such as @code{if}
7532
and @code{while}, to distinguish them from simple expressions.
7534
@cindex compound statement
7535
@cindex statement, compound
7536
Many control statements contain other statements; for example, the
7537
@code{if} statement contains another statement which may or may not be
7538
executed. The contained statement is called the @dfn{body}. If you
7539
want to include more than one statement in the body, group them into a
7540
single @dfn{compound statement} with curly braces, separating them with
7541
newlines or semicolons.
7544
* If Statement:: Conditionally execute some @code{awk}
7546
* While Statement:: Loop until some condition is satisfied.
7547
* Do Statement:: Do specified action while looping until some
7548
condition is satisfied.
7549
* For Statement:: Another looping statement, that provides
7550
initialization and increment clauses.
7551
* Break Statement:: Immediately exit the innermost enclosing loop.
7552
* Continue Statement:: Skip to the end of the innermost enclosing
7554
* Next Statement:: Stop processing the current input record.
7555
* Nextfile Statement:: Stop processing the current file.
7556
* Exit Statement:: Stop execution of @code{awk}.
7559
@node If Statement, While Statement, Statements, Statements
7560
@section The @code{if}-@code{else} Statement
7562
@cindex @code{if}-@code{else} statement
7563
The @code{if}-@code{else} statement is @code{awk}'s decision-making
7564
statement. It looks like this:
7567
if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]}
7571
The @var{condition} is an expression that controls what the rest of the
7572
statement will do. If @var{condition} is true, @var{then-body} is
7573
executed; otherwise, @var{else-body} is executed.
7574
The @code{else} part of the statement is
7575
optional. The condition is considered false if its value is zero or
7576
the null string, and true otherwise.
7587
In this example, if the expression @samp{x % 2 == 0} is true (that is,
7588
the value of @code{x} is evenly divisible by two), then the first @code{print}
7589
statement is executed, otherwise the second @code{print} statement is
7592
If the @code{else} appears on the same line as @var{then-body}, and
7593
@var{then-body} is not a compound statement (i.e.@: not surrounded by
7594
curly braces), then a semicolon must separate @var{then-body} from
7595
@code{else}. To illustrate this, let's rewrite the previous example:
7598
if (x % 2 == 0) print "x is even"; else
7603
If you forget the @samp{;}, @code{awk} won't be able to interpret the
7604
statement, and you will get a syntax error.
7606
We would not actually write this example this way, because a human
7607
reader might fail to see the @code{else} if it were not the first thing
7610
@node While Statement, Do Statement, If Statement, Statements
7611
@section The @code{while} Statement
7612
@cindex @code{while} statement
7614
@cindex body of a loop
7616
In programming, a @dfn{loop} means a part of a program that can
7617
be executed two or more times in succession.
7619
The @code{while} statement is the simplest looping statement in
7620
@code{awk}. It repeatedly executes a statement as long as a condition is
7621
true. It looks like this:
7624
while (@var{condition})
7629
Here @var{body} is a statement that we call the @dfn{body} of the loop,
7630
and @var{condition} is an expression that controls how long the loop
7633
The first thing the @code{while} statement does is test @var{condition}.
7634
If @var{condition} is true, it executes the statement @var{body}.
7636
(The @var{condition} is true when the value
7637
is not zero and not a null string.)
7639
After @var{body} has been executed,
7640
@var{condition} is tested again, and if it is still true, @var{body} is
7641
executed again. This process repeats until @var{condition} is no longer
7642
true. If @var{condition} is initially false, the body of the loop is
7643
never executed, and @code{awk} continues with the statement following
7646
This example prints the first three fields of each record, one per line.
7654
@}' inventory-shipped
7658
Here the body of the loop is a compound statement enclosed in braces,
7659
containing two statements.
7661
The loop works like this: first, the value of @code{i} is set to one.
7662
Then, the @code{while} tests whether @code{i} is less than or equal to
7663
three. This is true when @code{i} equals one, so the @code{i}-th
7664
field is printed. Then the @samp{i++} increments the value of @code{i}
7665
and the loop repeats. The loop terminates when @code{i} reaches four.
7667
As you can see, a newline is not required between the condition and the
7668
body; but using one makes the program clearer unless the body is a
7669
compound statement or is very simple. The newline after the open-brace
7670
that begins the compound statement is not required either, but the
7671
program would be harder to read without it.
7673
@node Do Statement, For Statement, While Statement, Statements
7674
@section The @code{do}-@code{while} Statement
7676
The @code{do} loop is a variation of the @code{while} looping statement.
7677
The @code{do} loop executes the @var{body} once, and then repeats @var{body}
7678
as long as @var{condition} is true. It looks like this:
7683
while (@var{condition})
7686
Even if @var{condition} is false at the start, @var{body} is executed at
7687
least once (and only once, unless executing @var{body} makes
7688
@var{condition} true). Contrast this with the corresponding
7689
@code{while} statement:
7692
while (@var{condition})
7697
This statement does not execute @var{body} even once if @var{condition}
7698
is false to begin with.
7700
Here is an example of a @code{do} statement:
7712
This program prints each input record ten times. It isn't a very
7713
realistic example, since in this case an ordinary @code{while} would do
7714
just as well. But this reflects actual experience; there is only
7715
occasionally a real use for a @code{do} statement.
7717
@node For Statement, Break Statement, Do Statement, Statements
7718
@section The @code{for} Statement
7719
@cindex @code{for} statement
7721
The @code{for} statement makes it more convenient to count iterations of a
7722
loop. The general form of the @code{for} statement looks like this:
7725
for (@var{initialization}; @var{condition}; @var{increment})
7730
The @var{initialization}, @var{condition} and @var{increment} parts are
7731
arbitrary @code{awk} expressions, and @var{body} stands for any
7732
@code{awk} statement.
7734
The @code{for} statement starts by executing @var{initialization}.
7736
as @var{condition} is true, it repeatedly executes @var{body} and then
7737
@var{increment}. Typically @var{initialization} sets a variable to
7738
either zero or one, @var{increment} adds one to it, and @var{condition}
7739
compares it against the desired number of iterations.
7741
Here is an example of a @code{for} statement:
7745
awk '@{ for (i = 1; i <= 3; i++)
7747
@}' inventory-shipped
7752
This prints the first three fields of each input record, one field per
7755
You cannot set more than one variable in the
7756
@var{initialization} part unless you use a multiple assignment statement
7757
such as @samp{x = y = 0}, which is possible only if all the initial values
7758
are equal. (But you can initialize additional variables by writing
7759
their assignments as separate statements preceding the @code{for} loop.)
7761
The same is true of the @var{increment} part; to increment additional
7762
variables, you must write separate statements at the end of the loop.
7763
The C compound expression, using C's comma operator, would be useful in
7764
this context, but it is not supported in @code{awk}.
7766
Most often, @var{increment} is an increment expression, as in the
7767
example above. But this is not required; it can be any expression
7768
whatever. For example, this statement prints all the powers of two
7769
between one and 100:
7772
for (i = 1; i <= 100; i *= 2)
7776
Any of the three expressions in the parentheses following the @code{for} may
7777
be omitted if there is nothing to be done there. Thus, @w{@samp{for (; x
7778
> 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the
7779
@var{condition} is omitted, it is treated as @var{true}, effectively
7780
yielding an @dfn{infinite loop} (i.e.@: a loop that will never
7783
In most cases, a @code{for} loop is an abbreviation for a @code{while}
7784
loop, as shown here:
7787
@var{initialization}
7788
while (@var{condition}) @{
7795
The only exception is when the @code{continue} statement
7796
(@pxref{Continue Statement, ,The @code{continue} Statement}) is used
7797
inside the loop; changing a @code{for} statement to a @code{while}
7798
statement in this way can change the effect of the @code{continue}
7799
statement inside the loop.
7801
There is an alternate version of the @code{for} loop, for iterating over
7802
all the indices of an array:
7806
@var{do something with} array[i]
7810
@xref{Scanning an Array, ,Scanning All Elements of an Array},
7811
for more information on this version of the @code{for} loop.
7813
The @code{awk} language has a @code{for} statement in addition to a
7814
@code{while} statement because often a @code{for} loop is both less work to
7815
type and more natural to think of. Counting the number of iterations is
7816
very common in loops. It can be easier to think of this counting as part
7817
of looping rather than as something to do inside the loop.
7819
The next section has more complicated examples of @code{for} loops.
7821
@node Break Statement, Continue Statement, For Statement, Statements
7822
@section The @code{break} Statement
7823
@cindex @code{break} statement
7824
@cindex loops, exiting
7826
The @code{break} statement jumps out of the innermost @code{for},
7827
@code{while}, or @code{do} loop that encloses it. The
7828
following example finds the smallest divisor of any integer, and also
7829
identifies prime numbers:
7832
awk '# find smallest divisor of num
7834
for (div = 2; div*div <= num; div++)
7838
printf "Smallest divisor of %d is %d\n", num, div
7840
printf "%d is prime\n", num
7844
When the remainder is zero in the first @code{if} statement, @code{awk}
7845
immediately @dfn{breaks out} of the containing @code{for} loop. This means
7846
that @code{awk} proceeds immediately to the statement following the loop
7847
and continues processing. (This is very different from the @code{exit}
7848
statement which stops the entire @code{awk} program.
7849
@xref{Exit Statement, ,The @code{exit} Statement}.)
7851
Here is another program equivalent to the previous one. It illustrates how
7852
the @var{condition} of a @code{for} or @code{while} could just as well be
7853
replaced with a @code{break} inside an @code{if}:
7857
awk '# find smallest divisor of num
7859
for (div = 2; ; div++) @{
7860
if (num % div == 0) @{
7861
printf "Smallest divisor of %d is %d\n", num, div
7864
if (div*div > num) @{
7865
printf "%d is prime\n", num
7873
@cindex @code{break}, outside of loops
7874
@cindex historical features
7875
@cindex @code{awk} language, POSIX version
7876
@cindex POSIX @code{awk}
7878
As described above, the @code{break} statement has no meaning when
7879
used outside the body of a loop. However, although it was never documented,
7880
historical implementations of @code{awk} have treated the @code{break}
7881
statement outside of a loop as if it were a @code{next} statement
7882
(@pxref{Next Statement, ,The @code{next} Statement}).
7883
Recent versions of Unix @code{awk} no longer allow this usage.
7884
@code{gawk} will support this use of @code{break} only if @samp{--traditional}
7885
has been specified on the command line
7886
(@pxref{Options, ,Command Line Options}).
7887
Otherwise, it will be treated as an error, since the POSIX standard
7888
specifies that @code{break} should only be used inside the body of a
7891
@node Continue Statement, Next Statement, Break Statement, Statements
7892
@section The @code{continue} Statement
7894
@cindex @code{continue} statement
7895
The @code{continue} statement, like @code{break}, is used only inside
7896
@code{for}, @code{while}, and @code{do} loops. It skips
7897
over the rest of the loop body, causing the next cycle around the loop
7898
to begin immediately. Contrast this with @code{break}, which jumps out
7899
of the loop altogether.
7901
@c The point of this program was to illustrate the use of continue with
7902
@c a while loop. But Karl Berry points out that that is done adequately
7903
@c below, and that this example is very un-awk-like. So for now, we'll
7906
In Texinfo source files, text that the author wishes to ignore can be
7907
enclosed between lines that start with @samp{@@ignore} and end with
7908
@samp{@@end ignore}. Here is a program that strips out lines between
7909
@samp{@@ignore} and @samp{@@end ignore} pairs.
7913
while (getline > 0) @{
7916
else if (/^@@end[ \t]+ignore/) @{
7927
When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true).
7928
When @samp{@@end ignore} is seen, the flag is reset to zero (false). As long
7929
as the flag is true, the input record is not printed, because the
7930
@code{continue} restarts the @code{while} loop, skipping over the @code{print}
7934
@c How could this program be written to make better use of the awk language?
7937
The @code{continue} statement in a @code{for} loop directs @code{awk} to
7938
skip the rest of the body of the loop, and resume execution with the
7939
increment-expression of the @code{for} statement. The following program
7940
illustrates this fact:
7944
for (x = 0; x <= 20; x++) @{
7954
This program prints all the numbers from zero to 20, except for five, for
7955
which the @code{printf} is skipped. Since the increment @samp{x++}
7956
is not skipped, @code{x} does not remain stuck at five. Contrast the
7957
@code{for} loop above with this @code{while} loop:
7973
This program loops forever once @code{x} gets to five.
7975
@cindex @code{continue}, outside of loops
7976
@cindex historical features
7977
@cindex @code{awk} language, POSIX version
7978
@cindex POSIX @code{awk}
7980
As described above, the @code{continue} statement has no meaning when
7981
used outside the body of a loop. However, although it was never documented,
7982
historical implementations of @code{awk} have treated the @code{continue}
7983
statement outside of a loop as if it were a @code{next} statement
7984
(@pxref{Next Statement, ,The @code{next} Statement}).
7985
Recent versions of Unix @code{awk} no longer allow this usage.
7986
@code{gawk} will support this use of @code{continue} only if
7987
@samp{--traditional} has been specified on the command line
7988
(@pxref{Options, ,Command Line Options}).
7989
Otherwise, it will be treated as an error, since the POSIX standard
7990
specifies that @code{continue} should only be used inside the body of a
7993
@node Next Statement, Nextfile Statement, Continue Statement, Statements
7994
@section The @code{next} Statement
7995
@cindex @code{next} statement
7997
The @code{next} statement forces @code{awk} to immediately stop processing
7998
the current record and go on to the next record. This means that no
7999
further rules are executed for the current record. The rest of the
8000
current rule's action is not executed either.
8002
Contrast this with the effect of the @code{getline} function
8003
(@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes
8004
@code{awk} to read the next record immediately, but it does not alter the
8005
flow of control in any way. So the rest of the current action executes
8006
with a new input record.
8008
At the highest level, @code{awk} program execution is a loop that reads
8009
an input record and then tests each rule's pattern against it. If you
8010
think of this loop as a @code{for} statement whose body contains the
8011
rules, then the @code{next} statement is analogous to a @code{continue}
8012
statement: it skips to the end of the body of this implicit loop, and
8013
executes the increment (which reads another record).
8015
For example, if your @code{awk} program works only on records with four
8016
fields, and you don't want it to fail when given bad input, you might
8017
use this rule near the beginning of the program:
8022
err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
8023
print err > "/dev/stderr"
8030
so that the following rules will not see the bad record. The error
8031
message is redirected to the standard error output stream, as error
8032
messages should be. @xref{Special Files, ,Special File Names in @code{gawk}}.
8034
@cindex @code{awk} language, POSIX version
8035
@cindex POSIX @code{awk}
8036
According to the POSIX standard, the behavior is undefined if
8037
the @code{next} statement is used in a @code{BEGIN} or @code{END} rule.
8038
@code{gawk} will treat it as a syntax error.
8039
Although POSIX permits it,
8040
some other @code{awk} implementations don't allow the @code{next}
8041
statement inside function bodies
8042
(@pxref{User-defined, ,User-defined Functions}).
8043
Just as any other @code{next} statement, a @code{next} inside a
8044
function body reads the next record and starts processing it with the
8045
first rule in the program.
8047
If the @code{next} statement causes the end of the input to be reached,
8048
then the code in any @code{END} rules will be executed.
8049
@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
8051
@node Nextfile Statement, Exit Statement, Next Statement, Statements
8052
@section The @code{nextfile} Statement
8053
@cindex @code{nextfile} statement
8054
@cindex differences between @code{gawk} and @code{awk}
8056
@code{gawk} provides the @code{nextfile} statement,
8057
which is similar to the @code{next} statement.
8058
However, instead of abandoning processing of the current record, the
8059
@code{nextfile} statement instructs @code{gawk} to stop processing the
8062
Upon execution of the @code{nextfile} statement, @code{FILENAME} is
8063
updated to the name of the next data file listed on the command line,
8064
@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing
8065
starts over with the first rule in the progam. @xref{Built-in Variables}.
8067
If the @code{nextfile} statement causes the end of the input to be reached,
8068
then the code in any @code{END} rules will be executed.
8069
@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
8071
The @code{nextfile} statement is a @code{gawk} extension; it is not
8072
(currently) available in any other @code{awk} implementation.
8073
@xref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
8074
for a user-defined function you can use to simulate the @code{nextfile}
8077
The @code{nextfile} statement would be useful if you have many data
8078
files to process, and you expect that you
8079
would not want to process every record in every file.
8080
Normally, in order to move on to
8081
the next data file, you would have to continue scanning the unwanted
8082
records. The @code{nextfile} statement accomplishes this much more
8085
@cindex @code{next file} statement
8086
@strong{Caution:} Versions of @code{gawk} prior to 3.0 used two
8087
words (@samp{next file}) for the @code{nextfile} statement. This was
8088
changed in 3.0 to one word, since the treatment of @samp{file} was
8089
inconsistent. When it appeared after @code{next}, it was a keyword.
8090
Otherwise, it was a regular identifier. The old usage is still
8091
accepted. However, @code{gawk} will generate a warning message, and
8092
support for @code{next file} will eventually be discontinued in a
8093
future version of @code{gawk}.
8095
@node Exit Statement, , Nextfile Statement, Statements
8096
@section The @code{exit} Statement
8098
@cindex @code{exit} statement
8099
The @code{exit} statement causes @code{awk} to immediately stop
8100
executing the current rule and to stop processing input; any remaining input
8101
is ignored. It looks like this:
8104
exit @r{[}@var{return code}@r{]}
8107
If an @code{exit} statement is executed from a @code{BEGIN} rule the
8108
program stops processing everything immediately. No input records are
8109
read. However, if an @code{END} rule is present, it is executed
8110
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
8112
If @code{exit} is used as part of an @code{END} rule, it causes
8113
the program to stop immediately.
8115
An @code{exit} statement that is not part
8116
of a @code{BEGIN} or @code{END} rule stops the execution of any further
8117
automatic rules for the current record, skips reading any remaining input
8118
records, and executes
8119
the @code{END} rule if there is one.
8121
If you do not want the @code{END} rule to do its job in this case, you
8122
can set a variable to non-zero before the @code{exit} statement, and check
8123
that variable in the @code{END} rule.
8124
@xref{Assert Function, ,Assertions},
8125
for an example that does this.
8128
If an argument is supplied to @code{exit}, its value is used as the exit
8129
status code for the @code{awk} process. If no argument is supplied,
8130
@code{exit} returns status zero (success). In the case where an argument
8131
is supplied to a first @code{exit} statement, and then @code{exit} is
8132
called a second time with no argument, the previously supplied exit value
8135
For example, let's say you've discovered an error condition you really
8136
don't know how to handle. Conventionally, programs report this by
8137
exiting with a non-zero status. Your @code{awk} program can do this
8138
using an @code{exit} statement with a non-zero argument. Here is an
8144
if (("date" | getline date_now) < 0) @{
8145
print "Can't get system date" > "/dev/stderr"
8148
print "current date is", date_now
8154
@node Built-in Variables, Arrays, Statements, Top
8155
@chapter Built-in Variables
8156
@cindex built-in variables
8158
Most @code{awk} variables are available for you to use for your own
8159
purposes; they never change except when your program assigns values to
8160
them, and never affect anything except when your program examines them.
8161
However, a few variables in @code{awk} have special built-in meanings.
8162
Some of them @code{awk} examines automatically, so that they enable you
8163
to tell @code{awk} how to do certain things. Others are set
8164
automatically by @code{awk}, so that they carry information from the
8165
internal workings of @code{awk} to your program.
8167
This chapter documents all the built-in variables of @code{gawk}. Most
8168
of them are also documented in the chapters describing their areas of
8172
* User-modified:: Built-in variables that you change to control
8174
* Auto-set:: Built-in variables where @code{awk} gives you
8176
* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
8179
@node User-modified, Auto-set, Built-in Variables, Built-in Variables
8180
@section Built-in Variables that Control @code{awk}
8181
@cindex built-in variables, user modifiable
8183
This is an alphabetical list of the variables which you can change to
8184
control how @code{awk} does certain things. Those variables that are
8185
specific to @code{gawk} are marked with an asterisk, @samp{*}.
8189
@cindex @code{awk} language, POSIX version
8190
@cindex POSIX @code{awk}
8192
This string controls conversion of numbers to
8193
strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
8194
It works by being passed, in effect, as the first argument to the
8195
@code{sprintf} function
8196
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8197
Its default value is @code{"%.6g"}.
8198
@code{CONVFMT} was introduced by the POSIX standard.
8202
This is a space separated list of columns that tells @code{gawk}
8203
how to split input with fixed, columnar boundaries. It is an
8204
experimental feature. Assigning to @code{FIELDWIDTHS}
8205
overrides the use of @code{FS} for field splitting.
8206
@xref{Constant Size, ,Reading Fixed-width Data}, for more information.
8208
If @code{gawk} is in compatibility mode
8209
(@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS}
8210
has no special meaning, and field splitting operations are done based
8211
exclusively on the value of @code{FS}.
8215
@code{FS} is the input field separator
8216
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
8217
The value is a single-character string or a multi-character regular
8218
expression that matches the separations between fields in an input
8219
record. If the value is the null string (@code{""}), then each
8220
character in the record becomes a separate field.
8222
The default value is @w{@code{" "}}, a string consisting of a single
8223
space. As a special exception, this value means that any
8224
sequence of spaces and tabs is a single separator. It also causes
8225
spaces and tabs at the beginning and end of a record to be ignored.
8227
You can set the value of @code{FS} on the command line using the
8231
awk -F, '@var{program}' @var{input-files}
8234
If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting,
8235
assigning a value to @code{FS} will cause @code{gawk} to return to
8236
the normal, @code{FS}-based, field splitting. An easy way to do this
8237
is to simply say @samp{FS = FS}, perhaps with an explanatory comment.
8241
If @code{IGNORECASE} is non-zero or non-null, then all string comparisons,
8242
and all regular expression matching are case-independent. Thus, regexp
8243
matching with @samp{~} and @samp{!~}, and the @code{gensub},
8244
@code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub}
8245
functions, record termination with @code{RS}, and field splitting with
8246
@code{FS} all ignore case when doing their particular regexp operations.
8247
@xref{Case-sensitivity, ,Case-sensitivity in Matching}.
8249
If @code{gawk} is in compatibility mode
8250
(@pxref{Options, ,Command Line Options}),
8251
then @code{IGNORECASE} has no special meaning, and string
8252
and regexp operations are always case-sensitive.
8256
This string controls conversion of numbers to
8257
strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for
8258
printing with the @code{print} statement. It works by being passed, in
8259
effect, as the first argument to the @code{sprintf} function
8260
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8261
Its default value is @code{"%.6g"}. Earlier versions of @code{awk}
8262
also used @code{OFMT} to specify the format for converting numbers to
8263
strings in general expressions; this is now done by @code{CONVFMT}.
8267
This is the output field separator (@pxref{Output Separators}). It is
8268
output between the fields output by a @code{print} statement. Its
8269
default value is @w{@code{" "}}, a string consisting of a single space.
8273
This is the output record separator. It is output at the end of every
8274
@code{print} statement. Its default value is @code{"\n"}.
8275
(@xref{Output Separators}.)
8279
This is @code{awk}'s input record separator. Its default value is a string
8280
containing a single newline character, which means that an input record
8281
consists of a single line of text.
8282
It can also be the null string, in which case records are separated by
8283
runs of blank lines, or a regexp, in which case records are separated by
8284
matches of the regexp in the input text.
8285
(@xref{Records, ,How Input is Split into Records}.)
8289
@code{SUBSEP} is the subscript separator. It has the default value of
8290
@code{"\034"}, and is used to separate the parts of the indices of a
8291
multi-dimensional array. Thus, the expression @code{@w{foo["A", "B"]}}
8292
really accesses @code{foo["A\034B"]}
8293
(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
8296
@node Auto-set, ARGC and ARGV, User-modified, Built-in Variables
8297
@section Built-in Variables that Convey Information
8298
@cindex built-in variables, convey information
8300
This is an alphabetical list of the variables that are set
8301
automatically by @code{awk} on certain occasions in order to provide
8302
information to your program. Those variables that are specific to
8303
@code{gawk} are marked with an asterisk, @samp{*}.
8310
The command-line arguments available to @code{awk} programs are stored in
8311
an array called @code{ARGV}. @code{ARGC} is the number of command-line
8312
arguments present. @xref{Other Arguments, ,Other Command Line Arguments}.
8313
Unlike most @code{awk} arrays,
8314
@code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1. For example:
8319
> for (i = 0; i < ARGC; i++)
8321
> @}' inventory-shipped BBS-list
8323
@print{} inventory-shipped
8329
In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
8330
contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
8331
@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the
8332
index of the last element in @code{ARGV}, since the elements are numbered
8335
The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing
8336
the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's
8337
method of accessing command line arguments.
8338
@xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information
8339
about how @code{awk} uses these variables.
8343
The index in @code{ARGV} of the current file being processed.
8344
Every time @code{gawk} opens a new data file for processing, it sets
8345
@code{ARGIND} to the index in @code{ARGV} of the file name.
8346
When @code{gawk} is processing the input files, it is always
8347
true that @samp{FILENAME == ARGV[ARGIND]}.
8349
This variable is useful in file processing; it allows you to tell how far
8350
along you are in the list of data files, and to distinguish between
8351
successive instances of the same filename on the command line.
8353
While you can change the value of @code{ARGIND} within your @code{awk}
8354
program, @code{gawk} will automatically set it to a new value when the
8355
next file is opened.
8357
This variable is a @code{gawk} extension. In other @code{awk} implementations,
8358
or if @code{gawk} is in compatibility mode
8359
(@pxref{Options, ,Command Line Options}),
8364
An associative array that contains the values of the environment. The array
8365
indices are the environment variable names; the values are the values of
8366
the particular environment variables. For example,
8367
@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array
8368
does not affect the environment passed on to any programs that
8369
@code{awk} may spawn via redirection or the @code{system} function.
8370
(In a future version of @code{gawk}, it may do so.)
8372
Some operating systems may not have environment variables.
8373
On such systems, the @code{ENVIRON} array is empty (except for
8374
@w{@code{ENVIRON["AWKPATH"]}}).
8378
If a system error occurs either doing a redirection for @code{getline},
8379
during a read for @code{getline}, or during a @code{close} operation,
8380
then @code{ERRNO} will contain a string describing the error.
8382
This variable is a @code{gawk} extension. In other @code{awk} implementations,
8383
or if @code{gawk} is in compatibility mode
8384
(@pxref{Options, ,Command Line Options}),
8390
This is the name of the file that @code{awk} is currently reading.
8391
When no data files are listed on the command line, @code{awk} reads
8392
from the standard input, and @code{FILENAME} is set to @code{"-"}.
8393
@code{FILENAME} is changed each time a new file is read
8394
(@pxref{Reading Files, ,Reading Input Files}).
8395
Inside a @code{BEGIN} rule, the value of @code{FILENAME} is
8396
@code{""}, since there are no input files being processed
8397
yet.@footnote{Some early implementations of Unix @code{awk} initialized
8398
@code{FILENAME} to @code{"-"}, even if there were data files to be
8399
processed. This behavior was incorrect, and should not be relied
8400
upon in your programs.} (d.c.)
8404
@code{FNR} is the current record number in the current file. @code{FNR} is
8405
incremented each time a new record is read
8406
(@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized
8407
to zero each time a new input file is started.
8411
@code{NF} is the number of fields in the current input record.
8412
@code{NF} is set each time a new record is read, when a new field is
8413
created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}).
8417
This is the number of input records @code{awk} has processed since
8418
the beginning of the program's execution
8419
(@pxref{Records, ,How Input is Split into Records}).
8420
@code{NR} is set each time a new record is read.
8424
@code{RLENGTH} is the length of the substring matched by the
8425
@code{match} function
8426
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8427
@code{RLENGTH} is set by invoking the @code{match} function. Its value
8428
is the length of the matched string, or @minus{}1 if no match was found.
8432
@code{RSTART} is the start-index in characters of the substring matched by the
8433
@code{match} function
8434
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8435
@code{RSTART} is set by invoking the @code{match} function. Its value
8436
is the position of the string where the matched substring starts, or zero
8437
if no match was found.
8441
@code{RT} is set each time a record is read. It contains the input text
8442
that matched the text denoted by @code{RS}, the record separator.
8444
This variable is a @code{gawk} extension. In other @code{awk} implementations,
8445
or if @code{gawk} is in compatibility mode
8446
(@pxref{Options, ,Command Line Options}),
8451
A side note about @code{NR} and @code{FNR}.
8452
@code{awk} simply increments both of these variables
8453
each time it reads a record, instead of setting them to the absolute
8454
value of the number of records read. This means that your program can
8455
change these variables, and their new values will be incremented for
8456
each record (d.c.). For example:
8463
> 4' | awk 'NR == 2 @{ NR = 17 @}
8473
Before @code{FNR} was added to the @code{awk} language
8474
(@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}),
8475
many @code{awk} programs used this feature to track the number of
8476
records in a file by resetting @code{NR} to zero when @code{FILENAME}
8479
@node ARGC and ARGV, , Auto-set, Built-in Variables
8480
@section Using @code{ARGC} and @code{ARGV}
8482
In @ref{Auto-set, , Built-in Variables that Convey Information},
8483
you saw this program describing the information contained in @code{ARGC}
8489
> for (i = 0; i < ARGC; i++)
8491
> @}' inventory-shipped BBS-list
8493
@print{} inventory-shipped
8499
In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
8500
contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
8503
Notice that the @code{awk} program is not entered in @code{ARGV}. The
8504
other special command line options, with their arguments, are also not
8505
entered. But variable assignments on the command line @emph{are}
8506
treated as arguments, and do show up in the @code{ARGV} array.
8508
Your program can alter @code{ARGC} and the elements of @code{ARGV}.
8509
Each time @code{awk} reaches the end of an input file, it uses the next
8510
element of @code{ARGV} as the name of the next input file. By storing a
8511
different string there, your program can change which files are read.
8512
You can use @code{"-"} to represent the standard input. By storing
8513
additional elements and incrementing @code{ARGC} you can cause
8514
additional files to be read.
8516
If you decrease the value of @code{ARGC}, that eliminates input files
8517
from the end of the list. By recording the old value of @code{ARGC}
8518
elsewhere, your program can treat the eliminated arguments as
8519
something other than file names.
8521
To eliminate a file from the middle of the list, store the null string
8522
(@code{""}) into @code{ARGV} in place of the file's name. As a
8523
special feature, @code{awk} ignores file names that have been
8524
replaced with the null string.
8525
You may also use the @code{delete} statement to remove elements from
8526
@code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}).
8528
All of these actions are typically done from the @code{BEGIN} rule,
8529
before actual processing of the input begins.
8530
@xref{Split Program, ,Splitting a Large File Into Pieces}, and see
8531
@ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example
8532
of each way of removing elements from @code{ARGV}.
8534
The following fragment processes @code{ARGV} in order to examine, and
8535
then remove, command line options.
8540
for (i = 1; i < ARGC; i++) @{
8541
if (ARGV[i] == "-v")
8543
else if (ARGV[i] == "-d")
8547
else if (ARGV[i] ~ /^-?/) @{
8548
e = sprintf("%s: unrecognized option -- %c",
8549
ARGV[0], substr(ARGV[i], 1, ,1))
8550
print e > "/dev/stderr"
8559
@node Arrays, Built-in, Built-in Variables, Top
8560
@chapter Arrays in @code{awk}
8562
An @dfn{array} is a table of values, called @dfn{elements}. The
8563
elements of an array are distinguished by their indices. @dfn{Indices}
8564
may be either numbers or strings. @code{awk} maintains a single set
8565
of names that may be used for naming variables, arrays and functions
8566
(@pxref{User-defined, ,User-defined Functions}).
8567
Thus, you cannot have a variable and an array with the same name in the
8568
same @code{awk} program.
8571
* Array Intro:: Introduction to Arrays
8572
* Reference to Elements:: How to examine one element of an array.
8573
* Assigning Elements:: How to change an element of an array.
8574
* Array Example:: Basic Example of an Array
8575
* Scanning an Array:: A variation of the @code{for} statement. It
8576
loops through the indices of an array's
8578
* Delete:: The @code{delete} statement removes an element
8580
* Numeric Array Subscripts:: How to use numbers as subscripts in
8582
* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
8583
* Multi-dimensional:: Emulating multi-dimensional arrays in
8585
* Multi-scanning:: Scanning multi-dimensional arrays.
8588
@node Array Intro, Reference to Elements, Arrays, Arrays
8589
@section Introduction to Arrays
8592
The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups
8593
of related strings or numbers.
8595
Every @code{awk} array must have a name. Array names have the same
8596
syntax as variable names; any valid variable name would also be a valid
8597
array name. But you cannot use one name in both ways (as an array and
8598
as a variable) in one @code{awk} program.
8600
Arrays in @code{awk} superficially resemble arrays in other programming
8601
languages; but there are fundamental differences. In @code{awk}, you
8602
don't need to specify the size of an array before you start to use it.
8603
Additionally, any number or string in @code{awk} may be used as an
8604
array index, not just consecutive integers.
8606
In most other languages, you have to @dfn{declare} an array and specify
8607
how many elements or components it contains. In such languages, the
8608
declaration causes a contiguous block of memory to be allocated for that
8609
many elements. An index in the array usually must be a positive integer; for
8610
example, the index zero specifies the first element in the array, which is
8611
actually stored at the beginning of the block of memory. Index one
8612
specifies the second element, which is stored in memory right after the
8613
first element, and so on. It is impossible to add more elements to the
8614
array, because it has room for only as many elements as you declared.
8615
(Some languages allow arbitrary starting and ending indices,
8616
e.g., @samp{15 .. 27}, but the size of the array is still fixed when
8617
the array is declared.)
8619
A contiguous array of four elements might look like this,
8620
conceptually, if the element values are eight, @code{"foo"},
8624
@c from Karl Berry, much thanks for the help.
8626
\bigskip % space above the table (about 1 linespace)
8628
\newdimen\width \width = 1.5cm
8629
\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt
8631
\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr
8632
\noalign{\hrule width\hwidth}
8633
&&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr
8634
\noalign{\hrule width\hwidth}
8635
\noalign{\smallskip}
8636
&\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad index\cr
8643
+---------+---------+--------+---------+
8644
| 8 | "foo" | "" | 30 | @r{value}
8645
+---------+---------+--------+---------+
8651
Only the values are stored; the indices are implicit from the order of
8652
the values. Eight is the value at index zero, because eight appears in the
8653
position with zero elements before it.
8655
@cindex arrays, definition of
8656
@cindex associative arrays
8657
@cindex arrays, associative
8658
Arrays in @code{awk} are different: they are @dfn{associative}. This means
8659
that each array is a collection of pairs: an index, and its corresponding
8660
array element value:
8663
@r{Element} 4 @r{Value} 30
8664
@r{Element} 2 @r{Value} "foo"
8665
@r{Element} 1 @r{Value} 8
8666
@r{Element} 3 @r{Value} ""
8670
We have shown the pairs in jumbled order because their order is irrelevant.
8672
One advantage of associative arrays is that new pairs can be added
8673
at any time. For example, suppose we add to the above array a tenth element
8674
whose value is @w{@code{"number ten"}}. The result is this:
8677
@r{Element} 10 @r{Value} "number ten"
8678
@r{Element} 4 @r{Value} 30
8679
@r{Element} 2 @r{Value} "foo"
8680
@r{Element} 1 @r{Value} 8
8681
@r{Element} 3 @r{Value} ""
8685
@cindex sparse arrays
8686
@cindex arrays, sparse
8687
Now the array is @dfn{sparse}, which just means some indices are missing:
8688
it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.
8689
@c ok, I should spell out the above, but ...
8691
Another consequence of associative arrays is that the indices don't
8692
have to be positive integers. Any number, or even a string, can be
8693
an index. For example, here is an array which translates words from
8694
English into French:
8697
@r{Element} "dog" @r{Value} "chien"
8698
@r{Element} "cat" @r{Value} "chat"
8699
@r{Element} "one" @r{Value} "un"
8700
@r{Element} 1 @r{Value} "un"
8704
Here we decided to translate the number one in both spelled-out and
8705
numeric form---thus illustrating that a single array can have both
8706
numbers and strings as indices.
8707
(In fact, array subscripts are always strings; this is discussed
8709
@ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.)
8711
When @code{awk} creates an array for you, e.g., with the @code{split}
8713
that array's indices are consecutive integers starting at one.
8714
(@xref{String Functions, ,Built-in Functions for String Manipulation}.)
8716
@node Reference to Elements, Assigning Elements, Array Intro, Arrays
8717
@section Referring to an Array Element
8718
@cindex array reference
8719
@cindex element of array
8720
@cindex reference to array
8722
The principal way of using an array is to refer to one of its elements.
8723
An array reference is an expression which looks like this:
8726
@var{array}[@var{index}]
8730
Here, @var{array} is the name of an array. The expression @var{index} is
8731
the index of the element of the array that you want.
8733
The value of the array reference is the current value of that array
8734
element. For example, @code{foo[4.3]} is an expression for the element
8735
of array @code{foo} at index @samp{4.3}.
8737
If you refer to an array element that has no recorded value, the value
8738
of the reference is @code{""}, the null string. This includes elements
8739
to which you have not assigned any value, and elements that have been
8740
deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference
8741
automatically creates that array element, with the null string as its value.
8742
(In some cases, this is unfortunate, because it might waste memory inside
8745
@cindex arrays, presence of elements
8746
@cindex arrays, the @code{in} operator
8747
You can find out if an element exists in an array at a certain index with
8751
@var{index} in @var{array}
8755
This expression tests whether or not the particular index exists,
8756
without the side effect of creating that element if it is not present.
8757
The expression has the value one (true) if @code{@var{array}[@var{index}]}
8758
exists, and zero (false) if it does not exist.
8760
For example, to test whether the array @code{frequencies} contains the
8761
index @samp{2}, you could write this statement:
8764
if (2 in frequencies)
8765
print "Subscript 2 is present."
8768
Note that this is @emph{not} a test of whether or not the array
8769
@code{frequencies} contains an element whose @emph{value} is two.
8770
(There is no way to do that except to scan all the elements.) Also, this
8771
@emph{does not} create @code{frequencies[2]}, while the following
8772
(incorrect) alternative would do so:
8775
if (frequencies[2] != "")
8776
print "Subscript 2 is present."
8779
@node Assigning Elements, Array Example, Reference to Elements, Arrays
8780
@section Assigning Array Elements
8781
@cindex array assignment
8782
@cindex element assignment
8784
Array elements are lvalues: they can be assigned values just like
8785
@code{awk} variables:
8788
@var{array}[@var{subscript}] = @var{value}
8792
Here @var{array} is the name of your array. The expression
8793
@var{subscript} is the index of the element of the array that you want
8794
to assign a value. The expression @var{value} is the value you are
8795
assigning to that element of the array.
8797
@node Array Example, Scanning an Array, Assigning Elements, Arrays
8798
@section Basic Array Example
8800
The following program takes a list of lines, each beginning with a line
8801
number, and prints them out in order of line number. The line numbers are
8802
not in order, however, when they are first read: they are scrambled. This
8803
program sorts the lines by making an array using the line numbers as
8804
subscripts. It then prints out the lines in sorted order of their numbers.
8805
It is a very simple program, and gets confused if it encounters repeated
8806
numbers, gaps, or lines that don't begin with a number.
8809
@c file eg/misc/arraymax.awk
8817
for (x = 1; x <= max; x++)
8823
The first rule keeps track of the largest line number seen so far;
8824
it also stores each line into the array @code{arr}, at an index that
8825
is the line's number.
8827
The second rule runs after all the input has been read, to print out
8830
When this program is run with the following input:
8834
@c file eg/misc/arraymax.data
8836
2 Who are you? The new number two!
8837
4 . . . And four on the floor
8838
1 Who is number one?
8848
1 Who is number one?
8849
2 Who are you? The new number two!
8851
4 . . . And four on the floor
8855
If a line number is repeated, the last line with a given number overrides
8858
Gaps in the line numbers can be handled with an easy improvement to the
8859
program's @code{END} rule:
8863
for (x = 1; x <= max; x++)
8869
@node Scanning an Array, Delete, Array Example, Arrays
8870
@section Scanning All Elements of an Array
8871
@cindex @code{for (x in @dots{})}
8872
@cindex arrays, special @code{for} statement
8873
@cindex scanning an array
8875
In programs that use arrays, you often need a loop that executes
8876
once for each element of an array. In other languages, where arrays are
8877
contiguous and indices are limited to positive integers, this is
8879
find all the valid indices by counting from the lowest index
8880
up to the highest. This
8881
technique won't do the job in @code{awk}, since any number or string
8882
can be an array index. So @code{awk} has a special kind of @code{for}
8883
statement for scanning an array:
8886
for (@var{var} in @var{array})
8891
This loop executes @var{body} once for each index in @var{array} that your
8892
program has previously used, with the
8893
variable @var{var} set to that index.
8895
Here is a program that uses this form of the @code{for} statement. The
8896
first rule scans the input records and notes which words appear (at
8897
least once) in the input, by storing a one into the array @code{used} with
8898
the word as index. The second rule scans the elements of @code{used} to
8899
find all the distinct words that appear in the input. It prints each
8900
word that is more than 10 characters long, and also prints the number of
8901
such words. @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information
8902
on the built-in function @code{length}.
8905
# Record a 1 for each word that is used at least once.
8907
for (i = 1; i <= NF; i++)
8911
# Find number of distinct words more than 10 characters long.
8914
if (length(x) > 10) @{
8918
print num_long_words, "words longer than 10 characters"
8923
@xref{Word Sorting, ,Generating Word Usage Counts},
8924
for a more detailed example of this type.
8926
The order in which elements of the array are accessed by this statement
8927
is determined by the internal arrangement of the array elements within
8928
@code{awk} and cannot be controlled or changed. This can lead to
8929
problems if new elements are added to @var{array} by statements in
8930
the loop body; you cannot predict whether or not the @code{for} loop will
8931
reach them. Similarly, changing @var{var} inside the loop may produce
8932
strange results. It is best to avoid such things.
8934
@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays
8935
@section The @code{delete} Statement
8936
@cindex @code{delete} statement
8937
@cindex deleting elements of arrays
8938
@cindex removing elements of arrays
8939
@cindex arrays, deleting an element
8941
You can remove an individual element of an array using the @code{delete}
8945
delete @var{array}[@var{index}]
8948
Once you have deleted an array element, you can no longer obtain any
8949
value the element once had. It is as if you had never referred
8950
to it and had never given it any value.
8952
Here is an example of deleting elements in an array:
8955
for (i in frequencies)
8956
delete frequencies[i]
8960
This example removes all the elements from the array @code{frequencies}.
8962
If you delete an element, a subsequent @code{for} statement to scan the array
8963
will not report that element, and the @code{in} operator to check for
8964
the presence of that element will return zero (i.e.@: false):
8969
print "This will never be printed"
8972
It is important to note that deleting an element is @emph{not} the
8973
same as assigning it a null value (the empty string, @code{""}).
8978
print "This is printed, even though foo[4] is empty"
8981
It is not an error to delete an element that does not exist.
8983
@cindex arrays, deleting entire contents
8984
@cindex deleting entire arrays
8985
@cindex differences between @code{gawk} and @code{awk}
8986
You can delete all the elements of an array with a single statement,
8987
by leaving off the subscript in the @code{delete} statement.
8993
This ability is a @code{gawk} extension; it is not available in
8994
compatibility mode (@pxref{Options, ,Command Line Options}).
8996
Using this version of the @code{delete} statement is about three times
8997
more efficient than the equivalent loop that deletes each element one
9000
@cindex portability issues
9001
The following statement provides a portable, but non-obvious way to clear
9004
@cindex Brennan, Michael
9007
# thanks to Michael Brennan for pointing this out
9012
The @code{split} function
9013
(@pxref{String Functions, ,Built-in Functions for String Manipulation})
9014
clears out the target array first. This call asks it to split
9015
apart the null string. Since there is no data to split out, the
9016
function simply clears the array and then returns.
9018
@node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays
9019
@section Using Numbers to Subscript Arrays
9021
An important aspect of arrays to remember is that @emph{array subscripts
9022
are always strings}. If you use a numeric value as a subscript,
9023
it will be converted to a string value before it is used for subscripting
9024
(@pxref{Conversion, ,Conversion of Strings and Numbers}).
9026
@cindex conversions, during subscripting
9027
@cindex numbers, used as subscripts
9029
This means that the value of the built-in variable @code{CONVFMT} can potentially
9030
affect how your program accesses elements of an array. For example:
9038
printf "%s is in data\n", xyz
9040
printf "%s is not in data\n", xyz
9045
This prints @samp{12.15 is not in data}. The first statement gives
9046
@code{xyz} a numeric value. Assigning to
9047
@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}
9048
(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}),
9049
and assigns one to @code{data["12.153"]}. The program then changes
9050
the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new
9051
string value from @code{xyz}, this time @code{"12.15"}, since the value of
9052
@code{CONVFMT} only allows two significant digits. This test fails,
9053
since @code{"12.15"} is a different string from @code{"12.153"}.
9055
According to the rules for conversions
9056
(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer
9057
values are always converted to strings as integers, no matter what the
9058
value of @code{CONVFMT} may happen to be. So the usual case of:
9061
for (i = 1; i <= maxsub; i++)
9062
@i{do something with} array[i]
9066
will work, no matter what the value of @code{CONVFMT}.
9068
Like many things in @code{awk}, the majority of the time things work
9069
as you would expect them to work. But it is useful to have a precise
9070
knowledge of the actual rules, since sometimes they can have a subtle
9071
effect on your programs.
9073
@node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays
9074
@section Using Uninitialized Variables as Subscripts
9076
@cindex uninitialized variables, as array subscripts
9077
@cindex array subscripts, uninitialized variables
9078
Suppose you want to print your input data in reverse order.
9079
A reasonable attempt at a program to do so (with some test
9080
data) might look like this:
9085
> line 3' | awk '@{ l[lines] = $0; ++lines @}
9087
> for (i = lines-1; i >= 0; --i)
9094
Unfortunately, the very first line of input data did not come out in the
9097
At first glance, this program should have worked. The variable @code{lines}
9098
is uninitialized, and uninitialized variables have the numeric value zero.
9099
So, the value of @code{l[0]} should have been printed.
9101
The issue here is that subscripts for @code{awk} arrays are @strong{always}
9102
strings. And uninitialized variables, when used as strings, have the
9103
value @code{""}, not zero. Thus, @samp{line 1} ended up stored in
9106
The following version of the program works correctly:
9109
@{ l[lines++] = $0 @}
9111
for (i = lines - 1; i >= 0; --i)
9116
Here, the @samp{++} forces @code{l} to be numeric, thus making
9117
the ``old value'' numeric zero, which is then converted to @code{"0"}
9118
as the array subscript.
9120
@cindex null string, as array subscript
9122
As we have just seen, even though it is somewhat unusual, the null string
9123
(@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided
9124
on the command line (@pxref{Options, ,Command Line Options}),
9125
@code{gawk} will warn about the use of the null string as a subscript.
9127
@node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays
9128
@section Multi-dimensional Arrays
9130
@cindex subscripts in arrays
9131
@cindex arrays, multi-dimensional subscripts
9132
@cindex multi-dimensional subscripts
9133
A multi-dimensional array is an array in which an element is identified
9134
by a sequence of indices, instead of a single index. For example, a
9135
two-dimensional array requires two indices. The usual way (in most
9136
languages, including @code{awk}) to refer to an element of a
9137
two-dimensional array named @code{grid} is with
9138
@code{grid[@var{x},@var{y}]}.
9141
Multi-dimensional arrays are supported in @code{awk} through
9142
concatenation of indices into one string. What happens is that
9143
@code{awk} converts the indices into strings
9144
(@pxref{Conversion, ,Conversion of Strings and Numbers}) and
9145
concatenates them together, with a separator between them. This creates
9146
a single string that describes the values of the separate indices. The
9147
combined string is used as a single index into an ordinary,
9148
one-dimensional array. The separator used is the value of the built-in
9149
variable @code{SUBSEP}.
9151
For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}
9152
when the value of @code{SUBSEP} is @code{"@@"}. The numbers five and 12 are
9153
converted to strings and
9154
concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,
9155
the array element @code{foo["5@@12"]} is set to @code{"value"}.
9157
Once the element's value is stored, @code{awk} has no record of whether
9158
it was stored with a single index or a sequence of indices. The two
9159
expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always
9162
The default value of @code{SUBSEP} is the string @code{"\034"},
9163
which contains a non-printing character that is unlikely to appear in an
9164
@code{awk} program or in most input data.
9166
The usefulness of choosing an unlikely character comes from the fact
9167
that index values that contain a string matching @code{SUBSEP} lead to
9168
combined strings that are ambiguous. Suppose that @code{SUBSEP} were
9169
@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",
9170
"b@@c"]}} would be indistinguishable because both would actually be
9171
stored as @samp{foo["a@@b@@c"]}.
9173
You can test whether a particular index-sequence exists in a
9174
``multi-dimensional'' array with the same operator @samp{in} used for single
9175
dimensional arrays. Instead of a single index as the left-hand operand,
9176
write the whole sequence of indices, separated by commas, in
9180
(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}
9183
The following example treats its input as a two-dimensional array of
9184
fields; it rotates this array 90 degrees clockwise and prints the
9185
result. It assumes that all lines have the same number of
9194
for (x = 1; x <= NF; x++)
9201
for (x = 1; x <= max_nf; x++) @{
9202
for (y = max_nr; y >= 1; --y)
9203
printf("%s ", vector[x, y])
9211
When given the input:
9236
@node Multi-scanning, , Multi-dimensional, Arrays
9237
@section Scanning Multi-dimensional Arrays
9239
There is no special @code{for} statement for scanning a
9240
``multi-dimensional'' array; there cannot be one, because in truth there
9241
are no multi-dimensional arrays or elements; there is only a
9242
multi-dimensional @emph{way of accessing} an array.
9244
However, if your program has an array that is always accessed as
9245
multi-dimensional, you can get the effect of scanning it by combining
9246
the scanning @code{for} statement
9247
(@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the
9248
@code{split} built-in function
9249
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
9253
for (combined in array) @{
9254
split(combined, separate, SUBSEP)
9260
This sets @code{combined} to
9261
each concatenated, combined index in the array, and splits it
9262
into the individual indices by breaking it apart where the value of
9263
@code{SUBSEP} appears. The split-out indices become the elements of
9264
the array @code{separate}.
9266
Thus, suppose you have previously stored a value in @code{array[1, "foo"]};
9267
then an element with index @code{"1\034foo"} exists in
9268
@code{array}. (Recall that the default value of @code{SUBSEP} is
9269
the character with code 034.) Sooner or later the @code{for} statement
9270
will find that index and do an iteration with @code{combined} set to
9271
@code{"1\034foo"}. Then the @code{split} function is called as
9275
split("1\034foo", separate, "\034")
9279
The result of this is to set @code{separate[1]} to @code{"1"} and
9280
@code{separate[2]} to @code{"foo"}. Presto, the original sequence of
9281
separate indices has been recovered.
9283
@node Built-in, User-defined, Arrays, Top
9284
@chapter Built-in Functions
9286
@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
9287
@cindex built-in functions
9288
@dfn{Built-in} functions are functions that are always available for
9289
your @code{awk} program to call. This chapter defines all the built-in
9290
functions in @code{awk}; some of them are mentioned in other sections,
9291
but they are summarized here for your convenience. (You can also define
9292
new functions yourself. @xref{User-defined, ,User-defined Functions}.)
9295
* Calling Built-in:: How to call built-in functions.
9296
* Numeric Functions:: Functions that work with numbers, including
9297
@code{int}, @code{sin} and @code{rand}.
9298
* String Functions:: Functions for string manipulation, such as
9299
@code{split}, @code{match}, and
9301
* I/O Functions:: Functions for files and shell commands.
9302
* Time Functions:: Functions for dealing with time stamps.
9305
@node Calling Built-in, Numeric Functions, Built-in, Built-in
9306
@section Calling Built-in Functions
9308
To call a built-in function, write the name of the function followed
9309
by arguments in parentheses. For example, @samp{atan2(y + z, 1)}
9310
is a call to the function @code{atan2}, with two arguments.
9312
Whitespace is ignored between the built-in function name and the
9313
open-parenthesis, but we recommend that you avoid using whitespace
9314
there. User-defined functions do not permit whitespace in this way, and
9315
you will find it easier to avoid mistakes by following a simple
9316
convention which always works: no whitespace after a function name.
9318
@cindex differences between @code{gawk} and @code{awk}
9319
Each built-in function accepts a certain number of arguments.
9320
In some cases, arguments can be omitted. The defaults for omitted
9321
arguments vary from function to function and are described under the
9322
individual functions. In some @code{awk} implementations, extra
9323
arguments given to built-in functions are ignored. However, in @code{gawk},
9324
it is a fatal error to give extra arguments to a built-in function.
9326
When a function is called, expressions that create the function's actual
9327
parameters are evaluated completely before the function call is performed.
9328
For example, in the code fragment:
9336
the variable @code{i} is set to five before @code{sqrt} is called
9337
with a value of four for its actual parameter.
9339
@cindex evaluation, order of
9340
@cindex order of evaluation
9341
The order of evaluation of the expressions used for the function's
9342
parameters is undefined. Thus, you should not write programs that
9343
assume that parameters are evaluated from left to right or from
9344
right to left. For example,
9348
j = atan2(i++, i *= 2)
9351
If the order of evaluation is left to right, then @code{i} first becomes
9352
six, and then 12, and @code{atan2} is called with the two arguments six
9353
and 12. But if the order of evaluation is right to left, @code{i}
9354
first becomes 10, and then 11, and @code{atan2} is called with the
9355
two arguments 11 and 10.
9357
@node Numeric Functions, String Functions, Calling Built-in, Built-in
9358
@section Numeric Built-in Functions
9360
Here is a full list of built-in functions that work with numbers.
9361
Optional parameters are enclosed in square brackets (``['' and ``]'').
9366
This produces the nearest integer to @var{x}, located between @var{x} and zero,
9367
truncated toward zero.
9369
For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)}
9370
is @minus{}3, and @code{int(-3)} is @minus{}3 as well.
9374
This gives you the positive square root of @var{x}. It reports an error
9375
if @var{x} is negative. Thus, @code{sqrt(4)} is two.
9379
This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports
9380
an error if @var{x} is out of range. The range of values @var{x} can have
9381
depends on your machine's floating point representation.
9385
This gives you the natural logarithm of @var{x}, if @var{x} is positive;
9386
otherwise, it reports an error.
9390
This gives you the sine of @var{x}, with @var{x} in radians.
9394
This gives you the cosine of @var{x}, with @var{x} in radians.
9396
@item atan2(@var{y}, @var{x})
9398
This gives you the arctangent of @code{@var{y} / @var{x}} in radians.
9402
This gives you a random number. The values of @code{rand} are
9403
uniformly-distributed between zero and one.
9404
The value is never zero and never one.
9406
Often you want random integers instead. Here is a user-defined function
9407
you can use to obtain a random non-negative integer less than @var{n}:
9410
function randint(n) @{
9411
return int(n * rand())
9416
The multiplication produces a random real number greater than zero and less
9417
than @code{n}. We then make it an integer (using @code{int}) between zero
9418
and @code{n} @minus{} 1, inclusive.
9420
Here is an example where a similar function is used to produce
9421
random integers between one and @var{n}. This program
9422
prints a new random number for each input record.
9427
# Function to roll a simulated die.
9428
function roll(n) @{ return 1 + int(rand() * n) @}
9432
# Roll 3 six-sided dice and
9433
# print total number of points.
9435
printf("%d points\n",
9436
roll(6)+roll(6)+roll(6))
9441
@cindex seed for random numbers
9442
@cindex random numbers, seed of
9443
@comment MAWK uses a different seed each time.
9444
@strong{Caution:} In most @code{awk} implementations, including @code{gawk},
9445
@code{rand} starts generating numbers from the same
9446
starting number, or @dfn{seed}, each time you run @code{awk}. Thus,
9447
a program will generate the same results each time you run it.
9448
The numbers are random within one @code{awk} run, but predictable
9449
from run to run. This is convenient for debugging, but if you want
9450
a program to do different things each time it is used, you must change
9451
the seed to a value that will be different in each run. To do this,
9454
@item srand(@r{[}@var{x}@r{]})
9456
The function @code{srand} sets the starting point, or seed,
9457
for generating random numbers to the value @var{x}.
9459
Each seed value leads to a particular sequence of random
9460
numbers.@footnote{Computer generated random numbers really are not truly
9461
random. They are technically known as ``pseudo-random.'' This means
9462
that while the numbers in a sequence appear to be random, you can in
9463
fact generate the same sequence of random numbers over and over again.}
9464
Thus, if you set the seed to the same value a second time, you will get
9465
the same sequence of random numbers again.
9467
If you omit the argument @var{x}, as in @code{srand()}, then the current
9468
date and time of day are used for a seed. This is the way to get random
9469
numbers that are truly unpredictable.
9471
The return value of @code{srand} is the previous seed. This makes it
9472
easy to keep track of the seeds for use in consistently reproducing
9473
sequences of random numbers.
9476
@node String Functions, I/O Functions, Numeric Functions, Built-in
9477
@section Built-in Functions for String Manipulation
9479
The functions in this section look at or change the text of one or more
9481
Optional parameters are enclosed in square brackets (``['' and ``]'').
9484
@item index(@var{in}, @var{find})
9486
This searches the string @var{in} for the first occurrence of the string
9487
@var{find}, and returns the position in characters where that occurrence
9488
begins in the string @var{in}. For example:
9491
$ awk 'BEGIN @{ print index("peanut", "an") @}'
9496
If @var{find} is not found, @code{index} returns zero.
9497
(Remember that string indices in @code{awk} start at one.)
9499
@item length(@r{[}@var{string}@r{]})
9501
This gives you the number of characters in @var{string}. If
9502
@var{string} is a number, the length of the digit string representing
9503
that number is returned. For example, @code{length("abcde")} is five. By
9504
contrast, @code{length(15 * 35)} works out to three. How? Well, 15 * 35 =
9505
525, and 525 is then converted to the string @code{"525"}, which has
9508
If no argument is supplied, @code{length} returns the length of @code{$0}.
9510
@cindex historical features
9511
@cindex portability issues
9512
@cindex @code{awk} language, POSIX version
9513
@cindex POSIX @code{awk}
9514
In older versions of @code{awk}, you could call the @code{length} function
9515
without any parentheses. Doing so is marked as ``deprecated'' in the
9516
POSIX standard. This means that while you can do this in your
9517
programs, it is a feature that can eventually be removed from a future
9518
version of the standard. Therefore, for maximal portability of your
9519
@code{awk} programs, you should always supply the parentheses.
9521
@item match(@var{string}, @var{regexp})
9523
The @code{match} function searches the string, @var{string}, for the
9524
longest, leftmost substring matched by the regular expression,
9525
@var{regexp}. It returns the character position, or @dfn{index}, of
9526
where that substring begins (one, if it starts at the beginning of
9527
@var{string}). If no match is found, it returns zero.
9531
The @code{match} function sets the built-in variable @code{RSTART} to
9532
the index. It also sets the built-in variable @code{RLENGTH} to the
9533
length in characters of the matched substring. If no match is found,
9534
@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.
9540
@c file eg/misc/findpat.sh
9545
where = match($0, regex)
9547
print "Match of", regex, "found at", \
9556
This program looks for lines that match the regular expression stored in
9557
the variable @code{regex}. This regular expression can be changed. If the
9558
first word on a line is @samp{FIND}, @code{regex} is changed to be the
9559
second word on that line. Therefore, given:
9562
@c file eg/misc/findpat.data
9565
but not very quickly
9568
This line is property of Reality Engineering Co.
9577
Match of ru+n found at 12 in My program runs
9578
Match of Melvin found at 1 in Melvin was here.
9581
@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
9583
This divides @var{string} into pieces separated by @var{fieldsep},
9584
and stores the pieces in @var{array}. The first piece is stored in
9585
@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
9586
forth. The string value of the third argument, @var{fieldsep}, is
9587
a regexp describing where to split @var{string} (much as @code{FS} can
9588
be a regexp describing where to split input records). If
9589
the @var{fieldsep} is omitted, the value of @code{FS} is used.
9590
@code{split} returns the number of elements created.
9592
The @code{split} function splits strings into pieces in a
9593
manner similar to the way input lines are split into fields. For example:
9596
split("cul-de-sac", a, "-")
9600
splits the string @samp{cul-de-sac} into three fields using @samp{-} as the
9601
separator. It sets the contents of the array @code{a} as follows:
9610
The value returned by this call to @code{split} is three.
9612
As with input field-splitting, when the value of @var{fieldsep} is
9613
@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements
9614
are separated by runs of whitespace.
9616
@cindex differences between @code{gawk} and @code{awk}
9617
Also as with input field-splitting, if @var{fieldsep} is the null string, each
9618
individual character in the string is split into its own array element.
9619
(This is a @code{gawk}-specific extension.)
9622
Recent implementations of @code{awk}, including @code{gawk}, allow
9623
the third argument to be a regexp constant (@code{/abc/}), as well as a
9624
string (d.c.). The POSIX standard allows this as well.
9626
Before splitting the string, @code{split} deletes any previously existing
9627
elements in the array @var{array} (d.c.).
9629
@item sprintf(@var{format}, @var{expression1},@dots{})
9631
This returns (without printing) the string that @code{printf} would
9632
have printed out with the same arguments
9633
(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
9637
sprintf("pi = %.2f (approx.)", 22/7)
9641
returns the string @w{@code{"pi = 3.14 (approx.)"}}.
9644
2e: For sub, gsub, and gensub, either here or in the "how much matches"
9645
section, we need some explanation that it is possible to match the
9646
null string when using closures like *. E.g.,
9648
$ echo abc | awk '{ gsub(/m*/, "X"); print }'
9651
Although this makes a certain amount of sense, it can be very
9655
@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
9657
The @code{sub} function alters the value of @var{target}.
9658
It searches this value, which is treated as a string, for the
9659
leftmost longest substring matched by the regular expression, @var{regexp},
9660
extending this match as far as possible. Then the entire string is
9661
changed by replacing the matched text with @var{replacement}.
9662
The modified string becomes the new value of @var{target}.
9664
This function is peculiar because @var{target} is not simply
9665
used to compute a value, and not just any expression will do: it
9666
must be a variable, field or array element, so that @code{sub} can
9667
store a modified value there. If this argument is omitted, then the
9668
default is to use and alter @code{$0}.
9673
str = "water, water, everywhere"
9674
sub(/at/, "ith", str)
9678
sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the
9679
leftmost, longest occurrence of @samp{at} with @samp{ith}.
9681
The @code{sub} function returns the number of substitutions made (either
9684
If the special character @samp{&} appears in @var{replacement}, it
9685
stands for the precise substring that was matched by @var{regexp}. (If
9686
the regexp can match more than one string, then this precise substring
9687
may vary.) For example:
9690
awk '@{ sub(/candidate/, "& and his wife"); print @}'
9694
changes the first occurrence of @samp{candidate} to @samp{candidate
9695
and his wife} on each input line.
9697
Here is another example:
9702
sub(/a*/, "c&c", str)
9709
This shows how @samp{&} can represent a non-constant string, and also
9710
illustrates the ``leftmost, longest'' rule in regexp matching
9711
(@pxref{Leftmost Longest, ,How Much Text Matches?}).
9713
The effect of this special character (@samp{&}) can be turned off by putting a
9714
backslash before it in the string. As usual, to insert one backslash in
9715
the string, you must write two backslashes. Therefore, write @samp{\\&}
9716
in a string constant to include a literal @samp{&} in the replacement.
9717
For example, here is how to replace the first @samp{|} on each line with
9721
awk '@{ sub(/\|/, "\\&"); print @}'
9724
@strong{Note:} As mentioned above, the third argument to @code{sub} must
9725
be a variable, field or array reference.
9726
Some versions of @code{awk} allow the third argument to
9727
be an expression which is not an lvalue. In such a case, @code{sub}
9728
would still search for the pattern and return zero or one, but the result of
9729
the substitution (if any) would be thrown away because there is no place
9730
to put it. Such versions of @code{awk} accept expressions like
9734
sub(/USA/, "United States", "the USA and Canada")
9738
This is considered erroneous in @code{gawk}.
9740
@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
9742
This is similar to the @code{sub} function, except @code{gsub} replaces
9743
@emph{all} of the longest, leftmost, @emph{non-overlapping} matching
9744
substrings it can find. The @samp{g} in @code{gsub} stands for
9745
``global,'' which means replace everywhere. For example:
9748
awk '@{ gsub(/Britain/, "United Kingdom"); print @}'
9752
replaces all occurrences of the string @samp{Britain} with @samp{United
9753
Kingdom} for all input records.
9755
The @code{gsub} function returns the number of substitutions made. If
9756
the variable to be searched and altered, @var{target}, is
9757
omitted, then the entire input record, @code{$0}, is used.
9759
As in @code{sub}, the characters @samp{&} and @samp{\} are special,
9760
and the third argument must be an lvalue.
9764
@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]})
9766
@code{gensub} is a general substitution function. Like @code{sub} and
9767
@code{gsub}, it searches the target string @var{target} for matches of
9768
the regular expression @var{regexp}. Unlike @code{sub} and
9769
@code{gsub}, the modified string is returned as the result of the
9770
function, and the original target string is @emph{not} changed. If
9771
@var{how} is a string beginning with @samp{g} or @samp{G}, then it
9772
replaces all matches of @var{regexp} with @var{replacement}.
9773
Otherwise, @var{how} is a number indicating which match of @var{regexp}
9774
to replace. If no @var{target} is supplied, @code{$0} is used instead.
9776
@code{gensub} provides an additional feature that is not available
9777
in @code{sub} or @code{gsub}: the ability to specify components of
9778
a regexp in the replacement text. This is done by using parentheses
9779
in the regexp to mark the components, and then specifying @samp{\@var{n}}
9780
in the replacement text, where @var{n} is a digit from one to nine.
9788
> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
9796
As described above for @code{sub}, you must type two backslashes in order
9797
to get one into the string.
9799
In the replacement text, the sequence @samp{\0} represents the entire
9800
matched text, as does the character @samp{&}.
9802
This example shows how you can use the third argument to control
9803
which match of the regexp should be changed.
9806
$ echo a b c a b c |
9807
> gawk '@{ print gensub(/a/, "AA", 2) @}'
9808
@print{} a b c AA b c
9811
In this case, @code{$0} is used as the default target string.
9812
@code{gensub} returns the new string as its result, which is
9813
passed directly to @code{print} for printing.
9815
If the @var{how} argument is a string that does not begin with @samp{g} or
9816
@samp{G}, or if it is a number that is less than zero, only one
9817
substitution is performed.
9819
@cindex differences between @code{gawk} and @code{awk}
9820
@code{gensub} is a @code{gawk} extension; it is not available
9821
in compatibility mode (@pxref{Options, ,Command Line Options}).
9823
@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]})
9825
This returns a @var{length}-character-long substring of @var{string},
9826
starting at character number @var{start}. The first character of a
9827
string is character number one. For example,
9828
@code{substr("washington", 5, 3)} returns @code{"ing"}.
9830
If @var{length} is not present, this function returns the whole suffix of
9831
@var{string} that begins at character number @var{start}. For example,
9832
@code{substr("washington", 5)} returns @code{"ington"}. The whole
9833
suffix is also returned
9834
if @var{length} is greater than the number of characters remaining
9835
in the string, counting from character number @var{start}.
9837
@cindex case conversion
9838
@cindex conversion of case
9839
@item tolower(@var{string})
9841
This returns a copy of @var{string}, with each upper-case character
9842
in the string replaced with its corresponding lower-case character.
9843
Non-alphabetic characters are left unchanged. For example,
9844
@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.
9846
@item toupper(@var{string})
9848
This returns a copy of @var{string}, with each lower-case character
9849
in the string replaced with its corresponding upper-case character.
9850
Non-alphabetic characters are left unchanged. For example,
9851
@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.
9854
@c fakenode --- for prepinfo
9855
@subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub}
9857
@cindex escape processing, @code{sub} et. al.
9858
When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal
9859
backslashes and ampersands into the replacement text, you need to remember
9860
that there are several levels of @dfn{escape processing} going on.
9862
First, there is the @dfn{lexical} level, which is when @code{awk} reads
9863
your program, and builds an internal copy of your program that can
9866
Then there is the run-time level, when @code{awk} actually scans the
9867
replacement string to determine what to generate.
9869
At both levels, @code{awk} looks for a defined set of characters that
9870
can come after a backslash. At the lexical level, it looks for the
9871
escape sequences listed in @ref{Escape Sequences}.
9872
Thus, for every @samp{\} that @code{awk} will process at the run-time
9873
level, you type two @samp{\}s at the lexical level.
9874
When a character that is not valid for an escape sequence follows the
9875
@samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial
9876
@samp{\}, and put the following character into the string. Thus, for
9877
example, @code{"a\qb"} is treated as @code{"aqb"}.
9879
At the run-time level, the various functions handle sequences of
9880
@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex.
9882
Historically, the @code{sub} and @code{gsub} functions treated the two
9883
character sequence @samp{\&} specially; this sequence was replaced in
9884
the generated text with a single @samp{&}. Any other @samp{\} within
9885
the @var{replacement} string that did not precede an @samp{&} was passed
9886
through unchanged. To illustrate with a table:
9888
@c Thank to Karl Berry for help with the TeX stuff.
9891
% This table has lots of &'s and \'s, so unspecialize them.
9892
\catcode`\& = \other \catcode`\\ = \other
9893
% But then we need character for escape and tab.
9895
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
9896
You type!@code{sub} sees!@code{sub} generates@cr
9897
@hrulefill!@hrulefill!@hrulefill@cr
9898
@code{\&}! @code{&}!the matched text@cr
9899
@code{\\&}! @code{\&}!a literal @samp{&}@cr
9900
@code{\\\&}! @code{\&}!a literal @samp{&}@cr
9901
@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr
9902
@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr
9903
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr
9904
@code{\\q}! @code{\q}!a literal @samp{\q}@cr
9910
You type @code{sub} sees @code{sub} generates
9911
-------- ---------- ---------------
9912
@code{\&} @code{&} the matched text
9913
@code{\\&} @code{\&} a literal @samp{&}
9914
@code{\\\&} @code{\&} a literal @samp{&}
9915
@code{\\\\&} @code{\\&} a literal @samp{\&}
9916
@code{\\\\\&} @code{\\&} a literal @samp{\&}
9917
@code{\\\\\\&} @code{\\\&} a literal @samp{\\&}
9918
@code{\\q} @code{\q} a literal @samp{\q}
9923
This table shows both the lexical level processing, where
9924
an odd number of backslashes becomes an even number at the run time level,
9925
and the run-time processing done by @code{sub}.
9926
(For the sake of simplicity, the rest of the tables below only show the
9927
case of even numbers of @samp{\}s entered at the lexical level.)
9929
The problem with the historical approach is that there is no way to get
9930
a literal @samp{\} followed by the matched text.
9932
@cindex @code{awk} language, POSIX version
9933
@cindex POSIX @code{awk}
9934
The 1992 POSIX standard attempted to fix this problem. The standard
9935
says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&}
9936
after the @samp{\}. If either one follows a @samp{\}, that character is
9937
output literally. The interpretation of @samp{\} and @samp{&} then becomes
9940
@c thanks to Karl Berry for formatting this table
9943
% This table has lots of &'s and \'s, so unspecialize them.
9944
\catcode`\& = \other \catcode`\\ = \other
9945
% But then we need character for escape and tab.
9947
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
9948
You type!@code{sub} sees!@code{sub} generates@cr
9949
@hrulefill!@hrulefill!@hrulefill@cr
9950
@code{&}! @code{&}!the matched text@cr
9951
@code{\\&}! @code{\&}!a literal @samp{&}@cr
9952
@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
9953
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
9959
You type @code{sub} sees @code{sub} generates
9960
-------- ---------- ---------------
9961
@code{&} @code{&} the matched text
9962
@code{\\&} @code{\&} a literal @samp{&}
9963
@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
9964
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
9969
This would appear to solve the problem.
9970
Unfortunately, the phrasing of the standard is unusual. It
9971
says, in effect, that @samp{\} turns off the special meaning of any
9972
following character, but that for anything other than @samp{\} and @samp{&},
9973
such special meaning is undefined. This wording leads to two problems.
9977
Backslashes must now be doubled in the @var{replacement} string, breaking
9978
historical @code{awk} programs.
9981
To make sure that an @code{awk} program is portable, @emph{every} character
9982
in the @var{replacement} string must be preceded with a
9983
backslash.@footnote{This consequence was certainly unintended.}
9984
@c I can say that, 'cause I was involved in making this change
9987
The POSIX standard is under revision.@footnote{As of December 1995,
9988
with final approval and publication hopefully sometime in 1996.}
9989
Because of the above problems, proposed text for the revised standard
9990
reverts to rules that correspond more closely to the original existing
9991
practice. The proposed rules have special cases that make it possible
9992
to produce a @samp{\} preceding the matched text.
9996
% This table has lots of &'s and \'s, so unspecialize them.
9997
\catcode`\& = \other \catcode`\\ = \other
9998
% But then we need character for escape and tab.
10000
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10001
You type!@code{sub} sees!@code{sub} generates@cr
10002
@hrulefill!@hrulefill!@hrulefill@cr
10003
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
10004
@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr
10005
@code{\\&}! @code{\&}!a literal @samp{&}@cr
10006
@code{\\q}! @code{\q}!a literal @samp{\q}@cr
10012
You type @code{sub} sees @code{sub} generates
10013
-------- ---------- ---------------
10014
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
10015
@code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text
10016
@code{\\&} @code{\&} a literal @samp{&}
10017
@code{\\q} @code{\q} a literal @samp{\q}
10021
In a nutshell, at the run-time level, there are now three special sequences
10022
of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically,
10023
there was only one. However, as in the historical case, any @samp{\} that
10024
is not part of one of these three sequences is not special, and appears
10025
in the output literally.
10027
@code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and
10029
@c As much as we think it's a lousy idea. You win some, you lose some. Sigh.
10030
Whether these proposed rules will actually become codified into the
10031
standard is unknown at this point. Subsequent @code{gawk} releases will
10032
track the standard and implement whatever the final version specifies;
10033
this @value{DOCUMENT} will be updated as well.
10035
The rules for @code{gensub} are considerably simpler. At the run-time
10036
level, whenever @code{gawk} sees a @samp{\}, if the following character
10037
is a digit, then the text that matched the corresponding parenthesized
10038
subexpression is placed in the generated output. Otherwise,
10039
no matter what the character after the @samp{\} is, that character will
10040
appear in the generated text, and the @samp{\} will not.
10044
% This table has lots of &'s and \'s, so unspecialize them.
10045
\catcode`\& = \other \catcode`\\ = \other
10046
% But then we need character for escape and tab.
10048
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10049
You type!@code{gensub} sees!@code{gensub} generates@cr
10050
@hrulefill!@hrulefill!@hrulefill@cr
10051
@code{&}! @code{&}!the matched text@cr
10052
@code{\\&}! @code{\&}!a literal @samp{&}@cr
10053
@code{\\\\}! @code{\\}!a literal @samp{\}@cr
10054
@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
10055
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
10056
@code{\\q}! @code{\q}!a literal @samp{q}@cr
10062
You type @code{gensub} sees @code{gensub} generates
10063
-------- ------------- ------------------
10064
@code{&} @code{&} the matched text
10065
@code{\\&} @code{\&} a literal @samp{&}
10066
@code{\\\\} @code{\\} a literal @samp{\}
10067
@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
10068
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
10069
@code{\\q} @code{\q} a literal @samp{q}
10073
Because of the complexity of the lexical and run-time level processing,
10074
and the special cases for @code{sub} and @code{gsub},
10075
we recommend the use of @code{gawk} and @code{gensub} for when you have
10076
to do substitutions.
10078
@node I/O Functions, Time Functions, String Functions, Built-in
10079
@section Built-in Functions for Input/Output
10081
The following functions are related to Input/Output (I/O).
10082
Optional parameters are enclosed in square brackets (``['' and ``]'').
10085
@item close(@var{filename})
10087
Close the file @var{filename}, for input or output. The argument may
10088
alternatively be a shell command that was used for redirecting to or
10089
from a pipe; then the pipe is closed.
10090
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
10091
for more information.
10093
@item fflush(@r{[}@var{filename}@r{]})
10095
@cindex portability issues
10096
@cindex flushing buffers
10097
@cindex buffers, flushing
10098
@cindex buffering output
10099
@cindex output, buffering
10100
Flush any buffered output associated @var{filename}, which is either a
10101
file opened for writing, or a shell command for redirecting output to
10104
Many utility programs will @dfn{buffer} their output; they save information
10105
to be written to a disk file or terminal in memory, until there is enough
10106
for it to be worthwhile to send the data to the ouput device.
10107
This is often more efficient than writing
10108
every little bit of information as soon as it is ready. However, sometimes
10109
it is necessary to force a program to @dfn{flush} its buffers; that is,
10110
write the information to its destination, even if a buffer is not full.
10111
This is the purpose of the @code{fflush} function; @code{gawk} too
10112
buffers its output, and the @code{fflush} function can be used to force
10113
@code{gawk} to flush its buffers.
10115
@code{fflush} is a recent (1994) addition to the Bell Labs research
10116
version of @code{awk}; it is not part of the POSIX standard, and will
10117
not be available if @samp{--posix} has been specified on the command
10118
line (@pxref{Options, ,Command Line Options}).
10120
@code{gawk} extends the @code{fflush} function in two ways. This first
10121
is to allow no argument at all. In this case, the buffer for the
10122
standard output is flushed. The second way is to allow the null string
10123
(@w{@code{""}}) as the argument. In this case, the buffers for
10124
@emph{all} open output files and pipes are flushed.
10126
@code{fflush} returns zero if the buffer was successfully flushed,
10127
and nonzero otherwise.
10129
@item system(@var{command})
10131
@cindex interaction, @code{awk} and other programs
10132
The system function allows the user to execute operating system commands
10133
and then return to the @code{awk} program. The @code{system} function
10134
executes the command given by the string @var{command}. It returns, as
10135
its value, the status returned by the command that was executed.
10137
For example, if the following fragment of code is put in your @code{awk}
10142
system("date | mail -s 'awk run done' root")
10147
the system administrator will be sent mail when the @code{awk} program
10148
finishes processing input and begins its end-of-input processing.
10150
Note that redirecting @code{print} or @code{printf} into a pipe is often
10151
enough to accomplish your task. However, if your @code{awk}
10152
program is interactive, @code{system} is useful for cranking up large
10153
self-contained programs, such as a shell or an editor.
10155
Some operating systems cannot implement the @code{system} function.
10156
@code{system} causes a fatal error if it is not supported.
10159
@c fakenode --- for prepinfo
10160
@subheading Controlling Output Buffering with @code{system}
10161
@cindex flushing buffers
10162
@cindex buffers, flushing
10163
@cindex buffering output
10164
@cindex output, buffering
10166
The @code{fflush} function provides explicit control over output buffering for
10167
individual files and pipes. However, its use is not portable to many other
10168
@code{awk} implementations. An alternative method to flush output
10169
buffers is by calling @code{system} with a null string as its argument:
10172
system("") # flush output
10176
@code{gawk} treats this use of the @code{system} function as a special
10177
case, and is smart enough not to run a shell (or other command
10178
interpreter) with the empty command. Therefore, with @code{gawk}, this
10179
idiom is not only useful, it is efficient. While this method should work
10180
with other @code{awk} implementations, it will not necessarily avoid
10181
starting an unnecessary shell. (Other implementations may only
10182
flush the buffer associated with the standard output, and not necessarily
10183
all buffered output.)
10185
If you think about what a programmer expects, it makes sense that
10186
@code{system} should flush any pending output. The following program:
10190
print "first print"
10191
system("echo system echo")
10192
print "second print"
10214
If @code{awk} did not flush its buffers before calling @code{system}, the
10215
latter (undesirable) output is what you would see.
10217
@node Time Functions, , I/O Functions, Built-in
10218
@section Functions for Dealing with Time Stamps
10221
@cindex time of day
10222
A common use for @code{awk} programs is the processing of log files
10223
containing time stamp information, indicating when a
10224
particular log record was written. Many programs log their time stamp
10225
in the form returned by the @code{time} system call, which is the
10226
number of seconds since a particular epoch. On POSIX systems,
10227
it is the number of seconds since Midnight, January 1, 1970, UTC.
10229
In order to make it easier to process such log files, and to produce
10230
useful reports, @code{gawk} provides two functions for working with time
10231
stamps. Both of these are @code{gawk} extensions; they are not specified
10232
in the POSIX standard, nor are they in any other known version
10235
Optional parameters are enclosed in square brackets (``['' and ``]'').
10240
This function returns the current time as the number of seconds since
10241
the system epoch. On POSIX systems, this is the number of seconds
10242
since Midnight, January 1, 1970, UTC. It may be a different number on
10245
@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]})
10247
This function returns a string. It is similar to the function of the
10248
same name in ANSI C. The time specified by @var{timestamp} is used to
10249
produce a string, based on the contents of the @var{format} string.
10250
The @var{timestamp} is in the same format as the value returned by the
10251
@code{systime} function. If no @var{timestamp} argument is supplied,
10252
@code{gawk} will use the current time of day as the time stamp.
10253
If no @var{format} argument is supplied, @code{strftime} uses
10254
@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces
10255
output (almost) equivalent to that of the @code{date} utility.
10256
(Versions of @code{gawk} prior to 3.0 require the @var{format} argument.)
10259
The @code{systime} function allows you to compare a time stamp from a
10260
log file with the current time of day. In particular, it is easy to
10261
determine how long ago a particular record was logged. It also allows
10262
you to produce log records using the ``seconds since the epoch'' format.
10264
The @code{strftime} function allows you to easily turn a time stamp
10265
into human-readable information. It is similar in nature to the @code{sprintf}
10267
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
10268
in that it copies non-format specification characters verbatim to the
10269
returned string, while substituting date and time values for format
10270
specifications in the @var{format} string.
10272
@code{strftime} is guaranteed by the ANSI C standard to support
10273
the following date format specifications:
10277
The locale's abbreviated weekday name.
10280
The locale's full weekday name.
10283
The locale's abbreviated month name.
10286
The locale's full month name.
10289
The locale's ``appropriate'' date and time representation.
10292
The day of the month as a decimal number (01--31).
10295
The hour (24-hour clock) as a decimal number (00--23).
10298
The hour (12-hour clock) as a decimal number (01--12).
10301
The day of the year as a decimal number (001--366).
10304
The month as a decimal number (01--12).
10307
The minute as a decimal number (00--59).
10310
The locale's equivalent of the AM/PM designations associated
10311
with a 12-hour clock.
10314
The second as a decimal number (00--61).@footnote{Occasionally there are
10315
minutes in a year with one or two leap seconds, which is why the
10316
seconds can go up to 61.}
10319
The week number of the year (the first Sunday as the first day of week one)
10320
as a decimal number (00--53).
10323
The weekday as a decimal number (0--6). Sunday is day zero.
10326
The week number of the year (the first Monday as the first day of week one)
10327
as a decimal number (00--53).
10330
The locale's ``appropriate'' date representation.
10333
The locale's ``appropriate'' time representation.
10336
The year without century as a decimal number (00--99).
10339
The year with century as a decimal number (e.g., 1995).
10342
The time zone name or abbreviation, or no characters if
10343
no time zone is determinable.
10346
A literal @samp{%}.
10349
If a conversion specifier is not one of the above, the behavior is
10350
undefined.@footnote{This is because ANSI C leaves the
10351
behavior of the C version of @code{strftime} undefined, and @code{gawk}
10352
will use the system's version of @code{strftime} if it's there.
10353
Typically, the conversion specifier will either not appear in the
10354
returned string, or it will appear literally.}
10356
@cindex locale, definition of
10357
Informally, a @dfn{locale} is the geographic place in which a program
10358
is meant to run. For example, a common way to abbreviate the date
10359
September 4, 1991 in the United States would be ``9/4/91''.
10360
In many countries in Europe, however, it would be abbreviated ``4.9.91''.
10361
Thus, the @samp{%x} specification in a @code{"US"} locale might produce
10362
@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce
10363
@samp{4.9.91}. The ANSI C standard defines a default @code{"C"}
10364
locale, which is an environment that is typical of what most C programmers
10367
A public-domain C version of @code{strftime} is supplied with @code{gawk}
10368
for systems that are not yet fully ANSI-compliant. If that version is
10369
used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}),
10370
then the following additional format specifications are available:
10374
Equivalent to specifying @samp{%m/%d/%y}.
10377
The day of the month, padded with a space if it is only one digit.
10380
Equivalent to @samp{%b}, above.
10383
A newline character (ASCII LF).
10386
Equivalent to specifying @samp{%I:%M:%S %p}.
10389
Equivalent to specifying @samp{%H:%M}.
10392
Equivalent to specifying @samp{%H:%M:%S}.
10398
The hour (24-hour clock) as a decimal number (0-23).
10399
Single digit numbers are padded with a space.
10402
The hour (12-hour clock) as a decimal number (1-12).
10403
Single digit numbers are padded with a space.
10406
The century, as a number between 00 and 99.
10409
The weekday as a decimal number
10414
The week number of the year (the first Monday as the first
10415
day of week one) as a decimal number (01--53).
10416
The method for determining the week number is as specified by ISO 8601
10417
(to wit: if the week containing January 1 has four or more days in the
10418
new year, then it is week one, otherwise it is week 53 of the previous year
10419
and the next week is week one).
10422
The year with century of the ISO week number, as a decimal number.
10424
For example, January 1, 1993, is in week 53 of 1992. Thus, the year
10425
of its ISO week number is 1992, even though its year is 1993.
10426
Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year
10427
of its ISO week number is 1974, even though its year is 1973.
10430
The year without century of the ISO week number, as a decimal number (00--99).
10432
@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI
10433
@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
10434
These are ``alternate representations'' for the specifications
10435
that use only the second letter (@samp{%c}, @samp{%C}, and so on).
10436
They are recognized, but their normal representations are
10437
used.@footnote{If you don't understand any of this, don't worry about
10438
it; these facilities are meant to make it easier to ``internationalize''
10440
(These facilitate compliance with the POSIX @code{date} utility.)
10443
The date in VMS format (e.g., 20-JUN-1991).
10448
The timezone offset in a +HHMM format (e.g., the format necessary to
10449
produce RFC-822/RFC-1036 date headers).
10452
This example is an @code{awk} implementation of the POSIX
10453
@code{date} utility. Normally, the @code{date} utility prints the
10454
current date and time of day in a well known format. However, if you
10455
provide an argument to it that begins with a @samp{+}, @code{date}
10456
will copy non-format specifier characters to the standard output, and
10457
will interpret the current time according to the format specifiers in
10458
the string. For example:
10461
$ date '+Today is %A, %B %d, %Y.'
10462
@print{} Today is Thursday, July 11, 1991.
10465
Here is the @code{gawk} version of the @code{date} utility.
10466
It has a shell ``wrapper'', to handle the @samp{-u} option,
10467
which requires that @code{date} run as if the time zone
10474
# date --- approximate the P1003.2 'date' command
10477
-u) TZ=GMT0 # use UTC
10485
format = "%a %b %d %H:%M:%S %Z %Y"
10492
else if (ARGC == 2) @{
10494
if (format ~ /^\+/)
10495
format = substr(format, 2) # remove leading +
10497
print strftime(format)
10503
@node User-defined, Invoking Gawk, Built-in, Top
10504
@chapter User-defined Functions
10506
@cindex user-defined functions
10507
@cindex functions, user-defined
10508
Complicated @code{awk} programs can often be simplified by defining
10509
your own functions. User-defined functions can be called just like
10510
built-in ones (@pxref{Function Calls}), but it is up to you to define
10511
them---to tell @code{awk} what they should do.
10514
* Definition Syntax:: How to write definitions and what they mean.
10515
* Function Example:: An example function definition and what it
10517
* Function Caveats:: Things to watch out for.
10518
* Return Statement:: Specifying the value a function returns.
10521
@node Definition Syntax, Function Example, User-defined, User-defined
10522
@section Function Definition Syntax
10523
@cindex defining functions
10524
@cindex function definition
10526
Definitions of functions can appear anywhere between the rules of an
10527
@code{awk} program. Thus, the general form of an @code{awk} program is
10528
extended to include sequences of rules @emph{and} user-defined function
10530
There is no need in @code{awk} to put the definition of a function
10531
before all uses of the function. This is because @code{awk} reads the
10532
entire program before starting to execute any of it.
10534
The definition of a function named @var{name} looks like this:
10537
function @var{name}(@var{parameter-list})
10539
@var{body-of-function}
10543
@cindex names, use of
10546
@var{name} is the name of the function to be defined. A valid function
10547
name is like a valid variable name: a sequence of letters, digits and
10548
underscores, not starting with a digit.
10549
Within a single @code{awk} program, any particular name can only be
10550
used as a variable, array or function.
10552
@var{parameter-list} is a list of the function's arguments and local
10553
variable names, separated by commas. When the function is called,
10554
the argument names are used to hold the argument values given in
10555
the call. The local variables are initialized to the empty string.
10556
A function cannot have two parameters with the same name.
10558
The @var{body-of-function} consists of @code{awk} statements. It is the
10559
most important part of the definition, because it says what the function
10560
should actually @emph{do}. The argument names exist to give the body a
10561
way to talk about the arguments; local variables, to give the body
10562
places to keep temporary values.
10564
Argument names are not distinguished syntactically from local variable
10565
names; instead, the number of arguments supplied when the function is
10566
called determines how many argument variables there are. Thus, if three
10567
argument values are given, the first three names in @var{parameter-list}
10568
are arguments, and the rest are local variables.
10570
It follows that if the number of arguments is not the same in all calls
10571
to the function, some of the names in @var{parameter-list} may be
10572
arguments on some occasions and local variables on others. Another
10573
way to think of this is that omitted arguments default to the
10576
Usually when you write a function you know how many names you intend to
10577
use for arguments and how many you intend to use as local variables. It is
10578
conventional to place some extra space between the arguments and
10579
the local variables, to document how your function is supposed to be used.
10581
@cindex variable shadowing
10582
During execution of the function body, the arguments and local variable
10583
values hide or @dfn{shadow} any variables of the same names used in the
10584
rest of the program. The shadowed variables are not accessible in the
10585
function definition, because there is no way to name them while their
10586
names have been taken away for the local variables. All other variables
10587
used in the @code{awk} program can be referenced or set normally in the
10590
The arguments and local variables last only as long as the function body
10591
is executing. Once the body finishes, you can once again access the
10592
variables that were shadowed while the function was running.
10594
@cindex recursive function
10595
@cindex function, recursive
10596
The function body can contain expressions which call functions. They
10597
can even call this function, either directly or by way of another
10598
function. When this happens, we say the function is @dfn{recursive}.
10600
@cindex @code{awk} language, POSIX version
10601
@cindex POSIX @code{awk}
10602
In many @code{awk} implementations, including @code{gawk},
10603
the keyword @code{function} may be
10604
abbreviated @code{func}. However, POSIX only specifies the use of
10605
the keyword @code{function}. This actually has some practical implications.
10606
If @code{gawk} is in POSIX-compatibility mode
10607
(@pxref{Options, ,Command Line Options}), then the following
10608
statement will @emph{not} define a function:
10611
func foo() @{ a = sqrt($1) ; print a @}
10615
Instead it defines a rule that, for each record, concatenates the value
10616
of the variable @samp{func} with the return value of the function @samp{foo}.
10617
If the resulting string is non-null, the action is executed.
10618
This is probably not what was desired. (@code{awk} accepts this input as
10619
syntactically valid, since functions may be used before they are defined
10620
in @code{awk} programs.)
10622
@cindex portability issues
10623
To ensure that your @code{awk} programs are portable, always use the
10624
keyword @code{function} when defining a function.
10626
@node Function Example, Function Caveats, Definition Syntax, User-defined
10627
@section Function Definition Examples
10629
Here is an example of a user-defined function, called @code{myprint}, that
10630
takes a number and prints it in a specific format.
10633
function myprint(num)
10635
printf "%6.3g\n", num
10640
To illustrate, here is an @code{awk} rule which uses our @code{myprint}
10644
$3 > 0 @{ myprint($3) @}
10648
This program prints, in our special format, all the third fields that
10649
contain a positive number in our input. Therefore, when given:
10653
9.10 11.12 -13.14 15.16
10654
17.18 19.20 21.22 23.24
10658
this program, using our function to format the results, prints:
10665
This function deletes all the elements in an array.
10668
function delarray(a, i)
10675
When working with arrays, it is often necessary to delete all the elements
10676
in an array and start over with a new list of elements
10677
(@pxref{Delete, ,The @code{delete} Statement}).
10679
to repeat this loop everywhere in your program that you need to clear out
10680
an array, your program can just call @code{delarray}.
10682
Here is an example of a recursive function. It takes a string
10683
as an input parameter, and returns the string in backwards order.
10686
function rev(str, start)
10691
return (substr(str, start, 1) rev(str, start - 1))
10695
If this function is in a file named @file{rev.awk}, we can test it
10699
$ echo "Don't Panic!" |
10700
> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk
10701
@print{} !cinaP t'noD
10704
Here is an example that uses the built-in function @code{strftime}.
10705
(@xref{Time Functions, ,Functions for Dealing with Time Stamps},
10706
for more information on @code{strftime}.)
10707
The C @code{ctime} function takes a timestamp and returns it in a string,
10708
formatted in a well known fashion. Here is an @code{awk} version:
10711
@c file eg/lib/ctime.awk
10715
# awk version of C ctime(3) function
10717
function ctime(ts, format)
10719
format = "%a %b %d %H:%M:%S %Z %Y"
10721
ts = systime() # use current time as default
10722
return strftime(format, ts)
10728
@node Function Caveats, Return Statement, Function Example, User-defined
10729
@section Calling User-defined Functions
10731
@cindex call by value
10732
@cindex call by reference
10733
@cindex calling a function
10734
@cindex function call
10735
@dfn{Calling a function} means causing the function to run and do its job.
10736
A function call is an expression, and its value is the value returned by
10739
A function call consists of the function name followed by the arguments
10740
in parentheses. What you write in the call for the arguments are
10741
@code{awk} expressions; each time the call is executed, these
10742
expressions are evaluated, and the values are the actual arguments. For
10743
example, here is a call to @code{foo} with three arguments (the first
10744
being a string concatenation):
10747
foo(x y, "lose", 4 * z)
10750
@strong{Caution:} whitespace characters (spaces and tabs) are not allowed
10751
between the function name and the open-parenthesis of the argument list.
10752
If you write whitespace by mistake, @code{awk} might think that you mean
10753
to concatenate a variable with an expression in parentheses. However, it
10754
notices that you used a function name and not a variable name, and reports
10757
@cindex call by value
10758
When a function is called, it is given a @emph{copy} of the values of
10759
its arguments. This is known as @dfn{call by value}. The caller may use
10760
a variable as the expression for the argument, but the called function
10761
does not know this: it only knows what value the argument had. For
10762
example, if you write this code:
10770
then you should not think of the argument to @code{myfunc} as being
10771
``the variable @code{foo}.'' Instead, think of the argument as the
10772
string value, @code{"bar"}.
10774
If the function @code{myfunc} alters the values of its local variables,
10775
this has no effect on any other variables. Thus, if @code{myfunc}
10780
function myfunc(str)
10790
to change its first argument variable @code{str}, this @emph{does not}
10791
change the value of @code{foo} in the caller. The role of @code{foo} in
10792
calling @code{myfunc} ended when its value, @code{"bar"}, was computed.
10793
If @code{str} also exists outside of @code{myfunc}, the function body
10794
cannot alter this outer value, because it is shadowed during the
10795
execution of @code{myfunc} and cannot be seen or changed from there.
10797
@cindex call by reference
10798
However, when arrays are the parameters to functions, they are @emph{not}
10799
copied. Instead, the array itself is made available for direct manipulation
10800
by the function. This is usually called @dfn{call by reference}.
10801
Changes made to an array parameter inside the body of a function @emph{are}
10802
visible outside that function.
10804
This can be @strong{very} dangerous if you do not watch what you are
10805
doing. For example:
10808
@emph{This can be very dangerous if you do not watch what you are
10809
doing.} For example:
10813
function changeit(array, ind, nvalue)
10815
array[ind] = nvalue
10819
a[1] = 1; a[2] = 2; a[3] = 3
10820
changeit(a, 2, "two")
10821
printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
10827
This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because
10828
@code{changeit} stores @code{"two"} in the second element of @code{a}.
10830
@cindex undefined functions
10831
@cindex functions, undefined
10832
Some @code{awk} implementations allow you to call a function that
10833
has not been defined, and only report a problem at run-time when the
10834
program actually tries to call the function. For example:
10844
function bar() @{ @dots{} @}
10845
# note that `foo' is not defined
10850
Since the @samp{if} statement will never be true, it is not really a
10851
problem that @code{foo} has not been defined. Usually though, it is a
10852
problem if a program calls an undefined function.
10855
At one point, I had gawk dieing on this, but later decided that this might
10856
break old programs and/or test suites.
10859
If @samp{--lint} has been specified
10860
(@pxref{Options, ,Command Line Options}),
10861
@code{gawk} will report about calls to undefined functions.
10863
@node Return Statement, , Function Caveats, User-defined
10864
@section The @code{return} Statement
10865
@cindex @code{return} statement
10867
The body of a user-defined function can contain a @code{return} statement.
10868
This statement returns control to the rest of the @code{awk} program. It
10869
can also be used to return a value for use in the rest of the @code{awk}
10870
program. It looks like this:
10873
return @r{[}@var{expression}@r{]}
10876
The @var{expression} part is optional. If it is omitted, then the returned
10877
value is undefined and, therefore, unpredictable.
10879
A @code{return} statement with no value expression is assumed at the end of
10880
every function definition. So if control reaches the end of the function
10881
body, then the function returns an unpredictable value. @code{awk}
10882
will @emph{not} warn you if you use the return value of such a function.
10884
Sometimes, you want to write a function for what it does, not for
10885
what it returns. Such a function corresponds to a @code{void} function
10886
in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not
10887
return any value; you should simply bear in mind that if you use the return
10888
value of such a function, you do so at your own risk.
10890
Here is an example of a user-defined function that returns a value
10891
for the largest number among the elements of an array:
10895
function maxelt(vec, i, ret)
10898
if (ret == "" || vec[i] > ret)
10907
You call @code{maxelt} with one argument, which is an array name. The local
10908
variables @code{i} and @code{ret} are not intended to be arguments;
10909
while there is nothing to stop you from passing two or three arguments
10910
to @code{maxelt}, the results would be strange. The extra space before
10911
@code{i} in the function parameter list indicates that @code{i} and
10912
@code{ret} are not supposed to be arguments. This is a convention that
10913
you should follow when you define functions.
10915
Here is a program that uses our @code{maxelt} function. It loads an
10916
array, calls @code{maxelt}, and then reports the maximum number in that
10922
function maxelt(vec, i, ret)
10925
if (ret == "" || vec[i] > ret)
10933
# Load all fields of each record into nums.
10935
for(i = 1; i <= NF; i++)
10945
Given the following input:
10951
256 291 1396 2962 100
10958
our program tells us (predictably) that @code{99385} is the largest number
10961
@node Invoking Gawk, Library Functions, User-defined, Top
10962
@chapter Running @code{awk}
10963
@cindex command line
10964
@cindex invocation of @code{gawk}
10965
@cindex arguments, command line
10966
@cindex options, command line
10967
@cindex long options
10968
@cindex options, long
10970
There are two ways to run @code{awk}: with an explicit program, or with
10971
one or more program files. Here are templates for both of them; items
10972
enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional.
10974
Besides traditional one-letter POSIX-style options, @code{gawk} also
10975
supports GNU long options.
10978
awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{}
10979
awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
10982
@cindex empty program
10983
@cindex dark corner
10984
It is possible to invoke @code{awk} with an empty program:
10987
$ awk '' datafile1 datafile2
10991
Doing so makes little sense though; @code{awk} will simply exit
10992
silently when given an empty program (d.c.). If @samp{--lint} has
10993
been specified on the command line, @code{gawk} will issue a
10994
warning that the program is empty.
10997
* Options:: Command line options and their meanings.
10998
* Other Arguments:: Input file names and variable assignments.
10999
* AWKPATH Variable:: Searching directories for @code{awk} programs.
11000
* Obsolete:: Obsolete Options and/or features.
11001
* Undocumented:: Undocumented Options and Features.
11002
* Known Bugs:: Known Bugs in @code{gawk}.
11005
@node Options, Other Arguments, Invoking Gawk, Invoking Gawk
11006
@section Command Line Options
11008
Options begin with a dash, and consist of a single character.
11009
GNU style long options consist of two dashes and a keyword.
11010
The keyword can be abbreviated, as long the abbreviation allows the option
11011
to be uniquely identified. If the option takes an argument, then the
11012
keyword is either immediately followed by an equals sign (@samp{=}) and the
11013
argument's value, or the keyword and the argument's value are separated
11014
by whitespace. For brevity, the discussion below only refers to the
11015
traditional short options; however the long and short options are
11016
interchangeable in all contexts.
11018
Each long option for @code{gawk} has a corresponding
11019
POSIX-style option. The options and their meanings are as follows:
11023
@itemx --field-separator @var{fs}
11024
@cindex @code{-F} option
11025
@cindex @code{--field-separator} option
11026
Sets the @code{FS} variable to @var{fs}
11027
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
11029
@item -f @var{source-file}
11030
@itemx --file @var{source-file}
11031
@cindex @code{-f} option
11032
@cindex @code{--file} option
11033
Indicates that the @code{awk} program is to be found in @var{source-file}
11034
instead of in the first non-option argument.
11036
@item -v @var{var}=@var{val}
11037
@itemx --assign @var{var}=@var{val}
11038
@cindex @code{-v} option
11039
@cindex @code{--assign} option
11040
Sets the variable @var{var} to the value @var{val} @strong{before}
11041
execution of the program begins. Such variable values are available
11042
inside the @code{BEGIN} rule
11043
(@pxref{Other Arguments, ,Other Command Line Arguments}).
11045
The @samp{-v} option can only set one variable, but you can use
11046
it more than once, setting another variable each time, like this:
11047
@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.
11049
@item -mf=@var{NNN}
11050
@itemx -mr=@var{NNN}
11051
Set various memory limits to the value @var{NNN}. The @samp{f} flag sets
11052
the maximum number of fields, and the @samp{r} flag sets the maximum
11053
record size. These two flags and the @samp{-m} option are from the
11054
Bell Labs research version of Unix @code{awk}. They are provided
11055
for compatibility, but otherwise ignored by
11056
@code{gawk}, since @code{gawk} has no predefined limits.
11058
@item -W @var{gawk-opt}
11059
@cindex @code{-W} option
11060
Following the POSIX standard, options that are implementation
11061
specific are supplied as arguments to the @samp{-W} option. With @code{gawk},
11062
these arguments may be separated by commas, or quoted and separated by
11063
whitespace. Case is ignored when processing these options. These options
11064
also have corresponding GNU style long options.
11068
Signals the end of the command line options. The following arguments
11069
are not treated as options even if they begin with @samp{-}. This
11070
interpretation of @samp{--} follows the POSIX argument parsing
11073
This is useful if you have file names that start with @samp{-},
11074
or in shell scripts, if you have file names that will be specified
11075
by the user which could start with @samp{-}.
11078
The following @code{gawk}-specific options are available:
11081
@item -W traditional
11083
@itemx --traditional
11085
@cindex @code{--compat} option
11086
@cindex @code{--traditional} option
11087
@cindex compatibility mode
11088
Specifies @dfn{compatibility mode}, in which the GNU extensions to
11089
the @code{awk} language are disabled, so that @code{gawk} behaves just
11090
like the Bell Labs research version of Unix @code{awk}.
11091
@samp{--traditional} is the preferred form of this option.
11092
@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
11093
which summarizes the extensions. Also see
11094
@ref{Compatibility Mode, ,Downward Compatibility and Debugging}.
11097
@itemx -W copyright
11100
@cindex @code{--copyleft} option
11101
@cindex @code{--copyright} option
11102
Print the short version of the General Public License.
11103
This option may disappear in a future version of @code{gawk}.
11109
@cindex @code{--help} option
11110
@cindex @code{--usage} option
11111
Print a ``usage'' message summarizing the short and long style options
11112
that @code{gawk} accepts, and then exit.
11116
@cindex @code{--lint} option
11117
Warn about constructs that are dubious or non-portable to
11118
other @code{awk} implementations.
11119
Some warnings are issued when @code{gawk} first reads your program. Others
11120
are issued at run-time, as your program executes.
11124
@cindex @code{--lint-old} option
11125
Warn about constructs that are not available in
11126
the original Version 7 Unix version of @code{awk}
11127
(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
11131
@cindex @code{--posix} option
11133
Operate in strict POSIX mode. This disables all @code{gawk}
11134
extensions (just like @samp{--traditional}), and adds the following additional
11137
@c IMPORTANT! Keep this list in sync with the one in node POSIX
11141
@code{\x} escape sequences are not recognized
11142
(@pxref{Escape Sequences}).
11145
The synonym @code{func} for the keyword @code{function} is not
11146
recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
11149
The operators @samp{**} and @samp{**=} cannot be used in
11150
place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
11151
and also @pxref{Assignment Ops, ,Assignment Expressions}).
11154
Specifying @samp{-Ft} on the command line does not set the value
11155
of @code{FS} to be a single tab character
11156
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
11159
The @code{fflush} built-in function is not supported
11160
(@pxref{I/O Functions, , Built-in Functions for Input/Output}).
11163
If you supply both @samp{--traditional} and @samp{--posix} on the
11164
command line, @samp{--posix} will take precedence. @code{gawk}
11165
will also issue a warning if both options are supplied.
11167
@item -W re-interval
11168
@itemx --re-interval
11169
Allow interval expressions
11170
(@pxref{Regexp Operators, , Regular Expression Operators}),
11172
Because interval expressions were traditionally not available in @code{awk},
11173
@code{gawk} does not provide them by default. This prevents old @code{awk}
11174
programs from breaking.
11176
@item -W source @var{program-text}
11177
@itemx --source @var{program-text}
11178
@cindex @code{--source} option
11179
Program source code is taken from the @var{program-text}. This option
11180
allows you to mix source code in files with source
11181
code that you enter on the command line. This is particularly useful
11182
when you have library functions that you wish to use from your command line
11183
programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
11187
@cindex @code{--version} option
11188
Prints version information for this particular copy of @code{gawk}.
11189
This allows you to determine if your copy of @code{gawk} is up to date
11190
with respect to whatever the Free Software Foundation is currently
11192
It is also useful for bug reports
11193
(@pxref{Bugs, , Reporting Problems and Bugs}).
11196
Any other options are flagged as invalid with a warning message, but
11197
are otherwise ignored.
11199
In compatibility mode, as a special case, if the value of @var{fs} supplied
11200
to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab
11201
character (@code{"\t"}). This is only true for @samp{--traditional}, and not
11203
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
11205
The @samp{-f} option may be used more than once on the command line.
11206
If it is, @code{awk} reads its program source from all of the named files, as
11207
if they had been concatenated together into one big file. This is
11208
useful for creating libraries of @code{awk} functions. Useful functions
11209
can be written once, and then retrieved from a standard place, instead
11210
of having to be included into each individual program.
11212
You can type in a program at the terminal and still use library functions,
11213
by specifying @samp{-f /dev/tty}. @code{awk} will read a file from the terminal
11214
to use as part of the @code{awk} program. After typing your program,
11215
type @kbd{Control-d} (the end-of-file character) to terminate it.
11216
(You may also use @samp{-f -} to read program source from the standard
11217
input, but then you will not be able to also use the standard input as a
11220
Because it is clumsy using the standard @code{awk} mechanisms to mix source
11221
file and command line @code{awk} programs, @code{gawk} provides the
11222
@samp{--source} option. This does not require you to pre-empt the standard
11223
input for your source code, and allows you to easily mix command line
11224
and library source code
11225
(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
11227
If no @samp{-f} or @samp{--source} option is specified, then @code{gawk}
11228
will use the first non-option command line argument as the text of the
11229
program source code.
11231
@cindex @code{POSIXLY_CORRECT} environment variable
11232
@cindex environment variable, @code{POSIXLY_CORRECT}
11233
If the environment variable @code{POSIXLY_CORRECT} exists,
11234
then @code{gawk} will behave in strict POSIX mode, exactly as if
11235
you had supplied the @samp{--posix} command line option.
11236
Many GNU programs look for this environment variable to turn on
11237
strict POSIX mode. If you supply @samp{--lint} on the command line,
11238
and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT},
11239
then it will print a warning message indicating that POSIX
11242
You would typically set this variable in your shell's startup file.
11243
For a Bourne compatible shell (such as Bash), you would add these
11244
lines to the @file{.profile} file in your home directory.
11248
POSIXLY_CORRECT=true
11249
export POSIXLY_CORRECT
11253
For a @code{csh} compatible shell,@footnote{Not recommended.}
11254
you would add this line to the @file{.login} file in your home directory.
11257
setenv POSIXLY_CORRECT true
11260
@node Other Arguments, AWKPATH Variable, Options, Invoking Gawk
11261
@section Other Command Line Arguments
11263
Any additional arguments on the command line are normally treated as
11264
input files to be processed in the order specified. However, an
11265
argument that has the form @code{@var{var}=@var{value}}, assigns
11266
the value @var{value} to the variable @var{var}---it does not specify a
11271
All these arguments are made available to your @code{awk} program in the
11272
@code{ARGV} array (@pxref{Built-in Variables}). Command line options
11273
and the program text (if present) are omitted from @code{ARGV}.
11274
All other arguments, including variable assignments, are
11275
included. As each element of @code{ARGV} is processed, @code{gawk}
11276
sets the variable @code{ARGIND} to the index in @code{ARGV} of the
11279
The distinction between file name arguments and variable-assignment
11280
arguments is made when @code{awk} is about to open the next input file.
11281
At that point in execution, it checks the ``file name'' to see whether
11282
it is really a variable assignment; if so, @code{awk} sets the variable
11283
instead of reading a file.
11285
Therefore, the variables actually receive the given values after all
11286
previously specified files have been read. In particular, the values of
11287
variables assigned in this fashion are @emph{not} available inside a
11289
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}),
11290
since such rules are run before @code{awk} begins scanning the argument list.
11292
@cindex dark corner
11293
The variable values given on the command line are processed for escape
11294
sequences (d.c.) (@pxref{Escape Sequences}).
11296
In some earlier implementations of @code{awk}, when a variable assignment
11297
occurred before any file names, the assignment would happen @emph{before}
11298
the @code{BEGIN} rule was executed. @code{awk}'s behavior was thus
11299
inconsistent; some command line assignments were available inside the
11300
@code{BEGIN} rule, while others were not. However,
11301
some applications came to depend
11302
upon this ``feature.'' When @code{awk} was changed to be more consistent,
11303
the @samp{-v} option was added to accommodate applications that depended
11304
upon the old behavior.
11306
The variable assignment feature is most useful for assigning to variables
11307
such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and
11308
output formats, before scanning the data files. It is also useful for
11309
controlling state if multiple passes are needed over a data file. For
11312
@cindex multiple passes over data
11313
@cindex passes, multiple
11315
awk 'pass == 1 @{ @var{pass 1 stuff} @}
11316
pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata
11319
Given the variable assignment feature, the @samp{-F} option for setting
11320
the value of @code{FS} is not
11321
strictly necessary. It remains for historical compatibility.
11323
@node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk
11324
@section The @code{AWKPATH} Environment Variable
11325
@cindex @code{AWKPATH} environment variable
11326
@cindex environment variable, @code{AWKPATH}
11327
@cindex search path
11328
@cindex directory search
11329
@cindex path, search
11330
@cindex differences between @code{gawk} and @code{awk}
11332
The previous section described how @code{awk} program files can be named
11333
on the command line with the @samp{-f} option. In most @code{awk}
11334
implementations, you must supply a precise path name for each program
11335
file, unless the file is in the current directory.
11337
@cindex search path, for source files
11338
But in @code{gawk}, if the file name supplied to the @samp{-f} option
11339
does not contain a @samp{/}, then @code{gawk} searches a list of
11340
directories (called the @dfn{search path}), one by one, looking for a
11341
file with the specified name.
11343
The search path is a string consisting of directory names
11344
separated by colons. @code{gawk} gets its search path from the
11345
@code{AWKPATH} environment variable. If that variable does not exist,
11346
@code{gawk} uses a default path, which is
11347
@samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk}
11348
may use a directory that is different than @file{/usr/local/share/awk}; it
11349
will depend upon how @code{gawk} was built and installed. The actual
11350
directory will be the value of @samp{$(datadir)} generated when
11351
@code{gawk} was configured. You probably don't need to worry about this
11352
though.} (Programs written for use by
11353
system administrators should use an @code{AWKPATH} variable that
11354
does not include the current directory, @file{.}.)
11356
The search path feature is particularly useful for building up libraries
11357
of useful @code{awk} functions. The library files can be placed in a
11358
standard directory that is in the default path, and then specified on
11359
the command line with a short file name. Otherwise, the full file name
11360
would have to be typed for each file.
11362
By using both the @samp{--source} and @samp{-f} options, your command line
11363
@code{awk} programs can use facilities in @code{awk} library files.
11364
@xref{Library Functions, , A Library of @code{awk} Functions}.
11366
Path searching is not done if @code{gawk} is in compatibility mode.
11367
This is true for both @samp{--traditional} and @samp{--posix}.
11368
@xref{Options, ,Command Line Options}.
11370
@strong{Note:} if you want files in the current directory to be found,
11371
you must include the current directory in the path, either by including
11372
@file{.} explicitly in the path, or by writing a null entry in the
11373
path. (A null entry is indicated by starting or ending the path with a
11374
colon, or by placing two colons next to each other (@samp{::}).) If the
11375
current directory is not included in the path, then files cannot be
11376
found in the current directory. This path search mechanism is identical
11378
@c someday, @cite{The Bourne Again Shell}....
11380
Starting with version 3.0, if @code{AWKPATH} is not defined in the
11381
environment, @code{gawk} will place its default search path into
11382
@code{ENVIRON["AWKPATH"]}. This makes it easy to determine
11383
the actual search path @code{gawk} will use.
11385
@node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk
11386
@section Obsolete Options and/or Features
11388
@cindex deprecated options
11389
@cindex obsolete options
11390
@cindex deprecated features
11391
@cindex obsolete features
11392
This section describes features and/or command line options from
11393
previous releases of @code{gawk} that are either not available in the
11394
current version, or that are still supported but deprecated (meaning that
11395
they will @emph{not} be in the next release).
11397
@c update this section for each release!
11399
For version @value{VERSION} of @code{gawk}, there are no command line options
11400
or other deprecated features from the previous version of @code{gawk}.
11407
is thus essentially a place holder,
11408
in case some option becomes obsolete in a future version of @code{gawk}.
11411
@c This is pretty old news...
11412
The public-domain version of @code{strftime} that is distributed with
11413
@code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier
11414
that used to generate the date in VMS format was changed to @samp{%v}.
11415
This is because the POSIX standard for the @code{date} utility now
11416
specifies a @samp{%V} conversion specifier.
11417
@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details.
11420
@node Undocumented, Known Bugs, Obsolete, Invoking Gawk
11421
@section Undocumented Options and Features
11422
@cindex undocumented features
11424
This section intentionally left blank.
11426
@c Read The Source, Luke!
11429
@c If these came out in the Info file or TeX document, then they wouldn't
11430
@c be undocumented, would they?
11432
@code{gawk} has one undocumented option:
11437
Print the message @code{"awk: bailing out near line 1"} and dump core.
11438
This option was inspired by the common behavior of very early versions of
11439
Unix @code{awk}, and by a t--shirt.
11442
Early versions of @code{awk} used to not require any separator (either
11443
a newline or @samp{;}) between the rules in @code{awk} programs. Thus,
11444
it was common to see one-line programs like:
11447
awk '@{ sum += $1 @} END @{ print sum @}'
11450
@code{gawk} actually supports this, but it is purposely undocumented
11451
since it is considered bad style. The correct way to write such a program
11455
awk '@{ sum += $1 @} ; END @{ print sum @}'
11462
awk '@{ sum += $1 @}
11463
END @{ print sum @}' data
11467
@xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller
11472
@node Known Bugs, , Undocumented, Invoking Gawk
11473
@section Known Bugs in @code{gawk}
11474
@cindex bugs, known in @code{gawk}
11479
The @samp{-F} option for changing the value of @code{FS}
11480
(@pxref{Options, ,Command Line Options})
11481
is not necessary given the command line variable
11482
assignment feature; it remains only for backwards compatibility.
11485
If your system actually has support for @file{/dev/fd} and the
11486
associated @file{/dev/stdin}, @file{/dev/stdout}, and
11487
@file{/dev/stderr} files, you may get different output from @code{gawk}
11488
than you would get on a system without those files. When @code{gawk}
11489
interprets these files internally, it synchronizes output to the
11490
standard output with output to @file{/dev/stdout}, while on a system
11491
with those files, the output is actually to different open files
11492
(@pxref{Special Files, ,Special File Names in @code{gawk}}).
11495
Syntactically invalid single character programs tend to overflow
11496
the parse stack, generating a rather unhelpful message. Such programs
11497
are surprisingly difficult to diagnose in the completely general case,
11498
and the effort to do so really is not worth it.
11501
The word ``GNU'' is incorrectly capitalized in at least one
11502
file in the source code.
11505
@node Library Functions, Sample Programs, Invoking Gawk, Top
11506
@chapter A Library of @code{awk} Functions
11508
@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
11509
This chapter presents a library of useful @code{awk} functions. The
11510
sample programs presented later
11511
(@pxref{Sample Programs, ,Practical @code{awk} Programs})
11512
use these functions.
11513
The functions are presented here in a progression from simple to complex.
11515
@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
11516
presents a program that you can use to extract the source code for
11517
these example library functions and programs from the Texinfo source
11518
for this @value{DOCUMENT}.
11519
(This has already been done as part of the @code{gawk} distribution.)
11521
If you have written one or more useful, general purpose @code{awk} functions,
11522
and would like to contribute them for a subsequent edition of this @value{DOCUMENT},
11523
please contact the author. @xref{Bugs, ,Reporting Problems and Bugs},
11524
for information on doing this. Don't just send code, as you will be
11525
required to either place your code in the public domain,
11526
publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
11527
or assign the copyright in it to the Free Software Foundation.
11530
* Portability Notes:: What to do if you don't have @code{gawk}.
11531
* Nextfile Function:: Two implementations of a @code{nextfile}
11533
* Assert Function:: A function for assertions in @code{awk}
11535
* Ordinal Functions:: Functions for using characters as numbers and
11537
* Join Function:: A function to join an array into a string.
11538
* Mktime Function:: A function to turn a date into a timestamp.
11539
* Gettimeofday Function:: A function to get formatted times.
11540
* Filetrans Function:: A function for handling data file transitions.
11541
* Getopt Function:: A function for processing command line
11543
* Passwd Functions:: Functions for getting user information.
11544
* Group Functions:: Functions for getting group information.
11545
* Library Names:: How to best name private global variables in
11549
@node Portability Notes, Nextfile Function, Library Functions, Library Functions
11550
@section Simulating @code{gawk}-specific Features
11551
@cindex portability issues
11553
The programs in this chapter and in
11554
@ref{Sample Programs, ,Practical @code{awk} Programs},
11555
freely use features that are specific to @code{gawk}.
11556
This section briefly discusses how you can rewrite these programs for
11557
different implementations of @code{awk}.
11559
Diagnostic error messages are sent to @file{/dev/stderr}.
11560
Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system
11561
does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}.
11563
A number of programs use @code{nextfile}
11564
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}),
11565
to skip any remaining input in the input file.
11566
@ref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
11567
shows you how to write a function that will do the same thing.
11569
Finally, some of the programs choose to ignore upper-case and lower-case
11570
distinctions in their input. They do this by assigning one to @code{IGNORECASE}.
11571
You can achieve the same effect by adding the following rule to the
11572
beginning of the program:
11576
@{ $0 = tolower($0) @}
11580
Also, verify that all regexp and string constants used in
11581
comparisons only use lower-case letters.
11583
@node Nextfile Function, Assert Function, Portability Notes, Library Functions
11584
@section Implementing @code{nextfile} as a Function
11586
@cindex skipping input files
11587
@cindex input files, skipping
11588
The @code{nextfile} statement presented in
11589
@ref{Nextfile Statement, ,The @code{nextfile} Statement},
11590
is a @code{gawk}-specific extension. It is not available in other
11591
implementations of @code{awk}. This section shows two versions of a
11592
@code{nextfile} function that you can use to simulate @code{gawk}'s
11593
@code{nextfile} statement if you cannot use @code{gawk}.
11595
Here is a first attempt at writing a @code{nextfile} function.
11599
# nextfile --- skip remaining records in current file
11601
# this should be read in before the "main" awk program
11603
function nextfile() @{ _abandon_ = FILENAME; next @}
11605
_abandon_ == FILENAME @{ next @}
11609
This file should be included before the main program, because it supplies
11610
a rule that must be executed first. This rule compares the current data
11611
file's name (which is always in the @code{FILENAME} variable) to a private
11612
variable named @code{_abandon_}. If the file name matches, then the action
11613
part of the rule executes a @code{next} statement, to go on to the next
11614
record. (The use of @samp{_} in the variable name is a convention.
11615
It is discussed more fully in
11616
@ref{Library Names, , Naming Library Function Global Variables}.)
11618
The use of the @code{next} statement effectively creates a loop that reads
11619
all the records from the current data file.
11620
Eventually, the end of the file is reached, and
11621
a new data file is opened, changing the value of @code{FILENAME}.
11622
Once this happens, the comparison of @code{_abandon_} to @code{FILENAME}
11623
fails, and execution continues with the first rule of the ``real'' program.
11625
The @code{nextfile} function itself simply sets the value of @code{_abandon_}
11626
and then executes a @code{next} statement to start the loop
11627
going.@footnote{Some implementations of @code{awk} do not allow you to
11628
execute @code{next} from within a function body. Some other work-around
11629
will be necessary if you use such a version.}
11630
@c mawk is what we're talking about.
11632
This initial version has a subtle problem. What happens if the same data
11633
file is listed @emph{twice} on the command line, one right after the other,
11634
or even with just a variable assignment between the two occurrences of
11637
@c @findex nextfile
11638
@c do it this way, since all the indices are merged
11639
@cindex @code{nextfile} function
11641
this code will skip right through the file, a second time, even though
11642
it should stop when it gets to the end of the first occurrence.
11643
Here is a second version of @code{nextfile} that remedies this problem.
11647
@c file eg/lib/nextfile.awk
11648
# nextfile --- skip remaining records in current file
11649
# correctly handle successive occurrences of the same file
11650
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
11653
# this should be read in before the "main" awk program
11655
function nextfile() @{ _abandon_ = FILENAME; next @}
11657
_abandon_ == FILENAME @{
11667
The @code{nextfile} function has not changed. It sets @code{_abandon_}
11668
equal to the current file name and then executes a @code{next} satement.
11669
The @code{next} statement reads the next record and increments @code{FNR},
11670
so @code{FNR} is guaranteed to have a value of at least two.
11671
However, if @code{nextfile} is called for the last record in the file,
11672
then @code{awk} will close the current data file and move on to the next
11673
one. Upon doing so, @code{FILENAME} will be set to the name of the new file,
11674
and @code{FNR} will be reset to one. If this next file is the same as
11675
the previous one, @code{_abandon_} will still be equal to @code{FILENAME}.
11676
However, @code{FNR} will be equal to one, telling us that this is a new
11677
occurrence of the file, and not the one we were reading when the
11678
@code{nextfile} function was executed. In that case, @code{_abandon_}
11679
is reset to the empty string, so that further executions of this rule
11680
will fail (until the next time that @code{nextfile} is called).
11682
If @code{FNR} is not one, then we are still in the original data file,
11683
and the program executes a @code{next} statement to skip through it.
11685
An important question to ask at this point is: ``Given that the
11686
functionality of @code{nextfile} can be provided with a library file,
11687
why is it built into @code{gawk}?'' This is an important question. Adding
11688
features for little reason leads to larger, slower programs that are
11689
harder to maintain.
11691
The answer is that building @code{nextfile} into @code{gawk} provides
11692
significant gains in efficiency. If the @code{nextfile} function is executed
11693
at the beginning of a large data file, @code{awk} still has to scan the entire
11694
file, splitting it up into records, just to skip over it. The built-in
11695
@code{nextfile} can simply close the file immediately and proceed to the
11696
next one, saving a lot of time. This is particularly important in
11697
@code{awk}, since @code{awk} programs are generally I/O bound (i.e.@:
11698
they spend most of their time doing input and output, instead of performing
11701
@node Assert Function, Ordinal Functions, Nextfile Function, Library Functions
11702
@section Assertions
11705
@cindex @code{assert}, C version
11706
When writing large programs, it is often useful to be able to know
11707
that a condition or set of conditions is true. Before proceeding with a
11708
particular computation, you make a statement about what you believe to be
11709
the case. Such a statement is known as an
11710
``assertion.'' The C language provides an @code{<assert.h>} header file
11711
and corresponding @code{assert} macro that the programmer can use to make
11712
assertions. If an assertion fails, the @code{assert} macro arranges to
11713
print a diagnostic message describing the condition that should have
11714
been true but was not, and then it kills the program. In C, using
11715
@code{assert} looks this:
11718
#include <assert.h>
11720
int myfunc(int a, double b)
11722
assert(a <= 5 && b >= 17);
11727
If the assertion failed, the program would print a message similar to
11731
prog.c:5: assertion failed: a <= 5 && b >= 17
11735
The ANSI C language makes it possible to turn the condition into a string for use
11736
in printing the diagnostic message. This is not possible in @code{awk}, so
11737
this @code{assert} function also requires a string version of the condition
11738
that is being tested.
11742
@c file eg/lib/assert.awk
11743
# assert --- assert that a condition is true. Otherwise exit.
11744
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
11747
function assert(condition, string)
11749
if (! condition) @{
11750
printf("%s:%d: assertion failed: %s\n",
11751
FILENAME, FNR, string) > "/dev/stderr"
11765
The @code{assert} function tests the @code{condition} parameter. If it
11766
is false, it prints a message to standard error, using the @code{string}
11767
parameter to describe the failed condition. It then sets the variable
11768
@code{_assert_exit} to one, and executes the @code{exit} statement.
11769
The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
11770
rules finds @code{_assert_exit} to be true, then it exits immediately.
11772
The purpose of the @code{END} rule with its test is to
11773
keep any other @code{END} rules from running. When an assertion fails, the
11774
program should exit immediately.
11775
If no assertions fail, then @code{_assert_exit} will still be
11776
false when the @code{END} rule is run normally, and the rest of the
11777
program's @code{END} rules will execute.
11778
For all of this to work correctly, @file{assert.awk} must be the
11779
first source file read by @code{awk}.
11781
You would use this function in your programs this way:
11784
function myfunc(a, b)
11786
assert(a <= 5 && b >= 17, "a <= 5 && b >= 17")
11792
If the assertion failed, you would see a message like this:
11795
mydata:1357: assertion failed: a <= 5 && b >= 17
11798
There is a problem with this version of @code{assert}, that it may not
11799
be possible to work around. An @code{END} rule is automatically added
11800
to the program calling @code{assert}. Normally, if a program consists
11801
of just a @code{BEGIN} rule, the input files and/or standard input are
11802
not read. However, now that the program has an @code{END} rule, @code{awk}
11803
will attempt to read the input data files, or standard input
11804
(@pxref{Using BEGIN/END, , Startup and Cleanup Actions}),
11805
most likely causing the program to hang, waiting for input.
11807
@cindex backslash continuation
11808
Just a note on programming style. You may have noticed that the @code{END}
11809
rule uses backslash continuation, with the open brace on a line by
11810
itself. This is so that it more closely resembles the way functions
11811
are written. Many of the examples
11813
in this chapter and the next one
11815
use this style. You can decide for yourself if you like writing
11816
your @code{BEGIN} and @code{END} rules this way,
11819
@node Ordinal Functions, Join Function, Assert Function, Library Functions
11820
@section Translating Between Characters and Numbers
11822
@cindex numeric character values
11823
@cindex values of characters as numbers
11824
One commercial implementation of @code{awk} supplies a built-in function,
11825
@code{ord}, which takes a character and returns the numeric value for that
11826
character in the machine's character set. If the string passed to
11827
@code{ord} has more than one character, only the first one is used.
11829
The inverse of this function is @code{chr} (from the function of the same
11830
name in Pascal), which takes a number and returns the corresponding character.
11832
Both functions can be written very nicely in @code{awk}; there is no real
11833
reason to build them into the @code{awk} interpreter.
11839
@c file eg/lib/ord.awk
11840
# ord.awk --- do ord and chr
11842
# Global identifiers:
11843
# _ord_: numerical values indexed by characters
11844
# _ord_init: function to initialize _ord_
11847
# arnold@@gnu.ai.mit.edu
11850
# 20 July, 1992, revised
11852
BEGIN @{ _ord_init() @}
11857
@c file eg/lib/ord.awk
11858
function _ord_init( low, high, i, t)
11860
low = sprintf("%c", 7) # BEL is ascii 7
11861
if (low == "\a") @{ # regular ascii
11864
@} else if (sprintf("%c", 128 + 7) == "\a") @{
11865
# ascii, mark parity
11868
@} else @{ # ebcdic(!)
11873
for (i = low; i <= high; i++) @{
11874
t = sprintf("%c", i)
11882
@cindex character sets
11883
@cindex character encodings
11886
@cindex mark parity
11887
Some explanation of the numbers used by @code{chr} is worthwhile.
11888
The most prominent character set in use today is ASCII. Although an
11889
eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only
11890
defines characters that use the values from zero to 127.@footnote{ASCII
11891
has been extended in many countries to use the values from 128 to 255
11892
for country-specific characters. If your system uses these extensions,
11893
you can simplify @code{_ord_init} to simply loop from zero to 255.}
11894
At least one computer manufacturer that we know of
11896
uses ASCII, but with mark parity, meaning that the leftmost bit in the byte
11897
is always one. What this means is that on those systems, characters
11898
have numeric values from 128 to 255.
11899
Finally, large mainframe systems use the EBCDIC character set, which
11900
uses all 256 values.
11901
While there are other character sets in use on some older systems,
11902
they are not really worth worrying about.
11906
@c file eg/lib/ord.awk
11907
function ord(str, c)
11909
# only first character is of interest
11910
c = substr(str, 1, 1)
11917
@c file eg/lib/ord.awk
11920
# force c to be numeric by adding 0
11921
return sprintf("%c", c + 0)
11927
@c file eg/lib/ord.awk
11928
#### test code ####
11932
# printf("enter a character: ")
11933
# if (getline var <= 0)
11935
# printf("ord(%s) = %d\n", var, ord(var))
11942
An obvious improvement to these functions would be to move the code for the
11943
@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was
11944
written this way initially for ease of development.
11946
There is a ``test program'' in a @code{BEGIN} rule, for testing the
11947
function. It is commented out for production use.
11949
@node Join Function, Mktime Function, Ordinal Functions, Library Functions
11950
@section Merging an Array Into a String
11952
@cindex merging strings
11953
When doing string processing, it is often useful to be able to join
11954
all the strings in an array into one long string. The following function,
11955
@code{join}, accomplishes this task. It is used later in several of
11956
the application programs
11957
(@pxref{Sample Programs, ,Practical @code{awk} Programs}).
11959
Good function design is important; this function needs to be general, but it
11960
should also have a reasonable default behavior. It is called with an array
11961
and the beginning and ending indices of the elements in the array to be
11962
merged. This assumes that the array indices are numeric---a reasonable
11963
assumption since the array was likely created with @code{split}
11964
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
11969
@c file eg/lib/join.awk
11970
# join.awk --- join an array into a string
11971
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
11974
function join(array, start, end, sep, result, i)
11978
else if (sep == SUBSEP) # magic value
11980
result = array[start]
11981
for (i = start + 1; i <= end; i++)
11982
result = result sep array[i]
11989
An optional additional argument is the separator to use when joining the
11990
strings back together. If the caller supplies a non-empty value,
11991
@code{join} uses it. If it is not supplied, it will have a null
11992
value. In this case, @code{join} uses a single blank as a default
11993
separator for the strings. If the value is equal to @code{SUBSEP},
11994
then @code{join} joins the strings with no separator between them.
11995
@code{SUBSEP} serves as a ``magic'' value to indicate that there should
11996
be no separation between the component strings.
11998
It would be nice if @code{awk} had an assignment operator for concatenation.
11999
The lack of an explicit operator for concatenation makes string operations
12000
more difficult than they really need to be.
12002
@node Mktime Function, Gettimeofday Function, Join Function, Library Functions
12003
@section Turning Dates Into Timestamps
12005
The @code{systime} function built in to @code{gawk}
12006
returns the current time of day as
12007
a timestamp in ``seconds since the Epoch.'' This timestamp
12008
can be converted into a printable date of almost infinitely variable
12009
format using the built-in @code{strftime} function.
12010
(For more information on @code{systime} and @code{strftime},
12011
@pxref{Time Functions, ,Functions for Dealing with Time Stamps}.)
12013
@cindex converting dates to timestamps
12014
@cindex dates, converting to timestamps
12015
@cindex timestamps, converting from dates
12016
An interesting but difficult problem is to convert a readable representation
12017
of a date back into a timestamp. The ANSI C library provides a @code{mktime}
12018
function that does the basic job, converting a canonical representation of a
12019
date into a timestamp.
12021
It would appear at first glance that @code{gawk} would have to supply a
12022
@code{mktime} built-in function that was simply a ``hook'' to the C language
12023
version. In fact though, @code{mktime} can be implemented entirely in
12026
Here is a version of @code{mktime} for @code{awk}. It takes a simple
12027
representation of the date and time, and converts it into a timestamp.
12029
The code is presented here intermixed with explanatory prose. In
12030
@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
12031
you will see how the Texinfo source file for this @value{DOCUMENT}
12032
can be processed to extract the code into a single source file.
12034
The program begins with a descriptive comment and a @code{BEGIN} rule
12035
that initializes a table @code{_tm_months}. This table is a two-dimensional
12036
array that has the lengths of the months. The first index is zero for
12037
regular years, and one for leap years. The values are the same for all the
12038
months in both kinds of years, except for February; thus the use of multiple
12043
@c file eg/lib/mktime.awk
12044
# mktime.awk --- convert a canonical date representation
12046
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
12051
# Initialize table of month lengths
12052
_tm_months[0,1] = _tm_months[1,1] = 31
12053
_tm_months[0,2] = 28; _tm_months[1,2] = 29
12054
_tm_months[0,3] = _tm_months[1,3] = 31
12055
_tm_months[0,4] = _tm_months[1,4] = 30
12056
_tm_months[0,5] = _tm_months[1,5] = 31
12057
_tm_months[0,6] = _tm_months[1,6] = 30
12058
_tm_months[0,7] = _tm_months[1,7] = 31
12059
_tm_months[0,8] = _tm_months[1,8] = 31
12060
_tm_months[0,9] = _tm_months[1,9] = 30
12061
_tm_months[0,10] = _tm_months[1,10] = 31
12062
_tm_months[0,11] = _tm_months[1,11] = 30
12063
_tm_months[0,12] = _tm_months[1,12] = 31
12069
The benefit of merging multiple @code{BEGIN} rules
12070
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
12071
is particularly clear when writing library files. Functions in library
12072
files can cleanly initialize their own private data and also provide clean-up
12073
actions in private @code{END} rules.
12075
The next function is a simple one that computes whether a given year is or
12076
is not a leap year. If a year is evenly divisible by four, but not evenly
12077
divisible by 100, or if it is evenly divisible by 400, then it is a leap
12078
year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be.
12079
@c Change this after the year 2000 to ``2000 was'' (:-)
12084
@c file eg/lib/mktime.awk
12085
# decide if a year is a leap year
12086
function _tm_isleap(year, ret)
12088
ret = (year % 4 == 0 && year % 100 != 0) ||
12097
This function is only used a few times in this file, and its computation
12098
could have been written @dfn{in-line} (at the point where it's used).
12099
Making it a separate function made the original development easier, and also
12100
avoids the possibility of typing errors when duplicating the code in
12103
The next function is more interesting. It does most of the work of
12104
generating a timestamp, which is converting a date and time into some number
12105
of seconds since the Epoch. The caller passes an array (rather
12106
imaginatively named @code{a}) containing six
12107
values: the year including century, the month as a number between one and 12,
12108
the day of the month, the hour as a number between zero and 23, the minute in
12109
the hour, and the seconds within the minute.
12111
The function uses several local variables to precompute the number of
12112
seconds in an hour, seconds in a day, and seconds in a year. Often,
12113
similar C code simply writes out the expression in-line, expecting the
12114
compiler to do @dfn{constant folding}. E.g., most C compilers would
12115
turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing
12116
it every time at run time. Precomputing these values makes the
12117
function more efficient.
12122
@c file eg/lib/mktime.awk
12123
# convert a date into seconds
12124
function _tm_addup(a, total, yearsecs, daysecs,
12128
daysecs = 24 * hoursecs
12129
yearsecs = 365 * daysecs
12131
total = (a[1] - 1970) * yearsecs
12134
# extra day for leap years
12135
for (i = 1970; i < a[1]; i++)
12141
j = _tm_isleap(a[1])
12142
for (i = 1; i < a[2]; i++)
12143
total += _tm_months[j, i] * daysecs
12146
total += (a[3] - 1) * daysecs
12147
total += a[4] * hoursecs
12157
The function starts with a first approximation of all the seconds between
12158
Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems.
12159
It may be different on other systems.} and the beginning of the current
12160
year. It then goes through all those years, and for every leap year,
12161
adds an additional day's worth of seconds.
12163
The variable @code{j} holds either one or zero, if the current year is or is not
12165
For every month in the current year prior to the current month, it adds
12166
the number of seconds in the month, using the appropriate entry in the
12167
@code{_tm_months} array.
12169
Finally, it adds in the seconds for the number of days prior to the current
12170
day, and the number of hours, minutes, and seconds in the current day.
12172
The result is a count of seconds since January 1, 1970. This value is not
12173
yet what is needed though. The reason why is described shortly.
12175
The main @code{mktime} function takes a single character string argument.
12176
This string is a representation of a date and time in a ``canonical''
12177
(fixed) form. This string should be
12178
@code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}.
12183
@c file eg/lib/mktime.awk
12184
# mktime --- convert a date into seconds,
12185
# compensate for time zone
12187
function mktime(str, res1, res2, a, b, i, j, t, diff)
12189
i = split(str, a, " ") # don't rely on FS
12201
a[2] < 1 || a[2] > 12 ||
12202
a[3] < 1 || a[3] > 31 ||
12203
a[4] < 0 || a[4] > 23 ||
12204
a[5] < 0 || a[5] > 59 ||
12205
a[6] < 0 || a[6] > 61 )
12209
res1 = _tm_addup(a)
12210
t = strftime("%Y %m %d %H %M %S", res1)
12213
printf("(%s) -> (%s)\n", str, t) > "/dev/stderr"
12216
res2 = _tm_addup(b)
12221
printf("diff = %d seconds\n", diff) > "/dev/stderr"
12231
The function first splits the string into an array, using spaces and tabs as
12232
separators. If there are not six elements in the array, it returns an
12233
error, signaled as the value @minus{}1.
12234
Next, it forces each element of the array to be numeric, by adding zero to it.
12235
The following @samp{if} statement then makes sure that each element is
12236
within an allowable range. (This checking could be extended further, e.g.,
12237
to make sure that the day of the month is within the correct range for the
12238
particular month supplied.) All of this is essentially preliminary set-up
12239
and error checking.
12241
Recall that @code{_tm_addup} generated a value in seconds since Midnight,
12242
January 1, 1970. This value is not directly usable as the result we want,
12243
@emph{since the calculation does not account for the local timezone}. In other
12244
words, the value represents the count in seconds since the Epoch, but only
12245
for UTC (Universal Coordinated Time). If the local timezone is east or west
12246
of UTC, then some number of hours should be either added to, or subtracted from
12247
the resulting timestamp.
12249
For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west
12250
of (behind) UTC. It is only four hours behind UTC if daylight savings
12252
If you are calling @code{mktime} in Atlanta, with the argument
12253
@code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be
12254
for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to
12255
add another four hours worth of seconds to the result.
12257
How can @code{mktime} determine how far away it is from UTC? This is
12258
surprisingly easy. The returned timestamp represents the time passed to
12259
@code{mktime} @emph{as UTC}. This timestamp can be fed back to
12260
@code{strftime}, which will format it as a @emph{local} time; i.e.@: as
12261
if it already had the UTC difference added in to it. This is done by
12262
giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format
12263
argument. It returns the computed timestamp in the original string
12264
format. The result represents a time that accounts for the UTC
12265
difference. When the new time is converted back to a timestamp, the
12266
difference between the two timestamps is the difference (in seconds)
12267
between the local timezone and UTC. This difference is then added back
12268
to the original result. An example demonstrating this is presented below.
12270
Finally, there is a ``main'' program for testing the function.
12274
@c file eg/lib/mktime.awk
12277
printf "Enter date as yyyy mm dd hh mm ss: "
12278
getline _tm_test_date
12280
t = mktime(_tm_test_date)
12281
r = strftime("%Y %m %d %H %M %S", t)
12282
printf "Got back (%s)\n", r
12289
The entire program uses two variables that can be set on the command
12290
line to control debugging output and to enable the test in the final
12291
@code{BEGIN} rule. Here is the result of a test run. (Note that debugging
12292
output is to standard error, and test output is to standard output.)
12296
$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1
12297
@print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10
12298
@error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10)
12299
@error{} diff = 14400 seconds
12300
@print{} Got back (1993 05 23 15 35 10)
12304
The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993.
12306
of debugging output shows the resulting time as UTC---four hours ahead of
12307
the local time zone. The second line shows that the difference is 14400
12308
seconds, which is four hours. (The difference is only four hours, since
12309
daylight savings time is in effect during May.)
12310
The final line of test output shows that the timezone compensation
12311
algorithm works; the returned time is the same as the entered time.
12313
This program does not solve the general problem of turning an arbitrary date
12314
representation into a timestamp. That problem is very involved. However,
12315
the @code{mktime} function provides a foundation upon which to build. Other
12316
software can convert month names into numeric months, and AM/PM times into
12317
24-hour clocks, to generate the ``canonical'' format that @code{mktime}
12320
@node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions
12321
@section Managing the Time of Day
12323
@cindex formatted timestamps
12324
@cindex timestamps, formatted
12325
The @code{systime} and @code{strftime} functions described in
12326
@ref{Time Functions, ,Functions for Dealing with Time Stamps},
12327
provide the minimum functionality necessary for dealing with the time of day
12328
in human readable form. While @code{strftime} is extensive, the control
12329
formats are not necessarily easy to remember or intuitively obvious when
12332
The following function, @code{gettimeofday}, populates a user-supplied array
12333
with pre-formatted time information. It returns a string with the current
12334
time formatted in the same way as the @code{date} utility.
12336
@findex gettimeofday
12339
@c file eg/lib/gettime.awk
12340
# gettimeofday --- get the time of day in a usable format
12341
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain, May 1993
12343
# Returns a string in the format of output of date(1)
12344
# Populates the array argument time with individual values:
12345
# time["second"] -- seconds (0 - 59)
12346
# time["minute"] -- minutes (0 - 59)
12347
# time["hour"] -- hours (0 - 23)
12348
# time["althour"] -- hours (0 - 12)
12349
# time["monthday"] -- day of month (1 - 31)
12350
# time["month"] -- month of year (1 - 12)
12351
# time["monthname"] -- name of the month
12352
# time["shortmonth"] -- short name of the month
12353
# time["year"] -- year within century (0 - 99)
12354
# time["fullyear"] -- year with century (19xx or 20xx)
12355
# time["weekday"] -- day of week (Sunday = 0)
12356
# time["altweekday"] -- day of week (Monday = 0)
12357
# time["weeknum"] -- week number, Sunday first day
12358
# time["altweeknum"] -- week number, Monday first day
12359
# time["dayname"] -- name of weekday
12360
# time["shortdayname"] -- short name of weekday
12361
# time["yearday"] -- day of year (0 - 365)
12362
# time["timezone"] -- abbreviation of timezone name
12363
# time["ampm"] -- AM or PM designation
12366
function gettimeofday(time, ret, now, i)
12368
# get time once, avoids unnecessary system calls
12371
# return date(1)-style output
12372
ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)
12374
# clear out target array
12380
# fill in values, force numeric values to be
12381
# numeric by adding 0
12382
time["second"] = strftime("%S", now) + 0
12383
time["minute"] = strftime("%M", now) + 0
12384
time["hour"] = strftime("%H", now) + 0
12385
time["althour"] = strftime("%I", now) + 0
12386
time["monthday"] = strftime("%d", now) + 0
12387
time["month"] = strftime("%m", now) + 0
12388
time["monthname"] = strftime("%B", now)
12389
time["shortmonth"] = strftime("%b", now)
12390
time["year"] = strftime("%y", now) + 0
12391
time["fullyear"] = strftime("%Y", now) + 0
12392
time["weekday"] = strftime("%w", now) + 0
12393
time["altweekday"] = strftime("%u", now) + 0
12394
time["dayname"] = strftime("%A", now)
12395
time["shortdayname"] = strftime("%a", now)
12396
time["yearday"] = strftime("%j", now) + 0
12397
time["timezone"] = strftime("%Z", now)
12398
time["ampm"] = strftime("%p", now)
12399
time["weeknum"] = strftime("%U", now) + 0
12400
time["altweeknum"] = strftime("%W", now) + 0
12408
The string indices are easier to use and read than the various formats
12409
required by @code{strftime}. The @code{alarm} program presented in
12410
@ref{Alarm Program, ,An Alarm Clock Program},
12411
uses this function.
12414
The @code{gettimeofday} function is presented above as it was written. A
12415
more general design for this function would have allowed the user to supply
12416
an optional timestamp value that would have been used instead of the current
12419
@node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions
12420
@section Noting Data File Boundaries
12422
@cindex per file initialization and clean-up
12423
The @code{BEGIN} and @code{END} rules are each executed exactly once, at
12424
the beginning and end respectively of your @code{awk} program
12425
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
12426
We (the @code{gawk} authors) once had a user who mistakenly thought that the
12427
@code{BEGIN} rule was executed at the beginning of each data file and the
12428
@code{END} rule was executed at the end of each data file. When informed
12429
that this was not the case, the user requested that we add new special
12430
patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
12431
would have the desired behavior. He even supplied us the code to do so.
12433
However, after a little thought, I came up with the following library program.
12434
It arranges to call two user-supplied functions, @code{beginfile} and
12435
@code{endfile}, at the beginning and end of each data file.
12436
Besides solving the problem in only nine(!) lines of code, it does so
12437
@emph{portably}; this will work with any implementation of @code{awk}.
12443
# Give the user a hook for filename transitions
12445
# The user must supply functions beginfile() and endfile()
12446
# that each take the name of the file being started or
12447
# finished, respectively.
12449
# Arnold Robbins, arnold@@gnu.ai.mit.edu, January 1992
12452
FILENAME != _oldfilename \
12454
if (_oldfilename != "")
12455
endfile(_oldfilename)
12456
_oldfilename = FILENAME
12457
beginfile(FILENAME)
12460
END @{ endfile(FILENAME) @}
12464
This file must be loaded before the user's ``main'' program, so that the
12465
rule it supplies will be executed first.
12467
This rule relies on @code{awk}'s @code{FILENAME} variable that
12468
automatically changes for each new data file. The current file name is
12469
saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does
12470
not equal @code{_oldfilename}, then a new data file is being processed, and
12471
it is necessary to call @code{endfile} for the old file. Since
12472
@code{endfile} should only be called if a file has been processed, the
12473
program first checks to make sure that @code{_oldfilename} is not the null
12474
string. The program then assigns the current file name to
12475
@code{_oldfilename}, and calls @code{beginfile} for the file.
12476
Since, like all @code{awk} variables, @code{_oldfilename} will be
12477
initialized to the null string, this rule executes correctly even for the
12480
The program also supplies an @code{END} rule, to do the final processing for
12481
the last file. Since this @code{END} rule comes before any @code{END} rules
12482
supplied in the ``main'' program, @code{endfile} will be called first. Once
12483
again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
12487
This version has same problem as the first version of @code{nextfile}
12488
(@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}).
12489
If the same data file occurs twice in a row on command line, then
12490
@code{endfile} and @code{beginfile} will not be executed at the end of the
12491
first pass and at the beginning of the second pass.
12492
This version solves the problem.
12496
@c file eg/lib/ftrans.awk
12497
# ftrans.awk --- handle data file transitions
12499
# user supplies beginfile() and endfile() functions
12501
# Arnold Robbins, arnold@@gnu.ai.mit.edu. November 1992
12505
if (_filename_ != "")
12506
endfile(_filename_)
12507
_filename_ = FILENAME
12508
beginfile(FILENAME)
12511
END @{ endfile(_filename_) @}
12516
In @ref{Wc Program, ,Counting Things},
12517
you will see how this library function can be used, and
12518
how it simplifies writing the main program.
12520
@node Getopt Function, Passwd Functions, Filetrans Function, Library Functions
12521
@section Processing Command Line Options
12523
@cindex @code{getopt}, C version
12524
@cindex processing arguments
12525
@cindex argument processing
12526
Most utilities on POSIX compatible systems take options or ``switches'' on
12527
the command line that can be used to change the way a program behaves.
12528
@code{awk} is an example of such a program
12529
(@pxref{Options, ,Command Line Options}).
12530
Often, options take @dfn{arguments}, data that the program needs to
12531
correctly obey the command line option. For example, @code{awk}'s
12532
@samp{-F} option requires a string to use as the field separator.
12533
The first occurrence on the command line of either @samp{--} or a
12534
string that does not begin with @samp{-} ends the options.
12536
Most Unix systems provide a C function named @code{getopt} for processing
12537
command line arguments. The programmer provides a string describing the one
12538
letter options. If an option requires an argument, it is followed in the
12539
string with a colon. @code{getopt} is also passed the
12540
count and values of the command line arguments, and is called in a loop.
12541
@code{getopt} processes the command line arguments for option letters.
12542
Each time around the loop, it returns a single character representing the
12543
next option letter that it found, or @samp{?} if it found an invalid option.
12544
When it returns @minus{}1, there are no options left on the command line.
12546
When using @code{getopt}, options that do not take arguments can be
12547
grouped together. Furthermore, options that take arguments require that the
12548
argument be present. The argument can immediately follow the option letter,
12549
or it can be a separate command line argument.
12551
Given a hypothetical program that takes
12552
three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and
12553
@samp{-b} requires an argument, all of the following are valid ways of
12554
invoking the program:
12558
prog -a -b foo -c data1 data2 data3
12559
prog -ac -bfoo -- data1 data2 data3
12560
prog -acbfoo data1 data2 data3
12564
Notice that when the argument is grouped with its option, the rest of
12565
the command line argument is considered to be the option's argument.
12566
In the above example, @samp{-acbfoo} indicates that all of the
12567
@samp{-a}, @samp{-b}, and @samp{-c} options were supplied,
12568
and that @samp{foo} is the argument to the @samp{-b} option.
12570
@code{getopt} provides four external variables that the programmer can use.
12574
The index in the argument value array (@code{argv}) where the first
12575
non-option command line argument can be found.
12578
The string value of the argument to an option.
12581
Usually @code{getopt} prints an error message when it finds an invalid
12582
option. Setting @code{opterr} to zero disables this feature. (An
12583
application might wish to print its own error message.)
12586
The letter representing the command line option.
12587
While not usually documented, most versions supply this variable.
12590
The following C fragment shows how @code{getopt} might process command line
12591
arguments for @code{awk}.
12596
main(int argc, char *argv[])
12599
/* print our own message */
12603
while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
12605
case 'f': /* file */
12608
case 'F': /* field separator */
12611
case 'v': /* variable assignment */
12614
case 'W': /* extension */
12628
As a side point, @code{gawk} actually uses the GNU @code{getopt_long}
12629
function to process both normal and GNU-style long options
12630
(@pxref{Options, ,Command Line Options}).
12632
The abstraction provided by @code{getopt} is very useful, and would be quite
12633
handy in @code{awk} programs as well. Here is an @code{awk} version of
12634
@code{getopt}. This function highlights one of the greatest weaknesses in
12635
@code{awk}, which is that it is very poor at manipulating single characters.
12636
Repeated calls to @code{substr} are necessary for accessing individual
12637
characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
12639
The discussion walks through the code a bit at a time.
12643
@c file eg/lib/getopt.awk
12644
# getopt --- do C library getopt(3) function in awk
12646
# arnold@@gnu.ai.mit.edu
12649
# Initial version: March, 1991
12650
# Revised: May, 1993
12652
# External variables:
12653
# Optind -- index of ARGV for first non-option argument
12654
# Optarg -- string value of argument to current option
12655
# Opterr -- if non-zero, print our own diagnostic
12656
# Optopt -- current option letter
12659
# -1 at end of options
12660
# ? for unrecognized option
12661
# <c> a character representing the current option
12664
# _opti index in multi-flag option, e.g., -abc
12669
The function starts out with some documentation: who wrote the code,
12670
and when it was revised, followed by a list of the global variables it uses,
12671
what the return values are and what they mean, and any global variables that
12672
are ``private'' to this library function. Such documentation is essential
12673
for any program, and particularly for library functions.
12678
@c file eg/lib/getopt.awk
12679
function getopt(argc, argv, options, optl, thisopt, i)
12681
optl = length(options)
12682
if (optl == 0) # no options given
12685
if (argv[Optind] == "--") @{ # all done
12689
@} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{
12697
The function first checks that it was indeed called with a string of options
12698
(the @code{options} parameter). If @code{options} has a zero length,
12699
@code{getopt} immediately returns @minus{}1.
12701
The next thing to check for is the end of the options. A @samp{--} ends the
12702
command line options, as does any command line argument that does not begin
12703
with a @samp{-}. @code{Optind} is used to step through the array of command
12704
line arguments; it retains its value across calls to @code{getopt}, since it
12705
is a global variable.
12707
The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is
12708
perhaps a bit of overkill; it checks for a @samp{-} followed by anything
12709
that is not whitespace and not a colon.
12710
If the current command line argument does not match this pattern,
12711
it is not an option, and it ends option processing.
12715
@c file eg/lib/getopt.awk
12718
thisopt = substr(argv[Optind], _opti, 1)
12720
i = index(options, thisopt)
12723
printf("%c -- invalid option\n",
12724
thisopt) > "/dev/stderr"
12725
if (_opti >= length(argv[Optind])) @{
12736
The @code{_opti} variable tracks the position in the current command line
12737
argument (@code{argv[Optind]}). In the case that multiple options were
12738
grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary
12739
to return them to the user one at a time.
12741
If @code{_opti} is equal to zero, it is set to two, the index in the string
12742
of the next character to look at (we skip the @samp{-}, which is at position
12743
one). The variable @code{thisopt} holds the character, obtained with
12744
@code{substr}. It is saved in @code{Optopt} for the main program to use.
12746
If @code{thisopt} is not in the @code{options} string, then it is an
12747
invalid option. If @code{Opterr} is non-zero, @code{getopt} prints an error
12748
message on the standard error that is similar to the message from the C
12749
version of @code{getopt}.
12751
Since the option is invalid, it is necessary to skip it and move on to the
12752
next option character. If @code{_opti} is greater than or equal to the
12753
length of the current command line argument, then it is necessary to move on
12754
to the next one, so @code{Optind} is incremented and @code{_opti} is reset
12755
to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
12758
In any case, since the option was invalid, @code{getopt} returns @samp{?}.
12759
The main program can examine @code{Optopt} if it needs to know what the
12760
invalid option letter actually was.
12764
@c file eg/lib/getopt.awk
12765
if (substr(options, i + 1, 1) == ":") @{
12766
# get option argument
12767
if (length(substr(argv[Optind], _opti + 1)) > 0)
12768
Optarg = substr(argv[Optind], _opti + 1)
12770
Optarg = argv[++Optind]
12778
If the option requires an argument, the option letter is followed by a colon
12779
in the @code{options} string. If there are remaining characters in the
12780
current command line argument (@code{argv[Optind]}), then the rest of that
12781
string is assigned to @code{Optarg}. Otherwise, the next command line
12782
argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case,
12783
@code{_opti} is reset to zero, since there are no more characters left to
12784
examine in the current command line argument.
12788
@c file eg/lib/getopt.awk
12789
if (_opti == 0 || _opti >= length(argv[Optind])) @{
12800
Finally, if @code{_opti} is either zero or greater than the length of the
12801
current command line argument, it means this element in @code{argv} is
12802
through being processed, so @code{Optind} is incremented to point to the
12803
next element in @code{argv}. If neither condition is true, then only
12804
@code{_opti} is incremented, so that the next option letter can be processed
12805
on the next call to @code{getopt}.
12809
@c file eg/lib/getopt.awk
12811
Opterr = 1 # default is to diagnose
12812
Optind = 1 # skip ARGV[0]
12815
if (_getopt_test) @{
12816
while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
12817
printf("c = <%c>, optarg = <%s>\n",
12819
printf("non-option arguments:\n")
12820
for (; Optind < ARGC; Optind++)
12821
printf("\tARGV[%d] = <%s>\n",
12822
Optind, ARGV[Optind])
12829
The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
12830
@code{Opterr} is set to one, since the default behavior is for @code{getopt}
12831
to print a diagnostic message upon seeing an invalid option. @code{Optind}
12832
is set to one, since there's no reason to look at the program name, which is
12835
The rest of the @code{BEGIN} rule is a simple test program. Here is the
12836
result of two sample runs of the test program.
12840
$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
12841
@print{} c = <a>, optarg = <>
12842
@print{} c = <c>, optarg = <>
12843
@print{} c = <b>, optarg = <ARG>
12844
@print{} non-option arguments:
12845
@print{} ARGV[3] = <bax>
12846
@print{} ARGV[4] = <-x>
12850
$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
12851
@print{} c = <a>, optarg = <>
12852
@error{} x -- invalid option
12853
@print{} c = <?>, optarg = <>
12854
@print{} non-option arguments:
12855
@print{} ARGV[4] = <xyz>
12856
@print{} ARGV[5] = <abc>
12860
The first @samp{--} terminates the arguments to @code{awk}, so that it does
12861
not try to interpret the @samp{-a} etc. as its own options.
12863
Several of the sample programs presented in
12864
@ref{Sample Programs, ,Practical @code{awk} Programs},
12865
use @code{getopt} to process their arguments.
12867
@node Passwd Functions, Group Functions, Getopt Function, Library Functions
12868
@section Reading the User Database
12870
@cindex @file{/dev/user}
12871
The @file{/dev/user} special file
12872
(@pxref{Special Files, ,Special File Names in @code{gawk}})
12873
provides access to the current user's real and effective user and group id
12874
numbers, and if available, the user's supplementary group set.
12875
However, since these are numbers, they do not provide very useful
12876
information to the average user. There needs to be some way to find the
12877
user information associated with the user and group numbers. This
12878
section presents a suite of functions for retrieving information from the
12879
user database. @xref{Group Functions, ,Reading the Group Database},
12880
for a similar suite that retrieves information from the group database.
12882
@cindex @code{getpwent}, C version
12883
@cindex user information
12884
@cindex login information
12885
@cindex account information
12886
@cindex password file
12887
The POSIX standard does not define the file where user information is
12888
kept. Instead, it provides the @code{<pwd.h>} header file
12889
and several C language subroutines for obtaining user information.
12890
The primary function is @code{getpwent}, for ``get password entry.''
12891
The ``password'' comes from the original user database file,
12892
@file{/etc/passwd}, which kept user information, along with the
12893
encrypted passwords (hence the name).
12895
While an @code{awk} program could simply read @file{/etc/passwd} directly
12896
(the format is well known), because of the way password
12897
files are handled on networked systems,
12898
this file may not contain complete information about the system's set of users.
12900
@cindex @code{pwcat} program
12901
To be sure of being
12902
able to produce a readable, complete version of the user database, it is
12903
necessary to write a small C program that calls @code{getpwent}.
12904
@code{getpwent} is defined to return a pointer to a @code{struct passwd}.
12905
Each time it is called, it returns the next entry in the database.
12906
When there are no more entries, it returns @code{NULL}, the null pointer.
12907
When this happens, the C program should call @code{endpwent} to close the
12909
Here is @code{pwcat}, a C program that ``cats'' the password database.
12914
@c file eg/lib/pwcat.c
12918
* Generate a printable version of the password database
12921
* arnold@@gnu.ai.mit.edu
12936
while ((p = getpwent()) != NULL)
12937
printf("%s:%s:%d:%d:%s:%s:%s\n",
12938
p->pw_name, p->pw_passwd, p->pw_uid,
12939
p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
12948
If you don't understand C, don't worry about it.
12949
The output from @code{pwcat} is the user database, in the traditional
12950
@file{/etc/passwd} format of colon-separated fields. The fields are:
12954
The user's login name.
12956
@item Encrypted password
12957
The user's encrypted password. This may not be available on some systems.
12960
The user's numeric user-id number.
12963
The user's numeric group-id number.
12966
The user's full name, and perhaps other information associated with the
12969
@item Home directory
12970
The user's login, or ``home'' directory (familiar to shell programmers as
12974
The program that will be run when the user logs in. This is usually a
12975
shell, such as Bash (the Gnu Bourne-Again shell).
12978
Here are a few lines representative of @code{pwcat}'s output.
12983
@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
12984
@print{} nobody:*:65534:65534::/:
12985
@print{} daemon:*:1:1::/:
12986
@print{} sys:*:2:2::/:/bin/csh
12987
@print{} bin:*:3:3::/bin:
12988
@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
12989
@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
12994
With that introduction, here is a group of functions for getting user
12995
information. There are several functions here, corresponding to the C
12996
functions of the same name.
13000
@c file eg/lib/passwdawk.in
13002
# passwd.awk --- access password file information
13003
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
13007
# tailor this to suit your system
13008
_pw_awklib = "/usr/local/libexec/awk/"
13012
function _pw_init( oldfs, oldrs, olddol0, pwcat)
13021
pwcat = _pw_awklib "pwcat"
13022
while ((pwcat | getline) > 0) @{
13023
_pw_byname[$1] = $0
13025
_pw_bycount[++_pw_total] = $0
13038
The @code{BEGIN} rule sets a private variable to the directory where
13039
@code{pwcat} is stored. Since it is used to help out an @code{awk} library
13040
routine, we have chosen to put it in @file{/usr/local/libexec/awk}.
13041
You might want it to be in a different directory on your system.
13043
The function @code{_pw_init} keeps three copies of the user information
13044
in three associative arrays. The arrays are indexed by user name
13045
(@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of
13046
occurrence (@code{_pw_bycount}).
13048
The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only
13049
needs to be called once.
13051
Since this function uses @code{getline} to read information from
13052
@code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and
13053
@code{$0}. Doing so is necessary, since these functions could be called
13054
from anywhere within a user's program, and the user may have his or her
13055
own values for @code{FS} and @code{RS}.
13057
Problem, what if FIELDWIDTHS is in use? Sigh.
13060
The main part of the function uses a loop to read database lines, split
13061
the line into fields, and then store the line into each array as necessary.
13062
When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline,
13063
setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and
13064
@code{$0}. The use of @code{@w{_pw_count}} will be explained below.
13069
@c file eg/lib/passwdawk.in
13070
function getpwnam(name)
13073
if (name in _pw_byname)
13074
return _pw_byname[name]
13081
The @code{getpwnam} function takes a user name as a string argument. If that
13082
user is in the database, it returns the appropriate line. Otherwise it
13083
returns the null string.
13088
@c file eg/lib/passwdawk.in
13089
function getpwuid(uid)
13092
if (uid in _pw_byuid)
13093
return _pw_byuid[uid]
13101
the @code{getpwuid} function takes a user-id number argument. If that
13102
user number is in the database, it returns the appropriate line. Otherwise it
13103
returns the null string.
13108
@c file eg/lib/passwdawk.in
13109
function getpwent()
13112
if (_pw_count < _pw_total)
13113
return _pw_bycount[++_pw_count]
13120
The @code{getpwent} function simply steps through the database, one entry at
13121
a time. It uses @code{_pw_count} to track its current position in the
13122
@code{_pw_bycount} array.
13127
@c file eg/lib/passwdawk.in
13128
function endpwent()
13136
The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that
13137
subsequent calls to @code{getpwent} will start over again.
13139
A conscious design decision in this suite is that each subroutine calls
13140
@code{@w{_pw_init}} to initialize the database arrays. The overhead of running
13141
a separate process to generate the user database, and the I/O to scan it,
13142
will only be incurred if the user's main program actually calls one of these
13143
functions. If this library file is loaded along with a user's program, but
13144
none of the routines are ever called, then there is no extra run-time overhead.
13145
(The alternative would be to move the body of @code{@w{_pw_init}} into a
13146
@code{BEGIN} rule, which would always run @code{pwcat}. This simplifies the
13147
code but runs an extra process that may never be needed.)
13149
In turn, calling @code{_pw_init} is not too expensive, since the
13150
@code{_pw_inited} variable keeps the program from reading the data more than
13151
once. If you are worried about squeezing every last cycle out of your
13152
@code{awk} program, the check of @code{_pw_inited} could be moved out of
13153
@code{_pw_init} and duplicated in all the other functions. In practice,
13154
this is not necessary, since most @code{awk} programs are I/O bound, and it
13155
would clutter up the code.
13157
The @code{id} program in @ref{Id Program, ,Printing Out User Information},
13158
uses these functions.
13160
@node Group Functions, Library Names, Passwd Functions, Library Functions
13161
@section Reading the Group Database
13163
@cindex @code{getgrent}, C version
13164
@cindex group information
13165
@cindex account information
13167
Much of the discussion presented in
13168
@ref{Passwd Functions, ,Reading the User Database},
13169
applies to the group database as well. Although there has traditionally
13170
been a well known file, @file{/etc/group}, in a well known format, the POSIX
13171
standard only provides a set of C library routines
13172
(@code{<grp.h>} and @code{getgrent})
13173
for accessing the information.
13174
Even though this file may exist, it likely does not have
13175
complete information. Therefore, as with the user database, it is necessary
13176
to have a small C program that generates the group database as its output.
13178
@cindex @code{grcat} program
13179
Here is @code{grcat}, a C program that ``cats'' the group database.
13184
@c file eg/lib/grcat.c
13188
* Generate a printable version of the group database
13190
* Arnold Robbins, arnold@@gnu.ai.mit.edu
13208
while ((g = getgrent()) != NULL) @{
13209
printf("%s:%s:%d:", g->gr_name, g->gr_passwd,
13211
for (i = 0; g->gr_mem[i] != NULL; i++) @{
13212
printf("%s", g->gr_mem[i]);
13213
if (g->gr_mem[i+1] != NULL)
13225
Each line in the group database represent one group. The fields are
13226
separated with colons, and represent the following information.
13230
The name of the group.
13232
@item Group Password
13233
The encrypted group password. In practice, this field is never used. It is
13234
usually empty, or set to @samp{*}.
13236
@item Group ID Number
13237
The numeric group-id number. This number should be unique within the file.
13239
@item Group Member List
13240
A comma-separated list of user names. These users are members of the group.
13241
Most Unix systems allow users to be members of several groups
13242
simultaneously. If your system does, then reading @file{/dev/user} will
13243
return those group-id numbers in @code{$5} through @code{$NF}.
13244
(Note that @file{/dev/user} is a @code{gawk} extension;
13245
@pxref{Special Files, ,Special File Names in @code{gawk}}.)
13251
Here is what running @code{grcat} might produce:
13256
@print{} wheel:*:0:arnold
13257
@print{} nogroup:*:65534:
13258
@print{} daemon:*:1:
13260
@print{} staff:*:10:arnold,miriam,andy
13261
@print{} other:*:20:
13266
Here are the functions for obtaining information from the group database.
13267
There are several, modeled after the C library functions of the same names.
13272
@c file eg/lib/groupawk.in
13273
# group.awk --- functions for dealing with the group file
13274
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
13279
# Change to suit your system
13280
_gr_awklib = "/usr/local/libexec/awk/"
13286
@c file eg/lib/groupawk.in
13287
function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i)
13302
grcat = _gr_awklib "grcat"
13303
while ((grcat | getline) > 0) @{
13304
if ($1 in _gr_byname)
13305
_gr_byname[$1] = _gr_byname[$1] "," $4
13307
_gr_byname[$1] = $0
13308
if ($3 in _gr_bygid)
13309
_gr_bygid[$3] = _gr_bygid[$3] "," $4
13313
n = split($4, a, "[ \t]*,[ \t]*")
13316
for (i = 1; i <= n; i++)
13317
if (a[i] in _gr_groupsbyuser)
13318
_gr_groupsbyuser[a[i]] = \
13319
_gr_groupsbyuser[a[i]] " " $1
13321
_gr_groupsbyuser[a[i]] = $1
13325
_gr_bycount[++_gr_count] = $0
13340
The @code{BEGIN} rule sets a private variable to the directory where
13341
@code{grcat} is stored. Since it is used to help out an @code{awk} library
13342
routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might
13343
want it to be in a different directory on your system.
13345
These routines follow the same general outline as the user database routines
13346
(@pxref{Passwd Functions, ,Reading the User Database}).
13347
The @code{@w{_gr_inited}} variable is used to
13348
ensure that the database is scanned no more than once.
13349
The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and
13350
@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for
13351
scanning the group information.
13353
The group information is stored is several associative arrays.
13354
The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number
13355
(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).
13356
There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}),
13357
that is a space separated list of groups that each user belongs to.
13359
Unlike the user database, it is possible to have multiple records in the
13360
database for the same group. This is common when a group has a large number
13361
of members. Such a pair of entries might look like:
13364
tvpeople:*:101:johny,jay,arsenio
13365
tvpeople:*:101:david,conan,tom,joan
13368
For this reason, @code{_gr_init} looks to see if a group name or
13369
group-id number has already been seen. If it has, then the user names are
13370
simply concatenated onto the previous list of users. (There is actually a
13371
subtle problem with the code presented above. Suppose that
13372
the first time there were no names. This code adds the names with
13373
a leading comma. It also doesn't check that there is a @code{$4}.)
13375
Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores
13376
@code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero
13377
(it is used later), and makes @code{_gr_inited} non-zero.
13382
@c file eg/lib/groupawk.in
13383
function getgrnam(group)
13386
if (group in _gr_byname)
13387
return _gr_byname[group]
13394
The @code{getgrnam} function takes a group name as its argument, and if that
13395
group exists, it is returned. Otherwise, @code{getgrnam} returns the null
13401
@c file eg/lib/groupawk.in
13402
function getgrgid(gid)
13405
if (gid in _gr_bygid)
13406
return _gr_bygid[gid]
13413
The @code{getgrgid} function is similar, it takes a numeric group-id, and
13414
looks up the information associated with that group-id.
13419
@c file eg/lib/groupawk.in
13420
function getgruser(user)
13423
if (user in _gr_groupsbyuser)
13424
return _gr_groupsbyuser[user]
13431
The @code{getgruser} function does not have a C counterpart. It takes a
13432
user name, and returns the list of groups that have the user as a member.
13437
@c file eg/lib/groupawk.in
13438
function getgrent()
13441
if (++gr_count in _gr_bycount)
13442
return _gr_bycount[_gr_count]
13449
The @code{getgrent} function steps through the database one entry at a time.
13450
It uses @code{_gr_count} to track its position in the list.
13455
@c file eg/lib/groupawk.in
13456
function endgrent()
13464
@code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can
13467
As with the user database routines, each function calls @code{_gr_init} to
13468
initialize the arrays. Doing so only incurs the extra overhead of running
13469
@code{grcat} if these functions are used (as opposed to moving the body of
13470
@code{_gr_init} into a @code{BEGIN} rule).
13472
Most of the work is in scanning the database and building the various
13473
associative arrays. The functions that the user calls are themselves very
13474
simple, relying on @code{awk}'s associative arrays to do work.
13476
The @code{id} program in @ref{Id Program, ,Printing Out User Information},
13477
uses these functions.
13479
@node Library Names, , Group Functions, Library Functions
13480
@section Naming Library Function Global Variables
13482
@cindex namespace issues in @code{awk}
13483
@cindex documenting @code{awk} programs
13484
@cindex programs, documenting
13485
Due to the way the @code{awk} language evolved, variables are either
13486
@dfn{global} (usable by the entire program), or @dfn{local} (usable just by
13487
a specific function). There is no intermediate state analogous to
13488
@code{static} variables in C.
13490
Library functions often need to have global variables that they can use to
13491
preserve state information between calls to the function. For example,
13492
@code{getopt}'s variable @code{_opti}
13493
(@pxref{Getopt Function, ,Processing Command Line Options}),
13494
and the @code{_tm_months} array used by @code{mktime}
13495
(@pxref{Mktime Function, ,Turning Dates Into Timestamps}).
13496
Such variables are called @dfn{private}, since the only functions that need to
13497
use them are the ones in the library.
13499
When writing a library function, you should try to choose names for your
13500
private variables so that they will not conflict with any variables used by
13501
either another library function or a user's main program. For example, a
13502
name like @samp{i} or @samp{j} is not a good choice, since user programs
13503
often use variable names like these for their own purposes.
13505
The example programs shown in this chapter all start the names of their
13506
private variables with an underscore (@samp{_}). Users generally don't use
13507
leading underscores in their variable names, so this convention immediately
13508
decreases the chances that the variable name will be accidentally shared
13509
with the user's program.
13511
In addition, several of the library functions use a prefix that helps
13512
indicate what function or set of functions uses the variables. For example,
13513
@code{_tm_months} in @code{mktime}
13514
(@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and
13515
@code{_pw_byname} in the user data base routines
13516
(@pxref{Passwd Functions, ,Reading the User Database}).
13517
This convention is recommended, since it even further decreases the chance
13518
of inadvertent conflict among variable names.
13519
Note that this convention can be used equally well both for variable names
13520
and for private function names too.
13522
While I could have re-written all the library routines to use this
13523
convention, I did not do so, in order to show how my own @code{awk}
13524
programming style has evolved, and to provide some basis for this
13527
As a final note on variable naming, if a function makes global variables
13528
available for use by a main program, it is a good convention to start that
13529
variable's name with a capital letter.
13530
For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables
13531
(@pxref{Getopt Function, ,Processing Command Line Options}).
13532
The leading capital letter indicates that it is global, while the fact that
13533
the variable name is not all capital letters indicates that the variable is
13534
not one of @code{awk}'s built-in variables, like @code{FS}.
13536
It is also important that @emph{all} variables in library functions
13537
that do not need to save state are in fact declared local. If this is
13538
not done, the variable could accidentally be used in the user's program,
13539
leading to bugs that are very difficult to track down.
13542
function lib_func(x, y, l1, l2)
13545
@var{use variable} some_var # some_var could be local
13546
@dots{} # but is not by oversight
13551
A different convention, common in the Tcl community, is to use a single
13552
associative array to hold the values needed by the library function(s), or
13553
``package.'' This significantly decreases the number of actual global names
13554
in use. For example, the functions described in
13555
@ref{Passwd Functions, , Reading the User Database},
13556
might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},
13557
@code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of
13558
@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},
13559
and @code{@w{_pw_count}}.
13561
The conventions presented in this section are exactly that, conventions. You
13562
are not required to write your programs this way, we merely recommend that
13565
@node Sample Programs, Language History, Library Functions, Top
13566
@chapter Practical @code{awk} Programs
13568
This chapter presents a potpourri of @code{awk} programs for your reading
13571
There are two sections. The first presents @code{awk}
13572
versions of several common POSIX utilities.
13573
The second is a grab-bag of interesting programs.
13576
Many of these programs use the library functions presented in
13577
@ref{Library Functions, ,A Library of @code{awk} Functions}.
13580
* Clones:: Clones of common utilities.
13581
* Miscellaneous Programs:: Some interesting @code{awk} programs.
13584
@node Clones, Miscellaneous Programs, Sample Programs, Sample Programs
13585
@section Re-inventing Wheels for Fun and Profit
13587
This section presents a number of POSIX utilities that are implemented in
13588
@code{awk}. Re-inventing these programs in @code{awk} is often enjoyable,
13589
since the algorithms can be very clearly expressed, and usually the code is
13590
very concise and simple. This is true because @code{awk} does so much for you.
13592
It should be noted that these programs are not necessarily intended to
13593
replace the installed versions on your system. Instead, their
13594
purpose is to illustrate @code{awk} language programming for ``real world''
13597
The programs are presented in alphabetical order.
13600
* Cut Program:: The @code{cut} utility.
13601
* Egrep Program:: The @code{egrep} utility.
13602
* Id Program:: The @code{id} utility.
13603
* Split Program:: The @code{split} utility.
13604
* Tee Program:: The @code{tee} utility.
13605
* Uniq Program:: The @code{uniq} utility.
13606
* Wc Program:: The @code{wc} utility.
13609
@node Cut Program, Egrep Program, Clones, Clones
13610
@subsection Cutting Out Fields and Columns
13612
@cindex @code{cut} utility
13613
The @code{cut} utility selects, or ``cuts,'' either characters or fields
13615
input and sends them to its standard output. @code{cut} can cut out either
13616
a list of characters, or a list of fields. By default, fields are separated
13617
by tabs, but you may supply a command line option to change the field
13618
@dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition
13619
of fields is less general than @code{awk}'s.
13621
A common use of @code{cut} might be to pull out just the login name of
13622
logged-on users from the output of @code{who}. For example, the following
13623
pipeline generates a sorted, unique list of the logged on users:
13626
who | cut -c1-8 | sort | uniq
13629
The options for @code{cut} are:
13632
@item -c @var{list}
13633
Use @var{list} as the list of characters to cut out. Items within the list
13634
may be separated by commas, and ranges of characters can be separated with
13635
dashes. The list @samp{1-8,15,22-35} specifies characters one through
13636
eight, 15, and 22 through 35.
13638
@item -f @var{list}
13639
Use @var{list} as the list of fields to cut out.
13641
@item -d @var{delim}
13642
Use @var{delim} as the field separator character instead of the tab
13646
Suppress printing of lines that do not contain the field delimiter.
13649
The @code{awk} implementation of @code{cut} uses the @code{getopt} library
13650
function (@pxref{Getopt Function, ,Processing Command Line Options}),
13651
and the @code{join} library function
13652
(@pxref{Join Function, ,Merging an Array Into a String}).
13654
The program begins with a comment describing the options and a @code{usage}
13655
function which prints out a usage message and exits. @code{usage} is called
13656
if invalid arguments are supplied.
13661
@c file eg/prog/cut.awk
13662
# cut.awk --- implement cut in awk
13663
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
13667
# -f list Cut fields
13668
# -d c Field delimiter character
13669
# -c list Cut characters
13671
# -s Suppress lines without the delimiter character
13673
function usage( e1, e2)
13675
e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
13676
e2 = "usage: cut [-c list] [files...]"
13677
print e1 > "/dev/stderr"
13678
print e2 > "/dev/stderr"
13686
The variables @code{e1} and @code{e2} are used so that the function
13695
Next comes a @code{BEGIN} rule that parses the command line options.
13696
It sets @code{FS} to a single tab character, since that is @code{cut}'s
13697
default field separator. The output field separator is also set to be the
13698
same as the input field separator. Then @code{getopt} is used to step
13699
through the command line options. One or the other of the variables
13700
@code{by_fields} or @code{by_chars} is set to true, to indicate that
13701
processing should be done by fields or by characters respectively.
13702
When cutting by characters, the output field separator is set to the null
13707
@c file eg/prog/cut.awk
13710
FS = "\t" # default
13712
while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
13716
@} else if (c == "c") @{
13720
@} else if (c == "d") @{
13721
if (length(Optarg) > 1) @{
13722
printf("Using first character of %s" \
13723
" for delimiter\n", Optarg) > "/dev/stderr"
13724
Optarg = substr(Optarg, 1, 1)
13728
if (FS == " ") # defeat awk semantics
13730
@} else if (c == "s")
13736
for (i = 1; i < Optind; i++)
13742
Special care is taken when the field delimiter is a space. Using
13743
@code{@w{" "}} (a single space) for the value of @code{FS} is
13744
incorrect---@code{awk} would
13745
separate fields with runs of spaces and/or tabs, and we want them to be
13746
separated with individual spaces. Also, note that after @code{getopt} is
13747
through, we have to clear out all the elements of @code{ARGV} from one to
13748
@code{Optind}, so that @code{awk} will not try to process the command line
13749
options as file names.
13751
After dealing with the command line options, the program verifies that the
13752
options make sense. Only one or the other of @samp{-c} and @samp{-f} should
13753
be used, and both require a field list. Then either @code{set_fieldlist} or
13754
@code{set_charlist} is called to pull apart the list of fields or
13759
@c file eg/prog/cut.awk
13760
if (by_fields && by_chars)
13763
if (by_fields == 0 && by_chars == 0)
13764
by_fields = 1 # default
13766
if (fieldlist == "") @{
13767
print "cut: needs list for -c or -f" > "/dev/stderr"
13781
Here is @code{set_fieldlist}. It first splits the field list apart
13782
at the commas, into an array. Then, for each element of the array, it
13783
looks to see if it is actually a range, and if so splits it apart. The range
13784
is verified to make sure the first number is smaller than the second.
13785
Each number in the list is added to the @code{flist} array, which simply
13786
lists the fields that will be printed.
13787
Normal field splitting is used.
13788
The program lets @code{awk}
13789
handle the job of doing the field splitting.
13793
@c file eg/prog/cut.awk
13794
function set_fieldlist( n, m, i, j, k, f, g)
13796
n = split(fieldlist, f, ",")
13797
j = 1 # index in flist
13798
for (i = 1; i <= n; i++) @{
13799
if (index(f[i], "-") != 0) @{ # a range
13800
m = split(f[i], g, "-")
13801
if (m != 2 || g[1] >= g[2]) @{
13802
printf("bad field list: %s\n",
13803
f[i]) > "/dev/stderr"
13806
for (k = g[1]; k <= g[2]; k++)
13817
The @code{set_charlist} function is more complicated than @code{set_fieldlist}.
13818
The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable
13819
(@pxref{Constant Size, ,Reading Fixed-width Data}),
13820
which describes constant width input. When using a character list, that is
13821
exactly what we have.
13823
Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
13824
fields that need to be printed. We have to keep track of the fields to be
13825
printed, and also the intervening characters that have to be skipped.
13826
For example, suppose you wanted characters one through eight, 15, and
13827
22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value
13828
for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}. This gives us five
13829
fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}.
13830
The intermediate fields are ``filler,'' stuff in between the desired data.
13832
@code{flist} lists the fields to be printed, and @code{t} tracks the
13833
complete field list, including filler fields.
13837
@c file eg/prog/cut.awk
13838
function set_charlist( field, i, j, f, g, t,
13841
field = 1 # count total fields
13842
n = split(fieldlist, f, ",")
13843
j = 1 # index in flist
13844
for (i = 1; i <= n; i++) @{
13845
if (index(f[i], "-") != 0) @{ # range
13846
m = split(f[i], g, "-")
13847
if (m != 2 || g[1] >= g[2]) @{
13848
printf(bad character list: %s\n",
13849
f[i]) > "/dev/stderr"
13852
len = g[2] - g[1] + 1
13853
if (g[1] > 1) # compute length of filler
13854
filler = g[1] - last - 1
13858
t[field++] = filler
13859
t[field++] = len # length of field
13861
flist[j++] = field - 1
13864
filler = f[i] - last - 1
13868
t[field++] = filler
13871
flist[j++] = field - 1
13875
FIELDWIDTHS = join(t, 1, field - 1)
13882
Here is the rule that actually processes the data. If the @samp{-s} option
13883
was given, then @code{suppress} will be true. The first @code{if} statement
13884
makes sure that the input record does have the field separator. If
13885
@code{cut} is processing fields, @code{suppress} is true, and the field
13886
separator character is not in the record, then the record is skipped.
13888
If the record is valid, then at this point, @code{gawk} has split the data
13889
into fields, either using the character in @code{FS} or using fixed-length
13890
fields and @code{FIELDWIDTHS}. The loop goes through the list of fields
13891
that should be printed. If the corresponding field has data in it, it is
13892
printed. If the next field also has data, then the separator character is
13893
written out in between the fields.
13895
@c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below
13899
@c file eg/prog/cut.awk
13901
if (by_fields && suppress && $0 !~ FS)
13904
for (i = 1; i <= nfields; i++) @{
13905
if ($flist[i] != "") @{
13906
printf "%s", $flist[i]
13907
if (i < nfields && $flist[i+1] != "")
13917
This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS}
13918
variable to do the character-based cutting. While it would be possible in
13919
other @code{awk} implementations to use @code{substr}
13920
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
13921
it would also be extremely painful to do so.
13922
The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
13923
of picking the input line apart by characters.
13925
@node Egrep Program, Id Program, Cut Program, Clones
13926
@subsection Searching for Regular Expressions in Files
13928
@cindex @code{egrep} utility
13929
The @code{egrep} utility searches files for patterns. It uses regular
13930
expressions that are almost identical to those available in @code{awk}
13931
(@pxref{Regexp Constants, ,Regular Expression Constants}). It is used this way:
13934
egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
13937
The @var{pattern} is a regexp.
13938
In typical usage, the regexp is quoted to prevent the shell from expanding
13939
any of the special characters as file name wildcards.
13940
Normally, @code{egrep} prints the
13941
lines that matched. If multiple file names are provided on the command
13942
line, each output line is preceded by the name of the file and a colon.
13948
Print out a count of the lines that matched the pattern, instead of the
13952
Be silent. No output is produced, and the exit value indicates whether
13953
or not the pattern was matched.
13956
Invert the sense of the test. @code{egrep} prints the lines that do
13957
@emph{not} match the pattern, and exits successfully if the pattern was not
13961
Ignore case distinctions in both the pattern and the input data.
13964
Only print the names of the files that matched, not the lines that matched.
13966
@item -e @var{pattern}
13967
Use @var{pattern} as the regexp to match. The purpose of the @samp{-e}
13968
option is to allow patterns that start with a @samp{-}.
13971
This version uses the @code{getopt} library function
13972
(@pxref{Getopt Function, ,Processing Command Line Options}),
13973
and the file transition library program
13974
(@pxref{Filetrans Function, ,Noting Data File Boundaries}).
13976
The program begins with a descriptive comment, and then a @code{BEGIN} rule
13977
that processes the command line arguments with @code{getopt}. The @samp{-i}
13978
(ignore case) option is particularly easy with @code{gawk}; we just use the
13979
@code{IGNORECASE} built in variable
13980
(@pxref{Built-in Variables}).
13985
@c file eg/prog/egrep.awk
13986
# egrep.awk --- simulate egrep in awk
13987
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
13991
# -c count of lines
13992
# -s silent - use exit value
13993
# -v invert test, success if no match
13995
# -l print filenames only
13996
# -e argument is pattern
13999
while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
14019
Next comes the code that handles the @code{egrep} specific behavior. If no
14020
pattern was supplied with @samp{-e}, the first non-option on the command
14021
line is used. The @code{awk} command line arguments up to @code{ARGV[Optind]}
14022
are cleared, so that @code{awk} won't try to process them as files. If no
14023
files were specified, the standard input is used, and if multiple files were
14024
specified, we make sure to note this so that the file names can precede the
14025
matched lines in the output.
14027
The last two lines are commented out, since they are not needed in
14028
@code{gawk}. They should be uncommented if you have to use another version
14033
@c file eg/prog/egrep.awk
14035
pattern = ARGV[Optind++]
14037
for (i = 1; i < Optind; i++)
14039
if (Optind >= ARGC) @{
14042
@} else if (ARGC - Optind > 1)
14046
# pattern = tolower(pattern)
14052
The next set of lines should be uncommented if you are not using
14053
@code{gawk}. This rule translates all the characters in the input line
14054
into lower-case if the @samp{-i} option was specified. The rule is
14055
commented out since it is not necessary with @code{gawk}.
14056
@c bug: if a match happens, we output the translated line, not the original
14060
@c file eg/prog/egrep.awk
14069
The @code{beginfile} function is called by the rule in @file{ftrans.awk}
14070
when each new file is processed. In this case, it is very simple; all it
14071
does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
14072
how many lines in the current file matched the pattern.
14076
@c file eg/prog/egrep.awk
14077
function beginfile(junk)
14085
The @code{endfile} function is called after each file has been processed.
14086
It is used only when the user wants a count of the number of lines that
14087
matched. @code{no_print} will be true only if the exit status is desired.
14088
@code{count_only} will be true if line counts are desired. @code{egrep}
14089
will therefore only print line counts if printing and counting are enabled.
14090
The output format must be adjusted depending upon the number of files to be
14091
processed. Finally, @code{fcount} is added to @code{total}, so that we
14092
know how many lines altogether matched the pattern.
14096
@c file eg/prog/egrep.awk
14097
function endfile(file)
14099
if (! no_print && count_only)
14101
print file ":" fcount
14111
This rule does most of the work of matching lines. The variable
14112
@code{matches} will be true if the line matched the pattern. If the user
14113
wants lines that did not match, the sense of the @code{matches} is inverted
14114
using the @samp{!} operator. @code{fcount} is incremented with the value of
14115
@code{matches}, which will be either one or zero, depending upon a
14116
successful or unsuccessful match. If the line did not match, the
14117
@code{next} statement just moves on to the next record.
14119
There are several optimizations for performance in the following few lines
14120
of code. If the user only wants exit status (@code{no_print} is true), and
14121
we don't have to count lines, then it is enough to know that one line in
14122
this file matched, and we can skip on to the next file with @code{nextfile}.
14123
Along similar lines, if we are only printing file names, and we
14124
don't need to count lines, we can print the file name, and then skip to the
14125
next file with @code{nextfile}.
14127
Finally, each line is printed, with a leading filename and colon if
14131
2e: note, probably better to recode the last few lines as
14132
if (! count_only) @{
14136
if (filenames_only) @{
14142
print FILENAME ":" $0
14150
@c file eg/prog/egrep.awk
14152
matches = ($0 ~ pattern)
14154
matches = ! matches
14156
fcount += matches # 1 or 0
14161
if (no_print && ! count_only)
14164
if (filenames_only && ! count_only) @{
14169
if (do_filenames && ! count_only)
14170
print FILENAME ":" $0
14171
else if (! count_only)
14178
@c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}.
14180
The @code{END} rule takes care of producing the correct exit status. If
14181
there were no matches, the exit status is one, otherwise it is zero.
14185
@c file eg/prog/egrep.awk
14196
The @code{usage} function prints a usage message in case of invalid options
14201
@c file eg/prog/egrep.awk
14204
e = "Usage: egrep [-csvil] [-e pat] [files ...]"
14205
print e > "/dev/stderr"
14212
The variable @code{e} is used so that the function fits nicely
14213
on the printed page.
14215
@node Id Program, Split Program, Egrep Program, Clones
14216
@subsection Printing Out User Information
14218
@cindex @code{id} utility
14219
The @code{id} utility lists a user's real and effective user-id numbers,
14220
real and effective group-id numbers, and the user's group set, if any.
14221
@code{id} will only print the effective user-id and group-id if they are
14222
different from the real ones. If possible, @code{id} will also supply the
14223
corresponding user and group names. The output might look like this:
14227
@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty)
14230
This information is exactly what is provided by @code{gawk}'s
14231
@file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}).
14232
However, the @code{id} utility provides a more palatable output than just a
14235
Here is a simple version of @code{id} written in @code{awk}.
14236
It uses the user database library functions
14237
(@pxref{Passwd Functions, ,Reading the User Database}),
14238
and the group database library functions
14239
(@pxref{Group Functions, ,Reading the Group Database}).
14241
The program is fairly straightforward. All the work is done in the
14242
@code{BEGIN} rule. The user and group id numbers are obtained from
14243
@file{/dev/user}. If there is no support for @file{/dev/user}, the program
14246
The code is repetitive. The entry in the user database for the real user-id
14247
number is split into parts at the @samp{:}. The name is the first field.
14248
Similar code is used for the effective user-id number, and the group
14254
@c file eg/prog/id.awk
14255
# id.awk --- implement id in awk
14256
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
14260
# uid=12(foo) euid=34(bar) gid=3(baz) \
14261
# egid=5(blat) groups=9(nine),2(two),1(one)
14265
if ((getline < "/dev/user") < 0) @{
14266
err = "id: no /dev/user support - cannot run"
14267
print err > "/dev/stderr"
14277
printf("uid=%d", uid)
14282
printf("(%s)", a[1])
14286
if (euid != uid) @{
14287
printf(" euid=%d", euid)
14288
pw = getpwuid(euid)
14291
printf("(%s)", a[1])
14295
printf(" gid=%d", gid)
14299
printf("(%s)", a[1])
14302
if (egid != gid) @{
14303
printf(" egid=%d", egid)
14304
pw = getgrgid(egid)
14307
printf("(%s)", a[1])
14312
printf(" groups=");
14313
for (i = 5; i <= NF; i++) @{
14318
printf("(%s)", a[1])
14332
The POSIX version of @code{id} takes arguments that control which
14333
information is printed. Modify this version to accept the same
14334
arguments and perform in the same way.
14337
@node Split Program, Tee Program, Id Program, Clones
14338
@subsection Splitting a Large File Into Pieces
14340
@cindex @code{split} utility
14341
The @code{split} program splits large text files into smaller pieces. By default,
14342
the output files are named @file{xaa}, @file{xab}, and so on. Each file has
14343
1000 lines in it, with the likely exception of the last file. To change the
14344
number of lines in each file, you supply a number on the command line
14345
preceded with a minus, e.g., @samp{-500} for files with 500 lines in them
14346
instead of 1000. To change the name of the output files to something like
14347
@file{myfileaa}, @file{myfileab}, and so on, you supply an additional
14348
argument that specifies the filename.
14350
Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and
14351
@code{chr} functions presented in
14352
@ref{Ordinal Functions, ,Translating Between Characters and Numbers}.
14354
The program first sets its defaults, and then tests to make sure there are
14355
not too many arguments. It then looks at each argument in turn. The
14356
first argument could be a minus followed by a number. If it is, this happens
14357
to look like a negative number, so it is made positive, and that is the
14358
count of lines. The data file name is skipped over, and the final argument
14359
is used as the prefix for the output file names.
14364
@c file eg/prog/split.awk
14365
# split.awk --- do split in awk
14366
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
14369
# usage: split [-num] [file] [outname]
14373
outfile = "x" # default
14379
if (ARGV[i] ~ /^-[0-9]+$/) @{
14384
# test argv in case reading from stdin instead of file
14386
i++ # skip data file name
14393
out = (outfile s1 s2)
14399
The next rule does most of the work. @code{tcount} (temporary count) tracks
14400
how many lines have been printed to the output file so far. If it is greater
14401
than @code{count}, it is time to close the current file and start a new one.
14402
@code{s1} and @code{s2} track the current suffixes for the file name. If
14403
they are both @samp{z}, the file is just too big. Otherwise, @code{s1}
14404
moves to the next letter in the alphabet and @code{s2} starts over again at
14409
@c file eg/prog/split.awk
14411
if (++tcount > count) @{
14415
printf("split: %s is too large to split\n", \
14416
FILENAME) > "/dev/stderr"
14419
s1 = chr(ord(s1) + 1)
14422
s2 = chr(ord(s2) + 1)
14423
out = (outfile s1 s2)
14432
The @code{usage} function simply prints an error message and exits.
14436
@c file eg/prog/split.awk
14439
e = "usage: split [-num] [file] [outname]"
14440
print e > "/dev/stderr"
14448
The variable @code{e} is used so that the function
14457
This program is a bit sloppy; it relies on @code{awk} to close the last file
14458
for it automatically, instead of doing it in an @code{END} rule.
14460
@node Tee Program, Uniq Program, Split Program, Clones
14461
@subsection Duplicating Output Into Multiple Files
14463
@cindex @code{tee} utility
14464
The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies
14465
its standard input to its standard output, and also duplicates it to the
14466
files named on the command line. Its usage is:
14469
tee @r{[}-a@r{]} file @dots{}
14472
The @samp{-a} option tells @code{tee} to append to the named files, instead of
14473
truncating them and starting over.
14475
The @code{BEGIN} rule first makes a copy of all the command line arguments,
14476
into an array named @code{copy}.
14477
@code{ARGV[0]} is not copied, since it is not needed.
14478
@code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to
14479
process each file named in @code{ARGV} as input data.
14481
If the first argument is @samp{-a}, then the flag variable
14482
@code{append} is set to true, and both @code{ARGV[1]} and
14483
@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file
14484
names were supplied, and @code{tee} prints a usage message and exits.
14485
Finally, @code{awk} is forced to read the standard input by setting
14486
@code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two.
14488
@c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed.
14493
@c file eg/prog/tee.awk
14494
# tee.awk --- tee in awk
14495
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
14497
# Revised December 1995
14501
for (i = 1; i < ARGC; i++)
14504
if (ARGV[1] == "-a") @{
14511
print "usage: tee [-a] file ..." > "/dev/stderr"
14521
The single rule does all the work. Since there is no pattern, it is
14522
executed for each line of input. The body of the rule simply prints the
14523
line into each file on the command line, and then to the standard output.
14527
@c file eg/prog/tee.awk
14529
# moving the if outside the loop makes it run faster
14542
It would have been possible to code the loop this way:
14553
This is more concise, but it is also less efficient. The @samp{if} is
14554
tested for each record and for each output file. By duplicating the loop
14555
body, the @samp{if} is only tested once for each input record. If there are
14556
@var{N} input records and @var{M} input files, the first method only
14557
executes @var{N} @samp{if} statements, while the second would execute
14558
@var{N}@code{*}@var{M} @samp{if} statements.
14560
Finally, the @code{END} rule cleans up, by closing all the output files.
14564
@c file eg/prog/tee.awk
14574
@node Uniq Program, Wc Program, Tee Program, Clones
14575
@subsection Printing Non-duplicated Lines of Text
14577
@cindex @code{uniq} utility
14578
The @code{uniq} utility reads sorted lines of data on its standard input,
14579
and (by default) removes duplicate lines. In other words, only unique lines
14580
are printed, hence the name. @code{uniq} has a number of options. The usage is:
14583
uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]}
14586
The option meanings are:
14590
Only print repeated lines.
14593
Only print non-repeated lines.
14596
Count lines. This option overrides @samp{-d} and @samp{-u}. Both repeated
14597
and non-repeated lines are counted.
14600
Skip @var{n} fields before comparing lines. The definition of fields is the
14601
same as @code{awk}'s default: non-whitespace characters separated by runs of
14602
spaces and/or tabs.
14605
Skip @var{n} characters before comparing lines. Any fields specified with
14606
@samp{-@var{n}} are skipped first.
14608
@item @var{input file}
14609
Data is read from the input file named on the command line, instead of from
14610
the standard input.
14612
@item @var{output file}
14613
The generated output is sent to the named output file, instead of to the
14617
Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options
14620
Here is an @code{awk} implementation of @code{uniq}. It uses the
14621
@code{getopt} library function
14622
(@pxref{Getopt Function, ,Processing Command Line Options}),
14623
and the @code{join} library function
14624
(@pxref{Join Function, ,Merging an Array Into a String}).
14626
The program begins with a @code{usage} function and then a brief outline of
14627
the options and their meanings in a comment.
14629
The @code{BEGIN} rule deals with the command line arguments and options. It
14630
uses a trick to get @code{getopt} to handle options of the form @samp{-25},
14631
treating such an option as the option letter @samp{2} with an argument of
14632
@samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks
14633
like a number), @code{Optarg} is
14634
concatenated with the option digit, and then result is added to zero to make
14635
it into a number. If there is only one digit in the option, then
14636
@code{Optarg} is not needed, and @code{Optind} must be decremented so that
14637
@code{getopt} will process it next time. This code is admittedly a bit
14640
If no options were supplied, then the default is taken, to print both
14641
repeated and non-repeated lines. The output file, if provided, is assigned
14642
to @code{outputfile}. Earlier, @code{outputfile} was initialized to the
14643
standard output, @file{/dev/stdout}.
14648
@c file eg/prog/uniq.awk
14649
# uniq.awk --- do uniq in awk
14650
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
14655
e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
14656
print e > "/dev/stderr"
14660
# -c count lines. overrides -d and -u
14661
# -d only repeated lines
14662
# -u only non-repeated lines
14664
# +n skip n characters, skip fields first
14669
outputfile = "/dev/stdout"
14670
opts = "udc0:1:2:3:4:5:6:7:8:9:"
14671
while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
14673
non_repeated_only++
14678
else if (index("0123456789", c) != 0) @{
14679
# getopt requires args to options
14680
# this messes us up for things like -5
14681
if (Optarg ~ /^[0-9]+$/)
14682
fcount = (c Optarg) + 0
14691
if (ARGV[Optind] ~ /^\+[0-9]+$/) @{
14692
charcount = substr(ARGV[Optind], 2) + 0
14696
for (i = 1; i < Optind; i++)
14699
if (repeated_only == 0 && non_repeated_only == 0)
14700
repeated_only = non_repeated_only = 1
14702
if (ARGC - Optind == 2) @{
14703
outputfile = ARGV[ARGC - 1]
14704
ARGV[ARGC - 1] = ""
14711
The following function, @code{are_equal}, compares the current line,
14713
previous line, @code{last}. It handles skipping fields and characters.
14715
If no field count and no character count were specified, @code{are_equal}
14716
simply returns one or zero depending upon the result of a simple string
14717
comparison of @code{last} and @code{$0}. Otherwise, things get more
14720
If fields have to be skipped, each line is broken into an array using
14722
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
14723
and then the desired fields are joined back into a line using @code{join}.
14724
The joined lines are stored in @code{clast} and @code{cline}.
14725
If no fields are skipped, @code{clast} and @code{cline} are set to
14726
@code{last} and @code{$0} respectively.
14728
Finally, if characters are skipped, @code{substr} is used to strip off the
14729
leading @code{charcount} characters in @code{clast} and @code{cline}. The
14730
two strings are then compared, and @code{are_equal} returns the result.
14734
@c file eg/prog/uniq.awk
14735
function are_equal( n, m, clast, cline, alast, aline)
14737
if (fcount == 0 && charcount == 0)
14738
return (last == $0)
14741
n = split(last, alast)
14742
m = split($0, aline)
14743
clast = join(alast, fcount+1, n)
14744
cline = join(aline, fcount+1, m)
14750
clast = substr(clast, charcount + 1)
14751
cline = substr(cline, charcount + 1)
14754
return (clast == cline)
14760
The following two rules are the body of the program. The first one is
14761
executed only for the very first line of data. It sets @code{last} equal to
14762
@code{$0}, so that subsequent lines of text have something to be compared to.
14764
The second rule does the work. The variable @code{equal} will be one or zero
14765
depending upon the results of @code{are_equal}'s comparison. If @code{uniq}
14766
is counting repeated lines, then the @code{count} variable is incremented if
14767
the lines are equal. Otherwise the line is printed and @code{count} is
14768
reset, since the two lines are not equal.
14770
If @code{uniq} is not counting, @code{count} is incremented if the lines are
14771
equal. Otherwise, if @code{uniq} is counting repeated lines, and more than
14772
one line has been seen, or if @code{uniq} is counting non-repeated lines,
14773
and only one line has been seen, then the line is printed, and @code{count}
14776
Finally, similar logic is used in the @code{END} rule to print the final
14777
line of input data.
14781
@c file eg/prog/uniq.awk
14790
equal = are_equal()
14792
if (do_count) @{ # overrides -d and -u
14796
printf("%4d %s\n", count, last) > outputfile
14806
if ((repeated_only && count > 1) ||
14807
(non_repeated_only && count == 1))
14808
print last > outputfile
14817
printf("%4d %s\n", count, last) > outputfile
14818
else if ((repeated_only && count > 1) ||
14819
(non_repeated_only && count == 1))
14820
print last > outputfile
14827
@node Wc Program, , Uniq Program, Clones
14828
@subsection Counting Things
14830
@cindex @code{wc} utility
14831
The @code{wc} (word count) utility counts lines, words, and characters in
14832
one or more input files. Its usage is:
14835
wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]}
14838
If no files are specified on the command line, @code{wc} reads its standard
14839
input. If there are multiple files, it will also print total counts for all
14840
the files. The options and their meanings are:
14848
A ``word'' is a contiguous sequence of non-whitespace characters, separated
14849
by spaces and/or tabs. Happily, this is the normal way @code{awk} separates
14850
fields in its input data.
14853
Only count characters.
14856
Implementing @code{wc} in @code{awk} is particularly elegant, since
14857
@code{awk} does a lot of the work for us; it splits lines into words (i.e.@:
14858
fields) and counts them, it counts lines (i.e.@: records) for us, and it can
14859
easily tell us how long a line is.
14861
This version uses the @code{getopt} library function
14862
(@pxref{Getopt Function, ,Processing Command Line Options}),
14863
and the file transition functions
14864
(@pxref{Filetrans Function, ,Noting Data File Boundaries}).
14866
This version has one major difference from traditional versions of @code{wc}.
14867
Our version always prints the counts in the order lines, words,
14868
and characters. Traditional versions note the order of the @samp{-l},
14869
@samp{-w}, and @samp{-c} options on the command line, and print the counts
14872
The @code{BEGIN} rule does the argument processing.
14873
The variable @code{print_total} will
14874
be true if more than one file was named on the command line.
14879
@c file eg/prog/wc.awk
14880
# wc.awk --- count lines, words, characters
14881
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
14885
# -l only count lines
14886
# -w only count words
14887
# -c only count characters
14889
# Default is to count lines, words, characters
14892
# let getopt print a message about
14893
# invalid options. we ignore them
14894
while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
14902
for (i = 1; i < Optind; i++)
14905
# if no options, do all
14906
if (! do_lines && ! do_words && ! do_chars)
14907
do_lines = do_words = do_chars = 1
14909
print_total = (ARC - i > 2)
14915
The @code{beginfile} function is simple; it just resets the counts of lines,
14916
words, and characters to zero, and saves the current file name in
14919
The @code{endfile} function adds the current file's numbers to the running
14920
totals of lines, words, and characters. It then prints out those numbers
14921
for the file that was just read. It relies on @code{beginfile} to reset the
14922
numbers for the following data file.
14926
@c file eg/prog/wc.awk
14927
function beginfile(file)
14929
chars = lines = words = 0
14933
function endfile(file)
14940
printf "\t%d", lines
14943
printf "\t%d", words
14945
printf "\t%d", chars
14946
printf "\t%s\n", fname
14952
There is one rule that is executed for each line. It adds the length of the
14953
record to @code{chars}. It has to add one, since the newline character
14954
separating records (the value of @code{RS}) is not part of the record
14955
itself. @code{lines} is incremented for each line read, and @code{words} is
14956
incremented by the value of @code{NF}, the number of ``words'' on this
14957
line.@footnote{Examine the code in
14958
@ref{Filetrans Function, ,Noting Data File Boundaries}.
14959
Why must @code{wc} use a separate @code{lines} variable, instead of using
14960
the value of @code{FNR} in @code{endfile}?}
14962
Finally, the @code{END} rule simply prints the totals for all the files.
14966
@c file eg/prog/wc.awk
14969
chars += length($0) + 1 # get newline
14975
if (print_total) @{
14977
printf "\t%d", tlines
14979
printf "\t%d", twords
14981
printf "\t%d", tchars
14989
@node Miscellaneous Programs, , Clones, Sample Programs
14990
@section A Grab Bag of @code{awk} Programs
14992
This section is a large ``grab bag'' of miscellaneous programs.
14993
We hope you find them both interesting and enjoyable.
14996
* Dupword Program:: Finding duplicated words in a document.
14997
* Alarm Program:: An alarm clock.
14998
* Translate Program:: A program similar to the @code{tr} utility.
14999
* Labels Program:: Printing mailing labels.
15000
* Word Sorting:: A program to produce a word usage count.
15001
* History Sorting:: Eliminating duplicate entries from a history
15003
* Extract Program:: Pulling out programs from Texinfo source
15005
* Simple Sed:: A Simple Stream Editor.
15006
* Igawk Program:: A wrapper for @code{awk} that includes files.
15009
@node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs
15010
@subsection Finding Duplicated Words in a Document
15012
A common error when writing large amounts of prose is to accidentally
15013
duplicate words. Often you will see this in text as something like ``the
15014
the program does the following @dots{}.'' When the text is on-line, often
15015
the duplicated words occur at the end of one line and the beginning of
15016
another, making them very difficult to spot.
15019
This program, @file{dupword.awk}, scans through a file one line at a time,
15020
and looks for adjacent occurrences of the same word. It also saves the last
15021
word on a line (in the variable @code{prev}) for comparison with the first
15022
word on the next line.
15024
The first two statements make sure that the line is all lower-case, so that,
15026
``The'' and ``the'' compare equal to each other. The second statement
15027
removes all non-alphanumeric and non-whitespace characters from the line, so
15028
that punctuation does not affect the comparison either. This sometimes
15029
leads to reports of duplicated words that really are different, but this is
15032
@findex dupword.awk
15035
@c file eg/prog/dupword.awk
15036
# dupword --- find duplicate words in text
15037
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15042
gsub(/[^A-Za-z0-9 \t]/, "");
15044
printf("%s:%d: duplicate %s\n",
15046
for (i = 2; i <= NF; i++)
15048
printf("%s:%d: duplicate %s\n",
15056
@node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs
15057
@subsection An Alarm Clock Program
15059
The following program is a simple ``alarm clock'' program.
15060
You give it a time of day, and an optional message. At the given time,
15061
it prints the message on the standard output. In addition, you can give it
15062
the number of times to repeat the message, and also a delay between
15065
This program uses the @code{gettimeofday} function from
15066
@ref{Gettimeofday Function, ,Managing the Time of Day}.
15068
All the work is done in the @code{BEGIN} rule. The first part is argument
15069
checking and setting of defaults; the delay, the count, and the message to
15070
print. If the user supplied a message, but it does not contain the ASCII BEL
15071
character (known as the ``alert'' character, @samp{\a}), then it is added to
15072
the message. (On many systems, printing the ASCII BEL generates some sort
15073
of audible alert. Thus, when the alarm goes off, the system calls attention
15074
to itself, in case the user is not looking at their computer or terminal.)
15079
@c file eg/prog/alarm.awk
15080
# alarm --- set an alarm
15081
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15084
# usage: alarm time [ "message" [ count [ delay ] ] ]
15088
# Initial argument sanity checking
15089
usage1 = "usage: alarm time ['message' [count [delay]]]"
15090
usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
15093
print usage > "/dev/stderr"
15095
@} else if (ARGC == 5) @{
15096
delay = ARGV[4] + 0
15097
count = ARGV[3] + 0
15099
@} else if (ARGC == 4) @{
15100
count = ARGV[3] + 0
15102
@} else if (ARGC == 3) @{
15104
@} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{
15105
print usage1 > "/dev/stderr"
15106
print usage2 > "/dev/stderr"
15110
# set defaults for once we reach the desired time
15112
delay = 180 # 3 minutes
15117
message = sprintf("\aIt is now %s!\a", ARGV[1])
15118
else if (index(message, "\a") == 0)
15119
message = "\a" message "\a"
15124
The next section of code turns the alarm time into hours and minutes,
15125
and converts it if necessary to a 24-hour clock. Then it turns that
15126
time into a count of the seconds since midnight. Next it turns the current
15127
time into a count of seconds since midnight. The difference between the two
15128
is how long to wait before setting off the alarm.
15132
@c file eg/prog/alarm.awk
15133
# split up dest time
15134
split(ARGV[1], atime, ":")
15135
hour = atime[1] + 0 # force numeric
15136
minute = atime[2] + 0 # force numeric
15138
# get current broken down time
15141
# if time given is 12-hour hours and it's after that
15142
# hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
15143
# then add 12 to real hour
15144
if (hour < 12 && now["hour"] > hour)
15147
# set target time in seconds since midnight
15148
target = (hour * 60 * 60) + (minute * 60)
15150
# get current time in seconds since midnight
15151
current = (now["hour"] * 60 * 60) + \
15152
(now["minute"] * 60) + now["second"]
15154
# how long to sleep for
15155
naptime = target - current
15156
if (naptime <= 0) @{
15157
print "time is in the past!" > "/dev/stderr"
15164
Finally, the program uses the @code{system} function
15165
(@pxref{I/O Functions, ,Built-in Functions for Input/Output})
15166
to call the @code{sleep} utility. The @code{sleep} utility simply pauses
15167
for the given number of seconds. If the exit status is not zero,
15168
the program assumes that @code{sleep} was interrupted, and exits. If
15169
@code{sleep} exited with an OK status (zero), then the program prints the
15170
message in a loop, again using @code{sleep} to delay for however many
15171
seconds are necessary.
15175
@c file eg/prog/alarm.awk
15176
# zzzzzz..... go away if interrupted
15177
if (system(sprintf("sleep %d", naptime)) != 0)
15181
command = sprintf("sleep %d", delay)
15182
for (i = 1; i <= count; i++) @{
15184
# if sleep command interrupted, go away
15185
if (system(command) != 0)
15195
@node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs
15196
@subsection Transliterating Characters
15198
The system @code{tr} utility transliterates characters. For example, it is
15199
often used to map upper-case letters into lower-case, for further
15203
@var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{}
15206
You give @code{tr} two lists of characters enclosed in square brackets.
15207
Usually, the lists are quoted to keep the shell from attempting to do a
15208
filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often
15209
does not require that the lists be enclosed in square brackets and quoted.
15210
This is a feature.} When processing the input, the
15211
first character in the first list is replaced with the first character in the
15212
second list, the second character in the first list is replaced with the
15213
second character in the second list, and so on.
15214
If there are more characters in the ``from'' list than in the ``to'' list,
15215
the last character of the ``to'' list is used for the remaining characters
15216
in the ``from'' list.
15219
@c early or mid-1989!
15220
a user proposed to us that we add a transliteration function to @code{gawk}.
15221
Being opposed to ``creeping featurism,'' I wrote the following program to
15222
prove that character transliteration could be done with a user-level
15223
function. This program is not as complete as the system @code{tr} utility,
15224
but it will do most of the job.
15226
The @code{translate} program demonstrates one of the few weaknesses of
15228
@code{awk}: dealing with individual characters is very painful, requiring
15229
repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in
15231
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This
15232
program was written before @code{gawk} acquired the ability to
15233
split each character in a string into separate array elements.
15234
How might this ability simplify the program?}
15236
There are two functions. The first, @code{stranslate}, takes three
15241
A list of characters to translate from.
15244
A list of characters to translate to.
15247
The string to do the translation on.
15250
Associative arrays make the translation part fairly easy. @code{t_ar} holds
15251
the ``to'' characters, indexed by the ``from'' characters. Then a simple
15252
loop goes through @code{from}, one character at a time. For each character
15253
in @code{from}, if the character appears in @code{target}, @code{gsub}
15254
is used to change it to the corresponding @code{to} character.
15256
The @code{translate} function simply calls @code{stranslate} using @code{$0}
15257
as the target. The main program sets two global variables, @code{FROM} and
15258
@code{TO}, from the command line, and then changes @code{ARGV} so that
15259
@code{awk} will read from the standard input.
15261
Finally, the processing rule simply calls @code{translate} for each record.
15263
@findex translate.awk
15266
@c file eg/prog/translate.awk
15267
# translate --- do tr like stuff
15268
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15271
# bugs: does not handle things like: tr A-Z a-z, it has
15272
# to be spelled out. However, if `to' is shorter than `from',
15273
# the last character in `to' is used for the rest of `from'.
15275
function stranslate(from, to, target, lf, lt, t_ar, i, c)
15279
for (i = 1; i <= lt; i++)
15280
t_ar[substr(from, i, 1)] = substr(to, i, 1)
15282
for (; i <= lf; i++)
15283
t_ar[substr(from, i, 1)] = substr(to, lt, 1)
15284
for (i = 1; i <= lf; i++) @{
15285
c = substr(from, i, 1)
15286
if (index(target, c) > 0)
15287
gsub(c, t_ar[c], target)
15293
function translate(from, to)
15295
return $0 = stranslate(from, to, $0)
15302
print "usage: translate from to" > "/dev/stderr"
15312
translate(FROM, TO)
15319
While it is possible to do character transliteration in a user-level
15320
function, it is not necessarily efficient, and we started to consider adding
15321
a built-in function. However, shortly after writing this program, we learned
15322
that the System V Release 4 @code{awk} had added the @code{toupper} and
15323
@code{tolower} functions. These functions handle the vast majority of the
15324
cases where character transliteration is necessary, and so we chose to
15325
simply add those functions to @code{gawk} as well, and then leave well
15328
An obvious improvement to this program would be to set up the
15329
@code{t_ar} array only once, in a @code{BEGIN} rule. However, this
15330
assumes that the ``from'' and ``to'' lists
15331
will never change throughout the lifetime of the program.
15333
@node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs
15334
@subsection Printing Mailing Labels
15336
Here is a ``real world''@footnote{``Real world'' is defined as
15337
``a program actually used to get something done.''}
15338
program. This script reads lists of names and
15339
addresses, and generates mailing labels. Each page of labels has 20 labels
15340
on it, two across and ten down. The addresses are guaranteed to be no more
15341
than five lines of data. Each address is separated from the next by a blank
15344
The basic idea is to read 20 labels worth of data. Each line of each label
15345
is stored in the @code{line} array. The single rule takes care of filling
15346
the @code{line} array and printing the page when 20 labels have been read.
15348
The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
15349
@code{awk} will split records at blank lines
15350
(@pxref{Records, ,How Input is Split into Records}).
15351
It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number
15352
of lines on the page (20 * 5 = 100).
15354
Most of the work is done in the @code{printpage} function.
15355
The label lines are stored sequentially in the @code{line} array. But they
15356
have to be printed horizontally; @code{line[1]} next to @code{line[6]},
15357
@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to
15358
accomplish this. The outer loop, controlled by @code{i}, steps through
15359
every 10 lines of data; this is each row of labels. The inner loop,
15360
controlled by @code{j}, goes through the lines within the row.
15361
As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in
15362
the row, and @samp{i+j+5} is the entry next to it. The output ends up
15363
looking something like this:
15373
As a final note, at lines 21 and 61, an extra blank line is printed, to keep
15374
the output lined up on the labels. This is dependent on the particular
15375
brand of labels in use when the program was written. You will also note
15376
that there are two blank lines at the top and two blank lines at the bottom.
15378
The @code{END} rule arranges to flush the final page of labels; there may
15379
not have been an even multiple of 20 labels in the data.
15384
@c file eg/prog/labels.awk
15386
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15389
# Program to print labels. Each label is 5 lines of data
15390
# that may have blank lines. The label sheets have 2
15391
# blank lines at the top and 2 at the bottom.
15393
BEGIN @{ RS = "" ; MAXLINES = 100 @}
15395
function printpage( i, j)
15400
printf "\n\n" # header
15402
for (i = 1; i <= Nlines; i += 10) @{
15403
if (i == 21 || i == 61)
15405
for (j = 0; j < 5; j++) @{
15406
if (i + j > MAXLINES)
15408
printf " %-41s %s\n", line[i+j], line[i+j+5]
15413
printf "\n\n" # footer
15421
if (Count >= 20) @{
15426
n = split($0, a, "\n")
15427
for (i = 1; i <= n; i++)
15428
line[++Nlines] = a[i]
15429
for (; i <= 5; i++)
15430
line[++Nlines] = ""
15442
@node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs
15443
@subsection Generating Word Usage Counts
15445
The following @code{awk} program prints
15446
the number of occurrences of each word in its input. It illustrates the
15447
associative nature of @code{awk} arrays by using strings as subscripts. It
15448
also demonstrates the @samp{for @var{x} in @var{array}} construction.
15449
Finally, it shows how @code{awk} can be used in conjunction with other
15450
utility programs to do a useful task of some complexity with a minimum of
15451
effort. Some explanations follow the program listing.
15455
# Print list of word frequencies
15457
for (i = 1; i <= NF; i++)
15463
printf "%s\t%d\n", word, freq[word]
15467
The first thing to notice about this program is that it has two rules. The
15468
first rule, because it has an empty pattern, is executed on every line of
15469
the input. It uses @code{awk}'s field-accessing mechanism
15470
(@pxref{Fields, ,Examining Fields}) to pick out the individual words from
15471
the line, and the built-in variable @code{NF} (@pxref{Built-in Variables})
15472
to know how many fields are available.
15474
For each input word, an element of the array @code{freq} is incremented to
15475
reflect that the word has been seen an additional time.
15477
The second rule, because it has the pattern @code{END}, is not executed
15478
until the input has been exhausted. It prints out the contents of the
15479
@code{freq} table that has been built up inside the first action.
15481
This program has several problems that would prevent it from being
15482
useful by itself on real text files:
15486
Words are detected using the @code{awk} convention that fields are
15487
separated by whitespace and that other characters in the input (except
15488
newlines) don't have any special meaning to @code{awk}. This means that
15489
punctuation characters count as part of words.
15492
The @code{awk} language considers upper- and lower-case characters to be
15493
distinct. Therefore, @samp{bartender} and @samp{Bartender} are not treated
15494
as the same word. This is undesirable since, in normal text, words
15495
are capitalized if they begin sentences, and a frequency analyzer should not
15496
be sensitive to capitalization.
15502
The output does not come out in any useful order. You're more likely to be
15503
interested in which words occur most frequently, or having an alphabetized
15504
table of how frequently each word occurs.
15507
The way to solve these problems is to use some of the more advanced
15508
features of the @code{awk} language. First, we use @code{tolower} to remove
15509
case distinctions. Next, we use @code{gsub} to remove punctuation
15510
characters. Finally, we use the system @code{sort} utility to process the
15511
output of the @code{awk} script. Here is the new version of
15514
@findex wordfreq.sh
15516
@c file eg/prog/wordfreq.awk
15517
# Print list of word frequencies
15519
$0 = tolower($0) # remove case distinctions
15520
gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation
15521
for (i = 1; i <= NF; i++)
15528
printf "%s\t%d\n", word, freq[word]
15532
Assuming we have saved this program in a file named @file{wordfreq.awk},
15533
and that the data is in @file{file1}, the following pipeline
15536
awk -f wordfreq.awk file1 | sort +1 -nr
15540
produces a table of the words appearing in @file{file1} in order of
15541
decreasing frequency.
15543
The @code{awk} program suitably massages the data and produces a word
15544
frequency table, which is not ordered.
15546
The @code{awk} script's output is then sorted by the @code{sort} utility and
15547
printed on the terminal. The options given to @code{sort} in this example
15548
specify to sort using the second field of each input line (skipping one field),
15549
that the sort keys should be treated as numeric quantities (otherwise
15550
@samp{15} would come before @samp{5}), and that the sorting should be done
15551
in descending (reverse) order.
15553
We could have even done the @code{sort} from within the program, by
15554
changing the @code{END} action to:
15557
@c file eg/prog/wordfreq.awk
15559
sort = "sort +1 -nr"
15561
printf "%s\t%d\n", word, freq[word] | sort
15567
You would have to use this way of sorting on systems that do not
15570
See the general operating system documentation for more information on how
15571
to use the @code{sort} program.
15573
@node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs
15574
@subsection Removing Duplicates from Unsorted Text
15576
The @code{uniq} program
15577
(@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}),
15578
removes duplicate lines from @emph{sorted} data.
15580
Suppose, however, you need to remove duplicate lines from a data file, but
15581
that you wish to preserve the order the lines are in? A good example of
15582
this might be a shell history file. The history file keeps a copy of all
15583
the commands you have entered, and it is not unusual to repeat a command
15584
several times in a row. Occasionally you might wish to compact the history
15585
by removing duplicate entries. Yet it is desirable to maintain the order
15586
of the original commands.
15588
This simple program does the job. It uses two arrays. The @code{data}
15589
array is indexed by the text of each line.
15590
For each line, @code{data[$0]} is incremented.
15592
If a particular line has not
15593
been seen before, then @code{data[$0]} will be zero.
15594
In that case, the text of the line is stored in @code{lines[count]}.
15595
Each element of @code{lines} is a unique command, and the indices of
15596
@code{lines} indicate the order in which those lines were encountered.
15597
The @code{END} rule simply prints out the lines, in order.
15599
@cindex Rakitzis, Byron
15600
@findex histsort.awk
15603
@c file eg/prog/histsort.awk
15604
# histsort.awk --- compact a shell history file
15605
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15608
# Thanks to Byron Rakitzis for the general idea
15610
if (data[$0]++ == 0)
15611
lines[++count] = $0
15615
for (i = 1; i <= count; i++)
15622
This program also provides a foundation for generating other useful
15623
information. For example, using the following @code{print} satement in the
15624
@code{END} rule would indicate how often a particular command was used.
15627
print data[lines[i]], lines[i]
15630
This works because @code{data[$0]} was incremented each time a line was
15633
@node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs
15634
@subsection Extracting Programs from Texinfo Source Files
15637
Both this chapter and the previous chapter
15638
(@ref{Library Functions, ,A Library of @code{awk} Functions}),
15639
present a large number of @code{awk} programs.
15643
@ref{Library Functions, ,A Library of @code{awk} Functions},
15644
and @ref{Sample Programs, ,Practical @code{awk} Programs},
15645
are the top level nodes for a large number of @code{awk} programs.
15647
If you wish to experiment with these programs, it is tedious to have to type
15648
them in by hand. Here we present a program that can extract parts of a
15649
Texinfo input file into separate files.
15651
This @value{DOCUMENT} is written in Texinfo, the GNU project's document
15652
formatting language. A single Texinfo source file can be used to produce both
15653
printed and on-line documentation.
15655
Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format},
15656
available from the Free Software Foundation.
15659
The Texinfo language is described fully, starting with
15660
@ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}.
15663
For our purposes, it is enough to know three things about Texinfo input
15668
The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C
15669
or @code{awk}. Literal @samp{@@} symbols are represented in Texinfo source
15670
files as @samp{@@@@}.
15673
Comments start with either @samp{@@c} or @samp{@@comment}.
15674
The file extraction program will work by using special comments that start
15675
at the beginning of a line.
15678
Example text that should not be split across a page boundary is bracketed
15679
between lines containing @samp{@@group} and @samp{@@end group} commands.
15682
The following program, @file{extract.awk}, reads through a Texinfo source
15683
file, and does two things, based on the special comments.
15684
Upon seeing @samp{@w{@@c system @dots{}}},
15685
it runs a command, by extracting the command text from the
15686
control line and passing it on to the @code{system} function
15687
(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
15688
Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
15689
the file @var{filename}, until @samp{@@c endfile} is encountered.
15690
The rules in @file{extract.awk} will match either @samp{@@c} or
15691
@samp{@@comment} by letting the @samp{omment} part be optional.
15692
Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
15693
@file{extract.awk} uses the @code{join} library function
15694
(@pxref{Join Function, ,Merging an Array Into a String}).
15696
The example programs in the on-line Texinfo source for @cite{@value{TITLE}}
15697
(@file{gawk.texi}) have all been bracketed inside @samp{file},
15698
and @samp{endfile} lines. The @code{gawk} distribution uses a copy of
15699
@file{extract.awk} to extract the sample
15700
programs and install many of them in a standard directory, where
15701
@code{gawk} can find them.
15703
@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that
15704
mixed upper-case and lower-case letters in the directives won't matter.
15706
The first rule handles calling @code{system}, checking that a command was
15707
given (@code{NF} is at least three), and also checking that the command
15708
exited with a zero exit status, signifying OK.
15710
@findex extract.awk
15713
@c file eg/prog/extract.awk
15714
# extract.awk --- extract files and run programs
15715
# from texinfo files
15716
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15719
BEGIN @{ IGNORECASE = 1 @}
15722
/^@@c(omment)?[ \t]+system/ \
15725
e = (FILENAME ":" FNR)
15726
e = (e ": badly formed `system' line")
15727
print e > "/dev/stderr"
15734
e = (FILENAME ":" FNR)
15735
e = (e ": warning: system returned " stat)
15736
print e > "/dev/stderr"
15744
The variable @code{e} is used so that the function
15753
The second rule handles moving data into files. It verifies that a file
15754
name was given in the directive. If the file named is not the current file,
15755
then the current file is closed. This means that an @samp{@@c endfile} was
15756
not given for that file. (We should probably print a diagnostic in this
15757
case, although at the moment we do not.)
15759
The @samp{for} loop does the work. It reads lines using @code{getline}
15760
(@pxref{Getline, ,Explicit Input with @code{getline}}).
15761
For an unexpected end of file, it calls the @code{@w{unexpected_eof}}
15762
function. If the line is an ``endfile'' line, then it breaks out of
15764
If the line is an @samp{@@group} or @samp{@@end group} line, then it
15765
ignores it, and goes on to the next line.
15767
Most of the work is in the following few lines. If the line has no @samp{@@}
15768
symbols, it can be printed directly. Otherwise, each leading @samp{@@} must be
15771
To remove the @samp{@@} symbols, the line is split into separate elements of
15772
the array @code{a}, using the @code{split} function
15773
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
15774
Each element of @code{a} that is empty indicates two successive @samp{@@}
15775
symbols in the original line. For each two empty elements (@samp{@@@@} in
15776
the original file), we have to add back in a single @samp{@@} symbol.
15778
When the processing of the array is finished, @code{join} is called with the
15779
value of @code{SUBSEP}, to rejoin the pieces back into a single
15780
line. That line is then printed to the output file.
15784
@c file eg/prog/extract.awk
15785
/^@@c(omment)?[ \t]+file/ \
15789
e = (FILENAME ":" FNR ": badly formed `file' line")
15790
print e > "/dev/stderr"
15794
if ($3 != curfile) @{
15801
if ((getline line) <= 0)
15803
if (line ~ /^@@c(omment)?[ \t]+endfile/)
15805
else if (line ~ /^@@(end[ \t]+)?group/)
15807
if (index(line, "@@") == 0) @{
15808
print line > curfile
15811
n = split(line, a, "@@")
15813
# if a[1] == "", means leading @@,
15814
# don't add one back in.
15816
for (i = 2; i <= n; i++) @{
15817
if (a[i] == "") @{ # was an @@@@
15823
print join(a, 1, n, SUBSEP) > curfile
15830
An important thing to note is the use of the @samp{>} redirection.
15831
Output done with @samp{>} only opens the file once; it stays open and
15832
subsequent output is appended to the file
15833
(@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}).
15834
This allows us to easily mix program text and explanatory prose for the same
15835
sample source file (as has been done here!) without any hassle. The file is
15836
only closed when a new data file name is encountered, or at the end of the
15839
Finally, the function @code{@w{unexpected_eof}} prints an appropriate
15840
error message and then exits.
15842
The @code{END} rule handles the final cleanup, closing the open file.
15845
@c file eg/prog/extract.awk
15847
function unexpected_eof()
15849
printf("%s:%d: unexpected EOF or error\n", \
15850
FILENAME, FNR) > "/dev/stderr"
15862
@node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs
15863
@subsection A Simple Stream Editor
15865
@cindex @code{sed} utility
15866
The @code{sed} utility is a ``stream editor,'' a program that reads a
15867
stream of data, makes changes to it, and passes the modified data on.
15868
It is often used to make global changes to a large file, or to a stream
15869
of data generated by a pipeline of commands.
15871
While @code{sed} is a complicated program in its own right, its most common
15872
use is to perform global substitutions in the middle of a pipeline:
15875
command1 < orig.data | sed 's/old/new/g' | command2 > result
15878
Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp
15879
@samp{old} on each input line, and replace it with the text @samp{new},
15880
globally (i.e.@: all the occurrences on a line). This is similar to
15881
@code{awk}'s @code{gsub} function
15882
(@pxref{String Functions, , Built-in Functions for String Manipulation}).
15884
The following program, @file{awksed.awk}, accepts at least two command line
15885
arguments; the pattern to look for and the text to replace it with. Any
15886
additional arguments are treated as data file names to process. If none
15887
are provided, the standard input is used.
15889
@cindex Brennan, Michael
15890
@cindex @code{awksed}
15891
@cindex simple stream editor
15892
@cindex stream editor, simple
15895
@c file eg/prog/awksed.awk
15896
# awksed.awk --- do s/foo/bar/g using just print
15897
# Thanks to Michael Brennan for the idea
15899
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
15904
print "usage: awksed pat repl [files...]" > "/dev/stderr"
15909
# validate arguments
15916
# don't use arguments as files
15917
ARGV[1] = ARGV[2] = ""
15920
# look ma, no hands!
15931
The program relies on @code{gawk}'s ability to have @code{RS} be a regexp
15932
and on the setting of @code{RT} to the actual text that terminated the
15933
record (@pxref{Records, ,How Input is Split into Records}).
15935
The idea is to have @code{RS} be the pattern to look for. @code{gawk}
15936
will automatically set @code{$0} to the text between matches of the pattern.
15937
This is text that we wish to keep, unmodified. Then, by setting @code{ORS}
15938
to the replacement text, a simple @code{print} statement will output the
15939
text we wish to keep, followed by the replacement text.
15941
There is one wrinkle to this scheme, which is what to do if the last record
15942
doesn't end with text that matches @code{RS}? Using a @code{print}
15943
statement unconditionally prints the replacement text, which is not correct.
15945
However, if the file did not end in text that matches @code{RS}, @code{RT}
15946
will be set to the null string. In this case, we can print @code{$0} using
15948
(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
15950
The @code{BEGIN} rule handles the setup, checking for the right number
15951
of arguments, and calling @code{usage} if there is a problem. Then it sets
15952
@code{RS} and @code{ORS} from the command line arguments, and sets
15953
@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will
15954
not be treated as file names
15955
(@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}).
15957
The @code{usage} function prints an error message and exits.
15959
Finally, the single rule handles the printing scheme outlined above,
15960
using @code{print} or @code{printf} as appropriate, depending upon the
15961
value of @code{RT}.
15964
Exercise, compare the performance of this version with the more
15970
ARGV[1] = ARGV[2] = ""
15973
{ gsub(pat, repl); print }
15975
Exercise: what are the advantages and disadvantages of this version vs. sed?
15976
Advantage: egrep regexps
15978
Disadvantage: no & in replacement text
15983
@node Igawk Program, , Simple Sed, Miscellaneous Programs
15984
@subsection An Easy Way to Use Library Functions
15986
Using library functions in @code{awk} can be very beneficial. It
15987
encourages code re-use and the writing of general functions. Programs are
15988
smaller, and therefore clearer.
15989
However, using library functions is only easy when writing @code{awk}
15990
programs; it is painful when running them, requiring multiple @samp{-f}
15991
options. If @code{gawk} is unavailable, then so too is the @code{AWKPATH}
15992
environment variable and the ability to put @code{awk} functions into a
15993
library directory (@pxref{Options, ,Command Line Options}).
15995
It would be nice to be able to write programs like so:
15998
# library functions
15999
@@include getopt.awk
16005
while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
16011
The following program, @file{igawk.sh}, provides this service.
16012
It simulates @code{gawk}'s searching of the @code{AWKPATH} variable,
16013
and also allows @dfn{nested} includes; i.e.@: a file that has been included
16014
with @samp{@@include} can contain further @samp{@@include} statements.
16015
@code{igawk} will make an effort to only include files once, so that nested
16016
includes don't accidentally include a library function twice.
16018
@code{igawk} should behave externally just like @code{gawk}. This means it
16019
should accept all of @code{gawk}'s command line arguments, including the
16020
ability to have multiple source files specified via @samp{-f}, and the
16021
ability to mix command line and library source files.
16023
The program is written using the POSIX Shell (@code{sh}) command language.
16024
The way the program works is as follows:
16028
Loop through the arguments, saving anything that doesn't represent
16029
@code{awk} source code for later, when the expanded program is run.
16032
For any arguments that do represent @code{awk} text, put the arguments into
16033
a temporary file that will be expanded. There are two cases.
16037
Literal text, provided with @samp{--source} or @samp{--source=}. This
16038
text is just echoed directly. The @code{echo} program will automatically
16039
supply a trailing newline.
16042
File names provided with @samp{-f}. We use a neat trick, and echo
16043
@samp{@@include @var{filename}} into the temporary file. Since the file
16044
inclusion program will work the way @code{gawk} does, this will get the text
16045
of the file included into the program at the correct point.
16049
Run an @code{awk} program (naturally) over the temporary file to expand
16050
@samp{@@include} statements. The expanded program is placed in a second
16054
Run the expanded program with @code{gawk} and any other original command line
16055
arguments that the user supplied (such as the data file names).
16058
The initial part of the program turns on shell tracing if the first
16059
argument was @samp{debug}. Otherwise, a shell @code{trap} statement
16060
arranges to clean up any temporary files on program exit or upon an
16063
@c 2e: For the temp file handling, go with Darrel's ig=${TMP:-/tmp}/igs.$$
16064
@c 2e: or something as similar as possible.
16066
The next part loops through all the command line arguments.
16067
There are several cases of interest.
16071
This ends the arguments to @code{igawk}. Anything else should be passed on
16072
to the user's @code{awk} program without being evaluated.
16075
This indicates that the next option is specific to @code{gawk}. To make
16076
argument processing easier, the @samp{-W} is appended to the front of the
16077
remaining arguments and the loop continues. (This is an @code{sh}
16078
programming trick. Don't worry about it if you are not familiar with
16083
These are saved and passed on to @code{gawk}.
16089
The file name is saved to the temporary file @file{/tmp/ig.s.$$} with an
16090
@samp{@@include} statement.
16091
The @code{sed} utility is used to remove the leading option part of the
16092
argument (e.g., @samp{--file=}).
16097
The source text is echoed into @file{/tmp/ig.s.$$}.
16105
@code{igawk} prints its version number, and runs @samp{gawk --version}
16106
to get the @code{gawk} version information, and then exits.
16109
If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source},
16110
or @samp{-Wsource}, were supplied, then the first non-option argument
16111
should be the @code{awk} program. If there are no command line
16112
arguments left, @code{igawk} prints an error message and exits.
16113
Otherwise, the first argument is echoed into @file{/tmp/ig.s.$$}.
16115
In any case, after the arguments have been processed,
16116
@file{/tmp/ig.s.$$} contains the complete text of the original @code{awk}
16119
The @samp{$$} in @code{sh} represents the current process ID number.
16120
It is often used in shell programs to generate unique temporary file
16121
names. This allows multiple users to run @code{igawk} without worrying
16122
that the temporary file names will clash.
16124
@cindex @code{sed} utility
16125
Here's the program:
16130
@c file eg/prog/igawk.sh
16133
# igawk --- like gawk but do @@include processing
16134
# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain
16137
if [ "$1" = debug ]
16142
# cleanup on exit, hangup, interrupt, quit, termination
16143
trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15
16146
while [ $# -ne 0 ] # loop over arguments
16155
-[vF]) opts="$opts $1 '$2'"
16158
-[vF]*) opts="$opts '$1'" ;;
16160
-f) echo @@include "$2" >> /tmp/ig.s.$$
16163
-f*) f=`echo "$1" | sed 's/-f//'`
16164
echo @@include "$f" >> /tmp/ig.s.$$ ;;
16166
-?file=*) # -Wfile or --file
16167
f=`echo "$1" | sed 's/-.file=//'`
16168
echo @@include "$f" >> /tmp/ig.s.$$ ;;
16170
-?file) # get arg, $2
16171
echo @@include "$2" >> /tmp/ig.s.$$
16174
-?source=*) # -Wsource or --source
16175
t=`echo "$1" | sed 's/-.source=//'`
16176
echo "$t" >> /tmp/ig.s.$$ ;;
16178
-?source) # get arg, $2
16179
echo "$2" >> /tmp/ig.s.$$
16183
echo igawk: version 1.0 1>&2
16187
-[W-]*) opts="$opts '$1'" ;;
16194
if [ ! -s /tmp/ig.s.$$ ]
16198
echo igawk: no program! 1>&2
16201
echo "$1" > /tmp/ig.s.$$
16206
# at this point, /tmp/ig.s.$$ has the program
16211
The @code{awk} program to process @samp{@@include} directives reads through
16212
the program, one line at a time using @code{getline}
16213
(@pxref{Getline, ,Explicit Input with @code{getline}}).
16214
The input file names and @samp{@@include} statements are managed using a
16215
stack. As each @samp{@@include} is encountered, the current file name is
16216
``pushed'' onto the stack, and the file named in the @samp{@@include}
16218
the current file name. As each file is finished, the stack is ``popped,''
16219
and the previous input file becomes the current input file again.
16220
The process is started by making the original file the first one on the
16223
The @code{pathto} function does the work of finding the full path to a
16224
file. It simulates @code{gawk}'s behavior when searching the @code{AWKPATH}
16225
environment variable
16226
(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
16227
If a file name has a @samp{/} in it, no path search
16228
is done. Otherwise, the file name is concatenated with the name of each
16229
directory in the path, and an attempt is made to open the generated file
16230
name. The only way in @code{awk} to test if a file can be read is to go
16231
ahead and try to read it with @code{getline}; that is what @code{pathto}
16232
does. If the file can be read, it is closed, and the file name is
16235
An alternative way to test for the file's existence would be to call
16236
@samp{system("test -r " t)}, which uses the @code{test} utility to
16237
see if the file exists and is readable. The disadvantage to this method
16238
is that it requires creating an extra process, and can thus be slightly
16244
@c file eg/prog/igawk.sh
16246
# process @@include directives
16248
function pathto(file, i, t, junk)
16250
if (index(file, "/") != 0)
16253
for (i = 1; i <= ndirs; i++) @{
16254
t = (pathlist[i] "/" file)
16255
if ((getline junk < t) > 0) @{
16267
The main program is contained inside one @code{BEGIN} rule. The first thing it
16268
does is set up the @code{pathlist} array that @code{pathto} uses. After
16269
splitting the path on @samp{:}, null elements are replaced with @code{"."},
16270
which represents the current directory.
16274
@c file eg/prog/igawk.sh
16276
path = ENVIRON["AWKPATH"]
16277
ndirs = split(path, pathlist, ":")
16278
for (i = 1; i <= ndirs; i++) @{
16279
if (pathlist[i] == "")
16286
The stack is initialized with @code{ARGV[1]}, which will be @file{/tmp/ig.s.$$}.
16287
The main loop comes next. Input lines are read in succession. Lines that
16288
do not start with @samp{@@include} are printed verbatim.
16290
If the line does start with @samp{@@include}, the file name is in @code{$2}.
16291
@code{pathto} is called to generate the full path. If it could not, then we
16292
print an error message and continue.
16294
The next thing to check is if the file has been included already. The
16295
@code{processed} array is indexed by the full file name of each included
16296
file, and it tracks this information for us. If the file has been
16297
seen, a warning message is printed. Otherwise, the new file name is
16298
pushed onto the stack and processing continues.
16300
Finally, when @code{getline} encounters the end of the input file, the file
16301
is closed and the stack is popped. When @code{stackptr} is less than zero,
16302
the program is done.
16306
@c file eg/prog/igawk.sh
16308
input[stackptr] = ARGV[1] # ARGV[1] is first file
16310
for (; stackptr >= 0; stackptr--) @{
16311
while ((getline < input[stackptr]) > 0) @{
16312
if (tolower($1) != "@@include") @{
16317
if (fpath == "") @{
16318
printf("igawk:%s:%d: cannot find %s\n", \
16319
input[stackptr], FNR, $2) > "/dev/stderr"
16323
if (! (fpath in processed)) @{
16324
processed[fpath] = input[stackptr]
16325
input[++stackptr] = fpath
16327
print $2, "included in", input[stackptr], \
16328
"already included in", \
16329
processed[fpath] > "/dev/stderr"
16333
close(input[stackptr])
16335
@}' /tmp/ig.s.$$ > /tmp/ig.e.$$
16341
The last step is to call @code{gawk} with the expanded program and the original
16342
options and command line arguments that the user supplied. @code{gawk}'s
16343
exit status is passed back on to @code{igawk}'s calling program.
16345
@c this causes more problems than it solves, so leave it out.
16347
The special file @file{/dev/null} is passed as a data file to @code{gawk}
16348
to handle an interesting case. Suppose that the user's program only has
16349
a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data
16350
files. However, suppose that an included library file defines an @code{END}
16351
rule of its own. In this case, @code{gawk} will hang, reading standard
16352
input. In order to avoid this, @file{/dev/null} is explicitly to the
16353
command line. Reading from @file{/dev/null} always returns an immediate
16354
end of file indication.
16356
@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.
16361
@c file eg/prog/igawk.sh
16362
eval gawk -f /tmp/ig.e.$$ $opts -- "$@@"
16369
This version of @code{igawk} represents my third attempt at this program.
16370
There are three key simplifications that made the program work better.
16374
Using @samp{@@include} even for the files named with @samp{-f} makes building
16375
the initial collected @code{awk} program much simpler; all the
16376
@samp{@@include} processing can be done once.
16379
The @code{pathto} function doesn't try to save the line read with
16380
@code{getline} when testing for the file's accessibility. Trying to save
16381
this line for use with the main program complicates things considerably.
16382
@c what problem does this engender though - exercise
16383
@c answer, reading from "-" or /dev/stdin
16386
Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
16387
place. It is not necessary to call out to a separate loop for processing
16388
nested @samp{@@include} statements.
16391
Also, this program illustrates that it is often worthwhile to combine
16392
@code{sh} and @code{awk} programming together. You can usually accomplish
16393
quite a lot, without having to resort to low-level programming in C or C++, and it
16394
is frequently easier to do certain kinds of string and argument manipulation
16395
using the shell than it is in @code{awk}.
16397
Finally, @code{igawk} shows that it is not always necessary to add new
16398
features to a program; they can often be layered on top. With @code{igawk},
16399
there is no real reason to build @samp{@@include} processing into
16400
@code{gawk} itself.
16402
As an additional example of this, consider the idea of having two
16403
files in a directory in the search path.
16407
This file would contain a set of default library functions, such
16408
as @code{getopt} and @code{assert}.
16411
This file would contain library functions that are specific to a site or
16412
installation, i.e.@: locally developed functions.
16413
Having a separate file allows @file{default.awk} to change with
16414
new @code{gawk} releases, without requiring the system administrator to
16415
update it each time by adding the local functions.
16419
@c Karl Berry, karl@ileaf.com, 10/95
16420
suggested that @code{gawk} be modified to automatically read these files
16421
upon startup. Instead, it would be very simple to modify @code{igawk}
16422
to do this. Since @code{igawk} can process nested @samp{@@include}
16423
directives, @file{default.awk} could simply contain @samp{@@include}
16424
statements for the desired library functions.
16426
@c Exercise: make this change
16428
@node Language History, Gawk Summary, Sample Programs, Top
16429
@chapter The Evolution of the @code{awk} Language
16431
This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows
16432
the POSIX specification. Many @code{awk} users are only familiar
16433
with the original @code{awk} implementation in Version 7 Unix.
16434
(This implementation was the basis for @code{awk} in Berkeley Unix,
16435
through 4.3--Reno. The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2
16436
for its version of @code{awk}.) This chapter briefly describes the
16437
evolution of the @code{awk} language, with cross references to other parts
16438
of the @value{DOCUMENT} where you can find more information.
16441
* V7/SVR3.1:: The major changes between V7 and System V
16443
* SVR4:: Minor changes between System V Releases 3.1
16445
* POSIX:: New features from the POSIX standard.
16446
* BTL:: New features from the AT&T Bell Laboratories
16447
version of @code{awk}.
16448
* POSIX/GNU:: The extensions in @code{gawk} not in POSIX
16452
@node V7/SVR3.1, SVR4, Language History, Language History
16453
@section Major Changes between V7 and SVR3.1
16455
The @code{awk} language evolved considerably between the release of
16456
Version 7 Unix (1978) and the new version first made generally available in
16457
System V Release 3.1 (1987). This section summarizes the changes, with
16458
cross-references to further details.
16462
The requirement for @samp{;} to separate rules on a line
16463
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
16466
User-defined functions, and the @code{return} statement
16467
(@pxref{User-defined, ,User-defined Functions}).
16470
The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}).
16473
The @code{do}-@code{while} statement
16474
(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}).
16477
The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and
16478
@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}).
16481
The built-in functions @code{gsub}, @code{sub}, and @code{match}
16482
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16485
The built-in functions @code{close}, and @code{system}
16486
(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16489
The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},
16490
and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}).
16493
The conditional expression using the ternary operator @samp{?:}
16494
(@pxref{Conditional Exp, ,Conditional Expressions}).
16497
The exponentiation operator @samp{^}
16498
(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator
16499
form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}).
16502
C-compatible operator precedence, which breaks some old @code{awk}
16503
programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}).
16506
Regexps as the value of @code{FS}
16507
(@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the
16508
third argument to the @code{split} function
16509
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16512
Dynamic regexps as operands of the @samp{~} and @samp{!~} operators
16513
(@pxref{Regexp Usage, ,How to Use Regular Expressions}).
16516
The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}
16517
(@pxref{Escape Sequences}).
16518
(Some vendors have updated their old versions of @code{awk} to
16519
recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not
16520
something you can rely on.)
16523
Redirection of input for the @code{getline} function
16524
(@pxref{Getline, ,Explicit Input with @code{getline}}).
16527
Multiple @code{BEGIN} and @code{END} rules
16528
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
16531
Multi-dimensional arrays
16532
(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
16535
@node SVR4, POSIX, V7/SVR3.1, Language History
16536
@section Changes between SVR3.1 and SVR4
16538
@cindex @code{awk} language, V.4 version
16539
The System V Release 4 version of Unix @code{awk} added these features
16540
(some of which originated in @code{gawk}):
16544
The @code{ENVIRON} variable (@pxref{Built-in Variables}).
16547
Multiple @samp{-f} options on the command line
16548
(@pxref{Options, ,Command Line Options}).
16551
The @samp{-v} option for assigning variables before program execution begins
16552
(@pxref{Options, ,Command Line Options}).
16555
The @samp{--} option for terminating command line options.
16558
The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences
16559
(@pxref{Escape Sequences}).
16562
A defined return value for the @code{srand} built-in function
16563
(@pxref{Numeric Functions, ,Numeric Built-in Functions}).
16566
The @code{toupper} and @code{tolower} built-in string functions
16567
for case translation
16568
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16571
A cleaner specification for the @samp{%c} format-control letter in the
16572
@code{printf} function
16573
(@pxref{Control Letters, ,Format-Control Letters}).
16576
The ability to dynamically pass the field width and precision (@code{"%*.*d"})
16577
in the argument list of the @code{printf} function
16578
(@pxref{Control Letters, ,Format-Control Letters}).
16581
The use of regexp constants such as @code{/foo/} as expressions, where
16582
they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}
16583
(@pxref{Using Constant Regexps, ,Using Regular Expression Constants}).
16586
@node POSIX, BTL, SVR4, Language History
16587
@section Changes between SVR4 and POSIX @code{awk}
16589
The POSIX Command Language and Utilities standard for @code{awk}
16590
introduced the following changes into the language:
16594
The use of @samp{-W} for implementation-specific options.
16597
The use of @code{CONVFMT} for controlling the conversion of numbers
16598
to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
16601
The concept of a numeric string, and tighter comparison rules to go
16602
with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}).
16605
More complete documentation of many of the previously undocumented
16606
features of the language.
16609
The following common extensions are not permitted by the POSIX
16612
@c IMPORTANT! Keep this list in sync with the one in node Options
16616
@code{\x} escape sequences are not recognized
16617
(@pxref{Escape Sequences}).
16620
The synonym @code{func} for the keyword @code{function} is not
16621
recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
16624
The operators @samp{**} and @samp{**=} cannot be used in
16625
place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
16626
and also @pxref{Assignment Ops, ,Assignment Expressions}).
16629
Specifying @samp{-Ft} on the command line does not set the value
16630
of @code{FS} to be a single tab character
16631
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
16634
The @code{fflush} built-in function is not supported
16635
(@pxref{I/O Functions, , Built-in Functions for Input/Output}).
16638
@node BTL, POSIX/GNU, POSIX, Language History
16639
@section Extensions in the AT&T Bell Laboratories @code{awk}
16641
@cindex Kernighan, Brian
16642
Brian Kernighan, one of the original designers of Unix @code{awk},
16643
has made his version available via anonymous @code{ftp}
16644
(@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}).
16645
This section describes extensions in his version of @code{awk} that are
16646
not in POSIX @code{awk}.
16650
The @samp{-mf=@var{NNN}} and @samp{-mr=@var{NNN}} command line options
16651
to set the maximum number of fields, and the maximum
16652
record size, respectively
16653
(@pxref{Options, ,Command Line Options}).
16656
The @code{fflush} built-in function for flushing buffered output
16657
(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16661
The @code{SYMTAB} array, that allows access to the internal symbol
16662
table of @code{awk}. This feature is not documented, largely because
16663
it is somewhat shakily implemented. For instance, you cannot access arrays
16664
or array elements through it.
16668
@node POSIX/GNU, , BTL, Language History
16669
@section Extensions in @code{gawk} Not in POSIX @code{awk}
16671
@cindex compatibility mode
16672
The GNU implementation, @code{gawk}, adds a number of features.
16673
This sections lists them in the order they were added to @code{gawk}.
16674
They can all be disabled with either the @samp{--traditional} or
16675
@samp{--posix} options
16676
(@pxref{Options, ,Command Line Options}).
16678
Version 2.10 of @code{gawk} introduced these features:
16682
The @code{AWKPATH} environment variable for specifying a path search for
16683
the @samp{-f} command line option
16684
(@pxref{Options, ,Command Line Options}).
16687
The @code{IGNORECASE} variable and its effects
16688
(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
16691
The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and
16692
@file{/dev/fd/@var{n}} file name interpretation
16693
(@pxref{Special Files, ,Special File Names in @code{gawk}}).
16696
Version 2.13 of @code{gawk} introduced these features:
16700
The @code{FIELDWIDTHS} variable and its effects
16701
(@pxref{Constant Size, ,Reading Fixed-width Data}).
16704
The @code{systime} and @code{strftime} built-in functions for obtaining
16705
and printing time stamps
16706
(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}).
16709
The @samp{-W lint} option to provide source code and run time error
16710
and portability checking
16711
(@pxref{Options, ,Command Line Options}).
16714
The @samp{-W compat} option to turn off these extensions
16715
(@pxref{Options, ,Command Line Options}).
16718
The @samp{-W posix} option for full POSIX compliance
16719
(@pxref{Options, ,Command Line Options}).
16722
Version 2.14 of @code{gawk} introduced these features:
16726
The @code{next file} statement for skipping to the next data file
16727
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
16730
Version 2.15 of @code{gawk} introduced these features:
16734
The @code{ARGIND} variable, that tracks the movement of @code{FILENAME}
16735
through @code{ARGV} (@pxref{Built-in Variables}).
16738
The @code{ERRNO} variable, that contains the system error message when
16739
@code{getline} returns @minus{}1, or when @code{close} fails
16740
(@pxref{Built-in Variables}).
16743
The ability to use GNU-style long named options that start with @samp{--}
16744
(@pxref{Options, ,Command Line Options}).
16747
The @samp{--source} option for mixing command line and library
16749
(@pxref{Options, ,Command Line Options}).
16752
The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and
16753
@file{/dev/user} file name interpretation
16754
(@pxref{Special Files, ,Special File Names in @code{gawk}}).
16757
Version 3.0 of @code{gawk} introduced these features:
16761
The @code{next file} statement became @code{nextfile}
16762
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
16765
The @samp{--lint-old} option to
16766
warn about constructs that are not available in
16767
the original Version 7 Unix version of @code{awk}
16768
(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
16771
The @samp{--traditional} option was added as a better name for
16772
@samp{--compat} (@pxref{Options, ,Command Line Options}).
16775
The ability for @code{FS} to be a null string, and for the third
16776
argument to @code{split} to be the null string
16777
(@pxref{Single Character Fields, , Making Each Character a Separate Field}).
16780
The ability for @code{RS} to be a regexp
16781
(@pxref{Records, , How Input is Split into Records}).
16784
The @code{RT} variable
16785
(@pxref{Records, , How Input is Split into Records}).
16788
The @code{gensub} function for more powerful text manipulation
16789
(@pxref{String Functions, , Built-in Functions for String Manipulation}).
16792
The @code{strftime} function acquired a default time format,
16793
allowing it to be called with no arguments
16794
(@pxref{Time Functions, , Functions for Dealing with Time Stamps}).
16797
Full support for both POSIX and GNU regexps
16798
(@pxref{Regexp, , Regular Expressions}).
16801
The @samp{--re-interval} option to provide interval expressions in regexps
16802
(@pxref{Regexp Operators, , Regular Expression Operators}).
16805
@code{IGNORECASE} changed, now applying to string comparison as well
16806
as regexp operations
16807
(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
16810
The @samp{-m} option and the @code{fflush} function from the
16811
Bell Labs research version of @code{awk}
16812
(@pxref{Options, ,Command Line Options}; also
16813
@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16816
The use of GNU Autoconf to control the configuration process
16817
(@pxref{Quick Installation, , Compiling @code{gawk} for Unix}).
16821
(@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}).
16823
@c XXX ADD MORE STUFF HERE
16827
@node Gawk Summary, Installation, Language History, Top
16828
@appendix @code{gawk} Summary
16830
This appendix provides a brief summary of the @code{gawk} command line and the
16831
@code{awk} language. It is designed to serve as ``quick reference.'' It is
16832
therefore terse, but complete.
16835
* Command Line Summary:: Recapitulation of the command line.
16836
* Language Summary:: A terse review of the language.
16837
* Variables/Fields:: Variables, fields, and arrays.
16838
* Rules Summary:: Patterns and Actions, and their component
16840
* Actions Summary:: Quick overview of actions.
16841
* Functions Summary:: Defining and calling functions.
16842
* Historical Features:: Some undocumented but supported ``features''.
16845
@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary
16846
@appendixsec Command Line Options Summary
16848
The command line consists of options to @code{gawk} itself, the
16849
@code{awk} program text (if not supplied via the @samp{-f} option), and
16850
values to be made available in the @code{ARGC} and @code{ARGV}
16851
predefined @code{awk} variables:
16854
gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{}
16855
gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
16858
The options that @code{gawk} accepts are:
16862
@itemx --field-separator @var{fs}
16863
Use @var{fs} for the input field separator (the value of the @code{FS}
16864
predefined variable).
16866
@item -f @var{program-file}
16867
@itemx --file @var{program-file}
16868
Read the @code{awk} program source from the file @var{program-file}, instead
16869
of from the first command line argument.
16871
@item -mf=@var{NNN}
16872
@itemx -mr=@var{NNN}
16873
The @samp{f} flag sets
16874
the maximum number of fields, and the @samp{r} flag sets the maximum
16875
record size. These options are ignored by @code{gawk}, since @code{gawk}
16876
has no predefined limits; they are only for compatibility with the
16877
Bell Labs research version of Unix @code{awk}.
16879
@item -v @var{var}=@var{val}
16880
@itemx --assign @var{var}=@var{val}
16881
Assign the variable @var{var} the value @var{val} before program execution
16884
@item -W traditional
16886
@itemx --traditional
16888
Use compatibility mode, in which @code{gawk} extensions are turned
16892
@itemx -W copyright
16895
Print the short version of the General Public License on the error
16896
output. This option may disappear in a future version of @code{gawk}.
16902
Print a relatively short summary of the available options on the error output.
16906
Give warnings about dubious or non-portable @code{awk} constructs.
16910
Warn about constructs that are not available in
16911
the original Version 7 Unix version of @code{awk}.
16915
Use POSIX compatibility mode, in which @code{gawk} extensions
16916
are turned off and additional restrictions apply.
16918
@item -W re-interval
16919
@itemx --re-interval
16920
Allow interval expressions
16921
(@pxref{Regexp Operators, , Regular Expression Operators}),
16924
@item -W source=@var{program-text}
16925
@itemx --source @var{program-text}
16926
Use @var{program-text} as @code{awk} program source code. This option allows
16927
mixing command line source code with source code from files, and is
16928
particularly useful for mixing command line programs with library functions.
16932
Print version information for this particular copy of @code{gawk} on the error
16936
Signal the end of options. This is useful to allow further arguments to the
16937
@code{awk} program itself to start with a @samp{-}. This is mainly for
16938
consistency with POSIX argument parsing conventions.
16941
Any other options are flagged as invalid, but are otherwise ignored.
16942
@xref{Options, ,Command Line Options}, for more details.
16944
@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary
16945
@appendixsec Language Summary
16947
An @code{awk} program consists of a sequence of zero or more pattern-action
16948
statements and optional function definitions. One or the other of the
16949
pattern and action may be omitted.
16952
@var{pattern} @{ @var{action statements} @}
16954
@{ @var{action statements} @}
16956
function @var{name}(@var{parameter list}) @{ @var{action statements} @}
16959
@code{gawk} first reads the program source from the
16960
@var{program-file}(s), if specified, or from the first non-option
16961
argument on the command line. The @samp{-f} option may be used multiple
16962
times on the command line. @code{gawk} reads the program text from all
16963
the @var{program-file} files, effectively concatenating them in the
16964
order they are specified. This is useful for building libraries of
16965
@code{awk} functions, without having to include them in each new
16966
@code{awk} program that uses them. To use a library function in a file
16967
from a program typed in on the command line, specify
16968
@samp{--source '@var{program}'}, and type your program in between the single
16970
@xref{Options, ,Command Line Options}.
16972
The environment variable @code{AWKPATH} specifies a search path to use
16973
when finding source files named with the @samp{-f} option. The default
16975
@samp{.:/usr/local/share/awk}@footnote{The path may use a directory
16976
other than @file{/usr/local/share/awk}, depending upon how @code{gawk}
16977
was built and installed.} is used if @code{AWKPATH} is not set.
16978
If a file name given to the @samp{-f} option contains a @samp{/} character,
16979
no path search is performed.
16980
@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
16982
@code{gawk} compiles the program into an internal form, and then proceeds to
16983
read each file named in the @code{ARGV} array.
16984
The initial values of @code{ARGV} come from the command line arguments.
16985
If there are no files named
16986
on the command line, @code{gawk} reads the standard input.
16988
If a ``file'' named on the command line has the form
16989
@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the
16990
variable @var{var} is assigned the value @var{val}.
16991
If any of the files have a value that is the null string, that
16992
element in the list is skipped.
16994
For each record in the input, @code{gawk} tests to see if it matches any
16995
@var{pattern} in the @code{awk} program. For each pattern that the record
16996
matches, the associated @var{action} is executed.
16998
@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary
16999
@appendixsec Variables and Fields
17001
@code{awk} variables are not declared; they come into existence when they are
17002
first used. Their values are either floating-point numbers or strings.
17003
@code{awk} also has one-dimensional arrays; multiple-dimensional arrays
17004
may be simulated. There are several predefined variables that
17005
@code{awk} sets as a program runs; these are summarized below.
17008
* Fields Summary:: Input field splitting.
17009
* Built-in Summary:: @code{awk}'s built-in variables.
17010
* Arrays Summary:: Using arrays.
17011
* Data Type Summary:: Values in @code{awk} are numbers or strings.
17014
@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields
17015
@appendixsubsec Fields
17017
As each input line is read, @code{gawk} splits the line into
17018
@var{fields}, using the value of the @code{FS} variable as the field
17019
separator. If @code{FS} is a single character, fields are separated by
17020
that character. Otherwise, @code{FS} is expected to be a full regular
17021
expression. In the special case that @code{FS} is a single space,
17022
fields are separated by runs of spaces and/or tabs.
17023
If @code{FS} is the null string (@code{""}), then each individual
17024
character in the record becomes a separate field.
17025
Note that the value
17026
of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching})
17027
also affects how fields are split when @code{FS} is a regular expression.
17029
Each field in the input line may be referenced by its position, @code{$1},
17030
@code{$2}, and so on. @code{$0} is the whole line. The value of a field may
17031
be assigned to as well. Field numbers need not be constants:
17039
prints the fifth field in the input line. The variable @code{NF} is set to
17040
the total number of fields in the input line.
17042
References to non-existent fields (i.e.@: fields after @code{$NF}) return
17043
the null string. However, assigning to a non-existent field (e.g.,
17044
@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any
17045
intervening fields with the null string as their value, and causes the
17046
value of @code{$0} to be recomputed, with the fields being separated by
17047
the value of @code{OFS}.
17048
@xref{Reading Files, ,Reading Input Files}.
17050
@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields
17051
@appendixsubsec Built-in Variables
17053
@code{gawk}'s built-in variables are:
17057
The number of elements in @code{ARGV}. See below for what is actually
17058
included in @code{ARGV}.
17061
The index in @code{ARGV} of the current file being processed.
17062
When @code{gawk} is processing the input data files,
17063
it is always true that @samp{FILENAME == ARGV[ARGIND]}.
17066
The array of command line arguments. The array is indexed from zero to
17067
@code{ARGC} @minus{} 1. Dynamically changing @code{ARGC} and
17068
the contents of @code{ARGV}
17069
can control the files used for data. A null-valued element in
17070
@code{ARGV} is ignored. @code{ARGV} does not include the options to
17071
@code{awk} or the text of the @code{awk} program itself.
17074
The conversion format to use when converting numbers to strings.
17077
A space separated list of numbers describing the fixed-width input data.
17080
An array of environment variable values. The array
17081
is indexed by variable name, each element being the value of that
17082
variable. Thus, the environment variable @code{HOME} is
17083
@code{ENVIRON["HOME"]}. One possible value might be @file{/home/arnold}.
17085
Changing this array does not affect the environment seen by programs
17086
which @code{gawk} spawns via redirection or the @code{system} function.
17087
(This may change in a future version of @code{gawk}.)
17089
Some operating systems do not have environment variables.
17090
The @code{ENVIRON} array is empty when running on these systems.
17093
The system error message when an error occurs using @code{getline}
17097
The name of the current input file. If no files are specified on the command
17098
line, the value of @code{FILENAME} is the null string.
17101
The input record number in the current input file.
17104
The input field separator, a space by default.
17107
The case-sensitivity flag for string comparisons and regular expression
17108
operations. If @code{IGNORECASE} has a non-zero value, then pattern
17109
matching in rules, record separating with @code{RS}, field splitting
17110
with @code{FS}, regular expression matching with @samp{~} and
17111
@samp{!~}, and the @code{gensub}, @code{gsub}, @code{index},
17112
@code{match}, @code{split} and @code{sub} built-in functions all
17113
ignore case when doing regular expression operations, and all string
17114
comparisons are done ignoring case.
17117
The number of fields in the current input record.
17120
The total number of input records seen so far.
17123
The output format for numbers for the @code{print} statement,
17124
@code{"%.6g"} by default.
17127
The output field separator, a space by default.
17130
The output record separator, by default a newline.
17133
The input record separator, by default a newline.
17134
If @code{RS} is set to the null string, then records are separated by
17135
blank lines. When @code{RS} is set to the null string, then the newline
17136
character always acts as a field separator, in addition to whatever value
17137
@code{FS} may have. If @code{RS} is set to a multi-character
17138
string, it denotes a regexp; input text matching the regexp
17142
The input text that matched the text denoted by @code{RS},
17143
the record separator.
17146
The index of the first character last matched by @code{match}; zero if no match.
17149
The length of the string last matched by @code{match}; @minus{}1 if no match.
17152
The string used to separate multiple subscripts in array elements, by
17153
default @code{"\034"}.
17156
@xref{Built-in Variables}, for more information.
17158
@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields
17159
@appendixsubsec Arrays
17161
Arrays are subscripted with an expression between square brackets
17162
(@samp{[} and @samp{]}). Array subscripts are @emph{always} strings;
17163
numbers are converted to strings as necessary, following the standard
17165
(@pxref{Conversion, ,Conversion of Strings and Numbers}).
17167
If you use multiple expressions separated by commas inside the square
17168
brackets, then the array subscript is a string consisting of the
17169
concatenation of the individual subscript values, converted to strings,
17170
separated by the subscript separator (the value of @code{SUBSEP}).
17172
The special operator @code{in} may be used in a conditional context
17173
to see if an array has an index consisting of a particular value.
17180
If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}}
17181
to test for existence of an element.
17183
The @code{in} construct may also be used in a @code{for} loop to iterate
17184
over all the elements of an array.
17185
@xref{Scanning an Array, ,Scanning All Elements of an Array}.
17187
You can remove an element from an array using the @code{delete} statement.
17189
You can clear an entire array using @samp{delete @var{array}}.
17191
@xref{Arrays, ,Arrays in @code{awk}}.
17193
@node Data Type Summary, , Arrays Summary, Variables/Fields
17194
@appendixsubsec Data Types
17196
The value of an @code{awk} expression is always either a number
17199
Some contexts (such as arithmetic operators) require numeric
17200
values. They convert strings to numbers by interpreting the text
17201
of the string as a number. If the string does not look like a
17202
number, it converts to zero.
17204
Other contexts (such as concatenation) require string values.
17205
They convert numbers to strings by effectively printing them
17206
with @code{sprintf}.
17207
@xref{Conversion, ,Conversion of Strings and Numbers}, for the details.
17209
To force conversion of a string value to a number, simply add zero
17210
to it. If the value you start with is already a number, this
17211
does not change it.
17213
To force conversion of a numeric value to a string, concatenate it with
17216
Comparisons are done numerically if both operands are numeric, or if
17217
one is numeric and the other is a numeric string. Otherwise one or
17218
both operands are converted to strings and a string comparison is
17219
performed. Fields, @code{getline} input, @code{FILENAME}, @code{ARGV}
17220
elements, @code{ENVIRON} elements and the elements of an array created
17221
by @code{split} are the only items that can be numeric strings. String
17222
constants, such as @code{"3.1415927"} are not numeric strings, they are
17223
string constants. The full rules for comparisons are described in
17224
@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
17226
Uninitialized variables have the string value @code{""} (the null, or
17227
empty, string). In contexts where a number is required, this is
17228
equivalent to zero.
17230
@xref{Variables}, for more information on variable naming and initialization;
17231
@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information
17232
on how variable values are interpreted.
17234
@node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary
17235
@appendixsec Patterns
17238
* Pattern Summary:: Quick overview of patterns.
17239
* Regexp Summary:: Quick overview of regular expressions.
17242
An @code{awk} program is mostly composed of rules, each consisting of a
17243
pattern followed by an action. The action is enclosed in @samp{@{} and
17244
@samp{@}}. Either the pattern may be missing, or the action may be
17245
missing, but not both. If the pattern is missing, the
17246
action is executed for every input record. A missing action is
17247
equivalent to @samp{@w{@{ print @}}}, which prints the entire line.
17249
@c These paragraphs repeated for both patterns and actions. I don't
17250
@c like this, but I also don't see any way around it. Update both copies
17251
@c if they need fixing.
17252
Comments begin with the @samp{#} character, and continue until the end of the
17253
line. Blank lines may be used to separate statements. Statements normally
17254
end with a newline; however, this is not the case for lines ending in a
17255
@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines
17256
ending in @code{do} or @code{else} also have their statements automatically
17257
continued on the following line. In other cases, a line can be continued by
17258
ending it with a @samp{\}, in which case the newline is ignored.
17260
Multiple statements may be put on one line by separating each one with
17262
This applies to both the statements within the action part of a rule (the
17263
usual case), and to the rule statements.
17265
@xref{Comments, ,Comments in @code{awk} Programs}, for information on
17266
@code{awk}'s commenting convention;
17267
@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
17268
description of the line continuation mechanism in @code{awk}.
17270
@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary
17271
@appendixsubsec Pattern Summary
17273
@code{awk} patterns may be one of the following:
17276
/@var{regular expression}/
17277
@var{relational expression}
17278
@var{pattern} && @var{pattern}
17279
@var{pattern} || @var{pattern}
17280
@var{pattern} ? @var{pattern} : @var{pattern}
17283
@var{pattern1}, @var{pattern2}
17288
@code{BEGIN} and @code{END} are two special kinds of patterns that are not
17289
tested against the input. The action parts of all @code{BEGIN} rules are
17290
concatenated as if all the statements had been written in a single @code{BEGIN}
17291
rule. They are executed before any of the input is read. Similarly, all the
17292
@code{END} rules are concatenated, and executed when all the input is exhausted (or
17293
when an @code{exit} statement is executed). @code{BEGIN} and @code{END}
17294
patterns cannot be combined with other patterns in pattern expressions.
17295
@code{BEGIN} and @code{END} rules cannot have missing action parts.
17297
For @code{/@var{regular-expression}/} patterns, the associated statement is
17298
executed for each input record that matches the regular expression. Regular
17299
expressions are summarized below.
17301
A @var{relational expression} may use any of the operators defined below in
17302
the section on actions. These generally test whether certain fields match
17303
certain regular expressions.
17305
The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,''
17306
logical ``or,'' and logical ``not,'' respectively, as in C. They do
17307
short-circuit evaluation, also as in C, and are used for combining more
17308
primitive pattern expressions. As in most languages, parentheses may be
17309
used to change the order of evaluation.
17311
The @samp{?:} operator is like the same operator in C. If the first
17312
pattern matches, then the second pattern is matched against the input
17313
record; otherwise, the third is matched. Only one of the second and
17314
third patterns is matched.
17316
The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a
17317
range pattern. It matches all input lines starting with a line that
17318
matches @var{pattern1}, and continuing until a line that matches
17319
@var{pattern2}, inclusive. A range pattern cannot be used as an operand
17320
of any of the pattern operators.
17322
@xref{Pattern Overview, ,Pattern Elements}.
17324
@node Regexp Summary, , Pattern Summary, Rules Summary
17325
@appendixsubsec Regular Expressions
17327
Regular expressions are based on POSIX EREs (extended regular expressions).
17328
The escape sequences allowed in string constants are also valid in
17329
regular expressions (@pxref{Escape Sequences}).
17330
Regexps are composed of characters as follows:
17334
matches the character @var{c} (assuming @var{c} is none of the characters
17338
matches the literal character @var{c}.
17341
matches any character, @emph{including} newline.
17342
In strict POSIX mode, @samp{.} does not match the @sc{nul}
17343
character, which is a character with all bits equal to zero.
17346
matches the beginning of a string.
17349
matches the end of a string.
17351
@item [@var{abc}@dots{}]
17352
matches any of the characters @var{abc}@dots{} (character list).
17354
@item [[:@var{class}:]]
17355
matches any character in the character class @var{class}. Allowable classes
17356
are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl},
17357
@code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct},
17358
@code{space}, @code{upper}, and @code{xdigit}.
17360
@item [[.@var{symbol}.]]
17361
matches the multi-character collating symbol @var{symbol}.
17362
@code{gawk} does not currently support collating symbols.
17364
@item [[=@var{chars}=]]
17365
matches any of the equivalent characters in @var{chars}.
17366
@code{gawk} does not currently support equivalence classes.
17368
@item [^@var{abc}@dots{}]
17369
matches any character except @var{abc}@dots{} and newline (negated
17372
@item @var{r1}|@var{r2}
17373
matches either @var{r1} or @var{r2} (alternation).
17376
matches @var{r1}, and then @var{r2} (concatenation).
17379
matches one or more @var{r}'s.
17382
matches zero or more @var{r}'s.
17385
matches zero or one @var{r}'s.
17388
matches @var{r} (grouping).
17390
@item @var{r}@{@var{n}@}
17391
@itemx @var{r}@{@var{n},@}
17392
@itemx @var{r}@{@var{n},@var{m}@}
17393
matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m}
17394
occurrences of @var{r} (interval expressions).
17397
matches the empty string at either the beginning or the
17401
matches the empty string within a word.
17404
matches the empty string at the beginning of a word.
17407
matches the empty string at the end of a word.
17410
matches any word-constituent character (alphanumeric characters and
17414
matches any character that is not word-constituent.
17417
matches the empty string at the beginning of a buffer (same as a string
17421
matches the empty string at the end of a buffer.
17424
The various command line options
17425
control how @code{gawk} interprets characters in regexps.
17427
@c NOTE!!! Keep this in sync with the same table in the regexp chapter!
17430
In the default case, @code{gawk} provide all the facilities of
17431
POSIX regexps and the GNU regexp operators described above.
17432
However, interval expressions are not supported.
17434
@item @code{--posix}
17435
Only POSIX regexps are supported, the GNU operators are not special
17436
(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
17439
@item @code{--traditional}
17440
Traditional Unix @code{awk} regexps are matched. The GNU operators
17441
are not special, interval expressions are not available, and neither
17442
are the POSIX character classes (@code{[[:alnum:]]} and so on).
17443
Characters described by octal and hexadecimal escape sequences are
17444
treated literally, even if they represent regexp metacharacters.
17446
@item @code{--re-interval}
17447
Allow interval expressions in regexps, even if @samp{--traditional}
17451
@xref{Regexp, ,Regular Expressions}.
17453
@node Actions Summary, Functions Summary, Rules Summary, Gawk Summary
17454
@appendixsec Actions
17456
Action statements are enclosed in braces, @samp{@{} and @samp{@}}.
17457
A missing action statement is equivalent to @samp{@w{@{ print @}}}.
17459
Action statements consist of the usual assignment, conditional, and looping
17460
statements found in most languages. The operators, control statements,
17461
and Input/Output statements available are similar to those in C.
17463
@c These paragraphs repeated for both patterns and actions. I don't
17464
@c like this, but I also don't see any way around it. Update both copies
17465
@c if they need fixing.
17466
Comments begin with the @samp{#} character, and continue until the end of the
17467
line. Blank lines may be used to separate statements. Statements normally
17468
end with a newline; however, this is not the case for lines ending in a
17469
@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines
17470
ending in @code{do} or @code{else} also have their statements automatically
17471
continued on the following line. In other cases, a line can be continued by
17472
ending it with a @samp{\}, in which case the newline is ignored.
17474
Multiple statements may be put on one line by separating each one with
17476
This applies to both the statements within the action part of a rule (the
17477
usual case), and to the rule statements.
17479
@xref{Comments, ,Comments in @code{awk} Programs}, for information on
17480
@code{awk}'s commenting convention;
17481
@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
17482
description of the line continuation mechanism in @code{awk}.
17485
* Operator Summary:: @code{awk} operators.
17486
* Control Flow Summary:: The control statements.
17487
* I/O Summary:: The I/O statements.
17488
* Printf Summary:: A summary of @code{printf}.
17489
* Special File Summary:: Special file names interpreted internally.
17490
* Built-in Functions Summary:: Built-in numeric and string functions.
17491
* Time Functions Summary:: Built-in time functions.
17492
* String Constants Summary:: Escape sequences in strings.
17495
@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary
17496
@appendixsubsec Operators
17498
The operators in @code{awk}, in order of decreasing precedence, are:
17508
Increment and decrement, both prefix and postfix.
17511
Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment
17512
operator, but they are not specified in the POSIX standard).
17515
Unary plus, unary minus, and logical negation.
17518
Multiplication, division, and modulus.
17521
Addition and subtraction.
17524
String concatenation.
17526
@item < <= > >= != ==
17527
The usual relational operators.
17530
Regular expression match, negated match.
17542
A conditional expression. This has the form @samp{@var{expr1} ?
17543
@var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the
17544
expression is @var{expr2}; otherwise it is @var{expr3}. Only one of
17545
@var{expr2} and @var{expr3} is evaluated.
17547
@item = += -= *= /= %= ^=
17548
Assignment. Both absolute assignment (@code{@var{var}=@var{value}})
17549
and operator assignment (the other forms) are supported.
17552
@xref{Expressions}.
17554
@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary
17555
@appendixsubsec Control Statements
17557
The control statements are as follows:
17560
if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]}
17561
while (@var{condition}) @var{statement}
17562
do @var{statement} while (@var{condition})
17563
for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement}
17564
for (@var{var} in @var{array}) @var{statement}
17567
delete @var{array}[@var{index}]
17569
exit @r{[} @var{expression} @r{]}
17570
@{ @var{statements} @}
17573
@xref{Statements, ,Control Statements in Actions}.
17575
@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary
17576
@appendixsubsec I/O Statements
17578
The Input/Output statements are as follows:
17582
Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}.
17583
@xref{Getline, ,Explicit Input with @code{getline}}.
17585
@item getline <@var{file}
17586
Set @code{$0} from next record of @var{file}; set @code{NF}.
17588
@item getline @var{var}
17589
Set @var{var} from next input record; set @code{NF}, @code{FNR}.
17591
@item getline @var{var} <@var{file}
17592
Set @var{var} from next record of @var{file}.
17594
@item @var{command} | getline
17595
Run @var{command}, piping its output into @code{getline}; sets @code{$0},
17596
@code{NF}, @code{NR}.
17598
@item @var{command} | getline @code{var}
17599
Run @var{command}, piping its output into @code{getline}; sets @var{var}.
17602
Stop processing the current input record. The next input record is read and
17603
processing starts over with the first pattern in the @code{awk} program.
17604
If the end of the input data is reached, the @code{END} rule(s), if any,
17606
@xref{Next Statement, ,The @code{next} Statement}.
17609
Stop processing the current input file. The next input record read comes
17610
from the next input file. @code{FILENAME} is updated, @code{FNR} is set to one,
17611
@code{ARGIND} is incremented,
17612
and processing starts over with the first pattern in the @code{awk} program.
17613
If the end of the input data is reached, the @code{END} rule(s), if any,
17615
Earlier versions of @code{gawk} used @samp{next file}; this usage is still
17616
supported, but is considered to be deprecated.
17617
@xref{Nextfile Statement, ,The @code{nextfile} Statement}.
17620
Prints the current record.
17621
@xref{Printing, ,Printing Output}.
17623
@item print @var{expr-list}
17624
Prints expressions.
17626
@item print @var{expr-list} > @var{file}
17627
Prints expressions to @var{file}. If @var{file} does not exist, it is
17628
created. If it does exist, its contents are deleted the first time the
17629
@code{print} is executed.
17631
@item print @var{expr-list} >> @var{file}
17632
Prints expressions to @var{file}. The previous contents of @var{file}
17633
are retained, and the output of @code{print} is appended to the file.
17635
@item print @var{expr-list} | @var{command}
17636
Prints expressions, sending the output down a pipe to @var{command}.
17637
The pipeline to the command stays open until the @code{close} function
17640
@item printf @var{fmt, expr-list}
17643
@item printf @var{fmt, expr-list} > file
17644
Format and print to @var{file}. If @var{file} does not exist, it is
17645
created. If it does exist, its contents are deleted the first time the
17646
@code{printf} is executed.
17648
@item printf @var{fmt, expr-list} >> @var{file}
17649
Format and print to @var{file}. The previous contents of @var{file}
17650
are retained, and the output of @code{printf} is appended to the file.
17652
@item printf @var{fmt, expr-list} | @var{command}
17653
Format and print, sending the output down a pipe to @var{command}.
17654
The pipeline to the command stays open until the @code{close} function
17658
@code{getline} returns zero on end of file, and @minus{}1 on an error.
17659
In the event of an error, @code{getline} will set @code{ERRNO} to
17660
the value of a system-dependent string that describes the error.
17662
@node Printf Summary, Special File Summary, I/O Summary, Actions Summary
17663
@appendixsubsec @code{printf} Summary
17665
Conversion specification have the form
17666
@code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}.
17668
Items in brackets are optional.
17670
The @code{awk} @code{printf} statement and @code{sprintf} function
17671
accept the following conversion specification formats:
17675
An ASCII character. If the argument used for @samp{%c} is numeric, it is
17676
treated as a character and printed. Otherwise, the argument is assumed to
17677
be a string, and the only first character of that string is printed.
17681
A decimal number (the integer part).
17685
A floating point number of the form
17686
@samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}.
17687
The @samp{%E} format uses @samp{E} instead of @samp{e}.
17690
A floating point number of the form
17691
@r{[}@code{-}@r{]}@code{ddd.dddddd}.
17695
Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter
17696
string, with non-significant zeros suppressed.
17697
@samp{%G} will use @samp{%E} instead of @samp{%e}.
17700
An unsigned octal number (again, an integer).
17703
A character string.
17707
An unsigned hexadecimal number (an integer).
17708
The @samp{%X} format uses @samp{A} through @samp{F} instead of
17709
@samp{a} through @samp{f} for decimal 10 through 15.
17712
A single @samp{%} character; no argument is converted.
17715
There are optional, additional parameters that may lie between the @samp{%}
17716
and the control letter:
17720
The expression should be left-justified within its field.
17723
For numeric conversions, prefix positive values with a space, and
17724
negative values with a minus sign.
17727
The plus sign, used before the width modifier (see below),
17728
says to always supply a sign for numeric conversions, even if the data
17729
to be formatted is positive. The @samp{+} overrides the space modifier.
17732
Use an ``alternate form'' for certain control letters.
17733
For @samp{o}, supply a leading zero.
17734
For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for
17736
For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a
17738
For @samp{g}, and @samp{G}, trailing zeros are not removed from the result.
17741
A leading @samp{0} (zero) acts as a flag, that indicates output should be
17742
padded with zeros instead of spaces.
17743
This applies even to non-numeric output formats.
17744
This flag only has an effect when the field width is wider than the
17745
value to be printed.
17748
The field should be padded to this width. The field is normally padded
17749
with spaces. If the @samp{0} flag has been used, it is padded with zeros.
17752
A number that specifies the precision to use when printing.
17753
For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
17754
number of digits you want printed to the right of the decimal point.
17755
For the @samp{g}, and @samp{G} formats, it specifies the maximum number
17756
of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
17757
@samp{x}, and @samp{X} formats, it specifies the minimum number of
17758
digits to print. For the @samp{s} format, it specifies the maximum number of
17759
characters from the string that should be printed.
17762
Either or both of the @var{width} and @var{prec} values may be specified
17763
as @samp{*}. In that case, the particular value is taken from the argument
17766
@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}.
17768
@node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary
17769
@appendixsubsec Special File Names
17771
When doing I/O redirection from either @code{print} or @code{printf} into a
17772
file, or via @code{getline} from a file, @code{gawk} recognizes certain special
17773
file names internally. These file names allow access to open file descriptors
17774
inherited from @code{gawk}'s parent process (usually the shell). The
17779
The standard input.
17782
The standard output.
17785
The standard error output.
17787
@item /dev/fd/@var{n}
17788
The file denoted by the open file descriptor @var{n}.
17791
In addition, reading the following files provides process related information
17792
about the running @code{gawk} program. All returned records are terminated
17797
Returns the process ID of the current process.
17800
Returns the parent process ID of the current process.
17803
Returns the process group ID of the current process.
17806
At least four space-separated fields, containing the return values of
17807
the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid}
17809
If there are any additional fields, they are the group IDs returned by
17810
@code{getgroups} system call.
17811
(Multiple groups may not be supported on all systems.)
17815
These file names may also be used on the command line to name data files.
17816
These file names are only recognized internally if you do not
17817
actually have files with these names on your system.
17819
@xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that
17820
provides the motivation for this feature.
17822
@node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary
17823
@appendixsubsec Built-in Functions
17825
@code{awk} provides a number of built-in functions for performing
17826
numeric operations, string related operations, and I/O related operations.
17828
The built-in arithmetic functions are:
17831
@item atan2(@var{y}, @var{x})
17832
the arctangent of @var{y/x} in radians.
17834
@item cos(@var{expr})
17835
the cosine in radians.
17837
@item exp(@var{expr})
17838
the exponential function (@code{e ^ @var{expr}}).
17840
@item int(@var{expr})
17841
truncates to integer.
17843
@item log(@var{expr})
17844
the natural logarithm of @code{expr}.
17847
a random number between zero and one.
17849
@item sin(@var{expr})
17850
the sine in radians.
17852
@item sqrt(@var{expr})
17853
the square root function.
17855
@item srand(@r{[}@var{expr}@r{]})
17856
use @var{expr} as a new seed for the random number generator. If no @var{expr}
17857
is provided, the time of day is used. The return value is the previous
17858
seed for the random number generator.
17864
@code{awk} has the following built-in string functions:
17867
@item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]})
17868
If @var{how} is a string beginning with @samp{g} or @samp{G}, then
17869
replace each match of @var{regex} in @var{target} with @var{subst}.
17870
Otherwise, replace the @var{how}'th occurrence. If @var{target} is not
17871
supplied, use @code{$0}. The return value is the changed string; the
17872
original @var{target} is not modified. Within @var{subst},
17873
@samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to
17874
indicate the text that matched the @var{n}'th parenthesized
17877
@item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
17878
for each substring matching the regular expression @var{regex} in the string
17879
@var{target}, substitute the string @var{subst}, and return the number of
17880
substitutions. If @var{target} is not supplied, use @code{$0}.
17882
@item index(@var{str}, @var{search})
17883
returns the index of the string @var{search} in the string @var{str}, or
17885
@var{search} is not present.
17887
@item length(@r{[}@var{str}@r{]})
17888
returns the length of the string @var{str}. The length of @code{$0}
17889
is returned if no argument is supplied.
17891
@item match(@var{str}, @var{regex})
17892
returns the position in @var{str} where the regular expression @var{regex}
17893
occurs, or zero if @var{regex} is not present, and sets the values of
17894
@code{RSTART} and @code{RLENGTH}.
17896
@item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]})
17897
splits the string @var{str} into the array @var{arr} on the regular expression
17898
@var{regex}, and returns the number of elements. If @var{regex} is omitted,
17899
@code{FS} is used instead. @var{regex} can be the null string, causing
17900
each character to be placed into its own array element.
17901
The array @var{arr} is cleared first.
17903
@item sprintf(@var{fmt}, @var{expr-list})
17904
prints @var{expr-list} according to @var{fmt}, and returns the resulting string.
17906
@item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
17907
just like @code{gsub}, but only the first matching substring is replaced.
17909
@item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]})
17910
returns the @var{len}-character substring of @var{str} starting at @var{index}.
17911
If @var{len} is omitted, the rest of @var{str} is used.
17913
@item tolower(@var{str})
17914
returns a copy of the string @var{str}, with all the upper-case characters in
17915
@var{str} translated to their corresponding lower-case counterparts.
17916
Non-alphabetic characters are left unchanged.
17918
@item toupper(@var{str})
17919
returns a copy of the string @var{str}, with all the lower-case characters in
17920
@var{str} translated to their corresponding upper-case counterparts.
17921
Non-alphabetic characters are left unchanged.
17924
The I/O related functions are:
17927
@item close(@var{expr})
17928
Close the open file or pipe denoted by @var{expr}.
17930
@item fflush(@r{[}@var{expr}@r{]})
17931
Flush any buffered output for the output file or pipe denoted by @var{expr}.
17932
If @var{expr} is omitted, standard output is flushed.
17933
If @var{expr} is the null string (@code{""}), all output buffers are flushed.
17935
@item system(@var{cmd-line})
17936
Execute the command @var{cmd-line}, and return the exit status.
17937
If your operating system does not support @code{system}, calling it will
17938
generate a fatal error.
17940
@samp{system("")} can be used to force @code{awk} to flush any pending
17941
output. This is more portable, but less obvious, than calling @code{fflush}.
17944
@node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary
17945
@appendixsubsec Time Functions
17947
The following two functions are available for getting the current
17948
time of day, and for formatting time stamps.
17952
returns the current time of day as the number of seconds since a particular
17953
epoch (Midnight, January 1, 1970 UTC, on POSIX systems).
17955
@item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]})
17956
formats @var{timestamp} according to the specification in @var{format}.
17957
The current time of day is used if no @var{timestamp} is supplied.
17958
A default format equivalent to the output of the @code{date} utility is used if
17959
no @var{format} is supplied.
17960
@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the
17961
details on the conversion specifiers that @code{strftime} accepts.
17965
@xref{Built-in, ,Built-in Functions}, for a description of all of
17966
@code{awk}'s built-in functions.
17969
@node String Constants Summary, , Time Functions Summary, Actions Summary
17970
@appendixsubsec String Constants
17972
String constants in @code{awk} are sequences of characters enclosed
17973
in double quotes (@code{"}). Within strings, certain @dfn{escape sequences}
17974
are recognized, as in C. These are:
17978
A literal backslash.
17981
The ``alert'' character; usually the ASCII BEL character.
18001
@item \x@var{hex digits}
18002
The character represented by the string of hexadecimal digits following
18003
the @samp{\x}. As in ANSI C, all following hexadecimal digits are
18004
considered part of the escape sequence. E.g., @code{"\x1B"} is a
18005
string containing the ASCII ESC (escape) character. (The @samp{\x}
18006
escape sequence is not in POSIX @code{awk}.)
18009
The character represented by the one, two, or three digit sequence of octal
18010
digits. Thus, @code{"\033"} is also a string containing the ASCII ESC
18011
(escape) character.
18014
The literal character @var{c}, if @var{c} is not one of the above.
18017
The escape sequences may also be used inside constant regular expressions
18018
(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace
18021
@xref{Escape Sequences}.
18023
@node Functions Summary, Historical Features, Actions Summary, Gawk Summary
18024
@appendixsec User-defined Functions
18026
Functions in @code{awk} are defined as follows:
18029
function @var{name}(@var{parameter list}) @{ @var{statements} @}
18032
Actual parameters supplied in the function call are used to instantiate
18033
the formal parameters declared in the function. Arrays are passed by
18034
reference, other variables are passed by value.
18036
If there are fewer arguments passed than there are names in @var{parameter-list},
18037
the extra names are given the null string as their value. Extra names have the
18038
effect of local variables.
18040
The open-parenthesis in a function call of a user-defined function must
18041
immediately follow the function name, without any intervening white space.
18042
This is to avoid a syntactic ambiguity with the concatenation operator.
18044
The word @code{func} may be used in place of @code{function} (but not in
18047
Use the @code{return} statement to return a value from a function.
18049
@xref{User-defined, ,User-defined Functions}.
18051
@node Historical Features, , Functions Summary, Gawk Summary
18052
@appendixsec Historical Features
18054
@cindex historical features
18055
There are two features of historical @code{awk} implementations that
18056
@code{gawk} supports.
18058
First, it is possible to call the @code{length} built-in function not only
18059
with no arguments, but even without parentheses!
18066
is the same as either of
18077
$ echo abcdef | awk '@{ print length @}'
18082
This feature is marked as ``deprecated'' in the POSIX standard, and
18083
@code{gawk} will issue a warning about its use if @samp{--lint} is
18084
specified on the command line.
18085
(The ability to use @code{length} this way was actually an accident of the
18086
original Unix @code{awk} implementation. If any built-in function used
18087
@code{$0} as its default argument, it was possible to call that function
18088
without the parentheses. In particular, it was common practice to use
18089
the @code{length} function in this fashion, and this usage was documented
18090
in the @code{awk} manual page.)
18092
The other historical feature is the use of either the @code{break} statement,
18093
or the @code{continue} statement
18094
outside the body of a @code{while}, @code{for}, or @code{do} loop. Traditional
18095
@code{awk} implementations have treated such usage as equivalent to the
18096
@code{next} statement. More recent versions of Unix @code{awk} do not allow
18097
it. @code{gawk} supports this usage if @samp{--traditional} has been
18100
@xref{Options, ,Command Line Options}, for more information about the
18101
@samp{--posix} and @samp{--lint} options.
18103
@node Installation, Notes, Gawk Summary, Top
18104
@appendix Installing @code{gawk}
18106
This appendix provides instructions for installing @code{gawk} on the
18107
various platforms that are supported by the developers. The primary
18108
developers support Unix (and one day, GNU), while the other ports were
18109
contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk}
18110
distribution lists the electronic mail addresses of the people who did
18111
the respective ports, and they are also provided in
18112
@ref{Bugs, , Reporting Problems and Bugs}.
18115
* Gawk Distribution:: What is in the @code{gawk} distribution.
18116
* Unix Installation:: Installing @code{gawk} under various versions
18118
* VMS Installation:: Installing @code{gawk} on VMS.
18119
* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS
18121
* Atari Installation:: Installing @code{gawk} on the Atari ST.
18122
* Amiga Installation:: Installing @code{gawk} on an Amiga.
18123
* Bugs:: Reporting Problems and Bugs.
18124
* Other Versions:: Other freely available @code{awk}
18128
@node Gawk Distribution, Unix Installation, Installation, Installation
18129
@appendixsec The @code{gawk} Distribution
18131
This section first describes how to get the @code{gawk}
18132
distribution, how to extract it, and then what is in the various files and
18136
* Getting:: How to get the distribution.
18137
* Extracting:: How to extract the distribution.
18138
* Distribution contents:: What is in the distribution.
18141
@node Getting, Extracting, Gawk Distribution, Gawk Distribution
18142
@appendixsubsec Getting the @code{gawk} Distribution
18143
@cindex getting @code{gawk}
18144
@cindex anonymous @code{ftp}
18145
@cindex @code{ftp}, anonymous
18146
@cindex Free Software Foundation
18147
There are three ways you can get GNU software.
18151
You can copy it from someone else who already has it.
18153
@cindex Free Software Foundation
18155
You can order @code{gawk} directly from the Free Software Foundation.
18156
Software distributions are available for Unix, MS-DOS, and VMS, on
18157
tape, CD-ROM, or floppies (MS-DOS only). The address is:
18160
Free Software Foundation @*
18161
59 Temple Place---Suite 330 @*
18162
Boston, MA 02111-1307 USA @*
18163
Phone: +1-617-542-5942 @*
18164
Fax (including Japan): +1-617-542-2652 @*
18165
E-mail: @code{gnu@@prep.ai.mit.edu} @*
18169
Ordering from the FSF directly contributes to the support of the foundation
18170
and to the production of more free software.
18173
You can get @code{gawk} by using anonymous @code{ftp} to the Internet host
18174
@code{ftp.gnu.ai.mit.edu}, in the directory @file{/pub/gnu}.
18176
Here is a list of alternate @code{ftp} sites from which you can obtain GNU
18177
software. When a site is listed as ``@var{site}@code{:}@var{directory}'' the
18178
@var{directory} indicates the directory where GNU software is kept.
18179
You should use a site that is geographically close to you.
18184
@item cair-archive.kaist.ac.kr:/pub/gnu
18185
@itemx ftp.cs.titech.ac.jp
18186
@itemx ftp.nectec.or.th:/pub/mirrors/gnu
18187
@itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep
18192
@item archie.au:/gnu
18193
(@code{archie.oz} or @code{archie.oz.au} for ACSnet)
18198
@item ftp.sun.ac.za:/pub/gnu
18203
@item ftp.technion.ac.il:/pub/unsupported/gnu
18208
@item archive.eu.net
18209
@itemx ftp.denet.dk
18210
@itemx ftp.eunet.ch
18211
@itemx ftp.funet.fi:/pub/gnu
18212
@itemx ftp.ieunet.ie:pub/gnu
18213
@itemx ftp.informatik.rwth-aachen.de:/pub/gnu
18214
@itemx ftp.informatik.tu-muenchen.de
18215
@itemx ftp.luth.se:/pub/unix/gnu
18216
@itemx ftp.mcc.ac.uk
18217
@itemx ftp.stacken.kth.se
18218
@itemx ftp.sunet.se:/pub/gnu
18219
@itemx ftp.univ-lyon1.fr:pub/gnu
18220
@itemx ftp.win.tue.nl:/pub/gnu
18221
@itemx irisa.irisa.fr:/pub/gnu
18223
@itemx nic.switch.ch:/mirror/gnu
18224
@itemx src.doc.ic.ac.uk:/gnu
18225
@itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu
18228
@item South America:
18230
@item ftp.inf.utfsm.cl:/pub/gnu
18231
@itemx ftp.unicamp.br:/pub/gnu
18234
@item Western Canada:
18236
@item ftp.cs.ubc.ca:/mirror2/gnu
18241
@item col.hp.com:/mirrors/gnu
18242
@itemx f.ms.uky.edu:/pub3/gnu
18243
@itemx ftp.cc.gatech.edu:/pub/gnu
18244
@itemx ftp.cs.columbia.edu:/archives/gnu/prep
18245
@itemx ftp.digex.net:/pub/gnu
18246
@itemx ftp.hawaii.edu:/mirrors/gnu
18247
@itemx ftp.kpc.com:/pub/mirror/gnu
18253
@item USA (continued):
18255
@itemx ftp.uu.net:/systems/gnu
18256
@itemx gatekeeper.dec.com:/pub/GNU
18257
@itemx jaguar.utah.edu:/gnustuff
18258
@itemx labrea.stanford.edu
18259
@itemx mrcnext.cso.uiuc.edu:/pub/gnu
18260
@itemx vixen.cso.uiuc.edu:/gnu
18261
@itemx wuarchive.wustl.edu:/systems/gnu
18266
@node Extracting, Distribution contents, Getting, Gawk Distribution
18267
@appendixsubsec Extracting the Distribution
18268
@code{gawk} is distributed as a @code{tar} file compressed with the
18269
GNU Zip program, @code{gzip}.
18271
Once you have the distribution (for example,
18272
@file{gawk-@value{VERSION}.0.tar.gz}), first use @code{gzip} to expand the
18273
file, and then use @code{tar} to extract it. You can use the following
18274
pipeline to produce the @code{gawk} distribution:
18277
# Under System V, add 'o' to the tar flags
18278
gzip -d -c gawk-@value{VERSION}.0.tar.gz | tar -xvpf -
18282
This will create a directory named @file{gawk-@value{VERSION}.0} in the current
18285
The distribution file name is of the form
18286
@file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}.
18287
The @var{V} represents the major version of @code{gawk},
18288
the @var{R} represents the current release of version @var{V}, and
18289
the @var{n} represents a @dfn{patch level}, meaning that minor bugs have
18290
been fixed in the release. The current patch level is 0, but when
18291
retrieving distributions, you should get the version with the highest
18292
version, release, and patch level. (Note that release levels greater than
18293
or equal to 90 denote ``beta,'' or non-production software; you may not wish
18294
to retrieve such a version unless you don't mind experimenting.)
18296
If you are not on a Unix system, you will need to make other arrangements
18297
for getting and extracting the @code{gawk} distribution. You should consult
18300
@node Distribution contents, , Extracting, Gawk Distribution
18301
@appendixsubsec Contents of the @code{gawk} Distribution
18303
The @code{gawk} distribution has a number of C source files,
18304
documentation files,
18305
subdirectories and files related to the configuration process
18306
(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}),
18307
and several subdirectories related to different, non-Unix,
18311
@item various @samp{.c}, @samp{.y}, and @samp{.h} files
18312
These files are the actual @code{gawk} source code.
18320
@itemx README_d/README.*
18321
Descriptive files: @file{README} for @code{gawk} under Unix, and the
18322
rest for the various hardware and software combinations.
18325
A file providing an overview of the configuration and installation process.
18328
A list of systems to which @code{gawk} has been ported, and which
18329
have successfully run the test suite.
18331
@item ACKNOWLEDGMENT
18332
A list of the people who contributed major parts of the code or documentation.
18335
A detailed list of source code changes as bugs are fixed or improvements made.
18338
A list of changes to @code{gawk} since the last release or patch.
18341
The GNU General Public License.
18344
A brief list of features and/or changes being contemplated for future
18345
releases, with some indication of the time frame for the feature, based
18349
A list of those factors that limit @code{gawk}'s performance.
18350
Most of these depend on the hardware or operating system software, and
18351
are not limits in @code{gawk} itself.
18354
A description of one area where the POSIX standard for @code{awk} is
18355
incorrect, and how @code{gawk} handles the problem.
18358
A file describing known problems with the current release.
18361
The @code{troff} source for a manual page describing @code{gawk}.
18362
This is distributed for the convenience of Unix users.
18364
@item doc/gawk.texi
18365
The Texinfo source file for this @value{DOCUMENT}.
18366
It should be processed with @TeX{} to produce a printed document, and
18367
with @code{makeinfo} to produce an Info file.
18369
@item doc/gawk.info
18370
The generated Info file for this @value{DOCUMENT}.
18373
The @code{troff} source for a manual page describing the @code{igawk}
18374
program presented in
18375
@ref{Igawk Program, ,An Easy Way to Use Library Functions}.
18377
@item doc/Makefile.in
18378
The input file used during the configuration process to generate the
18379
actual @file{Makefile} for creating the documentation.
18385
@itemx configure.in
18389
These files and subdirectory are used when configuring @code{gawk}
18390
for various Unix systems. They are explained in detail in
18391
@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}.
18393
@item awklib/extract.awk
18394
@itemx awklib/Makefile.in
18395
The @file{awklib} directory contains a copy of @file{extract.awk}
18396
(@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}),
18397
which can be used to extract the sample programs from the Texinfo
18398
source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which
18399
@code{configure} uses to generate a @file{Makefile}.
18400
As part of the process of building @code{gawk}, the library functions from
18401
@ref{Library Functions, , A Library of @code{awk} Functions},
18402
and the @code{igawk} program from
18403
@ref{Igawk Program, , An Easy Way to Use Library Functions},
18404
are extracted into ready to use files.
18405
They are installed as part of the installation process.
18408
Files needed for building @code{gawk} on an Amiga.
18409
@xref{Amiga Installation, ,Installing @code{gawk} on an Amiga}, for details.
18412
Files needed for building @code{gawk} on an Atari ST.
18413
@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details.
18416
Files needed for building @code{gawk} under MS-DOS and OS/2.
18417
@xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details.
18420
Files needed for building @code{gawk} under VMS.
18421
@xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details.
18425
@code{gawk}. You can use @samp{make check} from the top level @code{gawk}
18426
directory to run your version of @code{gawk} against the test suite.
18427
If @code{gawk} successfully passes @samp{make check} then you can
18428
be confident of a successful port.
18431
@node Unix Installation, VMS Installation, Gawk Distribution, Installation
18432
@appendixsec Compiling and Installing @code{gawk} on Unix
18434
Usually, you can compile and install @code{gawk} by typing only two
18435
commands. However, if you do use an unusual system, you may need
18436
to configure @code{gawk} for your system yourself.
18439
* Quick Installation:: Compiling @code{gawk} under Unix.
18440
* Configuration Philosophy:: How it's all supposed to work.
18443
@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation
18444
@appendixsubsec Compiling @code{gawk} for Unix
18446
@cindex installation, unix
18447
After you have extracted the @code{gawk} distribution, @code{cd}
18448
to @file{gawk-@value{VERSION}.0}. Like most GNU software,
18449
@code{gawk} is configured
18450
automatically for your Unix system by running the @code{configure} program.
18451
This program is a Bourne shell script that was generated automatically using
18452
GNU @code{autoconf}.
18454
(The @code{autoconf} software is
18456
@cite{Autoconf---Generating Automatic Configuration Scripts},
18457
which is available from the Free Software Foundation.)
18460
(The @code{autoconf} software is described fully starting with
18461
@ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.)
18464
To configure @code{gawk}, simply run @code{configure}:
18470
This produces a @file{Makefile} and @file{config.h} tailored to your system.
18471
The @file{config.h} file describes various facts about your system.
18472
You may wish to edit the @file{Makefile} to
18473
change the @code{CFLAGS} variable, which controls
18474
the command line options that are passed to the C compiler (such as
18475
optimization levels, or compiling for debugging).
18477
Alternatively, you can add your own values for most @code{make}
18478
variables, such as @code{CC} and @code{CFLAGS}, on the command line when
18479
running @code{configure}:
18482
CC=cc CFLAGS=-g sh ./configure
18486
See the file @file{INSTALL} in the @code{gawk} distribution for
18489
After you have run @code{configure}, and possibly edited the @file{Makefile},
18497
and shortly thereafter, you should have an executable version of @code{gawk}.
18498
That's all there is to it!
18499
(If these steps do not work, please send in a bug report;
18500
@pxref{Bugs, ,Reporting Problems and Bugs}.)
18502
@node Configuration Philosophy, , Quick Installation, Unix Installation
18503
@appendixsubsec The Configuration Process
18505
@cindex configuring @code{gawk}
18506
(This section is of interest only if you know something about using the
18507
C language and the Unix operating system.)
18509
The source code for @code{gawk} generally attempts to adhere to formal
18510
standards wherever possible. This means that @code{gawk} uses library
18511
routines that are specified by the ANSI C standard and by the POSIX
18512
operating system interface standard. When using an ANSI C compiler,
18513
function prototypes are used to help improve the compile-time checking.
18515
Many Unix systems do not support all of either the ANSI or the
18516
POSIX standards. The @file{missing} subdirectory in the @code{gawk}
18517
distribution contains replacement versions of those subroutines that are
18518
most likely to be missing.
18520
The @file{config.h} file that is created by the @code{configure} program
18521
contains definitions that describe features of the particular operating
18522
system where you are attempting to compile @code{gawk}. The three things
18523
described by this file are what header files are available, so that
18524
they can be correctly included,
18525
what (supposedly) standard functions are actually available in your C
18527
other miscellaneous facts about your
18528
variant of Unix. For example, there may not be an @code{st_blksize}
18529
element in the @code{stat} structure. In this case @samp{HAVE_ST_BLKSIZE}
18530
would be undefined.
18532
@cindex @code{custom.h} configuration file
18533
It is possible for your C compiler to lie to @code{configure}. It may
18534
do so by not exiting with an error when a library function is not
18535
available. To get around this, you can edit the file @file{custom.h}.
18536
Use an @samp{#ifdef} that is appropriate for your system, and either
18537
@code{#define} any constants that @code{configure} should have defined but
18538
didn't, or @code{#undef} any constants that @code{configure} defined and
18539
should not have. @file{custom.h} is automatically included by
18542
It is also possible that the @code{configure} program generated by
18544
will not work on your system in some other fashion. If you do have a problem,
18546
@file{configure.in} is the input for @code{autoconf}. You may be able to
18547
change this file, and generate a new version of @code{configure} that will
18548
work on your system. @xref{Bugs, ,Reporting Problems and Bugs}, for
18549
information on how to report problems in configuring @code{gawk}. The same
18550
mechanism may be used to send in updates to @file{configure.in} and/or
18553
@node VMS Installation, PC Installation, Unix Installation, Installation
18554
@appendixsec How to Compile and Install @code{gawk} on VMS
18556
@c based on material from Pat Rankin <rankin@eql.caltech.edu>
18558
@cindex installation, vms
18559
This section describes how to compile and install @code{gawk} under VMS.
18562
* VMS Compilation:: How to compile @code{gawk} under VMS.
18563
* VMS Installation Details:: How to install @code{gawk} under VMS.
18564
* VMS Running:: How to run @code{gawk} under VMS.
18565
* VMS POSIX:: Alternate instructions for VMS POSIX.
18568
@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation
18569
@appendixsubsec Compiling @code{gawk} on VMS
18571
To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that
18572
will issue all the necessary @code{CC} and @code{LINK} commands, and there is
18573
also a @file{Makefile} for use with the @code{MMS} utility. From the source
18574
directory, use either
18577
$ @@[.VMS]VMSBUILD.COM
18584
$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK
18587
Depending upon which C compiler you are using, follow one of the sets
18588
of instructions in this table:
18592
Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use
18593
@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0.
18596
You must have Version 2.3 or 2.4; older ones won't work. Edit either
18597
@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them.
18598
For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters.
18599
Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h})
18600
and comment out or delete the two lines @samp{#define __STDC__ 0} and
18601
@samp{#define VAXC_BUILTINS} near the end.
18604
Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different
18605
from those for VAX C V2.x, but equally straightforward. No changes to
18606
@file{config.h} should be needed.
18609
Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments.
18610
No changes to @file{config.h} should be needed.
18613
@code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2,
18614
GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up.
18616
@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation
18617
@appendixsubsec Installing @code{gawk} on VMS
18619
To install @code{gawk}, all you need is a ``foreign'' command, which is
18620
a @code{DCL} symbol whose value begins with a dollar sign. For example:
18623
$ GAWK :== $disk1:[gnubin]GAWK
18627
(Substitute the actual location of @code{gawk.exe} for
18628
@samp{$disk1:[gnubin]}.) The symbol should be placed in the
18629
@file{login.com} of any user who wishes to run @code{gawk},
18630
so that it will be defined every time the user logs on.
18631
Alternatively, the symbol may be placed in the system-wide
18632
@file{sylogin.com} procedure, which will allow all users
18633
to run @code{gawk}.
18635
Optionally, the help entry can be loaded into a VMS help library:
18638
$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP
18642
(You may want to substitute a site-specific help library rather than
18643
the standard VMS library @samp{HELPLIB}.) After loading the help text,
18650
will provide information about both the @code{gawk} implementation and the
18651
@code{awk} programming language.
18653
The logical name @samp{AWK_LIBRARY} can designate a default location
18654
for @code{awk} program files. For the @samp{-f} option, if the specified
18655
filename has no device or directory path information in it, @code{gawk}
18656
will look in the current directory first, then in the directory specified
18657
by the translation of @samp{AWK_LIBRARY} if the file was not found.
18658
If after searching in both directories, the file still is not found,
18659
then @code{gawk} appends the suffix @samp{.awk} to the filename and the
18660
file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that
18661
portion of the file search will fail benignly.
18663
@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation
18664
@appendixsubsec Running @code{gawk} on VMS
18666
Command line parsing and quoting conventions are significantly different
18667
on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor
18668
changes. They @emph{are} minor though, and all @code{awk} programs
18669
should run correctly.
18671
Here are a couple of trivial tests:
18674
$ gawk -- "BEGIN @{print ""Hello, World!""@}"
18675
$ gawk -"W" version
18676
! could also be -"W version" or "-W version"
18680
Note that upper-case and mixed-case text must be quoted.
18682
The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition
18683
to the original shell-style interface (see the help entry for details).
18684
One side-effect of dual command line parsing is that if there is only a
18685
single parameter (as in the quoted string program above), the command
18686
becomes ambiguous. To work around this, the normally optional @samp{--}
18687
flag is required to force Unix style rather than @code{DCL} parsing. If any
18688
other dash-type options (or multiple parameters such as data files to be
18689
processed) are present, there is no ambiguity and @samp{--} can be omitted.
18691
The default search path when looking for @code{awk} program files specified
18692
by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical
18693
name @samp{AWKPATH} can be used to override this default. The format
18694
of @samp{AWKPATH} is a comma-separated list of directory specifications.
18695
When defining it, the value should be quoted so that it retains a single
18696
translation, and not a multi-translation @code{RMS} searchlist.
18698
@node VMS POSIX, , VMS Running, VMS Installation
18699
@appendixsubsec Building and Using @code{gawk} on VMS POSIX
18701
Ignore the instructions above, although @file{vms/gawk.hlp} should still
18702
be made available in a help library. Make sure that the @code{configure}
18703
script is executable; use @samp{chmod +x}
18704
on it if necessary. Then execute the following commands:
18709
psx> CC=vms/posix-cc.sh configure
18710
psx> CC=c89 make gawk
18715
The first command will construct files @file{config.h} and @file{Makefile}
18716
out of templates. The second command will compile and link @code{gawk}.
18718
Due to a @code{make} bug in VMS POSIX V1.0 and V1.1,
18719
the file @file{awktab.c} must be given as an explicit target or it will
18720
not be built and the final link step will fail.
18723
@code{"Could not find lib m in lib list"}; it is harmless, caused by the
18724
explicit use of @samp{-lm} as a linker option which is not needed
18725
under VMS POSIX. Under V1.1 (but not V1.0) a problem with the @code{yacc}
18726
skeleton @file{/etc/yyparse.c} will cause a compiler warning for
18727
@file{awktab.c}, followed by a linker warning about compilation warnings
18728
in the resulting object module. These warnings can be ignored.
18730
Once built, @code{gawk} will work like any other shell utility. Unlike
18731
the normal VMS port of @code{gawk}, no special command line manipulation is
18732
needed in the VMS POSIX environment.
18734
@c Rewritten by Scott Deifik <scottd@amgen.com>
18735
@c and Darrel Hankerson <hankedr@mail.auburn.edu>
18736
@node PC Installation, Atari Installation, VMS Installation, Installation
18737
@appendixsec MS-DOS and OS/2 Installation and Compilation
18739
@cindex installation, MS-DOS and OS/2
18740
If you have received a binary distribution prepared by the DOS
18741
maintainers, then @code{gawk} and the necessary support files will appear
18742
under the @file{gnu} directory, with executables in @file{gnu/bin},
18743
libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}.
18744
This is designed for easy installation to a @file{/gnu} directory on your
18745
drive, but the files can be installed anywhere provided @code{AWKPATH} is
18746
set properly. Regardless of the installation directory, the first line of
18747
@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be
18750
The binary distribution will contain a separate file describing the
18751
contents. In particular, it may include more than one version of the
18752
@code{gawk} executable. OS/2 binary distributions may have a
18753
different arrangement, but installation is similar.
18755
The OS/2 and MS-DOS versions of @code{gawk} search for program files as
18756
described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
18757
However, semicolons (rather than colons) separate elements
18758
in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty,
18759
then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}.
18761
An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS
18762
or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming.
18763
Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a
18764
@code{ksh} clone and GNU Bash are available for OS/2. The file
18765
@file{README_d/README.pc} in the @code{gawk} distribution contains
18766
information on these shells. Users of Stewartson's shell on DOS should
18767
examine its documentation on handling of command-lines. In particular,
18768
the setting for @code{gawk} in the shell configuration may need to be
18769
changed, and the @code{ignoretype} option may also be of interest.
18771
@code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools
18772
from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2).
18773
Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2. The file
18774
@file{README_d/README.pc} in the @code{gawk} distribution contains additional
18775
notes, and @file{pc/Makefile} contains important notes on compilation options.
18777
To build @code{gawk}, copy the files in the @file{pc} directory to the
18778
directory with the rest of the @code{gawk} sources. The @file{Makefile}
18779
contains a configuration section with comments, and may need to be
18780
edited in order to work with your @code{make} utility.
18782
The @file{Makefile} contains a number of targets for building various MS-DOS
18783
and OS/2 versions. A list of targets will be printed if the @code{make}
18784
command is given without a target. As an example, to build @code{gawk}
18785
using the DJGPP tools, enter @samp{make djgpp}.
18787
Using @code{make} to run the standard tests and to install @code{gawk}
18788
requires additional Unix-like tools, including @code{sh}, @code{sed}, and
18789
@code{cp}. In order to run the tests, the @file{test/*.ok} files may need to
18790
be converted so that they have the usual DOS-style end-of-line markers. Most
18791
of the tests will work properly with Stewartson's shell along with the
18792
companion utilities or appropriate GNU utilities. However, some editing of
18793
@file{test/Makefile} is required. It is recommended that the file
18794
@file{pc/Makefile.tst} be copied to @file{test/Makefile} as a
18795
replacement. Details can be found in @file{README_d/README.pc}.
18797
@node Atari Installation, Amiga Installation, PC Installation, Installation
18798
@appendixsec Installing @code{gawk} on the Atari ST
18800
@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca>
18803
@cindex installation, atari
18804
There are no substantial differences when installing @code{gawk} on
18805
various Atari models. Compiled @code{gawk} executables do not require
18806
a large amount of memory with most @code{awk} programs and should run on all
18807
Motorola processor based models (called further ST, even if that is not
18810
In order to use @code{gawk}, you need to have a shell, either text or
18811
graphics, that does not map all the characters of a command line to
18812
upper-case. Maintaining case distinction in option flags is very
18813
important (@pxref{Options, ,Command Line Options}).
18814
These days this is the default, and it may only be a problem for some
18815
very old machines. If your system does not preserve the case of option
18816
flags, you will need to upgrade your tools. Support for I/O
18817
redirection is necessary to make it easy to import @code{awk} programs
18818
from other environments. Pipes are nice to have, but not vital.
18821
* Atari Compiling:: Compiling @code{gawk} on Atari
18822
* Atari Using:: Running @code{gawk} on Atari
18825
@node Atari Compiling, Atari Using, Atari Installation, Atari Installation
18826
@appendixsubsec Compiling @code{gawk} on the Atari ST
18828
A proper compilation of @code{gawk} sources when @code{sizeof(int)}
18829
differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial
18830
port was done with @code{gcc}. You may actually prefer executables
18831
where @code{int}s are four bytes wide, but the other variant works as well.
18833
You may need quite a bit of memory when trying to recompile the @code{gawk}
18834
sources, as some source files (@file{regex.c} in particular) are quite
18835
big. If you run out of memory compiling such a file, try reducing the
18836
optimization level for this particular file; this may help.
18839
With a reasonable shell (Bash will do), and in particular if you run
18840
Linux, MiNT or a similar operating system, you have a pretty good
18841
chance that the @code{configure} utility will succeed. Otherwise
18842
sample versions of @file{config.h} and @file{Makefile.st} are given in the
18843
@file{atari} subdirectory and can be edited and copied to the
18844
corresponding files in the main source directory. Even if
18845
@code{configure} produced something, it might be advisable to compare
18846
its results with the sample versions and possibly make adjustments.
18848
Some @code{gawk} source code fragments depend on a preprocessor define
18849
@samp{atarist}. This basically assumes the TOS environment with @code{gcc}.
18850
Modify these sections as appropriate if they are not right for your
18851
environment. Also see the remarks about @code{AWKPATH} and @code{envsep} in
18852
@ref{Atari Using, ,Running @code{gawk} on the Atari ST}.
18854
As shipped, the sample @file{config.h} claims that the @code{system}
18855
function is missing from the libraries, which is not true, and an
18856
alternative implementation of this function is provided in
18857
@file{atari/system.c}. Depending upon your particular combination of
18858
shell and operating system, you may wish to change the file to indicate
18859
that @code{system} is available.
18861
@node Atari Using, , Atari Compiling, Atari Installation
18862
@appendixsubsec Running @code{gawk} on the Atari ST
18864
An executable version of @code{gawk} should be placed, as usual,
18865
anywhere in your @code{PATH} where your shell can find it.
18867
While executing, @code{gawk} creates a number of temporary files. When
18868
using @code{gcc} libraries for TOS, @code{gawk} looks for either of
18869
the environment variables @code{TEMP} or @code{TMPDIR}, in that order.
18870
If either one is found, its value is assumed to be a directory for
18871
temporary files. This directory must exist, and if you can spare the
18872
memory, it is a good idea to put it on a RAM drive. If neither
18873
@code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the
18874
current directory for its temporary files.
18876
The ST version of @code{gawk} searches for its program files as described in
18877
@ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
18878
The default value for the @code{AWKPATH} variable is taken from
18879
@code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS
18880
@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to
18881
@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be
18882
modified by explicitly setting @code{AWKPATH} to whatever you wish.
18883
Note that colons cannot be used on the ST to separate elements in the
18884
@code{AWKPATH} variable, since they have another, reserved, meaning.
18885
Instead, you must use a comma to separate elements in the path. When
18886
recompiling, the separating character can be modified by initializing
18887
the @code{envsep} variable in @file{atari/gawkmisc.atr} to another
18890
Although @code{awk} allows great flexibility in doing I/O redirections
18891
from within a program, this facility should be used with care on the ST
18892
running under TOS. In some circumstances the OS routines for file
18893
handle pool processing lose track of certain events, causing the
18894
computer to crash, and requiring a reboot. Often a warm reboot is
18895
sufficient. Fortunately, this happens infrequently, and in rather
18896
esoteric situations. In particular, avoid having one part of an
18897
@code{awk} program using @code{print} statements explicitly redirected
18898
to @code{"/dev/stdout"}, while other @code{print} statements use the
18899
default standard output, and a calling shell has redirected standard
18902
When @code{gawk} is compiled with the ST version of @code{gcc} and its
18903
usual libraries, it will accept both @samp{/} and @samp{\} as path separators.
18904
While this is convenient, it should be remembered that this removes one,
18905
technically valid, character (@samp{/}) from your file names, and that
18906
it may create problems for external programs, called via the @code{system}
18907
function, which may not support this convention. Whenever it is possible
18908
that a file created by @code{gawk} will be used by some other program,
18909
use only backslashes. Also remember that in @code{awk}, backslashes in
18910
strings have to be doubled in order to get literal backslashes
18911
(@pxref{Escape Sequences}).
18913
@node Amiga Installation, Bugs, Atari Installation, Installation
18914
@appendixsec Installing @code{gawk} on an Amiga
18917
@cindex installation, amiga
18918
You can install @code{gawk} on an Amiga system using a Unix emulation
18919
environment available via anonymous @code{ftp} from
18920
@code{wuarchive.wustl.edu} in the directory @file{pub/aminet/dev/gcc}.
18921
This includes a shell based on @code{pdksh}. The primary component of
18922
this environment is a Unix emulation library, @file{ixemul.lib}.
18923
@c could really use more background here, who wrote this, etc.
18925
A more complete distribution for the Amiga is available on
18926
the FreshFish CD-ROM from:
18929
Amiga Library Services @*
18930
610 North Alma School Road, Suite 18 @*
18931
Chandler, AZ 85224 USA @*
18932
Phone: +1-602-491-0048 @*
18933
FAX: +1-602-491-0048 @*
18934
E-mail: @code{orders@@amigalib.com}
18937
Once you have the distribution, you can configure @code{gawk} simply by
18938
running @code{configure}:
18941
configure -v m68k-cbm-amigados
18944
Then run @code{make}, and you should be all set!
18945
(If these steps do not work, please send in a bug report;
18946
@pxref{Bugs, ,Reporting Problems and Bugs}.)
18948
@node Bugs, Other Versions, Amiga Installation, Installation
18949
@appendixsec Reporting Problems and Bugs
18951
If you have problems with @code{gawk} or think that you have found a bug,
18952
please report it to the developers; we cannot promise to do anything
18953
but we might well want to fix it.
18955
Before reporting a bug, make sure you have actually found a real bug.
18956
Carefully reread the documentation and see if it really says you can do
18957
what you're trying to do. If it's not clear whether you should be able
18958
to do something or not, report that too; it's a bug in the documentation!
18960
Before reporting a bug or trying to fix it yourself, try to isolate it
18961
to the smallest possible @code{awk} program and input data file that
18962
reproduces the problem. Then send us the program and data file,
18963
some idea of what kind of Unix system you're using, and the exact results
18964
@code{gawk} gave you. Also say what you expected to occur; this will help
18965
us decide whether the problem was really in the documentation.
18967
Once you have a precise problem, there are two e-mail addresses you
18972
@samp{bug-gnu-utils@@prep.ai.mit.edu}
18975
@samp{uunet!prep.ai.mit.edu!bug-gnu-utils}
18979
version number of @code{gawk} you are using. You can get this information
18980
with the command @samp{gawk --version}.
18981
You should send a carbon copy of your mail to Arnold Robbins, who can
18982
be reached at @samp{arnold@@gnu.ai.mit.edu}.
18984
@cindex @code{comp.lang.awk}
18985
@strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by
18986
posting to the Usenet/Internet newsgroup @code{comp.lang.awk}.
18987
While the @code{gawk} developers do occasionally read this newsgroup,
18988
there is no guarantee that we will see your posting. The steps described
18989
above are the official, recognized ways for reporting bugs.
18991
Non-bug suggestions are always welcome as well. If you have questions
18992
about things that are unclear in the documentation or are just obscure
18993
features, ask Arnold Robbins; he will try to help you out, although he
18994
may not have the time to fix the problem. You can send him electronic
18995
mail at the Internet address above.
18997
If you find bugs in one of the non-Unix ports of @code{gawk}, please send
18998
an electronic mail message to the person who maintains that port. They
18999
are listed below, and also in the @file{README} file in the @code{gawk}
19000
distribution. Information in the @code{README} file should be considered
19001
authoritative if it conflicts with this @value{DOCUMENT}.
19003
The people maintaining the non-Unix ports of @code{gawk} are:
19005
@cindex Deifik, Scott
19007
@cindex Hankerson, Darrel
19008
@cindex Jaegermann, Michal
19009
@cindex Rankin, Pat
19010
@cindex Rommel, Kai Uwe
19013
Scott Deifik, @samp{scottd@@amgen.com}, and
19014
Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}.
19017
Kai Uwe Rommel, @samp{rommel@@ars.de}.
19020
Pat Rankin, @samp{rankin@@eql.caltech.edu}.
19023
Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}.
19026
Fred Fish, @samp{fnf@@amigalib.com}.
19029
If your bug is also reproducible under Unix, please send copies of your
19030
report to the general GNU bug list, as well as to Arnold Robbins, at the
19031
addresses listed above.
19033
@node Other Versions, , Bugs, Installation
19034
@appendixsec Other Freely Available @code{awk} Implementations
19036
There are two other freely available @code{awk} implementations.
19037
This section briefly describes where to get them.
19040
@cindex Kernighan, Brian
19041
@cindex anonymous @code{ftp}
19042
@cindex @code{ftp}, anonymous
19043
@item Unix @code{awk}
19044
Brian Kernighan has been able to make his implementation of
19045
@code{awk} freely available. You can get it via anonymous @code{ftp}
19046
to the host @code{@w{netlib.att.com}}. Change directory to
19047
@file{/netlib/research}. Use ``binary'' or ``image'' mode, and
19048
retrieve @file{awk.bundle.Z}.
19050
This is a shell archive that has been compressed with the @code{compress}
19051
utility. It can be uncompressed with either @code{uncompress} or the
19052
GNU @code{gunzip} utility.
19054
This version requires an ANSI C compiler; GCC (the GNU C compiler)
19055
works quite nicely.
19057
@cindex Brennan, Michael
19058
@cindex @code{mawk}
19060
Michael Brennan has written an independent implementation of @code{awk},
19061
called @code{mawk}. It is available under the GPL
19062
(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
19063
just as @code{gawk} is.
19065
You can get it via anonymous @code{ftp} to the host
19066
@code{@w{oxy.edu}}. Change directory to @file{/public}. Use ``binary''
19067
or ``image'' mode, and retrieve @file{mawk1.2.1.tar.gz} (or the latest
19068
version that is there).
19070
@code{gunzip} may be used to decompress this file. Installation
19071
is similar to @code{gawk}'s
19072
(@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}).
19075
@node Notes, Glossary, Installation, Top
19076
@appendix Implementation Notes
19078
This appendix contains information mainly of interest to implementors and
19079
maintainers of @code{gawk}. Everything in it applies specifically to
19080
@code{gawk}, and not to other implementations.
19083
* Compatibility Mode:: How to disable certain @code{gawk} extensions.
19084
* Additions:: Making Additions To @code{gawk}.
19085
* Future Extensions:: New features that may be implemented one day.
19086
* Improvements:: Suggestions for improvements by volunteers.
19089
@node Compatibility Mode, Additions, Notes, Notes
19090
@appendixsec Downward Compatibility and Debugging
19092
@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
19093
for a summary of the GNU extensions to the @code{awk} language and program.
19094
All of these features can be turned off by invoking @code{gawk} with the
19095
@samp{--traditional} option, or with the @samp{--posix} option.
19097
If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there
19098
is one more option available on the command line:
19101
@item -W parsedebug
19102
@itemx --parsedebug
19103
Print out the parse stack information as the program is being parsed.
19106
This option is intended only for serious @code{gawk} developers,
19107
and not for the casual user. It probably has not even been compiled into
19108
your version of @code{gawk}, since it slows down execution.
19110
@node Additions, Future Extensions, Compatibility Mode, Notes
19111
@appendixsec Making Additions to @code{gawk}
19113
If you should find that you wish to enhance @code{gawk} in a significant
19114
fashion, you are perfectly free to do so. That is the point of having
19115
free software; the source code is available, and you are free to change
19116
it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
19118
This section discusses the ways you might wish to change @code{gawk},
19119
and any considerations you should bear in mind.
19122
* Adding Code:: Adding code to the main body of @code{gawk}.
19123
* New Ports:: Porting @code{gawk} to a new operating system.
19126
@node Adding Code, New Ports, Additions, Additions
19127
@appendixsubsec Adding New Features
19129
@cindex adding new features
19130
@cindex features, adding
19131
You are free to add any new features you like to @code{gawk}.
19132
However, if you want your changes to be incorporated into the @code{gawk}
19133
distribution, there are several steps that you need to take in order to
19134
make it possible for me to include to your changes.
19138
Get the latest version.
19139
It is much easier for me to integrate changes if they are relative to
19140
the most recent distributed version of @code{gawk}. If your version of
19141
@code{gawk} is very old, I may not be able to integrate them at all.
19142
@xref{Getting, ,Getting the @code{gawk} Distribution},
19143
for information on getting the latest version of @code{gawk}.
19147
Follow the @cite{GNU Coding Standards}.
19150
See @inforef{Top, , Version, standards, GNU Coding Standards}.
19152
This document describes how GNU software should be written. If you haven't
19153
read it, please do so, preferably @emph{before} starting to modify @code{gawk}.
19154
(The @cite{GNU Coding Standards} are available as part of the Autoconf
19155
distribution, from the FSF.)
19157
@cindex @code{gawk} coding style
19158
@cindex coding style used in @code{gawk}
19160
Use the @code{gawk} coding style.
19161
The C code for @code{gawk} follows the instructions in the
19162
@cite{GNU Coding Standards}, with minor exceptions. The code is formatted
19163
using the traditional ``K&R'' style, particularly as regards the placement
19164
of braces and the use of tabs. In brief, the coding rules for @code{gawk}
19169
Use old style (non-prototype) function headers when defining functions.
19172
Put the name of the function at the beginning of its own line.
19175
Put the return type of the function, even if it is @code{int}, on the
19176
line above the line with the name and arguments of the function.
19179
The declarations for the function arguments should not be indented.
19182
Put spaces around parentheses used in control structures
19183
(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}
19184
and @code{return}).
19187
Do not put spaces in front of parentheses used in function calls.
19190
Put spaces around all C operators, and after commas in function calls.
19193
Do not use the comma operator to produce multiple side-effects, except
19194
in @code{for} loop initialization and increment parts, and in macro bodies.
19197
Use real tabs for indenting, not spaces.
19200
Use the ``K&R'' brace layout style.
19203
Use comparisons against @code{NULL} and @code{'\0'} in the conditions of
19204
@code{if}, @code{while} and @code{for} statements, and in the @code{case}s
19205
of @code{switch} statements, instead of just the
19206
plain pointer or character value.
19209
Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants,
19210
and the character constant @code{'\0'} where appropriate, instead of @code{1}
19214
Provide one-line descriptive comments for each function.
19217
Do not use @samp{#elif}. Many older Unix C compilers cannot handle it.
19220
If I have to reformat your code to follow the coding style used in
19221
@code{gawk}, I may not bother.
19224
Be prepared to sign the appropriate paperwork.
19225
In order for the FSF to distribute your changes, you must either place
19226
those changes in the public domain, and submit a signed statement to that
19227
effect, or assign the copyright in your changes to the FSF.
19228
Both of these actions are easy to do, and @emph{many} people have done so
19229
already. If you have questions, please contact me
19230
(@pxref{Bugs, , Reporting Problems and Bugs}),
19231
or @code{gnu@@prep.ai.mit.edu}.
19234
Update the documentation.
19235
Along with your new code, please supply new sections and or chapters
19236
for this @value{DOCUMENT}. If at all possible, please use real
19237
Texinfo, instead of just supplying unformatted ASCII text (although
19238
even that is better than no documentation at all).
19239
Conventions to be followed in @cite{@value{TITLE}} are provided
19240
after the @samp{@@bye} at the end of the Texinfo source file.
19241
If possible, please update the man page as well.
19243
You will also have to sign paperwork for your documentation changes.
19246
Submit changes as context diffs or unified diffs.
19247
Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare
19248
the original @code{gawk} source tree with your version.
19249
(I find context diffs to be more readable, but unified diffs are
19251
I recommend using the GNU version of @code{diff}.
19252
Send the output produced by either run of @code{diff} to me when you
19253
submit your changes.
19254
@xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail
19257
Using this format makes it easy for me to apply your changes to the
19258
master version of the @code{gawk} source code (using @code{patch}).
19259
If I have to apply the changes manually, using a text editor, I may
19260
not do so, particularly if there are lots of changes.
19263
Although this sounds like a lot of work, please remember that while you
19264
may write the new code, I have to maintain it and support it, and if it
19265
isn't possible for me to do that with a minimum of extra work, then I
19268
@node New Ports, , Adding Code, Additions
19269
@appendixsubsec Porting @code{gawk} to a New Operating System
19271
@cindex porting @code{gawk}
19272
If you wish to port @code{gawk} to a new operating system, there are
19273
several steps to follow.
19277
Follow the guidelines in
19278
@ref{Adding Code, ,Adding New Features},
19279
concerning coding style, submission of diffs, and so on.
19282
When doing a port, bear in mind that your code must co-exist peacefully
19283
with the rest of @code{gawk}, and the other ports. Avoid gratuitous
19284
changes to the system-independent parts of the code. If at all possible,
19285
avoid sprinkling @samp{#ifdef}s just for your port throughout the
19288
If the changes needed for a particular system affect too much of the
19289
code, I probably will not accept them. In such a case, you will, of course,
19290
be able to distribute your changes on your own, as long as you comply
19292
(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
19295
A number of the files that come with @code{gawk} are maintained by other
19296
people at the Free Software Foundation. Thus, you should not change them
19297
unless it is for a very good reason. I.e.@: changes are not out of the
19298
question, but changes to these files will be scrutinized extra carefully.
19299
The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c},
19300
@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h},
19301
@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}.
19304
Be willing to continue to maintain the port.
19305
Non-Unix operating systems are supported by volunteers who maintain
19306
the code needed to compile and run @code{gawk} on their systems. If no-one
19307
volunteers to maintain a port, that port becomes unsupported, and it may
19308
be necessary to remove it from the distribution.
19311
Supply an appropriate @file{gawkmisc.???} file.
19312
Each port has its own @file{gawkmisc.???} that implements certain
19313
operating system specific functions. This is cleaner than a plethora of
19314
@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in
19315
the main source directory includes the appropriate
19316
@file{gawkmisc.???} file from each subdirectory.
19317
Be sure to update it as well.
19319
Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine
19320
or operating system for the port. For example, @file{pc/gawkmisc.pc} and
19321
@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain
19322
@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory
19323
into the main subdirectory, without accidentally destroying the real
19324
@file{gawkmisc.c} file. (Currently, this is only an issue for the MS-DOS
19328
Supply a @file{Makefile} and any other C source and header files that are
19329
necessary for your operating system. All your code should be in a
19330
separate subdirectory, with a name that is the same as, or reminiscent
19331
of, either your operating system or the computer system. If possible,
19332
try to structure things so that it is not necessary to move files out
19333
of the subdirectory into the main source directory. If that is not
19334
possible, then be sure to avoid using names for your files that
19335
duplicate the names of files in the main source directory.
19338
Update the documentation.
19339
Please write a section (or sections) for this @value{DOCUMENT} describing the
19340
installation and compilation steps needed to install and/or compile
19341
@code{gawk} for your system.
19344
Be prepared to sign the appropriate paperwork.
19345
In order for the FSF to distribute your code, you must either place
19346
your code in the public domain, and submit a signed statement to that
19347
effect, or assign the copyright in your code to the FSF.
19349
Both of these actions are easy to do, and @emph{many} people have done so
19350
already. If you have questions, please contact me, or
19351
@code{gnu@@prep.ai.mit.edu}.
19355
Following these steps will make it much easier to integrate your changes
19356
into @code{gawk}, and have them co-exist happily with the code for other
19357
operating systems that is already there.
19359
In the code that you supply, and that you maintain, feel free to use a
19360
coding style and brace layout that suits your taste.
19362
@c why should this be needed? sigh
19366
@node Future Extensions, Improvements, Additions, Notes
19367
@appendixsec Probable Future Extensions
19370
From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995
19371
Return-Path: <emory!scalpel.netlabs.com!lwall>
19372
Message-Id: <9510311732.AA28472@scalpel.netlabs.com>
19373
To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)
19374
Subject: Re: May I quote you?
19375
In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."
19376
<m0tAHPQ-00014MC@skeeve.atl.ga.us>
19377
Date: Tue, 31 Oct 95 09:32:46 -0800
19378
From: Larry Wall <emory!scalpel.netlabs.com!lwall>
19380
: Greetings. I am working on the release of gawk 3.0. Part of it will be a
19381
: thoroughly updated manual. One of the sections deals with planned future
19382
: extensions and enhancements. I have the following at the beginning
19386
: @cindex Wall, Larry
19388
: @i{AWK is a language similar to PERL, only considerably more elegant.} @*
19395
: Before I actually release this for publication, I wanted to get your
19396
: permission to quote you. (Hopefully, in the spirit of much of GNU, the
19397
: implied humor is visible... :-)
19399
I think that would be fine.
19405
@cindex Wall, Larry
19407
@i{AWK is a language similar to PERL, only considerably more elegant.}
19414
This section briefly lists extensions and possible improvements
19415
that indicate the directions we are
19416
currently considering for @code{gawk}. The file @file{FUTURES} in the
19417
@code{gawk} distributions lists these extensions as well.
19419
This is a list of probable future changes that will be usable by the
19420
@code{awk} language programmer.
19422
@c these are ordered by likelihood
19425
The GNU project is starting to support multiple languages.
19426
It will at least be possible to make @code{gawk} print its warnings and
19427
error messages in languages other than English.
19428
It may be possible for @code{awk} programs to also use the multiple
19429
language facilities, separate from @code{gawk} itself.
19432
It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array.
19434
@item A @code{PROCINFO} Array
19435
The special files that provide process-related information
19436
(@pxref{Special Files, ,Special File Names in @code{gawk}})
19437
may be superseded by a @code{PROCINFO} array that would provide the same
19438
information, in an easier to access fashion.
19440
@item More @code{lint} warnings
19441
There are more things that could be checked for portability.
19443
@item Control of subprocess environment
19444
Changes made in @code{gawk} to the array @code{ENVIRON} may be
19445
propagated to subprocesses run by @code{gawk}.
19448
@item @code{RECLEN} variable for fixed length records
19449
Along with @code{FIELDWIDTHS}, this would speed up the processing of
19450
fixed-length records.
19452
@item A @code{restart} keyword
19453
After modifying @code{$0}, @code{restart} would restart the pattern
19454
matching loop, without reading a new record from the input.
19456
@item A @samp{|&} redirection
19457
The @samp{|&} redirection, in place of @samp{|}, would open a two-way
19458
pipeline for communication with a sub-process (via @code{getline} and
19459
@code{print} and @code{printf}).
19461
@item Function valued variables
19462
It would be possible to assign the name of a user-defined or built-in
19463
function to a regular @code{awk} variable, and then call the function
19464
indirectly, by using the regular variable. This would make it possible
19465
to write general purpose sorting and comparing routines, for example,
19466
by simply passing the name of one function into another.
19468
@item A built-in @code{stat} function
19469
The @code{stat} function would provide an easy-to-use hook to the
19470
@code{stat} system call so that @code{awk} programs could determine information
19473
@item A built-in @code{ftw} function
19474
Combined with function valued variables and the @code{stat} function,
19475
@code{ftw} (file tree walk) would make it easy for an @code{awk} program
19476
to walk an entire file tree.
19480
This is a list of probable improvements that will make @code{gawk}
19484
@item An Improved Version of @code{dfa}
19485
The @code{dfa} pattern matcher from GNU @code{grep} has some
19486
problems. Either a new version or a fixed one will deal with some
19487
important regexp matching issues.
19489
@item Use of @code{mmap}
19490
On systems that support the @code{mmap} system call, its use would provide
19491
much faster file input, and considerably simplified input buffer management.
19493
@item Use of GNU @code{malloc}
19494
The GNU version of @code{malloc} could potentially speed up @code{gawk},
19495
since it relies heavily on the use of dynamic memory allocation.
19497
@item Use of the @code{rx} regexp library
19498
The @code{rx} regular expression library could potentially speed up
19499
all regexp operations that require knowing the exact location of matches.
19500
This includes record termination, field and array splitting,
19501
and the @code{sub}, @code{gsub}, @code{gensub} and @code{match} functions.
19504
@node Improvements, , Future Extensions, Notes
19505
@appendixsec Suggestions for Improvements
19507
Here are some projects that would-be @code{gawk} hackers might like to take
19508
on. They vary in size from a few days to a few weeks of programming,
19509
depending on which one you choose and how fast a programmer you are. Please
19510
send any improvements you write to the maintainers at the GNU project.
19511
@xref{Adding Code, , Adding New Features},
19512
for guidelines to follow when adding new features to @code{gawk}.
19513
@xref{Bugs, ,Reporting Problems and Bugs}, for information on
19514
contacting the maintainers.
19518
Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like)
19519
parser to convert the script given it into a syntax tree; the syntax
19520
tree is then executed by a simple recursive evaluator. This method incurs
19521
a lot of overhead, since the recursive evaluator performs many procedure
19522
calls to do even the simplest things.
19524
It should be possible for @code{gawk} to convert the script's parse tree
19525
into a C program which the user would then compile, using the normal
19526
C compiler and a special @code{gawk} library to provide all the needed
19527
functions (regexps, fields, associative arrays, type coercion, and so
19530
An easier possibility might be for an intermediate phase of @code{awk} to
19531
convert the parse tree into a linear byte code form like the one used
19532
in GNU Emacs Lisp. The recursive evaluator would then be replaced by
19533
a straight line byte code interpreter that would be intermediate in speed
19534
between running a compiled program and doing what @code{gawk} does
19538
The programs in the test suite could use documenting in this @value{DOCUMENT}.
19541
See the @file{FUTURES} file for more ideas. Contact us if you would
19542
seriously like to tackle any of the items listed there.
19545
@node Glossary, Copying, Notes, Top
19550
A series of @code{awk} statements attached to a rule. If the rule's
19551
pattern matches an input record, @code{awk} executes the
19552
rule's action. Actions are always enclosed in curly braces.
19553
@xref{Action Overview, ,Overview of Actions}.
19555
@item Amazing @code{awk} Assembler
19556
Henry Spencer at the University of Toronto wrote a retargetable assembler
19557
completely as @code{awk} scripts. It is thousands of lines long, including
19558
machine descriptions for several eight-bit microcomputers.
19559
It is a good example of a
19560
program that would have been better written in another language.
19562
@item Amazingly Workable Formatter (@code{awf})
19563
Henry Spencer at the University of Toronto wrote a formatter that accepts
19564
a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting
19565
commands, using @code{awk} and @code{sh}.
19568
The American National Standards Institute. This organization produces
19569
many standards, among them the standards for the C and C++ programming
19573
An @code{awk} expression that changes the value of some @code{awk}
19574
variable or data object. An object that you can assign to is called an
19575
@dfn{lvalue}. The assigned values are called @dfn{rvalues}.
19576
@xref{Assignment Ops, ,Assignment Expressions}.
19578
@item @code{awk} Language
19579
The language in which @code{awk} programs are written.
19581
@item @code{awk} Program
19582
An @code{awk} program consists of a series of @dfn{patterns} and
19583
@dfn{actions}, collectively known as @dfn{rules}. For each input record
19584
given to the program, the program's rules are all processed in turn.
19585
@code{awk} programs may also contain function definitions.
19587
@item @code{awk} Script
19588
Another name for an @code{awk} program.
19591
The GNU version of the standard shell (the Bourne-Again shell).
19592
See ``Bourne Shell.''
19595
See ``Bulletin Board System.''
19597
@item Boolean Expression
19598
Named after the English mathematician Boole. See ``Logical Expression.''
19601
The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,
19602
originally written by Steven R.@: Bourne.
19603
Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are
19604
generally upwardly compatible with the Bourne shell.
19606
@item Built-in Function
19607
The @code{awk} language provides built-in functions that perform various
19608
numerical, time stamp related, and string computations. Examples are
19609
@code{sqrt} (for the square root of a number) and @code{substr} (for a
19610
substring of a string). @xref{Built-in, ,Built-in Functions}.
19612
@item Built-in Variable
19613
@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON},
19614
@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS},
19615
@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS},
19616
@code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP},
19617
are the variables that have special meaning to @code{awk}.
19618
Changing some of them affects @code{awk}'s running environment.
19619
Several of these variables are specific to @code{gawk}.
19620
@xref{Built-in Variables}.
19623
See ``Curly Braces.''
19625
@item Bulletin Board System
19626
A computer system allowing users to log in and read and/or leave messages
19627
for other users of the system, much like leaving paper notes on a bulletin
19631
The system programming language that most GNU software is written in. The
19632
@code{awk} programming language has C-like syntax, and this @value{DOCUMENT}
19633
points out similarities between @code{awk} and C when appropriate.
19636
@cindex ISO Latin-1
19637
@item Character Set
19638
The set of numeric codes used by a computer system to represent the
19639
characters (letters, numbers, punctuation, etc.) of a particular country
19640
or place. The most common character set in use today is ASCII (American
19641
Standard Code for Information Interchange). Many European
19642
countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).
19645
A preprocessor for @code{pic} that reads descriptions of molecules
19646
and produces @code{pic} input for drawing them. It was written in @code{awk}
19647
by Brian Kernighan and Jon Bentley, and is available from
19648
@code{@w{netlib@@research.att.com}}.
19650
@item Compound Statement
19651
A series of @code{awk} statements, enclosed in curly braces. Compound
19652
statements may be nested.
19653
@xref{Statements, ,Control Statements in Actions}.
19655
@item Concatenation
19656
Concatenating two strings means sticking them together, one after another,
19657
giving a new string. For example, the string @samp{foo} concatenated with
19658
the string @samp{bar} gives the string @samp{foobar}.
19659
@xref{Concatenation, ,String Concatenation}.
19661
@item Conditional Expression
19662
An expression using the @samp{?:} ternary operator, such as
19663
@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression
19664
@var{expr1} is evaluated; if the result is true, the value of the whole
19665
expression is the value of @var{expr2}, otherwise the value is
19666
@var{expr3}. In either case, only one of @var{expr2} and @var{expr3}
19667
is evaluated. @xref{Conditional Exp, ,Conditional Expressions}.
19669
@item Comparison Expression
19670
A relation that is either true or false, such as @samp{(a < b)}.
19671
Comparison expressions are used in @code{if}, @code{while}, @code{do},
19673
statements, and in patterns to select which input records to process.
19674
@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
19677
The characters @samp{@{} and @samp{@}}. Curly braces are used in
19678
@code{awk} for delimiting actions, compound statements, and function
19682
An area in the language where specifications often were (or still
19683
are) not clear, leading to unexpected or undesirable behavior.
19684
Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the
19685
text, and are indexed under the heading ``dark corner.''
19688
These are numbers and strings of characters. Numbers are converted into
19689
strings and vice versa, as needed.
19690
@xref{Conversion, ,Conversion of Strings and Numbers}.
19692
@item Double Precision
19693
An internal representation of numbers that can have fractional parts.
19694
Double precision numbers keep track of more digits than do single precision
19695
numbers, but operations on them are more expensive. This is the way
19696
@code{awk} stores numeric values. It is the C type @code{double}.
19698
@item Dynamic Regular Expression
19699
A dynamic regular expression is a regular expression written as an
19700
ordinary expression. It could be a string constant, such as
19701
@code{"foo"}, but it may also be an expression whose value can vary.
19702
@xref{Computed Regexps, , Using Dynamic Regexps}.
19705
A collection of strings, of the form @var{name@code{=}val}, that each
19706
program has available to it. Users generally place values into the
19707
environment in order to provide information to various programs. Typical
19708
examples are the environment variables @code{HOME} and @code{PATH}.
19711
See ``Null String.''
19713
@item Escape Sequences
19714
A special sequence of characters used for describing non-printing
19715
characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII
19716
ESC (escape) character. @xref{Escape Sequences}.
19719
When @code{awk} reads an input record, it splits the record into pieces
19720
separated by whitespace (or by a separator regexp which you can
19721
change by setting the built-in variable @code{FS}). Such pieces are
19722
called fields. If the pieces are of fixed length, you can use the built-in
19723
variable @code{FIELDWIDTHS} to describe their lengths.
19724
@xref{Field Separators, ,Specifying How Fields are Separated},
19726
@xref{Constant Size, , Reading Fixed-width Data}.
19728
@item Floating Point Number
19729
Often referred to in mathematical terms as a ``rational'' number, this is
19730
just a number that can have a fractional part.
19731
See ``Double Precision'' and ``Single Precision.''
19734
Format strings are used to control the appearance of output in the
19735
@code{printf} statement. Also, data conversions from numbers to strings
19736
are controlled by the format string contained in the built-in variable
19737
@code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}.
19740
A specialized group of statements used to encapsulate general
19741
or program-specific tasks. @code{awk} has a number of built-in
19742
functions, and also allows you to define your own.
19743
@xref{Built-in, ,Built-in Functions},
19744
and @ref{User-defined, ,User-defined Functions}.
19747
See ``Free Software Foundation.''
19749
@item Free Software Foundation
19750
A non-profit organization dedicated
19751
to the production and distribution of freely distributable software.
19752
It was founded by Richard M.@: Stallman, the author of the original
19753
Emacs editor. GNU Emacs is the most widely used version of Emacs today.
19756
The GNU implementation of @code{awk}.
19758
@item General Public License
19759
This document describes the terms under which @code{gawk} and its source
19760
code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE})
19763
``GNU's not Unix''. An on-going project of the Free Software Foundation
19764
to create a complete, freely distributable, POSIX-compliant computing
19768
See ``General Public License.''
19771
Base 16 notation, where the digits are @code{0}-@code{9} and
19772
@code{A}-@code{F}, with @samp{A}
19773
representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15.
19774
Hexadecimal numbers are written in C using a leading @samp{0x},
19775
to indicate their base. Thus, @code{0x12} is 18 (one times 16 plus 2).
19778
Abbreviation for ``Input/Output,'' the act of moving data into and/or
19779
out of a running program.
19782
A single chunk of data read in by @code{awk}. Usually, an @code{awk} input
19783
record consists of one line of text.
19784
@xref{Records, ,How Input is Split into Records}.
19787
A whole number, i.e.@: a number that does not have a fractional part.
19790
In the @code{awk} language, a keyword is a word that has special
19791
meaning. Keywords are reserved and may not be used as variable names.
19793
@code{gawk}'s keywords are:
19799
@code{do@dots{}while},
19801
@code{for@dots{}in},
19811
@item Logical Expression
19812
An expression using the operators for logic, AND, OR, and NOT, written
19813
@samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean
19814
expressions, after the mathematician who pioneered this kind of
19815
mathematical logic.
19818
An expression that can appear on the left side of an assignment
19819
operator. In most languages, lvalues can be variables or array
19820
elements. In @code{awk}, a field designator can also be used as an
19824
A string with no characters in it. It is represented explicitly in
19825
@code{awk} programs by placing two double-quote characters next to
19826
each other (@code{""}). It can appear in input data by having two successive
19827
occurrences of the field separator appear next to each other.
19830
A numeric valued data object. The @code{gawk} implementation uses double
19831
precision floating point to represent numbers.
19832
Very old @code{awk} implementations use single precision floating
19836
Base-eight notation, where the digits are @code{0}-@code{7}.
19837
Octal numbers are written in C using a leading @samp{0},
19838
to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3).
19841
Patterns tell @code{awk} which input records are interesting to which
19844
A pattern is an arbitrary conditional expression against which input is
19845
tested. If the condition is satisfied, the pattern is said to @dfn{match}
19846
the input record. A typical pattern might compare the input record against
19847
a regular expression. @xref{Pattern Overview, ,Pattern Elements}.
19850
The name for a series of standards being developed by the IEEE
19851
that specify a Portable Operating System interface. The ``IX'' denotes
19852
the Unix heritage of these standards. The main standard of interest for
19853
@code{awk} users is
19854
@cite{IEEE Standard for Information Technology, Standard 1003.2-1992,
19855
Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}.
19856
Informally, this standard is often referred to as simply ``P1003.2.''
19859
Variables and/or functions that are meant for use exclusively by library
19860
functions, and not for the main @code{awk} program. Special care must be
19861
taken when naming such variables and functions.
19862
@xref{Library Names, , Naming Library Function Global Variables}.
19864
@item Range (of input lines)
19865
A sequence of consecutive lines from the input file. A pattern
19866
can specify ranges of input lines for @code{awk} to process, or it can
19867
specify single lines. @xref{Pattern Overview, ,Pattern Elements}.
19870
When a function calls itself, either directly or indirectly.
19871
If this isn't clear, refer to the entry for ``recursion.''
19874
Redirection means performing input from other than the standard input
19875
stream, or output to other than the standard output stream.
19877
You can redirect the output of the @code{print} and @code{printf} statements
19878
to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|}
19879
operators. You can redirect input to the @code{getline} statement using
19880
the @samp{<} and @samp{|} operators.
19881
@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}},
19882
and @ref{Getline, ,Explicit Input with @code{getline}}.
19885
Short for @dfn{regular expression}. A regexp is a pattern that denotes a
19886
set of strings, possibly an infinite set. For example, the regexp
19887
@samp{R.*xp} matches any string starting with the letter @samp{R}
19888
and ending with the letters @samp{xp}. In @code{awk}, regexps are
19889
used in patterns and in conditional expressions. Regexps may contain
19890
escape sequences. @xref{Regexp, ,Regular Expressions}.
19892
@item Regular Expression
19895
@item Regular Expression Constant
19896
A regular expression constant is a regular expression written within
19897
slashes, such as @code{/foo/}. This regular expression is chosen
19898
when you write the @code{awk} program, and cannot be changed doing
19899
its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}.
19902
A segment of an @code{awk} program that specifies how to process single
19903
input records. A rule consists of a @dfn{pattern} and an @dfn{action}.
19904
@code{awk} reads an input record; then, for each rule, if the input record
19905
satisfies the rule's pattern, @code{awk} executes the rule's action.
19906
Otherwise, the rule does nothing for that input record.
19909
A value that can appear on the right side of an assignment operator.
19910
In @code{awk}, essentially every expression has a value. These values
19914
See ``Stream Editor.''
19916
@item Short-Circuit
19917
The nature of the @code{awk} logical operators @samp{&&} and @samp{||}.
19918
If the value of the entire expression can be deduced from evaluating just
19919
the left-hand side of these operators, the right-hand side will not
19921
(@pxref{Boolean Ops, ,Boolean Expressions}).
19924
A side effect occurs when an expression has an effect aside from merely
19925
producing a value. Assignment expressions, increment and decrement
19926
expressions and function calls have side effects.
19927
@xref{Assignment Ops, ,Assignment Expressions}.
19929
@item Single Precision
19930
An internal representation of numbers that can have fractional parts.
19931
Single precision numbers keep track of fewer digits than do double precision
19932
numbers, but operations on them are less expensive in terms of CPU time.
19933
This is the type used by some very old versions of @code{awk} to store
19934
numeric values. It is the C type @code{float}.
19937
The character generated by hitting the space bar on the keyboard.
19940
A file name interpreted internally by @code{gawk}, instead of being handed
19941
directly to the underlying operating system. For example, @file{/dev/stderr}.
19942
@xref{Special Files, ,Special File Names in @code{gawk}}.
19944
@item Stream Editor
19945
A program that reads records from an input stream and processes them one
19946
or more at a time. This is in contrast with batch programs, which may
19947
expect to read their input files in entirety before starting to do
19948
anything, and with interactive programs, which require input from the
19952
A datum consisting of a sequence of characters, such as @samp{I am a
19953
string}. Constant strings are written with double-quotes in the
19954
@code{awk} language, and may contain escape sequences.
19955
@xref{Escape Sequences}.
19958
The character generated by hitting the @kbd{TAB} key on the keyboard.
19959
It usually expands to up to eight spaces upon output.
19962
A computer operating system originally developed in the early 1970's at
19963
AT&T Bell Laboratories. It initially became popular in universities around
19964
the world, and later moved into commercial evnironments as a software
19965
development system and network server system. There are many commercial
19966
versions of Unix, as well as several work-alike systems whose source code
19967
is freely available (such as Linux, NetBSD, and FreeBSD).
19970
A sequence of space or tab characters occurring inside an input record or a
19974
@node Copying, Index, Glossary, Top
19975
@unnumbered GNU GENERAL PUBLIC LICENSE
19976
@center Version 2, June 1991
19979
Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc.
19980
59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA
19982
Everyone is permitted to copy and distribute verbatim copies
19983
of this license document, but changing it is not allowed.
19986
@c fakenode --- for prepinfo
19987
@unnumberedsec Preamble
19989
The licenses for most software are designed to take away your
19990
freedom to share and change it. By contrast, the GNU General Public
19991
License is intended to guarantee your freedom to share and change free
19992
software---to make sure the software is free for all its users. This
19993
General Public License applies to most of the Free Software
19994
Foundation's software and to any other program whose authors commit to
19995
using it. (Some other Free Software Foundation software is covered by
19996
the GNU Library General Public License instead.) You can apply it to
19997
your programs, too.
19999
When we speak of free software, we are referring to freedom, not
20000
price. Our General Public Licenses are designed to make sure that you
20001
have the freedom to distribute copies of free software (and charge for
20002
this service if you wish), that you receive source code or can get it
20003
if you want it, that you can change the software or use pieces of it
20004
in new free programs; and that you know you can do these things.
20006
To protect your rights, we need to make restrictions that forbid
20007
anyone to deny you these rights or to ask you to surrender the rights.
20008
These restrictions translate to certain responsibilities for you if you
20009
distribute copies of the software, or if you modify it.
20011
For example, if you distribute copies of such a program, whether
20012
gratis or for a fee, you must give the recipients all the rights that
20013
you have. You must make sure that they, too, receive or can get the
20014
source code. And you must show them these terms so they know their
20017
We protect your rights with two steps: (1) copyright the software, and
20018
(2) offer you this license which gives you legal permission to copy,
20019
distribute and/or modify the software.
20021
Also, for each author's protection and ours, we want to make certain
20022
that everyone understands that there is no warranty for this free
20023
software. If the software is modified by someone else and passed on, we
20024
want its recipients to know that what they have is not the original, so
20025
that any problems introduced by others will not reflect on the original
20026
authors' reputations.
20028
Finally, any free program is threatened constantly by software
20029
patents. We wish to avoid the danger that redistributors of a free
20030
program will individually obtain patent licenses, in effect making the
20031
program proprietary. To prevent this, we have made it clear that any
20032
patent must be licensed for everyone's free use or not licensed at all.
20034
The precise terms and conditions for copying, distribution and
20035
modification follow.
20038
@c fakenode --- for prepinfo
20039
@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
20042
@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
20047
This License applies to any program or other work which contains
20048
a notice placed by the copyright holder saying it may be distributed
20049
under the terms of this General Public License. The ``Program'', below,
20050
refers to any such program or work, and a ``work based on the Program''
20051
means either the Program or any derivative work under copyright law:
20052
that is to say, a work containing the Program or a portion of it,
20053
either verbatim or with modifications and/or translated into another
20054
language. (Hereinafter, translation is included without limitation in
20055
the term ``modification''.) Each licensee is addressed as ``you''.
20057
Activities other than copying, distribution and modification are not
20058
covered by this License; they are outside its scope. The act of
20059
running the Program is not restricted, and the output from the Program
20060
is covered only if its contents constitute a work based on the
20061
Program (independent of having been made by running the Program).
20062
Whether that is true depends on what the Program does.
20065
You may copy and distribute verbatim copies of the Program's
20066
source code as you receive it, in any medium, provided that you
20067
conspicuously and appropriately publish on each copy an appropriate
20068
copyright notice and disclaimer of warranty; keep intact all the
20069
notices that refer to this License and to the absence of any warranty;
20070
and give any other recipients of the Program a copy of this License
20071
along with the Program.
20073
You may charge a fee for the physical act of transferring a copy, and
20074
you may at your option offer warranty protection in exchange for a fee.
20077
You may modify your copy or copies of the Program or any portion
20078
of it, thus forming a work based on the Program, and copy and
20079
distribute such modifications or work under the terms of Section 1
20080
above, provided that you also meet all of these conditions:
20084
You must cause the modified files to carry prominent notices
20085
stating that you changed the files and the date of any change.
20088
You must cause any work that you distribute or publish, that in
20089
whole or in part contains or is derived from the Program or any
20090
part thereof, to be licensed as a whole at no charge to all third
20091
parties under the terms of this License.
20094
If the modified program normally reads commands interactively
20095
when run, you must cause it, when started running for such
20096
interactive use in the most ordinary way, to print or display an
20097
announcement including an appropriate copyright notice and a
20098
notice that there is no warranty (or else, saying that you provide
20099
a warranty) and that users may redistribute the program under
20100
these conditions, and telling the user how to view a copy of this
20101
License. (Exception: if the Program itself is interactive but
20102
does not normally print such an announcement, your work based on
20103
the Program is not required to print an announcement.)
20106
These requirements apply to the modified work as a whole. If
20107
identifiable sections of that work are not derived from the Program,
20108
and can be reasonably considered independent and separate works in
20109
themselves, then this License, and its terms, do not apply to those
20110
sections when you distribute them as separate works. But when you
20111
distribute the same sections as part of a whole which is a work based
20112
on the Program, the distribution of the whole must be on the terms of
20113
this License, whose permissions for other licensees extend to the
20114
entire whole, and thus to each and every part regardless of who wrote it.
20116
Thus, it is not the intent of this section to claim rights or contest
20117
your rights to work written entirely by you; rather, the intent is to
20118
exercise the right to control the distribution of derivative or
20119
collective works based on the Program.
20121
In addition, mere aggregation of another work not based on the Program
20122
with the Program (or with a work based on the Program) on a volume of
20123
a storage or distribution medium does not bring the other work under
20124
the scope of this License.
20127
You may copy and distribute the Program (or a work based on it,
20128
under Section 2) in object code or executable form under the terms of
20129
Sections 1 and 2 above provided that you also do one of the following:
20133
Accompany it with the complete corresponding machine-readable
20134
source code, which must be distributed under the terms of Sections
20135
1 and 2 above on a medium customarily used for software interchange; or,
20138
Accompany it with a written offer, valid for at least three
20139
years, to give any third party, for a charge no more than your
20140
cost of physically performing source distribution, a complete
20141
machine-readable copy of the corresponding source code, to be
20142
distributed under the terms of Sections 1 and 2 above on a medium
20143
customarily used for software interchange; or,
20146
Accompany it with the information you received as to the offer
20147
to distribute corresponding source code. (This alternative is
20148
allowed only for non-commercial distribution and only if you
20149
received the program in object code or executable form with such
20150
an offer, in accord with Subsection b above.)
20153
The source code for a work means the preferred form of the work for
20154
making modifications to it. For an executable work, complete source
20155
code means all the source code for all modules it contains, plus any
20156
associated interface definition files, plus the scripts used to
20157
control compilation and installation of the executable. However, as a
20158
special exception, the source code distributed need not include
20159
anything that is normally distributed (in either source or binary
20160
form) with the major components (compiler, kernel, and so on) of the
20161
operating system on which the executable runs, unless that component
20162
itself accompanies the executable.
20164
If distribution of executable or object code is made by offering
20165
access to copy from a designated place, then offering equivalent
20166
access to copy the source code from the same place counts as
20167
distribution of the source code, even though third parties are not
20168
compelled to copy the source along with the object code.
20171
You may not copy, modify, sublicense, or distribute the Program
20172
except as expressly provided under this License. Any attempt
20173
otherwise to copy, modify, sublicense or distribute the Program is
20174
void, and will automatically terminate your rights under this License.
20175
However, parties who have received copies, or rights, from you under
20176
this License will not have their licenses terminated so long as such
20177
parties remain in full compliance.
20180
You are not required to accept this License, since you have not
20181
signed it. However, nothing else grants you permission to modify or
20182
distribute the Program or its derivative works. These actions are
20183
prohibited by law if you do not accept this License. Therefore, by
20184
modifying or distributing the Program (or any work based on the
20185
Program), you indicate your acceptance of this License to do so, and
20186
all its terms and conditions for copying, distributing or modifying
20187
the Program or works based on it.
20190
Each time you redistribute the Program (or any work based on the
20191
Program), the recipient automatically receives a license from the
20192
original licensor to copy, distribute or modify the Program subject to
20193
these terms and conditions. You may not impose any further
20194
restrictions on the recipients' exercise of the rights granted herein.
20195
You are not responsible for enforcing compliance by third parties to
20199
If, as a consequence of a court judgment or allegation of patent
20200
infringement or for any other reason (not limited to patent issues),
20201
conditions are imposed on you (whether by court order, agreement or
20202
otherwise) that contradict the conditions of this License, they do not
20203
excuse you from the conditions of this License. If you cannot
20204
distribute so as to satisfy simultaneously your obligations under this
20205
License and any other pertinent obligations, then as a consequence you
20206
may not distribute the Program at all. For example, if a patent
20207
license would not permit royalty-free redistribution of the Program by
20208
all those who receive copies directly or indirectly through you, then
20209
the only way you could satisfy both it and this License would be to
20210
refrain entirely from distribution of the Program.
20212
If any portion of this section is held invalid or unenforceable under
20213
any particular circumstance, the balance of the section is intended to
20214
apply and the section as a whole is intended to apply in other
20217
It is not the purpose of this section to induce you to infringe any
20218
patents or other property right claims or to contest validity of any
20219
such claims; this section has the sole purpose of protecting the
20220
integrity of the free software distribution system, which is
20221
implemented by public license practices. Many people have made
20222
generous contributions to the wide range of software distributed
20223
through that system in reliance on consistent application of that
20224
system; it is up to the author/donor to decide if he or she is willing
20225
to distribute software through any other system and a licensee cannot
20226
impose that choice.
20228
This section is intended to make thoroughly clear what is believed to
20229
be a consequence of the rest of this License.
20232
If the distribution and/or use of the Program is restricted in
20233
certain countries either by patents or by copyrighted interfaces, the
20234
original copyright holder who places the Program under this License
20235
may add an explicit geographical distribution limitation excluding
20236
those countries, so that distribution is permitted only in or among
20237
countries not thus excluded. In such case, this License incorporates
20238
the limitation as if written in the body of this License.
20241
The Free Software Foundation may publish revised and/or new versions
20242
of the General Public License from time to time. Such new versions will
20243
be similar in spirit to the present version, but may differ in detail to
20244
address new problems or concerns.
20246
Each version is given a distinguishing version number. If the Program
20247
specifies a version number of this License which applies to it and ``any
20248
later version'', you have the option of following the terms and conditions
20249
either of that version or of any later version published by the Free
20250
Software Foundation. If the Program does not specify a version number of
20251
this License, you may choose any version ever published by the Free Software
20255
If you wish to incorporate parts of the Program into other free
20256
programs whose distribution conditions are different, write to the author
20257
to ask for permission. For software which is copyrighted by the Free
20258
Software Foundation, write to the Free Software Foundation; we sometimes
20259
make exceptions for this. Our decision will be guided by the two goals
20260
of preserving the free status of all derivatives of our free software and
20261
of promoting the sharing and reuse of software generally.
20264
@c fakenode --- for prepinfo
20265
@heading NO WARRANTY
20268
@center NO WARRANTY
20272
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
20273
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN
20274
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
20275
PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
20276
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
20277
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS
20278
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE
20279
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
20280
REPAIR OR CORRECTION.
20283
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
20284
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
20285
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
20286
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
20287
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
20288
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
20289
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
20290
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
20291
POSSIBILITY OF SUCH DAMAGES.
20295
@c fakenode --- for prepinfo
20296
@heading END OF TERMS AND CONDITIONS
20299
@center END OF TERMS AND CONDITIONS
20303
@c fakenode --- for prepinfo
20304
@unnumberedsec How to Apply These Terms to Your New Programs
20306
If you develop a new program, and you want it to be of the greatest
20307
possible use to the public, the best way to achieve this is to make it
20308
free software which everyone can redistribute and change under these terms.
20310
To do so, attach the following notices to the program. It is safest
20311
to attach them to the start of each source file to most effectively
20312
convey the exclusion of warranty; and each file should have at least
20313
the ``copyright'' line and a pointer to where the full notice is found.
20316
@var{one line to give the program's name and an idea of what it does.}
20317
Copyright (C) 19@var{yy} @var{name of author}
20319
This program is free software; you can redistribute it and/or
20320
modify it under the terms of the GNU General Public License
20321
as published by the Free Software Foundation; either version 2
20322
of the License, or (at your option) any later version.
20324
This program is distributed in the hope that it will be useful,
20325
but WITHOUT ANY WARRANTY; without even the implied warranty of
20326
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the
20327
GNU General Public License for more details.
20329
You should have received a copy of the GNU General Public License
20330
along with this program; if not, write to the Free Software
20331
Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA.
20334
Also add information on how to contact you by electronic and paper mail.
20336
If the program is interactive, make it output a short notice like this
20337
when it starts in an interactive mode:
20340
Gnomovision version 69, Copyright (C) 19@var{yy} @var{name of author}
20341
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details
20342
type `show w'. This is free software, and you are welcome
20343
to redistribute it under certain conditions; type `show c'
20347
The hypothetical commands @samp{show w} and @samp{show c} should show
20348
the appropriate parts of the General Public License. Of course, the
20349
commands you use may be called something other than @samp{show w} and
20350
@samp{show c}; they could even be mouse-clicks or menu items---whatever
20351
suits your program.
20353
You should also get your employer (if you work as a programmer) or your
20354
school, if any, to sign a ``copyright disclaimer'' for the program, if
20355
necessary. Here is a sample; alter the names:
20359
Yoyodyne, Inc., hereby disclaims all copyright
20360
interest in the program `Gnomovision'
20361
(which makes passes at compilers) written
20364
@var{signature of Ty Coon}, 1 April 1989
20365
Ty Coon, President of Vice
20369
This General Public License does not permit incorporating your program into
20370
proprietary programs. If your program is a subroutine library, you may
20371
consider it more useful to permit linking proprietary applications with the
20372
library. If this is what you want to do, use the GNU Library General
20373
Public License instead of this License.
20375
@node Index, , Copying, Top
20387
Robert J. Chassell points out that awk programs should have some indication
20388
of how to use them. It would be useful to perhaps have a "programming
20389
style" section of the manual that would include this and other tips.
20391
2. The default AWKPATH search path should be configurable via `configure'
20392
The default and how this changes needs to be documented.
20394
Consistency issues:
20395
/.../ regexps are in @code, not @samp
20396
".." strings are in @code, not @samp
20397
no @print before @dots
20398
values of expressions in the text (@code{x} has the value 15),
20399
should be in roman, not @code
20400
Use tab and not TAB
20401
Use ESC and not ESCAPE
20402
Use space and not blank to describe the space bar's character
20403
The term "blank" is thus basically reserved for "blank lines" etc.
20404
The `(d.c.)' should appear inside the closing `.' of a sentence
20405
It should come before (pxref{...})
20406
" " should have an @w{} around it
20407
Use "non-" everywhere
20408
Use @code{ftp} when talking about anonymous ftp
20409
Use upper-case and lower-case, not "upper case" and "lower case"
20410
Use alphanumeric, not alpha-numeric
20411
Use --foo, not -Wfoo when describing long options
20412
Use findex for all programs and functions in the example chapters
20413
Use "Bell Labs" or "AT&T Bell Laboratories", but not
20415
Use "behavior" instead of "behaviour".
20416
Use "zeros" instead of "zeroes".
20417
Use "Input/Output", not "input/output". Also "I/O", not "i/o".
20418
Use @code{do}, and not @code{do}-@code{while}, except where
20419
actually discussing the do-while.
20420
The words "a", "and", "as", "between", "for", "from", "in", "of",
20421
"on", "that", "the", "to", "with", and "without",
20422
should not be capitalized in @chapter, @section etc.
20423
"Into" and "How" should.
20424
Search for @dfn; make sure important items are also indexed.
20425
"e.g." should always be followed by a comma.
20426
"i.e." should never be followed by a comma, and should be followed
20428
The numbers zero through ten should be spelled out, except when
20429
talking about file descriptor numbers. > 10 and < 0, it's
20431
In tables, put command line options in @code, while in the text,
20433
When using @strong, use "Note:" or "Caution:" with colons and
20434
not exclamation points. Do not surround the paragraphs
20435
with @quotation ... @end quotation.
20437
Date: Wed, 13 Apr 94 15:20:52 -0400
20438
From: rsm@gnu.ai.mit.edu (Richard Stallman)
20439
To: gnu-prog@gnu.ai.mit.edu
20440
Subject: A reminder: no pathnames in GNU
20442
It's a GNU convention to use the term "file name" for the name of a
20443
file, never "pathname". We use the term "path" for search paths,
20444
which are lists of file names. Using it for a single file name as
20445
well is potentially confusing to users.
20447
So please check any documentation you maintain, if you think you might
20448
have used "pathname".
20450
Note that "file name" should be two words when it appears as ordinary
20451
text. It's ok as one word when it's a metasyntactic variable, though.
20455
Enhance FIELDWIDTHS with some way to indicate "the rest of the record".
20456
E.g., a length of 0 or -1 or something. May be "n"?
20458
Make FIELDWIDTHS be an array?
20460
What if FIELDWIDTHS has invalid values in it?