~vcs-imports/gawk/master : revision 17

1

This is Info file gawk.info, produced by Makeinfo-1.54 from the input

2

file gawk.texi.

3

4

This file documents `awk', a program that you can use to select

5

particular records in a file and perform operations upon them.

6

7

This is Edition 0.15 of `The GAWK Manual',

8

for the 2.15 version of the GNU implementation

9

of AWK.

10

11

12

13

Permission is granted to make and distribute verbatim copies of this

14

manual provided the copyright notice and this permission notice are

15

preserved on all copies.

16

17

Permission is granted to copy and distribute modified versions of

18

this manual under the conditions for verbatim copying, provided that

19

the entire resulting derived work is distributed under the terms of a

20

permission notice identical to this one.

21

22

Permission is granted to copy and distribute translations of this

23

manual into another language, under the above conditions for modified

24

versions, except that this permission notice may be stated in a

25

translation approved by the Foundation.

26

27

28

File: gawk.info, Node: Statements/Lines, Next: When, Prev: Comments, Up: Getting Started

29

30

`awk' Statements versus Lines

31

=============================

32

33

Most often, each line in an `awk' program is a separate statement or

34

separate rule, like this:

35

36

awk '/12/ { print $0 }

37

/21/ { print $0 }' BBS-list inventory-shipped

38

39

But sometimes statements can be more than one line, and lines can

40

contain several statements. You can split a statement into multiple

41

lines by inserting a newline after any of the following:

42

43

, { ? : || && do else

44

45

A newline at any other point is considered the end of the statement.

46

(Splitting lines after `?' and `:' is a minor `gawk' extension. The

47

`?' and `:' referred to here is the three operand conditional

48

expression described in *Note Conditional Expressions: Conditional Exp.)

49

50

If you would like to split a single statement into two lines at a

51

point where a newline would terminate it, you can "continue" it by

52

ending the first line with a backslash character, `\'. This is allowed

53

absolutely anywhere in the statement, even in the middle of a string or

54

regular expression. For example:

55

56

awk '/This program is too long, so continue it\

57

on the next line/ { print $1 }'

58

59

We have generally not used backslash continuation in the sample

60

programs in this manual. Since in `gawk' there is no limit on the

61

length of a line, it is never strictly necessary; it just makes

62

programs prettier. We have preferred to make them even more pretty by

63

keeping the statements short. Backslash continuation is most useful

64

when your `awk' program is in a separate source file, instead of typed

65

in on the command line. You should also note that many `awk'

66

implementations are more picky about where you may use backslash

67

continuation. For maximal portability of your `awk' programs, it is

68

best not to split your lines in the middle of a regular expression or a

69

string.

70

71

*Warning: backslash continuation does not work as described above

72

with the C shell.* Continuation with backslash works for `awk'

73

programs in files, and also for one-shot programs *provided* you are

74

using a POSIX-compliant shell, such as the Bourne shell or the

75

Bourne-again shell. But the C shell used on Berkeley Unix behaves

76

differently! There, you must use two backslashes in a row, followed by

77

a newline.

78

79

When `awk' statements within one rule are short, you might want to

80

put more than one of them on a line. You do this by separating the

81

statements with a semicolon, `;'. This also applies to the rules

82

themselves. Thus, the previous program could have been written:

83

84

/12/ { print $0 } ; /21/ { print $0 }

85

86

*Note:* the requirement that rules on the same line must be separated

87

with a semicolon is a recent change in the `awk' language; it was done

88

for consistency with the treatment of statements within an action.

89

90

91

File: gawk.info, Node: When, Prev: Statements/Lines, Up: Getting Started

92

93

When to Use `awk'

94

=================

95

96

You might wonder how `awk' might be useful for you. Using additional

97

utility programs, more advanced patterns, field separators, arithmetic

98

statements, and other selection criteria, you can produce much more

99

complex output. The `awk' language is very useful for producing

100

reports from large amounts of raw data, such as summarizing information

101

from the output of other utility programs like `ls'. (*Note A More

102

Complex Example: More Complex.)

103

104

Programs written with `awk' are usually much smaller than they would

105

be in other languages. This makes `awk' programs easy to compose and

106

use. Often `awk' programs can be quickly composed at your terminal,

107

used once, and thrown away. Since `awk' programs are interpreted, you

108

can avoid the usually lengthy edit-compile-test-debug cycle of software

109

development.

110

111

Complex programs have been written in `awk', including a complete

112

retargetable assembler for 8-bit microprocessors (*note Glossary::., for

113

more information) and a microcode assembler for a special purpose Prolog

114

computer. However, `awk''s capabilities are strained by tasks of such

115

complexity.

116

117

If you find yourself writing `awk' scripts of more than, say, a few

118

hundred lines, you might consider using a different programming

119

language. Emacs Lisp is a good choice if you need sophisticated string

120

or pattern matching capabilities. The shell is also good at string and

121

pattern matching; in addition, it allows powerful use of the system

122

utilities. More conventional languages, such as C, C++, and Lisp, offer

123

better facilities for system programming and for managing the complexity

124

of large programs. Programs in these languages may require more lines

125

of source code than the equivalent `awk' programs, but they are easier

126

to maintain and usually run more efficiently.

127

128

129

File: gawk.info, Node: Reading Files, Next: Printing, Prev: Getting Started, Up: Top

130

131

Reading Input Files

132

*******************

133

134

In the typical `awk' program, all input is read either from the

135

standard input (by default the keyboard, but often a pipe from another

136

command) or from files whose names you specify on the `awk' command

137

line. If you specify input files, `awk' reads them in order, reading

138

all the data from one before going on to the next. The name of the

139

current input file can be found in the built-in variable `FILENAME'

140

(*note Built-in Variables::.).

141

142

The input is read in units called records, and processed by the

143

rules one record at a time. By default, each record is one line. Each

144

record is split automatically into fields, to make it more convenient

145

for a rule to work on its parts.

146

147

On rare occasions you will need to use the `getline' command, which

148

can do explicit input from any number of files (*note Explicit Input

149

with `getline': Getline.).

150

151

* Menu:

152

153

* Records:: Controlling how data is split into records.

154

* Fields:: An introduction to fields.

155

* Non-Constant Fields:: Non-constant Field Numbers.

156

* Changing Fields:: Changing the Contents of a Field.

157

* Field Separators:: The field separator and how to change it.

158

* Constant Size:: Reading constant width data.

159

* Multiple Line:: Reading multi-line records.

160

* Getline:: Reading files under explicit program control

161

using the `getline' function.

162

* Close Input:: Closing an input file (so you can read from

163

the beginning once more).

164

165

166

File: gawk.info, Node: Records, Next: Fields, Prev: Reading Files, Up: Reading Files

167

168

How Input is Split into Records

169

===============================

170

171

The `awk' language divides its input into records and fields.

172

Records are separated by a character called the "record separator". By

173

default, the record separator is the newline character, defining a

174

record to be a single line of text.

175

176

Sometimes you may want to use a different character to separate your

177

records. You can use a different character by changing the built-in

178

variable `RS'. The value of `RS' is a string that says how to separate

179

records; the default value is `"\n"', the string containing just a

180

newline character. This is why records are, by default, single lines.

181

182

`RS' can have any string as its value, but only the first character

183

of the string is used as the record separator. The other characters are

184

ignored. `RS' is exceptional in this regard; `awk' uses the full value

185

of all its other built-in variables.

186

187

You can change the value of `RS' in the `awk' program with the

188

assignment operator, `=' (*note Assignment Expressions: Assignment

189

Ops.). The new record-separator character should be enclosed in

190

quotation marks to make a string constant. Often the right time to do

191

this is at the beginning of execution, before any input has been

192

processed, so that the very first record will be read with the proper

193

separator. To do this, use the special `BEGIN' pattern (*note `BEGIN'

194

and `END' Special Patterns: BEGIN/END.). For example:

195

196

awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list

197

198

changes the value of `RS' to `"/"', before reading any input. This is

199

a string whose first character is a slash; as a result, records are

200

separated by slashes. Then the input file is read, and the second rule

201

in the `awk' program (the action with no pattern) prints each record.

202

Since each `print' statement adds a newline at the end of its output,

203

the effect of this `awk' program is to copy the input with each slash

204

changed to a newline.

205

206

Another way to change the record separator is on the command line,

207

using the variable-assignment feature (*note Invoking `awk': Command

208

Line.).

209

210

awk '{ print $0 }' RS="/" BBS-list

211

212

This sets `RS' to `/' before processing `BBS-list'.

213

214

Reaching the end of an input file terminates the current input

215

record, even if the last character in the file is not the character in

216

`RS'.

217

218

The empty string, `""' (a string of no characters), has a special

219

meaning as the value of `RS': it means that records are separated only

220

by blank lines. *Note Multiple-Line Records: Multiple Line, for more

221

details.

222

223

The `awk' utility keeps track of the number of records that have

224

been read so far from the current input file. This value is stored in a

225

built-in variable called `FNR'. It is reset to zero when a new file is

226

started. Another built-in variable, `NR', is the total number of input

227

records read so far from all files. It starts at zero but is never

228

automatically reset to zero.

229

230

If you change the value of `RS' in the middle of an `awk' run, the

231

new value is used to delimit subsequent records, but the record

232

currently being processed (and records already processed) are not

233

affected.

234

235

236

File: gawk.info, Node: Fields, Next: Non-Constant Fields, Prev: Records, Up: Reading Files

237

238

Examining Fields

239

================

240

241

When `awk' reads an input record, the record is automatically

242

separated or "parsed" by the interpreter into chunks called "fields".

243

By default, fields are separated by whitespace, like words in a line.

244

Whitespace in `awk' means any string of one or more spaces and/or tabs;

245

other characters such as newline, formfeed, and so on, that are

246

considered whitespace by other languages are *not* considered

247

whitespace by `awk'.

248

249

The purpose of fields is to make it more convenient for you to refer

250

to these pieces of the record. You don't have to use them--you can

251

operate on the whole record if you wish--but fields are what make

252

simple `awk' programs so powerful.

253

254

To refer to a field in an `awk' program, you use a dollar-sign, `$',

255

followed by the number of the field you want. Thus, `$1' refers to the

256

first field, `$2' to the second, and so on. For example, suppose the

257

following is a line of input:

258

259

This seems like a pretty nice example.

260

261

Here the first field, or `$1', is `This'; the second field, or `$2', is

262

`seems'; and so on. Note that the last field, `$7', is `example.'.

263

Because there is no space between the `e' and the `.', the period is

264

considered part of the seventh field.

265

266

No matter how many fields there are, the last field in a record can

267

be represented by `$NF'. So, in the example above, `$NF' would be the

268

same as `$7', which is `example.'. Why this works is explained below

269

(*note Non-constant Field Numbers: Non-Constant Fields.). If you try

270

to refer to a field beyond the last one, such as `$8' when the record

271

has only 7 fields, you get the empty string.

272

273

Plain `NF', with no `$', is a built-in variable whose value is the

274

number of fields in the current record.

275

276

`$0', which looks like an attempt to refer to the zeroth field, is a

277

special case: it represents the whole input record. This is what you

278

would use if you weren't interested in fields.

279

280

Here are some more examples:

281

282

awk '$1 ~ /foo/ { print $0 }' BBS-list

283

284

This example prints each record in the file `BBS-list' whose first

285

field contains the string `foo'. The operator `~' is called a

286

"matching operator" (*note Comparison Expressions: Comparison Ops.); it

287

tests whether a string (here, the field `$1') matches a given regular

288

expression.

289

290

By contrast, the following example:

291

292

awk '/foo/ { print $1, $NF }' BBS-list

293

294

looks for `foo' in *the entire record* and prints the first field and

295

the last field for each input record containing a match.

296

297

298

File: gawk.info, Node: Non-Constant Fields, Next: Changing Fields, Prev: Fields, Up: Reading Files

299

300

Non-constant Field Numbers

301

==========================

302

303

The number of a field does not need to be a constant. Any

304

expression in the `awk' language can be used after a `$' to refer to a

305

field. The value of the expression specifies the field number. If the

306

value is a string, rather than a number, it is converted to a number.

307

Consider this example:

308

309

awk '{ print $NR }'

310

311

Recall that `NR' is the number of records read so far: 1 in the first

312

record, 2 in the second, etc. So this example prints the first field

313

of the first record, the second field of the second record, and so on.

314

For the twentieth record, field number 20 is printed; most likely, the

315

record has fewer than 20 fields, so this prints a blank line.

316

317

Here is another example of using expressions as field numbers:

318

319

awk '{ print $(2*2) }' BBS-list

320

321

The `awk' language must evaluate the expression `(2*2)' and use its

322

value as the number of the field to print. The `*' sign represents

323

multiplication, so the expression `2*2' evaluates to 4. The

324

parentheses are used so that the multiplication is done before the `$'

325

operation; they are necessary whenever there is a binary operator in

326

the field-number expression. This example, then, prints the hours of

327

operation (the fourth field) for every line of the file `BBS-list'.

328

329

If the field number you compute is zero, you get the entire record.

330

Thus, `$(2-2)' has the same value as `$0'. Negative field numbers are

331

not allowed.

332

333

The number of fields in the current record is stored in the built-in

334

variable `NF' (*note Built-in Variables::.). The expression `$NF' is

335

not a special feature: it is the direct consequence of evaluating `NF'

336

and using its value as a field number.

337

338

339

File: gawk.info, Node: Changing Fields, Next: Field Separators, Prev: Non-Constant Fields, Up: Reading Files

340

341

Changing the Contents of a Field

342

================================

343

344

You can change the contents of a field as seen by `awk' within an

345

`awk' program; this changes what `awk' perceives as the current input

346

record. (The actual input is untouched: `awk' never modifies the input

347

file.)

348

349

Consider this example:

350

351

awk '{ $3 = $2 - 10; print $2, $3 }' inventory-shipped

352

353

The `-' sign represents subtraction, so this program reassigns field

354

three, `$3', to be the value of field two minus ten, `$2 - 10'. (*Note

355

Arithmetic Operators: Arithmetic Ops.) Then field two, and the new

356

value for field three, are printed.

357

358

In order for this to work, the text in field `$2' must make sense as

359

a number; the string of characters must be converted to a number in

360

order for the computer to do arithmetic on it. The number resulting

361

from the subtraction is converted back to a string of characters which

362

then becomes field three. *Note Conversion of Strings and Numbers:

363

Conversion.

364

365

When you change the value of a field (as perceived by `awk'), the

366

text of the input record is recalculated to contain the new field where

367

the old one was. Therefore, `$0' changes to reflect the altered field.

368

Thus,

369

370

awk '{ $2 = $2 - 10; print $0 }' inventory-shipped

371

372

prints a copy of the input file, with 10 subtracted from the second

373

field of each line.

374

375

You can also assign contents to fields that are out of range. For

376

example:

377

378

awk '{ $6 = ($5 + $4 + $3 + $2) ; print $6 }' inventory-shipped

379

380

We've just created `$6', whose value is the sum of fields `$2', `$3',

381

`$4', and `$5'. The `+' sign represents addition. For the file

382

`inventory-shipped', `$6' represents the total number of parcels

383

shipped for a particular month.

384

385

Creating a new field changes the internal `awk' copy of the current

386

input record--the value of `$0'. Thus, if you do `print $0' after

387

adding a field, the record printed includes the new field, with the

388

appropriate number of field separators between it and the previously

389

existing fields.

390

391

This recomputation affects and is affected by several features not

392

yet discussed, in particular, the "output field separator", `OFS',

393

which is used to separate the fields (*note Output Separators::.), and

394

`NF' (the number of fields; *note Examining Fields: Fields.). For

395

example, the value of `NF' is set to the number of the highest field

396

you create.

397

398

Note, however, that merely *referencing* an out-of-range field does

399

*not* change the value of either `$0' or `NF'. Referencing an

400

out-of-range field merely produces a null string. For example:

401

402

if ($(NF+1) != "")

403

print "can't happen"

404

else

405

print "everything is normal"

406

407

should print `everything is normal', because `NF+1' is certain to be

408

out of range. (*Note The `if' Statement: If Statement, for more

409

information about `awk''s `if-else' statements.)

410

411

It is important to note that assigning to a field will change the

412

value of `$0', but will not change the value of `NF', even when you

413

assign the null string to a field. For example:

414

415

echo a b c d | awk '{ OFS = ":"; $2 = "" ; print ; print NF }'

416

417

prints

418

419

a::c:d

420

4

421

422

The field is still there, it just has an empty value. You can tell

423

because there are two colons in a row.

424

425

426

File: gawk.info, Node: Field Separators, Next: Constant Size, Prev: Changing Fields, Up: Reading Files

427

428

Specifying how Fields are Separated

429

===================================

430

431

(This section is rather long; it describes one of the most

432

fundamental operations in `awk'. If you are a novice with `awk', we

433

recommend that you re-read this section after you have studied the

434

section on regular expressions, *Note Regular Expressions as Patterns:

435

Regexp.)

436

437

The way `awk' splits an input record into fields is controlled by

438

the "field separator", which is a single character or a regular

439

expression. `awk' scans the input record for matches for the

440

separator; the fields themselves are the text between the matches. For

441

example, if the field separator is `oo', then the following line:

442

443

moo goo gai pan

444

445

would be split into three fields: `m', ` g' and ` gai pan'.

446

447

The field separator is represented by the built-in variable `FS'.

448

Shell programmers take note! `awk' does not use the name `IFS' which

449

is used by the shell.

450

451

You can change the value of `FS' in the `awk' program with the

452

assignment operator, `=' (*note Assignment Expressions: Assignment

453

Ops.). Often the right time to do this is at the beginning of

454

execution, before any input has been processed, so that the very first

455

record will be read with the proper separator. To do this, use the

456

special `BEGIN' pattern (*note `BEGIN' and `END' Special Patterns:

457

BEGIN/END.). For example, here we set the value of `FS' to the string

458

`","':

459

460

awk 'BEGIN { FS = "," } ; { print $2 }'

461

462

Given the input line,

463

464

John Q. Smith, 29 Oak St., Walamazoo, MI 42139

465

466

this `awk' program extracts the string ` 29 Oak St.'.

467

468

Sometimes your input data will contain separator characters that

469

don't separate fields the way you thought they would. For instance, the

470

person's name in the example we've been using might have a title or

471

suffix attached, such as `John Q. Smith, LXIX'. From input containing

472

such a name:

473

474

John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139

475

476

the previous sample program would extract ` LXIX', instead of ` 29 Oak

477

St.'. If you were expecting the program to print the address, you

478

would be surprised. So choose your data layout and separator

479

characters carefully to prevent such problems.

480

481

As you know, by default, fields are separated by whitespace sequences

482

(spaces and tabs), not by single spaces: two spaces in a row do not

483

delimit an empty field. The default value of the field separator is a

484

string `" "' containing a single space. If this value were interpreted

485

in the usual way, each space character would separate fields, so two

486

spaces in a row would make an empty field between them. The reason

487

this does not happen is that a single space as the value of `FS' is a

488

special case: it is taken to specify the default manner of delimiting

489

fields.

490

491

If `FS' is any other single character, such as `","', then each

492

occurrence of that character separates two fields. Two consecutive

493

occurrences delimit an empty field. If the character occurs at the

494

beginning or the end of the line, that too delimits an empty field. The

495

space character is the only single character which does not follow these

496

rules.

497

498

More generally, the value of `FS' may be a string containing any

499

regular expression. Then each match in the record for the regular

500

expression separates fields. For example, the assignment:

501

502

FS = ", \t"

503

504

makes every area of an input line that consists of a comma followed by a

505

space and a tab, into a field separator. (`\t' stands for a tab.)

506

507

For a less trivial example of a regular expression, suppose you want

508

single spaces to separate fields the way single commas were used above.

509

You can set `FS' to `"[ ]"'. This regular expression matches a single

510

space and nothing else.

511

512

`FS' can be set on the command line. You use the `-F' argument to

513

do so. For example:

514

515

awk -F, 'PROGRAM' INPUT-FILES

516

517

sets `FS' to be the `,' character. Notice that the argument uses a

518

capital `F'. Contrast this with `-f', which specifies a file

519

containing an `awk' program. Case is significant in command options:

520

the `-F' and `-f' options have nothing to do with each other. You can

521

use both options at the same time to set the `FS' argument *and* get an

522

`awk' program from a file.

523

524

The value used for the argument to `-F' is processed in exactly the

525

same way as assignments to the built-in variable `FS'. This means that

526

if the field separator contains special characters, they must be escaped

527

appropriately. For example, to use a `\' as the field separator, you

528

would have to type:

529

530

# same as FS = "\\"

531

awk -F\\\\ '...' files ...

532

533

Since `\' is used for quoting in the shell, `awk' will see `-F\\'.

534

Then `awk' processes the `\\' for escape characters (*note Constant

535

Expressions: Constants.), finally yielding a single `\' to be used for

536

the field separator.

537

538

As a special case, in compatibility mode (*note Invoking `awk':

539

Command Line.), if the argument to `-F' is `t', then `FS' is set to the

540

tab character. (This is because if you type `-F\t', without the quotes,

541

at the shell, the `\' gets deleted, so `awk' figures that you really

542

want your fields to be separated with tabs, and not `t's. Use `-v

543

FS="t"' on the command line if you really do want to separate your

544

fields with `t's.)

545

546

For example, let's use an `awk' program file called `baud.awk' that

547

contains the pattern `/300/', and the action `print $1'. Here is the

548

program:

549

550

/300/ { print $1 }

551

552

Let's also set `FS' to be the `-' character, and run the program on

553

the file `BBS-list'. The following command prints a list of the names

554

of the bulletin boards that operate at 300 baud and the first three

555

digits of their phone numbers:

556

557

awk -F- -f baud.awk BBS-list

558

559

It produces this output:

560

561

aardvark 555

562

alpo

563

barfly 555

564

bites 555

565

camelot 555

566

core 555

567

fooey 555

568

foot 555

569

macfoo 555

570

sdace 555

571

sabafoo 555

572

573

Note the second line of output. If you check the original file, you

574

will see that the second line looked like this:

575

576

alpo-net 555-3412 2400/1200/300 A

577

578

The `-' as part of the system's name was used as the field

579

separator, instead of the `-' in the phone number that was originally

580

intended. This demonstrates why you have to be careful in choosing

581

your field and record separators.

582

583

The following program searches the system password file, and prints

584

the entries for users who have no password:

585

586

awk -F: '$2 == ""' /etc/passwd

587

588

Here we use the `-F' option on the command line to set the field

589

separator. Note that fields in `/etc/passwd' are separated by colons.

590

The second field represents a user's encrypted password, but if the

591

field is empty, that user has no password.

592

593

According to the POSIX standard, `awk' is supposed to behave as if

594

each record is split into fields at the time that it is read. In

595

particular, this means that you can change the value of `FS' after a

596

record is read, but before any of the fields are referenced. The value

597

of the fields (i.e. how they were split) should reflect the old value

598

of `FS', not the new one.

599

600

However, many implementations of `awk' do not do this. Instead,

601

they defer splitting the fields until a field reference actually

602

happens, using the *current* value of `FS'! This behavior can be

603

difficult to diagnose. The following example illustrates the results of

604

the two methods. (The `sed' command prints just the first line of

605

`/etc/passwd'.)

606

607

sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }'

608

609

will usually print

610

611

root

612

613

on an incorrect implementation of `awk', while `gawk' will print

614

something like

615

616

root:nSijPlPhZZwgE:0:0:Root:/:

617

618

There is an important difference between the two cases of `FS = " "'

619

(a single blank) and `FS = "[ \t]+"' (which is a regular expression

620

matching one or more blanks or tabs). For both values of `FS', fields

621

are separated by runs of blanks and/or tabs. However, when the value of

622

`FS' is `" "', `awk' will strip leading and trailing whitespace from

623

the record, and then decide where the fields are.

624

625

For example, the following expression prints `b':

626

627

echo ' a b c d ' | awk '{ print $2 }'

628

629

However, the following prints `a':

630

631

echo ' a b c d ' | awk 'BEGIN { FS = "[ \t]+" } ; { print $2 }'

632

633

In this case, the first field is null.

634

635

The stripping of leading and trailing whitespace also comes into

636

play whenever `$0' is recomputed. For instance, this pipeline

637

638

echo ' a b c d' | awk '{ print; $2 = $2; print }'

639

640

produces this output:

641

642

a b c d

643

a b c d

644

645

The first `print' statement prints the record as it was read, with

646

leading whitespace intact. The assignment to `$2' rebuilds `$0' by

647

concatenating `$1' through `$NF' together, separated by the value of

648

`OFS'. Since the leading whitespace was ignored when finding `$1', it

649

is not part of the new `$0'. Finally, the last `print' statement

650

prints the new `$0'.

651

652

The following table summarizes how fields are split, based on the

653

value of `FS'.

654

655

`FS == " "'

656

Fields are separated by runs of whitespace. Leading and trailing

657

whitespace are ignored. This is the default.

658

659

`FS == ANY SINGLE CHARACTER'

660

Fields are separated by each occurrence of the character. Multiple

661

successive occurrences delimit empty fields, as do leading and

662

trailing occurrences.

663

664

`FS == REGEXP'

665

Fields are separated by occurrences of characters that match

666

REGEXP. Leading and trailing matches of REGEXP delimit empty

667

fields.

668

669

670

File: gawk.info, Node: Constant Size, Next: Multiple Line, Prev: Field Separators, Up: Reading Files

671

672

Reading Fixed-width Data

673

========================

674

675

(This section discusses an advanced, experimental feature. If you

676

are a novice `awk' user, you may wish to skip it on the first reading.)

677

678

`gawk' 2.13 introduced a new facility for dealing with fixed-width

679

fields with no distinctive field separator. Data of this nature arises

680

typically in one of at least two ways: the input for old FORTRAN

681

programs where numbers are run together, and the output of programs

682

that did not anticipate the use of their output as input for other

683

programs.

684

685

An example of the latter is a table where all the columns are lined

686

up by the use of a variable number of spaces and *empty fields are just

687

spaces*. Clearly, `awk''s normal field splitting based on `FS' will

688

not work well in this case. (Although a portable `awk' program can use

689

a series of `substr' calls on `$0', this is awkward and inefficient for

690

a large number of fields.)

691

692

The splitting of an input record into fixed-width fields is

693

specified by assigning a string containing space-separated numbers to

694

the built-in variable `FIELDWIDTHS'. Each number specifies the width

695

of the field *including* columns between fields. If you want to ignore

696

the columns between fields, you can specify the width as a separate

697

field that is subsequently ignored.

698

699

The following data is the output of the `w' utility. It is useful

700

to illustrate the use of `FIELDWIDTHS'.

701

702

10:06pm up 21 days, 14:04, 23 users

703

User tty login idle JCPU PCPU what

704

hzuo ttyV0 8:58pm 9 5 vi p24.tex

705

hzang ttyV3 6:37pm 50 -csh

706

eklye ttyV5 9:53pm 7 1 em thes.tex

707

dportein ttyV6 8:17pm 1:47 -csh

708

gierd ttyD3 10:00pm 1 elm

709

dave ttyD4 9:47pm 4 4 w

710

brent ttyp0 26Jun91 4:46 26:46 4:41 bash

711

dave ttyq4 26Jun9115days 46 46 wnewmail

712

713

The following program takes the above input, converts the idle time

714

to number of seconds and prints out the first two fields and the

715

calculated idle time. (This program uses a number of `awk' features

716

that haven't been introduced yet.)

717

718

BEGIN { FIELDWIDTHS = "9 6 10 6 7 7 35" }

719

NR > 2 {

720

idle = $4

721

sub(/^ */, "", idle) # strip leading spaces

722

if (idle == "") idle = 0

723

if (idle ~ /:/) { split(idle, t, ":"); idle = t[1] * 60 + t[2] }

724

if (idle ~ /days/) { idle *= 24 * 60 * 60 }

725

726

print $1, $2, idle

727

}

728

729

Here is the result of running the program on the data:

730

731

hzuo ttyV0 0

732

hzang ttyV3 50

733

eklye ttyV5 0

734

dportein ttyV6 107

735

gierd ttyD3 1

736

dave ttyD4 0

737

brent ttyp0 286

738

dave ttyq4 1296000

739

740

Another (possibly more practical) example of fixed-width input data

741

would be the input from a deck of balloting cards. In some parts of

742

the United States, voters make their choices by punching holes in

743

computer cards. These cards are then processed to count the votes for

744

any particular candidate or on any particular issue. Since a voter may

745

choose not to vote on some issue, any column on the card may be empty.

746

An `awk' program for processing such data could use the `FIELDWIDTHS'

747

feature to simplify reading the data.

748

749

This feature is still experimental, and will likely evolve over time.

750

751

752

File: gawk.info, Node: Multiple Line, Next: Getline, Prev: Constant Size, Up: Reading Files

753

754

Multiple-Line Records

755

=====================

756

757

In some data bases, a single line cannot conveniently hold all the

758

information in one entry. In such cases, you can use multi-line

759

records.

760

761

The first step in doing this is to choose your data format: when

762

records are not defined as single lines, how do you want to define them?

763

What should separate records?

764

765

One technique is to use an unusual character or string to separate

766

records. For example, you could use the formfeed character (written

767

`\f' in `awk', as in C) to separate them, making each record a page of

768

the file. To do this, just set the variable `RS' to `"\f"' (a string

769

containing the formfeed character). Any other character could equally

770

well be used, as long as it won't be part of the data in a record.

771

772

Another technique is to have blank lines separate records. By a

773

special dispensation, a null string as the value of `RS' indicates that

774

records are separated by one or more blank lines. If you set `RS' to

775

the null string, a record always ends at the first blank line

776

encountered. And the next record doesn't start until the first nonblank

777

line that follows--no matter how many blank lines appear in a row, they

778

are considered one record-separator. (End of file is also considered a

779

record separator.)

780

781

The second step is to separate the fields in the record. One way to

782

do this is to put each field on a separate line: to do this, just set

783

the variable `FS' to the string `"\n"'. (This simple regular

784

expression matches a single newline.)

785

786

Another way to separate fields is to divide each of the lines into

787

fields in the normal manner. This happens by default as a result of a

788

special feature: when `RS' is set to the null string, the newline

789

character *always* acts as a field separator. This is in addition to

790

whatever field separations result from `FS'.

791

792

The original motivation for this special exception was probably so

793

that you get useful behavior in the default case (i.e., `FS == " "').

794

This feature can be a problem if you really don't want the newline

795

character to separate fields, since there is no way to prevent it.

796

However, you can work around this by using the `split' function to

797

break up the record manually (*note Built-in Functions for String

798

Manipulation: String Functions.).

799

800

801

File: gawk.info, Node: Getline, Next: Close Input, Prev: Multiple Line, Up: Reading Files

802

803

Explicit Input with `getline'

804

=============================

805

806

So far we have been getting our input files from `awk''s main input

807

stream--either the standard input (usually your terminal) or the files

808

specified on the command line. The `awk' language has a special

809

built-in command called `getline' that can be used to read input under

810

your explicit control.

811

812

This command is quite complex and should *not* be used by beginners.

813

It is covered here because this is the chapter on input. The examples

814

that follow the explanation of the `getline' command include material

815

that has not been covered yet. Therefore, come back and study the

816

`getline' command *after* you have reviewed the rest of this manual and

817

have a good knowledge of how `awk' works.

818

819

`getline' returns 1 if it finds a record, and 0 if the end of the

820

file is encountered. If there is some error in getting a record, such

821

as a file that cannot be opened, then `getline' returns -1. In this

822

case, `gawk' sets the variable `ERRNO' to a string describing the error

823

that occurred.

824

825

In the following examples, COMMAND stands for a string value that

826

represents a shell command.

827

828

`getline'

829

The `getline' command can be used without arguments to read input

830

from the current input file. All it does in this case is read the

831

next input record and split it up into fields. This is useful if

832

you've finished processing the current record, but you want to do

833

some special processing *right now* on the next record. Here's an

834

example:

835

836

awk '{

837

if (t = index($0, "/*")) {

838

if (t > 1)

839

tmp = substr($0, 1, t - 1)

840

else

841

tmp = ""

842

u = index(substr($0, t + 2), "*/")

843

while (u == 0) {

844

getline

845

t = -1

846

u = index($0, "*/")

847

}

848

if (u <= length($0) - 2)

849

$0 = tmp substr($0, t + u + 3)

850

else

851

$0 = tmp

852

}

853

print $0

854

}'

855

856

This `awk' program deletes all C-style comments, `/* ... */',

857

from the input. By replacing the `print $0' with other

858

statements, you could perform more complicated processing on the

859

decommented input, like searching for matches of a regular

860

expression. (This program has a subtle problem--can you spot it?)

861

862

This form of the `getline' command sets `NF' (the number of

863

fields; *note Examining Fields: Fields.), `NR' (the number of

864

records read so far; *note How Input is Split into Records:

865

Records.), `FNR' (the number of records read from this input

866

file), and the value of `$0'.

867

868

*Note:* the new value of `$0' is used in testing the patterns of

869

any subsequent rules. The original value of `$0' that triggered

870

the rule which executed `getline' is lost. By contrast, the

871

`next' statement reads a new record but immediately begins

872

processing it normally, starting with the first rule in the

873

program. *Note The `next' Statement: Next Statement.

874

875

`getline VAR'

876

This form of `getline' reads a record into the variable VAR. This

877

is useful when you want your program to read the next record from

878

the current input file, but you don't want to subject the record

879

to the normal input processing.

880

881

For example, suppose the next line is a comment, or a special

882

string, and you want to read it, but you must make certain that it

883

won't trigger any rules. This version of `getline' allows you to

884

read that line and store it in a variable so that the main

885

read-a-line-and-check-each-rule loop of `awk' never sees it.

886

887

The following example swaps every two lines of input. For

888

example, given:

889

890

wan

891

tew

892

free

893

phore

894

895

it outputs:

896

897

tew

898

wan

899

phore

900

free

901

902

Here's the program:

903

904

awk '{

905

if ((getline tmp) > 0) {

906

print tmp

907

print $0

908

} else

909

print $0

910

}'

911

912

The `getline' function used in this way sets only the variables

913

`NR' and `FNR' (and of course, VAR). The record is not split into

914

fields, so the values of the fields (including `$0') and the value

915

of `NF' do not change.

916

917

`getline < FILE'

918

This form of the `getline' function takes its input from the file

919

FILE. Here FILE is a string-valued expression that specifies the

920

file name. `< FILE' is called a "redirection" since it directs

921

input to come from a different place.

922

923

This form is useful if you want to read your input from a

924

particular file, instead of from the main input stream. For

925

example, the following program reads its input record from the

926

file `foo.input' when it encounters a first field with a value

927

equal to 10 in the current input file.

928

929

awk '{

930

if ($1 == 10) {

931

getline < "foo.input"

932

print

933

} else

934

print

935

}'

936

937

Since the main input stream is not used, the values of `NR' and

938

`FNR' are not changed. But the record read is split into fields in

939

the normal manner, so the values of `$0' and other fields are

940

changed. So is the value of `NF'.

941

942

This does not cause the record to be tested against all the

943

patterns in the `awk' program, in the way that would happen if the

944

record were read normally by the main processing loop of `awk'.

945

However the new record is tested against any subsequent rules,

946

just as when `getline' is used without a redirection.

947

948

`getline VAR < FILE'

949

This form of the `getline' function takes its input from the file

950

FILE and puts it in the variable VAR. As above, FILE is a

951

string-valued expression that specifies the file from which to

952

read.

953

954

In this version of `getline', none of the built-in variables are

955

changed, and the record is not split into fields. The only

956

variable changed is VAR.

957

958

For example, the following program copies all the input files to

959

the output, except for records that say `@include FILENAME'. Such

960

a record is replaced by the contents of the file FILENAME.

961

962

awk '{

963

if (NF == 2 && $1 == "@include") {

964

while ((getline line < $2) > 0)

965

print line

966

close($2)

967

} else

968

print

969

}'

970

971

Note here how the name of the extra input file is not built into

972

the program; it is taken from the data, from the second field on

973

the `@include' line.

974

975

The `close' function is called to ensure that if two identical

976

`@include' lines appear in the input, the entire specified file is

977

included twice. *Note Closing Input Files and Pipes: Close Input.

978

979

One deficiency of this program is that it does not process nested

980

`@include' statements the way a true macro preprocessor would.

981

982

`COMMAND | getline'

983

You can "pipe" the output of a command into `getline'. A pipe is

984

simply a way to link the output of one program to the input of

985

another. In this case, the string COMMAND is run as a shell

986

command and its output is piped into `awk' to be used as input.

987

This form of `getline' reads one record from the pipe.

988

989

For example, the following program copies input to output, except

990

for lines that begin with `@execute', which are replaced by the

991

output produced by running the rest of the line as a shell command:

992

993

awk '{

994

if ($1 == "@execute") {

995

tmp = substr($0, 10)

996

while ((tmp | getline) > 0)

997

print

998

close(tmp)

999

} else

1000

print

1001

}'

1002

1003

The `close' function is called to ensure that if two identical

1004

`@execute' lines appear in the input, the command is run for each

1005

one. *Note Closing Input Files and Pipes: Close Input.

1006

1007

Given the input:

1008

1009

foo

1010

bar

1011

baz

1012

@execute who

1013

bletch

1014

1015

the program might produce:

1016

1017

foo

1018

bar

1019

baz

1020

hack ttyv0 Jul 13 14:22

1021

hack ttyp0 Jul 13 14:23 (gnu:0)

1022

hack ttyp1 Jul 13 14:23 (gnu:0)

1023

hack ttyp2 Jul 13 14:23 (gnu:0)

1024

hack ttyp3 Jul 13 14:23 (gnu:0)

1025

bletch

1026

1027

Notice that this program ran the command `who' and printed the

1028

result. (If you try this program yourself, you will get different

1029

results, showing you who is logged in on your system.)

1030

1031

This variation of `getline' splits the record into fields, sets the

1032

value of `NF' and recomputes the value of `$0'. The values of

1033

`NR' and `FNR' are not changed.

1034

1035

`COMMAND | getline VAR'

1036

The output of the command COMMAND is sent through a pipe to

1037

`getline' and into the variable VAR. For example, the following

1038

program reads the current date and time into the variable

1039

`current_time', using the `date' utility, and then prints it.

1040

1041

awk 'BEGIN {

1042

"date" | getline current_time

1043

close("date")

1044

print "Report printed on " current_time

1045

}'

1046

1047

In this version of `getline', none of the built-in variables are

1048

changed, and the record is not split into fields.

1049

1050

1051

File: gawk.info, Node: Close Input, Prev: Getline, Up: Reading Files

1052

1053

Closing Input Files and Pipes

1054

=============================

1055

1056

If the same file name or the same shell command is used with

1057

`getline' more than once during the execution of an `awk' program, the

1058

file is opened (or the command is executed) only the first time. At

1059

that time, the first record of input is read from that file or command.

1060

The next time the same file or command is used in `getline', another

1061

record is read from it, and so on.

1062

1063

This implies that if you want to start reading the same file again

1064

from the beginning, or if you want to rerun a shell command (rather than

1065

reading more output from the command), you must take special steps.

1066

What you must do is use the `close' function, as follows:

1067

1068

close(FILENAME)

1069

1070

or

1071

1072

close(COMMAND)

1073

1074

The argument FILENAME or COMMAND can be any expression. Its value

1075

must exactly equal the string that was used to open the file or start

1076

the command--for example, if you open a pipe with this:

1077

1078

"sort -r names" | getline foo

1079

1080

then you must close it with this:

1081

1082

close("sort -r names")

1083

1084

Once this function call is executed, the next `getline' from that

1085

file or command will reopen the file or rerun the command.

1086

1087

`close' returns a value of zero if the close succeeded. Otherwise,

1088

the value will be non-zero. In this case, `gawk' sets the variable

1089

`ERRNO' to a string describing the error that occurred.

1090

1091

1092

File: gawk.info, Node: Printing, Next: One-liners, Prev: Reading Files, Up: Top

1093

1094

Printing Output

1095

***************

1096

1097

One of the most common things that actions do is to output or "print"

1098

some or all of the input. For simple output, use the `print'

1099

statement. For fancier formatting use the `printf' statement. Both

1100

are described in this chapter.

1101

1102

* Menu:

1103

1104

* Print:: The `print' statement.

1105

* Print Examples:: Simple examples of `print' statements.

1106

* Output Separators:: The output separators and how to change them.

1107

* OFMT:: Controlling Numeric Output With `print'.

1108

* Printf:: The `printf' statement.

1109

* Redirection:: How to redirect output to multiple

1110

files and pipes.

1111

* Special Files:: File name interpretation in `gawk'.

1112

`gawk' allows access to

1113

inherited file descriptors.

1114

1115

1116

File: gawk.info, Node: Print, Next: Print Examples, Prev: Printing, Up: Printing

1117

1118

The `print' Statement

1119

=====================

1120

1121

The `print' statement does output with simple, standardized

1122

formatting. You specify only the strings or numbers to be printed, in a

1123

list separated by commas. They are output, separated by single spaces,

1124

followed by a newline. The statement looks like this:

1125

1126

print ITEM1, ITEM2, ...

1127

1128

The entire list of items may optionally be enclosed in parentheses. The

1129

parentheses are necessary if any of the item expressions uses a

1130

relational operator; otherwise it could be confused with a redirection

1131

(*note Redirecting Output of `print' and `printf': Redirection.). The

1132

relational operators are `==', `!=', `<', `>', `>=', `<=', `~' and `!~'

1133

(*note Comparison Expressions: Comparison Ops.).

1134

1135

The items printed can be constant strings or numbers, fields of the

1136

current record (such as `$1'), variables, or any `awk' expressions.

1137

The `print' statement is completely general for computing *what* values

1138

to print. With two exceptions, you cannot specify *how* to print

1139

them--how many columns, whether to use exponential notation or not, and

1140

so on. (*Note Output Separators::, and *Note Controlling Numeric

1141

Output with `print': OFMT.) For that, you need the `printf' statement

1142

(*note Using `printf' Statements for Fancier Printing: Printf.).

1143

1144

The simple statement `print' with no items is equivalent to `print

1145

$0': it prints the entire current record. To print a blank line, use

1146

`print ""', where `""' is the null, or empty, string.

1147

1148

To print a fixed piece of text, use a string constant such as

1149

`"Hello there"' as one item. If you forget to use the double-quote

1150

characters, your text will be taken as an `awk' expression, and you

1151

will probably get an error. Keep in mind that a space is printed

1152

between any two items.

1153

1154

Most often, each `print' statement makes one line of output. But it

1155

isn't limited to one line. If an item value is a string that contains a

1156

newline, the newline is output along with the rest of the string. A

1157

single `print' can make any number of lines this way.

1158

1159

1160

File: gawk.info, Node: Print Examples, Next: Output Separators, Prev: Print, Up: Printing

1161

1162

Examples of `print' Statements

1163

==============================

1164

1165

Here is an example of printing a string that contains embedded

1166

newlines:

1167

1168

awk 'BEGIN { print "line one\nline two\nline three" }'

1169

1170

produces output like this:

1171

1172

line one

1173

line two

1174

line three

1175

1176

Here is an example that prints the first two fields of each input

1177

record, with a space between them:

1178

1179

awk '{ print $1, $2 }' inventory-shipped

1180

1181

Its output looks like this:

1182

1183

Jan 13

1184

Feb 15

1185

Mar 15

1186

...

1187

1188

A common mistake in using the `print' statement is to omit the comma

1189

between two items. This often has the effect of making the items run

1190

together in the output, with no space. The reason for this is that

1191

juxtaposing two string expressions in `awk' means to concatenate them.

1192

For example, without the comma:

1193

1194

awk '{ print $1 $2 }' inventory-shipped

1195

1196

prints:

1197

1198

Jan13

1199

Feb15

1200

Mar15

1201

...

1202

1203

Neither example's output makes much sense to someone unfamiliar with

1204

the file `inventory-shipped'. A heading line at the beginning would

1205

make it clearer. Let's add some headings to our table of months (`$1')

1206

and green crates shipped (`$2'). We do this using the `BEGIN' pattern

1207

(*note `BEGIN' and `END' Special Patterns: BEGIN/END.) to force the

1208

headings to be printed only once:

1209

1210

awk 'BEGIN { print "Month Crates"

1211

print "----- ------" }

1212

{ print $1, $2 }' inventory-shipped

1213

1214

Did you already guess what happens? This program prints the following:

1215

1216

Month Crates

1217

----- ------

1218

Jan 13

1219

Feb 15

1220

Mar 15

1221

...

1222

1223

The headings and the table data don't line up! We can fix this by

1224

printing some spaces between the two fields:

1225

1226

awk 'BEGIN { print "Month Crates"

1227

print "----- ------" }

1228

{ print $1, " ", $2 }' inventory-shipped

1229

1230

You can imagine that this way of lining up columns can get pretty

1231

complicated when you have many columns to fix. Counting spaces for two

1232

or three columns can be simple, but more than this and you can get

1233

"lost" quite easily. This is why the `printf' statement was created

1234

(*note Using `printf' Statements for Fancier Printing: Printf.); one of

1235

its specialties is lining up columns of data.

1236