~vcs-imports/gawk/master : revision 17

1

This is Info file gawk.info, produced by Makeinfo-1.54 from the input

2

file gawk.texi.

3

4

This file documents `awk', a program that you can use to select

5

particular records in a file and perform operations upon them.

6

7

This is Edition 0.15 of `The GAWK Manual',

8

for the 2.15 version of the GNU implementation

9

of AWK.

10

11

12

13

Permission is granted to make and distribute verbatim copies of this

14

manual provided the copyright notice and this permission notice are

15

preserved on all copies.

16

17

Permission is granted to copy and distribute modified versions of

18

this manual under the conditions for verbatim copying, provided that

19

the entire resulting derived work is distributed under the terms of a

20

permission notice identical to this one.

21

22

Permission is granted to copy and distribute translations of this

23

manual into another language, under the above conditions for modified

24

versions, except that this permission notice may be stated in a

25

translation approved by the Foundation.

26

27

28

File: gawk.info, Node: For Statement, Next: Break Statement, Prev: Do Statement, Up: Statements

29

30

The `for' Statement

31

===================

32

33

The `for' statement makes it more convenient to count iterations of a

34

loop. The general form of the `for' statement looks like this:

35

36

for (INITIALIZATION; CONDITION; INCREMENT)

37

BODY

38

39

This statement starts by executing INITIALIZATION. Then, as long as

40

CONDITION is true, it repeatedly executes BODY and then INCREMENT.

41

Typically INITIALIZATION sets a variable to either zero or one,

42

INCREMENT adds 1 to it, and CONDITION compares it against the desired

43

number of iterations.

44

45

Here is an example of a `for' statement:

46

47

awk '{ for (i = 1; i <= 3; i++)

48

print $i

49

}'

50

51

This prints the first three fields of each input record, one field per

52

line.

53

54

In the `for' statement, BODY stands for any statement, but

55

INITIALIZATION, CONDITION and INCREMENT are just expressions. You

56

cannot set more than one variable in the INITIALIZATION part unless you

57

use a multiple assignment statement such as `x = y = 0', which is

58

possible only if all the initial values are equal. (But you can

59

initialize additional variables by writing their assignments as

60

separate statements preceding the `for' loop.)

61

62

The same is true of the INCREMENT part; to increment additional

63

variables, you must write separate statements at the end of the loop.

64

The C compound expression, using C's comma operator, would be useful in

65

this context, but it is not supported in `awk'.

66

67

Most often, INCREMENT is an increment expression, as in the example

68

above. But this is not required; it can be any expression whatever.

69

For example, this statement prints all the powers of 2 between 1 and

70

100:

71

72

for (i = 1; i <= 100; i *= 2)

73

print i

74

75

Any of the three expressions in the parentheses following the `for'

76

may be omitted if there is nothing to be done there. Thus,

77

`for (;x > 0;)' is equivalent to `while (x > 0)'. If the CONDITION is

78

omitted, it is treated as TRUE, effectively yielding an "infinite loop"

79

(i.e., a loop that will never terminate).

80

81

In most cases, a `for' loop is an abbreviation for a `while' loop,

82

as shown here:

83

84

INITIALIZATION

85

while (CONDITION) {

86

BODY

87

INCREMENT

88

}

89

90

The only exception is when the `continue' statement (*note The

91

`continue' Statement: Continue Statement.) is used inside the loop;

92

changing a `for' statement to a `while' statement in this way can

93

change the effect of the `continue' statement inside the loop.

94

95

There is an alternate version of the `for' loop, for iterating over

96

all the indices of an array:

97

98

for (i in array)

99

DO SOMETHING WITH array[i]

100

101

*Note Arrays in `awk': Arrays, for more information on this version of

102

the `for' loop.

103

104

The `awk' language has a `for' statement in addition to a `while'

105

statement because often a `for' loop is both less work to type and more

106

natural to think of. Counting the number of iterations is very common

107

in loops. It can be easier to think of this counting as part of

108

looping rather than as something to do inside the loop.

109

110

The next section has more complicated examples of `for' loops.

111

112

113

File: gawk.info, Node: Break Statement, Next: Continue Statement, Prev: For Statement, Up: Statements

114

115

The `break' Statement

116

=====================

117

118

The `break' statement jumps out of the innermost `for', `while', or

119

`do'-`while' loop that encloses it. The following example finds the

120

smallest divisor of any integer, and also identifies prime numbers:

121

122

awk '# find smallest divisor of num

123

{ num = $1

124

for (div = 2; div*div <= num; div++)

125

if (num % div == 0)

126

break

127

if (num % div == 0)

128

printf "Smallest divisor of %d is %d\n", num, div

129

else

130

printf "%d is prime\n", num }'

131

132

When the remainder is zero in the first `if' statement, `awk'

133

immediately "breaks out" of the containing `for' loop. This means that

134

`awk' proceeds immediately to the statement following the loop and

135

continues processing. (This is very different from the `exit'

136

statement which stops the entire `awk' program. *Note The `exit'

137

Statement: Exit Statement.)

138

139

Here is another program equivalent to the previous one. It

140

illustrates how the CONDITION of a `for' or `while' could just as well

141

be replaced with a `break' inside an `if':

142

143

awk '# find smallest divisor of num

144

{ num = $1

145

for (div = 2; ; div++) {

146

if (num % div == 0) {

147

printf "Smallest divisor of %d is %d\n", num, div

148

break

149

}

150

if (div*div > num) {

151

printf "%d is prime\n", num

152

break

153

}

154

}

155

}'

156

157

158

File: gawk.info, Node: Continue Statement, Next: Next Statement, Prev: Break Statement, Up: Statements

159

160

The `continue' Statement

161

========================

162

163

The `continue' statement, like `break', is used only inside `for',

164

`while', and `do'-`while' loops. It skips over the rest of the loop

165

body, causing the next cycle around the loop to begin immediately.

166

Contrast this with `break', which jumps out of the loop altogether.

167

Here is an example:

168

169

# print names that don't contain the string "ignore"

170

171

# first, save the text of each line

172

{ names[NR] = $0 }

173

174

# print what we're interested in

175

END {

176

for (x in names) {

177

if (names[x] ~ /ignore/)

178

continue

179

print names[x]

180

}

181

}

182

183

If one of the input records contains the string `ignore', this

184

example skips the print statement for that record, and continues back to

185

the first statement in the loop.

186

187

This is not a practical example of `continue', since it would be

188

just as easy to write the loop like this:

189

190

for (x in names)

191

if (names[x] !~ /ignore/)

192

print names[x]

193

194

The `continue' statement in a `for' loop directs `awk' to skip the

195

rest of the body of the loop, and resume execution with the

196

increment-expression of the `for' statement. The following program

197

illustrates this fact:

198

199

awk 'BEGIN {

200

for (x = 0; x <= 20; x++) {

201

if (x == 5)

202

continue

203

printf ("%d ", x)

204

}

205

print ""

206

}'

207

208

This program prints all the numbers from 0 to 20, except for 5, for

209

which the `printf' is skipped. Since the increment `x++' is not

210

skipped, `x' does not remain stuck at 5. Contrast the `for' loop above

211

with the `while' loop:

212

213

awk 'BEGIN {

214

x = 0

215

while (x <= 20) {

216

if (x == 5)

217

continue

218

printf ("%d ", x)

219

x++

220

}

221

print ""

222

}'

223

224

This program loops forever once `x' gets to 5.

225

226

As described above, the `continue' statement has no meaning when

227

used outside the body of a loop. However, although it was never

228

documented, historical implementations of `awk' have treated the

229

`continue' statement outside of a loop as if it were a `next' statement

230

(*note The `next' Statement: Next Statement.). By default, `gawk'

231

silently supports this usage. However, if `-W posix' has been

232

specified on the command line (*note Invoking `awk': Command Line.), it

233

will be treated as an error, since the POSIX standard specifies that

234

`continue' should only be used inside the body of a loop.

235

236

237

File: gawk.info, Node: Next Statement, Next: Next File Statement, Prev: Continue Statement, Up: Statements

238

239

The `next' Statement

240

====================

241

242

The `next' statement forces `awk' to immediately stop processing the

243

current record and go on to the next record. This means that no

244

further rules are executed for the current record. The rest of the

245

current rule's action is not executed either.

246

247

Contrast this with the effect of the `getline' function (*note

248

Explicit Input with `getline': Getline.). That too causes `awk' to

249

read the next record immediately, but it does not alter the flow of

250

control in any way. So the rest of the current action executes with a

251

new input record.

252

253

At the highest level, `awk' program execution is a loop that reads

254

an input record and then tests each rule's pattern against it. If you

255

think of this loop as a `for' statement whose body contains the rules,

256

then the `next' statement is analogous to a `continue' statement: it

257

skips to the end of the body of this implicit loop, and executes the

258

increment (which reads another record).

259

260

For example, if your `awk' program works only on records with four

261

fields, and you don't want it to fail when given bad input, you might

262

use this rule near the beginning of the program:

263

264

NF != 4 {

265

printf("line %d skipped: doesn't have 4 fields", FNR) > "/dev/stderr"

266

}

268

269

so that the following rules will not see the bad record. The error

270

message is redirected to the standard error output stream, as error

271

messages should be. *Note Standard I/O Streams: Special Files.

272

273

According to the POSIX standard, the behavior is undefined if the

274

`next' statement is used in a `BEGIN' or `END' rule. `gawk' will treat

275

it as a syntax error.

276

277

If the `next' statement causes the end of the input to be reached,

278

then the code in the `END' rules, if any, will be executed. *Note

279

`BEGIN' and `END' Special Patterns: BEGIN/END.

280

281

282

File: gawk.info, Node: Next File Statement, Next: Exit Statement, Prev: Next Statement, Up: Statements

283

284

The `next file' Statement

285

=========================

286

287

The `next file' statement is similar to the `next' statement.

288

However, instead of abandoning processing of the current record, the

289

`next file' statement instructs `awk' to stop processing the current

290

data file.

291

292

Upon execution of the `next file' statement, `FILENAME' is updated

293

to the name of the next data file listed on the command line, `FNR' is

294

reset to 1, and processing starts over with the first rule in the

295

progam. *Note Built-in Variables::.

296

297

If the `next file' statement causes the end of the input to be

298

reached, then the code in the `END' rules, if any, will be executed.

299

*Note `BEGIN' and `END' Special Patterns: BEGIN/END.

300

301

The `next file' statement is a `gawk' extension; it is not

302

(currently) available in any other `awk' implementation. You can

303

simulate its behavior by creating a library file named `nextfile.awk',

304

with the following contents. (This sample program uses user-defined

305

functions, a feature that has not been presented yet. *Note

306

User-defined Functions: User-defined, for more information.)

307

308

# nextfile --- function to skip remaining records in current file

309

310

# this should be read in before the "main" awk program

311

312

function nextfile() { _abandon_ = FILENAME; next }

313

314

_abandon_ == FILENAME && FNR > 1 { next }

315

_abandon_ == FILENAME && FNR == 1 { _abandon_ = "" }

316

317

The `nextfile' function simply sets a "private" variable(1) to the

318

name of the current data file, and then retrieves the next record.

319

Since this file is read before the main `awk' program, the rules that

320

follows the function definition will be executed before the rules in

321

the main program. The first rule continues to skip records as long as

322

the name of the input file has not changed, and this is not the first

323

record in the file. This rule is sufficient most of the time. But

324

what if the *same* data file is named twice in a row on the command

325

line? This rule would not process the data file the second time. The

326

second rule catches this case: If the data file name is what was being

327

skipped, but `FNR' is 1, then this is the second time the file is being

328

processed, and it should not be skipped.

329

330

The `next file' statement would be useful if you have many data

331

files to process, and due to the nature of the data, you expect that you

332

would not want to process every record in the file. In order to move

333

on to the next data file, you would have to continue scanning the

334

unwanted records (as described above). The `next file' statement

335

accomplishes this much more efficiently.

336

337

---------- Footnotes ----------

338

339

(1) Since all variables in `awk' are global, this program uses the

340

common practice of prefixing the variable name with an underscore. In

341

fact, it also suffixes the variable name with an underscore, as extra

342

insurance against using a variable name that might be used in some

343

other library file.

344

345

346

File: gawk.info, Node: Exit Statement, Prev: Next File Statement, Up: Statements

347

348

The `exit' Statement

349

====================

350

351

The `exit' statement causes `awk' to immediately stop executing the

352

current rule and to stop processing input; any remaining input is

353

ignored.

354

355

If an `exit' statement is executed from a `BEGIN' rule the program

356

stops processing everything immediately. No input records are read.

357

However, if an `END' rule is present, it is executed (*note `BEGIN' and

358

`END' Special Patterns: BEGIN/END.).

359

360

If `exit' is used as part of an `END' rule, it causes the program to

361

stop immediately.

362

363

An `exit' statement that is part of an ordinary rule (that is, not

364

part of a `BEGIN' or `END' rule) stops the execution of any further

365

automatic rules, but the `END' rule is executed if there is one. If

366

you do not want the `END' rule to do its job in this case, you can set

367

a variable to nonzero before the `exit' statement, and check that

368

variable in the `END' rule.

369

370

If an argument is supplied to `exit', its value is used as the exit

371

status code for the `awk' process. If no argument is supplied, `exit'

372

returns status zero (success).

373

374

For example, let's say you've discovered an error condition you

375

really don't know how to handle. Conventionally, programs report this

376

by exiting with a nonzero status. Your `awk' program can do this using

377

an `exit' statement with a nonzero argument. Here's an example of this:

378

379

BEGIN {

380

if (("date" | getline date_now) < 0) {

381

print "Can't get system date" > "/dev/stderr"

382

exit 4

383

}

384

}

385

386

387

File: gawk.info, Node: Arrays, Next: Built-in, Prev: Statements, Up: Top

388

389

Arrays in `awk'

390

***************

391

392

An "array" is a table of values, called "elements". The elements of

393

an array are distinguished by their indices. "Indices" may be either

394

numbers or strings. Each array has a name, which looks like a variable

395

name, but must not be in use as a variable name in the same `awk'

396

program.

397

398

* Menu:

399

400

* Array Intro:: Introduction to Arrays

401

* Reference to Elements:: How to examine one element of an array.

402

* Assigning Elements:: How to change an element of an array.

403

* Array Example:: Basic Example of an Array

404

* Scanning an Array:: A variation of the `for' statement.

405

It loops through the indices of

406

an array's existing elements.

407

* Delete:: The `delete' statement removes

408

an element from an array.

409

* Numeric Array Subscripts:: How to use numbers as subscripts in `awk'.

410

* Multi-dimensional:: Emulating multi-dimensional arrays in `awk'.

411

* Multi-scanning:: Scanning multi-dimensional arrays.

412

413

414

File: gawk.info, Node: Array Intro, Next: Reference to Elements, Prev: Arrays, Up: Arrays

415

416

Introduction to Arrays

417

======================

418

419

The `awk' language has one-dimensional "arrays" for storing groups

420

of related strings or numbers.

421

422

Every `awk' array must have a name. Array names have the same

423

syntax as variable names; any valid variable name would also be a valid

424

array name. But you cannot use one name in both ways (as an array and

425

as a variable) in one `awk' program.

426

427

Arrays in `awk' superficially resemble arrays in other programming

428

languages; but there are fundamental differences. In `awk', you don't

429

need to specify the size of an array before you start to use it.

430

Additionally, any number or string in `awk' may be used as an array

431

index.

432

433

In most other languages, you have to "declare" an array and specify

434

how many elements or components it contains. In such languages, the

435

declaration causes a contiguous block of memory to be allocated for that

436

many elements. An index in the array must be a positive integer; for

437

example, the index 0 specifies the first element in the array, which is

438

actually stored at the beginning of the block of memory. Index 1

439

specifies the second element, which is stored in memory right after the

440

first element, and so on. It is impossible to add more elements to the

441

array, because it has room for only as many elements as you declared.

442

443

A contiguous array of four elements might look like this,

444

conceptually, if the element values are `8', `"foo"', `""' and `30':

445

446

+---------+---------+--------+---------+

447

| 8 | "foo" | "" | 30 | value

448

+---------+---------+--------+---------+

449

0 1 2 3 index

450

451

Only the values are stored; the indices are implicit from the order of

452

the values. `8' is the value at index 0, because `8' appears in the

453

position with 0 elements before it.

454

455

Arrays in `awk' are different: they are "associative". This means

456

that each array is a collection of pairs: an index, and its

457

corresponding array element value:

458

459

Element 4 Value 30

460

Element 2 Value "foo"

461

Element 1 Value 8

462

Element 3 Value ""

463

464

We have shown the pairs in jumbled order because their order is

465

irrelevant.

466

467

One advantage of an associative array is that new pairs can be added

468

at any time. For example, suppose we add to the above array a tenth

469

element whose value is `"number ten"'. The result is this:

470

471

Element 10 Value "number ten"

472

Element 4 Value 30

473

Element 2 Value "foo"

474

Element 1 Value 8

475

Element 3 Value ""

476

477

Now the array is "sparse" (i.e., some indices are missing): it has

478

elements 1-4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.

479

480

Another consequence of associative arrays is that the indices don't

481

have to be positive integers. Any number, or even a string, can be an

482

index. For example, here is an array which translates words from

483

English into French:

484

485

Element "dog" Value "chien"

486

Element "cat" Value "chat"

487

Element "one" Value "un"

488

Element 1 Value "un"

489

490

Here we decided to translate the number 1 in both spelled-out and

491

numeric form--thus illustrating that a single array can have both

492

numbers and strings as indices.

493

494

When `awk' creates an array for you, e.g., with the `split' built-in

495

function, that array's indices are consecutive integers starting at 1.

496

(*Note Built-in Functions for String Manipulation: String Functions.)

497

498

499

File: gawk.info, Node: Reference to Elements, Next: Assigning Elements, Prev: Array Intro, Up: Arrays

500

501

Referring to an Array Element

502

=============================

503

504

The principal way of using an array is to refer to one of its

505

elements. An array reference is an expression which looks like this:

506

507

ARRAY[INDEX]

508

509

Here, ARRAY is the name of an array. The expression INDEX is the index

510

of the element of the array that you want.

511

512

The value of the array reference is the current value of that array

513

element. For example, `foo[4.3]' is an expression for the element of

514

array `foo' at index 4.3.

515

516

If you refer to an array element that has no recorded value, the

517

value of the reference is `""', the null string. This includes elements

518

to which you have not assigned any value, and elements that have been

519

deleted (*note The `delete' Statement: Delete.). Such a reference

520

automatically creates that array element, with the null string as its

521

value. (In some cases, this is unfortunate, because it might waste

522

memory inside `awk').

523

524

You can find out if an element exists in an array at a certain index

525

with the expression:

526

527

INDEX in ARRAY

528

529

This expression tests whether or not the particular index exists,

530

without the side effect of creating that element if it is not present.

531

The expression has the value 1 (true) if `ARRAY[INDEX]' exists, and 0

532

(false) if it does not exist.

533

534

For example, to test whether the array `frequencies' contains the

535

index `"2"', you could write this statement:

536

537

if ("2" in frequencies) print "Subscript \"2\" is present."

538

539

Note that this is *not* a test of whether or not the array

540

`frequencies' contains an element whose *value* is `"2"'. (There is no

541

way to do that except to scan all the elements.) Also, this *does not*

542

create `frequencies["2"]', while the following (incorrect) alternative

543

would do so:

544

545

if (frequencies["2"] != "") print "Subscript \"2\" is present."

546

547

548

File: gawk.info, Node: Assigning Elements, Next: Array Example, Prev: Reference to Elements, Up: Arrays

549

550

Assigning Array Elements

551

========================

552

553

Array elements are lvalues: they can be assigned values just like

554

`awk' variables:

555

556

ARRAY[SUBSCRIPT] = VALUE

557

558

Here ARRAY is the name of your array. The expression SUBSCRIPT is the

559

index of the element of the array that you want to assign a value. The

560

expression VALUE is the value you are assigning to that element of the

561

array.

562

563

564

File: gawk.info, Node: Array Example, Next: Scanning an Array, Prev: Assigning Elements, Up: Arrays

565

566

Basic Example of an Array

567

=========================

568

569

The following program takes a list of lines, each beginning with a

570

line number, and prints them out in order of line number. The line

571

numbers are not in order, however, when they are first read: they are

572

scrambled. This program sorts the lines by making an array using the

573

line numbers as subscripts. It then prints out the lines in sorted

574

order of their numbers. It is a very simple program, and gets confused

575

if it encounters repeated numbers, gaps, or lines that don't begin with

576

a number.

577

578

{

579

if ($1 > max)

580

max = $1

581

arr[$1] = $0

582

}

583

584

END {

585

for (x = 1; x <= max; x++)

586

print arr[x]

587

}

588

589

The first rule keeps track of the largest line number seen so far;

590

it also stores each line into the array `arr', at an index that is the

591

line's number.

592

593

The second rule runs after all the input has been read, to print out

594

all the lines.

595

596

When this program is run with the following input:

597

598

5 I am the Five man

599

2 Who are you? The new number two!

600

4 . . . And four on the floor

601

1 Who is number one?

602

3 I three you.

603

604

its output is this:

605

606

1 Who is number one?

607

2 Who are you? The new number two!

608

3 I three you.

609

4 . . . And four on the floor

610

5 I am the Five man

611

612

If a line number is repeated, the last line with a given number

613

overrides the others.

614

615

Gaps in the line numbers can be handled with an easy improvement to

616

the program's `END' rule:

617

618

END {

619

for (x = 1; x <= max; x++)

620

if (x in arr)

621

print arr[x]

622

}

623

624

625

File: gawk.info, Node: Scanning an Array, Next: Delete, Prev: Array Example, Up: Arrays

626

627

Scanning all Elements of an Array

628

=================================

629

630

In programs that use arrays, often you need a loop that executes

631

once for each element of an array. In other languages, where arrays are

632

contiguous and indices are limited to positive integers, this is easy:

633

the largest index is one less than the length of the array, and you can

634

find all the valid indices by counting from zero up to that value. This

635

technique won't do the job in `awk', since any number or string may be

636

an array index. So `awk' has a special kind of `for' statement for

637

scanning an array:

638

639

for (VAR in ARRAY)

640

BODY

641

642

This loop executes BODY once for each different value that your program

643

has previously used as an index in ARRAY, with the variable VAR set to

644

that index.

645

646

Here is a program that uses this form of the `for' statement. The

647

first rule scans the input records and notes which words appear (at

648

least once) in the input, by storing a 1 into the array `used' with the

649

word as index. The second rule scans the elements of `used' to find

650

all the distinct words that appear in the input. It prints each word

651

that is more than 10 characters long, and also prints the number of

652

such words. *Note Built-in Functions: Built-in, for more information

653

on the built-in function `length'.

654

655

# Record a 1 for each word that is used at least once.

656

{

657

for (i = 1; i <= NF; i++)

658

used[$i] = 1

659

}

660

661

# Find number of distinct words more than 10 characters long.

662

END {

663

for (x in used)

664

if (length(x) > 10) {

665

++num_long_words

666

print x

667

}

668

print num_long_words, "words longer than 10 characters"

669

}

670

671

*Note Sample Program::, for a more detailed example of this type.

672

673

The order in which elements of the array are accessed by this

674

statement is determined by the internal arrangement of the array

675

elements within `awk' and cannot be controlled or changed. This can

676

lead to problems if new elements are added to ARRAY by statements in

677

BODY; you cannot predict whether or not the `for' loop will reach them.

678

Similarly, changing VAR inside the loop can produce strange results.

679

It is best to avoid such things.

680

681

682

File: gawk.info, Node: Delete, Next: Numeric Array Subscripts, Prev: Scanning an Array, Up: Arrays

683

684

The `delete' Statement

685

======================

686

687

You can remove an individual element of an array using the `delete'

688

statement:

689

690

delete ARRAY[INDEX]

691

692

You can not refer to an array element after it has been deleted; it

693

is as if you had never referred to it and had never given it any value.

694

You can no longer obtain any value the element once had.

695

696

Here is an example of deleting elements in an array:

697

698

for (i in frequencies)

699

delete frequencies[i]

700

701

This example removes all the elements from the array `frequencies'.

702

703

If you delete an element, a subsequent `for' statement to scan the

704

array will not report that element, and the `in' operator to check for

705

the presence of that element will return 0:

706

707

delete foo[4]

708

if (4 in foo)

709

print "This will never be printed"

710

711

It is not an error to delete an element which does not exist.

712

713

714

File: gawk.info, Node: Numeric Array Subscripts, Next: Multi-dimensional, Prev: Delete, Up: Arrays

715

716

Using Numbers to Subscript Arrays

717

=================================

718

719

An important aspect of arrays to remember is that array subscripts

720

are *always* strings. If you use a numeric value as a subscript, it

721

will be converted to a string value before it is used for subscripting

722

(*note Conversion of Strings and Numbers: Conversion.).

723

724

This means that the value of the `CONVFMT' can potentially affect

725

how your program accesses elements of an array. For example:

726

727

a = b = 12.153

728

data[a] = 1

729

CONVFMT = "%2.2f"

730

if (b in data)

731

printf "%s is in data", b

732

else

733

printf "%s is not in data", b

734

735

should print `12.15 is not in data'. The first statement gives both

736

`a' and `b' the same numeric value. Assigning to `data[a]' first gives

737

`a' the string value `"12.153"' (using the default conversion value of

738

`CONVFMT', `"%.6g"'), and then assigns 1 to `data["12.153"]'. The

739

program then changes the value of `CONVFMT'. The test `(b in data)'

740

forces `b' to be converted to a string, this time `"12.15"', since the

741

value of `CONVFMT' only allows two significant digits. This test fails,

742

since `"12.15"' is a different string from `"12.153"'.

743

744

According to the rules for conversions (*note Conversion of Strings

745

and Numbers: Conversion.), integer values are always converted to

746

strings as integers, no matter what the value of `CONVFMT' may happen

747

to be. So the usual case of

748

749

for (i = 1; i <= maxsub; i++)

750

do something with array[i]

751

752

will work, no matter what the value of `CONVFMT'.

753

754

Like many things in `awk', the majority of the time things work as

755

you would expect them to work. But it is useful to have a precise

756

knowledge of the actual rules, since sometimes they can have a subtle

757

effect on your programs.

758

759

760

File: gawk.info, Node: Multi-dimensional, Next: Multi-scanning, Prev: Numeric Array Subscripts, Up: Arrays

761

762

Multi-dimensional Arrays

763

========================

764

765

A multi-dimensional array is an array in which an element is

766

identified by a sequence of indices, not a single index. For example, a

767

two-dimensional array requires two indices. The usual way (in most

768

languages, including `awk') to refer to an element of a two-dimensional

769

array named `grid' is with `grid[X,Y]'.

770

771

Multi-dimensional arrays are supported in `awk' through

772

concatenation of indices into one string. What happens is that `awk'

773

converts the indices into strings (*note Conversion of Strings and

774

Numbers: Conversion.) and concatenates them together, with a separator

775

between them. This creates a single string that describes the values

776

of the separate indices. The combined string is used as a single index

777

into an ordinary, one-dimensional array. The separator used is the

778

value of the built-in variable `SUBSEP'.

779

780

For example, suppose we evaluate the expression `foo[5,12]="value"'

781

when the value of `SUBSEP' is `"@"'. The numbers 5 and 12 are

782

converted to strings and concatenated with an `@' between them,

783

yielding `"5@12"'; thus, the array element `foo["5@12"]' is set to

784

`"value"'.

785

786

Once the element's value is stored, `awk' has no record of whether

787

it was stored with a single index or a sequence of indices. The two

788

expressions `foo[5,12]' and `foo[5 SUBSEP 12]' always have the same

789

value.

790

791

The default value of `SUBSEP' is the string `"\034"', which contains

792

a nonprinting character that is unlikely to appear in an `awk' program

793

or in the input data.

794

795

The usefulness of choosing an unlikely character comes from the fact

796

that index values that contain a string matching `SUBSEP' lead to

797

combined strings that are ambiguous. Suppose that `SUBSEP' were `"@"';

798

then `foo["a@b", "c"]' and `foo["a", "b@c"]' would be indistinguishable

799

because both would actually be stored as `foo["a@b@c"]'. Because

800

`SUBSEP' is `"\034"', such confusion can arise only when an index

801

contains the character with ASCII code 034, which is a rare event.

802

803

You can test whether a particular index-sequence exists in a

804

"multi-dimensional" array with the same operator `in' used for single

805

dimensional arrays. Instead of a single index as the left-hand operand,

806

write the whole sequence of indices, separated by commas, in

807

parentheses:

808

809

(SUBSCRIPT1, SUBSCRIPT2, ...) in ARRAY

810

811

The following example treats its input as a two-dimensional array of

812

fields; it rotates this array 90 degrees clockwise and prints the

813

result. It assumes that all lines have the same number of elements.

814

815

awk '{

816

if (max_nf < NF)

817

max_nf = NF

818

max_nr = NR

819

for (x = 1; x <= NF; x++)

820

vector[x, NR] = $x

821

}

822

823

END {

824

for (x = 1; x <= max_nf; x++) {

825

for (y = max_nr; y >= 1; --y)

826

printf("%s ", vector[x, y])

827

printf("\n")

828

}

829

}'

830

831

When given the input:

832

833

1 2 3 4 5 6

834

2 3 4 5 6 1

835

3 4 5 6 1 2

836

4 5 6 1 2 3

837

838

it produces:

839

840

4 3 2 1

841

5 4 3 2

842

6 5 4 3

843

1 6 5 4

844

2 1 6 5

845

3 2 1 6

846

847

848

File: gawk.info, Node: Multi-scanning, Prev: Multi-dimensional, Up: Arrays

849

850

Scanning Multi-dimensional Arrays

851

=================================

852

853

There is no special `for' statement for scanning a

854

"multi-dimensional" array; there cannot be one, because in truth there

855

are no multi-dimensional arrays or elements; there is only a

856

multi-dimensional *way of accessing* an array.

857

858

However, if your program has an array that is always accessed as

859

multi-dimensional, you can get the effect of scanning it by combining

860

the scanning `for' statement (*note Scanning all Elements of an Array:

861

Scanning an Array.) with the `split' built-in function (*note Built-in

862

Functions for String Manipulation: String Functions.). It works like

863

this:

864

865

for (combined in ARRAY) {

866

split(combined, separate, SUBSEP)

867

...

868

}

869

870

This finds each concatenated, combined index in the array, and splits it

871

into the individual indices by breaking it apart where the value of

872

`SUBSEP' appears. The split-out indices become the elements of the

873

array `separate'.

874

875

Thus, suppose you have previously stored in `ARRAY[1, "foo"]'; then

876

an element with index `"1\034foo"' exists in ARRAY. (Recall that the

877

default value of `SUBSEP' contains the character with code 034.)

878

Sooner or later the `for' statement will find that index and do an

879

iteration with `combined' set to `"1\034foo"'. Then the `split'

880

function is called as follows:

881

882

split("1\034foo", separate, "\034")

883

884

The result of this is to set `separate[1]' to 1 and `separate[2]' to

885

`"foo"'. Presto, the original sequence of separate indices has been

886

recovered.

887

888

889

File: gawk.info, Node: Built-in, Next: User-defined, Prev: Arrays, Up: Top

890

891

Built-in Functions

892

******************

893

894

"Built-in" functions are functions that are always available for

895

your `awk' program to call. This chapter defines all the built-in

896

functions in `awk'; some of them are mentioned in other sections, but

897

they are summarized here for your convenience. (You can also define

898

new functions yourself. *Note User-defined Functions: User-defined.)

899

900

* Menu:

901

902

* Calling Built-in:: How to call built-in functions.

903

* Numeric Functions:: Functions that work with numbers,

904

including `int', `sin' and `rand'.

905

* String Functions:: Functions for string manipulation,

906

such as `split', `match', and `sprintf'.

907

* I/O Functions:: Functions for files and shell commands.

908

* Time Functions:: Functions for dealing with time stamps.

909

910

911

File: gawk.info, Node: Calling Built-in, Next: Numeric Functions, Prev: Built-in, Up: Built-in

912

913

Calling Built-in Functions

914

==========================

915

916

To call a built-in function, write the name of the function followed

917

by arguments in parentheses. For example, `atan2(y + z, 1)' is a call

918

to the function `atan2', with two arguments.

919

920

Whitespace is ignored between the built-in function name and the

921

open-parenthesis, but we recommend that you avoid using whitespace

922

there. User-defined functions do not permit whitespace in this way, and

923

you will find it easier to avoid mistakes by following a simple

924

convention which always works: no whitespace after a function name.

925

926

Each built-in function accepts a certain number of arguments. In

927

most cases, any extra arguments given to built-in functions are

928

ignored. The defaults for omitted arguments vary from function to

929

function and are described under the individual functions.

930

931

When a function is called, expressions that create the function's

932

actual parameters are evaluated completely before the function call is

933

performed. For example, in the code fragment:

934

935

i = 4

936

j = sqrt(i++)

937

938

the variable `i' is set to 5 before `sqrt' is called with a value of 4

939

for its actual parameter.

940

941

942

File: gawk.info, Node: Numeric Functions, Next: String Functions, Prev: Calling Built-in, Up: Built-in

943

944

Numeric Built-in Functions

945

==========================

946

947

Here is a full list of built-in functions that work with numbers:

948

949

`int(X)'

950

This gives you the integer part of X, truncated toward 0. This

951

produces the nearest integer to X, located between X and 0.

952

953

For example, `int(3)' is 3, `int(3.9)' is 3, `int(-3.9)' is -3,

954

and `int(-3)' is -3 as well.

955

956

`sqrt(X)'

957

This gives you the positive square root of X. It reports an error

958

if X is negative. Thus, `sqrt(4)' is 2.

959

960

`exp(X)'

961

This gives you the exponential of X, or reports an error if X is

962

out of range. The range of values X can have depends on your

963

machine's floating point representation.

964

965

`log(X)'

966

This gives you the natural logarithm of X, if X is positive;

967

otherwise, it reports an error.

968

969

`sin(X)'

970

This gives you the sine of X, with X in radians.

971

972

`cos(X)'

973

This gives you the cosine of X, with X in radians.

974

975

`atan2(Y, X)'

976

This gives you the arctangent of `Y / X' in radians.

977

978

`rand()'

979

This gives you a random number. The values of `rand' are

980

uniformly-distributed between 0 and 1. The value is never 0 and

981

never 1.

982

983

Often you want random integers instead. Here is a user-defined

984

function you can use to obtain a random nonnegative integer less

985

than N:

986

987

function randint(n) {

988

return int(n * rand())

989

}

990

991

The multiplication produces a random real number greater than 0

992

and less than N. We then make it an integer (using `int') between

993

0 and `N - 1'.

994

995

Here is an example where a similar function is used to produce

996

random integers between 1 and N. Note that this program will

997

print a new random number for each input record.

998

999

awk '

1000

# Function to roll a simulated die.

1001

function roll(n) { return 1 + int(rand() * n) }

1002

1003

# Roll 3 six-sided dice and print total number of points.

1004

{

1005

printf("%d points\n", roll(6)+roll(6)+roll(6))

1006

}'

1007

1008

*Note:* `rand' starts generating numbers from the same point, or

1009

"seed", each time you run `awk'. This means that a program will

1010

produce the same results each time you run it. The numbers are

1011

random within one `awk' run, but predictable from run to run.

1012

This is convenient for debugging, but if you want a program to do

1013

different things each time it is used, you must change the seed to

1014

a value that will be different in each run. To do this, use

1015

`srand'.

1016

1017

`srand(X)'

1018

The function `srand' sets the starting point, or "seed", for

1019

generating random numbers to the value X.

1020

1021

Each seed value leads to a particular sequence of "random" numbers.

1022

Thus, if you set the seed to the same value a second time, you

1023

will get the same sequence of "random" numbers again.

1024

1025

If you omit the argument X, as in `srand()', then the current date

1026

and time of day are used for a seed. This is the way to get random

1027

numbers that are truly unpredictable.

1028

1029

The return value of `srand' is the previous seed. This makes it

1030

easy to keep track of the seeds for use in consistently reproducing

1031

sequences of random numbers.

1032

1033

1034

File: gawk.info, Node: String Functions, Next: I/O Functions, Prev: Numeric Functions, Up: Built-in

1035

1036

Built-in Functions for String Manipulation

1037

==========================================

1038

1039

The functions in this section look at or change the text of one or

1040

more strings.

1041

1042

`index(IN, FIND)'

1043

This searches the string IN for the first occurrence of the string

1044

FIND, and returns the position in characters where that occurrence

1045

begins in the string IN. For example:

1046

1047

awk 'BEGIN { print index("peanut", "an") }'

1048

1049

prints `3'. If FIND is not found, `index' returns 0. (Remember

1050

that string indices in `awk' start at 1.)

1051

1052

`length(STRING)'

1053

This gives you the number of characters in STRING. If STRING is a

1054

number, the length of the digit string representing that number is

1055

returned. For example, `length("abcde")' is 5. By contrast,

1056

`length(15 * 35)' works out to 3. How? Well, 15 * 35 = 525, and

1057

525 is then converted to the string `"525"', which has three

1058

characters.

1059

1060

If no argument is supplied, `length' returns the length of `$0'.

1061

1062

In older versions of `awk', you could call the `length' function

1063

without any parentheses. Doing so is marked as "deprecated" in the

1064

POSIX standard. This means that while you can do this in your

1065

programs, it is a feature that can eventually be removed from a

1066

future version of the standard. Therefore, for maximal

1067

portability of your `awk' programs you should always supply the

1068

parentheses.

1069

1070

`match(STRING, REGEXP)'

1071

The `match' function searches the string, STRING, for the longest,

1072

leftmost substring matched by the regular expression, REGEXP. It

1073

returns the character position, or "index", of where that

1074

substring begins (1, if it starts at the beginning of STRING). If

1075

no match if found, it returns 0.

1076

1077

The `match' function sets the built-in variable `RSTART' to the

1078

index. It also sets the built-in variable `RLENGTH' to the length

1079

in characters of the matched substring. If no match is found,

1080

`RSTART' is set to 0, and `RLENGTH' to -1.

1081

1082

For example:

1083

1084

awk '{

1085

if ($1 == "FIND")

1086

regex = $2

1087

else {

1088

where = match($0, regex)

1089

if (where)

1090

print "Match of", regex, "found at", where, "in", $0

1091

}

1092

}'

1093

1094

This program looks for lines that match the regular expression

1095

stored in the variable `regex'. This regular expression can be

1096

changed. If the first word on a line is `FIND', `regex' is

1097

changed to be the second word on that line. Therefore, given:

1098

1099

FIND fo*bar

1100

My program was a foobar

1101

But none of it would doobar

1102

FIND Melvin

1103

JF+KM

1104

This line is property of The Reality Engineering Co.

1105

This file created by Melvin.

1106

1107

`awk' prints:

1108

1109

Match of fo*bar found at 18 in My program was a foobar

1110

Match of Melvin found at 26 in This file created by Melvin.

1111

1112

`split(STRING, ARRAY, FIELDSEP)'

1113

This divides STRING into pieces separated by FIELDSEP, and stores

1114

the pieces in ARRAY. The first piece is stored in `ARRAY[1]', the

1115

second piece in `ARRAY[2]', and so forth. The string value of the

1116

third argument, FIELDSEP, is a regexp describing where to split

1117

STRING (much as `FS' can be a regexp describing where to split

1118

input records). If the FIELDSEP is omitted, the value of `FS' is

1119

used. `split' returns the number of elements created.

1120

1121

The `split' function, then, splits strings into pieces in a manner

1122

similar to the way input lines are split into fields. For example:

1123

1124

split("auto-da-fe", a, "-")

1125

1126

splits the string `auto-da-fe' into three fields using `-' as the

1127

separator. It sets the contents of the array `a' as follows:

1128

1129

a[1] = "auto"

1130

a[2] = "da"

1131

a[3] = "fe"

1132

1133

The value returned by this call to `split' is 3.

1134

1135

As with input field-splitting, when the value of FIELDSEP is `"

1136

"', leading and trailing whitespace is ignored, and the elements

1137

are separated by runs of whitespace.

1138

1139

`sprintf(FORMAT, EXPRESSION1,...)'

1140

This returns (without printing) the string that `printf' would

1141

have printed out with the same arguments (*note Using `printf'

1142

Statements for Fancier Printing: Printf.). For example:

1143

1144

sprintf("pi = %.2f (approx.)", 22/7)

1145

1146

returns the string `"pi = 3.14 (approx.)"'.

1147

1148

`sub(REGEXP, REPLACEMENT, TARGET)'

1149

The `sub' function alters the value of TARGET. It searches this

1150

value, which should be a string, for the leftmost substring

1151

matched by the regular expression, REGEXP, extending this match as

1152

far as possible. Then the entire string is changed by replacing

1153

the matched text with REPLACEMENT. The modified string becomes

1154

the new value of TARGET.

1155

1156

This function is peculiar because TARGET is not simply used to

1157

compute a value, and not just any expression will do: it must be a

1158

variable, field or array reference, so that `sub' can store a

1159

modified value there. If this argument is omitted, then the

1160

default is to use and alter `$0'.

1161

1162

For example:

1163

1164

str = "water, water, everywhere"

1165

sub(/at/, "ith", str)

1166

1167

sets `str' to `"wither, water, everywhere"', by replacing the

1168

leftmost, longest occurrence of `at' with `ith'.

1169

1170

The `sub' function returns the number of substitutions made (either

1171

one or zero).

1172

1173

If the special character `&' appears in REPLACEMENT, it stands for

1174

the precise substring that was matched by REGEXP. (If the regexp

1175

can match more than one string, then this precise substring may

1176

vary.) For example:

1177

1178

awk '{ sub(/candidate/, "& and his wife"); print }'

1179

1180

changes the first occurrence of `candidate' to `candidate and his

1181

wife' on each input line.

1182

1183

Here is another example:

1184

1185

awk 'BEGIN {

1186

str = "daabaaa"

1187

sub(/a*/, "c&c", str)

1188

print str

1189

}'

1190

1191

prints `dcaacbaaa'. This show how `&' can represent a non-constant

1192

string, and also illustrates the "leftmost, longest" rule.

1193

1194

The effect of this special character (`&') can be turned off by

1195

putting a backslash before it in the string. As usual, to insert

1196

one backslash in the string, you must write two backslashes.

1197

Therefore, write `\\&' in a string constant to include a literal

1198

`&' in the replacement. For example, here is how to replace the

1199

first `|' on each line with an `&':

1200

1201

awk '{ sub(/\|/, "\\&"); print }'

1202

1203

*Note:* as mentioned above, the third argument to `sub' must be an

1204

lvalue. Some versions of `awk' allow the third argument to be an

1205

expression which is not an lvalue. In such a case, `sub' would

1206

still search for the pattern and return 0 or 1, but the result of

1207

the substitution (if any) would be thrown away because there is no

1208

place to put it. Such versions of `awk' accept expressions like

1209

this:

1210

1211

sub(/USA/, "United States", "the USA and Canada")

1212

1213

But that is considered erroneous in `gawk'.

1214

1215

`gsub(REGEXP, REPLACEMENT, TARGET)'

1216

This is similar to the `sub' function, except `gsub' replaces

1217

*all* of the longest, leftmost, *nonoverlapping* matching

1218

substrings it can find. The `g' in `gsub' stands for "global,"

1219

which means replace everywhere. For example:

1220

1221

awk '{ gsub(/Britain/, "United Kingdom"); print }'

1222

1223

replaces all occurrences of the string `Britain' with `United

1224

Kingdom' for all input records.

1225

1226

The `gsub' function returns the number of substitutions made. If

1227

the variable to be searched and altered, TARGET, is omitted, then

1228

the entire input record, `$0', is used.

1229

1230

As in `sub', the characters `&' and `\' are special, and the third

1231

argument must be an lvalue.

1232

1233

`substr(STRING, START, LENGTH)'

1234

This returns a LENGTH-character-long substring of STRING, starting

1235

at character number START. The first character of a string is

1236

character number one. For example, `substr("washington", 5, 3)'

1237

returns `"ing"'.

1238

1239

If LENGTH is not present, this function returns the whole suffix of

1240

STRING that begins at character number START. For example,

1241

`substr("washington", 5)' returns `"ington"'. This is also the

1242

case if LENGTH is greater than the number of characters remaining

1243

in the string, counting from character number START.

1244

1245

`tolower(STRING)'

1246

This returns a copy of STRING, with each upper-case character in

1247

the string replaced with its corresponding lower-case character.

1248

Nonalphabetic characters are left unchanged. For example,

1249

`tolower("MiXeD cAsE 123")' returns `"mixed case 123"'.

1250

1251

`toupper(STRING)'

1252

This returns a copy of STRING, with each lower-case character in

1253

the string replaced with its corresponding upper-case character.

1254

Nonalphabetic characters are left unchanged. For example,

1255

`toupper("MiXeD cAsE 123")' returns `"MIXED CASE 123"'.

1256