1
.TH REGEX 3 "25 Sept 1997"
4
.\" one other place knows this name: the SEE ALSO section
8
regcomp, regexec, regerror, regfree \- regular-expression library
12
#include <sys/types.h>
16
int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
18
int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19
size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
21
size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22
char\ *errbuf, size_t\ errbuf_size);
24
void\ regfree(regex_t\ *preg);
28
These routines implement POSIX 1003.2 regular expressions (``RE''s);
32
compiles an RE written as a string into an internal form,
34
matches that internal form against a string and reports results,
36
transforms error codes from either into human-readable messages,
39
frees any dynamically-allocated storage used by the internal form
44
declares two structure types,
48
the former for compiled internal forms and the latter for match reporting.
49
It also declares the four functions,
52
and a number of constants with names starting with ``REG_''.
55
compiles the regular expression contained in the
58
subject to the flags in
60
and places the results in the
62
structure pointed to by
65
is the bitwise OR of zero or more of the following flags:
66
.IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67
Compile modern (``extended'') REs,
68
rather than the obsolete (``basic'') REs that
71
This is a synonym for 0,
72
provided as a counterpart to REG_EXTENDED to improve readability.
74
compatible with but not specified by POSIX 1003.2,
75
and should be used with
76
caution in software intended to be portable to other systems.
78
Compile with recognition of all special characters turned off.
79
All characters are thus considered ordinary,
80
so the ``RE'' is a literal string.
82
compatible with but not specified by POSIX 1003.2,
83
and should be used with
84
caution in software intended to be portable to other systems.
85
REG_EXTENDED and REG_NOSPEC may not be used
89
Compile for matching that ignores upper/lower case distinctions.
93
Compile for matching that need only report success or failure,
96
Compile for newline-sensitive matching.
97
By default, newline is a completely ordinary character with no special
98
meaning in either REs or strings.
100
`[^' bracket expressions and `.' never match newline,
101
a `^' anchor matches the null string after any newline in the string
102
in addition to its normal function,
103
and the `$' anchor matches the null string before any newline in the
104
string in addition to its normal function.
106
The regular expression ends,
107
not at the first NUL,
108
but just before the character pointed to by the
110
member of the structure pointed to by
116
This flag permits inclusion of NULs in the RE;
117
they are considered ordinary characters.
118
This is an extension,
119
compatible with but not specified by POSIX 1003.2,
120
and should be used with
121
caution in software intended to be portable to other systems.
125
returns 0 and fills in the structure pointed to by
127
One member of that structure
134
contains the number of parenthesized subexpressions within the RE
135
(except that the value of this member is undefined if the
136
REG_NOSUB flag was used).
139
fails, it returns a non-zero error code;
143
matches the compiled RE pointed to by
147
subject to the flags in
149
and reports results using
152
and the returned value.
153
The RE must have been compiled by a previous invocation of
155
The compiled form is not altered during execution of
157
so a single compiled RE can be used simultaneously by multiple threads.
160
the NUL-terminated string pointed to by
162
is considered to be the text of an entire line,
163
with the NUL indicating the end of the line.
165
any other end-of-line marker is considered to have been removed
166
and replaced by the NUL.)
169
argument is the bitwise OR of zero or more of the following flags:
170
.IP REG_NOTBOL \w'REG_STARTEND'u+2n
171
The first character of
173
is not the beginning of a line, so the `^' anchor should not match before it.
174
This does not affect the behavior of newlines under REG_NEWLINE.
178
does not end a line, so the `$' anchor should not match before it.
179
This does not affect the behavior of newlines under REG_NEWLINE.
181
The string is considered to start at
182
\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
183
and to have a terminating NUL located at
184
\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
185
(there need not actually be a NUL at that location),
186
regardless of the value of
188
See below for the definition of
192
This is an extension,
193
compatible with but not specified by POSIX 1003.2,
194
and should be used with
195
caution in software intended to be portable to other systems.
196
Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
197
REG_STARTEND affects only the location of the string,
198
not how it is matched.
202
for a discussion of what is matched in situations where an RE or a
203
portion thereof could match any of several substrings of
208
returns 0 for success and the non-zero code REG_NOMATCH for failure.
209
Other non-zero error codes may be returned in exceptional situations;
212
If REG_NOSUB was specified in the compilation of the RE,
219
argument (but see below for the case where REG_STARTEND is specified).
222
points to an array of
226
Such a structure has at least the members
232
(a signed arithmetic type at least as large as an
236
containing respectively the offset of the first character of a substring
237
and the offset of the first character after the end of the substring.
238
Offsets are measured from the beginning of the
242
An empty substring is denoted by equal offsets,
243
both indicating the character following the empty substring.
245
The 0th member of the
247
array is filled in to indicate what substring of
249
was matched by the entire RE.
250
Remaining members report what substring was matched by parenthesized
251
subexpressions within the RE;
254
reports subexpression
256
with subexpressions counted (starting at 1) by the order of their opening
257
parentheses in the RE, left to right.
258
Unused entries in the array\(emcorresponding either to subexpressions that
259
did not participate in the match at all, or to subexpressions that do not
260
exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
265
If a subexpression participated in the match several times,
266
the reported substring is the last one it matched.
267
(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
268
the parenthesized subexpression matches the three `b's and then
269
an infinite number of empty strings following the last `b',
270
so the reported substring is one of the empties.)
272
If REG_STARTEND is specified,
274
must point to at least one
278
is 0 or REG_NOSUB was specified),
279
to hold the input offsets for REG_STARTEND.
280
Use for output is still entirely controlled by
284
is 0 or REG_NOSUB was specified,
287
will not be changed by a successful
297
to a human-readable, printable message.
301
the error code should have arisen from use of
306
and if the error code came from
308
it should have been the result from the most recent
313
may be able to supply a more detailed message using information
317
places the NUL-terminated message into the buffer pointed to by
319
limiting the length (including the NUL) to at most
322
If the whole message won't fit,
323
as much of it as will fit before the terminating NUL is supplied.
325
the returned value is the size of buffer needed to hold the whole
326
message (including terminating NUL).
331
is ignored but the return value is still correct.
337
is first ORed with REG_ITOA,
338
the ``message'' that results is the printable name of the error code,
339
e.g. ``REG_NOMATCH'',
340
rather than an explanation thereof.
346
shall be non-NULL and the
348
member of the structure it points to
349
must point to the printable name of an error code;
350
in this case, the result in
352
is the decimal digits of
353
the numeric value of the error code
354
(0 if the name is not recognized).
355
REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
357
compatible with but not specified by POSIX 1003.2,
358
and should be used with
359
caution in software intended to be portable to other systems.
360
Be warned also that they are considered experimental and changes are possible.
363
frees any dynamically-allocated storage associated with the compiled RE
368
is no longer a valid compiled RE
369
and the effect of supplying it to
375
None of these functions references global variables except for tables
377
all are safe for use from multiple threads if the arguments are safe.
378
.SH IMPLEMENTATION CHOICES
379
There are a number of decisions that 1003.2 leaves up to the implementor,
380
either by explicitly saying ``undefined'' or by virtue of them being
381
forbidden by the RE grammar.
382
This implementation treats them as follows.
386
for a discussion of the definition of case-independent matching.
388
There is no particular limit on the length of REs,
389
except insofar as memory is limited.
390
Memory usage is approximately linear in RE size, and largely insensitive
391
to RE complexity, except for bounded repetitions.
392
See BUGS for one short RE using them
393
that will run almost any system out of memory.
395
A backslashed character other than one specifically given a magic meaning
396
by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
397
is taken as an ordinary character.
399
Any unmatched [ is a REG_EBRACK error.
401
Equivalence classes cannot begin or end bracket-expression ranges.
402
The endpoint of one range cannot begin another.
404
RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
406
A repetition operator (?, *, +, or bounds) cannot follow another
408
A repetition operator cannot begin an expression or subexpression
409
or follow `^' or `|'.
411
`|' cannot appear first or last in a (sub)expression or after another `|',
412
i.e. an operand of `|' cannot be an empty subexpression.
413
An empty parenthesized subexpression, `()', is legal and matches an
415
An empty string is not a legal RE.
417
A `{' followed by a digit is considered the beginning of bounds for a
418
bounded repetition, which must then follow the syntax for bounds.
419
A `{' \fInot\fR followed by a digit is considered an ordinary character.
421
`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
422
REs are anchors, not ordinary characters.
426
POSIX 1003.2, sections 2.8 (Regular Expression Notation)
428
B.5 (C Binding for Regular Expression Matching).
430
Non-zero error codes from
434
include the following:
437
.ta \w'REG_ECOLLATE'u+3n
438
REG_NOMATCH regexec() failed to match
439
REG_BADPAT invalid regular expression
440
REG_ECOLLATE invalid collating element
441
REG_ECTYPE invalid character class
442
REG_EESCAPE \e applied to unescapable character
443
REG_ESUBREG invalid backreference number
444
REG_EBRACK brackets [ ] not balanced
445
REG_EPAREN parentheses ( ) not balanced
446
REG_EBRACE braces { } not balanced
447
REG_BADBR invalid repetition count(s) in { }
448
REG_ERANGE invalid character range in [ ]
449
REG_ESPACE ran out of memory
450
REG_BADRPT ?, *, or + operand invalid
451
REG_EMPTY empty (sub)expression
452
REG_ASSERT ``can't happen''\(emyou found a bug
453
REG_INVARG invalid argument, e.g. negative-length string
456
Written by Henry Spencer,
457
henry@zoo.toronto.edu.
459
This is an alpha release with known defects.
460
Please report problems.
462
There is one known functionality bug.
463
The implementation of internationalization is incomplete:
464
the locale is always assumed to be the default one of 1003.2,
465
and only the collating elements etc. of that locale are available.
467
The back-reference code is subtle and doubts linger about its correctness
472
This will improve with later releases.
474
exceeding 0 is expensive;
476
exceeding 1 is worse.
478
is largely insensitive to RE complexity \fIexcept\fR that back
479
references are massively expensive.
480
RE length does matter; in particular, there is a strong speed bonus
481
for keeping RE length under about 30 characters,
482
with most special characters counting roughly double.
485
implements bounded repetitions by macro expansion,
486
which is costly in time and space if counts are large
487
or bounded repetitions are nested.
489
`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
490
will (eventually) run almost any existing machine out of swap space.
492
There are suspected problems with response to obscure error conditions.
494
certain kinds of internal overflow,
495
produced only by truly enormous REs or by multiply nested bounded repetitions,
496
are probably not handled well.
498
Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
499
a special character only in the presence of a previous unmatched `('.
500
This can't be fixed until the spec is fixed.
502
The standard's definition of back references is vague.
504
`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
505
Until the standard is clarified,
506
behavior in such cases should not be relied on.
508
The implementation of word-boundary matching is a bit of a kludge,
509
and bugs may lurk in combinations of word-boundary matching and anchoring.