1
% manual page source format generated by PolyglotMan v3.0.9,
2
% available from http://polyglotman.sourceforge.net/
4
\section{Syntax of the builtin regular expression library}\label{wxresyn}
6
A {\it regular expression} describes strings of characters. It's a
7
pattern that matches certain strings and doesn't match others.
11
\helpref{wxRegEx}{wxregex}
13
\subsection{Different Flavors of REs}\label{differentflavors}
15
\helpref{Syntax of the builtin regular expression library}{wxresyn}
17
Regular expressions (``RE''s), as defined by POSIX, come in two
18
flavors: {\it extended} REs (``EREs'') and {\it basic} REs (``BREs''). EREs are roughly those
19
of the traditional {\it egrep}, while BREs are roughly those of the traditional
20
{\it ed}. This implementation adds a third flavor, {\it advanced} REs (``AREs''), basically
21
EREs with some significant extensions.
23
This manual page primarily describes
24
AREs. BREs mostly exist for backward compatibility in some old programs;
25
they will be discussed at the \helpref{end}{wxresynbre}. POSIX EREs are almost an exact subset
26
of AREs. Features of AREs that are not present in EREs will be indicated.
28
\subsection{Regular Expression Syntax}\label{resyntax}
30
\helpref{Syntax of the builtin regular expression library}{wxresyn}
32
These regular expressions are implemented using
33
the package written by Henry Spencer, based on the 1003.2 spec and some
34
(not quite all) of the Perl5 extensions (thanks, Henry!). Much of the description
35
of regular expressions below is copied verbatim from his manual entry.
37
An ARE is one or more {\it branches}, separated by `{\bf $|$}', matching anything that matches
40
A branch is zero or more {\it constraints} or {\it quantified
41
atoms}, concatenated. It matches a match for the first, followed by a match
42
for the second, etc; an empty branch matches the empty string.
44
A quantified atom is an {\it atom} possibly followed by a single {\it quantifier}. Without a quantifier,
45
it matches a match for the atom. The quantifiers, and what a so-quantified
48
\begin{twocollist}\twocolwidtha{4cm}
49
\twocolitem{{\bf *}}{a sequence of 0 or more matches of the atom}
50
\twocolitem{{\bf +}}{a sequence of 1 or more matches of the atom}
51
\twocolitem{{\bf ?}}{a sequence of 0 or 1 matches of the atom}
52
\twocolitem{{\bf \{m\}}}{a sequence of exactly {\it m} matches of the atom}
53
\twocolitem{{\bf \{m,\}}}{a sequence of {\it m} or more matches of the atom}
54
\twocolitem{{\bf \{m,n\}}}{a sequence of {\it m} through {\it n} (inclusive)
55
matches of the atom; {\it m} may not exceed {\it n}}
56
\twocolitem{{\bf *? +? ?? \{m\}? \{m,\}? \{m,n\}?}}{{\it non-greedy} quantifiers,
57
which match the same possibilities, but prefer the
58
smallest number rather than the largest number of matches (see \helpref{Matching}{wxresynmatching})}
61
The forms using {\bf \{} and {\bf \}} are known as {\it bound}s. The numbers {\it m} and {\it n} are unsigned
62
decimal integers with permissible values from 0 to 255 inclusive.
65
\begin{twocollist}\twocolwidtha{4cm}
66
\twocolitem{{\bf (re)}}{(where {\it re} is any regular expression) matches a match for
67
{\it re}, with the match noted for possible reporting}
68
\twocolitem{{\bf (?:re)}}{as previous, but
69
does no reporting (a ``non-capturing'' set of parentheses)}
70
\twocolitem{{\bf ()}}{matches an empty
71
string, noted for possible reporting}
72
\twocolitem{{\bf (?:)}}{matches an empty string, without reporting}
73
\twocolitem{{\bf $[chars]$}}{a {\it bracket expression}, matching any one of the {\it chars}
74
(see \helpref{Bracket Expressions}{wxresynbracket} for more detail)}
75
\twocolitem{{\bf .}}{matches any single character }
76
\twocolitem{{\bf $\backslash$k}}{(where {\it k} is a non-alphanumeric character)
77
matches that character taken as an ordinary character, e.g. $\backslash\backslash$ matches a backslash
79
\twocolitem{{\bf $\backslash$c}}{where {\it c} is alphanumeric (possibly followed by other characters),
80
an {\it escape} (AREs only), see \helpref{Escapes}{wxresynescapes} below}
81
\twocolitem{{\bf \{}}{when followed by a character
82
other than a digit, matches the left-brace character `{\bf \{}'; when followed by
83
a digit, it is the beginning of a {\it bound} (see above)}
84
\twocolitem{{\bf x}}{where {\it x} is a single
85
character with no other significance, matches that character.}
88
A {\it constraint} matches an empty string when specific conditions are met. A constraint may
89
not be followed by a quantifier. The simple constraints are as follows;
90
some more constraints are described later, under \helpref{Escapes}{wxresynescapes}.
92
\begin{twocollist}\twocolwidtha{4cm}
93
\twocolitem{{\bf \caret}}{matches at the beginning of a line}
94
\twocolitem{{\bf \$}}{matches at the end of a line}
95
\twocolitem{{\bf (?=re)}}{{\it positive lookahead}
96
(AREs only), matches at any point where a substring matching {\it re} begins}
97
\twocolitem{{\bf (?!re)}}{{\it negative lookahead} (AREs only),
98
matches at any point where no substring matching {\it re} begins}
101
The lookahead constraints may not contain back references
102
(see later), and all parentheses within them are considered non-capturing.
104
An RE may not end with `{\bf $\backslash$}'.
106
\subsection{Bracket Expressions}\label{wxresynbracket}
108
\helpref{Syntax of the builtin regular expression library}{wxresyn}
110
A {\it bracket expression} is a list
111
of characters enclosed in `{\bf $[]$}'. It normally matches any single character from
112
the list (but see below). If the list begins with `{\bf \caret}', it matches any single
113
character (but see below) {\it not} from the rest of the list.
116
in the list are separated by `{\bf -}', this is shorthand for the full {\it range} of
117
characters between those two (inclusive) in the collating sequence, e.g.
118
{\bf $[0-9]$} in ASCII matches any decimal digit. Two ranges may not share an endpoint,
119
so e.g. {\bf a-c-e} is illegal. Ranges are very collating-sequence-dependent, and portable
120
programs should avoid relying on them.
122
To include a literal {\bf $]$} or {\bf -} in the
123
list, the simplest method is to enclose it in {\bf $[.$} and {\bf $.]$} to make it a collating
124
element (see below). Alternatively, make it the first character (following
125
a possible `{\bf \caret}'), or (AREs only) precede it with `{\bf $\backslash$}'.
126
Alternatively, for `{\bf -}', make
127
it the last character, or the second endpoint of a range. To use a literal
128
{\bf -} as the first endpoint of a range, make it a collating element or (AREs
129
only) precede it with `{\bf $\backslash$}'. With the exception of these, some combinations using
130
{\bf $[$} (see next paragraphs), and escapes, all other special characters lose
131
their special significance within a bracket expression.
134
expression, a collating element (a character, a multi-character sequence
135
that collates as if it were a single character, or a collating-sequence
136
name for either) enclosed in {\bf $[.$} and {\bf $.]$} stands for the
137
sequence of characters of that collating element.
139
{\it wxWidgets}: Currently no multi-character collating elements are defined.
140
So in {\bf $[.X.]$}, {\it X} can either be a single character literal or
141
the name of a character. For example, the following are both identical
142
{\bf $[[.0.]-[.9.]]$} and {\bf $[[.zero.]-[.nine.]]$} and mean the same as
144
See \helpref{Character Names}{wxresynchars}.
146
%The sequence is a single element of the bracket
147
%expression's list. A bracket expression in a locale that has multi-character
148
%collating elements can thus match more than one character. So (insidiously),
149
%a bracket expression that starts with {\bf \caret} can match multi-character collating
150
%elements even if none of them appear in the bracket expression! ({\it Note:}
151
%Tcl currently has no multi-character collating elements. This information
152
%is only for illustration.)
154
%For example, assume the collating sequence includes
155
%a {\bf ch} multi-character collating element. Then the RE {\bf $[[.ch.]]*c$} (zero or more
156
% {\bf ch}'s followed by {\bf c}) matches the first five characters of `{\bf chchcc}'. Also, the
157
%RE {\bf $[^c]b$} matches all of `{\bf chb}' (because {\bf $[^c]$} matches the multi-character {\bf ch}).
159
Within a bracket expression, a collating element enclosed in {\bf $[=$} and {\bf $=]$}
160
is an equivalence class, standing for the sequences of characters of all
161
collating elements equivalent to that one, including itself.
163
%no other equivalent collating elements, the treatment is as if the enclosing
164
%delimiters were `{\bf $[.$}' and `{\bf $.]$}'.) For example, if {\bf o}
165
%and {\bf \caret} are the members of an
166
%equivalence class, then `{\bf $[[$=o=$]]$}', `{\bf $[[$=\caret=$]]$}',
167
%and `{\bf $[o^]$}' are all synonymous.
168
An equivalence class may not be an endpoint of a range.
170
%({\it Note:} Tcl currently
171
%implements only the Unicode locale. It doesn't define any equivalence classes.
172
%The examples above are just illustrations.)
174
{\it wxWidgets}: Currently no equivalence classes are defined, so
175
{\bf $[=X=]$} stands for just the single character {\it X}.
176
{\it X} can either be a single character literal or the name of a character,
177
see \helpref{Character Names}{wxresynchars}.
179
Within a bracket expression,
180
the name of a {\it character class} enclosed in {\bf $[:$} and {\bf $:]$} stands for the list
181
of all characters (not all collating elements!) belonging to that class.
182
Standard character classes are:
184
\begin{twocollist}\twocolwidtha{3cm}
185
\twocolitem{{\bf alpha}}{A letter.}
186
\twocolitem{{\bf upper}}{An upper-case letter.}
187
\twocolitem{{\bf lower}}{A lower-case letter.}
188
\twocolitem{{\bf digit}}{A decimal digit.}
189
\twocolitem{{\bf xdigit}}{A hexadecimal digit.}
190
\twocolitem{{\bf alnum}}{An alphanumeric (letter or digit).}
191
\twocolitem{{\bf print}}{An alphanumeric (same as alnum).}
192
\twocolitem{{\bf blank}}{A space or tab character.}
193
\twocolitem{{\bf space}}{A character producing white space in displayed text.}
194
\twocolitem{{\bf punct}}{A punctuation character.}
195
\twocolitem{{\bf graph}}{A character with a visible representation.}
196
\twocolitem{{\bf cntrl}}{A control character.}
199
%A locale may provide others. (Note that the current Tcl
200
%implementation has only one locale: the Unicode locale.)
201
A character class may not be used as an endpoint of a range.
203
{\it wxWidgets}: In a non-Unicode build, these character classifications depend on the
204
current locale, and correspond to the values return by the ANSI C 'is'
205
functions: isalpha, isupper, etc. In Unicode mode they are based on
206
Unicode classifications, and are not affected by the current locale.
208
There are two special cases of bracket expressions:
209
the bracket expressions {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} are constraints, matching empty
210
strings at the beginning and end of a word respectively. A word is defined
211
as a sequence of word characters that is neither preceded nor followed
212
by word characters. A word character is an {\it alnum} character or an underscore
213
({\bf \_}). These special bracket expressions are deprecated; users of AREs should
214
use constraint escapes instead (see \helpref{Escapes}{wxresynescapes} below).
216
\subsection{Escapes}\label{wxresynescapes}
218
\helpref{Syntax of the builtin regular expression library}{wxresyn}
221
which begin with a {\bf $\backslash$} followed by an alphanumeric character, come in several
222
varieties: character entry, class shorthands, constraint escapes, and back
223
references. A {\bf $\backslash$} followed by an alphanumeric character but not constituting
224
a valid escape is illegal in AREs. In EREs, there are no escapes: outside
225
a bracket expression, a {\bf $\backslash$} followed by an alphanumeric character merely stands
226
for that character as an ordinary character, and inside a bracket expression,
227
{\bf $\backslash$} is an ordinary character. (The latter is the one actual incompatibility
228
between EREs and AREs.)
230
Character-entry escapes (AREs only) exist to make
231
it easier to specify non-printing and otherwise inconvenient characters
234
\begin{twocollist}\twocolwidtha{4cm}
235
\twocolitem{{\bf $\backslash$a}}{alert (bell) character, as in C}
236
\twocolitem{{\bf $\backslash$b}}{backspace, as in C}
237
\twocolitem{{\bf $\backslash$B}}{synonym
238
for {\bf $\backslash$} to help reduce backslash doubling in some applications where there
239
are multiple levels of backslash processing}
240
\twocolitem{{\bf $\backslash$c{\it X}}}{(where X is any character)
241
the character whose low-order 5 bits are the same as those of {\it X}, and whose
242
other bits are all zero}
243
\twocolitem{{\bf $\backslash$e}}{the character whose collating-sequence name is
244
`{\bf ESC}', or failing that, the character with octal value 033}
245
\twocolitem{{\bf $\backslash$f}}{formfeed, as in C}
246
\twocolitem{{\bf $\backslash$n}}{newline, as in C}
247
\twocolitem{{\bf $\backslash$r}}{carriage return, as in C}
248
\twocolitem{{\bf $\backslash$t}}{horizontal tab, as in C}
249
\twocolitem{{\bf $\backslash$u{\it wxyz}}}{(where {\it wxyz} is exactly four hexadecimal digits)
251
character {\bf U+{\it wxyz}} in the local byte ordering}
252
\twocolitem{{\bf $\backslash$U{\it stuvwxyz}}}{(where {\it stuvwxyz} is
253
exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
254
extension to 32 bits}
255
\twocolitem{{\bf $\backslash$v}}{vertical tab, as in C are all available.}
256
\twocolitem{{\bf $\backslash$x{\it hhh}}}{(where
257
{\it hhh} is any sequence of hexadecimal digits) the character whose hexadecimal
258
value is {\bf 0x{\it hhh}} (a single character no matter how many hexadecimal digits
260
\twocolitem{{\bf $\backslash$0}}{the character whose value is {\bf 0}}
261
\twocolitem{{\bf $\backslash${\it xy}}}{(where {\it xy} is exactly two
262
octal digits, and is not a {\it back reference} (see below)) the character whose
263
octal value is {\bf 0{\it xy}}}
264
\twocolitem{{\bf $\backslash${\it xyz}}}{(where {\it xyz} is exactly three octal digits, and is
265
not a back reference (see below))
266
the character whose octal value is {\bf 0{\it xyz}}}
269
Hexadecimal digits are `{\bf 0}'-`{\bf 9}', `{\bf a}'-`{\bf f}', and `{\bf A}'-`{\bf F}'. Octal
270
digits are `{\bf 0}'-`{\bf 7}'.
273
escapes are always taken as ordinary characters. For example, {\bf $\backslash$135} is {\bf ]} in
274
ASCII, but {\bf $\backslash$135} does not terminate a bracket expression. Beware, however,
275
that some applications (e.g., C compilers) interpret such sequences themselves
276
before the regular-expression package gets to see them, which may require
277
doubling (quadrupling, etc.) the `{\bf $\backslash$}'.
279
Class-shorthand escapes (AREs only) provide
280
shorthands for certain commonly-used character classes:
282
\begin{twocollist}\twocolwidtha{4cm}
283
\twocolitem{{\bf $\backslash$d}}{{\bf $[[:digit:]]$}}
284
\twocolitem{{\bf $\backslash$s}}{{\bf $[[:space:]]$}}
285
\twocolitem{{\bf $\backslash$w}}{{\bf $[[:alnum:]\_]$} (note underscore)}
286
\twocolitem{{\bf $\backslash$D}}{{\bf $[^[:digit:]]$}}
287
\twocolitem{{\bf $\backslash$S}}{{\bf $[^[:space:]]$}}
288
\twocolitem{{\bf $\backslash$W}}{{\bf $[^[:alnum:]\_]$} (note underscore)}
291
Within bracket expressions, `{\bf $\backslash$d}', `{\bf $\backslash$s}', and
292
`{\bf $\backslash$w}' lose their outer brackets, and `{\bf $\backslash$D}',
293
`{\bf $\backslash$S}', and `{\bf $\backslash$W}' are illegal. (So, for example,
294
{\bf $[$a-c$\backslash$d$]$} is equivalent to {\bf $[a-c[:digit:]]$}.
295
Also, {\bf $[$a-c$\backslash$D$]$}, which is equivalent to
296
{\bf $[a-c^[:digit:]]$}, is illegal.)
298
A constraint escape (AREs only) is a constraint,
299
matching the empty string if specific conditions are met, written as an
302
\begin{twocollist}\twocolwidtha{4cm}
303
\twocolitem{{\bf $\backslash$A}}{matches only at the beginning of the string
304
(see \helpref{Matching}{wxresynmatching}, below,
305
for how this differs from `{\bf \caret}')}
306
\twocolitem{{\bf $\backslash$m}}{matches only at the beginning of a word}
307
\twocolitem{{\bf $\backslash$M}}{matches only at the end of a word}
308
\twocolitem{{\bf $\backslash$y}}{matches only at the beginning or end of a word}
309
\twocolitem{{\bf $\backslash$Y}}{matches only at a point that is not the beginning or end of
311
\twocolitem{{\bf $\backslash$Z}}{matches only at the end of the string
312
(see \helpref{Matching}{wxresynmatching}, below, for
313
how this differs from `{\bf \$}')}
314
\twocolitem{{\bf $\backslash${\it m}}}{(where {\it m} is a nonzero digit) a {\it back reference},
316
\twocolitem{{\bf $\backslash${\it mnn}}}{(where {\it m} is a nonzero digit, and {\it nn} is some more digits,
317
and the decimal value {\it mnn} is not greater than the number of closing capturing
318
parentheses seen so far) a {\it back reference}, see below}
322
as in the specification of {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} above. Constraint escapes are
323
illegal within bracket expressions.
325
A back reference (AREs only) matches
326
the same string matched by the parenthesized subexpression specified by
327
the number, so that (e.g.) {\bf ($[bc]$)$\backslash$1} matches {\bf bb} or {\bf cc} but not `{\bf bc}'.
329
must entirely precede the back reference in the RE. Subexpressions are numbered
330
in the order of their leading parentheses. Non-capturing parentheses do not
331
define subexpressions.
333
There is an inherent historical ambiguity between
334
octal character-entry escapes and back references, which is resolved by
335
heuristics, as hinted at above. A leading zero always indicates an octal
336
escape. A single non-zero digit, not followed by another digit, is always
337
taken as a back reference. A multi-digit sequence not starting with a zero
338
is taken as a back reference if it comes after a suitable subexpression
339
(i.e. the number is in the legal range for a back reference), and otherwise
342
\subsection{Metasyntax}\label{remetasyntax}
344
\helpref{Syntax of the builtin regular expression library}{wxresyn}
346
In addition to the main syntax described above,
347
there are some special forms and miscellaneous syntactic facilities available.
349
Normally the flavor of RE being used is specified by application-dependent
350
means. However, this can be overridden by a {\it director}. If an RE of any flavor
351
begins with `{\bf ***:}', the rest of the RE is an ARE. If an RE of any flavor begins
352
with `{\bf ***=}', the rest of the RE is taken to be a literal string, with all
353
characters considered ordinary characters.
355
An ARE may begin with {\it embedded options}: a sequence {\bf (?xyz)}
356
(where {\it xyz} is one or more alphabetic characters)
357
specifies options affecting the rest of the RE. These supplement, and can
358
override, any options specified by the application. The available option
361
\begin{twocollist}\twocolwidtha{4cm}
362
\twocolitem{{\bf b}}{rest of RE is a BRE}
363
\twocolitem{{\bf c}}{case-sensitive matching (usual default)}
364
\twocolitem{{\bf e}}{rest of RE is an ERE}
365
\twocolitem{{\bf i}}{case-insensitive matching (see \helpref{Matching}{wxresynmatching}, below)}
366
\twocolitem{{\bf m}}{historical synonym for {\bf n}}
367
\twocolitem{{\bf n}}{newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)}
368
\twocolitem{{\bf p}}{partial newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)}
369
\twocolitem{{\bf q}}{rest of RE
370
is a literal (``quoted'') string, all ordinary characters}
371
\twocolitem{{\bf s}}{non-newline-sensitive matching (usual default)}
372
\twocolitem{{\bf t}}{tight syntax (usual default; see below)}
373
\twocolitem{{\bf w}}{inverse
374
partial newline-sensitive (``weird'') matching (see \helpref{Matching}{wxresynmatching}, below)}
375
\twocolitem{{\bf x}}{expanded syntax (see below)}
378
Embedded options take effect at the {\bf )} terminating the
379
sequence. They are available only at the start of an ARE, and may not be
380
used later within it.
382
In addition to the usual ({\it tight}) RE syntax, in which
383
all characters are significant, there is an {\it expanded} syntax, available
384
%in all flavors of RE with the {\bf -expanded} switch, or
385
in AREs with the embedded
386
x option. In the expanded syntax, white-space characters are ignored and
387
all characters between a {\bf \#} and the following newline (or the end of the
388
RE) are ignored, permitting paragraphing and commenting a complex RE. There
389
are three exceptions to that basic rule:
392
a white-space character or `{\bf \#}' preceded
393
by `{\bf $\backslash$}' is retained
395
white space or `{\bf \#}' within a bracket expression is retained
397
white space and comments are illegal within multi-character symbols like
398
the ARE `{\bf (?:}' or the BRE `{\bf $\backslash$(}'
400
Expanded-syntax white-space characters are blank,
401
tab, newline, and any character that belongs to the {\it space} character class.
403
Finally, in an ARE, outside bracket expressions, the sequence `{\bf (?\#ttt)}' (where
404
{\it ttt} is any text not containing a `{\bf )}') is a comment, completely ignored. Again,
405
this is not allowed between the characters of multi-character symbols like
406
`{\bf (?:}'. Such comments are more a historical artifact than a useful facility,
407
and their use is deprecated; use the expanded syntax instead.
410
metasyntax extensions is available if the application (or an initial {\bf ***=}
411
director) has specified that the user's input be treated as a literal string
412
rather than as an RE.
414
\subsection{Matching}\label{wxresynmatching}
416
\helpref{Syntax of the builtin regular expression library}{wxresyn}
418
In the event that an RE could match more than
419
one substring of a given string, the RE matches the one starting earliest
420
in the string. If the RE could match more than one substring starting at
421
that point, its choice is determined by its {\it preference}: either the longest
422
substring, or the shortest.
424
Most atoms, and all constraints, have no preference.
425
A parenthesized RE has the same preference (possibly none) as the RE. A
426
quantified atom with quantifier {\bf \{m\}} or {\bf \{m\}?} has the same preference (possibly
427
none) as the atom itself. A quantified atom with other normal quantifiers
428
(including {\bf \{m,n\}} with {\it m} equal to {\it n}) prefers longest match. A quantified
429
atom with other non-greedy quantifiers (including {\bf \{m,n\}?} with {\it m} equal to
430
{\it n}) prefers shortest match. A branch has the same preference as the first
431
quantified atom in it which has a preference. An RE consisting of two or
432
more branches connected by the {\bf $|$} operator prefers longest match.
434
Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
435
also match the longest or shortest possible substrings, based on their
436
preferences, with subexpressions starting earlier in the RE taking priority
437
over ones starting later. Note that outer subexpressions thus take priority
438
over their component subexpressions.
440
Note that the quantifiers {\bf \{1,1\}} and
441
{\bf \{1,1\}?} can be used to force longest and shortest preference, respectively,
442
on a subexpression or a whole RE.
444
Match lengths are measured in characters,
445
not collating elements. An empty string is considered longer than no match
446
at all. For example, {\bf bb*} matches the three middle characters
447
of `{\bf abbbc}', {\bf (week$|$wee)(night$|$knights)}
448
matches all ten characters of `{\bf weeknights}', when {\bf (.*).*} is matched against
449
{\bf abc} the parenthesized subexpression matches all three characters, and when
450
{\bf (a*)*} is matched against {\bf bc} both the whole RE and the parenthesized subexpression
451
match an empty string.
453
If case-independent matching is specified, the effect
454
is much as if all case distinctions had vanished from the alphabet. When
455
an alphabetic that exists in multiple cases appears as an ordinary character
456
outside a bracket expression, it is effectively transformed into a bracket
457
expression containing both cases, so that {\bf x} becomes `{\bf $[xX]$}'. When it appears
458
inside a bracket expression, all case counterparts of it are added to the
459
bracket expression, so that {\bf $[x]$} becomes {\bf $[xX]$} and {\bf $[^x]$} becomes `{\bf $[^xX]$}'.
462
matching is specified, {\bf .} and bracket expressions using {\bf \caret} will never match
463
the newline character (so that matches will never cross newlines unless
464
the RE explicitly arranges it) and {\bf \caret} and {\bf \$} will match the empty string after
465
and before a newline respectively, in addition to matching at beginning
466
and end of string respectively. ARE {\bf $\backslash$A} and {\bf $\backslash$Z} continue to match beginning
467
or end of string {\it only}.
469
If partial newline-sensitive matching is specified,
470
this affects {\bf .} and bracket expressions as with newline-sensitive matching,
471
but not {\bf \caret} and `{\bf \$}'.
473
If inverse partial newline-sensitive matching is specified,
474
this affects {\bf \caret} and {\bf \$} as with newline-sensitive matching, but not {\bf .} and bracket
475
expressions. This isn't very useful but is provided for symmetry.
477
\subsection{Limits And Compatibility}\label{relimits}
479
\helpref{Syntax of the builtin regular expression library}{wxresyn}
481
No particular limit is imposed on the length of REs. Programs
482
intended to be highly portable should not employ REs longer than 256 bytes,
483
as a POSIX-compliant implementation can refuse to accept such REs.
486
feature of AREs that is actually incompatible with POSIX EREs is that {\bf $\backslash$}
487
does not lose its special significance inside bracket expressions. All other
488
ARE features use syntax which is illegal or has undefined or unspecified
489
effects in POSIX EREs; the {\bf ***} syntax of directors likewise is outside
490
the POSIX syntax for both BREs and EREs.
492
Many of the ARE extensions are
493
borrowed from Perl, but some have been changed to clean them up, and a
494
few Perl extensions are not present. Incompatibilities of note include `{\bf $\backslash$b}',
495
`{\bf $\backslash$B}', the lack of special treatment for a trailing newline, the addition of
496
complemented bracket expressions to the things affected by newline-sensitive
497
matching, the restrictions on parentheses and back references in lookahead
498
constraints, and the longest/shortest-match (rather than first-match) matching
501
The matching rules for REs containing both normal and non-greedy
502
quantifiers have changed since early beta-test versions of this package.
503
(The new rules are much simpler and cleaner, but don't work as hard at guessing
504
the user's real intentions.)
506
Henry Spencer's original 1986 {\it regexp} package, still in widespread use,
507
%(e.g., in pre-8.1 releases of Tcl),
508
implemented an early version of today's EREs. There are four incompatibilities between {\it regexp}'s
509
near-EREs (`RREs' for short) and AREs. In roughly increasing order of significance:
511
\item In AREs, {\bf $\backslash$} followed by an alphanumeric character is either an escape or
512
an error, while in RREs, it was just another way of writing the alphanumeric.
513
This should not be a problem because there was no reason to write such
516
\item {\bf \{} followed by a digit in an ARE is the beginning of
517
a bound, while in RREs, {\bf \{} was always an ordinary character. Such sequences
518
should be rare, and will often result in an error because following characters
519
will not look like a valid bound.
521
\item In AREs, {\bf $\backslash$} remains a special character
522
within `{\bf $[]$}', so a literal {\bf $\backslash$} within {\bf $[]$} must be
523
written `{\bf $\backslash\backslash$}'. {\bf $\backslash\backslash$} also gives a literal
524
{\bf $\backslash$} within {\bf $[]$} in RREs, but only truly paranoid programmers routinely doubled
527
\item AREs report the longest/shortest match for the RE, rather
528
than the first found in a specified search order. This may affect some RREs
529
which were written in the expectation that the first match would be reported.
530
(The careful crafting of RREs to optimize the search order for fast matching
531
is obsolete (AREs examine all possible matches in parallel, and their performance
532
is largely insensitive to their complexity) but cases where the search
533
order was exploited to deliberately find a match which was {\it not} the longest/shortest
534
will need rewriting.)
537
\subsection{Basic Regular Expressions}\label{wxresynbre}
539
\helpref{Syntax of the builtin regular expression library}{wxresyn}
541
BREs differ from EREs in
542
several respects. `{\bf $|$}', `{\bf +}', and {\bf ?} are ordinary characters and there is no equivalent
543
for their functionality. The delimiters for bounds
544
are {\bf $\backslash$\{} and `{\bf $\backslash$\}}', with {\bf \{} and
545
{\bf \}} by themselves ordinary characters. The parentheses for nested subexpressions
546
are {\bf $\backslash$(} and `{\bf $\backslash$)}', with {\bf (} and {\bf )} by themselves
547
ordinary characters. {\bf \caret} is an ordinary
548
character except at the beginning of the RE or the beginning of a parenthesized
549
subexpression, {\bf \$} is an ordinary character except at the end of the RE or
550
the end of a parenthesized subexpression, and {\bf *} is an ordinary character
551
if it appears at the beginning of the RE or the beginning of a parenthesized
552
subexpression (after a possible leading `{\bf \caret}'). Finally, single-digit back references
553
are available, and {\bf $\backslash<$} and {\bf $\backslash>$} are synonyms
554
for {\bf $[[:<:]]$} and {\bf $[[:>:]]$} respectively;
555
no other escapes are available.
557
\subsection{Regular Expression Character Names}\label{wxresynchars}
559
\helpref{Syntax of the builtin regular expression library}{wxresyn}
561
Note that the character names are case sensitive.
564
\twocolitem{NUL}{'$\backslash$0'}
565
\twocolitem{SOH}{'$\backslash$001'}
566
\twocolitem{STX}{'$\backslash$002'}
567
\twocolitem{ETX}{'$\backslash$003'}
568
\twocolitem{EOT}{'$\backslash$004'}
569
\twocolitem{ENQ}{'$\backslash$005'}
570
\twocolitem{ACK}{'$\backslash$006'}
571
\twocolitem{BEL}{'$\backslash$007'}
572
\twocolitem{alert}{'$\backslash$007'}
573
\twocolitem{BS}{'$\backslash$010'}
574
\twocolitem{backspace}{'$\backslash$b'}
575
\twocolitem{HT}{'$\backslash$011'}
576
\twocolitem{tab}{'$\backslash$t'}
577
\twocolitem{LF}{'$\backslash$012'}
578
\twocolitem{newline}{'$\backslash$n'}
579
\twocolitem{VT}{'$\backslash$013'}
580
\twocolitem{vertical-tab}{'$\backslash$v'}
581
\twocolitem{FF}{'$\backslash$014'}
582
\twocolitem{form-feed}{'$\backslash$f'}
583
\twocolitem{CR}{'$\backslash$015'}
584
\twocolitem{carriage-return}{'$\backslash$r'}
585
\twocolitem{SO}{'$\backslash$016'}
586
\twocolitem{SI}{'$\backslash$017'}
587
\twocolitem{DLE}{'$\backslash$020'}
588
\twocolitem{DC1}{'$\backslash$021'}
589
\twocolitem{DC2}{'$\backslash$022'}
590
\twocolitem{DC3}{'$\backslash$023'}
591
\twocolitem{DC4}{'$\backslash$024'}
592
\twocolitem{NAK}{'$\backslash$025'}
593
\twocolitem{SYN}{'$\backslash$026'}
594
\twocolitem{ETB}{'$\backslash$027'}
595
\twocolitem{CAN}{'$\backslash$030'}
596
\twocolitem{EM}{'$\backslash$031'}
597
\twocolitem{SUB}{'$\backslash$032'}
598
\twocolitem{ESC}{'$\backslash$033'}
599
\twocolitem{IS4}{'$\backslash$034'}
600
\twocolitem{FS}{'$\backslash$034'}
601
\twocolitem{IS3}{'$\backslash$035'}
602
\twocolitem{GS}{'$\backslash$035'}
603
\twocolitem{IS2}{'$\backslash$036'}
604
\twocolitem{RS}{'$\backslash$036'}
605
\twocolitem{IS1}{'$\backslash$037'}
606
\twocolitem{US}{'$\backslash$037'}
607
\twocolitem{space}{' '}
608
\twocolitem{exclamation-mark}{'!'}
609
\twocolitem{quotation-mark}{'"'}
610
\twocolitem{number-sign}{'\#'}
611
\twocolitem{dollar-sign}{'\$'}
612
\twocolitem{percent-sign}{'\%'}
613
\twocolitem{ampersand}{'\&'}
614
\twocolitem{apostrophe}{'$\backslash$''}
615
\twocolitem{left-parenthesis}{'('}
616
\twocolitem{right-parenthesis}{')'}
617
\twocolitem{asterisk}{'*'}
618
\twocolitem{plus-sign}{'+'}
619
\twocolitem{comma}{','}
620
\twocolitem{hyphen}{'-'}
621
\twocolitem{hyphen-minus}{'-'}
622
\twocolitem{period}{'.'}
623
\twocolitem{full-stop}{'.'}
624
\twocolitem{slash}{'/'}
625
\twocolitem{solidus}{'/'}
626
\twocolitem{zero}{'0'}
627
\twocolitem{one}{'1'}
628
\twocolitem{two}{'2'}
629
\twocolitem{three}{'3'}
630
\twocolitem{four}{'4'}
631
\twocolitem{five}{'5'}
632
\twocolitem{six}{'6'}
633
\twocolitem{seven}{'7'}
634
\twocolitem{eight}{'8'}
635
\twocolitem{nine}{'9'}
636
\twocolitem{colon}{':'}
637
\twocolitem{semicolon}{';'}
638
\twocolitem{less-than-sign}{'<'}
639
\twocolitem{equals-sign}{'='}
640
\twocolitem{greater-than-sign}{'>'}
641
\twocolitem{question-mark}{'?'}
642
\twocolitem{commercial-at}{'@'}
643
\twocolitem{left-square-bracket}{'$[$'}
644
\twocolitem{backslash}{'$\backslash$'}
645
\twocolitem{reverse-solidus}{'$\backslash$'}
646
\twocolitem{right-square-bracket}{'$]$'}
647
\twocolitem{circumflex}{'\caret'}
648
\twocolitem{circumflex-accent}{'\caret'}
649
\twocolitem{underscore}{'\_'}
650
\twocolitem{low-line}{'\_'}
651
\twocolitem{grave-accent}{'`'}
652
\twocolitem{left-brace}{'\{'}
653
\twocolitem{left-curly-bracket}{'\{'}
654
\twocolitem{vertical-line}{'$|$'}
655
\twocolitem{right-brace}{'\}'}
656
\twocolitem{right-curly-bracket}{'\}'}
657
\twocolitem{tilde}{'\destruct{}'}
658
\twocolitem{DEL}{'$\backslash$177'}