1
<?xml version="1.0" encoding="latin1" ?>
2
<!DOCTYPE erlref SYSTEM "erlref.dtd">
9
<holder>Ericsson AB, All Rights Reserved</holder>
12
The contents of this file are subject to the Erlang Public License,
13
Version 1.1, (the "License"); you may not use this file except in
14
compliance with the License. You should have received a copy of the
15
Erlang Public License along with this software. If not, it can be
16
retrieved on line at http://www.erlang.org/.
18
Software distributed under the License is distributed on an "AS IS"
19
basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
20
the License for the specific language governing rights and limitations
23
The Initial Developer of the Original Code is Ericsson AB.
27
<prepared>Patrik Nyblom</prepared>
28
<responsible>Kenneth Lundin</responsible>
32
<date>2008-05-27</date>
37
<modulesummary>Perl like regular expressions for Erlang</modulesummary>
40
<p>This module contains functions for regular expression
41
matching for strings and binaries.</p>
43
<p>The regular expression syntax and semantics resemble that of
44
Perl. This library in many ways replaces the old regexp library
45
written purely in Erlang, as it has a richer syntax as well as
46
many more options. The library is also faster than the
47
older regexp implementation.</p>
49
<p>Although the library's matching algorithms are currently based
50
on the PCRE library, it is not to be viewed as an Erlang to PCRE
51
mapping. Only parts of the PCRE library is interfaced and the re
52
library in some ways extend PCRE. The PCRE documentation contains
53
many parts of no interest to the Erlang programmer, why only the
54
relevant part of the documentation is included here. There should
55
bee no need to go directly to the PCRE library documentation.</p>
58
<p>The Erlang literal syntax for strings give special
59
meaning to the "\\" (backslash) character. To literally write
60
a regular expression or a replacement string containing a
61
backslash in your code or in the shell, two backslashes have to be written:
68
<title>DATA TYPES</title>
70
iodata() = iolist() | binary()
71
iolist() = [char() | binary() | iolist()]
72
- a binary is allowed as the tail of the list</code>
74
mp() = Opaque datatype containing a compiled regular expression.</code>
78
<name>compile(Regexp) -> {ok, MP} | {error, ErrSpec}</name>
79
<fsummary>Compile a regular expression into a match program</fsummary>
81
<v>Regexp = iodata()</v>
84
<p>The same as <c>compile(Regexp,[])</c></p>
88
<name>compile(Regexp,Options) -> {ok, MP} | {error, ErrSpec}</name>
89
<fsummary>Compile a regular expression into a match program</fsummary>
91
<v>Regexp = iodata()</v>
92
<v>Options = [ Option ]</v>
93
<v>Option = anchored | caseless | dollar_endonly | dotall | extended | firstline | multiline | no_auto_capture | dupnames | ungreedy | {newline, NLSpec}</v>
94
<v>NLSpec = cr | crlf | lf | anycrlf
95
<!-- XXX Unicode | any | bsr_anycrlf -->
98
<v>ErrSpec = {ErrString, Position}</v>
99
<v>ErrString = string()</v>
100
<v>Position = int()</v>
103
<p>This function compiles a regular expression with the syntax
104
described below into an internal format to be used later as a
105
parameter to the run/2,3 functions.</p>
106
<p>Compiling the regular expression before matching is useful if
107
the same expression is to be used in matching against multiple
108
subjects during the program's lifetime. Compiling once and
109
executing many times is far more efficient than compiling each
110
time one wants to match.</p>
111
<p>The options have the following meanings:</p>
113
<tag><c>anchored</c></tag>
114
<item>The pattern is forced to be "anchored", that is, it is constrained to match only at the first matching point in the string that is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself.</item>
115
<tag><c>caseless</c></tag>
116
<item>Letters in the pattern match both upper and lower case letters. It is equivalent to Perl's /i option, and it can be changed within a pattern by a (?i) option setting. Uppercase and lowercase letters are defined as in the ISO-8859-1 character set.</item>
117
<tag><c>dollar_endonly</c></tag>
118
<item>A dollar metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar also matches immediately before a newline at the end of the string (but not before any other newlines). The <c>dollar_endonly</c> option is ignored if <c>multiline</c> is given. There is no equivalent option in Perl, and no way to set it within a pattern.</item>
119
<tag><c>dotall</c></tag>
120
<item>A dot maturate in the pattern matches all characters, including those that indicate newline. Without it, a dot does not match when the current position is at a newline. This option is equivalent to Perl's /s option, and it can be changed within a pattern by a (?s) option setting. A negative class such as [^a] always matches newline characters, independent of the setting of this option.</item>
121
<tag><c>extended</c></tag>
122
<item>Whitespace data characters in the pattern are ignored except when escaped or inside a character class. Whitespace does not include the VT character (ASCII 11). In addition, characters between an unescaped # outside a character class and the next newline, inclusive, are also ignored. This is equivalent to Perl's /x option, and it can be changed within a pattern by a (?x) option setting.
124
This option makes it possible to include comments inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence <c>(?(</c> which introduces a conditional subpattern.</item>
125
<tag><c>firstline</c></tag>
126
<item>An unanchored pattern is required to match before or at the first newline in the subject string, though the matched text may continue over the newline.</item>
127
<tag><c>multiline</c></tag>
128
<item><p>By default, PCRE treats the subject string as consisting of a single line of characters (even if it actually contains newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless <c>dollar_endonly</c> is given). This is the same as Perl.</p>
130
<p>When <c>multiline</c> it is given, the "start of line" and "end of line" constructs match immediately following or immediately before internal newlines in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m option, and it can be changed within a pattern by a (?m) option setting. If there are no newlines in a subject string, or no occurrences of ^ or $ in a pattern, setting <c>multiline</c> has no effect.</p> </item>
131
<tag><c>no_auto_capture</c></tag>
132
<item>Disables the use of numbered capturing parentheses in the pattern. Any opening parenthesis that is not followed by ? behaves as if it were followed by ?: but named parentheses can still be used for capturing (and they acquire numbers in the usual way). There is no equivalent of this option in Perl.
134
<tag><c>dupnames</c></tag>
135
<item>Names used to identify capturing subpatterns need not be unique. This can be helpful for certain types of pattern when it is known that only one instance of the named subpattern can ever be matched. There are more details of named subpatterns below</item>
136
<tag><c>ungreedy</c></tag>
137
<item>This option inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by "?". It is not compatible with Perl. It can also be set by a (?U) option setting within the pattern.</item>
138
<tag><c>{newline, NLSpec}</c></tag>
140
<p>Override the default definition of a newline in the subject string, which is LF (ASCII 10) in Erlang.</p>
143
<item>Newline is indicated by a single character CR (ASCII 13)</item>
145
<item>Newline is indicated by a single character LF (ASCII 10), the default</item>
146
<tag><c>crlf</c></tag>
147
<item>Newline is indicated by the two-character CRLF (ASCII 13 followed by ASCII 10) sequence.</item>
148
<tag><c>anycrlf</c></tag>
149
<item>Any of the three preceding sequences should be recognized.</item>
151
<tag><c>any</c></tag>
152
<item><p>...</p></item>
153
<tag><c>bsr_anycrlf</c></tag>
154
<item><p>...</p></item>
163
<name>run(Subject,RE) -> {match, Captured} | nomatch</name>
164
<fsummary>Match a subject against regular expression and capture subpatterns</fsummary>
166
<v>Subject = iodata()</v>
167
<v>RE = mp() | iodata()</v>
168
<v>Captured = [ CaptureData ]</v>
169
<v>CaptureData = {int(),int()}</v>
172
<p>The same as <c>run(Subject,RE,[])</c>.</p>
176
<name>run(Subject,RE,Options) -> {match, Captured} | match | nomatch</name>
177
<fsummary>Match a subject against regular expression and capture subpatterns</fsummary>
179
<v>Subject = iodata()</v>
180
<v>RE = mp() | iodata()</v>
181
<v>Options = [ Option ]</v>
182
<v>Option = anchored | global | notbol | noteol | notempty | {offset, int()} | {newline, NLSpec} | {capture, ValueSpec} | {capture, ValueSpec, Type} | CompileOpt</v>
183
<v>Type = index | list | binary</v>
184
<v>ValueSpec = all | all_but_first | first | ValueList</v>
185
<v>ValueList = [ ValueID ]</v>
186
<v>ValueID = int() | string() | atom()</v>
187
<v>CompileOpt = see compile/2 above</v>
188
<v>NLSpec = cr | crlf | lf | anycrlf
193
<v>Captured = [ CaptureData ] | [ [ CaptureData ] ... ]</v>
194
<v>CaptureData = {int(),int()} | string() | binary()</v>
198
<p>Executes a regexp matching, returning <c>match/{match,
199
Captured}</c> or <c>nomatch</c>. The regular expression can be
200
given either as <c>iodata()</c> in which case it is
201
automatically compiled (as by <c>re:compile/2</c>) and executed,
202
or as a pre compiled <c>mp()</c> in which case it is executed
203
against the subject directly.</p>
205
<p>When compilation is involved, the exception <c>badarg</c> is thrown if
206
a compilation error occurs. To locate the error in the regular
207
expression, use the function <c>re:compile/2</c> to get more information.</p>
209
<p>If the regular expression is previously compiled, the option
210
list can only contain the options <c>anchored</c>,
211
<c>global</c>, <c>notbol</c>, <c>noteol</c>,
212
<c>notempty</c>, <c>{offset, int()}</c>, <c>{newline,
213
NLSpec}</c> and <c>{capture, ValueSpec}/{capture, ValueSpec,
214
Type}</c>. Otherwise all options valid for the
215
<c>re:compile/2</c> function are allowed as well. Options
216
allowed both for compilation and execution of a match, namely
217
<c>anchored</c> and <c>{newline, NLSpec}</c>, will affect both
218
the compilation and execution if present together with a non
219
pre-compiled regular expression.</p>
221
<p>The <c>{capture, ValueSpec}/{capture, ValueSpec, Type}</c>
222
defines what to return from the function upon successful
223
matching. The <c>capture</c> tuple may contain both a
224
value specification telling which of the captured
225
substrings are to be returned, and a type specification, telling
226
how captured substrings are to be returned (as index tuples,
227
lists or binaries). The <c>capture</c> option makes the function
228
quite flexible and powerful. The different options are described
231
<p>If the capture options describe that no substring capturing
232
at all is to be done (<c>{capture, none}</c>), the function will
233
return the single atom <c>match</c> upon successful matching,
235
<c>{match, ValueList}</c> is returned. Disabling capturing can
236
be done either by specifying <c>none</c> or an empty list as
237
<c>ValueSpec</c>.</p>
239
<p>A description of all the options relevant for execution follows:</p>
242
<tag><c>anchored</c></tag>
244
<item>Limits <c>re:run/3</c> to matching at the first matching
245
position. If a pattern was compiled with <c>anchored</c>, or
246
turned out to be anchored by virtue of its contents, it cannot
247
be made unachored at matching time, hence there is no
248
<c>unanchored</c> option.</item>
250
<tag><c>global</c></tag>
253
<p>Implements global (repetitive) search as the <c>g</c> flag in
254
i.e. Perl. Each match found is returned as a separate
255
<c>list()</c> containing the specific match as well as any
256
matching subexpressions (or as specified by the <c>capture
257
option</c>). The <c>Captured</c> part of the return value will
258
hence be a <c>list()</c> of <c>list()</c>'s when this
261
<p>When the regular expression matches an empty string, the
262
behaviour might seem non-intuitive, why the behaviour requites
263
some clarifying. With the global option, <c>re:run/3</c>
264
handles empty matches in the same way as Perl, meaning that a
265
match at any point giving an empty string (with length 0) will
266
be retried with the options
267
<c>[anchored, notempty]</c> as well. If that
268
search gives a result of length > 0, the result is included.
271
<code> re:run("cat","(|at)",[global]).</code>
273
<p>The matching will be performed as following:</p>
275
<tag>At offset <c>0</c></tag>
276
<item>The regexp <c>(|at)</c> will first match at the initial
277
position of the string <c>cat</c>, giving the result set
278
<c>[{0,0},{0,0}]</c> (the second <c>{0,0}</c> is due to the
279
subexpression marked by the parentheses). As the length of the
280
match is 0, we don't advance to the next position yet.</item>
281
<tag>At offset <c>0</c> with <c>[anchored, notempty]</c></tag>
282
<item> The search is retried
283
with the options <c>[anchored, notempty]</c> at the same
284
position, which does not give any interesting result of longer
285
length, why the search position is now advanced to the next
286
character (<c>a</c>).</item>
287
<tag>At offset <c>1</c></tag>
288
<item>Now the search results in
289
<c>[{1,0},{1,0}]</c> meaning this search will also be repeated
290
with the extra options.</item>
291
<tag>At offset <c>1</c> with <c>[anchored, notempty]</c></tag>
292
<item>Now the <c>ab</c> alternative
293
is found and the result will be [{1,2},{1,2}]. The result is
294
added to the list of results and the position in the
295
search string is advanced two steps.</item>
296
<tag>At offset <c>3</c></tag>
297
<item>The search now once again
298
matches the empty string, giving <c>[{3,0},{3,0}]</c>.</item>
299
<tag>At offset <c>1</c> with <c>[anchored, notempty]</c></tag>
300
<item>This will give no result of length > 0 and we are at
301
the last position, so the global search is complete.</item>
303
<p>The result of the call is:</p>
305
<code> {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}</code>
308
<tag><c>notempty</c></tag>
310
<p>An empty string is not considered to be a valid match if this
311
option is given. If there are alternatives in the pattern, they
312
are tried. If all the alternatives match the empty string, the
313
entire match fails. For example, if the pattern</p>
315
<p>is applied to a string not beginning with "a" or "b", it
316
matches the empty string at the start of the subject. With
317
<c>notempty</c> given, this match is not valid, so re:run/3 searches
318
further into the string for occurrences of "a" or "b".</p>
320
<p>Perl has no direct equivalent of <c>notempty</c>, but it does
321
make a special case of a pattern match of the empty string
322
within its split() function, and when using the /g modifier. It
323
is possible to emulate Perl's behavior after matching a null
324
string by first trying the match again at the same offset with
325
<c>notempty</c> and <c>anchored</c>, and then if that fails by
326
advancing the starting offset (see below) and trying an ordinary
329
<tag><c>notbol</c></tag>
331
<item>This option specifies that the first character of the subject
332
string is not the beginning of a line, so the circumflex
333
metacharacter should not match before it. Setting this without
334
<c>multiline</c> (at compile time) causes circumflex never to
335
match. This option affects only the behavior of the circumflex
336
metacharacter. It does not affect \A.</item>
338
<tag><c>noteol</c></tag>
340
<item>This option specifies that the end of the subject string
341
is not the end of a line, so the dollar metacharacter should not
342
match it nor (except in multiline mode) a newline immediately
343
before it. Setting this without <c>multiline</c> (at compile time)
344
causes dollar never to match. This option affects only the
345
behavior of the dollar metacharacter. It does not affect \Z or
348
<tag><c>{offset, int()}</c></tag>
350
<item>Start matching at the offset (position) given in the
351
subject string. The offset is zero-based, so that the default is
352
<c>{offset,0}</c> (all of the subject string).</item>
354
<tag><c>{newline, NLSpec}</c></tag>
356
<p>Override the default definition of a newline in the subject string, which is LF (ASCII 10) in Erlang.</p>
359
<item>Newline is indicated by a single character CR (ASCII 13)</item>
361
<item>Newline is indicated by a single character LF (ASCII 10), the default</item>
362
<tag><c>crlf</c></tag>
363
<item>Newline is indicated by the two-character CRLF (ASCII 13 followed by ASCII 10) sequence.</item>
364
<tag><c>anycrlf</c></tag>
365
<item>Any of the three preceding sequences should be recognized.</item>
367
<tag><c>any</c></tag>
368
<item><p>...</p></item>
369
<tag><c>bsr_anycrlf</c></tag>
370
<item><p>...</p></item>
375
<tag><c>{capture, ValueSpec}</c>/<c>{capture, ValueSpec, Type}</c></tag>
378
<p>Specifies which captured substrings are returned and in what
380
<c>re:run/3</c> captures all of the matching part of the
381
substring as well as all capturing subpatterns (all of the
382
pattern is automatically captured). The default return type is
383
(zero-based) indexes of the captured parts of the string, given as
384
<c>{Offset,Length}</c> pairs (the <c>index</c> <c>Type</c> of
387
<p>As an example of the default behavior, the following call:</p>
389
<code> re:run("ABCabcdABC","abcd",[]).</code>
391
<p>returns, as first and only captured string the matching part of the subject ("abcd" in the middle) as a index pair <c>{3,4}</c>, where character positions are zero based, just as in offsets. The return value of the call above would then be:</p>
392
<code> {match,[{3,4}]}</code>
393
<p>Another (and quite common) case is where the regular expression matches all of the subject, as in:</p>
394
<code> re:run("ABCabcdABC",".*abcd.*",[]).</code>
395
<p>where the return value correspondingly will point out all of the string, beginning at index 0 and being 10 characters long:</p>
396
<code> {match,[{0,10}]}</code>
398
<p>If the regular expression contains capturing subpatterns,
399
like in the following case:</p>
401
<code> re:run("ABCabcdABC",".*(abcd).*",[]).</code>
403
<p>all of the matched subject is captured, as
404
well as the captured substrings:</p>
406
<code> {match,[{0,10},{3,4}]}</code>
408
<p>the complete matching pattern always giving the first return value in the
409
list and the rest of the subpatterns being added in the
410
order they occurred in the regular expression.</p>
412
<p>The capture tuple is built up as follows:</p>
414
<tag><c>ValueSpec</c></tag>
415
<item><p>Specifies which captured (sub)patterns are to be returned. The ValueSpec can either be an atom describing a predefined set of return values, or a list containing either the indexes or the names of specific subpatterns to return.</p>
416
<p>The predefined sets of subpatterns are:</p>
418
<tag><c>all</c></tag>
419
<item>All captured subpatterns including the complete matching string. This is the default.</item>
420
<tag><c>first</c></tag>
421
<item>Only the first captured subpattern, which is always the complete matching part of the subject. All explicitly captured subpatterns are discarded.</item>
422
<tag><c>all_but_first</c></tag>
423
<item>All but the first matching subpattern, i.e. all explicitly captured subpatterns, but not the complete matching part of the subject string. This is useful if the regular expression as a whole matches a large part of the subject, but the part you're interested in is in an explicitly captured subpattern. If the return type is <c>list</c> or <c>binary</c>, not returning subpatterns you're not interested in is a good way to optimize.</item>
424
<tag><c>none</c></tag>
425
<item>Do not return matching subpatterns at all, yielding the single atom <c>match</c> as the return value of the function when matching successfully instead of the <c>{match, list()}</c> return. Specifying an empty list gives the same behavior.</item>
427
<p>The value list is a list of indexes for the subpatterns to return, where index 0 is for all of the pattern, and 1 is for the first explicit capturing subpattern in the regular expression, and so forth. When using named captured subpatterns (see below) in the regular expression, one can use <c>atom()</c>'s or <c>string()</c>'s to specify the subpatterns to be returned. This deserves an example, consider the following regular expression:</p>
428
<code> ".*(abcd).*"</code>
429
<p>matched against the string ""ABCabcdABC", capturing only the "abcd" part (the first explicit subpattern):</p>
430
<code> re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).</code>
431
<p>The call will yield the following result:</p>
432
<code> {match,[{3,4}]}</code>
433
<p>as the first explicitly captured subpattern is "(abcd)", matching "abcd" in the subject, at (zero-based) position 3, of length 4.</p>
434
<p>Now consider the same regular expression, but with the subpattern explicitly named 'FOO':</p>
435
<code> ".*(?<FOO>abcd).*"</code>
436
<p>With this expression, we could still give the index of the subpattern with the following call:</p>
437
<code> re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).</code>
438
<p>giving the same result as before. But as the subpattern is named, we can also give its name in the value list:</p>
439
<code> re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).</code>
440
<p>which would yield the same result as the earlier examples, namely:</p>
441
<code> {match,[{3,4}]}</code>
443
<p>The values list might specify indexes or names not present in
444
the regular expression, in which case the return values vary
445
depending on the type. If the type is <c>index</c>, the tuple
446
<c>{-1,0}</c> is returned for values having no corresponding
447
subpattern in the regexp, but for the other types
448
(<c>binary</c> and <c>list</c>), the values are the empty binary
449
or list respectively.</p>
452
<tag><c>Type</c></tag>
453
<item><p>Optionally specifies how captured substrings are to be returned. If omitted, the default of <c>index</c> is used. The <c>Type</c> can be one of the following:</p>
455
<tag><c>index</c></tag>
456
<item>Return captured substrings as pairs of byte indexes into the subject string and length of the matching string in the subject (as if the subject string was flattened with iolist_to_binary prior to matching). This is the default.</item>
457
<tag><c>list</c></tag>
458
<item>Return matching substrings as lists of characters (Erlang <c>string()</c>'s).</item>
459
<tag><c>binary</c></tag>
460
<item>Return matching substrings as binaries.</item>
464
<p>In general, subpatterns that got assigned no value in the match are returned as the tuple <c>{-1,0}</c> when <c>type</c> is <c>index</c>. Unasigned subpatterns are returned as the empty binary or list respectively for other return types. Consider the regular expression:</p>
465
<code> ".*((?<FOO>abdd)|a(..d)).*"</code>
466
<p>There are three explicitly capturing subpatterns, where the opening parenthesis position determines the order in the result, hence <c>((?<FOO>abdd)|a(..d))</c> is subpattern index 1, <c>(?<FOO>abdd)</c> is subpattern index 2 and <c>(..d)</c> is subpattern index 3. When matched against the following string:</p>
467
<code> "ABCabcdABC"</code>
468
<p>the subpattern at index 2 won't match, as "abdd" is not present in the string, but the complete pattern matches (due to the alternative <c>a(..d)</c>. The subpattern at index 2 is therefore unassigned and the default return value will be:</p>
469
<code> {match,[{0,10},{3,4},{-1,0},{4,3}]}</code>
470
<p>Setting the capture <c>Type</c> to <c>binary</c> would give the following:</p>
471
<code> {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}</code>
472
<p>where the empty binary (<c><<>></c>) represents the unassigned subpattern. In the <c>binary</c> case, some information about the matching is therefore lost, the <c><<>></c> might just as well be an empty string captured.</p>
473
<p>If differentiation between empty matches and non existing subpatterns is necessary, use the <c>type</c> <c>index</c>
474
and do the conversion to the final type in Erlang code.</p>
476
<p>When the option <c>global</c> is given, the <c>capture</c>
477
specification affects each match separately, so that:</p>
479
<code> re:run("cacb","c(a|b)",[global,{capture,[1],list}]).</code>
481
<p>gives the result:</p>
483
<code> {match,[["a"],["b"]]}</code>
487
<p>The options solely affecting the compilation step are described in the <c>re:compile/2</c> function.</p>
491
<name>replace(Subject,RE,Replacement) -> iodata() </name>
492
<fsummary>Match a subject against regular expression and replace matching elements with Replacement</fsummary>
494
<v>Subject = iodata()</v>
495
<v>RE = mp() | iodata()</v>
496
<v>Replacement = iodata()</v>
499
<p>The same as <c>replace(Subject,RE,Replacement,[])</c>.</p>
503
<name>replace(Subject,RE,Replacement, Options) -> iodata() | binary() | list()</name>
504
<fsummary>Match a subject against regular expression and replace matching elements with Replacement</fsummary>
506
<v>Subject = iodata()</v>
507
<v>RE = mp() | iodata()</v>
508
<v>Replacement = iodata()</v>
509
<v>Options = [ Option ]</v>
510
<v>Option = anchored | global | notbol | noteol | notempty | {offset, int()} | {newline, NLSpec} | {return, ReturnType} | CompileOpt</v>
511
<v>ReturnType = iodata | list | binary</v>
512
<v>CompileOpt = see compile/2 above</v>
513
<v>NLSpec = cr | crlf | lf | anycrlf
520
<p>Replaces the matched part of the <c>Subject</c> string with the content of <c>Replacement</c>.</p>
521
<p>Options are given as to the <c>re:run/3</c> function except that the <c>capture</c> option of <c>re:run/3</c> is not allowed.
522
Instead a <c>{return, ReturnType}</c> is present. The default return type is <c>iodata</c>, constructed in a
523
way to minimize copying. The <c>iodata</c> result can be used directly in many i/o-operations. If a flat <c>list()</c> is
524
desired, specify <c>{return, list}</c> and if a binary is preferred, specify <c>{return, binary}</c>.</p>
525
<p>The replacement string can contain the special character
526
<c>&</c>, which inserts the whole matching expression in the
527
result, and the special sequence <c>\\</c>N (where N is an
528
integer > 0), resulting in the subexpression number N will be
529
inserted in the result. If no subexpression with that number is
530
generated by the regular expression, nothing is inserted.</p>
531
<p>To insert an <c>&</c> or <c>\\</c> in the result, precede it
532
with a <c>\\</c>. Note that Erlang already gives a special
533
meaning to <c>\\</c> in literal strings, why a single <c>\\</c>
534
has to be written as <c>"\\\\"</c> and therefore a double <c>\\</c>
535
as <c>"\\\\\\\\"</c>. Example:</p>
536
<code> re:replace("abcd","c","[&]",[{return,list}]).</code>
538
<code> "ab[c]d"</code>
540
<code> re:replace("abcd","c","[\\\&]",[{return,list}]).</code>
542
<code> "ab[&]d"</code>
543
<p>As with <c>re:run/3</c>, compilation errors raise the <c>badarg</c>
544
exception, <c>re:compile/2</c> can be used to get more information
549
<name>split(Subject,RE) -> SplitList</name>
550
<fsummary>Split a string by tokens specified as a regular expression</fsummary>
552
<v>Subject = iodata()</v>
553
<v>RE = mp() | iodata()</v>
554
<v>SplitList = [ iodata() ]</v>
557
<p>The same as <c>split(Subject,RE,[])</c>.</p>
562
<name>split(Subject,RE,Options) -> SplitList</name>
563
<fsummary>Split a string by tokens specified as a regular expression</fsummary>
565
<v>Subject = iodata()</v>
566
<v>RE = mp() | iodata()</v>
567
<v>Options = [ Option ]</v>
568
<v>Option = anchored | global | notbol | noteol | notempty | {offset, int()} | {newline, NLSpec} | {return, ReturnType} | {parts, NumParts} | group | trim | CompileOpt</v>
569
<v>NumParts = int() | infinity</v>
570
<v>ReturnType = iodata | list | binary</v>
571
<v>CompileOpt = see compile/2 above</v>
572
<v>NLSpec = cr | crlf | lf | anycrlf
577
<v>SplitList = [ RetData ] | [ GroupedRetData ]</v>
578
<v>GroupedRetData = [ RetData ]</v>
579
<v>RetData = iodata() | binary() | list()</v>
582
<p>This function splits the input into parts by finding tokens
583
according to the regular expression supplied.</p>
585
<p>The splitting is done basically by running a global regexp match and
586
dividing the initial string wherever a match occurs. The matching part
587
of the string is removed from the output.</p>
589
<p>The result is given as a list of "strings", the
590
preferred datatype given in the <c>return</c> option (default iodata).</p>
591
<p>If subexpressions are given in the regular expression, the
592
matching subexpressions are returned in the resulting list as
593
well. An example:</p>
595
<code> re:split("Erlang","[ln]",[{return,list}]).</code>
597
<p>will yield the result:</p>
599
<code> ["Er","a","g"]</code>
603
<code> re:split("Erlang","([ln])",[{return,list}]).</code>
607
<code> ["Er","l","a","n","g"]</code>
609
<p>The text matching the subexpression (marked by the parantheses
611
inserted in the result list where it was found. In effect this means
612
that concatenating the result of a split where the whole regexp is a
613
single subexpression (as in the example above) will always result in
614
the original string.</p>
616
<p>As there is no matching subexpression for the last part in
617
the example (the "g"), there is nothing inserted after
618
that. To make the group of strings and the parts matching the
619
subexpressions more obvious, one might use the <c>group</c>
620
option, which groups together the part of the subject string with the
621
parts matching the subexpressions when the string was split:</p>
623
<code> re:split("Erlang","([ln])",[{return,list},group]).</code>
627
<code> [["Er","l"],["a","n"],["g"]]</code>
629
<p>Here the regular expression matched first the "l",
630
causing "Er" to be the first part in the result. When
631
the regular expression matched, the (only) subexpression was
632
bound to the "l", why the "l" is inserted
633
in the group together with "Er". The next match is of
634
the "n", making "a" the next part to be
635
returned. As the subexpression is bound to the substring
636
"n" in this case, the "n" is inserted into
637
this group. The last group consists of the rest of the string,
638
as no more matches are found.</p>
641
<p>By default, all parts of the string, including the empty
642
strings are returned from the function. As an example:</p>
644
<code> re:split("Erlang","[lg]",[{return,list}]).</code>
646
<p>The result will be:</p>
648
<code> ["Er","an",[]]</code>
650
<p>as the matching of the "g" in the end of the string
651
leaves an empty rest which is also returned. This behaviour
652
differs from the default behaviour of the split function in
653
Perl, where empty strings at the end are by default removed. To
655
"trimming" default behavior of Perl, specify
656
<c>trim</c> as an option:</p>
658
<code> re:split("Erlang","[lg]",[{return,list},trim]).</code>
660
<p>The result will be:</p>
662
<code> ["Er","an"]</code>
664
<p>The "trim" option in effect says; "give me as
665
many parts as possible except the empty ones", which might
666
be useful in some circumstances. You can also specify how many
667
parts you want, by specifying <c>{parts,</c>N<c>}</c>:</p>
669
<code> re:split("Erlang","[lg]",[{return,list},{parts,2}]).</code>
671
<p>This will give:</p>
673
<code> ["Er","ang"]</code>
675
<p>Note that the last part is "ang", not
676
"an", as we only specified splitting into two parts,
677
and the splitting stops when enough parts are given, why the
678
result differs from that of <c>trim</c>.</p>
680
<p>More than three parts are not possible with this indata, why</p>
682
<code> re:split("Erlang","[lg]",[{return,list},{parts,4}]).</code>
684
<p>will give the same result as the default, which is to be
685
viewed as "an infinite number of parts".</p>
687
<p>Specifying <c>0</c> as the number of parts gives the same
688
effect as the option <c>trim</c>. If subexpressions are
689
captured, empty subexpression matches at the end are also
690
stripped from the result if <c>trim</c> or <c>{parts,0}</c> is
693
<p>If you are familiar with Perl, the <c>trim</c>
694
behaviour corresponds exactly to the Perl default, the
695
<c>{parts,N}</c> where N is a positive integer corresponds
696
exactly to the Perl behaviour with a positive numerical third
697
parameter and the default behaviour of <c>re:split/3</c> corresponds
698
to that when the Perl routine is given a negative integer as the
701
<p>Summary of options not previously described for the <c>re:run/3</c> function:</p>
703
<tag>{return,ReturnType}</tag>
704
<item><p>Specifies how the parts of the original string are presented in the result list. The possible types are:</p>
707
<item>The variant of <c>iodata()</c> that gives the least copying of data with the current implementation (often a binary, but don't depend on it).</item>
709
<item>All parts returned as binaries.</item>
711
<item>All parts returned as lists of characters ("strings").</item>
717
<p>Groups together the part of the string with
718
the parts of the string matching the subexpressions of the
720
<p>The return value from the function will in this case be a
721
<c>list()</c> of <c>list()</c>'s. Each sublist begins with the
722
string picked out of the subject string, followed by the parts
723
matching each of the subexpressions in order of occurence in the
724
regular expression.</p>
730
<p>Specifies the number of parts the subject string is to be
733
<p>The number of parts should be a positive integer for a specific maximum on the
734
number of parts and <c>infinity</c> for the maximum number of
735
parts possible (the default). Specifying <c>{parts,0}</c> gives as many parts as
736
possible disregarding empty parts at the end, the same as
737
specifying <c>trim</c></p>
742
<p>Specifies that empty parts at the end of the result list are
743
to be disregarded. The same as specifying <c>{parts,0}</c>. This
744
corresponds to the default behaviour of the <c>split</c>
745
built in function in Perl.</p>
755
<title>PERL LIKE REGULAR EXPRESSIONS SYNTAX</title>
756
<p>The following sections contain reference material for the
757
regular expressions used by this module. The regular expression
758
reference is taken from the PCRE documentation, but converted as
760
<p>The documentation is altered where appropriate and where the re
761
module behaves differently than the PCRE library.</p>
764
<section><title>PCRE regular expression details</title>
766
<p>The syntax and semantics of the regular expressions that are supported by PCRE
767
are described in detail below. Perl's regular expressions are described in its own documentation, and
768
regular expressions in general are covered in a number of books, some of which
769
have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
770
published by O'Reilly, covers regular expressions in great detail. This
771
description of PCRE's regular expressions is intended as reference material.</p>
772
<p>The reference material is divided into the following sections:</p>
774
<item><seealso marker="#sect1">Newline conventions</seealso></item>
775
<item><seealso marker="#sect2">Characters and metacharacters</seealso></item>
776
<item><seealso marker="#sect3">Backslash</seealso></item>
777
<item><seealso marker="#sect4">Circumflex and dollar</seealso></item>
778
<item><seealso marker="#sect5">Full stop (period, dot)</seealso></item>
780
<item><seealso marker="#sect6">Matching a single byte</seealso></item>
782
<item><seealso marker="#sect7">Square brackets and character classes</seealso></item>
783
<item><seealso marker="#sect8">Posix character classes</seealso></item>
784
<item><seealso marker="#sect9">Vertical bar</seealso></item>
785
<item><seealso marker="#sect10">Internal option setting</seealso></item>
786
<item><seealso marker="#sect11">Subpatterns</seealso></item>
787
<item><seealso marker="#sect12">Duplicate subpattern numbers</seealso></item>
788
<item><seealso marker="#sect13">Named subpatterns</seealso></item>
789
<item><seealso marker="#sect14">Repetition</seealso></item>
790
<item><seealso marker="#sect15">Atomic grouping and possessive quantifiers</seealso></item>
791
<item><seealso marker="#sect16">Back references</seealso></item>
792
<item><seealso marker="#sect17">Assertions</seealso></item>
793
<item><seealso marker="#sect18">Conditional subpatterns</seealso></item>
794
<item><seealso marker="#sect19">Comments</seealso></item>
795
<item><seealso marker="#sect20">Recursive patterns</seealso></item>
796
<item><seealso marker="#sect21">Subpatterns as subroutines</seealso></item>
798
<item><seealso marker="#sect22">Callouts</seealso></item>
800
<item><seealso marker="#sect23">Backtracking control</seealso></item>
806
<section><marker id="sect1"></marker><title>Newline conventions</title>
815
different conventions for indicating line breaks in
816
strings: a single CR (carriage return) character, a single LF (linefeed)
817
character, the two-character sequence CRLF
819
, any of the three preceding, or any
820
Unicode newline sequence.</p>
823
or any of the three preceding.</p>
825
<p>It is also possible to specify a newline convention by starting a pattern
826
string with one of the following five sequences:</p>
829
<tag>(*CR)</tag> <item>carriage return</item>
830
<tag>(*LF)</tag> <item>linefeed</item>
831
<tag>(*CRLF)</tag> <item>carriage return, followed by linefeed</item>
832
<tag>(*ANYCRLF)</tag> <item>any of the three above</item>
834
<tag>(*ANY)</tag> <item>all Unicode newline sequences</item>
838
<p>These override the default and the options given to <c>re:compile/2</c>. For
839
example, the pattern:</p>
845
<p>changes the convention to CR. That pattern matches "a\\nb" because LF is no
846
longer a newline. Note that these special settings, which are not
847
Perl-compatible, are recognized only at the very start of a pattern, and that
848
they must be in upper case. If more than one of them is present, the last one
852
<p>The newline convention does not affect what the \\R escape sequence matches. By
853
default, this is any Unicode newline sequence, for Perl compatibility. However,
854
this can be changed; see the description of \\R in the section entitled
858
below. A change of \\R setting can be combined with a change of newline
865
<section><marker id="sect2"></marker><title>Characters and metacharacters</title>
868
<p>A regular expression is a pattern that is matched against a subject
869
string from left to right. Most characters stand for themselves in a
870
pattern, and match the corresponding characters in the subject. As a
871
trivial example, the pattern</p>
874
<p> The quick brown fox</p>
877
<p>matches a portion of a subject string that is identical to
878
itself. When caseless matching is specified (the <c>caseless</c>
879
option), letters are matched independently of case.</p>
881
<p>The power of regular expressions comes from the ability to include
882
alternatives and repetitions in the pattern. These are encoded in the
883
pattern by the use of <em>metacharacters</em>, which do not stand for
884
themselves but instead are interpreted in some special way.</p>
886
<p>There are two different sets of metacharacters: those that are recognized
887
anywhere in the pattern except within square brackets, and those that are
888
recognized within square brackets. Outside square brackets, the metacharacters
892
<tag>\\</tag> <item>general escape character with several uses</item>
893
<tag>^</tag> <item>assert start of string (or line, in multiline mode)</item>
894
<tag>$</tag> <item>assert end of string (or line, in multiline mode)</item>
895
<tag>.</tag> <item>match any character except newline (by default)</item>
896
<tag>[</tag> <item>start character class definition</item>
897
<tag>|</tag> <item>start of alternative branch</item>
898
<tag>(</tag> <item>start subpattern</item>
899
<tag>)</tag> <item>end subpattern</item>
900
<tag>?</tag> <item>extends the meaning of (,
901
also 0 or 1 quantifier,
902
also quantifier minimizer</item>
903
<tag>*</tag> <item>0 or more quantifier</item>
904
<tag>+</tag> <item>1 or more quantifier,
905
also "possessive quantifier"</item>
906
<tag>{</tag> <item>start min/max quantifier</item>
909
<p>Part of a pattern that is in square brackets is called a "character class". In
910
a character class the only metacharacters are:</p>
913
<tag>\\</tag> <item>general escape character</item>
914
<tag>^</tag> <item>negate the class, but only if the first character</item>
915
<tag>-</tag> <item>indicates character range</item>
916
<tag>[</tag> <item>POSIX character class (only if followed by POSIX
918
<tag>]</tag> <item>terminates the character class</item>
921
<p>The following sections describe the use of each of the metacharacters.</p>
926
<section><marker id="sect3"></marker><title>Backslash</title>
929
<p>The backslash character has several uses. Firstly, if it is followed by a
930
non-alphanumeric character, it takes away any special meaning that character
931
may have. This use of backslash as an escape character applies both inside and
932
outside character classes.</p>
934
<p>For example, if you want to match a * character, you write \\* in the pattern.
935
This escaping action applies whether or not the following character would
936
otherwise be interpreted as a metacharacter, so it is always safe to precede a
937
non-alphanumeric with backslash to specify that it stands for itself. In
938
particular, if you want to match a backslash, you write \\\\.</p>
940
<p>If a pattern is compiled with the <c>extended</c> option, whitespace in the
941
pattern (other than in a character class) and characters between a # outside
942
a character class and the next newline are ignored. An escaping backslash can
943
be used to include a whitespace or # character as part of the pattern.</p>
945
<p>If you want to remove the special meaning from a sequence of characters, you
946
can do so by putting them between \\Q and \\E. This is different from Perl in
947
that $ and @ are handled as literals in \\Q...\\E sequences in PCRE, whereas in
948
Perl, $ and @ cause variable interpolation. Note the following examples:</p>
950
Pattern PCRE matches Perl matches
952
\\Qabc$xyz\\E abc$xyz abc followed by the contents of $xyz
953
\\Qabc\\$xyz\\E abc\\$xyz abc\\$xyz
954
\\Qabc\\E\\$\\Qxyz\\E abc$xyz abc$xyz</code>
957
<p>The \\Q...\\E sequence is recognized both inside and outside character classes.</p>
960
<p><em>Non-printing characters</em></p>
962
<p>A second use of backslash provides a way of encoding non-printing characters
963
in patterns in a visible manner. There is no restriction on the appearance of
964
non-printing characters, apart from the binary zero that terminates a pattern,
965
but when a pattern is being prepared by text editing, it is usually easier to
966
use one of the following escape sequences than the binary character it
970
<tag>\\a</tag> <item>alarm, that is, the BEL character (hex 07)</item>
971
<tag>\\cx</tag> <item>"control-x", where x is any character</item>
972
<tag>\\e </tag> <item>escape (hex 1B)</item>
973
<tag>\\f</tag> <item>formfeed (hex 0C)</item>
974
<tag>\\n</tag> <item>linefeed (hex 0A)</item>
975
<tag>\\r</tag> <item>carriage return (hex 0D)</item>
976
<tag>\\t </tag> <item>tab (hex 09)</item>
977
<tag>\\ddd</tag> <item>character with octal code ddd, or backreference</item>
978
<tag>\\xhh </tag> <item>character with hex code hh</item>
979
<tag>\\x{hhh..}</tag> <item>character with hex code hhh..</item>
982
<p>The precise effect of \\cx is as follows: if x is a lower case letter, it
983
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
984
Thus \\cz becomes hex 1A, but \\c{ becomes hex 3B, while \\c; becomes hex
987
<p>After \\x, from zero to two hexadecimal digits are read (letters can be in
988
upper or lower case). Any number of hexadecimal digits may appear between \\x{
989
and }, but the value of the character code must be less than 256.</p>
993
mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
994
hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
995
point, which is 10FFFF.</p>
998
<p>If characters other than hexadecimal digits appear between \\x{ and }, or if
999
there is no terminating }, this form of escape is not recognized. Instead, the
1000
initial \\x will be interpreted as a basic hexadecimal escape, with no
1001
following digits, giving a character whose value is zero.</p>
1003
<p>Characters whose value is less than 256 can be defined by either of the two
1004
syntaxes for \\x. There is no difference in the way they are handled. For
1005
example, \\xdc is exactly the same as \\x{dc}.</p>
1007
<p>After \\0 up to two further octal digits are read. If there are fewer than two
1008
digits, just those that are present are used. Thus the sequence \\0\\x\\07
1009
specifies two binary zeros followed by a BEL character (code value 7). Make
1010
sure you supply two digits after the initial zero if the pattern character that
1011
follows is itself an octal digit.</p>
1013
<p>The handling of a backslash followed by a digit other than 0 is complicated.
1014
Outside a character class, PCRE reads it and any following digits as a decimal
1015
number. If the number is less than 10, or if there have been at least that many
1016
previous capturing left parentheses in the expression, the entire sequence is
1017
taken as a <em>back reference</em>. A description of how this works is given
1018
later, following the discussion of parenthesized subpatterns.</p>
1021
<p>Inside a character class, or if the decimal number is greater than 9 and there
1022
have not been that many capturing subpatterns, PCRE re-reads up to three octal
1023
digits following the backslash, and uses them to generate a data character. Any
1024
subsequent digits stand for themselves.
1026
character specified in octal must be less than \\400.
1028
In non-UTF-8 mode, the value of a
1029
character specified in octal must be less than \\400. In UTF-8 mode, values up
1030
to \\777 are permitted.
1035
<tag>\\040</tag> <item>is another way of writing a space</item>
1037
<tag>\\40</tag> <item>is the same, provided there are fewer than 40
1038
previous capturing subpatterns</item>
1039
<tag>\\7</tag> <item>is always a back reference</item>
1041
<tag>\\11</tag> <item> might be a back reference, or another way of
1042
writing a tab</item>
1043
<tag>\\011</tag> <item>is always a tab</item>
1044
<tag>\\0113</tag> <item>is a tab followed by the character "3"</item>
1046
<tag>\\113</tag> <item>might be a back reference, otherwise the
1047
character with octal code 113</item>
1049
<tag>\\377</tag> <item>might be a back reference, otherwise
1050
the byte consisting entirely of 1 bits</item>
1052
<tag>\\81</tag> <item>is either a back reference, or a binary zero
1053
followed by the two characters "8" and "1"</item>
1056
<p>Note that octal values of 100 or greater must not be introduced by
1057
a leading zero, because no more than three octal digits are ever
1060
<p>All the sequences that define a single character value can be used
1061
both inside and outside character classes. In addition, inside a
1062
character class, the sequence \\b is interpreted as the backspace
1063
character (hex 08), and the sequences \\R and \\X are interpreted as
1064
the characters "R" and "X", respectively. Outside a character class,
1065
these sequences have different meanings (see below).</p>
1067
<p><em>Absolute and relative back references</em></p>
1069
<p>The sequence \\g followed by an unsigned or a negative number,
1070
optionally enclosed in braces, is an absolute or relative back
1071
reference. A named back reference can be coded as \\g{name}. Back
1072
references are discussed later, following the discussion of
1073
parenthesized subpatterns.</p>
1075
<p><em>Generic character types</em></p>
1077
<p>Another use of backslash is for specifying generic character types. The
1078
following are always recognized:</p>
1081
<tag>\\d</tag> <item>any decimal digit</item>
1082
<tag>\\D</tag> <item>any character that is not a decimal digit</item>
1083
<tag>\\h</tag> <item>any horizontal whitespace character</item>
1084
<tag>\\H</tag> <item>any character that is not a horizontal whitespace character</item>
1085
<tag>\\s</tag> <item>any whitespace character</item>
1086
<tag>\\S</tag> <item>any character that is not a whitespace character</item>
1087
<tag>\\v</tag> <item>any vertical whitespace character</item>
1088
<tag>\\V</tag> <item>any character that is not a vertical whitespace character</item>
1089
<tag>\\w</tag> <item>any "word" character</item>
1090
<tag>\\W</tag> <item>any "non-word" character</item>
1093
<p>Each pair of escape sequences partitions the complete set of characters into
1094
two disjoint sets. Any given character matches one, and only one, of each pair.</p>
1096
<p>These character type sequences can appear both inside and outside character
1097
classes. They each match one character of the appropriate type. If the current
1098
matching point is at the end of the subject string, all of them fail, since
1099
there is no character to match.</p>
1101
<p>For compatibility with Perl, \\s does not match the VT character (code 11).
1102
This makes it different from the the POSIX "space" class. The \\s characters
1103
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
1104
included in a Perl script, \\s may match the VT character. In PCRE, it never
1108
<p>In UTF-8 mode, characters with values greater than 128 never match \\d, \\s, or
1109
\\w, and always match \\D, \\S, and \\W. This is true even when Unicode
1110
character property support is available. These sequences retain their original
1111
meanings from before UTF-8 support was available, mainly for efficiency
1114
<p>The sequences \\h, \\H, \\v, and \\V are Perl 5.10 features. In contrast to the
1115
other sequences, these do match certain high-valued codepoints in UTF-8 mode.
1116
The horizontal space characters are:</p>
1118
<p> U+0009 Horizontal tab
1120
U+00A0 Non-break space
1121
U+1680 Ogham space mark
1122
U+180E Mongolian vowel separator
1127
U+2004 Three-per-em space
1128
U+2005 Four-per-em space
1129
U+2006 Six-per-em space
1131
U+2008 Punctuation space
1134
U+202F Narrow no-break space
1135
U+205F Medium mathematical space
1136
U+3000 Ideographic space</p>
1138
<p>The vertical space characters are:</p>
1143
U+000D Carriage return
1145
U+2028 Line separator
1146
U+2029 Paragraph separator</p>
1150
<p>A "word" character is an underscore or any character less than 256 that is a
1151
letter or digit. The definition of letters and digits is controlled by PCRE's
1152
low-valued character tables, which are always ISO-8859-1.</p>
1154
<p><em>Newline sequences</em></p>
1157
<p>Outside a character class, by default, the escape sequence \\R matches any
1158
Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \\R is
1159
equivalent to the following:</p>
1164
<p>The \\R sequence is equivalent to the following:</p>
1166
<quote><p> (?>\\r\\n|\\n|\\x0b|\\f|\\r|\\x85)</p></quote>
1168
<p>This is an example of an "atomic group", details of which are given below.</p>
1170
<p>This particular group matches either the two-character sequence CR followed by
1171
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
1172
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next
1173
line, U+0085). The two-character sequence is treated as a single unit that
1174
cannot be split.</p>
1176
<p>In UTF-8 mode, two additional characters whose codepoints are greater than 255
1177
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
1178
Unicode character property support is not needed for these characters to be
1183
<p>It is possible to restrict \\R to match only CR, LF, or CRLF (instead of the
1184
complete set of Unicode line endings) by setting the option <c>bsr_anycrlf</c>
1185
either at compile time or when the pattern is matched. (BSR is an abbrevation
1186
for "backslash R".) This can be made the default when PCRE is built; if this is
1187
the case, the other behaviour can be requested via the <c>bsr_unicode</c> option.
1188
It is also possible to specify these settings by starting a pattern string with
1189
one of the following sequences:</p>
1191
<p> (*BSR_ANYCRLF) CR, LF, or CRLF only
1192
(*BSR_UNICODE) any Unicode newline sequence</p>
1194
<p>These override the default and the options given to <c>re:compile/2</c>, but
1195
they can be overridden by options given to <c>re:run/3</c>. Note that these
1196
special settings, which are not Perl-compatible, are recognized only at the
1197
very start of a pattern, and that they must be in upper case. If more than one
1198
of them is present, the last one is used. They can be combined with a change of
1199
newline convention, for example, a pattern can start with:</p>
1201
<p> (*ANY)(*BSR_ANYCRLF)</p>
1204
<p>Inside a character class, \\R matches the letter "R".</p>
1208
<p><em>Unicode character properties</em></p>
1210
<p>When PCRE is built with Unicode character property support, three additional
1211
escape sequences that match characters with specific properties are available.
1212
When not in UTF-8 mode, these sequences are of course limited to testing
1213
characters whose codepoints are less than 256, but they do work in this mode.
1214
The extra escape sequences are:</p>
1216
<p> \\p{<em>xx</em>} a character with the <em>xx</em> property
1217
\\P{<em>xx</em>} a character without the <em>xx</em> property
1218
\\X an extended Unicode sequence</p>
1220
<p>The property names represented by <em>xx</em> above are limited to the Unicode
1221
script names, the general category properties, and "Any", which matches any
1222
character (including newline). Other properties such as "InMusicalSymbols" are
1223
not currently supported by PCRE. Note that \\P{Any} does not match any
1224
characters, so always causes a match failure.</p>
1226
<p>Sets of Unicode characters are defined as belonging to certain scripts. A
1227
character from one of these sets can be matched using a script name. For
1233
<p>Those that are not part of an identified script are lumped together as
1234
"Common". The current list of scripts is:</p>
1244
Canadian_Aboriginal,
1303
<p>Each character has exactly one general category property, specified by a
1304
two-letter abbreviation. For compatibility with Perl, negation can be specified
1305
by including a circumflex between the opening brace and the property name. For
1306
example, \\p{^Lu} is the same as \\P{Lu}.</p>
1308
<p>If only one letter is specified with \\p or \\P, it includes all the general
1309
category properties that start with that letter. In this case, in the absence
1310
of negation, the curly brackets in the escape sequence are optional; these two
1311
examples have the same effect:</p>
1316
<p>The following general category property codes are supported:</p>
1326
Ll Lower case letter
1329
Lt Title case letter
1330
Lu Upper case letter</p>
1335
Mn Non-spacing mark</p>
1343
Pc Connector punctuation
1345
Pe Close punctuation
1346
Pf Final punctuation
1347
Pi Initial punctuation
1348
Po Other punctuation
1349
Ps Open punctuation</p>
1354
Sm Mathematical symbol
1359
Zp Paragraph separator
1360
Zs Space separator</p>
1362
<p>The special property L& is also supported: it matches a character that has
1363
the Lu, Ll, or Lt property, in other words, a letter that is not classified as
1364
a modifier or "other".</p>
1366
<p>The Cs (Surrogate) property applies only to characters in the range U+D800 to
1367
U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so
1368
cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
1369
(see the discussion of <c>no_utf8_check</c> in the
1373
<p>The long synonyms for these properties that Perl supports (such as \\p{Letter})
1374
are not supported by PCRE, nor is it permitted to prefix any of these
1375
properties with "Is".</p>
1377
<p>No character that is in the Unicode table has the Cn (unassigned) property.
1378
Instead, this property is assumed for any code point that is not in the
1381
<p>Specifying caseless matching does not affect these escape sequences. For
1382
example, \\p{Lu} always matches only upper case letters.</p>
1384
<p>The \\X escape matches any number of Unicode characters that form an extended
1385
Unicode sequence. \\X is equivalent to</p>
1387
<quote><p> (?>\\PM\\pM*)</p></quote>
1389
<p>That is, it matches a character without the "mark" property, followed by zero
1390
or more characters with the "mark" property, and treats the sequence as an
1393
Characters with the "mark" property are typically accents that affect the
1394
preceding character. None of them have codepoints less than 256, so in
1395
non-UTF-8 mode \\X matches any one character.</p>
1397
<p>Matching characters by Unicode property is not fast, because PCRE has to search
1398
a structure that contains data for over fifteen thousand characters. That is
1399
why the traditional escape sequences such as \\d and \\w do not use Unicode
1400
properties in PCRE.</p>
1405
<p><em>Resetting the match start</em></p>
1407
<p>The escape sequence \\K, which is a Perl 5.10 feature, causes any previously
1408
matched characters not to be included in the final matched sequence. For
1409
example, the pattern:</p>
1411
<quote><p> foo\\Kbar</p></quote>
1413
<p>matches "foobar", but reports that it has matched "bar". This feature is
1414
similar to a lookbehind assertion
1415
<!-- HTML <a href="#lookbehind"> -->
1419
However, in this case, the part of the subject before the real match does not
1420
have to be of fixed length, as lookbehind assertions do. The use of \\K does
1421
not interfere with the setting of
1422
captured substrings.
1423
For example, when the pattern</p>
1425
<quote><p> (foo)\\Kbar</p></quote>
1427
<p>matches "foobar", the first substring is still set to "foo".</p>
1429
<p><em>Simple assertions</em></p>
1431
<p>The final use of backslash is for certain simple assertions. An
1432
assertion specifies a condition that has to be met at a particular
1433
point in a match, without consuming any characters from the subject
1434
string. The use of subpatterns for more complicated assertions is
1435
described below. The backslashed assertions are:</p>
1438
<tag>\\b</tag> <item>matches at a word boundary</item>
1439
<tag>\\B</tag> <item>matches when not at a word boundary</item>
1440
<tag>\\A</tag> <item>matches at the start of the subject</item>
1441
<tag>\\Z</tag> <item>matches at the end of the subject
1442
also matches before a newline at the end of
1444
<tag>\\z</tag> <item>matches only at the end of the subject</item>
1445
<tag>\\G</tag> <item>matches at the first matching position in the
1449
<p>These assertions may not appear in character classes (but note that \\b has a
1450
different meaning, namely the backspace character, inside a character class).</p>
1452
<p>A word boundary is a position in the subject string where the current character
1453
and the previous character do not both match \\w or \\W (i.e. one matches
1454
\\w and the other matches \\W), or the start or end of the string if the
1455
first or last character matches \\w, respectively.</p>
1457
<p>The \\A, \\Z, and \\z assertions differ from the traditional circumflex and
1458
dollar (described in the next section) in that they only ever match at the very
1459
start and end of the subject string, whatever options are set. Thus, they are
1460
independent of multiline mode. These three assertions are not affected by the
1461
<c>notbol</c> or <c>noteol</c> options, which affect only the behaviour of the
1462
circumflex and dollar metacharacters. However, if the <em>startoffset</em>
1463
argument of <c>re:run/3</c> is non-zero, indicating that matching is to start
1464
at a point other than the beginning of the subject, \\A can never match. The
1465
difference between \\Z and \\z is that \\Z matches before a newline at the end
1466
of the string as well as at the very end, whereas \\z matches only at the end.</p>
1468
<p>The \\G assertion is true only when the current matching position is at the
1469
start point of the match, as specified by the <em>startoffset</em> argument of
1470
<c>re:run/3</c>. It differs from \\A when the value of <em>startoffset</em> is
1471
non-zero. By calling <c>re:run/3</c> multiple times with appropriate
1472
arguments, you can mimic Perl's /g option, and it is in this kind of
1473
implementation where \\G can be useful.</p>
1475
<p>Note, however, that PCRE's interpretation of \\G, as the start of the current
1476
match, is subtly different from Perl's, which defines it as the end of the
1477
previous match. In Perl, these can be different when the previously matched
1478
string was empty. Because PCRE does just one match at a time, it cannot
1479
reproduce this behaviour.</p>
1481
<p>If all the alternatives of a pattern begin with \\G, the expression is anchored
1482
to the starting match position, and the "anchored" flag is set in the compiled
1483
regular expression.</p>
1487
<section><marker id="sect4"></marker><title>Circumflex and dollar</title>
1489
<p>Outside a character class, in the default matching mode, the circumflex
1490
character is an assertion that is true only if the current matching point is
1491
at the start of the subject string. If the <em>startoffset</em> argument of
1492
<c>re:run/3</c> is non-zero, circumflex can never match if the <c>multiline</c>
1493
option is unset. Inside a character class, circumflex has an entirely different
1494
meaning (see below).</p>
1496
<p>Circumflex need not be the first character of the pattern if a number of
1497
alternatives are involved, but it should be the first thing in each alternative
1498
in which it appears if the pattern is ever to match that branch. If all
1499
possible alternatives start with a circumflex, that is, if the pattern is
1500
constrained to match only at the start of the subject, it is said to be an
1501
"anchored" pattern. (There are also other constructs that can cause a pattern
1502
to be anchored.)</p>
1504
<p>A dollar character is an assertion that is true only if the current matching
1505
point is at the end of the subject string, or immediately before a newline
1506
at the end of the string (by default). Dollar need not be the last character of
1507
the pattern if a number of alternatives are involved, but it should be the last
1508
item in any branch in which it appears. Dollar has no special meaning in a
1509
character class.</p>
1511
<p>The meaning of dollar can be changed so that it matches only at the
1512
very end of the string, by setting the <c>dollar_endonly</c> option at
1513
compile time. This does not affect the \\Z assertion.</p>
1515
<p>The meanings of the circumflex and dollar characters are changed if the
1516
<c>multiline</c> option is set. When this is the case, a circumflex matches
1517
immediately after internal newlines as well as at the start of the subject
1518
string. It does not match after a newline that ends the string. A dollar
1519
matches before any newlines in the string, as well as at the very end, when
1520
<c>multiline</c> is set. When newline is specified as the two-character
1521
sequence CRLF, isolated CR and LF characters do not indicate newlines.</p>
1523
<p>For example, the pattern /^abc$/ matches the subject string
1524
"def\\nabc" (where \\n represents a newline) in multiline mode, but
1525
not otherwise. Consequently, patterns that are anchored in single line
1526
mode because all branches start with ^ are not anchored in multiline
1527
mode, and a match for circumflex is possible when the
1528
<em>startoffset</em> argument of <c>re:run/3</c> is non-zero. The
1529
<c>dollar_endonly</c> option is ignored if <c>multiline</c> is set.</p>
1531
<p>Note that the sequences \\A, \\Z, and \\z can be used to match the start and
1532
end of the subject in both modes, and if all branches of a pattern start with
1533
\\A it is always anchored, whether or not <c>multiline</c> is set.</p>
1538
<section><marker id="sect5"></marker><title>Full stop (period, dot)</title>
1540
<p>Outside a character class, a dot in the pattern matches any one character in
1541
the subject string except (by default) a character that signifies the end of a
1544
In UTF-8 mode, the matched character may be more than one byte long.
1548
<p>When a line ending is defined as a single character, dot never matches that
1549
character; when the two-character sequence CRLF is used, dot does not match CR
1550
if it is immediately followed by LF, but otherwise it matches all characters
1551
(including isolated CRs and LFs).
1553
When any Unicode line endings are being
1554
recognized, dot does not match CR or LF or any of the other line ending
1559
<p>The behaviour of dot with regard to newlines can be changed. If
1560
the <c>dotall</c> option is set, a dot matches any one character,
1561
without exception. If the two-character sequence CRLF is present in
1562
the subject string, it takes two dots to match it.</p>
1564
<p>The handling of dot is entirely independent of the handling of
1565
circumflex and dollar, the only relationship being that they both
1566
involve newlines. Dot has no special meaning in a character class.</p>
1571
<section><marker id="sect6"><title>Matching a single byte</title></marker>
1573
<p>Outside a character class, the escape sequence \\C matches any one byte, both
1574
in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
1575
characters. The feature is provided in Perl in order to match individual bytes
1576
in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,
1577
what remains in the string may be a malformed UTF-8 string. For this reason,
1578
the \\C escape sequence is best avoided.</p>
1580
<p>PCRE does not allow \\C to appear in lookbehind assertions (described below),
1582
because in UTF-8 mode this would make it impossible to calculate the length of
1588
<section><marker id="sect7"></marker><title>Square brackets and character classes</title>
1590
<p>An opening square bracket introduces a character class, terminated
1591
by a closing square bracket. A closing square bracket on its own is
1592
not special. If a closing square bracket is required as a member of
1593
the class, it should be the first data character in the class (after
1594
an initial circumflex, if present) or escaped with a backslash.</p>
1596
<p>A character class matches a single character in the subject.
1599
UTF-8 mode, the character may occupy more than one byte.
1602
character must be in the set of characters defined by the class,
1603
unless the first character in the class definition is a circumflex, in
1604
which case the subject character must not be in the set defined by the
1605
class. If a circumflex is actually required as a member of the class,
1606
ensure it is not the first character, or escape it with a
1609
<p>For example, the character class [aeiou] matches any lower case vowel, while
1610
[^aeiou] matches any character that is not a lower case vowel. Note that a
1611
circumflex is just a convenient notation for specifying the characters that
1612
are in the class by enumerating those that are not. A class that starts with a
1613
circumflex is not an assertion: it still consumes a character from the subject
1614
string, and therefore it fails if the current pointer is at the end of the
1618
<p>In UTF-8 mode, characters with values greater than 255 can be included in a
1619
class as a literal string of bytes, or by using the \\x{ escaping mechanism.</p>
1622
<p>When caseless matching is set, any letters in a class represent both their
1623
upper case and lower case versions, so for example, a caseless [aeiou] matches
1624
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
1625
caseful version would.
1627
In UTF-8 mode, PCRE always understands the concept of
1628
case for characters whose values are less than 128, so caseless matching is
1629
always possible. For characters with higher values, the concept of case is
1630
supported if PCRE is compiled with Unicode property support, but not otherwise.
1631
If you want to use caseless matching for characters 128 and above, you must
1632
ensure that PCRE is compiled with Unicode property support as well as with
1636
Caselessness is always in regard to the ISO-8859-1 character set in Erlang.
1639
<p>Characters that might indicate line breaks are never treated in any
1640
special way when matching character classes, whatever line-ending
1641
sequence is in use, and whatever setting of the <c>dotall</c> and
1642
<c>multiline</c> options is used. A class such as [^a] always matches
1643
one of these characters.</p>
1645
<p>The minus (hyphen) character can be used to specify a range of
1646
characters in a character class. For example, [d-m] matches any letter
1647
between d and m, inclusive. If a minus character is required in a
1648
class, it must be escaped with a backslash or appear in a position
1649
where it cannot be interpreted as indicating a range, typically as the
1650
first or last character in the class.</p>
1652
<p>It is not possible to have the literal character "]" as the end
1653
character of a range. A pattern such as [W-]46] is interpreted as a
1654
class of two characters ("W" and "-") followed by a literal string
1655
"46]", so it would match "W46]" or "-46]". However, if the "]" is
1656
escaped with a backslash it is interpreted as the end of range, so
1657
[W-\\]46] is interpreted as a class containing a range followed by two
1658
other characters. The octal or hexadecimal representation of "]" can
1659
also be used to end a range.</p>
1661
<p>Ranges operate in the collating sequence of character values. They can also be
1662
used for characters specified numerically, for example [\\000-\\037].
1665
mode, ranges can include characters whose values are greater than 255, for
1666
example [\\x{100}-\\x{2ff}].
1670
<p>If a range that includes letters is used when caseless matching is set, it
1671
matches the letters in either case. For example, [W-c] is equivalent to
1672
[][\\\\^_`wxyzabc], matched caselessly
1674
, and in non-UTF-8 mode, if character
1675
tables for a French locale are in use, [\\xc8-\\xcb] matches accented E
1676
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
1677
characters with values greater than 128 only when it is compiled with Unicode
1682
<p>The character types \\d, \\D, \\p, \\P, \\s, \\S, \\w, and \\W may
1683
also appear in a character class, and add the characters that they
1684
match to the class. For example, [\\dABCDEF] matches any hexadecimal
1685
digit. A circumflex can conveniently be used with the upper case
1686
character types to specify a more restricted set of characters than
1687
the matching lower case type. For example, the class [^\\W_] matches
1688
any letter or digit, but not underscore.</p>
1690
<p>The only metacharacters that are recognized in character classes
1691
are backslash, hyphen (only where it can be interpreted as specifying
1692
a range), circumflex (only at the start), opening square bracket (only
1693
when it can be interpreted as introducing a POSIX class name - see the
1694
next section), and the terminating closing square bracket. However,
1695
escaping other non-alphanumeric characters does no harm.</p>
1699
<section><marker id="sect8"></marker><title>Posix character classes</title>
1701
<p>Perl supports the POSIX notation for character classes. This uses names
1702
enclosed by [: and :] within the enclosing square brackets. PCRE also supports
1703
this notation. For example,</p>
1705
<quote><p> [01[:alpha:]%]</p></quote>
1707
<p>matches "0", "1", any alphabetic character, or "%". The supported class names
1711
<tag>alnum</tag> <item>letters and digits</item>
1712
<tag>alpha</tag> <item>letters</item>
1713
<tag>ascii</tag> <item>character codes 0 - 127</item>
1714
<tag>blank</tag> <item>space or tab only</item>
1715
<tag>cntrl</tag> <item>control characters</item>
1716
<tag>digit</tag> <item>decimal digits (same as \\d)</item>
1717
<tag>graph</tag> <item>printing characters, excluding space</item>
1718
<tag>lower</tag> <item>lower case letters</item>
1719
<tag>print</tag> <item>printing characters, including space</item>
1720
<tag>punct</tag> <item>printing characters, excluding letters and digits</item>
1721
<tag>space</tag> <item>whitespace (not quite the same as \\s)</item>
1722
<tag>upper</tag> <item>upper case letters</item>
1723
<tag>word</tag> <item>"word" characters (same as \\w)</item>
1724
<tag>xdigit</tag> <item>hexadecimal digits</item>
1727
<p>The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
1728
space (32). Notice that this list includes the VT character (code 11). This
1729
makes "space" different to \\s, which does not include VT (for Perl
1732
<p>The name "word" is a Perl extension, and "blank" is a GNU extension
1733
from Perl 5.8. Another Perl extension is negation, which is indicated
1734
by a ^ character after the colon. For example,</p>
1736
<quote><p> [12[:^digit:]]</p></quote>
1738
<p>matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
1739
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1740
supported, and an error is given if they are encountered.</p>
1743
<p>In UTF-8 mode, characters with values greater than 128 do not match any of
1744
the POSIX character classes.</p>
1750
<section><marker id="sect9"></marker><title>Vertical bar</title>
1752
<p>Vertical bar characters are used to separate alternative
1753
patterns. For example, the pattern</p>
1755
<quote><p> gilbert|sullivan</p></quote>
1757
<p>matches either "gilbert" or "sullivan". Any number of alternatives
1758
may appear, and an empty alternative is permitted (matching the empty
1759
string). The matching process tries each alternative in turn, from
1760
left to right, and the first one that succeeds is used. If the
1761
alternatives are within a subpattern (defined below), "succeeds" means
1762
matching the rest of the main pattern as well as the alternative in
1767
<section><marker id="sect10"></marker><title>Internal option setting</title>
1769
<p>The settings of the <c>caseless</c>, <c>multiline</c>, <c>dotall</c>, and
1770
<c>extended</c> options (which are Perl-compatible) can be changed from within
1771
the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
1772
The option letters are</p>
1775
<tag>i</tag> <item>for <c>caseless</c></item>
1776
<tag>m</tag> <item>for <c>multiline</c></item>
1777
<tag>s</tag> <item>for <c>dotall</c></item>
1778
<tag>x</tag> <item>for <c>extended</c></item>
1781
<p>For example, (?im) sets caseless, multiline matching. It is also possible to
1782
unset these options by preceding the letter with a hyphen, and a combined
1783
setting and unsetting such as (?im-sx), which sets <c>caseless</c> and
1784
<c>multiline</c> while unsetting <c>dotall</c> and <c>extended</c>, is also
1785
permitted. If a letter appears both before and after the hyphen, the option is
1788
<p>The PCRE-specific options <c>dupnames</c>, <c>ungreedy</c>, and
1789
<c>extra</c> can be changed in the same way as the Perl-compatible
1790
options by using the characters J, U and X respectively.</p>
1792
<p>When an option change occurs at top level (that is, not inside subpattern
1793
parentheses), the change applies to the remainder of the pattern that follows.
1794
If the change is placed right at the start of a pattern, PCRE extracts it into
1796
<!-- XXX C Interface
1797
(and it will therefore show up in data extracted by the
1798
<c>pcre_fullinfo()</c> function).
1802
<p>An option change within a subpattern (see below for a description
1803
of subpatterns) affects only that part of the current pattern that
1806
<quote><p> (a(?i)b)c</p></quote>
1808
<p>matches abc and aBc and no other strings (assuming <c>caseless</c>
1809
is not used). By this means, options can be made to have different
1810
settings in different parts of the pattern. Any changes made in one
1811
alternative do carry on into subsequent branches within the same
1812
subpattern. For example,</p>
1814
<quote><p> (a(?i)b|c)</p></quote>
1816
<p>matches "ab", "aB", "c", and "C", even though when matching "C" the first
1817
branch is abandoned before the option setting. This is because the effects of
1818
option settings happen at compile time. There would be some very weird
1819
behaviour otherwise.</p>
1821
<p><em>Note:</em> There are other PCRE-specific options that can be set by the
1822
application when the compile or match functions are called. In some cases the
1823
pattern can contain special leading sequences to override what the application
1824
has set or what has been defaulted. Details are given in the section entitled
1825
"Newline sequences" above.</p>
1830
<section><marker id="sect11"></marker><title>Subpatterns</title>
1832
<p>Subpatterns are delimited by parentheses (round brackets), which
1833
can be nested. Turning part of a pattern into a subpattern does two
1836
<p>1. It localizes a set of alternatives. For example, the pattern</p>
1838
<quote><p> cat(aract|erpillar|)</p></quote>
1840
<p>matches one of the words "cat", "cataract", or "caterpillar". Without the
1841
parentheses, it would match "cataract", "erpillar" or an empty string.</p>
1843
<p>2. It sets up the subpattern as a capturing subpattern. This means that, when
1844
the complete pattern matches, that portion of the subject string that matched the
1845
subpattern is passed back to the caller via the return value of
1846
<c>re:run/3</c>. Opening parentheses are counted from left to right (starting
1847
from 1) to obtain numbers for the capturing subpatterns.</p>
1849
<p>For example, if the string "the red king" is matched against the pattern</p>
1851
<quote><p> the ((red|white) (king|queen))</p></quote>
1853
<p>the captured substrings are "red king", "red", and "king", and are numbered 1,
1854
2, and 3, respectively.</p>
1856
<p>The fact that plain parentheses fulfil two functions is not always helpful.
1857
There are often times when a grouping subpattern is required without a
1858
capturing requirement. If an opening parenthesis is followed by a question mark
1859
and a colon, the subpattern does not do any capturing, and is not counted when
1860
computing the number of any subsequent capturing subpatterns. For example, if
1861
the string "the white queen" is matched against the pattern</p>
1863
<quote><p> the ((?:red|white) (king|queen))</p></quote>
1865
<p>the captured substrings are "white queen" and "queen", and are numbered 1 and
1866
2. The maximum number of capturing subpatterns is 65535.</p>
1868
<p>As a convenient shorthand, if any option settings are required at the start of
1869
a non-capturing subpattern, the option letters may appear between the "?" and
1870
the ":". Thus the two patterns</p>
1873
<item>(?i:saturday|sunday)</item>
1874
<item>(?:(?i)saturday|sunday)</item>
1877
<p>match exactly the same set of strings. Because alternative branches are tried
1878
from left to right, and options are not reset until the end of the subpattern
1879
is reached, an option setting in one branch does affect subsequent branches, so
1880
the above patterns match "SUNDAY" as well as "Saturday".</p>
1885
<section><marker id="sect12"></marker><title>Duplicate subpattern numbers</title>
1887
<p>Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1888
the same numbers for its capturing parentheses. Such a subpattern starts with
1889
(?| and is itself a non-capturing subpattern. For example, consider this
1892
<quote><p> (?|(Sat)ur|(Sun))day</p></quote>
1894
<p>Because the two alternatives are inside a (?| group, both sets of capturing
1895
parentheses are numbered one. Thus, when the pattern matches, you can look
1896
at captured substring number one, whichever alternative matched. This construct
1897
is useful when you want to capture part, but not all, of one of a number of
1898
alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1899
number is reset at the start of each branch. The numbers of any capturing
1900
buffers that follow the subpattern start after the highest number used in any
1901
branch. The following example is taken from the Perl documentation.
1902
The numbers underneath show in which buffer the captured content will be
1906
# before ---------------branch-reset----------- after
1907
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1908
# 1 2 2 3 2 3 4</code>
1910
<p>A backreference or a recursive call to a numbered subpattern always
1911
refers to the first one in the pattern with the given number.</p>
1913
<p>An alternative approach to using this "branch reset" feature is to use
1914
duplicate named subpatterns, as described in the next section.</p>
1918
<section><marker id="sect13"></marker><title>Named subpatterns</title>
1920
<p>Identifying capturing parentheses by number is simple, but it can be very hard
1921
to keep track of the numbers in complicated regular expressions. Furthermore,
1922
if an expression is modified, the numbers may change. To help with this
1923
difficulty, PCRE supports the naming of subpatterns. This feature was not
1924
added to Perl until release 5.10. Python had the feature earlier, and PCRE
1925
introduced it at release 4.0, using the Python syntax. PCRE now supports both
1926
the Perl and the Python syntax.</p>
1928
<p>In PCRE, a subpattern can be named in one of three ways:
1929
(?<name>...) or (?'name'...) as in Perl, or (?P<name>...)
1930
as in Python. References to capturing parentheses from other parts of
1931
the pattern, such as backreferences, recursion, and conditions, can be
1932
made by name as well as by number.</p>
1934
<p>Names consist of up to 32 alphanumeric characters and underscores. Named
1935
capturing parentheses are still allocated numbers as well as names, exactly as
1936
if the names were not present.
1937
<!-- XXX C Interface
1938
The PCRE API provides function calls for
1939
extracting the name-to-number translation table from a compiled pattern. There
1940
is also a convenience function for extracting a captured substring by name.
1942
The <c>capture</c> specification to <c>re:run/3</c> can use named values if they are present in the regular expression.
1945
<p>By default, a name must be unique within a pattern, but it is possible to relax
1946
this constraint by setting the <c>dupnames</c> option at compile time. This can
1947
be useful for patterns where only one instance of the named parentheses can
1948
match. Suppose you want to match the name of a weekday, either as a 3-letter
1949
abbreviation or as the full name, and in both cases you want to extract the
1950
abbreviation. This pattern (ignoring the line breaks) does the job:</p>
1953
(?<DN>Mon|Fri|Sun)(?:day)?|
1954
(?<DN>Tue)(?:sday)?|
1955
(?<DN>Wed)(?:nesday)?|
1956
(?<DN>Thu)(?:rsday)?|
1957
(?<DN>Sat)(?:urday)?</code>
1959
<p>There are five capturing substrings, but only one is ever set after a match.
1960
(An alternative way of solving this problem is to use a "branch reset"
1961
subpattern, as described in the previous section.)</p>
1963
<!-- XXX C Interface
1965
<p>The convenience function for extracting the data by name returns the substring
1966
for the first (and in this example, the only) subpattern of that name that
1967
matched. This saves searching to find which numbered subpattern it was. If you
1968
make a reference to a non-unique named subpattern from elsewhere in the
1969
pattern, the one that corresponds to the lowest number is used. For further
1970
details of the interfaces for handling named subpatterns, see the
1976
<p>In case of capturing named subpatterns which are not unique, the first occurence is returned from <c>re:exec/3</c>, if the name is specified int the <c>values</c> part of the <c>capture</c> statement.</p>
1980
<section><marker id="sect14"></marker><title>Repetition</title>
1982
<p>Repetition is specified by quantifiers, which can follow any of the
1983
following items:</p>
1986
<item>a literal data character</item>
1987
<item>the dot metacharacter</item>
1988
<item>the \\C escape sequence</item>
1989
<item>the \\X escape sequence
1991
(in UTF-8 mode with Unicode properties)
1994
<item>the \\R escape sequence</item>
1995
<item>an escape such as \\d that matches a single character</item>
1996
<item>a character class</item>
1997
<item>a back reference (see next section)</item>
1998
<item>a parenthesized subpattern (unless it is an assertion)</item>
2001
<p>The general repetition quantifier specifies a minimum and maximum number of
2002
permitted matches, by giving the two numbers in curly brackets (braces),
2003
separated by a comma. The numbers must be less than 65536, and the first must
2004
be less than or equal to the second. For example:</p>
2006
<quote><p> z{2,4}</p></quote>
2008
<p>matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
2009
character. If the second number is omitted, but the comma is present, there is
2010
no upper limit; if the second number and the comma are both omitted, the
2011
quantifier specifies an exact number of required matches. Thus</p>
2013
<quote><p> [aeiou]{3,}</p></quote>
2015
<p>matches at least 3 successive vowels, but may match many more, while</p>
2017
<quote><p> \\d{8}</p></quote>
2019
<p>matches exactly 8 digits. An opening curly bracket that appears in a position
2020
where a quantifier is not allowed, or one that does not match the syntax of a
2021
quantifier, is taken as a literal character. For example, {,6} is not a
2022
quantifier, but a literal string of four characters.</p>
2025
<p>In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
2026
bytes. Thus, for example, \\x{100}{2} matches two UTF-8 characters, each of
2027
which is represented by a two-byte sequence. Similarly, when Unicode property
2028
support is available, \\X{3} matches three Unicode extended sequences, each of
2029
which may be several bytes long (and they may be of different lengths).</p>
2032
<p>The quantifier {0} is permitted, causing the expression to behave as if the
2033
previous item and the quantifier were not present.</p>
2035
<p>For convenience, the three most common quantifiers have single-character
2039
<tag>*</tag> <item>is equivalent to {0,}</item>
2040
<tag>+</tag> <item>is equivalent to {1,}</item>
2041
<tag>?</tag> <item>is equivalent to {0,1}</item>
2044
<p>It is possible to construct infinite loops by following a
2045
subpattern that can match no characters with a quantifier that has no
2046
upper limit, for example:</p>
2048
<quote><p> (a?)*</p></quote>
2050
<p>Earlier versions of Perl and PCRE used to give an error at compile time for
2051
such patterns. However, because there are cases where this can be useful, such
2052
patterns are now accepted, but if any repetition of the subpattern does in fact
2053
match no characters, the loop is forcibly broken.</p>
2055
<p>By default, the quantifiers are "greedy", that is, they match as much as
2056
possible (up to the maximum number of permitted times), without causing the
2057
rest of the pattern to fail. The classic example of where this gives problems
2058
is in trying to match comments in C programs. These appear between /* and */
2059
and within the comment, individual * and / characters may appear. An attempt to
2060
match C comments by applying the pattern</p>
2062
<quote><p> /\\*.*\\*/</p></quote>
2064
<p>to the string</p>
2066
<quote><p> /* first comment */ not comment /* second comment */</p></quote>
2068
<p>fails, because it matches the entire string owing to the greediness of the .*
2071
<p>However, if a quantifier is followed by a question mark, it ceases to be
2072
greedy, and instead matches the minimum number of times possible, so the
2075
<quote><p> /\\*.*?\\*/</p></quote>
2077
<p>does the right thing with the C comments. The meaning of the various
2078
quantifiers is not otherwise changed, just the preferred number of matches.
2079
Do not confuse this use of question mark with its use as a quantifier in its
2080
own right. Because it has two uses, it can sometimes appear doubled, as in</p>
2082
<quote><p> \\d??\\d</p></quote>
2084
<p>which matches one digit by preference, but can match two if that is the only
2085
way the rest of the pattern matches.</p>
2087
<p>If the <c>ungreedy</c> option is set (an option that is not available in Perl),
2088
the quantifiers are not greedy by default, but individual ones can be made
2089
greedy by following them with a question mark. In other words, it inverts the
2090
default behaviour.</p>
2092
<p>When a parenthesized subpattern is quantified with a minimum repeat count that
2093
is greater than 1 or with a limited maximum, more memory is required for the
2094
compiled pattern, in proportion to the size of the minimum or maximum.</p>
2096
<p>If a pattern starts with .* or .{0,} and the <c>dotall</c> option (equivalent
2097
to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2098
implicitly anchored, because whatever follows will be tried against every
2099
character position in the subject string, so there is no point in retrying the
2100
overall match at any position after the first. PCRE normally treats such a
2101
pattern as though it were preceded by \\A.</p>
2103
<p>In cases where it is known that the subject string contains no newlines, it is
2104
worth setting <c>dotall</c> in order to obtain this optimization, or
2105
alternatively using ^ to indicate anchoring explicitly.</p>
2107
<p>However, there is one situation where the optimization cannot be used. When .*
2108
is inside capturing parentheses that are the subject of a backreference
2109
elsewhere in the pattern, a match at the start may fail where a later one
2110
succeeds. Consider, for example:</p>
2112
<quote><p> (.*)abc\\1</p></quote>
2114
<p>If the subject is "xyz123abc123" the match point is the fourth character. For
2115
this reason, such a pattern is not implicitly anchored.</p>
2117
<p>When a capturing subpattern is repeated, the value captured is the substring
2118
that matched the final iteration. For example, after</p>
2120
<quote><p> (tweedle[dume]{3}\\s*)+</p></quote>
2122
<p>has matched "tweedledum tweedledee" the value of the captured substring is
2123
"tweedledee". However, if there are nested capturing subpatterns, the
2124
corresponding captured values may have been set in previous iterations. For
2127
<quote><p> /(a|(b))+/</p></quote>
2129
<p>matches "aba" the value of the second captured substring is "b".</p>
2134
<section><marker id="sect15"></marker><title>Atomic grouping and possessive quantifiers</title>
2136
<p>With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
2137
repetition, failure of what follows normally causes the repeated item to be
2138
re-evaluated to see if a different number of repeats allows the rest of the
2139
pattern to match. Sometimes it is useful to prevent this, either to change the
2140
nature of the match, or to cause it fail earlier than it otherwise might, when
2141
the author of the pattern knows there is no point in carrying on.</p>
2143
<p>Consider, for example, the pattern \\d+foo when applied to the subject line</p>
2145
<quote><p> 123456bar</p></quote>
2147
<p>After matching all 6 digits and then failing to match "foo", the normal
2148
action of the matcher is to try again with only 5 digits matching the \\d+
2149
item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
2150
(a term taken from Jeffrey Friedl's book) provides the means for specifying
2151
that once a subpattern has matched, it is not to be re-evaluated in this way.</p>
2153
<p>If we use atomic grouping for the previous example, the matcher gives up
2154
immediately on failing to match "foo" the first time. The notation is a kind of
2155
special parenthesis, starting with (?> as in this example:</p>
2157
<quote><p> (?>\\d+)foo</p></quote>
2159
<p>This kind of parenthesis "locks up" the part of the pattern it contains once
2160
it has matched, and a failure further into the pattern is prevented from
2161
backtracking into it. Backtracking past it to previous items, however, works as
2164
<p>An alternative description is that a subpattern of this type matches the string
2165
of characters that an identical standalone pattern would match, if anchored at
2166
the current point in the subject string.</p>
2168
<p>Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
2169
the above example can be thought of as a maximizing repeat that must swallow
2170
everything it can. So, while both \\d+ and \\d+? are prepared to adjust the
2171
number of digits they match in order to make the rest of the pattern match,
2172
(?>\\d+) can only match an entire sequence of digits.</p>
2174
<p>Atomic groups in general can of course contain arbitrarily complicated
2175
subpatterns, and can be nested. However, when the subpattern for an atomic
2176
group is just a single repeated item, as in the example above, a simpler
2177
notation, called a "possessive quantifier" can be used. This consists of an
2178
additional + character following a quantifier. Using this notation, the
2179
previous example can be rewritten as</p>
2181
<quote><p> \\d++foo</p></quote>
2183
<p>Note that a possessive quantifier can be used with an entire group, for
2186
<quote><p> (abc|xyz){2,3}+</p></quote>
2188
<p>Possessive quantifiers are always greedy; the setting of the <c>ungreedy</c>
2189
option is ignored. They are a convenient notation for the simpler forms of
2190
atomic group. However, there is no difference in the meaning of a possessive
2191
quantifier and the equivalent atomic group, though there may be a performance
2192
difference; possessive quantifiers should be slightly faster.</p>
2194
<p>The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2195
Jeffrey Friedl originated the idea (and the name) in the first edition of his
2196
book. Mike McCloskey liked it, so implemented it when he built Sun's Java
2197
package, and PCRE copied it from there. It ultimately found its way into Perl
2198
at release 5.10.</p>
2200
<p>PCRE has an optimization that automatically "possessifies" certain simple
2201
pattern constructs. For example, the sequence A+B is treated as A++B because
2202
there is no point in backtracking into a sequence of A's when B must follow.</p>
2204
<p>When a pattern contains an unlimited repeat inside a subpattern that can itself
2205
be repeated an unlimited number of times, the use of an atomic group is the
2206
only way to avoid some failing matches taking a very long time indeed. The
2209
<quote><p> (\\D+|<\\d+>)*[!?]</p></quote>
2211
<p>matches an unlimited number of substrings that either consist of non-digits, or
2212
digits enclosed in <>, followed by either ! or ?. When it matches, it runs
2213
quickly. However, if it is applied to</p>
2215
<quote><p> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</p></quote>
2217
<p>it takes a long time before reporting failure. This is because the string can
2218
be divided between the internal \\D+ repeat and the external * repeat in a
2219
large number of ways, and all have to be tried. (The example uses [!?] rather
2220
than a single character at the end, because both PCRE and Perl have an
2221
optimization that allows for fast failure when a single character is used. They
2222
remember the last single character that is required for a match, and fail early
2223
if it is not present in the string.) If the pattern is changed so that it uses
2224
an atomic group, like this:</p>
2226
<quote><p> ((?>\\D+)|<\\d+>)*[!?]</p></quote>
2228
<p>sequences of non-digits cannot be broken, and failure happens quickly.</p>
2232
<section><marker id="sect16"></marker><title>Back references</title>
2234
<p>Outside a character class, a backslash followed by a digit greater than 0 (and
2235
possibly further digits) is a back reference to a capturing subpattern earlier
2236
(that is, to its left) in the pattern, provided there have been that many
2237
previous capturing left parentheses.</p>
2239
<p>However, if the decimal number following the backslash is less than 10, it is
2240
always taken as a back reference, and causes an error only if there are not
2241
that many capturing left parentheses in the entire pattern. In other words, the
2242
parentheses that are referenced need not be to the left of the reference for
2243
numbers less than 10. A "forward back reference" of this type can make sense
2244
when a repetition is involved and the subpattern to the right has participated
2245
in an earlier iteration.</p>
2247
<p>It is not possible to have a numerical "forward back reference" to
2248
a subpattern whose number is 10 or more using this syntax because a
2249
sequence such as \\50 is interpreted as a character defined in
2250
octal. See the subsection entitled "Non-printing characters" above for
2251
further details of the handling of digits following a backslash. There
2252
is no such problem when named parentheses are used. A back reference
2253
to any subpattern is possible using named parentheses (see below).</p>
2255
<p>Another way of avoiding the ambiguity inherent in the use of digits
2256
following a backslash is to use the \\g escape sequence, which is a
2257
feature introduced in Perl 5.10. This escape must be followed by an
2258
unsigned number or a negative number, optionally enclosed in
2259
braces. These examples are all identical:</p>
2262
<item>(ring), \\1</item>
2263
<item>(ring), \\g1</item>
2264
<item>(ring), \\g{1}</item>
2267
<p>An unsigned number specifies an absolute reference without the
2268
ambiguity that is present in the older syntax. It is also useful when
2269
literal digits follow the reference. A negative number is a relative
2270
reference. Consider this example:</p>
2272
<quote><p> (abc(def)ghi)\\g{-1}</p></quote>
2274
<p>The sequence \\g{-1} is a reference to the most recently started capturing
2275
subpattern before \\g, that is, is it equivalent to \\2. Similarly, \\g{-2}
2276
would be equivalent to \\1. The use of relative references can be helpful in
2277
long patterns, and also in patterns that are created by joining together
2278
fragments that contain references within themselves.</p>
2280
<p>A back reference matches whatever actually matched the capturing
2281
subpattern in the current subject string, rather than anything
2282
matching the subpattern itself (see "Subpatterns as subroutines" below
2283
for a way of doing that). So the pattern</p>
2285
<quote><p> (sens|respons)e and \\1ibility</p></quote>
2287
<p>matches "sense and sensibility" and "response and responsibility", but not
2288
"sense and responsibility". If caseful matching is in force at the time of the
2289
back reference, the case of letters is relevant. For example,</p>
2291
<quote><p> ((?i)rah)\\s+\\1</p></quote>
2293
<p>matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
2294
capturing subpattern is matched caselessly.</p>
2296
<p>There are several different ways of writing back references to named
2297
subpatterns. The .NET syntax \\k{name} and the Perl syntax \\k<name> or
2298
\\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2299
back reference syntax, in which \\g can be used for both numeric and named
2300
references, is also supported. We could rewrite the above example in any of
2301
the following ways:</p>
2304
<item>(?<p1>(?i)rah)\\s+\\k<p1></item>
2305
<item>(?'p1'(?i)rah)\\s+\\k{p1}</item>
2306
<item>(?P<p1>(?i)rah)\\s+(?P=p1)</item>
2307
<item>(?<p1>(?i)rah)\\s+\\g{p1}</item>
2310
<p>A subpattern that is referenced by name may appear in the pattern before or
2311
after the reference.</p>
2313
<p>There may be more than one back reference to the same subpattern. If a
2314
subpattern has not actually been used in a particular match, any back
2315
references to it always fail. For example, the pattern</p>
2317
<quote><p> (a|(bc))\\2</p></quote>
2319
<p>always fails if it starts to match "a" rather than "bc". Because
2320
there may be many capturing parentheses in a pattern, all digits
2321
following the backslash are taken as part of a potential back
2322
reference number. If the pattern continues with a digit character,
2323
some delimiter must be used to terminate the back reference. If the
2324
<c>extended</c> option is set, this can be whitespace. Otherwise an
2325
empty comment (see "Comments" below) can be used.</p>
2327
<p>A back reference that occurs inside the parentheses to which it refers fails
2328
when the subpattern is first used, so, for example, (a\\1) never matches.
2329
However, such references can be useful inside repeated subpatterns. For
2330
example, the pattern</p>
2332
<quote><p> (a|b\\1)+</p></quote>
2334
<p>matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
2335
the subpattern, the back reference matches the character string corresponding
2336
to the previous iteration. In order for this to work, the pattern must be such
2337
that the first iteration does not need to match the back reference. This can be
2338
done using alternation, as in the example above, or by a quantifier with a
2339
minimum of zero.</p>
2343
<section><marker id="sect17"></marker><title>Assertions</title>
2345
<p>An assertion is a test on the characters following or preceding the current
2346
matching point that does not actually consume any characters. The simple
2347
assertions coded as \\b, \\B, \\A, \\G, \\Z, \\z, ^ and $ are described
2351
<p>More complicated assertions are coded as subpatterns. There are two kinds:
2352
those that look ahead of the current position in the subject string, and those
2353
that look behind it. An assertion subpattern is matched in the normal way,
2354
except that it does not cause the current matching position to be changed.</p>
2356
<p>Assertion subpatterns are not capturing subpatterns, and may not be repeated,
2357
because it makes no sense to assert the same thing several times. If any kind
2358
of assertion contains capturing subpatterns within it, these are counted for
2359
the purposes of numbering the capturing subpatterns in the whole pattern.
2360
However, substring capturing is carried out only for positive assertions,
2361
because it does not make sense for negative assertions.</p>
2363
<p><em>Lookahead assertions</em></p>
2365
<p>Lookahead assertions start with (?= for positive assertions and (?! for
2366
negative assertions. For example,</p>
2368
<quote><p> \\w+(?=;)</p></quote>
2370
<p>matches a word followed by a semicolon, but does not include the semicolon in
2373
<quote><p> foo(?!bar)</p></quote>
2375
<p>matches any occurrence of "foo" that is not followed by "bar". Note that the
2376
apparently similar pattern</p>
2378
<quote><p> (?!foo)bar</p></quote>
2380
<p>does not find an occurrence of "bar" that is preceded by something other than
2381
"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
2382
(?!foo) is always true when the next three characters are "bar". A
2383
lookbehind assertion is needed to achieve the other effect.</p>
2385
<p>If you want to force a matching failure at some point in a pattern, the most
2386
convenient way to do it is with (?!) because an empty string always matches, so
2387
an assertion that requires there not to be an empty string must always fail.</p>
2390
<p><em>Lookbehind assertions</em></p>
2392
<p>Lookbehind assertions start with (?<= for positive assertions and (?<! for
2393
negative assertions. For example,</p>
2395
<quote><p> (?<!foo)bar</p></quote>
2397
<p>does find an occurrence of "bar" that is not preceded by "foo". The contents of
2398
a lookbehind assertion are restricted such that all the strings it matches must
2399
have a fixed length. However, if there are several top-level alternatives, they
2400
do not all have to have the same fixed length. Thus</p>
2402
<quote><p> (?<=bullock|donkey)</p></quote>
2404
<p>is permitted, but</p>
2406
<quote><p> (?<!dogs?|cats?)</p></quote>
2408
<p>causes an error at compile time. Branches that match different length strings
2409
are permitted only at the top level of a lookbehind assertion. This is an
2410
extension compared with Perl (at least for 5.8), which requires all branches to
2411
match the same length of string. An assertion such as</p>
2413
<quote><p> (?<=ab(c|de))</p></quote>
2415
<p>is not permitted, because its single top-level branch can match two different
2416
lengths, but it is acceptable if rewritten to use two top-level branches:</p>
2418
<quote><p> (?<=abc|abde)</p></quote>
2420
<p>In some cases, the Perl 5.10 escape sequence \\K (see above) can be
2421
used instead of a lookbehind assertion; this is not restricted to a
2424
<p>The implementation of lookbehind assertions is, for each alternative, to
2425
temporarily move the current position back by the fixed length and then try to
2426
match. If there are insufficient characters before the current position, the
2427
assertion fails.</p>
2430
<p>PCRE does not allow the \\C escape (which matches a single byte in UTF-8 mode)
2431
to appear in lookbehind assertions, because it makes it impossible to calculate
2432
the length of the lookbehind. The \\X and \\R escapes, which can match
2433
different numbers of bytes, are also not permitted.</p>
2436
<p>Possessive quantifiers can be used in conjunction with lookbehind assertions to
2437
specify efficient matching at the end of the subject string. Consider a simple
2440
<quote><p> abcd$</p></quote>
2442
<p>when applied to a long string that does not match. Because matching proceeds
2443
from left to right, PCRE will look for each "a" in the subject and then see if
2444
what follows matches the rest of the pattern. If the pattern is specified as</p>
2446
<quote><p> ^.*abcd$</p></quote>
2448
<p>the initial .* matches the entire string at first, but when this fails (because
2449
there is no following "a"), it backtracks to match all but the last character,
2450
then all but the last two characters, and so on. Once again the search for "a"
2451
covers the entire string, from right to left, so we are no better off. However,
2452
if the pattern is written as</p>
2454
<quote><p> ^.*+(?<=abcd)</p></quote>
2456
<p>there can be no backtracking for the .*+ item; it can match only the entire
2457
string. The subsequent lookbehind assertion does a single test on the last four
2458
characters. If it fails, the match fails immediately. For long strings, this
2459
approach makes a significant difference to the processing time.</p>
2461
<p><em>Using multiple assertions</em></p>
2463
<p>Several assertions (of any sort) may occur in succession. For example,</p>
2465
<quote><p> (?<=\\d{3})(?<!999)foo</p></quote>
2467
<p>matches "foo" preceded by three digits that are not "999". Notice
2468
that each of the assertions is applied independently at the same point
2469
in the subject string. First there is a check that the previous three
2470
characters are all digits, and then there is a check that the same
2471
three characters are not "999". This pattern does <em>not</em> match
2472
"foo" preceded by six characters, the first of which are digits and
2473
the last three of which are not "999". For example, it doesn't match
2474
"123abcfoo". A pattern to do that is</p>
2476
<quote><p> (?<=\\d{3}...)(?<!999)foo</p></quote>
2478
<p>This time the first assertion looks at the preceding six
2479
characters, checking that the first three are digits, and then the
2480
second assertion checks that the preceding three characters are not
2483
<p>Assertions can be nested in any combination. For example,</p>
2485
<quote><p> (?<=(?<!foo)bar)baz</p></quote>
2487
<p>matches an occurrence of "baz" that is preceded by "bar" which in
2488
turn is not preceded by "foo", while</p>
2490
<quote><p> (?<=\\d{3}(?!999)...)foo</p></quote>
2492
<p>is another pattern that matches "foo" preceded by three digits and any three
2493
characters that are not "999".</p>
2497
<section><marker id="sect18"></marker><title>Conditional subpatterns</title>
2499
<p>It is possible to cause the matching process to obey a subpattern
2500
conditionally or to choose between two alternative subpatterns, depending on
2501
the result of an assertion, or whether a previous capturing subpattern matched
2502
or not. The two possible forms of conditional subpattern are</p>
2505
<item>(?(condition)yes-pattern)</item>
2506
<item>(?(condition)yes-pattern|no-pattern)</item>
2509
<p>If the condition is satisfied, the yes-pattern is used; otherwise the
2510
no-pattern (if present) is used. If there are more than two alternatives in the
2511
subpattern, a compile-time error occurs.</p>
2513
<p>There are four kinds of condition: references to subpatterns, references to
2514
recursion, a pseudo-condition called DEFINE, and assertions.</p>
2517
<p><em>Checking for a used subpattern by number</em></p>
2519
<p>If the text between the parentheses consists of a sequence of
2520
digits, the condition is true if the capturing subpattern of that
2521
number has previously matched. An alternative notation is to precede
2522
the digits with a plus or minus sign. In this case, the subpattern
2523
number is relative rather than absolute. The most recently opened
2524
parentheses can be referenced by (?(-1), the next most recent by
2525
(?(-2), and so on. In looping constructs it can also make sense to
2526
refer to subsequent groups with constructs such as (?(+2).</p>
2528
<p>Consider the following pattern, which contains non-significant
2529
whitespace to make it more readable (assume the <c>extended</c>
2530
option) and to divide it into three parts for ease of discussion:</p>
2532
<quote><p> ( \\( )? [^()]+ (?(1) \\) )</p></quote>
2534
<p>The first part matches an optional opening parenthesis, and if that
2535
character is present, sets it as the first captured substring. The second part
2536
matches one or more characters that are not parentheses. The third part is a
2537
conditional subpattern that tests whether the first set of parentheses matched
2538
or not. If they did, that is, if subject started with an opening parenthesis,
2539
the condition is true, and so the yes-pattern is executed and a closing
2540
parenthesis is required. Otherwise, since no-pattern is not present, the
2541
subpattern matches nothing. In other words, this pattern matches a sequence of
2542
non-parentheses, optionally enclosed in parentheses.</p>
2544
<p>If you were embedding this pattern in a larger one, you could use a relative
2547
<quote><p> ...other stuff... ( \\( )? [^()]+ (?(-1) \\) ) ...</p></quote>
2549
<p>This makes the fragment independent of the parentheses in the larger pattern.</p>
2551
<p><em>Checking for a used subpattern by name</em></p>
2553
<p>Perl uses the syntax (?(<name>)...) or (?('name')...) to test
2554
for a used subpattern by name. For compatibility with earlier versions
2555
of PCRE, which had this facility before Perl, the syntax (?(name)...)
2556
is also recognized. However, there is a possible ambiguity with this
2557
syntax, because subpattern names may consist entirely of digits. PCRE
2558
looks first for a named subpattern; if it cannot find one and the name
2559
consists entirely of digits, PCRE looks for a subpattern of that
2560
number, which must be greater than zero. Using subpattern names that
2561
consist entirely of digits is not recommended.</p>
2563
<p>Rewriting the above example to use a named subpattern gives this:</p>
2565
<quote><p> (?<OPEN> \\( )? [^()]+ (?(<OPEN>) \\) )</p></quote>
2567
<p><em>Checking for pattern recursion</em></p>
2569
<p>If the condition is the string (R), and there is no subpattern with
2570
the name R, the condition is true if a recursive call to the whole
2571
pattern or any subpattern has been made. If digits or a name preceded
2572
by ampersand follow the letter R, for example:</p>
2574
<quote><p> (?(R3)...) or (?(R&name)...)</p></quote>
2576
<p>the condition is true if the most recent recursion is into the
2577
subpattern whose number or name is given. This condition does not
2578
check the entire recursion stack.</p>
2580
<p>At "top level", all these recursion test conditions are false. Recursive
2581
patterns are described below.</p>
2583
<p><em>Defining subpatterns for use by reference only</em></p>
2585
<p>If the condition is the string (DEFINE), and there is no subpattern with the
2586
name DEFINE, the condition is always false. In this case, there may be only one
2587
alternative in the subpattern. It is always skipped if control reaches this
2588
point in the pattern; the idea of DEFINE is that it can be used to define
2589
"subroutines" that can be referenced from elsewhere. (The use of "subroutines"
2590
is described below.) For example, a pattern to match an IPv4 address could be
2591
written like this (ignore whitespace and line breaks):</p>
2593
<quote><p> (?(DEFINE) (?<byte> 2[0-4]\\d | 25[0-5] | 1\\d\\d | [1-9]?\\d) )
2594
\\b (?&byte) (\\.(?&byte)){3} \\b</p></quote>
2596
<p>The first part of the pattern is a DEFINE group inside which a another group
2597
named "byte" is defined. This matches an individual component of an IPv4
2598
address (a number less than 256). When matching takes place, this part of the
2599
pattern is skipped because DEFINE acts like a false condition.</p>
2601
<p>The rest of the pattern uses references to the named group to match the four
2602
dot-separated components of an IPv4 address, insisting on a word boundary at
2605
<p><em>Assertion conditions</em></p>
2607
<p>If the condition is not in any of the above formats, it must be an
2608
assertion. This may be a positive or negative lookahead or lookbehind
2609
assertion. Consider this pattern, again containing non-significant
2610
whitespace, and with the two alternatives on the second line:</p>
2614
\\d{2}-[a-z]{3}-\\d{2} | \\d{2}-\\d{2}-\\d{2} )</code>
2616
<p>The condition is a positive lookahead assertion that matches an optional
2617
sequence of non-letters followed by a letter. In other words, it tests for the
2618
presence of at least one letter in the subject. If a letter is found, the
2619
subject is matched against the first alternative; otherwise it is matched
2620
against the second. This pattern matches strings in one of the two forms
2621
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.</p>
2626
<section><marker id="sect19"></marker><title>Comments</title>
2628
<p>The sequence (?# marks the start of a comment that continues up to the next
2629
closing parenthesis. Nested parentheses are not permitted. The characters
2630
that make up a comment play no part in the pattern matching at all.</p>
2632
<p>If the <c>extended</c> option is set, an unescaped # character outside a
2633
character class introduces a comment that continues to immediately after the
2634
next newline in the pattern.</p>
2639
<section><marker id="sect20"></marker><title>Recursive patterns</title>
2641
<p>Consider the problem of matching a string in parentheses, allowing for
2642
unlimited nested parentheses. Without the use of recursion, the best that can
2643
be done is to use a pattern that matches up to some fixed depth of nesting. It
2644
is not possible to handle an arbitrary nesting depth.</p>
2646
<p>For some time, Perl has provided a facility that allows regular
2647
expressions to recurse (amongst other things). It does this by
2648
interpolating Perl code in the expression at run time, and the code
2649
can refer to the expression itself. A Perl pattern using code
2650
interpolation to solve the parentheses problem can be created like
2653
<quote><p> $re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x;</p></quote>
2655
<p>The (?p{...}) item interpolates Perl code at run time, and in this
2656
case refers recursively to the pattern in which it appears.</p>
2658
<p>Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
2659
supports special syntax for recursion of the entire pattern, and also for
2660
individual subpattern recursion. After its introduction in PCRE and Python,
2661
this kind of recursion was introduced into Perl at release 5.10.</p>
2663
<p>A special item that consists of (? followed by a number greater
2664
than zero and a closing parenthesis is a recursive call of the
2665
subpattern of the given number, provided that it occurs inside that
2666
subpattern. (If not, it is a "subroutine" call, which is described in
2667
the next section.) The special item (?R) or (?0) is a recursive call
2668
of the entire regular expression.</p>
2670
<p>In PCRE (like Python, but unlike Perl), a recursive subpattern call
2671
is always treated as an atomic group. That is, once it has matched
2672
some of the subject string, it is never re-entered, even if it
2673
contains untried alternatives and there is a subsequent matching
2676
<p>This PCRE pattern solves the nested parentheses problem (assume the
2677
<c>extended</c> option is set so that whitespace is ignored):</p>
2679
<quote><p> \\( ( (?>[^()]+) | (?R) )* \\)</p></quote>
2681
<p>First it matches an opening parenthesis. Then it matches any number
2682
of substrings which can either be a sequence of non-parentheses, or a
2683
recursive match of the pattern itself (that is, a correctly
2684
parenthesized substring). Finally there is a closing parenthesis.</p>
2686
<p>If this were part of a larger pattern, you would not want to
2687
recurse the entire pattern, so instead you could use this:</p>
2689
<quote><p> ( \\( ( (?>[^()]+) | (?1) )* \\) )</p></quote>
2691
<p>We have put the pattern into parentheses, and caused the recursion
2692
to refer to them instead of the whole pattern.</p>
2694
<p>In a larger pattern, keeping track of parenthesis numbers can be
2695
tricky. This is made easier by the use of relative references. (A Perl
2696
5.10 feature.) Instead of (?1) in the pattern above you can write
2697
(?-2) to refer to the second most recently opened parentheses
2698
preceding the recursion. In other words, a negative number counts
2699
capturing parentheses leftwards from the point at which it is
2702
<p>It is also possible to refer to subsequently opened parentheses, by
2703
writing references such as (?+2). However, these cannot be recursive
2704
because the reference is not inside the parentheses that are
2705
referenced. They are always "subroutine" calls, as described in the
2708
<p>An alternative approach is to use named parentheses instead. The
2709
Perl syntax for this is (?&name); PCRE's earlier syntax
2710
(?P>name) is also supported. We could rewrite the above example as
2713
<quote><p> (?<pn> \\( ( (?>[^()]+) | (?&pn) )* \\) )</p></quote>
2715
<p>If there is more than one subpattern with the same name, the earliest one is
2718
<p>This particular example pattern that we have been looking at contains nested
2719
unlimited repeats, and so the use of atomic grouping for matching strings of
2720
non-parentheses is important when applying the pattern to strings that do not
2721
match. For example, when this pattern is applied to</p>
2723
<quote><p> (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</p></quote>
2725
<p>it yields "no match" quickly. However, if atomic grouping is not used,
2726
the match runs for a very long time indeed because there are so many different
2727
ways the + and * repeats can carve up the subject, and all have to be tested
2728
before failure can be reported.</p>
2730
<p>At the end of a match, the values set for any capturing subpatterns are those
2731
from the outermost level of the recursion at which the subpattern value is set.
2733
<!-- XXX C Interface
2734
If you want to obtain intermediate values, a callout function can be used (see
2737
<em>pcrecallout</em>
2741
If the pattern above is matched against</p>
2743
<quote><p> (ab(cd)ef)</p></quote>
2745
<p>the value for the capturing parentheses is "ef", which is the last value taken
2746
on at the top level. If additional parentheses are added, giving</p>
2749
\\( ( ( (?>[^()]+) | (?R) )* ) \\)
2753
<p>the string they capture is "ab(cd)ef", the contents of the top level
2755
<!-- XXX C interface
2756
If there are more than 15 capturing parentheses in a pattern, PCRE
2757
has to obtain extra memory to store data during a recursion, which it does by
2758
using <em>pcre_malloc</em>, freeing it via <em>pcre_free</em> afterwards. If no
2759
memory can be obtained, the match fails with the <c>error_nomemory</c> error.</p>
2763
<p>Do not confuse the (?R) item with the condition (R), which tests
2764
for recursion. Consider this pattern, which matches text in angle
2765
brackets, allowing for arbitrary nesting. Only digits are allowed in
2766
nested brackets (that is, when recursing), whereas any characters are
2767
permitted at the outer level.</p>
2769
<quote><p> < (?: (?(R) \\d++ | [^<>]*+) | (?R)) * ></p></quote>
2771
<p>In this pattern, (?(R) is the start of a conditional subpattern,
2772
with two different alternatives for the recursive and non-recursive
2773
cases. The (?R) item is the actual recursive call.</p>
2777
<section><marker id="sect21"></marker><title>Subpatterns as subroutines</title>
2779
<p>If the syntax for a recursive subpattern reference (either by number or by
2780
name) is used outside the parentheses to which it refers, it operates like a
2781
subroutine in a programming language. The "called" subpattern may be defined
2782
before or after the reference. A numbered reference can be absolute or
2783
relative, as in these examples:</p>
2786
<item>(...(absolute)...)...(?2)...</item>
2787
<item>(...(relative)...)...(?-1)...</item>
2788
<item>(...(?+1)...(relative)...</item>
2791
<p>An earlier example pointed out that the pattern</p>
2793
<quote><p> (sens|respons)e and \\1ibility</p></quote>
2795
<p>matches "sense and sensibility" and "response and responsibility", but not
2796
"sense and responsibility". If instead the pattern</p>
2798
<quote><p> (sens|respons)e and (?1)ibility</p></quote>
2800
<p>is used, it does match "sense and responsibility" as well as the other two
2801
strings. Another example is given in the discussion of DEFINE above.</p>
2803
<p>Like recursive subpatterns, a "subroutine" call is always treated
2804
as an atomic group. That is, once it has matched some of the subject
2805
string, it is never re-entered, even if it contains untried
2806
alternatives and there is a subsequent matching failure.</p>
2808
<p>When a subpattern is used as a subroutine, processing options such as
2809
case-independence are fixed when the subpattern is defined. They cannot be
2810
changed for different calls. For example, consider this pattern:</p>
2812
<quote><p> (abc)(?i:(?-1))</p></quote>
2814
<p>It matches "abcabc". It does not match "abcABC" because the change of
2815
processing option does not affect the called subpattern.</p>
2820
<!-- XXX C interface
2822
<section> <marker id="sect22"><title>Callouts</title></marker>
2824
<p>Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
2825
code to be obeyed in the middle of matching a regular expression. This makes it
2826
possible, amongst other things, to extract different substrings that match the
2827
same pair of parentheses when there is a repetition.</p>
2829
<p>PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
2830
code. The feature is called "callout". The caller of PCRE provides an external
2831
function by putting its entry point in the global variable <em>pcre_callout</em>.
2832
By default, this variable contains NULL, which disables all calling out.</p>
2834
<p>Within a regular expression, (?C) indicates the points at which the external
2835
function is to be called. If you want to identify different callout points, you
2836
can put a number less than 256 after the letter C. The default value is zero.
2837
For example, this pattern has two callout points:</p>
2839
<quote><p> (?C1)abc(?C2)def</p></quote>
2842
<p>If the <c>AUTO_CALLOUT</c> flag is passed to <c>re:compile/2</c>, callouts are
2843
automatically installed before each item in the pattern. They are all numbered
2846
<p>During matching, when PCRE reaches a callout point (and <em>pcre_callout</em> is
2847
set), the external function is called. It is provided with the number of the
2848
callout, the position in the pattern, and, optionally, one item of data
2849
originally supplied by the caller of <c>re:run/3</c>. The callout function
2850
may cause matching to proceed, to backtrack, or to fail altogether. A complete
2851
description of the interface to the callout function is given in the
2852
<em>pcrecallout</em>
2859
<section><marker id="sect23"></marker><title>Backtracking control</title>
2861
<p>Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
2862
are described in the Perl documentation as "experimental and subject to change
2863
or removal in a future version of Perl". It goes on to say: "Their usage in
2864
production code should be noted to avoid problems during upgrades." The same
2865
remarks apply to the PCRE features described in this section.</p>
2867
<!-- XXX C interface
2868
<p>Since these verbs are specifically related to backtracking, they can be used
2869
only when the pattern is to be matched using <c>re:run/3</c>, which uses a
2870
backtracking algorithm. They cause an error if encountered by
2871
<c>pcre_dfa_exec()</c>.</p>
2874
<p>The new verbs make use of what was previously invalid syntax: an opening
2875
parenthesis followed by an asterisk. In Perl, they are generally of the form
2876
(*VERB:ARG) but PCRE does not support the use of arguments, so its general
2877
form is just (*VERB). Any number of these verbs may occur in a pattern. There
2881
<p><em>Verbs that act immediately</em></p>
2883
<p>The following verbs act as soon as they are encountered:</p>
2885
<quote><p> (*ACCEPT)</p></quote>
2887
<p>This verb causes the match to end successfully, skipping the remainder of the
2888
pattern. When inside a recursion, only the innermost pattern is ended
2889
immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside
2890
capturing parentheses. In Perl, the data so far is captured: in PCRE no data is
2891
captured. For example:</p>
2893
<quote><p> A(A|B(*ACCEPT)|C)D</p></quote>
2895
<p>This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is
2898
<quote><p> (*FAIL) or (*F)</p></quote>
2900
<p>This verb causes the match to fail, forcing backtracking to occur. It is
2901
equivalent to (?!) but easier to read. The Perl documentation notes that it is
2902
probably useful only when combined with (?{}) or (??{}). Those are, of course,
2903
Perl features that are not present in PCRE. The nearest equivalent is the
2904
callout feature, as for example in this pattern:</p>
2906
<quote><p> a+(?C)(*FAIL)</p></quote>
2908
<p>A match with the string "aaaa" always fails, but the callout is taken before
2909
each backtrack happens (in this example, 10 times).</p>
2913
<p><em>Verbs that act after backtracking</em></p>
2915
<p>The following verbs do nothing when they are encountered. Matching continues
2916
with what follows, but if there is no subsequent match, a failure is forced.
2917
The verbs differ in exactly what kind of failure occurs.</p>
2919
<quote><p> (*COMMIT)</p></quote>
2921
<p>This verb causes the whole match to fail outright if the rest of the pattern
2922
does not match. Even if the pattern is unanchored, no further attempts to find
2923
a match by advancing the start point take place. Once (*COMMIT) has been
2924
passed, <c>re:run/3</c> is committed to finding a match at the current
2925
starting point, or not at all. For example:</p>
2927
<quote><p> a+(*COMMIT)b</p></quote>
2929
<p>This matches "xxaab" but not "aacaab". It can be thought of as a kind of
2930
dynamic anchor, or "I've started, so I must finish."</p>
2932
<quote><p> (*PRUNE)</p></quote>
2934
<p>This verb causes the match to fail at the current position if the rest of the
2935
pattern does not match. If the pattern is unanchored, the normal "bumpalong"
2936
advance to the next starting character then happens. Backtracking can occur as
2937
usual to the left of (*PRUNE), or when matching to the right of (*PRUNE), but
2938
if there is no match to the right, backtracking cannot cross (*PRUNE).
2939
In simple cases, the use of (*PRUNE) is just an alternative to an atomic
2940
group or possessive quantifier, but there are some uses of (*PRUNE) that cannot
2941
be expressed in any other way.</p>
2943
<quote><p> (*SKIP)</p></quote>
2945
<p>This verb is like (*PRUNE), except that if the pattern is unanchored, the
2946
"bumpalong" advance is not to the next character, but to the position in the
2947
subject where (*SKIP) was encountered. (*SKIP) signifies that whatever text
2948
was matched leading up to it cannot be part of a successful match. Consider:</p>
2950
<quote><p> a+(*SKIP)b</p></quote>
2952
<p>If the subject is "aaaac...", after the first match attempt fails (starting at
2953
the first character in the string), the starting point skips on to start the
2954
next attempt at "c". Note that a possessive quantifer does not have the same
2955
effect in this example; although it would suppress backtracking during the
2956
first match attempt, the second attempt would start at the second character
2957
instead of skipping on to "c".</p>
2959
<quote><p> (*THEN)</p></quote>
2961
<p>This verb causes a skip to the next alternation if the rest of the pattern does
2962
not match. That is, it cancels pending backtracking, but only within the
2963
current alternation. Its name comes from the observation that it can be used
2964
for a pattern-based if-then-else block:</p>
2966
<quote><p> ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...</p></quote>
2968
<p>If the COND1 pattern matches, FOO is tried (and possibly further items after
2969
the end of the group if FOO succeeds); on failure the matcher skips to the
2970
second alternative and tries COND2, without backtracking into COND1. If (*THEN)
2971
is used outside of any alternation, it acts exactly like (*PRUNE).</p>