1
This file contains the PCRE man page that describes the regular expressions
2
supported by PCRE version 4.5. Note that not all of the features are relevant
1
This file contains the PCRE man page that describes the regular expressions
2
supported by PCRE version 6.0. Note that not all of the features are relevant
3
3
in the context of Exim. In particular, the version of PCRE that is compiled
4
4
with Exim does not include UTF-8 support, there is no mechanism for changing
5
5
the options with which the PCRE functions are called, and features such as
6
6
callout are not accessible.
7
7
-----------------------------------------------------------------------------
14
12
PCRE - Perl-compatible regular expressions
16
15
PCRE REGULAR EXPRESSION DETAILS
18
17
The syntax and semantics of the regular expressions supported by PCRE
19
18
are described below. Regular expressions are also described in the Perl
20
documentation and in a number of other books, some of which have copi-
21
ous examples. Jeffrey Friedl's "Mastering Regular Expressions", pub-
22
lished by O'Reilly, covers them in great detail. The description here
23
is intended as reference documentation.
25
The basic operation of PCRE is on strings of bytes. However, there is
26
also support for UTF-8 character strings. To use this support you must
27
build PCRE to include UTF-8 support, and then call pcre_compile() with
28
the PCRE_UTF8 option. How this affects the pattern matching is men-
29
tioned in several places below. There is also a summary of UTF-8 fea-
30
tures in the section on UTF-8 support in the main pcre page.
32
A regular expression is a pattern that is matched against a subject
33
string from left to right. Most characters stand for themselves in a
34
pattern, and match the corresponding characters in the subject. As a
19
documentation and in a number of books, some of which have copious
20
examples. Jeffrey Friedl's "Mastering Regular Expressions", published
21
by O'Reilly, covers regular expressions in great detail. This descrip-
22
tion of PCRE's regular expressions is intended as reference material.
24
The original operation of PCRE was on strings of one-byte characters.
25
However, there is now also support for UTF-8 character strings. To use
26
this, you must build PCRE to include UTF-8 support, and then call
27
pcre_compile() with the PCRE_UTF8 option. How this affects pattern
28
matching is mentioned in several places below. There is also a summary
29
of UTF-8 features in the section on UTF-8 support in the main pcre
32
The remainder of this document discusses the patterns that are sup-
33
ported by PCRE when its main matching function, pcre_exec(), is used.
34
From release 6.0, PCRE offers a second matching function,
35
pcre_dfa_exec(), which matches using a different algorithm that is not
36
Perl-compatible. The advantages and disadvantages of the alternative
37
function, and how it differs from the normal function, are discussed in
38
the pcrematching page.
40
A regular expression is a pattern that is matched against a subject
41
string from left to right. Most characters stand for themselves in a
42
pattern, and match the corresponding characters in the subject. As a
35
43
trivial example, the pattern
37
45
The quick brown fox
39
matches a portion of a subject string that is identical to itself. The
40
power of regular expressions comes from the ability to include alterna-
41
tives and repetitions in the pattern. These are encoded in the pattern
42
by the use of meta-characters, which do not stand for themselves but
43
instead are interpreted in some special way.
45
There are two different sets of meta-characters: those that are recog-
47
matches a portion of a subject string that is identical to itself. When
48
caseless matching is specified (the PCRE_CASELESS option), letters are
49
matched independently of case. In UTF-8 mode, PCRE always understands
50
the concept of case for characters whose values are less than 128, so
51
caseless matching is always possible. For characters with higher val-
52
ues, the concept of case is supported if PCRE is compiled with Unicode
53
property support, but not otherwise. If you want to use caseless
54
matching for characters 128 and above, you must ensure that PCRE is
55
compiled with Unicode property support as well as with UTF-8 support.
57
The power of regular expressions comes from the ability to include
58
alternatives and repetitions in the pattern. These are encoded in the
59
pattern by the use of metacharacters, which do not stand for themselves
60
but instead are interpreted in some special way.
62
There are two different sets of metacharacters: those that are recog-
46
63
nized anywhere in the pattern except within square brackets, and those
47
64
that are recognized in square brackets. Outside square brackets, the
48
meta-characters are as follows:
65
metacharacters are as follows:
50
67
\ general escape character with several uses
51
68
^ assert start of string (or line, in multiline mode)
204
228
\W any "non-word" character
206
230
Each pair of escape sequences partitions the complete set of characters
207
into two disjoint sets. Any given character matches one, and only one,
231
into two disjoint sets. Any given character matches one, and only one,
210
In UTF-8 mode, characters with values greater than 255 never match \d,
211
\s, or \w, and always match \D, \S, and \W.
213
For compatibility with Perl, \s does not match the VT character (code
214
11). This makes it different from the the POSIX "space" class. The \s
215
characters are HT (9), LF (10), FF (12), CR (13), and space (32).
217
A "word" character is any letter or digit or the underscore character,
218
that is, any character which can be part of a Perl "word". The defini-
219
tion of letters and digits is controlled by PCRE's character tables,
220
and may vary if locale- specific matching is taking place (see "Locale
221
support" in the pcreapi page). For example, in the "fr" (French)
222
locale, some character codes greater than 128 are used for accented
223
letters, and these are matched by \w.
225
234
These character type sequences can appear both inside and outside char-
226
235
acter classes. They each match one character of the appropriate type.
227
236
If the current matching point is at the end of the subject string, all
228
237
of them fail, since there is no character to match.
239
For compatibility with Perl, \s does not match the VT character (code
240
11). This makes it different from the the POSIX "space" class. The \s
241
characters are HT (9), LF (10), FF (12), CR (13), and space (32).
243
A "word" character is an underscore or any character less than 256 that
244
is a letter or digit. The definition of letters and digits is con-
245
trolled by PCRE's low-valued character tables, and may vary if locale-
246
specific matching is taking place (see "Locale support" in the pcreapi
247
page). For example, in the "fr_FR" (French) locale, some character
248
codes greater than 128 are used for accented letters, and these are
251
In UTF-8 mode, characters with values greater than 128 never match \d,
252
\s, or \w, and always match \D, \S, and \W. This is true even when Uni-
253
code character property support is available.
255
Unicode character properties
257
When PCRE is built with Unicode character property support, three addi-
258
tional escape sequences to match generic character types are available
259
when UTF-8 mode is selected. They are:
261
\p{xx} a character with the xx property
262
\P{xx} a character without the xx property
263
\X an extended Unicode sequence
265
The property names represented by xx above are limited to the Unicode
266
general category properties. Each character has exactly one such prop-
267
erty, specified by a two-letter abbreviation. For compatibility with
268
Perl, negation can be specified by including a circumflex between the
269
opening brace and the property name. For example, \p{^Lu} is the same
272
If only one letter is specified with \p or \P, it includes all the
273
properties that start with that letter. In this case, in the absence of
274
negation, the curly brackets in the escape sequence are optional; these
275
two examples have the same effect:
280
The following property codes are supported:
307
Pc Connector punctuation
311
Pi Initial punctuation
318
Sm Mathematical symbol
323
Zp Paragraph separator
326
Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
329
Specifying caseless matching does not affect these escape sequences.
330
For example, \p{Lu} always matches only upper case letters.
332
The \X escape matches any number of Unicode characters that form an
333
extended Unicode sequence. \X is equivalent to
337
That is, it matches a character without the "mark" property, followed
338
by zero or more characters with the "mark" property, and treats the
339
sequence as an atomic group (see below). Characters with the "mark"
340
property are typically accents that affect the preceding character.
342
Matching characters by Unicode property is not fast, because PCRE has
343
to search a structure that contains data for over fifteen thousand
344
characters. That is why the traditional escape sequences such as \d and
345
\w do not use Unicode properties in PCRE.
230
349
The fourth use of backslash is for certain simple assertions. An asser-
231
tion specifies a condition that has to be met at a particular point in
232
a match, without consuming any characters from the subject string. The
233
use of subpatterns for more complicated assertions is described below.
234
The backslashed assertions are
350
tion specifies a condition that has to be met at a particular point in
351
a match, without consuming any characters from the subject string. The
352
use of subpatterns for more complicated assertions is described below.
353
The backslashed assertions are:
236
355
\b matches at a word boundary
237
356
\B matches when not at a word boundary
371
491
For example, the character class [aeiou] matches any lower case vowel,
372
492
while [^aeiou] matches any character that is not a lower case vowel.
373
493
Note that a circumflex is just a convenient notation for specifying the
374
characters which are in the class by enumerating those that are not. It
375
is not an assertion: it still consumes a character from the subject
376
string, and fails if the current pointer is at the end of the string.
494
characters that are in the class by enumerating those that are not. A
495
class that starts with a circumflex is not an assertion: it still con-
496
sumes a character from the subject string, and therefore it fails if
497
the current pointer is at the end of the string.
378
In UTF-8 mode, characters with values greater than 255 can be included
379
in a class as a literal string of bytes, or by using the \x{ escaping
499
In UTF-8 mode, characters with values greater than 255 can be included
500
in a class as a literal string of bytes, or by using the \x{ escaping
382
When caseless matching is set, any letters in a class represent both
383
their upper case and lower case versions, so for example, a caseless
384
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
385
match "A", whereas a caseful version would. PCRE does not support the
386
concept of case for characters with values greater than 255.
503
When caseless matching is set, any letters in a class represent both
504
their upper case and lower case versions, so for example, a caseless
505
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
506
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
507
understands the concept of case for characters whose values are less
508
than 128, so caseless matching is always possible. For characters with
509
higher values, the concept of case is supported if PCRE is compiled
510
with Unicode property support, but not otherwise. If you want to use
511
caseless matching for characters 128 and above, you must ensure that
512
PCRE is compiled with Unicode property support as well as with UTF-8
388
The newline character is never treated in any special way in character
389
classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
515
The newline character is never treated in any special way in character
516
classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
390
517
options is. A class such as [^a] will always match a newline.
392
The minus (hyphen) character can be used to specify a range of charac-
393
ters in a character class. For example, [d-m] matches any letter
394
between d and m, inclusive. If a minus character is required in a
395
class, it must be escaped with a backslash or appear in a position
396
where it cannot be interpreted as indicating a range, typically as the
519
The minus (hyphen) character can be used to specify a range of charac-
520
ters in a character class. For example, [d-m] matches any letter
521
between d and m, inclusive. If a minus character is required in a
522
class, it must be escaped with a backslash or appear in a position
523
where it cannot be interpreted as indicating a range, typically as the
397
524
first or last character in the class.
399
526
It is not possible to have the literal character "]" as the end charac-
400
ter of a range. A pattern such as [W-]46] is interpreted as a class of
401
two characters ("W" and "-") followed by a literal string "46]", so it
402
would match "W46]" or "-46]". However, if the "]" is escaped with a
403
backslash it is interpreted as the end of range, so [W-\]46] is inter-
404
preted as a single class containing a range followed by two separate
405
characters. The octal or hexadecimal representation of "]" can also be
527
ter of a range. A pattern such as [W-]46] is interpreted as a class of
528
two characters ("W" and "-") followed by a literal string "46]", so it
529
would match "W46]" or "-46]". However, if the "]" is escaped with a
530
backslash it is interpreted as the end of range, so [W-\]46] is inter-
531
preted as a class containing a range followed by two other characters.
532
The octal or hexadecimal representation of "]" can also be used to end
408
Ranges operate in the collating sequence of character values. They can
409
also be used for characters specified numerically, for example
410
[\000-\037]. In UTF-8 mode, ranges can include characters whose values
535
Ranges operate in the collating sequence of character values. They can
536
also be used for characters specified numerically, for example
537
[\000-\037]. In UTF-8 mode, ranges can include characters whose values
411
538
are greater than 255, for example [\x{100}-\x{2ff}].
413
540
If a range that includes letters is used when caseless matching is set,
414
541
it matches the letters in either case. For example, [W-c] is equivalent
415
to [][\^_`wxyzabc], matched caselessly, and if character tables for the
416
"fr" locale are in use, [\xc8-\xcb] matches accented E characters in
419
The character types \d, \D, \s, \S, \w, and \W may also appear in a
420
character class, and add the characters that they match to the class.
421
For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
422
conveniently be used with the upper case character types to specify a
423
more restricted set of characters than the matching lower case type.
424
For example, the class [^\W_] matches any letter or digit, but not
427
All non-alphameric characters other than \, -, ^ (at the start) and the
428
terminating ] are non-special in character classes, but it does no harm
542
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
543
character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
544
accented E characters in both cases. In UTF-8 mode, PCRE supports the
545
concept of case for characters with values greater than 128 only when
546
it is compiled with Unicode property support.
548
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
549
in a character class, and add the characters that they match to the
550
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
551
flex can conveniently be used with the upper case character types to
552
specify a more restricted set of characters than the matching lower
553
case type. For example, the class [^\W_] matches any letter or digit,
556
The only metacharacters that are recognized in character classes are
557
backslash, hyphen (only where it can be interpreted as specifying a
558
range), circumflex (only at the start), opening square bracket (only
559
when it can be interpreted as introducing a POSIX class name - see the
560
next section), and the terminating closing square bracket. However,
561
escaping other non-alphanumeric characters does no harm.
432
564
POSIX CHARACTER CLASSES
434
Perl supports the POSIX notation for character classes, which uses
435
names enclosed by [: and :] within the enclosing square brackets. PCRE
436
also supports this notation. For example,
566
Perl supports the POSIX notation for character classes. This uses names
567
enclosed by [: and :] within the enclosing square brackets. PCRE also
568
supports this notation. For example,
832
970
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
834
972
it takes a long time before reporting failure. This is because the
835
string can be divided between the two repeats in a large number of
836
ways, and all have to be tried. (The example used [!?] rather than a
837
single character at the end, because both PCRE and Perl have an opti-
838
mization that allows for fast failure when a single character is used.
839
They remember the last single character that is required for a match,
840
and fail early if it is not present in the string.) If the pattern is
973
string can be divided between the internal \D+ repeat and the external
974
* repeat in a large number of ways, and all have to be tried. (The
975
example uses [!?] rather than a single character at the end, because
976
both PCRE and Perl have an optimization that allows for fast failure
977
when a single character is used. They remember the last single charac-
978
ter that is required for a match, and fail early if it is not present
979
in the string.) If the pattern is changed so that it uses an atomic
843
982
((?>\D+)|<\d+>)*[!?]
845
sequences of non-digits cannot be broken, and failure happens quickly.
984
sequences of non-digits cannot be broken, and failure happens quickly.
850
989
Outside a character class, a backslash followed by a digit greater than
851
990
0 (and possibly further digits) is a back reference to a capturing sub-
852
pattern earlier (that is, to its left) in the pattern, provided there
991
pattern earlier (that is, to its left) in the pattern, provided there
853
992
have been that many previous capturing left parentheses.
855
994
However, if the decimal number following the backslash is less than 10,
856
it is always taken as a back reference, and causes an error only if
857
there are not that many capturing left parentheses in the entire pat-
858
tern. In other words, the parentheses that are referenced need not be
859
to the left of the reference for numbers less than 10. See the section
860
entitled "Backslash" above for further details of the handling of dig-
861
its following a backslash.
995
it is always taken as a back reference, and causes an error only if
996
there are not that many capturing left parentheses in the entire pat-
997
tern. In other words, the parentheses that are referenced need not be
998
to the left of the reference for numbers less than 10. See the subsec-
999
tion entitled "Non-printing characters" above for further details of
1000
the handling of digits following a backslash.
863
A back reference matches whatever actually matched the capturing sub-
864
pattern in the current subject string, rather than anything matching
1002
A back reference matches whatever actually matched the capturing sub-
1003
pattern in the current subject string, rather than anything matching
865
1004
the subpattern itself (see "Subpatterns as subroutines" below for a way
866
1005
of doing that). So the pattern
868
1007
(sens|respons)e and \1ibility
870
matches "sense and sensibility" and "response and responsibility", but
871
not "sense and responsibility". If caseful matching is in force at the
872
time of the back reference, the case of letters is relevant. For exam-
1009
matches "sense and sensibility" and "response and responsibility", but
1010
not "sense and responsibility". If caseful matching is in force at the
1011
time of the back reference, the case of letters is relevant. For exam-
877
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1016
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
878
1017
original capturing subpattern is matched caselessly.
880
Back references to named subpatterns use the Python syntax (?P=name).
1019
Back references to named subpatterns use the Python syntax (?P=name).
881
1020
We could rewrite the above example as follows:
883
1022
(?<p1>(?i)rah)\s+(?P=p1)
885
There may be more than one back reference to the same subpattern. If a
886
subpattern has not actually been used in a particular match, any back
1024
There may be more than one back reference to the same subpattern. If a
1025
subpattern has not actually been used in a particular match, any back
887
1026
references to it always fail. For example, the pattern
891
always fails if it starts to match "a" rather than "bc". Because there
892
may be many capturing parentheses in a pattern, all digits following
893
the backslash are taken as part of a potential back reference number.
1030
always fails if it starts to match "a" rather than "bc". Because there
1031
may be many capturing parentheses in a pattern, all digits following
1032
the backslash are taken as part of a potential back reference number.
894
1033
If the pattern continues with a digit character, some delimiter must be
895
used to terminate the back reference. If the PCRE_EXTENDED option is
896
set, this can be whitespace. Otherwise an empty comment can be used.
1034
used to terminate the back reference. If the PCRE_EXTENDED option is
1035
set, this can be whitespace. Otherwise an empty comment (see "Com-
1036
ments" below) can be used.
898
1038
A back reference that occurs inside the parentheses to which it refers
899
1039
fails when the subpattern is first used, so, for example, (a\1) never
985
1139
PCRE does not allow the \C escape (which matches a single byte in UTF-8
986
1140
mode) to appear in lookbehind assertions, because it makes it impossi-
987
ble to calculate the length of the lookbehind.
1141
ble to calculate the length of the lookbehind. The \X escape, which can
1142
match different numbers of bytes, is also not permitted.
989
Atomic groups can be used in conjunction with lookbehind assertions to
1144
Atomic groups can be used in conjunction with lookbehind assertions to
990
1145
specify efficient matching at the end of the subject string. Consider a
991
1146
simple pattern such as
995
when applied to a long string that does not match. Because matching
1150
when applied to a long string that does not match. Because matching
996
1151
proceeds from left to right, PCRE will look for each "a" in the subject
997
and then see if what follows matches the rest of the pattern. If the
1152
and then see if what follows matches the rest of the pattern. If the
998
1153
pattern is specified as
1002
the initial .* matches the entire string at first, but when this fails
1157
the initial .* matches the entire string at first, but when this fails
1003
1158
(because there is no following "a"), it backtracks to match all but the
1004
last character, then all but the last two characters, and so on. Once
1005
again the search for "a" covers the entire string, from right to left,
1159
last character, then all but the last two characters, and so on. Once
1160
again the search for "a" covers the entire string, from right to left,
1006
1161
so we are no better off. However, if the pattern is written as
1008
1163
^(?>.*)(?<=abcd)
1165
or, equivalently, using the possessive quantifier syntax,
1014
there can be no backtracking for the .* item; it can match only the
1015
entire string. The subsequent lookbehind assertion does a single test
1016
on the last four characters. If it fails, the match fails immediately.
1017
For long strings, this approach makes a significant difference to the
1169
there can be no backtracking for the .* item; it can match only the
1170
entire string. The subsequent lookbehind assertion does a single test
1171
on the last four characters. If it fails, the match fails immediately.
1172
For long strings, this approach makes a significant difference to the
1018
1173
processing time.
1175
Using multiple assertions
1020
1177
Several assertions (of any sort) may occur in succession. For example,
1022
1179
(?<=\d{3})(?<!999)foo
1024
matches "foo" preceded by three digits that are not "999". Notice that
1025
each of the assertions is applied independently at the same point in
1026
the subject string. First there is a check that the previous three
1027
characters are all digits, and then there is a check that the same
1181
matches "foo" preceded by three digits that are not "999". Notice that
1182
each of the assertions is applied independently at the same point in
1183
the subject string. First there is a check that the previous three
1184
characters are all digits, and then there is a check that the same
1028
1185
three characters are not "999". This pattern does not match "foo" pre-
1029
ceded by six characters, the first of which are digits and the last
1030
three of which are not "999". For example, it doesn't match "123abc-
1186
ceded by six characters, the first of which are digits and the last
1187
three of which are not "999". For example, it doesn't match "123abc-
1031
1188
foo". A pattern to do that is
1033
1190
(?<=\d{3}...)(?<!999)foo
1035
This time the first assertion looks at the preceding six characters,
1192
This time the first assertion looks at the preceding six characters,
1036
1193
checking that the first three are digits, and then the second assertion
1037
1194
checks that the preceding three characters are not "999".
1041
1198
(?<=(?<!foo)bar)baz
1043
matches an occurrence of "baz" that is preceded by "bar" which in turn
1200
matches an occurrence of "baz" that is preceded by "bar" which in turn
1044
1201
is not preceded by "foo", while
1046
1203
(?<=\d{3}(?!999)...)foo
1048
is another pattern which matches "foo" preceded by three digits and any
1205
is another pattern that matches "foo" preceded by three digits and any
1049
1206
three characters that are not "999".
1051
Assertion subpatterns are not capturing subpatterns, and may not be
1052
repeated, because it makes no sense to assert the same thing several
1053
times. If any kind of assertion contains capturing subpatterns within
1054
it, these are counted for the purposes of numbering the capturing sub-
1055
patterns in the whole pattern. However, substring capturing is carried
1056
out only for positive assertions, because it does not make sense for
1057
negative assertions.
1060
1209
CONDITIONAL SUBPATTERNS
1062
It is possible to cause the matching process to obey a subpattern con-
1063
ditionally or to choose between two alternative subpatterns, depending
1064
on the result of an assertion, or whether a previous capturing
1065
subpattern matched or not. The two possible forms of conditional sub-
1211
It is possible to cause the matching process to obey a subpattern con-
1212
ditionally or to choose between two alternative subpatterns, depending
1213
on the result of an assertion, or whether a previous capturing subpat-
1214
tern matched or not. The two possible forms of conditional subpattern
1068
1217
(?(condition)yes-pattern)
1069
1218
(?(condition)yes-pattern|no-pattern)
1071
If the condition is satisfied, the yes-pattern is used; otherwise the
1072
no-pattern (if present) is used. If there are more than two alterna-
1220
If the condition is satisfied, the yes-pattern is used; otherwise the
1221
no-pattern (if present) is used. If there are more than two alterna-
1073
1222
tives in the subpattern, a compile-time error occurs.
1075
1224
There are three kinds of condition. If the text between the parentheses
1076
consists of a sequence of digits, the condition is satisfied if the
1077
capturing subpattern of that number has previously matched. The number
1078
must be greater than zero. Consider the following pattern, which con-
1079
tains non-significant white space to make it more readable (assume the
1080
PCRE_EXTENDED option) and to divide it into three parts for ease of
1225
consists of a sequence of digits, the condition is satisfied if the
1226
capturing subpattern of that number has previously matched. The number
1227
must be greater than zero. Consider the following pattern, which con-
1228
tains non-significant white space to make it more readable (assume the
1229
PCRE_EXTENDED option) and to divide it into three parts for ease of
1083
1232
( \( )? [^()]+ (?(1) \) )
1085
The first part matches an optional opening parenthesis, and if that
1234
The first part matches an optional opening parenthesis, and if that
1086
1235
character is present, sets it as the first captured substring. The sec-
1087
ond part matches one or more characters that are not parentheses. The
1236
ond part matches one or more characters that are not parentheses. The
1088
1237
third part is a conditional subpattern that tests whether the first set
1089
1238
of parentheses matched or not. If they did, that is, if subject started
1090
1239
with an opening parenthesis, the condition is true, and so the yes-pat-
1091
tern is executed and a closing parenthesis is required. Otherwise,
1092
since no-pattern is not present, the subpattern matches nothing. In
1093
other words, this pattern matches a sequence of non-parentheses,
1240
tern is executed and a closing parenthesis is required. Otherwise,
1241
since no-pattern is not present, the subpattern matches nothing. In
1242
other words, this pattern matches a sequence of non-parentheses,
1094
1243
optionally enclosed in parentheses.
1096
1245
If the condition is the string (R), it is satisfied if a recursive call
1097
to the pattern or subpattern has been made. At "top level", the condi-
1098
tion is false. This is a PCRE extension. Recursive patterns are
1246
to the pattern or subpattern has been made. At "top level", the condi-
1247
tion is false. This is a PCRE extension. Recursive patterns are
1099
1248
described in the next section.
1101
If the condition is not a sequence of digits or (R), it must be an
1102
assertion. This may be a positive or negative lookahead or lookbehind
1103
assertion. Consider this pattern, again containing non-significant
1250
If the condition is not a sequence of digits or (R), it must be an
1251
assertion. This may be a positive or negative lookahead or lookbehind
1252
assertion. Consider this pattern, again containing non-significant
1104
1253
white space, and with the two alternatives on the second line:
1106
1255
(?(?=[^a-z]*[a-z])
1107
1256
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1109
The condition is a positive lookahead assertion that matches an
1110
optional sequence of non-letters followed by a letter. In other words,
1111
it tests for the presence of at least one letter in the subject. If a
1112
letter is found, the subject is matched against the first alternative;
1113
otherwise it is matched against the second. This pattern matches
1114
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1258
The condition is a positive lookahead assertion that matches an
1259
optional sequence of non-letters followed by a letter. In other words,
1260
it tests for the presence of at least one letter in the subject. If a
1261
letter is found, the subject is matched against the first alternative;
1262
otherwise it is matched against the second. This pattern matches
1263
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1115
1264
letters and dd are digits.
1120
The sequence (?# marks the start of a comment which continues up to the
1121
next closing parenthesis. Nested parentheses are not permitted. The
1122
characters that make up a comment play no part in the pattern matching
1269
The sequence (?# marks the start of a comment that continues up to the
1270
next closing parenthesis. Nested parentheses are not permitted. The
1271
characters that make up a comment play no part in the pattern matching
1125
If the PCRE_EXTENDED option is set, an unescaped # character outside a
1274
If the PCRE_EXTENDED option is set, an unescaped # character outside a
1126
1275
character class introduces a comment that continues up to the next new-
1127
1276
line character in the pattern.
1130
1279
RECURSIVE PATTERNS
1132
Consider the problem of matching a string in parentheses, allowing for
1133
unlimited nested parentheses. Without the use of recursion, the best
1134
that can be done is to use a pattern that matches up to some fixed
1135
depth of nesting. It is not possible to handle an arbitrary nesting
1136
depth. Perl has provided an experimental facility that allows regular
1137
expressions to recurse (amongst other things). It does this by interpo-
1138
lating Perl code in the expression at run time, and the code can refer
1139
to the expression itself. A Perl pattern to solve the parentheses prob-
1140
lem can be created like this:
1281
Consider the problem of matching a string in parentheses, allowing for
1282
unlimited nested parentheses. Without the use of recursion, the best
1283
that can be done is to use a pattern that matches up to some fixed
1284
depth of nesting. It is not possible to handle an arbitrary nesting
1285
depth. Perl provides a facility that allows regular expressions to
1286
recurse (amongst other things). It does this by interpolating Perl code
1287
in the expression at run time, and the code can refer to the expression
1288
itself. A Perl pattern to solve the parentheses problem can be created
1142
1291
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1144
1293
The (?p{...}) item interpolates Perl code at run time, and in this case
1145
refers recursively to the pattern in which it appears. Obviously, PCRE
1146
cannot support the interpolation of Perl code. Instead, it supports
1147
some special syntax for recursion of the entire pattern, and also for
1294
refers recursively to the pattern in which it appears. Obviously, PCRE
1295
cannot support the interpolation of Perl code. Instead, it supports
1296
some special syntax for recursion of the entire pattern, and also for
1148
1297
individual subpattern recursion.
1150
The special item that consists of (? followed by a number greater than
1299
The special item that consists of (? followed by a number greater than
1151
1300
zero and a closing parenthesis is a recursive call of the subpattern of
1152
the given number, provided that it occurs inside that subpattern. (If
1153
not, it is a "subroutine" call, which is described in the next sec-
1154
tion.) The special item (?R) is a recursive call of the entire regular
1301
the given number, provided that it occurs inside that subpattern. (If
1302
not, it is a "subroutine" call, which is described in the next sec-
1303
tion.) The special item (?R) is a recursive call of the entire regular
1157
For example, this PCRE pattern solves the nested parentheses problem
1158
(assume the PCRE_EXTENDED option is set so that white space is
1306
For example, this PCRE pattern solves the nested parentheses problem
1307
(assume the PCRE_EXTENDED option is set so that white space is
1161
1310
\( ( (?>[^()]+) | (?R) )* \)
1163
First it matches an opening parenthesis. Then it matches any number of
1164
substrings which can either be a sequence of non-parentheses, or a
1165
recursive match of the pattern itself (that is a correctly parenthe-
1312
First it matches an opening parenthesis. Then it matches any number of
1313
substrings which can either be a sequence of non-parentheses, or a
1314
recursive match of the pattern itself (that is a correctly parenthe-
1166
1315
sized substring). Finally there is a closing parenthesis.
1168
If this were part of a larger pattern, you would not want to recurse
1317
If this were part of a larger pattern, you would not want to recurse
1169
1318
the entire pattern, so instead you could use this:
1171
1320
( \( ( (?>[^()]+) | (?1) )* \) )
1173
We have put the pattern into parentheses, and caused the recursion to
1174
refer to them instead of the whole pattern. In a larger pattern, keep-
1175
ing track of parenthesis numbers can be tricky. It may be more conve-
1176
nient to use named parentheses instead. For this, PCRE uses (?P>name),
1177
which is an extension to the Python syntax that PCRE uses for named
1322
We have put the pattern into parentheses, and caused the recursion to
1323
refer to them instead of the whole pattern. In a larger pattern, keep-
1324
ing track of parenthesis numbers can be tricky. It may be more conve-
1325
nient to use named parentheses instead. For this, PCRE uses (?P>name),
1326
which is an extension to the Python syntax that PCRE uses for named
1178
1327
parentheses (Perl does not provide named parentheses). We could rewrite
1179
1328
the above example as follows:
1181
1330
(?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1183
This particular example pattern contains nested unlimited repeats, and
1184
so the use of atomic grouping for matching strings of non-parentheses
1185
is important when applying the pattern to strings that do not match.
1332
This particular example pattern contains nested unlimited repeats, and
1333
so the use of atomic grouping for matching strings of non-parentheses
1334
is important when applying the pattern to strings that do not match.
1186
1335
For example, when this pattern is applied to
1188
1337
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1190
it yields "no match" quickly. However, if atomic grouping is not used,
1191
the match runs for a very long time indeed because there are so many
1192
different ways the + and * repeats can carve up the subject, and all
1339
it yields "no match" quickly. However, if atomic grouping is not used,
1340
the match runs for a very long time indeed because there are so many
1341
different ways the + and * repeats can carve up the subject, and all
1193
1342
have to be tested before failure can be reported.
1195
1344
At the end of a match, the values set for any capturing subpatterns are
1196
1345
those from the outermost level of the recursion at which the subpattern
1197
value is set. If you want to obtain intermediate values, a callout
1198
function can be used (see below and the pcrecallout documentation). If
1199
the pattern above is matched against
1346
value is set. If you want to obtain intermediate values, a callout
1347
function can be used (see the next section and the pcrecallout documen-
1348
tation). If the pattern above is matched against
1203
the value for the capturing parentheses is "ef", which is the last
1204
value taken on at the top level. If additional parentheses are added,
1352
the value for the capturing parentheses is "ef", which is the last
1353
value taken on at the top level. If additional parentheses are added,
1207
1356
\( ( ( (?>[^()]+) | (?R) )* ) \)
1211
the string they capture is "ab(cd)ef", the contents of the top level
1212
parentheses. If there are more than 15 capturing parentheses in a pat-
1360
the string they capture is "ab(cd)ef", the contents of the top level
1361
parentheses. If there are more than 15 capturing parentheses in a pat-
1213
1362
tern, PCRE has to obtain extra memory to store data during a recursion,
1214
which it does by using pcre_malloc, freeing it via pcre_free after-
1215
wards. If no memory can be obtained, the match fails with the
1363
which it does by using pcre_malloc, freeing it via pcre_free after-
1364
wards. If no memory can be obtained, the match fails with the
1216
1365
PCRE_ERROR_NOMEMORY error.
1218
Do not confuse the (?R) item with the condition (R), which tests for
1219
recursion. Consider this pattern, which matches text in angle brack-
1220
ets, allowing for arbitrary nesting. Only digits are allowed in nested
1221
brackets (that is, when recursing), whereas any characters are permit-
1367
Do not confuse the (?R) item with the condition (R), which tests for
1368
recursion. Consider this pattern, which matches text in angle brack-
1369
ets, allowing for arbitrary nesting. Only digits are allowed in nested
1370
brackets (that is, when recursing), whereas any characters are permit-
1222
1371
ted at the outer level.
1224
1373
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1226
In this pattern, (?(R) is the start of a conditional subpattern, with
1227
two different alternatives for the recursive and non-recursive cases.
1375
In this pattern, (?(R) is the start of a conditional subpattern, with
1376
two different alternatives for the recursive and non-recursive cases.
1228
1377
The (?R) item is the actual recursive call.
1231
1380
SUBPATTERNS AS SUBROUTINES
1233
1382
If the syntax for a recursive subpattern reference (either by number or
1234
by name) is used outside the parentheses to which it refers, it oper-
1235
ates like a subroutine in a programming language. An earlier example
1383
by name) is used outside the parentheses to which it refers, it oper-
1384
ates like a subroutine in a programming language. An earlier example
1236
1385
pointed out that the pattern
1238
1387
(sens|respons)e and \1ibility
1240
matches "sense and sensibility" and "response and responsibility", but
1389
matches "sense and sensibility" and "response and responsibility", but
1241
1390
not "sense and responsibility". If instead the pattern
1243
1392
(sens|respons)e and (?1)ibility
1245
is used, it does match "sense and responsibility" as well as the other
1246
two strings. Such references must, however, follow the subpattern to
1394
is used, it does match "sense and responsibility" as well as the other
1395
two strings. Such references must, however, follow the subpattern to
1247
1396
which they refer.
1252
1401
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1253
Perl code to be obeyed in the middle of matching a regular expression.
1402
Perl code to be obeyed in the middle of matching a regular expression.
1254
1403
This makes it possible, amongst other things, to extract different sub-
1255
1404
strings that match the same pair of parentheses when there is a repeti-
1258
1407
PCRE provides a similar feature, but of course it cannot obey arbitrary
1259
1408
Perl code. The feature is called "callout". The caller of PCRE provides
1260
an external function by putting its entry point in the global variable
1261
pcre_callout. By default, this variable contains NULL, which disables
1409
an external function by putting its entry point in the global variable
1410
pcre_callout. By default, this variable contains NULL, which disables
1262
1411
all calling out.
1264
Within a regular expression, (?C) indicates the points at which the
1265
external function is to be called. If you want to identify different
1266
callout points, you can put a number less than 256 after the letter C.
1267
The default value is zero. For example, this pattern has two callout
1413
Within a regular expression, (?C) indicates the points at which the
1414
external function is to be called. If you want to identify different
1415
callout points, you can put a number less than 256 after the letter C.
1416
The default value is zero. For example, this pattern has two callout
1270
1419
(?C1)abc(?C2)def
1421
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1422
automatically installed before each item in the pattern. They are all
1272
1425
During matching, when PCRE reaches a callout point (and pcre_callout is
1273
set), the external function is called. It is provided with the number
1274
of the callout, and, optionally, one item of data originally supplied
1275
by the caller of pcre_exec(). The callout function may cause matching
1276
to backtrack, or to fail altogether. A complete description of the
1277
interface to the callout function is given in the pcrecallout documen-
1426
set), the external function is called. It is provided with the number
1427
of the callout, the position in the pattern, and, optionally, one item
1428
of data originally supplied by the caller of pcre_exec(). The callout
1429
function may cause matching to proceed, to backtrack, or to fail alto-
1430
gether. A complete description of the interface to the callout function
1431
is given in the pcrecallout documentation.
1280
Last updated: 03 February 2003
1281
Copyright (c) 1997-2003 University of Cambridge.
1433
Last updated: 28 February 2005
1434
Copyright (c) 1997-2005 University of Cambridge.