367
367
There are two new general option names, PCRE_UTF16 and
368
368
PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and
369
369
PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
370
define the same bits in the options word.
370
define the same bits in the options word. There is a discussion about
371
the validity of UTF-16 strings in the pcreunicode page.
372
For the pcre16_config() function there is an option PCRE_CONFIG_UTF16
373
that returns 1 if UTF-16 support is configured, otherwise 0. If this
374
option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option is
373
For the pcre16_config() function there is an option PCRE_CONFIG_UTF16
374
that returns 1 if UTF-16 support is configured, otherwise 0. If this
375
option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option is
375
376
given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.
380
In 16-bit mode, when PCRE_UTF16 is not set, character values are
381
In 16-bit mode, when PCRE_UTF16 is not set, character values are
381
382
treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
382
that they can range from 0 to 0xffff instead of 0 to 0xff. Character
383
types for characters less than 0xff can therefore be influenced by the
384
locale in the same way as before. Characters greater than 0xff have
383
that they can range from 0 to 0xffff instead of 0 to 0xff. Character
384
types for characters less than 0xff can therefore be influenced by the
385
locale in the same way as before. Characters greater than 0xff have
385
386
only one case, and no "type" (such as letter or digit).
387
In UTF-16 mode, the character code is Unicode, in the range 0 to
388
0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
389
because those are "surrogate" values that are used in pairs to encode
388
In UTF-16 mode, the character code is Unicode, in the range 0 to
389
0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
390
because those are "surrogate" values that are used in pairs to encode
390
391
values greater than 0xffff.
392
A UTF-16 string can indicate its endianness by special code knows as a
393
A UTF-16 string can indicate its endianness by special code knows as a
393
394
byte-order mark (BOM). The PCRE functions do not handle this, expecting
394
strings to be in host byte order. A utility function called
395
pcre16_utf16_to_host_byte_order() is provided to help with this (see
395
strings to be in host byte order. A utility function called
396
pcre16_utf16_to_host_byte_order() is provided to help with this (see
401
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
402
spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is
403
given when a compiled pattern is passed to a function that processes
404
patterns in the other mode, for example, if a pattern compiled with
402
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
403
spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is
404
given when a compiled pattern is passed to a function that processes
405
patterns in the other mode, for example, if a pattern compiled with
405
406
pcre_compile() is passed to pcre16_exec().
407
There are new error codes whose names begin with PCRE_UTF16_ERR for
408
invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for
409
UTF-8 strings that are described in the section entitled "Reason codes
410
for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
408
There are new error codes whose names begin with PCRE_UTF16_ERR for
409
invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for
410
UTF-8 strings that are described in the section entitled "Reason codes
411
for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
413
414
PCRE_UTF16_ERR1 Missing low surrogate at end of string
1981
1990
which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
1982
1991
values less than 256.)
1984
These two optimizations apply to both pcre_exec() and pcre_dfa_exec().
1985
However, they are not used by pcre_exec() if pcre_study() is called
1986
with the PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling is
1987
successful. The optimizations can be disabled by setting the
1988
PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or
1989
pcre_dfa_exec(). You might want to do this if your pattern contains
1990
callouts or (*MARK) (which cannot be handled by the JIT compiler), and
1991
you want to make use of these facilities in cases where matching fails.
1992
See the discussion of PCRE_NO_START_OPTIMIZE below.
1993
These two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
1994
and the information is also used by the JIT compiler. The optimiza-
1995
tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when
1996
calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-
1997
tion is also disabled. You might want to do this if your pattern con-
1998
tains callouts or (*MARK) and you want to make use of these facilities
1999
in cases where matching fails. See the discussion of
2000
PCRE_NO_START_OPTIMIZE below.
1997
PCRE handles caseless matching, and determines whether characters are
1998
letters, digits, or whatever, by reference to a set of tables, indexed
1999
by character value. When running in UTF-8 mode, this applies only to
2000
characters with codes less than 128. By default, higher-valued codes
2005
PCRE handles caseless matching, and determines whether characters are
2006
letters, digits, or whatever, by reference to a set of tables, indexed
2007
by character value. When running in UTF-8 mode, this applies only to
2008
characters with codes less than 128. By default, higher-valued codes
2001
2009
never match escapes such as \w or \d, but they can be tested with \p if
2002
PCRE is built with Unicode character property support. Alternatively,
2003
the PCRE_UCP option can be set at compile time; this causes \w and
2010
PCRE is built with Unicode character property support. Alternatively,
2011
the PCRE_UCP option can be set at compile time; this causes \w and
2004
2012
friends to use Unicode property support instead of built-in tables. The
2005
2013
use of locales with Unicode is discouraged. If you are handling charac-
2006
ters with codes greater than 128, you should either use UTF-8 and Uni-
2014
ters with codes greater than 128, you should either use UTF-8 and Uni-
2007
2015
code, or use locales, but not try to mix the two.
2009
PCRE contains an internal set of tables that are used when the final
2010
argument of pcre_compile() is NULL. These are sufficient for many
2017
PCRE contains an internal set of tables that are used when the final
2018
argument of pcre_compile() is NULL. These are sufficient for many
2011
2019
applications. Normally, the internal tables recognize only ASCII char-
2012
2020
acters. However, when PCRE is built, it is possible to cause the inter-
2013
2021
nal tables to be rebuilt in the default "C" locale of the local system,
2014
2022
which may cause them to be different.
2016
The internal tables can always be overridden by tables supplied by the
2024
The internal tables can always be overridden by tables supplied by the
2017
2025
application that calls PCRE. These may be created in a different locale
2018
from the default. As more and more applications change to using Uni-
2026
from the default. As more and more applications change to using Uni-
2019
2027
code, the need for this locale support is expected to die away.
2021
External tables are built by calling the pcre_maketables() function,
2022
which has no arguments, in the relevant locale. The result can then be
2023
passed to pcre_compile() or pcre_exec() as often as necessary. For
2024
example, to build and use tables that are appropriate for the French
2025
locale (where accented characters with values greater than 128 are
2029
External tables are built by calling the pcre_maketables() function,
2030
which has no arguments, in the relevant locale. The result can then be
2031
passed to pcre_compile() or pcre_exec() as often as necessary. For
2032
example, to build and use tables that are appropriate for the French
2033
locale (where accented characters with values greater than 128 are
2026
2034
treated as letters), the following code could be used:
2028
2036
setlocale(LC_CTYPE, "fr_FR");
2029
2037
tables = pcre_maketables();
2030
2038
re = pcre_compile(..., tables);
2032
The locale name "fr_FR" is used on Linux and other Unix-like systems;
2040
The locale name "fr_FR" is used on Linux and other Unix-like systems;
2033
2041
if you are using Windows, the name for the French locale is "french".
2035
When pcre_maketables() runs, the tables are built in memory that is
2036
obtained via pcre_malloc. It is the caller's responsibility to ensure
2037
that the memory containing the tables remains available for as long as
2043
When pcre_maketables() runs, the tables are built in memory that is
2044
obtained via pcre_malloc. It is the caller's responsibility to ensure
2045
that the memory containing the tables remains available for as long as
2040
2048
The pointer that is passed to pcre_compile() is saved with the compiled
2041
pattern, and the same tables are used via this pointer by pcre_study()
2049
pattern, and the same tables are used via this pointer by pcre_study()
2042
2050
and normally also by pcre_exec(). Thus, by default, for any single pat-
2043
2051
tern, compilation, studying and matching all happen in the same locale,
2044
2052
but different patterns can be compiled in different locales.
2046
It is possible to pass a table pointer or NULL (indicating the use of
2047
the internal tables) to pcre_exec(). Although not intended for this
2048
purpose, this facility could be used to match a pattern in a different
2054
It is possible to pass a table pointer or NULL (indicating the use of
2055
the internal tables) to pcre_exec(). Although not intended for this
2056
purpose, this facility could be used to match a pattern in a different
2049
2057
locale from the one in which it was compiled. Passing table pointers at
2050
2058
run time is discussed below in the section on matching a pattern.
2087
2095
PCRE_INFO_SIZE, /* what is required */
2088
2096
&length); /* where to put the data */
2090
The possible values for the third argument are defined in pcre.h, and
2098
The possible values for the third argument are defined in pcre.h, and
2091
2099
are as follows:
2093
2101
PCRE_INFO_BACKREFMAX
2095
Return the number of the highest back reference in the pattern. The
2096
fourth argument should point to an int variable. Zero is returned if
2103
Return the number of the highest back reference in the pattern. The
2104
fourth argument should point to an int variable. Zero is returned if
2097
2105
there are no back references.
2099
2107
PCRE_INFO_CAPTURECOUNT
2101
Return the number of capturing subpatterns in the pattern. The fourth
2109
Return the number of capturing subpatterns in the pattern. The fourth
2102
2110
argument should point to an int variable.
2104
2112
PCRE_INFO_DEFAULT_TABLES
2106
Return a pointer to the internal default character tables within PCRE.
2107
The fourth argument should point to an unsigned char * variable. This
2114
Return a pointer to the internal default character tables within PCRE.
2115
The fourth argument should point to an unsigned char * variable. This
2108
2116
information call is provided for internal use by the pcre_study() func-
2109
tion. External callers can cause PCRE to use its internal tables by
2117
tion. External callers can cause PCRE to use its internal tables by
2110
2118
passing a NULL table pointer.
2112
2120
PCRE_INFO_FIRSTBYTE
2114
2122
Return information about the first data unit of any matched string, for
2115
a non-anchored pattern. (The name of this option refers to the 8-bit
2116
library, where data units are bytes.) The fourth argument should point
2123
a non-anchored pattern. (The name of this option refers to the 8-bit
2124
library, where data units are bytes.) The fourth argument should point
2117
2125
to an int variable.
2119
If there is a fixed first value, for example, the letter "c" from a
2120
pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
2121
library, the value is always less than 256; in the 16-bit library the
2127
If there is a fixed first value, for example, the letter "c" from a
2128
pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
2129
library, the value is always less than 256; in the 16-bit library the
2122
2130
value can be up to 0xffff.
2124
2132
If there is no fixed first value, and if either
2126
(a) the pattern was compiled with the PCRE_MULTILINE option, and every
2134
(a) the pattern was compiled with the PCRE_MULTILINE option, and every
2127
2135
branch starts with "^", or
2129
2137
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2130
2138
set (if it were set, the pattern would be anchored),
2132
-1 is returned, indicating that the pattern matches only at the start
2133
of a subject string or after any newline within the string. Otherwise
2140
-1 is returned, indicating that the pattern matches only at the start
2141
of a subject string or after any newline within the string. Otherwise
2134
2142
-2 is returned. For anchored patterns, -2 is returned.
2136
2144
PCRE_INFO_FIRSTTABLE
2138
If the pattern was studied, and this resulted in the construction of a
2139
256-bit table indicating a fixed set of values for the first data unit
2140
in any matching string, a pointer to the table is returned. Otherwise
2141
NULL is returned. The fourth argument should point to an unsigned char
2146
If the pattern was studied, and this resulted in the construction of a
2147
256-bit table indicating a fixed set of values for the first data unit
2148
in any matching string, a pointer to the table is returned. Otherwise
2149
NULL is returned. The fourth argument should point to an unsigned char
2144
2152
PCRE_INFO_HASCRORLF
2146
Return 1 if the pattern contains any explicit matches for CR or LF
2147
characters, otherwise 0. The fourth argument should point to an int
2148
variable. An explicit match is either a literal CR or LF character, or
2154
Return 1 if the pattern contains any explicit matches for CR or LF
2155
characters, otherwise 0. The fourth argument should point to an int
2156
variable. An explicit match is either a literal CR or LF character, or
2151
2159
PCRE_INFO_JCHANGED
2153
Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2154
otherwise 0. The fourth argument should point to an int variable. (?J)
2161
Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2162
otherwise 0. The fourth argument should point to an int variable. (?J)
2155
2163
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2159
Return 1 if the pattern was studied with the PCRE_STUDY_JIT_COMPILE
2160
option, and just-in-time compiling was successful. The fourth argument
2161
should point to an int variable. A return value of 0 means that JIT
2162
support is not available in this version of PCRE, or that the pattern
2163
was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT
2164
compiler could not handle this particular pattern. See the pcrejit doc-
2165
umentation for details of what can and cannot be handled.
2167
Return 1 if the pattern was studied with one of the JIT options, and
2168
just-in-time compiling was successful. The fourth argument should point
2169
to an int variable. A return value of 0 means that JIT support is not
2170
available in this version of PCRE, or that the pattern was not studied
2171
with a JIT option, or that the JIT compiler could not handle this par-
2172
ticular pattern. See the pcrejit documentation for details of what can
2173
and cannot be handled.
2167
2175
PCRE_INFO_JITSIZE
2169
If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE
2170
option, return the size of the JIT compiled code, otherwise return
2171
zero. The fourth argument should point to a size_t variable.
2177
If the pattern was successfully studied with a JIT option, return the
2178
size of the JIT compiled code, otherwise return zero. The fourth argu-
2179
ment should point to a size_t variable.
2173
2181
PCRE_INFO_LASTLITERAL
2175
Return the value of the rightmost literal data unit that must exist in
2176
any matched string, other than at its start, if such a value has been
2183
Return the value of the rightmost literal data unit that must exist in
2184
any matched string, other than at its start, if such a value has been
2177
2185
recorded. The fourth argument should point to an int variable. If there
2178
2186
is no such value, -1 is returned. For anchored patterns, a last literal
2179
value is recorded only if it follows something of variable length. For
2187
value is recorded only if it follows something of variable length. For
2180
2188
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2181
2189
/^a\dz\d/ the returned value is -1.
2191
PCRE_INFO_MAXLOOKBEHIND
2193
Return the number of characters (NB not bytes) in the longest lookbe-
2194
hind assertion in the pattern. Note that the simple assertions \b and
2195
\B require a one-character lookbehind. This information is useful when
2196
doing multi-segment matching using the partial matching facilities.
2183
2198
PCRE_INFO_MINLENGTH
2185
2200
If the pattern was studied and a minimum length for matching subject
2643
2660
When PCRE_UTF8 is set at compile time, the validity of the subject as a
2644
2661
UTF-8 string is automatically checked when pcre_exec() is subsequently
2645
called. The value of startoffset is also checked to ensure that it
2646
points to the start of a UTF-8 character. There is a discussion about
2647
the validity of UTF-8 strings in the pcreunicode page. If an invalid
2648
sequence of bytes is found, pcre_exec() returns the error
2662
called. The entire string is checked before any other processing takes
2663
place. The value of startoffset is also checked to ensure that it
2664
points to the start of a UTF-8 character. There is a discussion about
2665
the validity of UTF-8 strings in the pcreunicode page. If an invalid
2666
sequence of bytes is found, pcre_exec() returns the error
2649
2667
PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
2650
2668
truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
2651
both cases, information about the precise nature of the error may also
2652
be returned (see the descriptions of these errors in the section enti-
2653
tled Error return values from pcre_exec() below). If startoffset con-
2669
both cases, information about the precise nature of the error may also
2670
be returned (see the descriptions of these errors in the section enti-
2671
tled Error return values from pcre_exec() below). If startoffset con-
2654
2672
tains a value that does not point to the start of a UTF-8 character (or
2655
2673
to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2657
If you already know that your subject is valid, and you want to skip
2658
these checks for performance reasons, you can set the
2659
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
2660
do this for the second and subsequent calls to pcre_exec() if you are
2661
making repeated calls to find all the matches in a single subject
2662
string. However, you should be sure that the value of startoffset
2663
points to the start of a character (or the end of the subject). When
2675
If you already know that your subject is valid, and you want to skip
2676
these checks for performance reasons, you can set the
2677
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
2678
do this for the second and subsequent calls to pcre_exec() if you are
2679
making repeated calls to find all the matches in a single subject
2680
string. However, you should be sure that the value of startoffset
2681
points to the start of a character (or the end of the subject). When
2664
2682
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
2665
subject or an invalid value of startoffset is undefined. Your program
2683
subject or an invalid value of startoffset is undefined. Your program
2668
2686
PCRE_PARTIAL_HARD
2669
2687
PCRE_PARTIAL_SOFT
2671
These options turn on the partial matching feature. For backwards com-
2672
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2673
match occurs if the end of the subject string is reached successfully,
2674
but there are not enough subject characters to complete the match. If
2689
These options turn on the partial matching feature. For backwards com-
2690
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2691
match occurs if the end of the subject string is reached successfully,
2692
but there are not enough subject characters to complete the match. If
2675
2693
this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2676
matching continues by testing any remaining alternatives. Only if no
2677
complete match can be found is PCRE_ERROR_PARTIAL returned instead of
2678
PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
2679
caller is prepared to handle a partial match, but only if no complete
2694
matching continues by testing any remaining alternatives. Only if no
2695
complete match can be found is PCRE_ERROR_PARTIAL returned instead of
2696
PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
2697
caller is prepared to handle a partial match, but only if no complete
2680
2698
match can be found.
2682
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
2683
case, if a partial match is found, pcre_exec() immediately returns
2684
PCRE_ERROR_PARTIAL, without considering any other alternatives. In
2685
other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
2700
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
2701
case, if a partial match is found, pcre_exec() immediately returns
2702
PCRE_ERROR_PARTIAL, without considering any other alternatives. In
2703
other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
2686
2704
ered to be more important that an alternative complete match.
2688
In both cases, the portion of the string that was inspected when the
2706
In both cases, the portion of the string that was inspected when the
2689
2707
partial match was found is set as the first matching string. There is a
2690
more detailed discussion of partial and multi-segment matching, with
2708
more detailed discussion of partial and multi-segment matching, with
2691
2709
examples, in the pcrepartial documentation.
2693
2711
The string to be matched by pcre_exec()
2695
The subject string is passed to pcre_exec() as a pointer in subject, a
2696
length in bytes in length, and a starting byte offset in startoffset.
2697
If this is negative or greater than the length of the subject,
2698
pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
2699
zero, the search for a match starts at the beginning of the subject,
2713
The subject string is passed to pcre_exec() as a pointer in subject, a
2714
length in bytes in length, and a starting byte offset in startoffset.
2715
If this is negative or greater than the length of the subject,
2716
pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
2717
zero, the search for a match starts at the beginning of the subject,
2700
2718
and this is by far the most common case. In UTF-8 mode, the byte offset
2701
must point to the start of a UTF-8 character (or the end of the sub-
2702
ject). Unlike the pattern string, the subject may contain binary zero
2719
must point to the start of a UTF-8 character (or the end of the sub-
2720
ject). Unlike the pattern string, the subject may contain binary zero
2705
A non-zero starting offset is useful when searching for another match
2706
in the same subject by calling pcre_exec() again after a previous suc-
2707
cess. Setting startoffset differs from just passing over a shortened
2708
string and setting PCRE_NOTBOL in the case of a pattern that begins
2723
A non-zero starting offset is useful when searching for another match
2724
in the same subject by calling pcre_exec() again after a previous suc-
2725
cess. Setting startoffset differs from just passing over a shortened
2726
string and setting PCRE_NOTBOL in the case of a pattern that begins
2709
2727
with any kind of lookbehind. For example, consider the pattern
2713
which finds occurrences of "iss" in the middle of words. (\B matches
2714
only if the current position in the subject is not a word boundary.)
2715
When applied to the string "Mississipi" the first call to pcre_exec()
2716
finds the first occurrence. If pcre_exec() is called again with just
2717
the remainder of the subject, namely "issipi", it does not match,
2731
which finds occurrences of "iss" in the middle of words. (\B matches
2732
only if the current position in the subject is not a word boundary.)
2733
When applied to the string "Mississipi" the first call to pcre_exec()
2734
finds the first occurrence. If pcre_exec() is called again with just
2735
the remainder of the subject, namely "issipi", it does not match,
2718
2736
because \B is always false at the start of the subject, which is deemed
2719
to be a word boundary. However, if pcre_exec() is passed the entire
2737
to be a word boundary. However, if pcre_exec() is passed the entire
2720
2738
string again, but with startoffset set to 4, it finds the second occur-
2721
rence of "iss" because it is able to look behind the starting point to
2739
rence of "iss" because it is able to look behind the starting point to
2722
2740
discover that it is preceded by a letter.
2724
Finding all the matches in a subject is tricky when the pattern can
2742
Finding all the matches in a subject is tricky when the pattern can
2725
2743
match an empty string. It is possible to emulate Perl's /g behaviour by
2726
first trying the match again at the same offset, with the
2727
PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
2728
fails, advancing the starting offset and trying an ordinary match
2744
first trying the match again at the same offset, with the
2745
PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
2746
fails, advancing the starting offset and trying an ordinary match
2729
2747
again. There is some code that demonstrates how to do this in the pcre-
2730
2748
demo sample program. In the most general case, you have to check to see
2731
if the newline convention recognizes CRLF as a newline, and if so, and
2749
if the newline convention recognizes CRLF as a newline, and if so, and
2732
2750
the current character is CR followed by LF, advance the starting offset
2733
2751
by two characters instead of one.
2735
If a non-zero starting offset is passed when the pattern is anchored,
2753
If a non-zero starting offset is passed when the pattern is anchored,
2736
2754
one attempt to match at the given offset is made. This can only succeed
2737
if the pattern does not require the match to be at the start of the
2755
if the pattern does not require the match to be at the start of the
2740
2758
How pcre_exec() returns captured substrings
2742
In general, a pattern matches a certain portion of the subject, and in
2743
addition, further substrings from the subject may be picked out by
2744
parts of the pattern. Following the usage in Jeffrey Friedl's book,
2745
this is called "capturing" in what follows, and the phrase "capturing
2746
subpattern" is used for a fragment of a pattern that picks out a sub-
2747
string. PCRE supports several other kinds of parenthesized subpattern
2760
In general, a pattern matches a certain portion of the subject, and in
2761
addition, further substrings from the subject may be picked out by
2762
parts of the pattern. Following the usage in Jeffrey Friedl's book,
2763
this is called "capturing" in what follows, and the phrase "capturing
2764
subpattern" is used for a fragment of a pattern that picks out a sub-
2765
string. PCRE supports several other kinds of parenthesized subpattern
2748
2766
that do not cause substrings to be captured.
2750
2768
Captured substrings are returned to the caller via a vector of integers
2751
whose address is passed in ovector. The number of elements in the vec-
2752
tor is passed in ovecsize, which must be a non-negative number. Note:
2769
whose address is passed in ovector. The number of elements in the vec-
2770
tor is passed in ovecsize, which must be a non-negative number. Note:
2753
2771
this argument is NOT the size of ovector in bytes.
2755
The first two-thirds of the vector is used to pass back captured sub-
2756
strings, each substring using a pair of integers. The remaining third
2757
of the vector is used as workspace by pcre_exec() while matching cap-
2758
turing subpatterns, and is not available for passing back information.
2759
The number passed in ovecsize should always be a multiple of three. If
2773
The first two-thirds of the vector is used to pass back captured sub-
2774
strings, each substring using a pair of integers. The remaining third
2775
of the vector is used as workspace by pcre_exec() while matching cap-
2776
turing subpatterns, and is not available for passing back information.
2777
The number passed in ovecsize should always be a multiple of three. If
2760
2778
it is not, it is rounded down.
2762
When a match is successful, information about captured substrings is
2763
returned in pairs of integers, starting at the beginning of ovector,
2764
and continuing up to two-thirds of its length at the most. The first
2765
element of each pair is set to the byte offset of the first character
2766
in a substring, and the second is set to the byte offset of the first
2767
character after the end of a substring. Note: these values are always
2780
When a match is successful, information about captured substrings is
2781
returned in pairs of integers, starting at the beginning of ovector,
2782
and continuing up to two-thirds of its length at the most. The first
2783
element of each pair is set to the byte offset of the first character
2784
in a substring, and the second is set to the byte offset of the first
2785
character after the end of a substring. Note: these values are always
2768
2786
byte offsets, even in UTF-8 mode. They are not character counts.
2770
The first pair of integers, ovector[0] and ovector[1], identify the
2771
portion of the subject string matched by the entire pattern. The next
2772
pair is used for the first capturing subpattern, and so on. The value
2788
The first pair of integers, ovector[0] and ovector[1], identify the
2789
portion of the subject string matched by the entire pattern. The next
2790
pair is used for the first capturing subpattern, and so on. The value
2773
2791
returned by pcre_exec() is one more than the highest numbered pair that
2774
has been set. For example, if two substrings have been captured, the
2775
returned value is 3. If there are no capturing subpatterns, the return
2792
has been set. For example, if two substrings have been captured, the
2793
returned value is 3. If there are no capturing subpatterns, the return
2776
2794
value from a successful match is 1, indicating that just the first pair
2777
2795
of offsets has been set.
2779
2797
If a capturing subpattern is matched repeatedly, it is the last portion
2780
2798
of the string that it matched that is returned.
2782
If the vector is too small to hold all the captured substring offsets,
2800
If the vector is too small to hold all the captured substring offsets,
2783
2801
it is used as far as possible (up to two-thirds of its length), and the
2784
function returns a value of zero. If neither the actual string matched
2785
not any captured substrings are of interest, pcre_exec() may be called
2786
with ovector passed as NULL and ovecsize as zero. However, if the pat-
2787
tern contains back references and the ovector is not big enough to
2788
remember the related substrings, PCRE has to get additional memory for
2789
use during matching. Thus it is usually advisable to supply an ovector
2802
function returns a value of zero. If neither the actual string matched
2803
nor any captured substrings are of interest, pcre_exec() may be called
2804
with ovector passed as NULL and ovecsize as zero. However, if the pat-
2805
tern contains back references and the ovector is not big enough to
2806
remember the related substrings, PCRE has to get additional memory for
2807
use during matching. Thus it is usually advisable to supply an ovector
2790
2808
of reasonable size.
2792
There are some cases where zero is returned (indicating vector over-
2793
flow) when in fact the vector is exactly the right size for the final
2810
There are some cases where zero is returned (indicating vector over-
2811
flow) when in fact the vector is exactly the right size for the final
2794
2812
match. For example, consider the pattern
2798
If a vector of 6 elements (allowing for only 1 captured substring) is
2816
If a vector of 6 elements (allowing for only 1 captured substring) is
2799
2817
given with subject string "abd", pcre_exec() will try to set the second
2800
2818
captured string, thereby recording a vector overflow, before failing to
2801
match "c" and backing up to try the second alternative. The zero
2802
return, however, does correctly indicate that the maximum number of
2819
match "c" and backing up to try the second alternative. The zero
2820
return, however, does correctly indicate that the maximum number of
2803
2821
slots (namely 2) have been filled. In similar cases where there is tem-
2804
porary overflow, but the final number of used slots is actually less
2822
porary overflow, but the final number of used slots is actually less
2805
2823
than the maximum, a non-zero value is returned.
2807
2825
The pcre_fullinfo() function can be used to find out how many capturing
2808
subpatterns there are in a compiled pattern. The smallest size for
2809
ovector that will allow for n captured substrings, in addition to the
2826
subpatterns there are in a compiled pattern. The smallest size for
2827
ovector that will allow for n captured substrings, in addition to the
2810
2828
offsets of the substring matched by the whole pattern, is (n+1)*3.
2812
It is possible for capturing subpattern number n+1 to match some part
2830
It is possible for capturing subpattern number n+1 to match some part
2813
2831
of the subject when subpattern n has not been used at all. For example,
2814
if the string "abc" is matched against the pattern (a|(z))(bc) the
2832
if the string "abc" is matched against the pattern (a|(z))(bc) the
2815
2833
return from the function is 4, and subpatterns 1 and 3 are matched, but
2816
2 is not. When this happens, both values in the offset pairs corre-
2834
2 is not. When this happens, both values in the offset pairs corre-
2817
2835
sponding to unused subpatterns are set to -1.
2819
Offset values that correspond to unused subpatterns at the end of the
2820
expression are also set to -1. For example, if the string "abc" is
2821
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2822
matched. The return from the function is 2, because the highest used
2823
capturing subpattern number is 1, and the offsets for for the second
2824
and third capturing subpatterns (assuming the vector is large enough,
2837
Offset values that correspond to unused subpatterns at the end of the
2838
expression are also set to -1. For example, if the string "abc" is
2839
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2840
matched. The return from the function is 2, because the highest used
2841
capturing subpattern number is 1, and the offsets for for the second
2842
and third capturing subpatterns (assuming the vector is large enough,
2825
2843
of course) are set to -1.
2827
Note: Elements in the first two-thirds of ovector that do not corre-
2828
spond to capturing parentheses in the pattern are never changed. That
2829
is, if a pattern contains n capturing parentheses, no more than ovec-
2830
tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
2845
Note: Elements in the first two-thirds of ovector that do not corre-
2846
spond to capturing parentheses in the pattern are never changed. That
2847
is, if a pattern contains n capturing parentheses, no more than ovec-
2848
tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
2831
2849
the first two-thirds) retain whatever values they previously had.
2833
Some convenience functions are provided for extracting the captured
2851
Some convenience functions are provided for extracting the captured
2834
2852
substrings as separate strings. These are described below.
2836
2854
Error return values from pcre_exec()
2838
If pcre_exec() fails, it returns a negative number. The following are
2856
If pcre_exec() fails, it returns a negative number. The following are
2839
2857
defined in the header file:
2841
2859
PCRE_ERROR_NOMATCH (-1)
2854
2872
PCRE_ERROR_BADMAGIC (-4)
2856
PCRE stores a 4-byte "magic number" at the start of the compiled code,
2874
PCRE stores a 4-byte "magic number" at the start of the compiled code,
2857
2875
to catch the case when it is passed a junk pointer and to detect when a
2858
2876
pattern that was compiled in an environment of one endianness is run in
2859
an environment with the other endianness. This is the error that PCRE
2877
an environment with the other endianness. This is the error that PCRE
2860
2878
gives when the magic number is not present.
2862
2880
PCRE_ERROR_UNKNOWN_OPCODE (-5)
2864
2882
While running the pattern match, an unknown item was encountered in the
2865
compiled pattern. This error could be caused by a bug in PCRE or by
2883
compiled pattern. This error could be caused by a bug in PCRE or by
2866
2884
overwriting of the compiled pattern.
2868
2886
PCRE_ERROR_NOMEMORY (-6)
2870
If a pattern contains back references, but the ovector that is passed
2888
If a pattern contains back references, but the ovector that is passed
2871
2889
to pcre_exec() is not big enough to remember the referenced substrings,
2872
PCRE gets a block of memory at the start of matching to use for this
2873
purpose. If the call via pcre_malloc() fails, this error is given. The
2890
PCRE gets a block of memory at the start of matching to use for this
2891
purpose. If the call via pcre_malloc() fails, this error is given. The
2874
2892
memory is automatically freed at the end of matching.
2876
This error is also given if pcre_stack_malloc() fails in pcre_exec().
2877
This can happen only when PCRE has been compiled with --disable-stack-
2894
This error is also given if pcre_stack_malloc() fails in pcre_exec().
2895
This can happen only when PCRE has been compiled with --disable-stack-
2880
2898
PCRE_ERROR_NOSUBSTRING (-7)
2882
This error is used by the pcre_copy_substring(), pcre_get_substring(),
2900
This error is used by the pcre_copy_substring(), pcre_get_substring(),
2883
2901
and pcre_get_substring_list() functions (see below). It is never
2884
2902
returned by pcre_exec().
2886
2904
PCRE_ERROR_MATCHLIMIT (-8)
2888
The backtracking limit, as specified by the match_limit field in a
2889
pcre_extra structure (or defaulted) was reached. See the description
2906
The backtracking limit, as specified by the match_limit field in a
2907
pcre_extra structure (or defaulted) was reached. See the description
2892
2910
PCRE_ERROR_CALLOUT (-9)
2894
2912
This error is never generated by pcre_exec() itself. It is provided for
2895
use by callout functions that want to yield a distinctive error code.
2913
use by callout functions that want to yield a distinctive error code.
2896
2914
See the pcrecallout documentation for details.
2898
2916
PCRE_ERROR_BADUTF8 (-10)
2900
A string that contains an invalid UTF-8 byte sequence was passed as a
2901
subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
2902
the output vector (ovecsize) is at least 2, the byte offset to the
2903
start of the the invalid UTF-8 character is placed in the first ele-
2904
ment, and a reason code is placed in the second element. The reason
2918
A string that contains an invalid UTF-8 byte sequence was passed as a
2919
subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
2920
the output vector (ovecsize) is at least 2, the byte offset to the
2921
start of the the invalid UTF-8 character is placed in the first ele-
2922
ment, and a reason code is placed in the second element. The reason
2905
2923
codes are listed in the following section. For backward compatibility,
2906
if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
2907
acter at the end of the subject (reason codes 1 to 5),
2924
if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
2925
acter at the end of the subject (reason codes 1 to 5),
2908
2926
PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2910
2928
PCRE_ERROR_BADUTF8_OFFSET (-11)
2912
The UTF-8 byte sequence that was passed as a subject was checked and
2913
found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
2914
value of startoffset did not point to the beginning of a UTF-8 charac-
2930
The UTF-8 byte sequence that was passed as a subject was checked and
2931
found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
2932
value of startoffset did not point to the beginning of a UTF-8 charac-
2915
2933
ter or the end of the subject.
2917
2935
PCRE_ERROR_PARTIAL (-12)
2919
The subject string did not match, but it did match partially. See the
2937
The subject string did not match, but it did match partially. See the
2920
2938
pcrepartial documentation for details of partial matching.
2922
2940
PCRE_ERROR_BADPARTIAL (-13)
2924
This code is no longer in use. It was formerly returned when the
2925
PCRE_PARTIAL option was used with a compiled pattern containing items
2926
that were not supported for partial matching. From release 8.00
2942
This code is no longer in use. It was formerly returned when the
2943
PCRE_PARTIAL option was used with a compiled pattern containing items
2944
that were not supported for partial matching. From release 8.00
2927
2945
onwards, there are no restrictions on partial matching.
2929
2947
PCRE_ERROR_INTERNAL (-14)
2931
An unexpected internal error has occurred. This error could be caused
2949
An unexpected internal error has occurred. This error could be caused
2932
2950
by a bug in PCRE or by overwriting of the compiled pattern.
2934
2952
PCRE_ERROR_BADCOUNT (-15)
2953
2971
PCRE_ERROR_SHORTUTF8 (-25)
2955
This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
2956
string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
2957
option is set. Information about the failure is returned as for
2958
PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
2959
this special error code for PCRE_PARTIAL_HARD precedes the implementa-
2960
tion of returned information; it is retained for backwards compatibil-
2973
This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
2974
string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
2975
option is set. Information about the failure is returned as for
2976
PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
2977
this special error code for PCRE_PARTIAL_HARD precedes the implementa-
2978
tion of returned information; it is retained for backwards compatibil-
2963
2981
PCRE_ERROR_RECURSELOOP (-26)
2965
2983
This error is returned when pcre_exec() detects a recursion loop within
2966
the pattern. Specifically, it means that either the whole pattern or a
2967
subpattern has been called recursively for the second time at the same
2984
the pattern. Specifically, it means that either the whole pattern or a
2985
subpattern has been called recursively for the second time at the same
2968
2986
position in the subject string. Some simple patterns that might do this
2969
are detected and faulted at compile time, but more complicated cases,
2987
are detected and faulted at compile time, but more complicated cases,
2970
2988
in particular mutual recursions between two different subpatterns, can-
2971
2989
not be detected until run time.
2973
2991
PCRE_ERROR_JIT_STACKLIMIT (-27)
2975
This error is returned when a pattern that was successfully studied
2976
using the PCRE_STUDY_JIT_COMPILE option is being matched, but the mem-
2977
ory available for the just-in-time processing stack is not large
2978
enough. See the pcrejit documentation for more details.
2993
This error is returned when a pattern that was successfully studied
2994
using a JIT compile option is being matched, but the memory available
2995
for the just-in-time processing stack is not large enough. See the
2996
pcrejit documentation for more details.
2980
PCRE_ERROR_BADMODE (-28)
2998
PCRE_ERROR_BADMODE (-28)
2982
3000
This error is given if a pattern that was compiled by the 8-bit library
2983
3001
is passed to a 16-bit library function, or vice versa.
2985
PCRE_ERROR_BADENDIANNESS (-29)
3003
PCRE_ERROR_BADENDIANNESS (-29)
2987
This error is given if a pattern that was compiled and saved is
2988
reloaded on a host with different endianness. The utility function
3005
This error is given if a pattern that was compiled and saved is
3006
reloaded on a host with different endianness. The utility function
2989
3007
pcre_pattern_to_host_byte_order() can be used to convert such a pattern
2990
3008
so that it runs on the new host.
2992
Error numbers -16 to -20 and -22 are not used by pcre_exec().
3010
Error numbers -16 to -20, -22, and -30 are not used by pcre_exec().
2994
3012
Reason codes for invalid UTF-8 strings
2996
This section applies only to the 8-bit library. The corresponding
3014
This section applies only to the 8-bit library. The corresponding
2997
3015
information for the 16-bit library is given in the pcre16 page.
2999
3017
When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3000
UTF8, and the size of the output vector (ovecsize) is at least 2, the
3001
offset of the start of the invalid UTF-8 character is placed in the
3018
UTF8, and the size of the output vector (ovecsize) is at least 2, the
3019
offset of the start of the invalid UTF-8 character is placed in the
3002
3020
first output vector element (ovector[0]) and a reason code is placed in
3003
the second element (ovector[1]). The reason codes are given names in
3021
the second element (ovector[1]). The reason codes are given names in
3004
3022
the pcre.h header file:
3079
3097
int pcre_get_substring_list(const char *subject,
3080
3098
int *ovector, int stringcount, const char ***listptr);
3082
Captured substrings can be accessed directly by using the offsets
3083
returned by pcre_exec() in ovector. For convenience, the functions
3100
Captured substrings can be accessed directly by using the offsets
3101
returned by pcre_exec() in ovector. For convenience, the functions
3084
3102
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
3085
string_list() are provided for extracting captured substrings as new,
3086
separate, zero-terminated strings. These functions identify substrings
3087
by number. The next section describes functions for extracting named
3103
string_list() are provided for extracting captured substrings as new,
3104
separate, zero-terminated strings. These functions identify substrings
3105
by number. The next section describes functions for extracting named
3090
A substring that contains a binary zero is correctly extracted and has
3091
a further zero added on the end, but the result is not, of course, a C
3092
string. However, you can process such a string by referring to the
3093
length that is returned by pcre_copy_substring() and pcre_get_sub-
3108
A substring that contains a binary zero is correctly extracted and has
3109
a further zero added on the end, but the result is not, of course, a C
3110
string. However, you can process such a string by referring to the
3111
length that is returned by pcre_copy_substring() and pcre_get_sub-
3094
3112
string(). Unfortunately, the interface to pcre_get_substring_list() is
3095
not adequate for handling strings containing binary zeros, because the
3113
not adequate for handling strings containing binary zeros, because the
3096
3114
end of the final string is not independently indicated.
3098
The first three arguments are the same for all three of these func-
3099
tions: subject is the subject string that has just been successfully
3116
The first three arguments are the same for all three of these func-
3117
tions: subject is the subject string that has just been successfully
3100
3118
matched, ovector is a pointer to the vector of integer offsets that was
3101
3119
passed to pcre_exec(), and stringcount is the number of substrings that
3102
were captured by the match, including the substring that matched the
3120
were captured by the match, including the substring that matched the
3103
3121
entire regular expression. This is the value returned by pcre_exec() if
3104
it is greater than zero. If pcre_exec() returned zero, indicating that
3105
it ran out of space in ovector, the value passed as stringcount should
3122
it is greater than zero. If pcre_exec() returned zero, indicating that
3123
it ran out of space in ovector, the value passed as stringcount should
3106
3124
be the number of elements in the vector divided by three.
3108
The functions pcre_copy_substring() and pcre_get_substring() extract a
3109
single substring, whose number is given as stringnumber. A value of
3110
zero extracts the substring that matched the entire pattern, whereas
3111
higher values extract the captured substrings. For pcre_copy_sub-
3112
string(), the string is placed in buffer, whose length is given by
3113
buffersize, while for pcre_get_substring() a new block of memory is
3114
obtained via pcre_malloc, and its address is returned via stringptr.
3115
The yield of the function is the length of the string, not including
3126
The functions pcre_copy_substring() and pcre_get_substring() extract a
3127
single substring, whose number is given as stringnumber. A value of
3128
zero extracts the substring that matched the entire pattern, whereas
3129
higher values extract the captured substrings. For pcre_copy_sub-
3130
string(), the string is placed in buffer, whose length is given by
3131
buffersize, while for pcre_get_substring() a new block of memory is
3132
obtained via pcre_malloc, and its address is returned via stringptr.
3133
The yield of the function is the length of the string, not including
3116
3134
the terminating zero, or one of these error codes:
3118
3136
PCRE_ERROR_NOMEMORY (-6)
3120
The buffer was too small for pcre_copy_substring(), or the attempt to
3138
The buffer was too small for pcre_copy_substring(), or the attempt to
3121
3139
get memory failed for pcre_get_substring().
3123
3141
PCRE_ERROR_NOSUBSTRING (-7)
3125
3143
There is no substring whose number is stringnumber.
3127
The pcre_get_substring_list() function extracts all available sub-
3128
strings and builds a list of pointers to them. All this is done in a
3145
The pcre_get_substring_list() function extracts all available sub-
3146
strings and builds a list of pointers to them. All this is done in a
3129
3147
single block of memory that is obtained via pcre_malloc. The address of
3130
the memory block is returned via listptr, which is also the start of
3131
the list of string pointers. The end of the list is marked by a NULL
3132
pointer. The yield of the function is zero if all went well, or the
3148
the memory block is returned via listptr, which is also the start of
3149
the list of string pointers. The end of the list is marked by a NULL
3150
pointer. The yield of the function is zero if all went well, or the
3135
3153
PCRE_ERROR_NOMEMORY (-6)
3137
3155
if the attempt to get the memory block failed.
3139
When any of these functions encounter a substring that is unset, which
3140
can happen when capturing subpattern number n+1 matches some part of
3141
the subject, but subpattern n has not been used at all, they return an
3157
When any of these functions encounter a substring that is unset, which
3158
can happen when capturing subpattern number n+1 matches some part of
3159
the subject, but subpattern n has not been used at all, they return an
3142
3160
empty string. This can be distinguished from a genuine zero-length sub-
3143
string by inspecting the appropriate offset in ovector, which is nega-
3161
string by inspecting the appropriate offset in ovector, which is nega-
3144
3162
tive for unset substrings.
3146
The two convenience functions pcre_free_substring() and pcre_free_sub-
3147
string_list() can be used to free the memory returned by a previous
3164
The two convenience functions pcre_free_substring() and pcre_free_sub-
3165
string_list() can be used to free the memory returned by a previous
3148
3166
call of pcre_get_substring() or pcre_get_substring_list(), respec-
3149
tively. They do nothing more than call the function pointed to by
3150
pcre_free, which of course could be called directly from a C program.
3151
However, PCRE is used in some situations where it is linked via a spe-
3152
cial interface to another programming language that cannot use
3153
pcre_free directly; it is for these cases that the functions are pro-
3167
tively. They do nothing more than call the function pointed to by
3168
pcre_free, which of course could be called directly from a C program.
3169
However, PCRE is used in some situations where it is linked via a spe-
3170
cial interface to another programming language that cannot use
3171
pcre_free directly; it is for these cases that the functions are pro-
3215
3233
int pcre_get_stringtable_entries(const pcre *code,
3216
3234
const char *name, char **first, char **last);
3218
When a pattern is compiled with the PCRE_DUPNAMES option, names for
3219
subpatterns are not required to be unique. (Duplicate names are always
3220
allowed for subpatterns with the same number, created by using the (?|
3221
feature. Indeed, if such subpatterns are named, they are required to
3236
When a pattern is compiled with the PCRE_DUPNAMES option, names for
3237
subpatterns are not required to be unique. (Duplicate names are always
3238
allowed for subpatterns with the same number, created by using the (?|
3239
feature. Indeed, if such subpatterns are named, they are required to
3222
3240
use the same names.)
3224
3242
Normally, patterns with duplicate names are such that in any one match,
3225
only one of the named subpatterns participates. An example is shown in
3243
only one of the named subpatterns participates. An example is shown in
3226
3244
the pcrepattern documentation.
3228
When duplicates are present, pcre_copy_named_substring() and
3229
pcre_get_named_substring() return the first substring corresponding to
3230
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
3231
(-7) is returned; no data is returned. The pcre_get_stringnumber()
3232
function returns one of the numbers that are associated with the name,
3246
When duplicates are present, pcre_copy_named_substring() and
3247
pcre_get_named_substring() return the first substring corresponding to
3248
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
3249
(-7) is returned; no data is returned. The pcre_get_stringnumber()
3250
function returns one of the numbers that are associated with the name,
3233
3251
but it is not defined which it is.
3235
If you want to get full details of all captured substrings for a given
3236
name, you must use the pcre_get_stringtable_entries() function. The
3253
If you want to get full details of all captured substrings for a given
3254
name, you must use the pcre_get_stringtable_entries() function. The
3237
3255
first argument is the compiled pattern, and the second is the name. The
3238
third and fourth are pointers to variables which are updated by the
3256
third and fourth are pointers to variables which are updated by the
3239
3257
function. After it has run, they point to the first and last entries in
3240
the name-to-number table for the given name. The function itself
3241
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
3242
there are none. The format of the table is described above in the sec-
3243
tion entitled Information about a pattern above. Given all the rele-
3244
vant entries for the name, you can extract each of their numbers, and
3258
the name-to-number table for the given name. The function itself
3259
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
3260
there are none. The format of the table is described above in the sec-
3261
tion entitled Information about a pattern above. Given all the rele-
3262
vant entries for the name, you can extract each of their numbers, and
3245
3263
hence the captured data, if any.
3248
3266
FINDING ALL POSSIBLE MATCHES
3250
The traditional matching function uses a similar algorithm to Perl,
3268
The traditional matching function uses a similar algorithm to Perl,
3251
3269
which stops when it finds the first match, starting at a given point in
3252
the subject. If you want to find all possible matches, or the longest
3253
possible match, consider using the alternative matching function (see
3254
below) instead. If you cannot use the alternative function, but still
3255
need to find all possible matches, you can kludge it up by making use
3270
the subject. If you want to find all possible matches, or the longest
3271
possible match, consider using the alternative matching function (see
3272
below) instead. If you cannot use the alternative function, but still
3273
need to find all possible matches, you can kludge it up by making use
3256
3274
of the callout facility, which is described in the pcrecallout documen-
3259
3277
What you have to do is to insert a callout right at the end of the pat-
3260
tern. When your callout function is called, extract and save the cur-
3261
rent matched substring. Then return 1, which forces pcre_exec() to
3262
backtrack and try other alternatives. Ultimately, when it runs out of
3278
tern. When your callout function is called, extract and save the cur-
3279
rent matched substring. Then return 1, which forces pcre_exec() to
3280
backtrack and try other alternatives. Ultimately, when it runs out of
3263
3281
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
3266
3284
OBTAINING AN ESTIMATE OF STACK USAGE
3268
Matching certain patterns using pcre_exec() can use a lot of process
3269
stack, which in certain environments can be rather limited in size.
3270
Some users find it helpful to have an estimate of the amount of stack
3271
that is used by pcre_exec(), to help them set recursion limits, as
3272
described in the pcrestack documentation. The estimate that is output
3286
Matching certain patterns using pcre_exec() can use a lot of process
3287
stack, which in certain environments can be rather limited in size.
3288
Some users find it helpful to have an estimate of the amount of stack
3289
that is used by pcre_exec(), to help them set recursion limits, as
3290
described in the pcrestack documentation. The estimate that is output
3273
3291
by pcretest when called with the -m and -C options is obtained by call-
3274
ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
3292
ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
3275
3293
first five arguments.
3277
Normally, if its first argument is NULL, pcre_exec() immediately
3278
returns the negative error code PCRE_ERROR_NULL, but with this special
3279
combination of arguments, it returns instead a negative number whose
3280
absolute value is the approximate stack frame size in bytes. (A nega-
3281
tive number is used so that it is clear that no match has happened.)
3282
The value is approximate because in some cases, recursive calls to
3295
Normally, if its first argument is NULL, pcre_exec() immediately
3296
returns the negative error code PCRE_ERROR_NULL, but with this special
3297
combination of arguments, it returns instead a negative number whose
3298
absolute value is the approximate stack frame size in bytes. (A nega-
3299
tive number is used so that it is clear that no match has happened.)
3300
The value is approximate because in some cases, recursive calls to
3283
3301
pcre_exec() occur when there are one or two additional variables on the
3286
If PCRE has been compiled to use the heap instead of the stack for
3287
recursion, the value returned is the size of each block that is
3304
If PCRE has been compiled to use the heap instead of the stack for
3305
recursion, the value returned is the size of each block that is
3288
3306
obtained from the heap.
3337
3355
Option bits for pcre_dfa_exec()
3339
The unused bits of the options argument for pcre_dfa_exec() must be
3340
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
3357
The unused bits of the options argument for pcre_dfa_exec() must be
3358
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
3341
3359
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
3342
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
3343
PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
3344
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
3345
four of these are exactly the same as for pcre_exec(), so their
3360
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
3361
PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
3362
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
3363
four of these are exactly the same as for pcre_exec(), so their
3346
3364
description is not repeated here.
3348
3366
PCRE_PARTIAL_HARD
3349
3367
PCRE_PARTIAL_SOFT
3351
These have the same general effect as they do for pcre_exec(), but the
3352
details are slightly different. When PCRE_PARTIAL_HARD is set for
3353
pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
3354
ject is reached and there is still at least one matching possibility
3369
These have the same general effect as they do for pcre_exec(), but the
3370
details are slightly different. When PCRE_PARTIAL_HARD is set for
3371
pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
3372
ject is reached and there is still at least one matching possibility
3355
3373
that requires additional characters. This happens even if some complete
3356
3374
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
3357
3375
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
3358
of the subject is reached, there have been no complete matches, but
3359
there is still at least one matching possibility. The portion of the
3360
string that was inspected when the longest partial match was found is
3361
set as the first matching string in both cases. There is a more
3362
detailed discussion of partial and multi-segment matching, with exam-
3376
of the subject is reached, there have been no complete matches, but
3377
there is still at least one matching possibility. The portion of the
3378
string that was inspected when the longest partial match was found is
3379
set as the first matching string in both cases. There is a more
3380
detailed discussion of partial and multi-segment matching, with exam-
3363
3381
ples, in the pcrepartial documentation.
3365
3383
PCRE_DFA_SHORTEST
3367
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
3385
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
3368
3386
stop as soon as it has found one match. Because of the way the alterna-
3369
tive algorithm works, this is necessarily the shortest possible match
3387
tive algorithm works, this is necessarily the shortest possible match
3370
3388
at the first possible matching point in the subject string.
3372
3390
PCRE_DFA_RESTART
3374
3392
When pcre_dfa_exec() returns a partial match, it is possible to call it
3375
again, with additional subject characters, and have it continue with
3376
the same match. The PCRE_DFA_RESTART option requests this action; when
3377
it is set, the workspace and wscount options must reference the same
3378
vector as before because data about the match so far is left in them
3393
again, with additional subject characters, and have it continue with
3394
the same match. The PCRE_DFA_RESTART option requests this action; when
3395
it is set, the workspace and wscount options must reference the same
3396
vector as before because data about the match so far is left in them
3379
3397
after a partial match. There is more discussion of this facility in the
3380
3398
pcrepartial documentation.
3382
3400
Successful returns from pcre_dfa_exec()
3384
When pcre_dfa_exec() succeeds, it may have matched more than one sub-
3402
When pcre_dfa_exec() succeeds, it may have matched more than one sub-
3385
3403
string in the subject. Note, however, that all the matches from one run
3386
of the function start at the same point in the subject. The shorter
3387
matches are all initial substrings of the longer matches. For example,
3404
of the function start at the same point in the subject. The shorter
3405
matches are all initial substrings of the longer matches. For example,
3399
3417
<something> <something else>
3400
3418
<something> <something else> <something further>
3402
On success, the yield of the function is a number greater than zero,
3403
which is the number of matched substrings. The substrings themselves
3404
are returned in ovector. Each string uses two elements; the first is
3405
the offset to the start, and the second is the offset to the end. In
3406
fact, all the strings have the same start offset. (Space could have
3407
been saved by giving this only once, but it was decided to retain some
3408
compatibility with the way pcre_exec() returns data, even though the
3420
On success, the yield of the function is a number greater than zero,
3421
which is the number of matched substrings. The substrings themselves
3422
are returned in ovector. Each string uses two elements; the first is
3423
the offset to the start, and the second is the offset to the end. In
3424
fact, all the strings have the same start offset. (Space could have
3425
been saved by giving this only once, but it was decided to retain some
3426
compatibility with the way pcre_exec() returns data, even though the
3409
3427
meaning of the strings is different.)
3411
3429
The strings are returned in reverse order of length; that is, the long-
3412
est matching string is given first. If there were too many matches to
3413
fit into ovector, the yield of the function is zero, and the vector is
3414
filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
3430
est matching string is given first. If there were too many matches to
3431
fit into ovector, the yield of the function is zero, and the vector is
3432
filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
3415
3433
can use the entire ovector for returning matched strings.
3417
3435
Error returns from pcre_dfa_exec()
3419
The pcre_dfa_exec() function returns a negative number when it fails.
3420
Many of the errors are the same as for pcre_exec(), and these are
3421
described above. There are in addition the following errors that are
3437
The pcre_dfa_exec() function returns a negative number when it fails.
3438
Many of the errors are the same as for pcre_exec(), and these are
3439
described above. There are in addition the following errors that are
3422
3440
specific to pcre_dfa_exec():
3424
3442
PCRE_ERROR_DFA_UITEM (-16)
3426
This return is given if pcre_dfa_exec() encounters an item in the pat-
3427
tern that it does not support, for instance, the use of \C or a back
3444
This return is given if pcre_dfa_exec() encounters an item in the pat-
3445
tern that it does not support, for instance, the use of \C or a back
3430
3448
PCRE_ERROR_DFA_UCOND (-17)
3432
This return is given if pcre_dfa_exec() encounters a condition item
3433
that uses a back reference for the condition, or a test for recursion
3450
This return is given if pcre_dfa_exec() encounters a condition item
3451
that uses a back reference for the condition, or a test for recursion
3434
3452
in a specific group. These are not supported.
3436
3454
PCRE_ERROR_DFA_UMLIMIT (-18)
3438
This return is given if pcre_dfa_exec() is called with an extra block
3439
that contains a setting of the match_limit or match_limit_recursion
3440
fields. This is not supported (these fields are meaningless for DFA
3456
This return is given if pcre_dfa_exec() is called with an extra block
3457
that contains a setting of the match_limit or match_limit_recursion
3458
fields. This is not supported (these fields are meaningless for DFA
3443
3461
PCRE_ERROR_DFA_WSSIZE (-19)
3445
This return is given if pcre_dfa_exec() runs out of space in the
3463
This return is given if pcre_dfa_exec() runs out of space in the
3446
3464
workspace vector.
3448
3466
PCRE_ERROR_DFA_RECURSE (-20)
3450
When a recursive subpattern is processed, the matching function calls
3451
itself recursively, using private vectors for ovector and workspace.
3452
This error is given if the output vector is not large enough. This
3468
When a recursive subpattern is processed, the matching function calls
3469
itself recursively, using private vectors for ovector and workspace.
3470
This error is given if the output vector is not large enough. This
3453
3471
should be extremely rare, as a vector of size 1000 is used.
3473
PCRE_ERROR_DFA_BADRESTART (-30)
3475
When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
3476
plausibility checks are made on the contents of the workspace, which
3477
should contain data about the previous partial match. If any of these
3478
checks fail, this error is given.
4366
4401
Those that are not part of an identified script are lumped together as
4367
4402
"Common". The current list of scripts is:
4369
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
4370
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
4371
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
4372
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
4373
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
4374
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
4375
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
4376
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
4377
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
4378
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
4379
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
4380
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
4381
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
4404
Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
4405
Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,
4406
Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
4407
Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic,
4408
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
4409
gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
4410
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
4411
Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,
4412
Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive,
4413
Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko,
4414
Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic,
4415
Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-
4416
tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese,
4417
Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,
4418
Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
4384
4421
Each character has exactly one Unicode general category property, spec-
4385
4422
ified by a two-letter abbreviation. For compatibility with Perl, nega-
4712
4750
closing square bracket. A closing square bracket on its own is not spe-
4713
4751
cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4714
4752
a lone closing square bracket causes a compile-time error. If a closing
4715
square bracket is required as a member of the class, it should be the
4716
first data character in the class (after an initial circumflex, if
4753
square bracket is required as a member of the class, it should be the
4754
first data character in the class (after an initial circumflex, if
4717
4755
present) or escaped with a backslash.
4719
A character class matches a single character in the subject. In a UTF
4720
mode, the character may be more than one data unit long. A matched
4757
A character class matches a single character in the subject. In a UTF
4758
mode, the character may be more than one data unit long. A matched
4721
4759
character must be in the set of characters defined by the class, unless
4722
the first character in the class definition is a circumflex, in which
4760
the first character in the class definition is a circumflex, in which
4723
4761
case the subject character must not be in the set defined by the class.
4724
If a circumflex is actually required as a member of the class, ensure
4762
If a circumflex is actually required as a member of the class, ensure
4725
4763
it is not the first character, or escape it with a backslash.
4727
For example, the character class [aeiou] matches any lower case vowel,
4728
while [^aeiou] matches any character that is not a lower case vowel.
4765
For example, the character class [aeiou] matches any lower case vowel,
4766
while [^aeiou] matches any character that is not a lower case vowel.
4729
4767
Note that a circumflex is just a convenient notation for specifying the
4730
characters that are in the class by enumerating those that are not. A
4731
class that starts with a circumflex is not an assertion; it still con-
4732
sumes a character from the subject string, and therefore it fails if
4768
characters that are in the class by enumerating those that are not. A
4769
class that starts with a circumflex is not an assertion; it still con-
4770
sumes a character from the subject string, and therefore it fails if
4733
4771
the current pointer is at the end of the string.
4735
In UTF-8 (UTF-16) mode, characters with values greater than 255
4736
(0xffff) can be included in a class as a literal string of data units,
4773
In UTF-8 (UTF-16) mode, characters with values greater than 255
4774
(0xffff) can be included in a class as a literal string of data units,
4737
4775
or by using the \x{ escaping mechanism.
4739
When caseless matching is set, any letters in a class represent both
4740
their upper case and lower case versions, so for example, a caseless
4741
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
4742
match "A", whereas a caseful version would. In a UTF mode, PCRE always
4743
understands the concept of case for characters whose values are less
4744
than 128, so caseless matching is always possible. For characters with
4745
higher values, the concept of case is supported if PCRE is compiled
4746
with Unicode property support, but not otherwise. If you want to use
4747
caseless matching in a UTF mode for characters 128 and above, you must
4748
ensure that PCRE is compiled with Unicode property support as well as
4777
When caseless matching is set, any letters in a class represent both
4778
their upper case and lower case versions, so for example, a caseless
4779
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
4780
match "A", whereas a caseful version would. In a UTF mode, PCRE always
4781
understands the concept of case for characters whose values are less
4782
than 128, so caseless matching is always possible. For characters with
4783
higher values, the concept of case is supported if PCRE is compiled
4784
with Unicode property support, but not otherwise. If you want to use
4785
caseless matching in a UTF mode for characters 128 and above, you must
4786
ensure that PCRE is compiled with Unicode property support as well as
4749
4787
with UTF support.
4751
Characters that might indicate line breaks are never treated in any
4752
special way when matching character classes, whatever line-ending
4753
sequence is in use, and whatever setting of the PCRE_DOTALL and
4789
Characters that might indicate line breaks are never treated in any
4790
special way when matching character classes, whatever line-ending
4791
sequence is in use, and whatever setting of the PCRE_DOTALL and
4754
4792
PCRE_MULTILINE options is used. A class such as [^a] always matches one
4755
4793
of these characters.
4757
The minus (hyphen) character can be used to specify a range of charac-
4758
ters in a character class. For example, [d-m] matches any letter
4759
between d and m, inclusive. If a minus character is required in a
4760
class, it must be escaped with a backslash or appear in a position
4761
where it cannot be interpreted as indicating a range, typically as the
4795
The minus (hyphen) character can be used to specify a range of charac-
4796
ters in a character class. For example, [d-m] matches any letter
4797
between d and m, inclusive. If a minus character is required in a
4798
class, it must be escaped with a backslash or appear in a position
4799
where it cannot be interpreted as indicating a range, typically as the
4762
4800
first or last character in the class.
4764
4802
It is not possible to have the literal character "]" as the end charac-
4765
ter of a range. A pattern such as [W-]46] is interpreted as a class of
4766
two characters ("W" and "-") followed by a literal string "46]", so it
4767
would match "W46]" or "-46]". However, if the "]" is escaped with a
4768
backslash it is interpreted as the end of range, so [W-\]46] is inter-
4769
preted as a class containing a range followed by two other characters.
4770
The octal or hexadecimal representation of "]" can also be used to end
4803
ter of a range. A pattern such as [W-]46] is interpreted as a class of
4804
two characters ("W" and "-") followed by a literal string "46]", so it
4805
would match "W46]" or "-46]". However, if the "]" is escaped with a
4806
backslash it is interpreted as the end of range, so [W-\]46] is inter-
4807
preted as a class containing a range followed by two other characters.
4808
The octal or hexadecimal representation of "]" can also be used to end
4773
Ranges operate in the collating sequence of character values. They can
4774
also be used for characters specified numerically, for example
4775
[\000-\037]. Ranges can include any characters that are valid for the
4811
Ranges operate in the collating sequence of character values. They can
4812
also be used for characters specified numerically, for example
4813
[\000-\037]. Ranges can include any characters that are valid for the
4778
4816
If a range that includes letters is used when caseless matching is set,
4779
4817
it matches the letters in either case. For example, [W-c] is equivalent
4780
to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
4781
character tables for a French locale are in use, [\xc8-\xcb] matches
4782
accented E characters in both cases. In UTF modes, PCRE supports the
4783
concept of case for characters with values greater than 128 only when
4818
to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
4819
character tables for a French locale are in use, [\xc8-\xcb] matches
4820
accented E characters in both cases. In UTF modes, PCRE supports the
4821
concept of case for characters with values greater than 128 only when
4784
4822
it is compiled with Unicode property support.
4786
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
4824
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
4787
4825
\w, and \W may appear in a character class, and add the characters that
4788
they match to the class. For example, [\dABCDEF] matches any hexadeci-
4789
mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
4790
\d, \s, \w and their upper case partners, just as it does when they
4791
appear outside a character class, as described in the section entitled
4826
they match to the class. For example, [\dABCDEF] matches any hexadeci-
4827
mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
4828
\d, \s, \w and their upper case partners, just as it does when they
4829
appear outside a character class, as described in the section entitled
4792
4830
"Generic character types" above. The escape sequence \b has a different
4793
meaning inside a character class; it matches the backspace character.
4794
The sequences \B, \N, \R, and \X are not special inside a character
4795
class. Like any other unrecognized escape sequences, they are treated
4796
as the literal characters "B", "N", "R", and "X" by default, but cause
4831
meaning inside a character class; it matches the backspace character.
4832
The sequences \B, \N, \R, and \X are not special inside a character
4833
class. Like any other unrecognized escape sequences, they are treated
4834
as the literal characters "B", "N", "R", and "X" by default, but cause
4797
4835
an error if the PCRE_EXTRA option is set.
4799
A circumflex can conveniently be used with the upper case character
4800
types to specify a more restricted set of characters than the matching
4801
lower case type. For example, the class [^\W_] matches any letter or
4837
A circumflex can conveniently be used with the upper case character
4838
types to specify a more restricted set of characters than the matching
4839
lower case type. For example, the class [^\W_] matches any letter or
4802
4840
digit, but not underscore, whereas [\w] includes underscore. A positive
4803
4841
character class should be read as "something OR something OR ..." and a
4804
4842
negative class as "NOT something AND NOT something AND NOT ...".
4806
The only metacharacters that are recognized in character classes are
4807
backslash, hyphen (only where it can be interpreted as specifying a
4808
range), circumflex (only at the start), opening square bracket (only
4809
when it can be interpreted as introducing a POSIX class name - see the
4810
next section), and the terminating closing square bracket. However,
4844
The only metacharacters that are recognized in character classes are
4845
backslash, hyphen (only where it can be interpreted as specifying a
4846
range), circumflex (only at the start), opening square bracket (only
4847
when it can be interpreted as introducing a POSIX class name - see the
4848
next section), and the terminating closing square bracket. However,
4811
4849
escaping other non-alphanumeric characters does no harm.
4814
4852
POSIX CHARACTER CLASSES
4816
4854
Perl supports the POSIX notation for character classes. This uses names
4817
enclosed by [: and :] within the enclosing square brackets. PCRE also
4855
enclosed by [: and :] within the enclosing square brackets. PCRE also
4818
4856
supports this notation. For example,
4902
4940
For example, (?im) sets caseless, multiline matching. It is also possi-
4903
4941
ble to unset these options by preceding the letter with a hyphen, and a
4904
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
4905
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
4906
is also permitted. If a letter appears both before and after the
4942
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
4943
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
4944
is also permitted. If a letter appears both before and after the
4907
4945
hyphen, the option is unset.
4909
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
4910
can be changed in the same way as the Perl-compatible options by using
4947
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
4948
can be changed in the same way as the Perl-compatible options by using
4911
4949
the characters J, U and X respectively.
4913
When one of these option changes occurs at top level (that is, not
4914
inside subpattern parentheses), the change applies to the remainder of
4951
When one of these option changes occurs at top level (that is, not
4952
inside subpattern parentheses), the change applies to the remainder of
4915
4953
the pattern that follows. If the change is placed right at the start of
4916
4954
a pattern, PCRE extracts it into the global options (and it will there-
4917
4955
fore show up in data extracted by the pcre_fullinfo() function).
4919
An option change within a subpattern (see below for a description of
4920
subpatterns) affects only that part of the subpattern that follows it,
4957
An option change within a subpattern (see below for a description of
4958
subpatterns) affects only that part of the subpattern that follows it,
4925
4963
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4926
used). By this means, options can be made to have different settings
4927
in different parts of the pattern. Any changes made in one alternative
4928
do carry on into subsequent branches within the same subpattern. For
4964
used). By this means, options can be made to have different settings
4965
in different parts of the pattern. Any changes made in one alternative
4966
do carry on into subsequent branches within the same subpattern. For
4933
matches "ab", "aB", "c", and "C", even though when matching "C" the
4934
first branch is abandoned before the option setting. This is because
4935
the effects of option settings happen at compile time. There would be
4971
matches "ab", "aB", "c", and "C", even though when matching "C" the
4972
first branch is abandoned before the option setting. This is because
4973
the effects of option settings happen at compile time. There would be
4936
4974
some very weird behaviour otherwise.
4938
Note: There are other PCRE-specific options that can be set by the
4939
application when the compiling or matching functions are called. In
4940
some cases the pattern can contain special leading sequences such as
4941
(*CRLF) to override what the application has set or what has been
4942
defaulted. Details are given in the section entitled "Newline
4943
sequences" above. There are also the (*UTF8), (*UTF16), and (*UCP)
4944
leading sequences that can be used to set UTF and Unicode property
4945
modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, and
4976
Note: There are other PCRE-specific options that can be set by the
4977
application when the compiling or matching functions are called. In
4978
some cases the pattern can contain special leading sequences such as
4979
(*CRLF) to override what the application has set or what has been
4980
defaulted. Details are given in the section entitled "Newline
4981
sequences" above. There are also the (*UTF8), (*UTF16), and (*UCP)
4982
leading sequences that can be used to set UTF and Unicode property
4983
modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, and
4946
4984
the PCRE_UCP options, respectively.
5025
5063
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
5026
5064
# 1 2 2 3 2 3 4
5028
A back reference to a numbered subpattern uses the most recent value
5029
that is set for that number by any subpattern. The following pattern
5066
A back reference to a numbered subpattern uses the most recent value
5067
that is set for that number by any subpattern. The following pattern
5030
5068
matches "abcabc" or "defdef":
5032
5070
/(?|(abc)|(def))\1/
5034
In contrast, a subroutine call to a numbered subpattern always refers
5035
to the first one in the pattern with the given number. The following
5072
In contrast, a subroutine call to a numbered subpattern always refers
5073
to the first one in the pattern with the given number. The following
5036
5074
pattern matches "abcabc" or "defabc":
5038
5076
/(?|(abc)|(def))(?1)/
5040
If a condition test for a subpattern's having matched refers to a non-
5041
unique number, the test is true if any of the subpatterns of that num-
5078
If a condition test for a subpattern's having matched refers to a non-
5079
unique number, the test is true if any of the subpatterns of that num-
5042
5080
ber have matched.
5044
An alternative approach to using this "branch reset" feature is to use
5082
An alternative approach to using this "branch reset" feature is to use
5045
5083
duplicate named subpatterns, as described in the next section.
5048
5086
NAMED SUBPATTERNS
5050
Identifying capturing parentheses by number is simple, but it can be
5051
very hard to keep track of the numbers in complicated regular expres-
5052
sions. Furthermore, if an expression is modified, the numbers may
5053
change. To help with this difficulty, PCRE supports the naming of sub-
5088
Identifying capturing parentheses by number is simple, but it can be
5089
very hard to keep track of the numbers in complicated regular expres-
5090
sions. Furthermore, if an expression is modified, the numbers may
5091
change. To help with this difficulty, PCRE supports the naming of sub-
5054
5092
patterns. This feature was not added to Perl until release 5.10. Python
5055
had the feature earlier, and PCRE introduced it at release 4.0, using
5056
the Python syntax. PCRE now supports both the Perl and the Python syn-
5057
tax. Perl allows identically numbered subpatterns to have different
5093
had the feature earlier, and PCRE introduced it at release 4.0, using
5094
the Python syntax. PCRE now supports both the Perl and the Python syn-
5095
tax. Perl allows identically numbered subpatterns to have different
5058
5096
names, but PCRE does not.
5060
In PCRE, a subpattern can be named in one of three ways: (?<name>...)
5061
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
5062
to capturing parentheses from other parts of the pattern, such as back
5063
references, recursion, and conditions, can be made by name as well as
5098
In PCRE, a subpattern can be named in one of three ways: (?<name>...)
5099
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
5100
to capturing parentheses from other parts of the pattern, such as back
5101
references, recursion, and conditions, can be made by name as well as
5066
Names consist of up to 32 alphanumeric characters and underscores.
5067
Named capturing parentheses are still allocated numbers as well as
5068
names, exactly as if the names were not present. The PCRE API provides
5104
Names consist of up to 32 alphanumeric characters and underscores.
5105
Named capturing parentheses are still allocated numbers as well as
5106
names, exactly as if the names were not present. The PCRE API provides
5069
5107
function calls for extracting the name-to-number translation table from
5070
5108
a compiled pattern. There is also a convenience function for extracting
5071
5109
a captured substring by name.
5073
By default, a name must be unique within a pattern, but it is possible
5111
By default, a name must be unique within a pattern, but it is possible
5074
5112
to relax this constraint by setting the PCRE_DUPNAMES option at compile
5075
time. (Duplicate names are also always permitted for subpatterns with
5076
the same number, set up as described in the previous section.) Dupli-
5077
cate names can be useful for patterns where only one instance of the
5078
named parentheses can match. Suppose you want to match the name of a
5079
weekday, either as a 3-letter abbreviation or as the full name, and in
5113
time. (Duplicate names are also always permitted for subpatterns with
5114
the same number, set up as described in the previous section.) Dupli-
5115
cate names can be useful for patterns where only one instance of the
5116
named parentheses can match. Suppose you want to match the name of a
5117
weekday, either as a 3-letter abbreviation or as the full name, and in
5080
5118
both cases you want to extract the abbreviation. This pattern (ignoring
5081
5119
the line breaks) does the job:
5271
5309
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
5273
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
5274
repetition, failure of what follows normally causes the repeated item
5275
to be re-evaluated to see if a different number of repeats allows the
5276
rest of the pattern to match. Sometimes it is useful to prevent this,
5277
either to change the nature of the match, or to cause it fail earlier
5278
than it otherwise might, when the author of the pattern knows there is
5311
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
5312
repetition, failure of what follows normally causes the repeated item
5313
to be re-evaluated to see if a different number of repeats allows the
5314
rest of the pattern to match. Sometimes it is useful to prevent this,
5315
either to change the nature of the match, or to cause it fail earlier
5316
than it otherwise might, when the author of the pattern knows there is
5279
5317
no point in carrying on.
5281
Consider, for example, the pattern \d+foo when applied to the subject
5319
Consider, for example, the pattern \d+foo when applied to the subject
5286
5324
After matching all 6 digits and then failing to match "foo", the normal
5287
action of the matcher is to try again with only 5 digits matching the
5288
\d+ item, and then with 4, and so on, before ultimately failing.
5289
"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
5290
the means for specifying that once a subpattern has matched, it is not
5325
action of the matcher is to try again with only 5 digits matching the
5326
\d+ item, and then with 4, and so on, before ultimately failing.
5327
"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
5328
the means for specifying that once a subpattern has matched, it is not
5291
5329
to be re-evaluated in this way.
5293
If we use atomic grouping for the previous example, the matcher gives
5294
up immediately on failing to match "foo" the first time. The notation
5331
If we use atomic grouping for the previous example, the matcher gives
5332
up immediately on failing to match "foo" the first time. The notation
5295
5333
is a kind of special parenthesis, starting with (?> as in this example:
5299
This kind of parenthesis "locks up" the part of the pattern it con-
5300
tains once it has matched, and a failure further into the pattern is
5301
prevented from backtracking into it. Backtracking past it to previous
5337
This kind of parenthesis "locks up" the part of the pattern it con-
5338
tains once it has matched, and a failure further into the pattern is
5339
prevented from backtracking into it. Backtracking past it to previous
5302
5340
items, however, works as normal.
5304
An alternative description is that a subpattern of this type matches
5305
the string of characters that an identical standalone pattern would
5342
An alternative description is that a subpattern of this type matches
5343
the string of characters that an identical standalone pattern would
5306
5344
match, if anchored at the current point in the subject string.
5308
5346
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
5309
5347
such as the above example can be thought of as a maximizing repeat that
5310
must swallow everything it can. So, while both \d+ and \d+? are pre-
5311
pared to adjust the number of digits they match in order to make the
5348
must swallow everything it can. So, while both \d+ and \d+? are pre-
5349
pared to adjust the number of digits they match in order to make the
5312
5350
rest of the pattern match, (?>\d+) can only match an entire sequence of
5315
Atomic groups in general can of course contain arbitrarily complicated
5316
subpatterns, and can be nested. However, when the subpattern for an
5353
Atomic groups in general can of course contain arbitrarily complicated
5354
subpatterns, and can be nested. However, when the subpattern for an
5317
5355
atomic group is just a single repeated item, as in the example above, a
5318
simpler notation, called a "possessive quantifier" can be used. This
5319
consists of an additional + character following a quantifier. Using
5356
simpler notation, called a "possessive quantifier" can be used. This
5357
consists of an additional + character following a quantifier. Using
5320
5358
this notation, the previous example can be rewritten as
5448
5486
(?P<p1>(?i)rah)\s+(?P=p1)
5449
5487
(?<p1>(?i)rah)\s+\g{p1}
5451
A subpattern that is referenced by name may appear in the pattern
5489
A subpattern that is referenced by name may appear in the pattern
5452
5490
before or after the reference.
5454
There may be more than one back reference to the same subpattern. If a
5455
subpattern has not actually been used in a particular match, any back
5492
There may be more than one back reference to the same subpattern. If a
5493
subpattern has not actually been used in a particular match, any back
5456
5494
references to it always fail by default. For example, the pattern
5460
always fails if it starts to match "a" rather than "bc". However, if
5498
always fails if it starts to match "a" rather than "bc". However, if
5461
5499
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
5462
5500
ence to an unset value matches an empty string.
5464
Because there may be many capturing parentheses in a pattern, all dig-
5465
its following a backslash are taken as part of a potential back refer-
5466
ence number. If the pattern continues with a digit character, some
5467
delimiter must be used to terminate the back reference. If the
5468
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
5469
syntax or an empty comment (see "Comments" below) can be used.
5502
Because there may be many capturing parentheses in a pattern, all dig-
5503
its following a backslash are taken as part of a potential back refer-
5504
ence number. If the pattern continues with a digit character, some
5505
delimiter must be used to terminate the back reference. If the
5506
PCRE_EXTENDED option is set, this can be white space. Otherwise, the
5507
\g{ syntax or an empty comment (see "Comments" below) can be used.
5471
5509
Recursive back references
5473
A back reference that occurs inside the parentheses to which it refers
5474
fails when the subpattern is first used, so, for example, (a\1) never
5475
matches. However, such references can be useful inside repeated sub-
5511
A back reference that occurs inside the parentheses to which it refers
5512
fails when the subpattern is first used, so, for example, (a\1) never
5513
matches. However, such references can be useful inside repeated sub-
5476
5514
patterns. For example, the pattern
5480
5518
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
5481
ation of the subpattern, the back reference matches the character
5482
string corresponding to the previous iteration. In order for this to
5483
work, the pattern must be such that the first iteration does not need
5484
to match the back reference. This can be done using alternation, as in
5519
ation of the subpattern, the back reference matches the character
5520
string corresponding to the previous iteration. In order for this to
5521
work, the pattern must be such that the first iteration does not need
5522
to match the back reference. This can be done using alternation, as in
5485
5523
the example above, or by a quantifier with a minimum of zero.
5487
Back references of this type cause the group that they reference to be
5488
treated as an atomic group. Once the whole group has been matched, a
5489
subsequent matching failure cannot cause backtracking into the middle
5525
Back references of this type cause the group that they reference to be
5526
treated as an atomic group. Once the whole group has been matched, a
5527
subsequent matching failure cannot cause backtracking into the middle
5495
An assertion is a test on the characters following or preceding the
5496
current matching point that does not actually consume any characters.
5497
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
5533
An assertion is a test on the characters following or preceding the
5534
current matching point that does not actually consume any characters.
5535
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
5498
5536
described above.
5500
More complicated assertions are coded as subpatterns. There are two
5501
kinds: those that look ahead of the current position in the subject
5502
string, and those that look behind it. An assertion subpattern is
5503
matched in the normal way, except that it does not cause the current
5538
More complicated assertions are coded as subpatterns. There are two
5539
kinds: those that look ahead of the current position in the subject
5540
string, and those that look behind it. An assertion subpattern is
5541
matched in the normal way, except that it does not cause the current
5504
5542
matching position to be changed.
5506
Assertion subpatterns are not capturing subpatterns. If such an asser-
5507
tion contains capturing subpatterns within it, these are counted for
5508
the purposes of numbering the capturing subpatterns in the whole pat-
5509
tern. However, substring capturing is carried out only for positive
5544
Assertion subpatterns are not capturing subpatterns. If such an asser-
5545
tion contains capturing subpatterns within it, these are counted for
5546
the purposes of numbering the capturing subpatterns in the whole pat-
5547
tern. However, substring capturing is carried out only for positive
5510
5548
assertions, because it does not make sense for negative assertions.
5512
For compatibility with Perl, assertion subpatterns may be repeated;
5513
though it makes no sense to assert the same thing several times, the
5514
side effect of capturing parentheses may occasionally be useful. In
5550
For compatibility with Perl, assertion subpatterns may be repeated;
5551
though it makes no sense to assert the same thing several times, the
5552
side effect of capturing parentheses may occasionally be useful. In
5515
5553
practice, there only three cases:
5517
(1) If the quantifier is {0}, the assertion is never obeyed during
5518
matching. However, it may contain internal capturing parenthesized
5555
(1) If the quantifier is {0}, the assertion is never obeyed during
5556
matching. However, it may contain internal capturing parenthesized
5519
5557
groups that are called from elsewhere via the subroutine mechanism.
5521
(2) If quantifier is {0,n} where n is greater than zero, it is treated
5522
as if it were {0,1}. At run time, the rest of the pattern match is
5559
(2) If quantifier is {0,n} where n is greater than zero, it is treated
5560
as if it were {0,1}. At run time, the rest of the pattern match is
5523
5561
tried with and without the assertion, the order depending on the greed-
5524
5562
iness of the quantifier.
5526
(3) If the minimum repetition is greater than zero, the quantifier is
5527
ignored. The assertion is obeyed just once when encountered during
5564
(3) If the minimum repetition is greater than zero, the quantifier is
5565
ignored. The assertion is obeyed just once when encountered during
5530
5568
Lookahead assertions
5575
5613
(?<!dogs?|cats?)
5577
causes an error at compile time. Branches that match different length
5578
strings are permitted only at the top level of a lookbehind assertion.
5615
causes an error at compile time. Branches that match different length
5616
strings are permitted only at the top level of a lookbehind assertion.
5579
5617
This is an extension compared with Perl, which requires all branches to
5580
5618
match the same length of string. An assertion such as
5584
is not permitted, because its single top-level branch can match two
5622
is not permitted, because its single top-level branch can match two
5585
5623
different lengths, but it is acceptable to PCRE if rewritten to use two
5586
5624
top-level branches:
5590
In some cases, the escape sequence \K (see above) can be used instead
5628
In some cases, the escape sequence \K (see above) can be used instead
5591
5629
of a lookbehind assertion to get round the fixed-length restriction.
5593
The implementation of lookbehind assertions is, for each alternative,
5594
to temporarily move the current position back by the fixed length and
5631
The implementation of lookbehind assertions is, for each alternative,
5632
to temporarily move the current position back by the fixed length and
5595
5633
then try to match. If there are insufficient characters before the cur-
5596
5634
rent position, the assertion fails.
5598
In a UTF mode, PCRE does not allow the \C escape (which matches a sin-
5599
gle data unit even in a UTF mode) to appear in lookbehind assertions,
5600
because it makes it impossible to calculate the length of the lookbe-
5601
hind. The \X and \R escapes, which can match different numbers of data
5636
In a UTF mode, PCRE does not allow the \C escape (which matches a sin-
5637
gle data unit even in a UTF mode) to appear in lookbehind assertions,
5638
because it makes it impossible to calculate the length of the lookbe-
5639
hind. The \X and \R escapes, which can match different numbers of data
5602
5640
units, are also not permitted.
5604
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
5605
lookbehinds, as long as the subpattern matches a fixed-length string.
5642
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
5643
lookbehinds, as long as the subpattern matches a fixed-length string.
5606
5644
Recursion, however, is not supported.
5608
Possessive quantifiers can be used in conjunction with lookbehind
5646
Possessive quantifiers can be used in conjunction with lookbehind
5609
5647
assertions to specify efficient matching of fixed-length strings at the
5610
5648
end of subject strings. Consider a simple pattern such as
5614
when applied to a long string that does not match. Because matching
5652
when applied to a long string that does not match. Because matching
5615
5653
proceeds from left to right, PCRE will look for each "a" in the subject
5616
and then see if what follows matches the rest of the pattern. If the
5654
and then see if what follows matches the rest of the pattern. If the
5617
5655
pattern is specified as
5621
the initial .* matches the entire string at first, but when this fails
5659
the initial .* matches the entire string at first, but when this fails
5622
5660
(because there is no following "a"), it backtracks to match all but the
5623
last character, then all but the last two characters, and so on. Once
5624
again the search for "a" covers the entire string, from right to left,
5661
last character, then all but the last two characters, and so on. Once
5662
again the search for "a" covers the entire string, from right to left,
5625
5663
so we are no better off. However, if the pattern is written as
5629
there can be no backtracking for the .*+ item; it can match only the
5630
entire string. The subsequent lookbehind assertion does a single test
5631
on the last four characters. If it fails, the match fails immediately.
5632
For long strings, this approach makes a significant difference to the
5667
there can be no backtracking for the .*+ item; it can match only the
5668
entire string. The subsequent lookbehind assertion does a single test
5669
on the last four characters. If it fails, the match fails immediately.
5670
For long strings, this approach makes a significant difference to the
5633
5671
processing time.
5635
5673
Using multiple assertions
5688
5726
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5691
There are four kinds of condition: references to subpatterns, refer-
5729
There are four kinds of condition: references to subpatterns, refer-
5692
5730
ences to recursion, a pseudo-condition called DEFINE, and assertions.
5694
5732
Checking for a used subpattern by number
5696
If the text between the parentheses consists of a sequence of digits,
5734
If the text between the parentheses consists of a sequence of digits,
5697
5735
the condition is true if a capturing subpattern of that number has pre-
5698
viously matched. If there is more than one capturing subpattern with
5699
the same number (see the earlier section about duplicate subpattern
5700
numbers), the condition is true if any of them have matched. An alter-
5701
native notation is to precede the digits with a plus or minus sign. In
5702
this case, the subpattern number is relative rather than absolute. The
5703
most recently opened parentheses can be referenced by (?(-1), the next
5704
most recent by (?(-2), and so on. Inside loops it can also make sense
5736
viously matched. If there is more than one capturing subpattern with
5737
the same number (see the earlier section about duplicate subpattern
5738
numbers), the condition is true if any of them have matched. An alter-
5739
native notation is to precede the digits with a plus or minus sign. In
5740
this case, the subpattern number is relative rather than absolute. The
5741
most recently opened parentheses can be referenced by (?(-1), the next
5742
most recent by (?(-2), and so on. Inside loops it can also make sense
5705
5743
to refer to subsequent groups. The next parentheses to be opened can be
5706
referenced as (?(+1), and so on. (The value zero in any of these forms
5744
referenced as (?(+1), and so on. (The value zero in any of these forms
5707
5745
is not used; it provokes a compile-time error.)
5709
Consider the following pattern, which contains non-significant white
5747
Consider the following pattern, which contains non-significant white
5710
5748
space to make it more readable (assume the PCRE_EXTENDED option) and to
5711
5749
divide it into three parts for ease of discussion:
5713
5751
( \( )? [^()]+ (?(1) \) )
5715
The first part matches an optional opening parenthesis, and if that
5753
The first part matches an optional opening parenthesis, and if that
5716
5754
character is present, sets it as the first captured substring. The sec-
5717
ond part matches one or more characters that are not parentheses. The
5718
third part is a conditional subpattern that tests whether or not the
5719
first set of parentheses matched. If they did, that is, if subject
5720
started with an opening parenthesis, the condition is true, and so the
5721
yes-pattern is executed and a closing parenthesis is required. Other-
5722
wise, since no-pattern is not present, the subpattern matches nothing.
5723
In other words, this pattern matches a sequence of non-parentheses,
5755
ond part matches one or more characters that are not parentheses. The
5756
third part is a conditional subpattern that tests whether or not the
5757
first set of parentheses matched. If they did, that is, if subject
5758
started with an opening parenthesis, the condition is true, and so the
5759
yes-pattern is executed and a closing parenthesis is required. Other-
5760
wise, since no-pattern is not present, the subpattern matches nothing.
5761
In other words, this pattern matches a sequence of non-parentheses,
5724
5762
optionally enclosed in parentheses.
5726
If you were embedding this pattern in a larger one, you could use a
5764
If you were embedding this pattern in a larger one, you could use a
5727
5765
relative reference:
5729
5767
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
5731
This makes the fragment independent of the parentheses in the larger
5769
This makes the fragment independent of the parentheses in the larger
5734
5772
Checking for a used subpattern by name
5736
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
5737
used subpattern by name. For compatibility with earlier versions of
5738
PCRE, which had this facility before Perl, the syntax (?(name)...) is
5739
also recognized. However, there is a possible ambiguity with this syn-
5740
tax, because subpattern names may consist entirely of digits. PCRE
5741
looks first for a named subpattern; if it cannot find one and the name
5742
consists entirely of digits, PCRE looks for a subpattern of that num-
5743
ber, which must be greater than zero. Using subpattern names that con-
5774
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
5775
used subpattern by name. For compatibility with earlier versions of
5776
PCRE, which had this facility before Perl, the syntax (?(name)...) is
5777
also recognized. However, there is a possible ambiguity with this syn-
5778
tax, because subpattern names may consist entirely of digits. PCRE
5779
looks first for a named subpattern; if it cannot find one and the name
5780
consists entirely of digits, PCRE looks for a subpattern of that num-
5781
ber, which must be greater than zero. Using subpattern names that con-
5744
5782
sist entirely of digits is not recommended.
5746
5784
Rewriting the above example to use a named subpattern gives this:
5748
5786
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
5750
If the name used in a condition of this kind is a duplicate, the test
5751
is applied to all subpatterns of the same name, and is true if any one
5788
If the name used in a condition of this kind is a duplicate, the test
5789
is applied to all subpatterns of the same name, and is true if any one
5752
5790
of them has matched.
5754
5792
Checking for pattern recursion
5756
5794
If the condition is the string (R), and there is no subpattern with the
5757
name R, the condition is true if a recursive call to the whole pattern
5795
name R, the condition is true if a recursive call to the whole pattern
5758
5796
or any subpattern has been made. If digits or a name preceded by amper-
5759
5797
sand follow the letter R, for example:
5763
5801
the condition is true if the most recent recursion is into a subpattern
5764
5802
whose number or name is given. This condition does not check the entire
5765
recursion stack. If the name used in a condition of this kind is a
5803
recursion stack. If the name used in a condition of this kind is a
5766
5804
duplicate, the test is applied to all subpatterns of the same name, and
5767
5805
is true if any one of them is the most recent recursion.
5769
At "top level", all these recursion test conditions are false. The
5807
At "top level", all these recursion test conditions are false. The
5770
5808
syntax for recursive patterns is described below.
5772
5810
Defining subpatterns for use by reference only
5774
If the condition is the string (DEFINE), and there is no subpattern
5775
with the name DEFINE, the condition is always false. In this case,
5776
there may be only one alternative in the subpattern. It is always
5777
skipped if control reaches this point in the pattern; the idea of
5778
DEFINE is that it can be used to define subroutines that can be refer-
5779
enced from elsewhere. (The use of subroutines is described below.) For
5780
example, a pattern to match an IPv4 address such as "192.168.23.245"
5781
could be written like this (ignore whitespace and line breaks):
5812
If the condition is the string (DEFINE), and there is no subpattern
5813
with the name DEFINE, the condition is always false. In this case,
5814
there may be only one alternative in the subpattern. It is always
5815
skipped if control reaches this point in the pattern; the idea of
5816
DEFINE is that it can be used to define subroutines that can be refer-
5817
enced from elsewhere. (The use of subroutines is described below.) For
5818
example, a pattern to match an IPv4 address such as "192.168.23.245"
5819
could be written like this (ignore white space and line breaks):
5783
5821
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5784
5822
\b (?&byte) (\.(?&byte)){3} \b
5786
The first part of the pattern is a DEFINE group inside which a another
5787
group named "byte" is defined. This matches an individual component of
5788
an IPv4 address (a number less than 256). When matching takes place,
5789
this part of the pattern is skipped because DEFINE acts like a false
5790
condition. The rest of the pattern uses references to the named group
5791
to match the four dot-separated components of an IPv4 address, insist-
5824
The first part of the pattern is a DEFINE group inside which a another
5825
group named "byte" is defined. This matches an individual component of
5826
an IPv4 address (a number less than 256). When matching takes place,
5827
this part of the pattern is skipped because DEFINE acts like a false
5828
condition. The rest of the pattern uses references to the named group
5829
to match the four dot-separated components of an IPv4 address, insist-
5792
5830
ing on a word boundary at each end.
5794
5832
Assertion conditions
5796
If the condition is not in any of the above formats, it must be an
5797
assertion. This may be a positive or negative lookahead or lookbehind
5798
assertion. Consider this pattern, again containing non-significant
5834
If the condition is not in any of the above formats, it must be an
5835
assertion. This may be a positive or negative lookahead or lookbehind
5836
assertion. Consider this pattern, again containing non-significant
5799
5837
white space, and with the two alternatives on the second line:
5801
5839
(?(?=[^a-z]*[a-z])
5802
5840
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
5804
The condition is a positive lookahead assertion that matches an
5805
optional sequence of non-letters followed by a letter. In other words,
5806
it tests for the presence of at least one letter in the subject. If a
5807
letter is found, the subject is matched against the first alternative;
5808
otherwise it is matched against the second. This pattern matches
5809
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
5842
The condition is a positive lookahead assertion that matches an
5843
optional sequence of non-letters followed by a letter. In other words,
5844
it tests for the presence of at least one letter in the subject. If a
5845
letter is found, the subject is matched against the first alternative;
5846
otherwise it is matched against the second. This pattern matches
5847
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
5810
5848
letters and dd are digits.
5859
5897
refers recursively to the pattern in which it appears.
5861
5899
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5862
it supports special syntax for recursion of the entire pattern, and
5863
also for individual subpattern recursion. After its introduction in
5864
PCRE and Python, this kind of recursion was subsequently introduced
5900
it supports special syntax for recursion of the entire pattern, and
5901
also for individual subpattern recursion. After its introduction in
5902
PCRE and Python, this kind of recursion was subsequently introduced
5865
5903
into Perl at release 5.10.
5867
A special item that consists of (? followed by a number greater than
5868
zero and a closing parenthesis is a recursive subroutine call of the
5869
subpattern of the given number, provided that it occurs inside that
5870
subpattern. (If not, it is a non-recursive subroutine call, which is
5871
described in the next section.) The special item (?R) or (?0) is a
5905
A special item that consists of (? followed by a number greater than
5906
zero and a closing parenthesis is a recursive subroutine call of the
5907
subpattern of the given number, provided that it occurs inside that
5908
subpattern. (If not, it is a non-recursive subroutine call, which is
5909
described in the next section.) The special item (?R) or (?0) is a
5872
5910
recursive call of the entire regular expression.
5874
This PCRE pattern solves the nested parentheses problem (assume the
5912
This PCRE pattern solves the nested parentheses problem (assume the
5875
5913
PCRE_EXTENDED option is set so that white space is ignored):
5877
5915
\( ( [^()]++ | (?R) )* \)
5879
First it matches an opening parenthesis. Then it matches any number of
5880
substrings which can either be a sequence of non-parentheses, or a
5881
recursive match of the pattern itself (that is, a correctly parenthe-
5917
First it matches an opening parenthesis. Then it matches any number of
5918
substrings which can either be a sequence of non-parentheses, or a
5919
recursive match of the pattern itself (that is, a correctly parenthe-
5882
5920
sized substring). Finally there is a closing parenthesis. Note the use
5883
5921
of a possessive quantifier to avoid backtracking into sequences of non-
5886
If this were part of a larger pattern, you would not want to recurse
5924
If this were part of a larger pattern, you would not want to recurse
5887
5925
the entire pattern, so instead you could use this:
5889
5927
( \( ( [^()]++ | (?1) )* \) )
5891
We have put the pattern into parentheses, and caused the recursion to
5929
We have put the pattern into parentheses, and caused the recursion to
5892
5930
refer to them instead of the whole pattern.
5894
In a larger pattern, keeping track of parenthesis numbers can be
5895
tricky. This is made easier by the use of relative references. Instead
5932
In a larger pattern, keeping track of parenthesis numbers can be
5933
tricky. This is made easier by the use of relative references. Instead
5896
5934
of (?1) in the pattern above you can write (?-2) to refer to the second
5897
most recently opened parentheses preceding the recursion. In other
5898
words, a negative number counts capturing parentheses leftwards from
5935
most recently opened parentheses preceding the recursion. In other
5936
words, a negative number counts capturing parentheses leftwards from
5899
5937
the point at which it is encountered.
5901
It is also possible to refer to subsequently opened parentheses, by
5902
writing references such as (?+2). However, these cannot be recursive
5903
because the reference is not inside the parentheses that are refer-
5904
enced. They are always non-recursive subroutine calls, as described in
5939
It is also possible to refer to subsequently opened parentheses, by
5940
writing references such as (?+2). However, these cannot be recursive
5941
because the reference is not inside the parentheses that are refer-
5942
enced. They are always non-recursive subroutine calls, as described in
5905
5943
the next section.
5907
An alternative approach is to use named parentheses instead. The Perl
5908
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
5945
An alternative approach is to use named parentheses instead. The Perl
5946
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
5909
5947
supported. We could rewrite the above example as follows:
5911
5949
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
5913
If there is more than one subpattern with the same name, the earliest
5951
If there is more than one subpattern with the same name, the earliest
5916
This particular example pattern that we have been looking at contains
5954
This particular example pattern that we have been looking at contains
5917
5955
nested unlimited repeats, and so the use of a possessive quantifier for
5918
5956
matching strings of non-parentheses is important when applying the pat-
5919
tern to strings that do not match. For example, when this pattern is
5957
tern to strings that do not match. For example, when this pattern is
5922
5960
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5924
it yields "no match" quickly. However, if a possessive quantifier is
5925
not used, the match runs for a very long time indeed because there are
5926
so many different ways the + and * repeats can carve up the subject,
5962
it yields "no match" quickly. However, if a possessive quantifier is
5963
not used, the match runs for a very long time indeed because there are
5964
so many different ways the + and * repeats can carve up the subject,
5927
5965
and all have to be tested before failure can be reported.
5929
At the end of a match, the values of capturing parentheses are those
5930
from the outermost level. If you want to obtain intermediate values, a
5931
callout function can be used (see below and the pcrecallout documenta-
5967
At the end of a match, the values of capturing parentheses are those
5968
from the outermost level. If you want to obtain intermediate values, a
5969
callout function can be used (see below and the pcrecallout documenta-
5932
5970
tion). If the pattern above is matched against
5936
the value for the inner capturing parentheses (numbered 2) is "ef",
5937
which is the last value taken on at the top level. If a capturing sub-
5938
pattern is not matched at the top level, its final captured value is
5939
unset, even if it was (temporarily) set at a deeper level during the
5974
the value for the inner capturing parentheses (numbered 2) is "ef",
5975
which is the last value taken on at the top level. If a capturing sub-
5976
pattern is not matched at the top level, its final captured value is
5977
unset, even if it was (temporarily) set at a deeper level during the
5940
5978
matching process.
5942
If there are more than 15 capturing parentheses in a pattern, PCRE has
5943
to obtain extra memory to store data during a recursion, which it does
5980
If there are more than 15 capturing parentheses in a pattern, PCRE has
5981
to obtain extra memory to store data during a recursion, which it does
5944
5982
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5945
5983
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5947
Do not confuse the (?R) item with the condition (R), which tests for
5948
recursion. Consider this pattern, which matches text in angle brack-
5949
ets, allowing for arbitrary nesting. Only digits are allowed in nested
5950
brackets (that is, when recursing), whereas any characters are permit-
5985
Do not confuse the (?R) item with the condition (R), which tests for
5986
recursion. Consider this pattern, which matches text in angle brack-
5987
ets, allowing for arbitrary nesting. Only digits are allowed in nested
5988
brackets (that is, when recursing), whereas any characters are permit-
5951
5989
ted at the outer level.
5953
5991
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
5955
In this pattern, (?(R) is the start of a conditional subpattern, with
5956
two different alternatives for the recursive and non-recursive cases.
5993
In this pattern, (?(R) is the start of a conditional subpattern, with
5994
two different alternatives for the recursive and non-recursive cases.
5957
5995
The (?R) item is the actual recursive call.
5959
5997
Differences in recursion processing between PCRE and Perl
5961
Recursion processing in PCRE differs from Perl in two important ways.
5962
In PCRE (like Python, but unlike Perl), a recursive subpattern call is
5999
Recursion processing in PCRE differs from Perl in two important ways.
6000
In PCRE (like Python, but unlike Perl), a recursive subpattern call is
5963
6001
always treated as an atomic group. That is, once it has matched some of
5964
6002
the subject string, it is never re-entered, even if it contains untried
5965
alternatives and there is a subsequent matching failure. This can be
5966
illustrated by the following pattern, which purports to match a palin-
5967
dromic string that contains an odd number of characters (for example,
6003
alternatives and there is a subsequent matching failure. This can be
6004
illustrated by the following pattern, which purports to match a palin-
6005
dromic string that contains an odd number of characters (for example,
5968
6006
"a", "aba", "abcba", "abcdcba"):
5970
6008
^(.|(.)(?1)\2)$
5972
6010
The idea is that it either matches a single character, or two identical
5973
characters surrounding a sub-palindrome. In Perl, this pattern works;
5974
in PCRE it does not if the pattern is longer than three characters.
6011
characters surrounding a sub-palindrome. In Perl, this pattern works;
6012
in PCRE it does not if the pattern is longer than three characters.
5975
6013
Consider the subject string "abcba":
5977
At the top level, the first character is matched, but as it is not at
6015
At the top level, the first character is matched, but as it is not at
5978
6016
the end of the string, the first alternative fails; the second alterna-
5979
6017
tive is taken and the recursion kicks in. The recursive call to subpat-
5980
tern 1 successfully matches the next character ("b"). (Note that the
6018
tern 1 successfully matches the next character ("b"). (Note that the
5981
6019
beginning and end of line tests are not part of the recursion).
5983
Back at the top level, the next character ("c") is compared with what
5984
subpattern 2 matched, which was "a". This fails. Because the recursion
5985
is treated as an atomic group, there are now no backtracking points,
5986
and so the entire match fails. (Perl is able, at this point, to re-
5987
enter the recursion and try the second alternative.) However, if the
6021
Back at the top level, the next character ("c") is compared with what
6022
subpattern 2 matched, which was "a". This fails. Because the recursion
6023
is treated as an atomic group, there are now no backtracking points,
6024
and so the entire match fails. (Perl is able, at this point, to re-
6025
enter the recursion and try the second alternative.) However, if the
5988
6026
pattern is written with the alternatives in the other order, things are
5991
6029
^((.)(?1)\2|.)$
5993
This time, the recursing alternative is tried first, and continues to
5994
recurse until it runs out of characters, at which point the recursion
5995
fails. But this time we do have another alternative to try at the
5996
higher level. That is the big difference: in the previous case the
6031
This time, the recursing alternative is tried first, and continues to
6032
recurse until it runs out of characters, at which point the recursion
6033
fails. But this time we do have another alternative to try at the
6034
higher level. That is the big difference: in the previous case the
5997
6035
remaining alternative is at a deeper recursion level, which PCRE cannot
6000
To change the pattern so that it matches all palindromic strings, not
6001
just those with an odd number of characters, it is tempting to change
6038
To change the pattern so that it matches all palindromic strings, not
6039
just those with an odd number of characters, it is tempting to change
6002
6040
the pattern to this:
6004
6042
^((.)(?1)\2|.?)$
6006
Again, this works in Perl, but not in PCRE, and for the same reason.
6007
When a deeper recursion has matched a single character, it cannot be
6008
entered again in order to match an empty string. The solution is to
6009
separate the two cases, and write out the odd and even cases as alter-
6044
Again, this works in Perl, but not in PCRE, and for the same reason.
6045
When a deeper recursion has matched a single character, it cannot be
6046
entered again in order to match an empty string. The solution is to
6047
separate the two cases, and write out the odd and even cases as alter-
6010
6048
natives at the higher level:
6012
6050
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
6014
If you want to match typical palindromic phrases, the pattern has to
6052
If you want to match typical palindromic phrases, the pattern has to
6015
6053
ignore all non-word characters, which can be done like this:
6017
6055
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
6019
6057
If run with the PCRE_CASELESS option, this pattern matches phrases such
6020
6058
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
6021
Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
6022
ing into sequences of non-word characters. Without this, PCRE takes a
6023
great deal longer (ten times or more) to match typical phrases, and
6059
Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
6060
ing into sequences of non-word characters. Without this, PCRE takes a
6061
great deal longer (ten times or more) to match typical phrases, and
6024
6062
Perl takes so long that you think it has gone into a loop.
6026
WARNING: The palindrome-matching patterns above work only if the sub-
6027
ject string does not start with a palindrome that is shorter than the
6028
entire string. For example, although "abcba" is correctly matched, if
6029
the subject is "ababa", PCRE finds the palindrome "aba" at the start,
6030
then fails at top level because the end of the string does not follow.
6031
Once again, it cannot jump back into the recursion to try other alter-
6064
WARNING: The palindrome-matching patterns above work only if the sub-
6065
ject string does not start with a palindrome that is shorter than the
6066
entire string. For example, although "abcba" is correctly matched, if
6067
the subject is "ababa", PCRE finds the palindrome "aba" at the start,
6068
then fails at top level because the end of the string does not follow.
6069
Once again, it cannot jump back into the recursion to try other alter-
6032
6070
natives, so the entire match fails.
6034
The second way in which PCRE and Perl differ in their recursion pro-
6035
cessing is in the handling of captured values. In Perl, when a subpat-
6036
tern is called recursively or as a subpattern (see the next section),
6037
it has no access to any values that were captured outside the recur-
6038
sion, whereas in PCRE these values can be referenced. Consider this
6072
The second way in which PCRE and Perl differ in their recursion pro-
6073
cessing is in the handling of captured values. In Perl, when a subpat-
6074
tern is called recursively or as a subpattern (see the next section),
6075
it has no access to any values that were captured outside the recur-
6076
sion, whereas in PCRE these values can be referenced. Consider this
6043
In PCRE, this pattern matches "bab". The first capturing parentheses
6044
match "b", then in the second group, when the back reference \1 fails
6045
to match "b", the second alternative matches "a" and then recurses. In
6046
the recursion, \1 does now match "b" and so the whole match succeeds.
6047
In Perl, the pattern fails to match because inside the recursive call
6081
In PCRE, this pattern matches "bab". The first capturing parentheses
6082
match "b", then in the second group, when the back reference \1 fails
6083
to match "b", the second alternative matches "a" and then recurses. In
6084
the recursion, \1 does now match "b" and so the whole match succeeds.
6085
In Perl, the pattern fails to match because inside the recursive call
6048
6086
\1 cannot access the externally set value.
6051
6089
SUBPATTERNS AS SUBROUTINES
6053
If the syntax for a recursive subpattern call (either by number or by
6054
name) is used outside the parentheses to which it refers, it operates
6055
like a subroutine in a programming language. The called subpattern may
6056
be defined before or after the reference. A numbered reference can be
6091
If the syntax for a recursive subpattern call (either by number or by
6092
name) is used outside the parentheses to which it refers, it operates
6093
like a subroutine in a programming language. The called subpattern may
6094
be defined before or after the reference. A numbered reference can be
6057
6095
absolute or relative, as in these examples:
6059
6097
(...(absolute)...)...(?2)...
6065
6103
(sens|respons)e and \1ibility
6067
matches "sense and sensibility" and "response and responsibility", but
6105
matches "sense and sensibility" and "response and responsibility", but
6068
6106
not "sense and responsibility". If instead the pattern
6070
6108
(sens|respons)e and (?1)ibility
6072
is used, it does match "sense and responsibility" as well as the other
6073
two strings. Another example is given in the discussion of DEFINE
6110
is used, it does match "sense and responsibility" as well as the other
6111
two strings. Another example is given in the discussion of DEFINE
6076
All subroutine calls, whether recursive or not, are always treated as
6077
atomic groups. That is, once a subroutine has matched some of the sub-
6114
All subroutine calls, whether recursive or not, are always treated as
6115
atomic groups. That is, once a subroutine has matched some of the sub-
6078
6116
ject string, it is never re-entered, even if it contains untried alter-
6079
natives and there is a subsequent matching failure. Any capturing
6080
parentheses that are set during the subroutine call revert to their
6117
natives and there is a subsequent matching failure. Any capturing
6118
parentheses that are set during the subroutine call revert to their
6081
6119
previous values afterwards.
6083
Processing options such as case-independence are fixed when a subpat-
6084
tern is defined, so if it is used as a subroutine, such options cannot
6121
Processing options such as case-independence are fixed when a subpat-
6122
tern is defined, so if it is used as a subroutine, such options cannot
6085
6123
be changed for different calls. For example, consider this pattern:
6087
6125
(abc)(?i:(?-1))
6089
It matches "abcabc". It does not match "abcABC" because the change of
6127
It matches "abcabc". It does not match "abcABC" because the change of
6090
6128
processing option does not affect the called subpattern.
6093
6131
ONIGURUMA SUBROUTINE SYNTAX
6095
For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6133
For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6096
6134
name or a number enclosed either in angle brackets or single quotes, is
6097
an alternative syntax for referencing a subpattern as a subroutine,
6098
possibly recursively. Here are two of the examples used above, rewrit-
6135
an alternative syntax for referencing a subpattern as a subroutine,
6136
possibly recursively. Here are two of the examples used above, rewrit-
6099
6137
ten using this syntax:
6101
6139
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
6102
6140
(sens|respons)e and \g'1'ibility
6104
PCRE supports an extension to Oniguruma: if a number is preceded by a
6142
PCRE supports an extension to Oniguruma: if a number is preceded by a
6105
6143
plus or a minus sign it is taken as a relative reference. For example:
6107
6145
(abc)(?i:\g<-1>)
6109
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
6110
synonymous. The former is a back reference; the latter is a subroutine
6147
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
6148
synonymous. The former is a back reference; the latter is a subroutine
6116
6154
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
6117
Perl code to be obeyed in the middle of matching a regular expression.
6155
Perl code to be obeyed in the middle of matching a regular expression.
6118
6156
This makes it possible, amongst other things, to extract different sub-
6119
6157
strings that match the same pair of parentheses when there is a repeti-
6122
6160
PCRE provides a similar feature, but of course it cannot obey arbitrary
6123
6161
Perl code. The feature is called "callout". The caller of PCRE provides
6124
an external function by putting its entry point in the global variable
6125
pcre_callout (8-bit library) or pcre16_callout (16-bit library). By
6162
an external function by putting its entry point in the global variable
6163
pcre_callout (8-bit library) or pcre16_callout (16-bit library). By
6126
6164
default, this variable contains NULL, which disables all calling out.
6128
Within a regular expression, (?C) indicates the points at which the
6129
external function is to be called. If you want to identify different
6130
callout points, you can put a number less than 256 after the letter C.
6131
The default value is zero. For example, this pattern has two callout
6166
Within a regular expression, (?C) indicates the points at which the
6167
external function is to be called. If you want to identify different
6168
callout points, you can put a number less than 256 after the letter C.
6169
The default value is zero. For example, this pattern has two callout
6134
6172
(?C1)abc(?C2)def
6136
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
6137
outs are automatically installed before each item in the pattern. They
6174
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
6175
outs are automatically installed before each item in the pattern. They
6138
6176
are all numbered 255.
6140
During matching, when PCRE reaches a callout point, the external func-
6141
tion is called. It is provided with the number of the callout, the
6142
position in the pattern, and, optionally, one item of data originally
6143
supplied by the caller of the matching function. The callout function
6144
may cause matching to proceed, to backtrack, or to fail altogether. A
6145
complete description of the interface to the callout function is given
6178
During matching, when PCRE reaches a callout point, the external func-
6179
tion is called. It is provided with the number of the callout, the
6180
position in the pattern, and, optionally, one item of data originally
6181
supplied by the caller of the matching function. The callout function
6182
may cause matching to proceed, to backtrack, or to fail altogether. A
6183
complete description of the interface to the callout function is given
6146
6184
in the pcrecallout documentation.
6149
6187
BACKTRACKING CONTROL
6151
Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
6189
Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
6152
6190
which are described in the Perl documentation as "experimental and sub-
6153
ject to change or removal in a future version of Perl". It goes on to
6154
say: "Their usage in production code should be noted to avoid problems
6191
ject to change or removal in a future version of Perl". It goes on to
6192
say: "Their usage in production code should be noted to avoid problems
6155
6193
during upgrades." The same remarks apply to the PCRE features described
6156
6194
in this section.
6158
Since these verbs are specifically related to backtracking, most of
6159
them can be used only when the pattern is to be matched using one of
6196
Since these verbs are specifically related to backtracking, most of
6197
them can be used only when the pattern is to be matched using one of
6160
6198
the traditional matching functions, which use a backtracking algorithm.
6161
With the exception of (*FAIL), which behaves like a failing negative
6162
assertion, they cause an error if encountered by a DFA matching func-
6199
With the exception of (*FAIL), which behaves like a failing negative
6200
assertion, they cause an error if encountered by a DFA matching func-
6165
If any of these verbs are used in an assertion or in a subpattern that
6203
If any of these verbs are used in an assertion or in a subpattern that
6166
6204
is called as a subroutine (whether or not recursively), their effect is
6167
6205
confined to that subpattern; it does not extend to the surrounding pat-
6168
6206
tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
6169
that is encountered in a successful positive assertion is passed back
6170
when a match succeeds (compare capturing parentheses in assertions).
6207
that is encountered in a successful positive assertion is passed back
6208
when a match succeeds (compare capturing parentheses in assertions).
6171
6209
Note that such subpatterns are processed as anchored at the point where
6172
they are tested. Note also that Perl's treatment of subroutines is dif-
6173
ferent in some cases.
6210
they are tested. Note also that Perl's treatment of subroutines and
6211
assertions is different in some cases.
6175
The new verbs make use of what was previously invalid syntax: an open-
6213
The new verbs make use of what was previously invalid syntax: an open-
6176
6214
ing parenthesis followed by an asterisk. They are generally of the form
6177
(*VERB) or (*VERB:NAME). Some may take either form, with differing be-
6178
haviour, depending on whether or not an argument is present. A name is
6215
(*VERB) or (*VERB:NAME). Some may take either form, with differing be-
6216
haviour, depending on whether or not an argument is present. A name is
6179
6217
any sequence of characters that does not include a closing parenthesis.
6180
If the name is empty, that is, if the closing parenthesis immediately
6181
follows the colon, the effect is as if the colon were not there. Any
6182
number of these verbs may occur in a pattern.
6218
The maximum length of name is 255 in the 8-bit library and 65535 in the
6219
16-bit library. If the name is empty, that is, if the closing parenthe-
6220
sis immediately follows the colon, the effect is as if the colon were
6221
not there. Any number of these verbs may occur in a pattern.
6223
Optimizations that affect backtracking verbs
6184
6225
PCRE contains some optimizations that are used to speed up matching by
6185
6226
running some checks at the start of each match attempt. For example, it
6572
6619
SCRIPT NAMES FOR \p AND \P
6574
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
6575
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
6576
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
6577
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
6578
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
6579
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
6580
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
6581
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
6582
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
6583
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
6584
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
6585
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
6586
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
6621
Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
6622
Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,
6623
Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
6624
Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic,
6625
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
6626
gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
6627
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
6628
Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,
6629
Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive,
6630
Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko,
6631
Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic,
6632
Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-
6633
tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese,
6634
Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,
6635
Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
6590
6639
CHARACTER CLASSES
6890
6939
When you set the PCRE_UTF8 flag, the byte strings passed as patterns
6891
6940
and subjects are (by default) checked for validity on entry to the rel-
6892
evant functions. From release 7.3 of PCRE, the check is according the
6941
evant functions. The entire string is checked before any other process-
6942
ing takes place. From release 7.3 of PCRE, the check is according the
6893
6943
rules of RFC 3629, which are themselves derived from the Unicode speci-
6894
fication. Earlier releases of PCRE followed the rules of RFC 2279,
6895
which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The
6896
current check allows only values in the range U+0 to U+10FFFF, exclud-
6944
fication. Earlier releases of PCRE followed the rules of RFC 2279,
6945
which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The
6946
current check allows only values in the range U+0 to U+10FFFF, exclud-
6897
6947
ing U+D800 to U+DFFF.
6899
The excluded code points are the "Surrogate Area" of Unicode. They are
6900
reserved for use by UTF-16, where they are used in pairs to encode
6901
codepoints with values greater than 0xFFFF. The code points that are
6949
The excluded code points are the "Surrogate Area" of Unicode. They are
6950
reserved for use by UTF-16, where they are used in pairs to encode
6951
codepoints with values greater than 0xFFFF. The code points that are
6902
6952
encoded by UTF-16 pairs are available independently in the UTF-8 encod-
6903
ing. (In other words, the whole surrogate thing is a fudge for UTF-16
6953
ing. (In other words, the whole surrogate thing is a fudge for UTF-16
6904
6954
which unfortunately messes up UTF-8.)
6906
6956
If an invalid UTF-8 string is passed to PCRE, an error return is given.
6907
At compile time, the only additional information is the offset to the
6908
first byte of the failing character. The runtime functions pcre_exec()
6909
and pcre_dfa_exec() also pass back this information, as well as a more
6910
detailed reason code if the caller has provided memory in which to do
6957
At compile time, the only additional information is the offset to the
6958
first byte of the failing character. The run-time functions pcre_exec()
6959
and pcre_dfa_exec() also pass back this information, as well as a more
6960
detailed reason code if the caller has provided memory in which to do
6913
In some situations, you may already know that your strings are valid,
6914
and therefore want to skip these checks in order to improve perfor-
6915
mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run
6916
time, PCRE assumes that the pattern or subject it is given (respec-
6917
tively) contains only valid UTF-8 codes. In this case, it does not
6918
diagnose an invalid UTF-8 string.
6963
In some situations, you may already know that your strings are valid,
6964
and therefore want to skip these checks in order to improve perfor-
6965
mance, for example in the case of a long subject string that is being
6966
scanned repeatedly with different patterns. If you set the
6967
PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes
6968
that the pattern or subject it is given (respectively) contains only
6969
valid UTF-8 codes. In this case, it does not diagnose an invalid UTF-8
6920
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
6921
what happens depends on why the string is invalid. If the string con-
6972
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
6973
what happens depends on why the string is invalid. If the string con-
6922
6974
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
6923
string of characters in the range 0 to 0x7FFFFFFF by pcre_dfa_exec()
6924
and the interpreted version of pcre_exec(). In other words, apart from
6925
the initial validity test, these functions (when in UTF-8 mode) handle
6926
strings according to the more liberal rules of RFC 2279. However, the
6975
string of characters in the range 0 to 0x7FFFFFFF by pcre_dfa_exec()
6976
and the interpreted version of pcre_exec(). In other words, apart from
6977
the initial validity test, these functions (when in UTF-8 mode) handle
6978
strings according to the more liberal rules of RFC 2279. However, the
6927
6979
just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.
6928
If you are using JIT optimization, or if the string does not even con-
6980
If you are using JIT optimization, or if the string does not even con-
6929
6981
form to RFC 2279, the result is undefined. Your program may crash.
6931
If you want to process strings of values in the full range 0 to
6932
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
6983
If you want to process strings of values in the full range 0 to
6984
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
6933
6985
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
6934
this situation, you will have to apply your own validity check, and
6986
this situation, you will have to apply your own validity check, and
6935
6987
avoid the use of JIT optimization.
6937
6989
Validity of UTF-16 strings
6939
6991
When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
6940
6992
are passed as patterns and subjects are (by default) checked for valid-
6941
ity on entry to the relevant functions. Values other than those in the
6993
ity on entry to the relevant functions. Values other than those in the
6942
6994
surrogate range U+D800 to U+DFFF are independent code points. Values in
6943
6995
the surrogate range must be used in pairs in the correct manner.
6945
If an invalid UTF-16 string is passed to PCRE, an error return is
6946
given. At compile time, the only additional information is the offset
6947
to the first data unit of the failing character. The runtime functions
6997
If an invalid UTF-16 string is passed to PCRE, an error return is
6998
given. At compile time, the only additional information is the offset
6999
to the first data unit of the failing character. The run-time functions
6948
7000
pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
6949
well as a more detailed reason code if the caller has provided memory
7001
well as a more detailed reason code if the caller has provided memory
6950
7002
in which to do this.
6952
In some situations, you may already know that your strings are valid,
6953
and therefore want to skip these checks in order to improve perfor-
6954
mance. If you set the PCRE_NO_UTF16_CHECK flag at compile time or at
7004
In some situations, you may already know that your strings are valid,
7005
and therefore want to skip these checks in order to improve perfor-
7006
mance. If you set the PCRE_NO_UTF16_CHECK flag at compile time or at
6955
7007
run time, PCRE assumes that the pattern or subject it is given (respec-
6956
7008
tively) contains only valid UTF-16 sequences. In this case, it does not
6957
7009
diagnose an invalid UTF-16 string.
6959
7011
General comments about UTF modes
6961
1. Codepoints less than 256 can be specified by either braced or
6962
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
7013
1. Codepoints less than 256 can be specified by either braced or
7014
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
6963
7015
Larger values have to use braced sequences.
6965
2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they
7017
2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they
6966
7018
match two-byte characters for values greater than \177.
6968
7020
3. Repeat quantifiers apply to complete UTF characters, not to individ-
6969
7021
ual data units, for example: \x{100}{3}.
6971
4. The dot metacharacter matches one UTF character instead of a single
7023
4. The dot metacharacter matches one UTF character instead of a single
6974
5. The escape sequence \C can be used to match a single byte in UTF-8
7026
5. The escape sequence \C can be used to match a single byte in UTF-8
6975
7027
mode, or a single 16-bit data unit in UTF-16 mode, but its use can lead
6976
7028
to some strange effects because it breaks up multi-unit characters (see
6977
the description of \C in the pcrepattern documentation). The use of \C
6978
is not supported in the alternative matching function
6979
pcre[16]_dfa_exec(), nor is it supported in UTF mode by the JIT opti-
7029
the description of \C in the pcrepattern documentation). The use of \C
7030
is not supported in the alternative matching function
7031
pcre[16]_dfa_exec(), nor is it supported in UTF mode by the JIT opti-
6980
7032
mization of pcre[16]_exec(). If JIT optimization is requested for a UTF
6981
7033
pattern that contains \C, it will not succeed, and so the matching will
6982
7034
be carried out by the normal interpretive function.
6984
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
7036
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
6985
7037
test characters of any code value, but, by default, the characters that
6986
PCRE recognizes as digits, spaces, or word characters remain the same
6987
set as in non-UTF mode, all with values less than 256. This remains
6988
true even when PCRE is built to include Unicode property support,
7038
PCRE recognizes as digits, spaces, or word characters remain the same
7039
set as in non-UTF mode, all with values less than 256. This remains
7040
true even when PCRE is built to include Unicode property support,
6989
7041
because to do otherwise would slow down PCRE in many common cases. Note
6990
in particular that this applies to \b and \B, because they are defined
7042
in particular that this applies to \b and \B, because they are defined
6991
7043
in terms of \w and \W. If you really want to test for a wider sense of,
6992
say, "digit", you can use explicit Unicode property tests such as
7044
say, "digit", you can use explicit Unicode property tests such as
6993
7045
\p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
6994
character escapes work is changed so that Unicode properties are used
7046
character escapes work is changed so that Unicode properties are used
6995
7047
to determine which characters match. There are more details in the sec-
6996
7048
tion on generic character types in the pcrepattern documentation.
6998
7. Similarly, characters that match the POSIX named character classes
7050
7. Similarly, characters that match the POSIX named character classes
6999
7051
are all low-valued characters, unless the PCRE_UCP option is set.
7001
8. However, the horizontal and vertical whitespace matching escapes
7002
(\h, \H, \v, and \V) do match all the appropriate Unicode characters,
7053
8. However, the horizontal and vertical white space matching escapes
7054
(\h, \H, \v, and \V) do match all the appropriate Unicode characters,
7003
7055
whether or not PCRE_UCP is set.
7005
9. Case-insensitive matching applies only to characters whose values
7006
are less than 128, unless PCRE is built with Unicode property support.
7007
Even when Unicode property support is available, PCRE still uses its
7008
own character tables when checking the case of low-valued characters,
7009
so as not to degrade performance. The Unicode property information is
7057
9. Case-insensitive matching applies only to characters whose values
7058
are less than 128, unless PCRE is built with Unicode property support.
7059
Even when Unicode property support is available, PCRE still uses its
7060
own character tables when checking the case of low-valued characters,
7061
so as not to degrade performance. The Unicode property information is
7010
7062
used only for characters with higher values. Furthermore, PCRE supports
7011
case-insensitive matching only when there is a one-to-one mapping
7012
between a letter's cases. There are a small number of many-to-one map-
7063
case-insensitive matching only when there is a one-to-one mapping
7064
between a letter's cases. There are a small number of many-to-one map-
7013
7065
pings in Unicode; these are not supported by PCRE.
7266
7347
Use a one-line callback function
7267
7348
return thread_local_var
7269
All the functions described in this section do nothing if JIT is not
7270
available, and pcre_assign_jit_stack() does nothing unless the extra
7271
argument is non-NULL and points to a pcre_extra block that is the
7272
result of a successful study with PCRE_STUDY_JIT_COMPILE.
7350
All the functions described in this section do nothing if JIT is not
7351
available, and pcre_assign_jit_stack() does nothing unless the extra
7352
argument is non-NULL and points to a pcre_extra block that is the
7353
result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
7277
7358
(1) Why do we need JIT stacks?
7279
PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack
7280
where the local data of the current node is pushed before checking its
7360
PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack
7361
where the local data of the current node is pushed before checking its
7281
7362
child nodes. Allocating real machine stack on some platforms is diffi-
7282
7363
cult. For example, the stack chain needs to be updated every time if we
7283
extend the stack on PowerPC. Although it is possible, its updating
7364
extend the stack on PowerPC. Although it is possible, its updating
7284
7365
time overhead decreases performance. So we do the recursion in memory.
7286
7367
(2) Why don't we simply allocate blocks of memory with malloc()?
7288
Modern operating systems have a nice feature: they can reserve an
7369
Modern operating systems have a nice feature: they can reserve an
7289
7370
address space instead of allocating memory. We can safely allocate mem-
7290
ory pages inside this address space, so the stack could grow without
7371
ory pages inside this address space, so the stack could grow without
7291
7372
moving memory data (this is important because of pointers). Thus we can
7292
allocate 1M address space, and use only a single memory page (usually
7293
4K) if that is enough. However, we can still grow up to 1M anytime if
7373
allocate 1M address space, and use only a single memory page (usually
7374
4K) if that is enough. However, we can still grow up to 1M anytime if
7296
7377
(3) Who "owns" a JIT stack?
7298
7379
The owner of the stack is the user program, not the JIT studied pattern
7299
or anything else. The user program must ensure that if a stack is used
7300
by pcre_exec(), (that is, it is assigned to the pattern currently run-
7380
or anything else. The user program must ensure that if a stack is used
7381
by pcre_exec(), (that is, it is assigned to the pattern currently run-
7301
7382
ning), that stack must not be used by any other threads (to avoid over-
7302
7383
writing the same memory area). The best practice for multithreaded pro-
7303
grams is to allocate a stack for each thread, and return this stack
7384
grams is to allocate a stack for each thread, and return this stack
7304
7385
through the JIT callback function.
7306
7387
(4) When should a JIT stack be freed?
7308
7389
You can free a JIT stack at any time, as long as it will not be used by
7309
pcre_exec() again. When you assign the stack to a pattern, only a
7310
pointer is set. There is no reference counting or any other magic. You
7311
can free the patterns and stacks in any order, anytime. Just do not
7312
call pcre_exec() with a pattern pointing to an already freed stack, as
7313
that will cause SEGFAULT. (Also, do not free a stack currently used by
7314
pcre_exec() in another thread). You can also replace the stack for a
7315
pattern at any time. You can even free the previous stack before
7390
pcre_exec() again. When you assign the stack to a pattern, only a
7391
pointer is set. There is no reference counting or any other magic. You
7392
can free the patterns and stacks in any order, anytime. Just do not
7393
call pcre_exec() with a pattern pointing to an already freed stack, as
7394
that will cause SEGFAULT. (Also, do not free a stack currently used by
7395
pcre_exec() in another thread). You can also replace the stack for a
7396
pattern at any time. You can even free the previous stack before
7316
7397
assigning a replacement.
7318
(5) Should I allocate/free a stack every time before/after calling
7399
(5) Should I allocate/free a stack every time before/after calling
7321
No, because this is too costly in terms of resources. However, you
7322
could implement some clever idea which release the stack if it is not
7402
No, because this is too costly in terms of resources. However, you
7403
could implement some clever idea which release the stack if it is not
7323
7404
used in let's say two minutes. The JIT callback can help to achive this
7324
7405
without keeping a list of the currently JIT studied patterns.
7326
(6) OK, the stack is for long term memory allocation. But what happens
7327
if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
7407
(6) OK, the stack is for long term memory allocation. But what happens
7408
if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
7328
7409
until the stack is freed?
7330
Especially on embedded sytems, it might be a good idea to release mem-
7331
ory sometimes without freeing the stack. There is no API for this at
7332
the moment. Probably a function call which returns with the currently
7333
allocated memory for any stack and another which allows releasing mem-
7411
Especially on embedded sytems, it might be a good idea to release mem-
7412
ory sometimes without freeing the stack. There is no API for this at
7413
the moment. Probably a function call which returns with the currently
7414
allocated memory for any stack and another which allows releasing mem-
7334
7415
ory (shrinking the stack) would be a good idea if someone needs this.
7336
7417
(7) This is too much of a headache. Isn't there any better solution for
7337
7418
JIT stack handling?
7339
No, thanks to Windows. If POSIX threads were used everywhere, we could
7420
No, thanks to Windows. If POSIX threads were used everywhere, we could
7340
7421
throw out this complicated API.
7345
This is a single-threaded example that specifies a JIT stack without
7426
This is a single-threaded example that specifies a JIT stack without
7346
7427
using a callback.