~ubuntu-branches/ubuntu/trusty/pcre3/trusty : revision 38

367

There are two new general option names, PCRE_UTF16 and

368

PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and

369

PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options

370

define the same bits in the options word.

370

define the same bits in the options word. There is a discussion about

371

the validity of UTF-16 strings in the pcreunicode page.

371

372

For the pcre16_config() function there is an option PCRE_CONFIG_UTF16

373

that returns 1 if UTF-16 support is configured, otherwise 0. If this

374

option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option is

373

For the pcre16_config() function there is an option PCRE_CONFIG_UTF16

374

that returns 1 if UTF-16 support is configured, otherwise 0. If this

375

option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option is

375

376

given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.

376

377

378

379

CHARACTER CODES

379

380

In 16-bit mode, when PCRE_UTF16 is not set, character values are

381

In 16-bit mode, when PCRE_UTF16 is not set, character values are

381

382

treated in the same way as in 8-bit, non UTF-8 mode, except, of course,

382

that they can range from 0 to 0xffff instead of 0 to 0xff. Character

383

types for characters less than 0xff can therefore be influenced by the

384

locale in the same way as before. Characters greater than 0xff have

383

that they can range from 0 to 0xffff instead of 0 to 0xff. Character

384

types for characters less than 0xff can therefore be influenced by the

385

locale in the same way as before. Characters greater than 0xff have

385

386

only one case, and no "type" (such as letter or digit).

386

387

In UTF-16 mode, the character code is Unicode, in the range 0 to

388

0x10ffff, with the exception of values in the range 0xd800 to 0xdfff

389

because those are "surrogate" values that are used in pairs to encode

388

In UTF-16 mode, the character code is Unicode, in the range 0 to

389

0x10ffff, with the exception of values in the range 0xd800 to 0xdfff

390

because those are "surrogate" values that are used in pairs to encode

390

391

values greater than 0xffff.

391

392

A UTF-16 string can indicate its endianness by special code knows as a

393

A UTF-16 string can indicate its endianness by special code knows as a

393

394

byte-order mark (BOM). The PCRE functions do not handle this, expecting

394

strings to be in host byte order. A utility function called

395

pcre16_utf16_to_host_byte_order() is provided to help with this (see

395

strings to be in host byte order. A utility function called

396

pcre16_utf16_to_host_byte_order() is provided to help with this (see

396

397

above).

397

398

399

400

ERROR NAMES

400

401

The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-

402

spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is

403

given when a compiled pattern is passed to a function that processes

404

patterns in the other mode, for example, if a pattern compiled with

402

The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-

403

spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is

404

given when a compiled pattern is passed to a function that processes

405

patterns in the other mode, for example, if a pattern compiled with

405

406

pcre_compile() is passed to pcre16_exec().

406

407

There are new error codes whose names begin with PCRE_UTF16_ERR for

408

invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for

409

UTF-8 strings that are described in the section entitled "Reason codes

410

for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors

408

There are new error codes whose names begin with PCRE_UTF16_ERR for

409

invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for

410

UTF-8 strings that are described in the section entitled "Reason codes

411

for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors

411

412

are:

412

413

414

PCRE_UTF16_ERR1 Missing low surrogate at end of string

418

419

420

ERROR TEXTS

420

421

If there is an error while compiling a pattern, the error text that is

422

passed back by pcre16_compile() or pcre16_compile2() is still an 8-bit

422

If there is an error while compiling a pattern, the error text that is

423

passed back by pcre16_compile() or pcre16_compile2() is still an 8-bit

423

424

character string, zero-terminated.

424

425

426

427

CALLOUTS

427

428

The subject and mark fields in the callout block that is passed to a

429

The subject and mark fields in the callout block that is passed to a

429

430

callout function point to 16-bit vectors.

430

431

432

433

TESTING

433

434

The pcretest program continues to operate with 8-bit input and output

435

files, but it can be used for testing the 16-bit library. If it is run

435

The pcretest program continues to operate with 8-bit input and output

436

files, but it can be used for testing the 16-bit library. If it is run

436

437

with the command line option -16, patterns and subject strings are con-

437

438

verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit

438

library functions are used instead of the 8-bit ones. Returned 16-bit

439

library functions are used instead of the 8-bit ones. Returned 16-bit

439

440

strings are converted to 8-bit for output. If the 8-bit library was not

440

441

compiled, pcretest defaults to 16-bit and the -16 option is ignored.

441

442

When PCRE is being built, the RunTest script that is called by "make

443

check" uses the pcretest -C option to discover which of the 8-bit and

443

When PCRE is being built, the RunTest script that is called by "make

444

check" uses the pcretest -C option to discover which of the 8-bit and

444

445

16-bit libraries has been built, and runs the tests appropriately.

445

446

447

448

NOT SUPPORTED IN 16-BIT MODE

448

449

450

Not all the features of the 8-bit library are available with the 16-bit

450

library. The C++ and POSIX wrapper functions support only the 8-bit

451

library. The C++ and POSIX wrapper functions support only the 8-bit

451

452

library, and the pcregrep program is at present 8-bit only.

452

453

454

460

461

462

REVISION

462

463

Last updated: 08 January 2012

464

Last updated: 14 April 2012

464

465

466

------------------------------------------------------------------------------

466

467

567

568

tern compiling functions.

568

569

570

If you set --enable-utf when compiling in an EBCDIC environment, PCRE

570

expects its input to be either ASCII or UTF-8 (depending on the runtime

571

option). It is not possible to support both EBCDIC and UTF-8 codes in

572

the same version of the library. Consequently, --enable-utf and

571

expects its input to be either ASCII or UTF-8 (depending on the run-

572

time option). It is not possible to support both EBCDIC and UTF-8 codes

573

in the same version of the library. Consequently, --enable-utf and

573

574

--enable-ebcdic are mutually exclusive.

574

575

576

760

761

to the configure command, the distributed tables are no longer used.

761

762

Instead, a program called dftables is compiled and run. This outputs

762

763

the source for new set of tables, created in the default locale of your

763

C runtime system. (This method of replacing the tables does not work if

764

you are cross compiling, because dftables is run on the local host. If

765

you need to create alternative tables when cross compiling, you will

764

C run-time system. (This method of replacing the tables does not work

765

if you are cross compiling, because dftables is run on the local host.

766

If you need to create alternative tables when cross compiling, you will

766

767

have to do so "by hand".)

767

768

769

1310

1311

feed) character, the two-character sequence CRLF, any of the three pre-

1311

1312

ceding, or any Unicode newline sequence. The Unicode newline sequences

1312

1313

are the three just mentioned, plus the single characters VT (vertical

1313

tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line

1314

tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line

1314

1315

separator, U+2028), and PS (paragraph separator, U+2029).

1315

1316

1317

Each of the first three conventions is used by at least one operating

1511

1512

different parts of the pattern, the contents of the options argument

1512

1513

specifies their settings at the start of compilation and execution. The

1513

1514

PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and

1514

PCRE_NO_START_OPT options can be set at the time of matching as well as

1515

at compile time.

1515

PCRE_NO_START_OPTIMIZE options can be set at the time of matching as

1516

well as at compile time.

1516

1517

1518

If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,

1518

1519

if compilation of a pattern fails, pcre_compile() returns NULL, and

1624

1625

1626

PCRE_EXTENDED

1626

1627

If this bit is set, whitespace data characters in the pattern are

1628

totally ignored except when escaped or inside a character class. White-

1628

If this bit is set, white space data characters in the pattern are

1629

totally ignored except when escaped or inside a character class. White

1629

1630

space does not include the VT character (code 11). In addition, charac-

1630

1631

ters between an unescaped # outside a character class and the next new-

1631

1632

line, inclusive, are also ignored. This is equivalent to Perl's /x

1641

1642

1643

This option makes it possible to include comments inside complicated

1643

1644

patterns. Note, however, that this applies only to data characters.

1644

Whitespace characters may never appear within special character

1645

White space characters may never appear within special character

1645

1646

sequences in a pattern, for example within the sequence (?( that intro-

1646

1647

duces a conditional subpattern.

1647

1648

1726

1727

that any of the three preceding sequences should be recognized. Setting

1727

1728

PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be

1728

1729

recognized. The Unicode newline sequences are the three just mentioned,

1729

plus the single characters VT (vertical tab, U+000B), FF (formfeed,

1730

plus the single characters VT (vertical tab, U+000B), FF (form feed,

1730

1731

U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS

1731

1732

(paragraph separator, U+2029). For the 8-bit library, the last two are

1732

1733

recognized only in UTF-8 mode.

1740

1741

cause an error.

1741

1742

1743

The only time that a line break in a pattern is specially recognized

1743

when compiling is when PCRE_EXTENDED is set. CR and LF are whitespace

1744

when compiling is when PCRE_EXTENDED is set. CR and LF are white space

1744

1745

characters, and so are ignored in this mode. Also, an unescaped # out-

1745

1746

side a character class indicates a comment that lasts until after the

1746

1747

next line break sequence. In other circumstances, line break sequences

1893

1894

72 too many forward references

1894

1895

73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)

1895

1896

74 invalid UTF-16 string (specifically UTF-16)

1897

75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)

1898

76 character value in \u.... sequence is too large

1896

1899

1897

1900

The numbers 32 and 10000 in errors 48 and 49 are defaults; different

1898

1901

values may be used if the limits were changed when PCRE was built.

1921

1924

wants to pass any of the other fields to pcre_exec() or

1922

1925

pcre_dfa_exec(), it must set up its own pcre_extra block.

1923

1926

1924

The second argument of pcre_study() contains option bits. There is only

1925

one option: PCRE_STUDY_JIT_COMPILE. If this is set, and the just-in-

1926

time compiler is available, the pattern is further compiled into

1927

machine code that executes much faster than the pcre_exec() matching

1928

function. If the just-in-time compiler is not available, this option is

1929

ignored. All other bits in the options argument must be zero.

1927

The second argument of pcre_study() contains option bits. There are

1928

three options:

1929

1930

PCRE_STUDY_JIT_COMPILE

1931

PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE

1932

PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE

1933

1934

If any of these are set, and the just-in-time compiler is available,

1935

the pattern is further compiled into machine code that executes much

1936

faster than the pcre_exec() interpretive matching function. If the

1937

just-in-time compiler is not available, these options are ignored. All

1938

other bits in the options argument must be zero.

1930

1939

1931

1940

JIT compilation is a heavyweight optimization. It can take some time

1932

1941

for patterns to be analyzed, and for one-off matches and simple pat-

1947

1956

the study data by calling pcre_free_study(). This function was added to

1948

1957

the API for release 8.20. For earlier versions, the memory could be

1949

1958

freed with pcre_free(), just like the pattern itself. This will still

1950

work in cases where PCRE_STUDY_JIT_COMPILE is not used, but it is

1951

advisable to change to the new function when convenient.

1959

work in cases where JIT optimization is not used, but it is advisable

1960

to change to the new function when convenient.

1952

1961

1953

1962

This is a typical way in which pcre_study() is used (except that in a

1954

1963

real application there should be tests for errors):

1981

1990

which to start matching. (In 16-bit mode, the bitmap is used for 16-bit

1982

1991

values less than 256.)

1983

1992

1984

These two optimizations apply to both pcre_exec() and pcre_dfa_exec().

1985

However, they are not used by pcre_exec() if pcre_study() is called

1986

with the PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling is

1987

successful. The optimizations can be disabled by setting the

1988

PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or

1989

pcre_dfa_exec(). You might want to do this if your pattern contains

1990

callouts or (*MARK) (which cannot be handled by the JIT compiler), and

1991

you want to make use of these facilities in cases where matching fails.

1992

See the discussion of PCRE_NO_START_OPTIMIZE below.

1993

These two optimizations apply to both pcre_exec() and pcre_dfa_exec(),

1994

and the information is also used by the JIT compiler. The optimiza-

1995

tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when

1996

calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-

1997

tion is also disabled. You might want to do this if your pattern con-

1998

tains callouts or (*MARK) and you want to make use of these facilities

1999

in cases where matching fails. See the discussion of

2000

PCRE_NO_START_OPTIMIZE below.

1993

2001

1994

2002

1995

2003

LOCALE SUPPORT

1996

2004

1997

PCRE handles caseless matching, and determines whether characters are

1998

letters, digits, or whatever, by reference to a set of tables, indexed

1999

by character value. When running in UTF-8 mode, this applies only to

2000

characters with codes less than 128. By default, higher-valued codes

2005

PCRE handles caseless matching, and determines whether characters are

2006

letters, digits, or whatever, by reference to a set of tables, indexed

2007

by character value. When running in UTF-8 mode, this applies only to

2008

characters with codes less than 128. By default, higher-valued codes

2001

2009

never match escapes such as \w or \d, but they can be tested with \p if

2002

PCRE is built with Unicode character property support. Alternatively,

2003

the PCRE_UCP option can be set at compile time; this causes \w and

2010

PCRE is built with Unicode character property support. Alternatively,

2011

the PCRE_UCP option can be set at compile time; this causes \w and

2004

2012

friends to use Unicode property support instead of built-in tables. The

2005

2013

use of locales with Unicode is discouraged. If you are handling charac-

2006

ters with codes greater than 128, you should either use UTF-8 and Uni-

2014

ters with codes greater than 128, you should either use UTF-8 and Uni-

2007

2015

code, or use locales, but not try to mix the two.

2008

2016

2009

PCRE contains an internal set of tables that are used when the final

2010

argument of pcre_compile() is NULL. These are sufficient for many

2017

PCRE contains an internal set of tables that are used when the final

2018

argument of pcre_compile() is NULL. These are sufficient for many

2011

2019

applications. Normally, the internal tables recognize only ASCII char-

2012

2020

acters. However, when PCRE is built, it is possible to cause the inter-

2013

2021

nal tables to be rebuilt in the default "C" locale of the local system,

2014

2022

which may cause them to be different.

2015

2023

2016

The internal tables can always be overridden by tables supplied by the

2024

The internal tables can always be overridden by tables supplied by the

2017

2025

application that calls PCRE. These may be created in a different locale

2018

from the default. As more and more applications change to using Uni-

2026

from the default. As more and more applications change to using Uni-

2019

2027

code, the need for this locale support is expected to die away.

2020

2028

2021

External tables are built by calling the pcre_maketables() function,

2022

which has no arguments, in the relevant locale. The result can then be

2023

passed to pcre_compile() or pcre_exec() as often as necessary. For

2024

example, to build and use tables that are appropriate for the French

2025

locale (where accented characters with values greater than 128 are

2029

External tables are built by calling the pcre_maketables() function,

2030

which has no arguments, in the relevant locale. The result can then be

2031

passed to pcre_compile() or pcre_exec() as often as necessary. For

2032

example, to build and use tables that are appropriate for the French

2033

locale (where accented characters with values greater than 128 are

2026

2034

treated as letters), the following code could be used:

2027

2035

2028

2036

setlocale(LC_CTYPE, "fr_FR");

2029

2037

tables = pcre_maketables();

2030

2038

re = pcre_compile(..., tables);

2031

2039

2032

The locale name "fr_FR" is used on Linux and other Unix-like systems;

2040

The locale name "fr_FR" is used on Linux and other Unix-like systems;

2033

2041

if you are using Windows, the name for the French locale is "french".

2034

2042

2035

When pcre_maketables() runs, the tables are built in memory that is

2036

obtained via pcre_malloc. It is the caller's responsibility to ensure

2037

that the memory containing the tables remains available for as long as

2043

When pcre_maketables() runs, the tables are built in memory that is

2044

obtained via pcre_malloc. It is the caller's responsibility to ensure

2045

that the memory containing the tables remains available for as long as

2038

2046

it is needed.

2039

2047

2040

2048

The pointer that is passed to pcre_compile() is saved with the compiled

2041

pattern, and the same tables are used via this pointer by pcre_study()

2049

pattern, and the same tables are used via this pointer by pcre_study()

2042

2050

and normally also by pcre_exec(). Thus, by default, for any single pat-

2043

2051

tern, compilation, studying and matching all happen in the same locale,

2044

2052

but different patterns can be compiled in different locales.

2045

2053

2046

It is possible to pass a table pointer or NULL (indicating the use of

2047

the internal tables) to pcre_exec(). Although not intended for this

2048

purpose, this facility could be used to match a pattern in a different

2054

It is possible to pass a table pointer or NULL (indicating the use of

2055

the internal tables) to pcre_exec(). Although not intended for this

2056

purpose, this facility could be used to match a pattern in a different

2049

2057

locale from the one in which it was compiled. Passing table pointers at

2050

2058

run time is discussed below in the section on matching a pattern.

2051

2059

2055

2063

int pcre_fullinfo(const pcre *code, const pcre_extra *extra,

2056

2064

int what, void *where);

2057

2065

2058

The pcre_fullinfo() function returns information about a compiled pat-

2059

tern. It replaces the pcre_info() function, which was removed from the

2066

The pcre_fullinfo() function returns information about a compiled pat-

2067

tern. It replaces the pcre_info() function, which was removed from the

2060

2068

library at version 8.30, after more than 10 years of obsolescence.

2061

2069

2062

The first argument for pcre_fullinfo() is a pointer to the compiled

2063

pattern. The second argument is the result of pcre_study(), or NULL if

2064

the pattern was not studied. The third argument specifies which piece

2065

of information is required, and the fourth argument is a pointer to a

2066

variable to receive the data. The yield of the function is zero for

2070

The first argument for pcre_fullinfo() is a pointer to the compiled

2071

pattern. The second argument is the result of pcre_study(), or NULL if

2072

the pattern was not studied. The third argument specifies which piece

2073

of information is required, and the fourth argument is a pointer to a

2074

variable to receive the data. The yield of the function is zero for

2067

2075

success, or one of the following negative numbers:

2068

2076

2069

2077

PCRE_ERROR_NULL the argument code was NULL

2073

2081

endianness

2074

2082

PCRE_ERROR_BADOPTION the value of what was invalid

2075

2083

2076

The "magic number" is placed at the start of each compiled pattern as

2077

an simple check against passing an arbitrary memory pointer. The endi-

2084

The "magic number" is placed at the start of each compiled pattern as

2085

an simple check against passing an arbitrary memory pointer. The endi-

2078

2086

anness error can occur if a compiled pattern is saved and reloaded on a

2079

different host. Here is a typical call of pcre_fullinfo(), to obtain

2087

different host. Here is a typical call of pcre_fullinfo(), to obtain

2080

2088

the length of the compiled pattern:

2081

2089

2082

2090

int rc;

2087

2095

PCRE_INFO_SIZE, /* what is required */

2088

2096

&length); /* where to put the data */

2089

2097

2090

The possible values for the third argument are defined in pcre.h, and

2098

The possible values for the third argument are defined in pcre.h, and

2091

2099

are as follows:

2092

2100

2093

2101

PCRE_INFO_BACKREFMAX

2094

2102

2095

Return the number of the highest back reference in the pattern. The

2096

fourth argument should point to an int variable. Zero is returned if

2103

Return the number of the highest back reference in the pattern. The

2104

fourth argument should point to an int variable. Zero is returned if

2097

2105

there are no back references.

2098

2106

2099

2107

PCRE_INFO_CAPTURECOUNT

2100

2108

2101

Return the number of capturing subpatterns in the pattern. The fourth

2109

Return the number of capturing subpatterns in the pattern. The fourth

2102

2110

argument should point to an int variable.

2103

2111

2104

2112

PCRE_INFO_DEFAULT_TABLES

2105

2113

2106

Return a pointer to the internal default character tables within PCRE.

2107

The fourth argument should point to an unsigned char * variable. This

2114

Return a pointer to the internal default character tables within PCRE.

2115

The fourth argument should point to an unsigned char * variable. This

2108

2116

information call is provided for internal use by the pcre_study() func-

2109

tion. External callers can cause PCRE to use its internal tables by

2117

tion. External callers can cause PCRE to use its internal tables by

2110

2118

passing a NULL table pointer.

2111

2119

2112

2120

PCRE_INFO_FIRSTBYTE

2113

2121

2114

2122

Return information about the first data unit of any matched string, for

2115

a non-anchored pattern. (The name of this option refers to the 8-bit

2116

library, where data units are bytes.) The fourth argument should point

2123

a non-anchored pattern. (The name of this option refers to the 8-bit

2124

library, where data units are bytes.) The fourth argument should point

2117

2125

to an int variable.

2118

2126

2119

If there is a fixed first value, for example, the letter "c" from a

2120

pattern such as (cat|cow|coyote), its value is returned. In the 8-bit

2121

library, the value is always less than 256; in the 16-bit library the

2127

If there is a fixed first value, for example, the letter "c" from a

2128

pattern such as (cat|cow|coyote), its value is returned. In the 8-bit

2129

library, the value is always less than 256; in the 16-bit library the

2122

2130

value can be up to 0xffff.

2123

2131

2124

2132

If there is no fixed first value, and if either

2125

2133

2126

(a) the pattern was compiled with the PCRE_MULTILINE option, and every

2134

(a) the pattern was compiled with the PCRE_MULTILINE option, and every

2127

2135

branch starts with "^", or

2128

2136

2129

2137

(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not

2130

2138

set (if it were set, the pattern would be anchored),

2131

2139

2132

-1 is returned, indicating that the pattern matches only at the start

2133

of a subject string or after any newline within the string. Otherwise

2140

-1 is returned, indicating that the pattern matches only at the start

2141

of a subject string or after any newline within the string. Otherwise

2134

2142

-2 is returned. For anchored patterns, -2 is returned.

2135

2143

2136

2144

PCRE_INFO_FIRSTTABLE

2137

2145

2138

If the pattern was studied, and this resulted in the construction of a

2139

256-bit table indicating a fixed set of values for the first data unit

2140

in any matching string, a pointer to the table is returned. Otherwise

2141

NULL is returned. The fourth argument should point to an unsigned char

2146

If the pattern was studied, and this resulted in the construction of a

2147

256-bit table indicating a fixed set of values for the first data unit

2148

in any matching string, a pointer to the table is returned. Otherwise

2149

NULL is returned. The fourth argument should point to an unsigned char

2142

2150

* variable.

2143

2151

2144

2152

PCRE_INFO_HASCRORLF

2145

2153

2146

Return 1 if the pattern contains any explicit matches for CR or LF

2147

characters, otherwise 0. The fourth argument should point to an int

2148

variable. An explicit match is either a literal CR or LF character, or

2154

Return 1 if the pattern contains any explicit matches for CR or LF

2155

characters, otherwise 0. The fourth argument should point to an int

2156

variable. An explicit match is either a literal CR or LF character, or

2149

2157

\r or \n.

2150

2158

2151

2159

PCRE_INFO_JCHANGED

2152

2160

2153

Return 1 if the (?J) or (?-J) option setting is used in the pattern,

2154

otherwise 0. The fourth argument should point to an int variable. (?J)

2161

Return 1 if the (?J) or (?-J) option setting is used in the pattern,

2162

otherwise 0. The fourth argument should point to an int variable. (?J)

2155

2163

and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.

2156

2164

2157

2165

PCRE_INFO_JIT

2158

2166

2159

Return 1 if the pattern was studied with the PCRE_STUDY_JIT_COMPILE

2160

option, and just-in-time compiling was successful. The fourth argument

2161

should point to an int variable. A return value of 0 means that JIT

2162

support is not available in this version of PCRE, or that the pattern

2163

was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT

2164

compiler could not handle this particular pattern. See the pcrejit doc-

2165

umentation for details of what can and cannot be handled.

2167

Return 1 if the pattern was studied with one of the JIT options, and

2168

just-in-time compiling was successful. The fourth argument should point

2169

to an int variable. A return value of 0 means that JIT support is not

2170

available in this version of PCRE, or that the pattern was not studied

2171

with a JIT option, or that the JIT compiler could not handle this par-

2172

ticular pattern. See the pcrejit documentation for details of what can

2173

and cannot be handled.

2166

2174

2167

2175

PCRE_INFO_JITSIZE

2168

2176

2169

If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE

2170

option, return the size of the JIT compiled code, otherwise return

2171

zero. The fourth argument should point to a size_t variable.

2177

If the pattern was successfully studied with a JIT option, return the

2178

size of the JIT compiled code, otherwise return zero. The fourth argu-

2179

ment should point to a size_t variable.

2172

2180

2173

2181

PCRE_INFO_LASTLITERAL

2174

2182

2175

Return the value of the rightmost literal data unit that must exist in

2176

any matched string, other than at its start, if such a value has been

2183

Return the value of the rightmost literal data unit that must exist in

2184

any matched string, other than at its start, if such a value has been

2177

2185

recorded. The fourth argument should point to an int variable. If there

2178

2186

is no such value, -1 is returned. For anchored patterns, a last literal

2179

value is recorded only if it follows something of variable length. For

2187

value is recorded only if it follows something of variable length. For

2180

2188

example, for the pattern /^a\d+z\d+/ the returned value is "z", but for

2181

2189

/^a\dz\d/ the returned value is -1.

2182

2190

2191

PCRE_INFO_MAXLOOKBEHIND

2192

2193

Return the number of characters (NB not bytes) in the longest lookbe-

2194

hind assertion in the pattern. Note that the simple assertions \b and

2195

\B require a one-character lookbehind. This information is useful when

2196

doing multi-segment matching using the partial matching facilities.

2197

2183

2198

PCRE_INFO_MINLENGTH

2184

2199

2185

2200

If the pattern was studied and a minimum length for matching subject

2383

2398

In the 16-bit version of this structure, the mark field has type

2384

2399

"PCRE_UCHAR16 **".

2385

2400

2386

The flags field is a bitmap that specifies which of the other fields

2387

are set. The flag bits are:

2401

The flags field is used to specify which of the other fields are set.

2402

The flag bits are:

2388

2403

2389

PCRE_EXTRA_STUDY_DATA

2404

PCRE_EXTRA_CALLOUT_DATA

2390

2405

PCRE_EXTRA_EXECUTABLE_JIT

2406

PCRE_EXTRA_MARK

2391

2407

PCRE_EXTRA_MATCH_LIMIT

2392

2408

PCRE_EXTRA_MATCH_LIMIT_RECURSION

2393

PCRE_EXTRA_CALLOUT_DATA

2409

PCRE_EXTRA_STUDY_DATA

2394

2410

PCRE_EXTRA_TABLES

2395

PCRE_EXTRA_MARK

2396

2411

2397

2412

Other flag bits should be set to zero. The study_data field and some-

2398

2413

times the executable_jit field are set in the pcre_extra block that is

2399

2414

returned by pcre_study(), together with the appropriate flag bits. You

2400

2415

should not set these yourself, but you may add to the block by setting

2401

the other fields and their corresponding flag bits.

2416

other fields and their corresponding flag bits.

2402

2417

2403

2418

The match_limit field provides a means of preventing PCRE from using up

2404

2419

a vast amount of resources when running patterns that are not going to

2414

2429

zero for each position in the subject string.

2415

2430

2416

2431

When pcre_exec() is called with a pattern that was successfully studied

2417

with the PCRE_STUDY_JIT_COMPILE option, the way that the matching is

2418

executed is entirely different. However, there is still the possibility

2419

of runaway matching that goes on for a very long time, and so the

2420

match_limit value is also used in this case (but in a different way) to

2421

limit how long the matching can continue.

2432

with a JIT option, the way that the matching is executed is entirely

2433

different. However, there is still the possibility of runaway matching

2434

that goes on for a very long time, and so the match_limit value is also

2435

used in this case (but in a different way) to limit how long the match-

2436

ing can continue.

2422

2437

2423

2438

The default value for the limit can be set when PCRE is built; the

2424

2439

default default is 10 million, which handles all but the most extreme

2436

2451

Limiting the recursion depth limits the amount of machine stack that

2437

2452

can be used, or, when PCRE has been compiled to use memory on the heap

2438

2453

instead of the stack, the amount of heap memory that can be used. This

2439

limit is not relevant, and is ignored, if the pattern was successfully

2440

studied with PCRE_STUDY_JIT_COMPILE.

2454

limit is not relevant, and is ignored, when matching is done using JIT

2455

compiled code.

2441

2456

2442

2457

The default value for match_limit_recursion can be set when PCRE is

2443

2458

built; the default default is the same value as the default for

2477

2492

The unused bits of the options argument for pcre_exec() must be zero.

2478

2493

The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,

2479

2494

PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,

2480

PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and

2481

PCRE_PARTIAL_HARD.

2495

PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and

2496

PCRE_PARTIAL_SOFT.

2482

2497

2483

If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE

2484

option, the only supported options for JIT execution are

2485

PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and

2486

PCRE_NOTEMPTY_ATSTART. Note in particular that partial matching is not

2487

supported. If an unsupported option is used, JIT execution is disabled

2488

and the normal interpretive code in pcre_exec() is run.

2498

If the pattern was successfully studied with one of the just-in-time

2499

(JIT) compile options, the only supported options for JIT execution are

2500

PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,

2501

PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an

2502

unsupported option is used, JIT execution is disabled and the normal

2503

interpretive code in pcre_exec() is run.

2489

2504

2490

2505

PCRE_ANCHORED

2491

2506

2608

2623

where the result is "no match", the callouts do occur, and that items

2609

2624

such as (*COMMIT) and (*MARK) are considered at every possible starting

2610

2625

position in the subject string. If PCRE_NO_START_OPTIMIZE is set at

2611

compile time, it cannot be unset at matching time.

2626

compile time, it cannot be unset at matching time. The use of

2627

PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching

2628

is always done using interpretively.

2612

2629

2613

2630

Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching

2614

2631

operation. Consider the pattern

2642

2659

2643

2660

When PCRE_UTF8 is set at compile time, the validity of the subject as a

2644

2661

UTF-8 string is automatically checked when pcre_exec() is subsequently

2645

called. The value of startoffset is also checked to ensure that it

2646

points to the start of a UTF-8 character. There is a discussion about

2647

the validity of UTF-8 strings in the pcreunicode page. If an invalid

2648

sequence of bytes is found, pcre_exec() returns the error

2662

called. The entire string is checked before any other processing takes

2663

place. The value of startoffset is also checked to ensure that it

2664

points to the start of a UTF-8 character. There is a discussion about

2665

the validity of UTF-8 strings in the pcreunicode page. If an invalid

2666

sequence of bytes is found, pcre_exec() returns the error

2649

2667

PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a

2650

2668

truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In

2651

both cases, information about the precise nature of the error may also

2652

be returned (see the descriptions of these errors in the section enti-

2653

tled Error return values from pcre_exec() below). If startoffset con-

2669

both cases, information about the precise nature of the error may also

2670

be returned (see the descriptions of these errors in the section enti-

2671

tled Error return values from pcre_exec() below). If startoffset con-

2654

2672

tains a value that does not point to the start of a UTF-8 character (or

2655

2673

to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.

2656

2674

2657

If you already know that your subject is valid, and you want to skip

2658

these checks for performance reasons, you can set the

2659

PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to

2660

do this for the second and subsequent calls to pcre_exec() if you are

2661

making repeated calls to find all the matches in a single subject

2662

string. However, you should be sure that the value of startoffset

2663

points to the start of a character (or the end of the subject). When

2675

If you already know that your subject is valid, and you want to skip

2676

these checks for performance reasons, you can set the

2677

PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to

2678

do this for the second and subsequent calls to pcre_exec() if you are

2679

making repeated calls to find all the matches in a single subject

2680

string. However, you should be sure that the value of startoffset

2681

points to the start of a character (or the end of the subject). When

2664

2682

PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a

2665

subject or an invalid value of startoffset is undefined. Your program

2683

subject or an invalid value of startoffset is undefined. Your program

2666

2684

may crash.

2667

2685

2668

2686

PCRE_PARTIAL_HARD

2669

2687

PCRE_PARTIAL_SOFT

2670

2688

2671

These options turn on the partial matching feature. For backwards com-

2672

patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial

2673

match occurs if the end of the subject string is reached successfully,

2674

but there are not enough subject characters to complete the match. If

2689

These options turn on the partial matching feature. For backwards com-

2690

patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial

2691

match occurs if the end of the subject string is reached successfully,

2692

but there are not enough subject characters to complete the match. If

2675

2693

this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,

2676

matching continues by testing any remaining alternatives. Only if no

2677

complete match can be found is PCRE_ERROR_PARTIAL returned instead of

2678

PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the

2679

caller is prepared to handle a partial match, but only if no complete

2694

matching continues by testing any remaining alternatives. Only if no

2695

complete match can be found is PCRE_ERROR_PARTIAL returned instead of

2696

PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the

2697

caller is prepared to handle a partial match, but only if no complete

2680

2698

match can be found.

2681

2699

2682

If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this

2683

case, if a partial match is found, pcre_exec() immediately returns

2684

PCRE_ERROR_PARTIAL, without considering any other alternatives. In

2685

other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-

2700

If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this

2701

case, if a partial match is found, pcre_exec() immediately returns

2702

PCRE_ERROR_PARTIAL, without considering any other alternatives. In

2703

other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-

2686

2704

ered to be more important that an alternative complete match.

2687

2705

2688

In both cases, the portion of the string that was inspected when the

2706

In both cases, the portion of the string that was inspected when the

2689

2707

partial match was found is set as the first matching string. There is a

2690

more detailed discussion of partial and multi-segment matching, with

2708

more detailed discussion of partial and multi-segment matching, with

2691

2709

examples, in the pcrepartial documentation.

2692

2710

2693

2711

The string to be matched by pcre_exec()

2694

2712

2695

The subject string is passed to pcre_exec() as a pointer in subject, a

2696

length in bytes in length, and a starting byte offset in startoffset.

2697

If this is negative or greater than the length of the subject,

2698

pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is

2699

zero, the search for a match starts at the beginning of the subject,

2713

The subject string is passed to pcre_exec() as a pointer in subject, a

2714

length in bytes in length, and a starting byte offset in startoffset.

2715

If this is negative or greater than the length of the subject,

2716

pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is

2717

zero, the search for a match starts at the beginning of the subject,

2700

2718

and this is by far the most common case. In UTF-8 mode, the byte offset

2701

must point to the start of a UTF-8 character (or the end of the sub-

2702

ject). Unlike the pattern string, the subject may contain binary zero

2719

must point to the start of a UTF-8 character (or the end of the sub-

2720

ject). Unlike the pattern string, the subject may contain binary zero

2703

2721

bytes.

2704

2722

2705

A non-zero starting offset is useful when searching for another match

2706

in the same subject by calling pcre_exec() again after a previous suc-

2707

cess. Setting startoffset differs from just passing over a shortened

2708

string and setting PCRE_NOTBOL in the case of a pattern that begins

2723

A non-zero starting offset is useful when searching for another match

2724

in the same subject by calling pcre_exec() again after a previous suc-

2725

cess. Setting startoffset differs from just passing over a shortened

2726

string and setting PCRE_NOTBOL in the case of a pattern that begins

2709

2727

with any kind of lookbehind. For example, consider the pattern

2710

2728

2711

2729

\Biss\B

2712

2730

2713

which finds occurrences of "iss" in the middle of words. (\B matches

2714

only if the current position in the subject is not a word boundary.)

2715

When applied to the string "Mississipi" the first call to pcre_exec()

2716

finds the first occurrence. If pcre_exec() is called again with just

2717

the remainder of the subject, namely "issipi", it does not match,

2731

which finds occurrences of "iss" in the middle of words. (\B matches

2732

only if the current position in the subject is not a word boundary.)

2733

When applied to the string "Mississipi" the first call to pcre_exec()

2734

finds the first occurrence. If pcre_exec() is called again with just

2735

the remainder of the subject, namely "issipi", it does not match,

2718

2736

because \B is always false at the start of the subject, which is deemed

2719

to be a word boundary. However, if pcre_exec() is passed the entire

2737

to be a word boundary. However, if pcre_exec() is passed the entire

2720

2738

string again, but with startoffset set to 4, it finds the second occur-

2721

rence of "iss" because it is able to look behind the starting point to

2739

rence of "iss" because it is able to look behind the starting point to

2722

2740

discover that it is preceded by a letter.

2723

2741

2724

Finding all the matches in a subject is tricky when the pattern can

2742

Finding all the matches in a subject is tricky when the pattern can

2725

2743

match an empty string. It is possible to emulate Perl's /g behaviour by

2726

first trying the match again at the same offset, with the

2727

PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that

2728

fails, advancing the starting offset and trying an ordinary match

2744

first trying the match again at the same offset, with the

2745

PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that

2746

fails, advancing the starting offset and trying an ordinary match

2729

2747

again. There is some code that demonstrates how to do this in the pcre-

2730

2748

demo sample program. In the most general case, you have to check to see

2731

if the newline convention recognizes CRLF as a newline, and if so, and

2749

if the newline convention recognizes CRLF as a newline, and if so, and

2732

2750

the current character is CR followed by LF, advance the starting offset

2733

2751

by two characters instead of one.

2734

2752

2735

If a non-zero starting offset is passed when the pattern is anchored,

2753

If a non-zero starting offset is passed when the pattern is anchored,

2736

2754

one attempt to match at the given offset is made. This can only succeed

2737

if the pattern does not require the match to be at the start of the

2755

if the pattern does not require the match to be at the start of the

2738

2756

subject.

2739

2757

2740

2758

How pcre_exec() returns captured substrings

2741

2759

2742

In general, a pattern matches a certain portion of the subject, and in

2743

addition, further substrings from the subject may be picked out by

2744

parts of the pattern. Following the usage in Jeffrey Friedl's book,

2745

this is called "capturing" in what follows, and the phrase "capturing

2746

subpattern" is used for a fragment of a pattern that picks out a sub-

2747

string. PCRE supports several other kinds of parenthesized subpattern

2760

In general, a pattern matches a certain portion of the subject, and in

2761

addition, further substrings from the subject may be picked out by

2762

parts of the pattern. Following the usage in Jeffrey Friedl's book,

2763

this is called "capturing" in what follows, and the phrase "capturing

2764

subpattern" is used for a fragment of a pattern that picks out a sub-

2765

string. PCRE supports several other kinds of parenthesized subpattern

2748

2766

that do not cause substrings to be captured.

2749

2767

2750

2768

Captured substrings are returned to the caller via a vector of integers

2751

whose address is passed in ovector. The number of elements in the vec-

2752

tor is passed in ovecsize, which must be a non-negative number. Note:

2769

whose address is passed in ovector. The number of elements in the vec-

2770

tor is passed in ovecsize, which must be a non-negative number. Note:

2753

2771

this argument is NOT the size of ovector in bytes.

2754

2772

2755

The first two-thirds of the vector is used to pass back captured sub-

2756

strings, each substring using a pair of integers. The remaining third

2757

of the vector is used as workspace by pcre_exec() while matching cap-

2758

turing subpatterns, and is not available for passing back information.

2759

The number passed in ovecsize should always be a multiple of three. If

2773

The first two-thirds of the vector is used to pass back captured sub-

2774

strings, each substring using a pair of integers. The remaining third

2775

of the vector is used as workspace by pcre_exec() while matching cap-

2776

turing subpatterns, and is not available for passing back information.

2777

The number passed in ovecsize should always be a multiple of three. If

2760

2778

it is not, it is rounded down.

2761

2779

2762

When a match is successful, information about captured substrings is

2763

returned in pairs of integers, starting at the beginning of ovector,

2764

and continuing up to two-thirds of its length at the most. The first

2765

element of each pair is set to the byte offset of the first character

2766

in a substring, and the second is set to the byte offset of the first

2767

character after the end of a substring. Note: these values are always

2780

When a match is successful, information about captured substrings is

2781

returned in pairs of integers, starting at the beginning of ovector,

2782

and continuing up to two-thirds of its length at the most. The first

2783

element of each pair is set to the byte offset of the first character

2784

in a substring, and the second is set to the byte offset of the first

2785

character after the end of a substring. Note: these values are always

2768

2786

byte offsets, even in UTF-8 mode. They are not character counts.

2769

2787

2770

The first pair of integers, ovector[0] and ovector[1], identify the

2771

portion of the subject string matched by the entire pattern. The next

2772

pair is used for the first capturing subpattern, and so on. The value

2788

The first pair of integers, ovector[0] and ovector[1], identify the

2789

portion of the subject string matched by the entire pattern. The next

2790

pair is used for the first capturing subpattern, and so on. The value

2773

2791

returned by pcre_exec() is one more than the highest numbered pair that

2774

has been set. For example, if two substrings have been captured, the

2775

returned value is 3. If there are no capturing subpatterns, the return

2792

has been set. For example, if two substrings have been captured, the

2793

returned value is 3. If there are no capturing subpatterns, the return

2776

2794

value from a successful match is 1, indicating that just the first pair

2777

2795

of offsets has been set.

2778

2796

2779

2797

If a capturing subpattern is matched repeatedly, it is the last portion

2780

2798

of the string that it matched that is returned.

2781

2799

2782

If the vector is too small to hold all the captured substring offsets,

2800

If the vector is too small to hold all the captured substring offsets,

2783

2801

it is used as far as possible (up to two-thirds of its length), and the

2784

function returns a value of zero. If neither the actual string matched

2785

not any captured substrings are of interest, pcre_exec() may be called

2786

with ovector passed as NULL and ovecsize as zero. However, if the pat-

2787

tern contains back references and the ovector is not big enough to

2788

remember the related substrings, PCRE has to get additional memory for

2789

use during matching. Thus it is usually advisable to supply an ovector

2802

function returns a value of zero. If neither the actual string matched

2803

nor any captured substrings are of interest, pcre_exec() may be called

2804

with ovector passed as NULL and ovecsize as zero. However, if the pat-

2805

tern contains back references and the ovector is not big enough to

2806

remember the related substrings, PCRE has to get additional memory for

2807

use during matching. Thus it is usually advisable to supply an ovector

2790

2808

of reasonable size.

2791

2809

2792

There are some cases where zero is returned (indicating vector over-

2793

flow) when in fact the vector is exactly the right size for the final

2810

There are some cases where zero is returned (indicating vector over-

2811

flow) when in fact the vector is exactly the right size for the final

2794

2812

match. For example, consider the pattern

2795

2813

2796

2814

(a)(?:(b)c|bd)

2797

2815

2798

If a vector of 6 elements (allowing for only 1 captured substring) is

2816

If a vector of 6 elements (allowing for only 1 captured substring) is

2799

2817

given with subject string "abd", pcre_exec() will try to set the second

2800

2818

captured string, thereby recording a vector overflow, before failing to

2801

match "c" and backing up to try the second alternative. The zero

2802

return, however, does correctly indicate that the maximum number of

2819

match "c" and backing up to try the second alternative. The zero

2820

return, however, does correctly indicate that the maximum number of

2803

2821

slots (namely 2) have been filled. In similar cases where there is tem-

2804

porary overflow, but the final number of used slots is actually less

2822

porary overflow, but the final number of used slots is actually less

2805

2823

than the maximum, a non-zero value is returned.

2806

2824

2807

2825

The pcre_fullinfo() function can be used to find out how many capturing

2808

subpatterns there are in a compiled pattern. The smallest size for

2809

ovector that will allow for n captured substrings, in addition to the

2826

subpatterns there are in a compiled pattern. The smallest size for

2827

ovector that will allow for n captured substrings, in addition to the

2810

2828

offsets of the substring matched by the whole pattern, is (n+1)*3.

2811

2829

2812

It is possible for capturing subpattern number n+1 to match some part

2830

It is possible for capturing subpattern number n+1 to match some part

2813

2831

of the subject when subpattern n has not been used at all. For example,

2814

if the string "abc" is matched against the pattern (a|(z))(bc) the

2832

if the string "abc" is matched against the pattern (a|(z))(bc) the

2815

2833

return from the function is 4, and subpatterns 1 and 3 are matched, but

2816

2 is not. When this happens, both values in the offset pairs corre-

2834

2 is not. When this happens, both values in the offset pairs corre-

2817

2835

sponding to unused subpatterns are set to -1.

2818

2836

2819

Offset values that correspond to unused subpatterns at the end of the

2820

expression are also set to -1. For example, if the string "abc" is

2821

matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not

2822

matched. The return from the function is 2, because the highest used

2823

capturing subpattern number is 1, and the offsets for for the second

2824

and third capturing subpatterns (assuming the vector is large enough,

2837

Offset values that correspond to unused subpatterns at the end of the

2838

expression are also set to -1. For example, if the string "abc" is

2839

matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not

2840

matched. The return from the function is 2, because the highest used

2841

capturing subpattern number is 1, and the offsets for for the second

2842

and third capturing subpatterns (assuming the vector is large enough,

2825

2843

of course) are set to -1.

2826

2844

2827

Note: Elements in the first two-thirds of ovector that do not corre-

2828

spond to capturing parentheses in the pattern are never changed. That

2829

is, if a pattern contains n capturing parentheses, no more than ovec-

2830

tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in

2845

Note: Elements in the first two-thirds of ovector that do not corre-

2846

spond to capturing parentheses in the pattern are never changed. That

2847

is, if a pattern contains n capturing parentheses, no more than ovec-

2848

tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in

2831

2849

the first two-thirds) retain whatever values they previously had.

2832

2850

2833

Some convenience functions are provided for extracting the captured

2851

Some convenience functions are provided for extracting the captured

2834

2852

substrings as separate strings. These are described below.

2835

2853

2836

2854

Error return values from pcre_exec()

2837

2855

2838

If pcre_exec() fails, it returns a negative number. The following are

2856

If pcre_exec() fails, it returns a negative number. The following are

2839

2857

defined in the header file:

2840

2858

2841

2859

PCRE_ERROR_NOMATCH (-1)

2844

2862

2845

2863

PCRE_ERROR_NULL (-2)

2846

2864

2847

Either code or subject was passed as NULL, or ovector was NULL and

2865

Either code or subject was passed as NULL, or ovector was NULL and

2848

2866

ovecsize was not zero.

2849

2867

2850

2868

PCRE_ERROR_BADOPTION (-3)

2853

2871

2854

2872

PCRE_ERROR_BADMAGIC (-4)

2855

2873

2856

PCRE stores a 4-byte "magic number" at the start of the compiled code,

2874

PCRE stores a 4-byte "magic number" at the start of the compiled code,

2857

2875

to catch the case when it is passed a junk pointer and to detect when a

2858

2876

pattern that was compiled in an environment of one endianness is run in

2859

an environment with the other endianness. This is the error that PCRE

2877

an environment with the other endianness. This is the error that PCRE

2860

2878

gives when the magic number is not present.

2861

2879

2862

2880

PCRE_ERROR_UNKNOWN_OPCODE (-5)

2863

2881

2864

2882

While running the pattern match, an unknown item was encountered in the

2865

compiled pattern. This error could be caused by a bug in PCRE or by

2883

compiled pattern. This error could be caused by a bug in PCRE or by

2866

2884

overwriting of the compiled pattern.

2867

2885

2868

2886

PCRE_ERROR_NOMEMORY (-6)

2869

2887

2870

If a pattern contains back references, but the ovector that is passed

2888

If a pattern contains back references, but the ovector that is passed

2871

2889

to pcre_exec() is not big enough to remember the referenced substrings,

2872

PCRE gets a block of memory at the start of matching to use for this

2873

purpose. If the call via pcre_malloc() fails, this error is given. The

2890

PCRE gets a block of memory at the start of matching to use for this

2891

purpose. If the call via pcre_malloc() fails, this error is given. The

2874

2892

memory is automatically freed at the end of matching.

2875

2893

2876

This error is also given if pcre_stack_malloc() fails in pcre_exec().

2877

This can happen only when PCRE has been compiled with --disable-stack-

2894

This error is also given if pcre_stack_malloc() fails in pcre_exec().

2895

This can happen only when PCRE has been compiled with --disable-stack-

2878

2896

for-recursion.

2879

2897

2880

2898

PCRE_ERROR_NOSUBSTRING (-7)

2881

2899

2882

This error is used by the pcre_copy_substring(), pcre_get_substring(),

2900

This error is used by the pcre_copy_substring(), pcre_get_substring(),

2883

2901

and pcre_get_substring_list() functions (see below). It is never

2884

2902

returned by pcre_exec().

2885

2903

2886

2904

PCRE_ERROR_MATCHLIMIT (-8)

2887

2905

2888

The backtracking limit, as specified by the match_limit field in a

2889

pcre_extra structure (or defaulted) was reached. See the description

2906

The backtracking limit, as specified by the match_limit field in a

2907

pcre_extra structure (or defaulted) was reached. See the description

2890

2908

above.

2891

2909

2892

2910

PCRE_ERROR_CALLOUT (-9)

2893

2911

2894

2912

This error is never generated by pcre_exec() itself. It is provided for

2895

use by callout functions that want to yield a distinctive error code.

2913

use by callout functions that want to yield a distinctive error code.

2896

2914

See the pcrecallout documentation for details.

2897

2915

2898

2916

PCRE_ERROR_BADUTF8 (-10)

2899

2917

2900

A string that contains an invalid UTF-8 byte sequence was passed as a

2901

subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of

2902

the output vector (ovecsize) is at least 2, the byte offset to the

2903

start of the the invalid UTF-8 character is placed in the first ele-

2904

ment, and a reason code is placed in the second element. The reason

2918

A string that contains an invalid UTF-8 byte sequence was passed as a

2919

subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of

2920

the output vector (ovecsize) is at least 2, the byte offset to the

2921

start of the the invalid UTF-8 character is placed in the first ele-

2922

ment, and a reason code is placed in the second element. The reason

2905

2923

codes are listed in the following section. For backward compatibility,

2906

if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-

2907

acter at the end of the subject (reason codes 1 to 5),

2924

if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-

2925

acter at the end of the subject (reason codes 1 to 5),

2908

2926

PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.

2909

2927

2910

2928

PCRE_ERROR_BADUTF8_OFFSET (-11)

2911

2929

2912

The UTF-8 byte sequence that was passed as a subject was checked and

2913

found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the

2914

value of startoffset did not point to the beginning of a UTF-8 charac-

2930

The UTF-8 byte sequence that was passed as a subject was checked and

2931

found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the

2932

value of startoffset did not point to the beginning of a UTF-8 charac-

2915

2933

ter or the end of the subject.

2916

2934

2917

2935

PCRE_ERROR_PARTIAL (-12)

2918

2936

2919

The subject string did not match, but it did match partially. See the

2937

The subject string did not match, but it did match partially. See the

2920

2938

pcrepartial documentation for details of partial matching.

2921

2939

2922

2940

PCRE_ERROR_BADPARTIAL (-13)

2923

2941

2924

This code is no longer in use. It was formerly returned when the

2925

PCRE_PARTIAL option was used with a compiled pattern containing items

2926

that were not supported for partial matching. From release 8.00

2942

This code is no longer in use. It was formerly returned when the

2943

PCRE_PARTIAL option was used with a compiled pattern containing items

2944

that were not supported for partial matching. From release 8.00

2927

2945

onwards, there are no restrictions on partial matching.

2928

2946

2929

2947

PCRE_ERROR_INTERNAL (-14)

2930

2948

2931

An unexpected internal error has occurred. This error could be caused

2949

An unexpected internal error has occurred. This error could be caused

2932

2950

by a bug in PCRE or by overwriting of the compiled pattern.

2933

2951

2934

2952

PCRE_ERROR_BADCOUNT (-15)

2938

2956

PCRE_ERROR_RECURSIONLIMIT (-21)

2939

2957

2940

2958

The internal recursion limit, as specified by the match_limit_recursion

2941

field in a pcre_extra structure (or defaulted) was reached. See the

2959

field in a pcre_extra structure (or defaulted) was reached. See the

2942

2960

description above.

2943

2961

2944

2962

PCRE_ERROR_BADNEWLINE (-23)

2952

2970

2953

2971

PCRE_ERROR_SHORTUTF8 (-25)

2954

2972

2955

This error is returned instead of PCRE_ERROR_BADUTF8 when the subject

2956

string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD

2957

option is set. Information about the failure is returned as for

2958

PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but

2959

this special error code for PCRE_PARTIAL_HARD precedes the implementa-

2960

tion of returned information; it is retained for backwards compatibil-

2973

This error is returned instead of PCRE_ERROR_BADUTF8 when the subject

2974

string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD

2975

option is set. Information about the failure is returned as for

2976

PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but

2977

this special error code for PCRE_PARTIAL_HARD precedes the implementa-

2978

tion of returned information; it is retained for backwards compatibil-

2961

2979

ity.

2962

2980

2963

2981

PCRE_ERROR_RECURSELOOP (-26)

2964

2982

2965

2983

This error is returned when pcre_exec() detects a recursion loop within

2966

the pattern. Specifically, it means that either the whole pattern or a

2967

subpattern has been called recursively for the second time at the same

2984

the pattern. Specifically, it means that either the whole pattern or a

2985

subpattern has been called recursively for the second time at the same

2968

2986

position in the subject string. Some simple patterns that might do this

2969

are detected and faulted at compile time, but more complicated cases,

2987

are detected and faulted at compile time, but more complicated cases,

2970

2988

in particular mutual recursions between two different subpatterns, can-

2971

2989

not be detected until run time.

2972

2990

2973

2991

PCRE_ERROR_JIT_STACKLIMIT (-27)

2974

2992

2975

This error is returned when a pattern that was successfully studied

2976

using the PCRE_STUDY_JIT_COMPILE option is being matched, but the mem-

2977

ory available for the just-in-time processing stack is not large

2978

enough. See the pcrejit documentation for more details.

2993

This error is returned when a pattern that was successfully studied

2994

using a JIT compile option is being matched, but the memory available

2995

for the just-in-time processing stack is not large enough. See the

2996

pcrejit documentation for more details.

2979

2997

2980

PCRE_ERROR_BADMODE (-28)

2998

PCRE_ERROR_BADMODE (-28)

2981

2999

2982

3000

This error is given if a pattern that was compiled by the 8-bit library

2983

3001

is passed to a 16-bit library function, or vice versa.

2984

3002

2985

PCRE_ERROR_BADENDIANNESS (-29)

3003

PCRE_ERROR_BADENDIANNESS (-29)

2986

3004

2987

This error is given if a pattern that was compiled and saved is

2988

reloaded on a host with different endianness. The utility function

3005

This error is given if a pattern that was compiled and saved is

3006

reloaded on a host with different endianness. The utility function

2989

3007

pcre_pattern_to_host_byte_order() can be used to convert such a pattern

2990

3008

so that it runs on the new host.

2991

3009

2992

Error numbers -16 to -20 and -22 are not used by pcre_exec().

3010

Error numbers -16 to -20, -22, and -30 are not used by pcre_exec().

2993

3011

2994

3012

Reason codes for invalid UTF-8 strings

2995

3013

2996

This section applies only to the 8-bit library. The corresponding

3014

This section applies only to the 8-bit library. The corresponding

2997

3015

information for the 16-bit library is given in the pcre16 page.

2998

3016

2999

3017

When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-

3000

UTF8, and the size of the output vector (ovecsize) is at least 2, the

3001

offset of the start of the invalid UTF-8 character is placed in the

3018

UTF8, and the size of the output vector (ovecsize) is at least 2, the

3019

offset of the start of the invalid UTF-8 character is placed in the

3002

3020

first output vector element (ovector[0]) and a reason code is placed in

3003

the second element (ovector[1]). The reason codes are given names in

3021

the second element (ovector[1]). The reason codes are given names in

3004

3022

the pcre.h header file:

3005

3023

3006

3024

PCRE_UTF8_ERR1

3009

3027

PCRE_UTF8_ERR4

3010

3028

PCRE_UTF8_ERR5

3011

3029

3012

The string ends with a truncated UTF-8 character; the code specifies

3013

how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8

3014

characters to be no longer than 4 bytes, the encoding scheme (origi-

3015

nally defined by RFC 2279) allows for up to 6 bytes, and this is

3030

The string ends with a truncated UTF-8 character; the code specifies

3031

how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8

3032

characters to be no longer than 4 bytes, the encoding scheme (origi-

3033

nally defined by RFC 2279) allows for up to 6 bytes, and this is

3016

3034

checked first; hence the possibility of 4 or 5 missing bytes.

3017

3035

3018

3036

PCRE_UTF8_ERR6

3022

3040

PCRE_UTF8_ERR10

3023

3041

3024

3042

The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of

3025

the character do not have the binary value 0b10 (that is, either the

3043

the character do not have the binary value 0b10 (that is, either the

3026

3044

most significant bit is 0, or the next bit is 1).

3027

3045

3028

3046

PCRE_UTF8_ERR11

3029

3047

PCRE_UTF8_ERR12

3030

3048

3031

A character that is valid by the RFC 2279 rules is either 5 or 6 bytes

3049

A character that is valid by the RFC 2279 rules is either 5 or 6 bytes

3032

3050

long; these code points are excluded by RFC 3629.

3033

3051

3034

3052

PCRE_UTF8_ERR13

3035

3053

3036

A 4-byte character has a value greater than 0x10fff; these code points

3054

A 4-byte character has a value greater than 0x10fff; these code points

3037

3055

are excluded by RFC 3629.

3038

3056

3039

3057

PCRE_UTF8_ERR14

3040

3058

3041

A 3-byte character has a value in the range 0xd800 to 0xdfff; this

3042

range of code points are reserved by RFC 3629 for use with UTF-16, and

3059

A 3-byte character has a value in the range 0xd800 to 0xdfff; this

3060

range of code points are reserved by RFC 3629 for use with UTF-16, and

3043

3061

so are excluded from UTF-8.

3044

3062

3045

3063

PCRE_UTF8_ERR15

3048

3066

PCRE_UTF8_ERR18

3049

3067

PCRE_UTF8_ERR19

3050

3068

3051

A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes

3052

for a value that can be represented by fewer bytes, which is invalid.

3053

For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-

3069

A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes

3070

for a value that can be represented by fewer bytes, which is invalid.

3071

For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-

3054

3072

rect coding uses just one byte.

3055

3073

3056

3074

PCRE_UTF8_ERR20

3057

3075

3058

3076

The two most significant bits of the first byte of a character have the

3059

binary value 0b10 (that is, the most significant bit is 1 and the sec-

3060

ond is 0). Such a byte can only validly occur as the second or subse-

3077

binary value 0b10 (that is, the most significant bit is 1 and the sec-

3078

ond is 0). Such a byte can only validly occur as the second or subse-

3061

3079

quent byte of a multi-byte character.

3062

3080

3063

3081

PCRE_UTF8_ERR21

3064

3082

3065

The first byte of a character has the value 0xfe or 0xff. These values

3083

The first byte of a character has the value 0xfe or 0xff. These values

3066

3084

can never occur in a valid UTF-8 string.

3067

3085

3068

3086

3079

3097

int pcre_get_substring_list(const char *subject,

3080

3098

int *ovector, int stringcount, const char ***listptr);

3081

3099

3082

Captured substrings can be accessed directly by using the offsets

3083

returned by pcre_exec() in ovector. For convenience, the functions

3100

Captured substrings can be accessed directly by using the offsets

3101

returned by pcre_exec() in ovector. For convenience, the functions

3084

3102

pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-

3085

string_list() are provided for extracting captured substrings as new,

3086

separate, zero-terminated strings. These functions identify substrings

3087

by number. The next section describes functions for extracting named

3103

string_list() are provided for extracting captured substrings as new,

3104

separate, zero-terminated strings. These functions identify substrings

3105

by number. The next section describes functions for extracting named

3088

3106

substrings.

3089

3107

3090

A substring that contains a binary zero is correctly extracted and has

3091

a further zero added on the end, but the result is not, of course, a C

3092

string. However, you can process such a string by referring to the

3093

length that is returned by pcre_copy_substring() and pcre_get_sub-

3108

A substring that contains a binary zero is correctly extracted and has

3109

a further zero added on the end, but the result is not, of course, a C

3110

string. However, you can process such a string by referring to the

3111

length that is returned by pcre_copy_substring() and pcre_get_sub-

3094

3112

string(). Unfortunately, the interface to pcre_get_substring_list() is

3095

not adequate for handling strings containing binary zeros, because the

3113

not adequate for handling strings containing binary zeros, because the

3096

3114

end of the final string is not independently indicated.

3097

3115

3098

The first three arguments are the same for all three of these func-

3099

tions: subject is the subject string that has just been successfully

3116

The first three arguments are the same for all three of these func-

3117

tions: subject is the subject string that has just been successfully

3100

3118

matched, ovector is a pointer to the vector of integer offsets that was

3101

3119

passed to pcre_exec(), and stringcount is the number of substrings that

3102

were captured by the match, including the substring that matched the

3120

were captured by the match, including the substring that matched the

3103

3121

entire regular expression. This is the value returned by pcre_exec() if

3104

it is greater than zero. If pcre_exec() returned zero, indicating that

3105

it ran out of space in ovector, the value passed as stringcount should

3122

it is greater than zero. If pcre_exec() returned zero, indicating that

3123

it ran out of space in ovector, the value passed as stringcount should

3106

3124

be the number of elements in the vector divided by three.

3107

3125

3108

The functions pcre_copy_substring() and pcre_get_substring() extract a

3109

single substring, whose number is given as stringnumber. A value of

3110

zero extracts the substring that matched the entire pattern, whereas

3111

higher values extract the captured substrings. For pcre_copy_sub-

3112

string(), the string is placed in buffer, whose length is given by

3113

buffersize, while for pcre_get_substring() a new block of memory is

3114

obtained via pcre_malloc, and its address is returned via stringptr.

3115

The yield of the function is the length of the string, not including

3126

The functions pcre_copy_substring() and pcre_get_substring() extract a

3127

single substring, whose number is given as stringnumber. A value of

3128

zero extracts the substring that matched the entire pattern, whereas

3129

higher values extract the captured substrings. For pcre_copy_sub-

3130

string(), the string is placed in buffer, whose length is given by

3131

buffersize, while for pcre_get_substring() a new block of memory is

3132

obtained via pcre_malloc, and its address is returned via stringptr.

3133

The yield of the function is the length of the string, not including

3116

3134

the terminating zero, or one of these error codes:

3117

3135

3118

3136

PCRE_ERROR_NOMEMORY (-6)

3119

3137

3120

The buffer was too small for pcre_copy_substring(), or the attempt to

3138

The buffer was too small for pcre_copy_substring(), or the attempt to

3121

3139

get memory failed for pcre_get_substring().

3122

3140

3123

3141

PCRE_ERROR_NOSUBSTRING (-7)

3124

3142

3125

3143

There is no substring whose number is stringnumber.

3126

3144

3127

The pcre_get_substring_list() function extracts all available sub-

3128

strings and builds a list of pointers to them. All this is done in a

3145

The pcre_get_substring_list() function extracts all available sub-

3146

strings and builds a list of pointers to them. All this is done in a

3129

3147

single block of memory that is obtained via pcre_malloc. The address of

3130

the memory block is returned via listptr, which is also the start of

3131

the list of string pointers. The end of the list is marked by a NULL

3132

pointer. The yield of the function is zero if all went well, or the

3148

the memory block is returned via listptr, which is also the start of

3149

the list of string pointers. The end of the list is marked by a NULL

3150

pointer. The yield of the function is zero if all went well, or the

3133

3151

error code

3134

3152

3135

3153

PCRE_ERROR_NOMEMORY (-6)

3136

3154

3137

3155

if the attempt to get the memory block failed.

3138

3156

3139

When any of these functions encounter a substring that is unset, which

3140

can happen when capturing subpattern number n+1 matches some part of

3141

the subject, but subpattern n has not been used at all, they return an

3157

When any of these functions encounter a substring that is unset, which

3158

can happen when capturing subpattern number n+1 matches some part of

3159

the subject, but subpattern n has not been used at all, they return an

3142

3160

empty string. This can be distinguished from a genuine zero-length sub-

3143

string by inspecting the appropriate offset in ovector, which is nega-

3161

string by inspecting the appropriate offset in ovector, which is nega-

3144

3162

tive for unset substrings.

3145

3163

3146

The two convenience functions pcre_free_substring() and pcre_free_sub-

3147

string_list() can be used to free the memory returned by a previous

3164

The two convenience functions pcre_free_substring() and pcre_free_sub-

3165

string_list() can be used to free the memory returned by a previous

3148

3166

call of pcre_get_substring() or pcre_get_substring_list(), respec-

3149

tively. They do nothing more than call the function pointed to by

3150

pcre_free, which of course could be called directly from a C program.

3151

However, PCRE is used in some situations where it is linked via a spe-

3152

cial interface to another programming language that cannot use

3153

pcre_free directly; it is for these cases that the functions are pro-

3167

tively. They do nothing more than call the function pointed to by

3168

pcre_free, which of course could be called directly from a C program.

3169

However, PCRE is used in some situations where it is linked via a spe-

3170

cial interface to another programming language that cannot use

3171

pcre_free directly; it is for these cases that the functions are pro-

3154

3172

vided.

3155

3173

3156

3174

3169

3187

int stringcount, const char *stringname,

3170

3188

const char **stringptr);

3171

3189

3172

To extract a substring by name, you first have to find associated num-

3190

To extract a substring by name, you first have to find associated num-

3173

3191

ber. For example, for this pattern

3174

3192

3175

3193

(a+)b(?<xxx>\d+)...

3178

3196

be unique (PCRE_DUPNAMES was not set), you can find the number from the

3179

3197

name by calling pcre_get_stringnumber(). The first argument is the com-

3180

3198

piled pattern, and the second is the name. The yield of the function is

3181

the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no

3199

the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no

3182

3200

subpattern of that name.

3183

3201

3184

3202

Given the number, you can extract the substring directly, or use one of

3185

3203

the functions described in the previous section. For convenience, there

3186

3204

are also two functions that do the whole job.

3187

3205

3188

Most of the arguments of pcre_copy_named_substring() and

3189

pcre_get_named_substring() are the same as those for the similarly

3190

named functions that extract by number. As these are described in the

3191

previous section, they are not re-described here. There are just two

3206

Most of the arguments of pcre_copy_named_substring() and

3207

pcre_get_named_substring() are the same as those for the similarly

3208

named functions that extract by number. As these are described in the

3209

previous section, they are not re-described here. There are just two

3192

3210

differences:

3193

3211

3194

First, instead of a substring number, a substring name is given. Sec-

3212

First, instead of a substring number, a substring name is given. Sec-

3195

3213

ond, there is an extra argument, given at the start, which is a pointer

3196

to the compiled pattern. This is needed in order to gain access to the

3214

to the compiled pattern. This is needed in order to gain access to the

3197

3215

name-to-number translation table.

3198

3216

3199

These functions call pcre_get_stringnumber(), and if it succeeds, they

3200

then call pcre_copy_substring() or pcre_get_substring(), as appropri-

3201

ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the

3217

These functions call pcre_get_stringnumber(), and if it succeeds, they

3218

then call pcre_copy_substring() or pcre_get_substring(), as appropri-

3219

ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the

3202

3220

behaviour may not be what you want (see the next section).

3203

3221

3204

3222

Warning: If the pattern uses the (?| feature to set up multiple subpat-

3205

terns with the same number, as described in the section on duplicate

3206

subpattern numbers in the pcrepattern page, you cannot use names to

3207

distinguish the different subpatterns, because names are not included

3208

in the compiled code. The matching process uses only numbers. For this

3209

reason, the use of different names for subpatterns of the same number

3223

terns with the same number, as described in the section on duplicate

3224

subpattern numbers in the pcrepattern page, you cannot use names to

3225

distinguish the different subpatterns, because names are not included

3226

in the compiled code. The matching process uses only numbers. For this

3227

reason, the use of different names for subpatterns of the same number

3210

3228

causes an error at compile time.

3211

3229

3212

3230

3215

3233

int pcre_get_stringtable_entries(const pcre *code,

3216

3234

const char *name, char **first, char **last);

3217

3235

3218

When a pattern is compiled with the PCRE_DUPNAMES option, names for

3219

subpatterns are not required to be unique. (Duplicate names are always

3220

allowed for subpatterns with the same number, created by using the (?|

3221

feature. Indeed, if such subpatterns are named, they are required to

3236

When a pattern is compiled with the PCRE_DUPNAMES option, names for

3237

subpatterns are not required to be unique. (Duplicate names are always

3238

allowed for subpatterns with the same number, created by using the (?|

3239

feature. Indeed, if such subpatterns are named, they are required to

3222

3240

use the same names.)

3223

3241

3224

3242

Normally, patterns with duplicate names are such that in any one match,

3225

only one of the named subpatterns participates. An example is shown in

3243

only one of the named subpatterns participates. An example is shown in

3226

3244

the pcrepattern documentation.

3227

3245

3228

When duplicates are present, pcre_copy_named_substring() and

3229

pcre_get_named_substring() return the first substring corresponding to

3230

the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING

3231

(-7) is returned; no data is returned. The pcre_get_stringnumber()

3232

function returns one of the numbers that are associated with the name,

3246

When duplicates are present, pcre_copy_named_substring() and

3247

pcre_get_named_substring() return the first substring corresponding to

3248

the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING

3249

(-7) is returned; no data is returned. The pcre_get_stringnumber()

3250

function returns one of the numbers that are associated with the name,

3233

3251

but it is not defined which it is.

3234

3252

3235

If you want to get full details of all captured substrings for a given

3236

name, you must use the pcre_get_stringtable_entries() function. The

3253

If you want to get full details of all captured substrings for a given

3254

name, you must use the pcre_get_stringtable_entries() function. The

3237

3255

first argument is the compiled pattern, and the second is the name. The

3238

third and fourth are pointers to variables which are updated by the

3256

third and fourth are pointers to variables which are updated by the

3239

3257

function. After it has run, they point to the first and last entries in

3240

the name-to-number table for the given name. The function itself

3241

returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if

3242

there are none. The format of the table is described above in the sec-

3243

tion entitled Information about a pattern above. Given all the rele-

3244

vant entries for the name, you can extract each of their numbers, and

3258

the name-to-number table for the given name. The function itself

3259

returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if

3260

there are none. The format of the table is described above in the sec-

3261

tion entitled Information about a pattern above. Given all the rele-

3262

vant entries for the name, you can extract each of their numbers, and

3245

3263

hence the captured data, if any.

3246

3264

3247

3265

3248

3266

FINDING ALL POSSIBLE MATCHES

3249

3267

3250

The traditional matching function uses a similar algorithm to Perl,

3268

The traditional matching function uses a similar algorithm to Perl,

3251

3269

which stops when it finds the first match, starting at a given point in

3252

the subject. If you want to find all possible matches, or the longest

3253

possible match, consider using the alternative matching function (see

3254

below) instead. If you cannot use the alternative function, but still

3255

need to find all possible matches, you can kludge it up by making use

3270

the subject. If you want to find all possible matches, or the longest

3271

possible match, consider using the alternative matching function (see

3272

below) instead. If you cannot use the alternative function, but still

3273

need to find all possible matches, you can kludge it up by making use

3256

3274

of the callout facility, which is described in the pcrecallout documen-

3257

3275

tation.

3258

3276

3259

3277

What you have to do is to insert a callout right at the end of the pat-

3260

tern. When your callout function is called, extract and save the cur-

3261

rent matched substring. Then return 1, which forces pcre_exec() to

3262

backtrack and try other alternatives. Ultimately, when it runs out of

3278

tern. When your callout function is called, extract and save the cur-

3279

rent matched substring. Then return 1, which forces pcre_exec() to

3280

backtrack and try other alternatives. Ultimately, when it runs out of

3263

3281

matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.

3264

3282

3265

3283

3266

3284

OBTAINING AN ESTIMATE OF STACK USAGE

3267

3285

3268

Matching certain patterns using pcre_exec() can use a lot of process

3269

stack, which in certain environments can be rather limited in size.

3270

Some users find it helpful to have an estimate of the amount of stack

3271

that is used by pcre_exec(), to help them set recursion limits, as

3272

described in the pcrestack documentation. The estimate that is output

3286

Matching certain patterns using pcre_exec() can use a lot of process

3287

stack, which in certain environments can be rather limited in size.

3288

Some users find it helpful to have an estimate of the amount of stack

3289

that is used by pcre_exec(), to help them set recursion limits, as

3290

described in the pcrestack documentation. The estimate that is output

3273

3291

by pcretest when called with the -m and -C options is obtained by call-

3274

ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its

3292

ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its

3275

3293

first five arguments.

3276

3294

3277

Normally, if its first argument is NULL, pcre_exec() immediately

3278

returns the negative error code PCRE_ERROR_NULL, but with this special

3279

combination of arguments, it returns instead a negative number whose

3280

absolute value is the approximate stack frame size in bytes. (A nega-

3281

tive number is used so that it is clear that no match has happened.)

3282

The value is approximate because in some cases, recursive calls to

3295

Normally, if its first argument is NULL, pcre_exec() immediately

3296

returns the negative error code PCRE_ERROR_NULL, but with this special

3297

combination of arguments, it returns instead a negative number whose

3298

absolute value is the approximate stack frame size in bytes. (A nega-

3299

tive number is used so that it is clear that no match has happened.)

3300

The value is approximate because in some cases, recursive calls to

3283

3301

pcre_exec() occur when there are one or two additional variables on the

3284

3302

stack.

3285

3303

3286

If PCRE has been compiled to use the heap instead of the stack for

3287

recursion, the value returned is the size of each block that is

3304

If PCRE has been compiled to use the heap instead of the stack for

3305

recursion, the value returned is the size of each block that is

3288

3306

obtained from the heap.

3289

3307

3290

3308

3295

3313

int options, int *ovector, int ovecsize,

3296

3314

int *workspace, int wscount);

3297

3315

3298

The function pcre_dfa_exec() is called to match a subject string

3299

against a compiled pattern, using a matching algorithm that scans the

3300

subject string just once, and does not backtrack. This has different

3301

characteristics to the normal algorithm, and is not compatible with

3302

Perl. Some of the features of PCRE patterns are not supported. Never-

3303

theless, there are times when this kind of matching can be useful. For

3304

a discussion of the two matching algorithms, and a list of features

3305

that pcre_dfa_exec() does not support, see the pcrematching documenta-

3316

The function pcre_dfa_exec() is called to match a subject string

3317

against a compiled pattern, using a matching algorithm that scans the

3318

subject string just once, and does not backtrack. This has different

3319

characteristics to the normal algorithm, and is not compatible with

3320

Perl. Some of the features of PCRE patterns are not supported. Never-

3321

theless, there are times when this kind of matching can be useful. For

3322

a discussion of the two matching algorithms, and a list of features

3323

that pcre_dfa_exec() does not support, see the pcrematching documenta-

3306

3324

tion.

3307

3325

3308

The arguments for the pcre_dfa_exec() function are the same as for

3326

The arguments for the pcre_dfa_exec() function are the same as for

3309

3327

pcre_exec(), plus two extras. The ovector argument is used in a differ-

3310

ent way, and this is described below. The other common arguments are

3311

used in the same way as for pcre_exec(), so their description is not

3328

ent way, and this is described below. The other common arguments are

3329

used in the same way as for pcre_exec(), so their description is not

3312

3330

repeated here.

3313

3331

3314

The two additional arguments provide workspace for the function. The

3315

workspace vector should contain at least 20 elements. It is used for

3332

The two additional arguments provide workspace for the function. The

3333

workspace vector should contain at least 20 elements. It is used for

3316

3334

keeping track of multiple paths through the pattern tree. More

3317

workspace will be needed for patterns and subjects where there are a

3335

workspace will be needed for patterns and subjects where there are a

3318

3336

lot of potential matches.

3319

3337

3320

3338

Here is an example of a simple call to pcre_dfa_exec():

3336

3354

3337

3355

Option bits for pcre_dfa_exec()

3338

3356

3339

The unused bits of the options argument for pcre_dfa_exec() must be

3340

zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-

3357

The unused bits of the options argument for pcre_dfa_exec() must be

3358

zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-

3341

3359

LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,

3342

PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,

3343

PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-

3344

TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last

3345

four of these are exactly the same as for pcre_exec(), so their

3360

PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,

3361

PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-

3362

TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last

3363

four of these are exactly the same as for pcre_exec(), so their

3346

3364

description is not repeated here.

3347

3365

3348

3366

PCRE_PARTIAL_HARD

3349

3367

PCRE_PARTIAL_SOFT

3350

3368

3351

These have the same general effect as they do for pcre_exec(), but the

3352

details are slightly different. When PCRE_PARTIAL_HARD is set for

3353

pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-

3354

ject is reached and there is still at least one matching possibility

3369

These have the same general effect as they do for pcre_exec(), but the

3370

details are slightly different. When PCRE_PARTIAL_HARD is set for

3371

pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-

3372

ject is reached and there is still at least one matching possibility

3355

3373

that requires additional characters. This happens even if some complete

3356

3374

matches have also been found. When PCRE_PARTIAL_SOFT is set, the return

3357

3375

code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end

3358

of the subject is reached, there have been no complete matches, but

3359

there is still at least one matching possibility. The portion of the

3360

string that was inspected when the longest partial match was found is

3361

set as the first matching string in both cases. There is a more

3362

detailed discussion of partial and multi-segment matching, with exam-

3376

of the subject is reached, there have been no complete matches, but

3377

there is still at least one matching possibility. The portion of the

3378

string that was inspected when the longest partial match was found is

3379

set as the first matching string in both cases. There is a more

3380

detailed discussion of partial and multi-segment matching, with exam-

3363

3381

ples, in the pcrepartial documentation.

3364

3382

3365

3383

PCRE_DFA_SHORTEST

3366

3384

3367

Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to

3385

Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to

3368

3386

stop as soon as it has found one match. Because of the way the alterna-

3369

tive algorithm works, this is necessarily the shortest possible match

3387

tive algorithm works, this is necessarily the shortest possible match

3370

3388

at the first possible matching point in the subject string.

3371

3389

3372

3390

PCRE_DFA_RESTART

3373

3391

3374

3392

When pcre_dfa_exec() returns a partial match, it is possible to call it

3375

again, with additional subject characters, and have it continue with

3376

the same match. The PCRE_DFA_RESTART option requests this action; when

3377

it is set, the workspace and wscount options must reference the same

3378

vector as before because data about the match so far is left in them

3393

again, with additional subject characters, and have it continue with

3394

the same match. The PCRE_DFA_RESTART option requests this action; when

3395

it is set, the workspace and wscount options must reference the same

3396

vector as before because data about the match so far is left in them

3379

3397

after a partial match. There is more discussion of this facility in the

3380

3398

pcrepartial documentation.

3381

3399

3382

3400

Successful returns from pcre_dfa_exec()

3383

3401

3384

When pcre_dfa_exec() succeeds, it may have matched more than one sub-

3402

When pcre_dfa_exec() succeeds, it may have matched more than one sub-

3385

3403

string in the subject. Note, however, that all the matches from one run

3386

of the function start at the same point in the subject. The shorter

3387

matches are all initial substrings of the longer matches. For example,

3404

of the function start at the same point in the subject. The shorter

3405

matches are all initial substrings of the longer matches. For example,

3388

3406

if the pattern

3389

3407

3390

3408

<.*>

3399

3417

3400

3418

3401

3419

3402

On success, the yield of the function is a number greater than zero,

3403

which is the number of matched substrings. The substrings themselves

3404

are returned in ovector. Each string uses two elements; the first is

3405

the offset to the start, and the second is the offset to the end. In

3406

fact, all the strings have the same start offset. (Space could have

3407

been saved by giving this only once, but it was decided to retain some

3408

compatibility with the way pcre_exec() returns data, even though the

3420

On success, the yield of the function is a number greater than zero,

3421

which is the number of matched substrings. The substrings themselves

3422

are returned in ovector. Each string uses two elements; the first is

3423

the offset to the start, and the second is the offset to the end. In

3424

fact, all the strings have the same start offset. (Space could have

3425

been saved by giving this only once, but it was decided to retain some

3426

compatibility with the way pcre_exec() returns data, even though the

3409

3427

meaning of the strings is different.)

3410

3428

3411

3429

The strings are returned in reverse order of length; that is, the long-

3412

est matching string is given first. If there were too many matches to

3413

fit into ovector, the yield of the function is zero, and the vector is

3414

filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()

3430

est matching string is given first. If there were too many matches to

3431

fit into ovector, the yield of the function is zero, and the vector is

3432

filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()

3415

3433

can use the entire ovector for returning matched strings.

3416

3434

3417

3435

Error returns from pcre_dfa_exec()

3418

3436

3419

The pcre_dfa_exec() function returns a negative number when it fails.

3420

Many of the errors are the same as for pcre_exec(), and these are

3421

described above. There are in addition the following errors that are

3437

The pcre_dfa_exec() function returns a negative number when it fails.

3438

Many of the errors are the same as for pcre_exec(), and these are

3439

described above. There are in addition the following errors that are

3422

3440

specific to pcre_dfa_exec():

3423

3441

3424

3442

PCRE_ERROR_DFA_UITEM (-16)

3425

3443

3426

This return is given if pcre_dfa_exec() encounters an item in the pat-

3427

tern that it does not support, for instance, the use of \C or a back

3444

This return is given if pcre_dfa_exec() encounters an item in the pat-

3445

tern that it does not support, for instance, the use of \C or a back

3428

3446

reference.

3429

3447

3430

3448

PCRE_ERROR_DFA_UCOND (-17)

3431

3449

3432

This return is given if pcre_dfa_exec() encounters a condition item

3433

that uses a back reference for the condition, or a test for recursion

3450

This return is given if pcre_dfa_exec() encounters a condition item

3451

that uses a back reference for the condition, or a test for recursion

3434

3452

in a specific group. These are not supported.

3435

3453

3436

3454

PCRE_ERROR_DFA_UMLIMIT (-18)

3437

3455

3438

This return is given if pcre_dfa_exec() is called with an extra block

3439

that contains a setting of the match_limit or match_limit_recursion

3440

fields. This is not supported (these fields are meaningless for DFA

3456

This return is given if pcre_dfa_exec() is called with an extra block

3457

that contains a setting of the match_limit or match_limit_recursion

3458

fields. This is not supported (these fields are meaningless for DFA

3441

3459

matching).

3442

3460

3443

3461

PCRE_ERROR_DFA_WSSIZE (-19)

3444

3462

3445

This return is given if pcre_dfa_exec() runs out of space in the

3463

This return is given if pcre_dfa_exec() runs out of space in the

3446

3464

workspace vector.

3447

3465

3448

3466

PCRE_ERROR_DFA_RECURSE (-20)

3449

3467

3450

When a recursive subpattern is processed, the matching function calls

3451

itself recursively, using private vectors for ovector and workspace.

3452

This error is given if the output vector is not large enough. This

3468

When a recursive subpattern is processed, the matching function calls

3469

itself recursively, using private vectors for ovector and workspace.

3470

This error is given if the output vector is not large enough. This

3453

3471

should be extremely rare, as a vector of size 1000 is used.

3454

3472

3473

PCRE_ERROR_DFA_BADRESTART (-30)

3474

3475

When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some

3476

plausibility checks are made on the contents of the workspace, which

3477

should contain data about the previous partial match. If any of these

3478

checks fail, this error is given.

3479

3455

3480

3456

3481

SEE ALSO

3457

3482

3469

3494

3470

3495

REVISION

3471

3496

3472

Last updated: 21 January 2012

3497

Last updated: 17 June 2012

3473

3498

3474

3499

------------------------------------------------------------------------------

3475

3500

3761

3786

There is a discussion that explains these differences in more detail in

3762

3787

the section on recursion differences from Perl in the pcrepattern page.

3763

3788

3764

11. If (*THEN) is present in a group that is called as a subroutine,

3765

its action is limited to that group, even if the group does not contain

3766

any | characters.

3789

11. If any of the backtracking control verbs are used in an assertion

3790

or in a subpattern that is called as a subroutine (whether or not

3791

recursively), their effect is confined to that subpattern; it does not

3792

extend to the surrounding pattern. This is not always the case in Perl.

3793

In particular, if (*THEN) is present in a group that is called as a

3794

subroutine, its action is limited to that group, even if the group does

3795

not contain any | characters. There is one exception to this: the name

3796

from a *(MARK), (*PRUNE), or (*THEN) that is encountered in a success-

3797

ful positive assertion is passed back when a match succeeds (compare

3798

capturing parentheses in assertions). Note that such subpatterns are

3799

processed as anchored at the point where they are tested.

3767

3800

3768

3801

12. There are some differences that are concerned with the settings of

3769

3802

captured strings when part of a pattern is repeated. For example,

3783

3816

3784

3817

14. Perl recognizes comments in some places that PCRE does not, for

3785

3818

example, between the ( and ? at the start of a subpattern. If the /x

3786

modifier is set, Perl allows whitespace between ( and ? but PCRE never

3819

modifier is set, Perl allows white space between ( and ? but PCRE never

3787

3820

does, even if the PCRE_EXTENDED option is set.

3788

3821

3789

3822

15. PCRE provides some extensions to the Perl regular expression facil-

3843

3876

3844

3877

REVISION

3845

3878

3846

Last updated: 08 Januray 2012

3879

Last updated: 01 June 2012

3847

3880

3848

3881

------------------------------------------------------------------------------

3849

3882

4029

4062

after a backslash. All other characters (in particular, those whose

4030

4063

codepoints are greater than 127) are treated as literals.

4031

4064

4032

If a pattern is compiled with the PCRE_EXTENDED option, whitespace in

4065

If a pattern is compiled with the PCRE_EXTENDED option, white space in

4033

4066

the pattern (other than in a character class) and characters between a

4034

4067

# outside a character class and the next newline are ignored. An escap-

4035

ing backslash can be used to include a whitespace or # character as

4068

ing backslash can be used to include a white space or # character as

4036

4069

part of the pattern.

4037

4070

4038

4071

If you want to remove the special meaning from a sequence of charac-

4067

4100

\a alarm, that is, the BEL character (hex 07)

4068

4101

\cx "control-x", where x is any ASCII character

4069

4102

\e escape (hex 1B)

4070

\f formfeed (hex 0C)

4103

\f form feed (hex 0C)

4071

4104

\n linefeed (hex 0A)

4072

4105

\r carriage return (hex 0D)

4073

4106

\t tab (hex 09)

4109

4142

its. Otherwise, it matches a literal "x" character. In JavaScript

4110

4143

mode, support for code points greater than 256 is provided by \u, which

4111

4144

must be followed by four hexadecimal digits; otherwise it matches a

4112

literal "u" character.

4145

literal "u" character. Character codes specified by \u in JavaScript

4146

mode are constrained in the same was as those specified by \x in non-

4147

JavaScript mode.

4113

4148

4114

4149

Characters whose value is less than 256 can be defined by either of the

4115

4150

two syntaxes for \x (or by \u in JavaScript mode). There is no differ-

4196

4231

4197

4232

\d any decimal digit

4198

4233

\D any character that is not a decimal digit

4199

\h any horizontal whitespace character

4200

\H any character that is not a horizontal whitespace character

4201

\s any whitespace character

4202

\S any character that is not a whitespace character

4203

\v any vertical whitespace character

4204

\V any character that is not a vertical whitespace character

4234

\h any horizontal white space character

4235

\H any character that is not a horizontal white space character

4236

\s any white space character

4237

\S any character that is not a white space character

4238

\v any vertical white space character

4239

\V any character that is not a vertical white space character

4205

4240

\w any "word" character

4206

4241

\W any "non-word" character

4207

4242

4281

4316

4282

4317

U+000A Linefeed

4283

4318

U+000B Vertical tab

4284

U+000C Formfeed

4319

U+000C Form feed

4285

4320

U+000D Carriage return

4286

4321

U+0085 Next line

4287

4322

U+2028 Line separator

4301

4336

This is an example of an "atomic group", details of which are given

4302

4337

below. This particular group matches either the two-character sequence

4303

4338

CR followed by LF, or one of the single characters LF (linefeed,

4304

U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage

4305

return, U+000D), or NEL (next line, U+0085). The two-character sequence

4306

is treated as a single unit that cannot be split.

4339

U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-

4340

riage return, U+000D), or NEL (next line, U+0085). The two-character

4341

sequence is treated as a single unit that cannot be split.

4307

4342

4308

4343

In other modes, two additional characters whose codepoints are greater

4309

4344

than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-

4366

4401

Those that are not part of an identified script are lumped together as

4367

4402

"Common". The current list of scripts is:

4368

4403

4369

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,

4370

Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,

4371

Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-

4372

tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,

4373

Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-

4374

rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,

4375

Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,

4376

Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,

4377

Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,

4378

Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,

4379

Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,

4380

Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,

4381

Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,

4382

Ugaritic, Vai, Yi.

4404

Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,

4405

Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,

4406

Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,

4407

Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic,

4408

Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-

4409

gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-

4410

tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,

4411

Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,

4412

Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive,

4413

Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko,

4414

Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic,

4415

Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-

4416

tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese,

4417

Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,

4418

Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,

4419

Yi.

4383

4420

4384

4421

Each character has exactly one Unicode general category property, spec-

4385

4422

ified by a two-letter abbreviation. For compatibility with Perl, nega-

4501

4538

4502

4539

Xan matches characters that have either the L (letter) or the N (num-

4503

4540

ber) property. Xps matches the characters tab, linefeed, vertical tab,

4504

formfeed, or carriage return, and any other character that has the Z

4541

form feed, or carriage return, and any other character that has the Z

4505

4542

(separator) property. Xsp is the same as Xps, except that vertical tab

4506

4543

is excluded. Xwd matches the same characters as Xan, plus underscore.

4507

4544

4681

4718

means that the rest of the string may start with a malformed UTF char-

4682

4719

acter. This has undefined results, because PCRE assumes that it is

4683

4720

dealing with valid UTF strings (and by default it checks this at the

4684

start of processing unless the PCRE_NO_UTF8_CHECK option is used).

4721

start of processing unless the PCRE_NO_UTF8_CHECK or

4722

PCRE_NO_UTF16_CHECK option is used).

4685

4723

4686

PCRE does not allow \C to appear in lookbehind assertions (described

4687

below) in a UTF mode, because this would make it impossible to calcu-

4724

PCRE does not allow \C to appear in lookbehind assertions (described

4725

below) in a UTF mode, because this would make it impossible to calcu-

4688

4726

late the length of the lookbehind.

4689

4727

4690

4728

In general, the \C escape sequence is best avoided. However, one way of

4691

using it that avoids the problem of malformed UTF characters is to use

4692

a lookahead to check the length of the next character, as in this pat-

4693

tern, which could be used with a UTF-8 string (ignore white space and

4729

using it that avoids the problem of malformed UTF characters is to use

4730

a lookahead to check the length of the next character, as in this pat-

4731

tern, which could be used with a UTF-8 string (ignore white space and

4694

4732

line breaks):

4695

4733

4696

4734

(?| (?=[\x00-\x7f])(\C) |

4698

4736

(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |

4699

4737

(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))

4700

4738

4701

A group that starts with (?| resets the capturing parentheses numbers

4702

in each alternative (see "Duplicate Subpattern Numbers" below). The

4703

assertions at the start of each branch check the next UTF-8 character

4704

for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The

4705

character's individual bytes are then captured by the appropriate num-

4739

A group that starts with (?| resets the capturing parentheses numbers

4740

in each alternative (see "Duplicate Subpattern Numbers" below). The

4741

assertions at the start of each branch check the next UTF-8 character

4742

for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The

4743

character's individual bytes are then captured by the appropriate num-

4706

4744

ber of groups.

4707

4745

4708

4746

4712

4750

closing square bracket. A closing square bracket on its own is not spe-

4713

4751

cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,

4714

4752

a lone closing square bracket causes a compile-time error. If a closing

4715

square bracket is required as a member of the class, it should be the

4716

first data character in the class (after an initial circumflex, if

4753

square bracket is required as a member of the class, it should be the

4754

first data character in the class (after an initial circumflex, if

4717

4755

present) or escaped with a backslash.

4718

4756

4719

A character class matches a single character in the subject. In a UTF

4720

mode, the character may be more than one data unit long. A matched

4757

A character class matches a single character in the subject. In a UTF

4758

mode, the character may be more than one data unit long. A matched

4721

4759

character must be in the set of characters defined by the class, unless

4722

the first character in the class definition is a circumflex, in which

4760

the first character in the class definition is a circumflex, in which

4723

4761

case the subject character must not be in the set defined by the class.

4724

If a circumflex is actually required as a member of the class, ensure

4762

If a circumflex is actually required as a member of the class, ensure

4725

4763

it is not the first character, or escape it with a backslash.

4726

4764

4727

For example, the character class [aeiou] matches any lower case vowel,

4728

while [^aeiou] matches any character that is not a lower case vowel.

4765

For example, the character class [aeiou] matches any lower case vowel,

4766

while [^aeiou] matches any character that is not a lower case vowel.

4729

4767

Note that a circumflex is just a convenient notation for specifying the

4730

characters that are in the class by enumerating those that are not. A

4731

class that starts with a circumflex is not an assertion; it still con-

4732

sumes a character from the subject string, and therefore it fails if

4768

characters that are in the class by enumerating those that are not. A

4769

class that starts with a circumflex is not an assertion; it still con-

4770

sumes a character from the subject string, and therefore it fails if

4733

4771

the current pointer is at the end of the string.

4734

4772

4735

In UTF-8 (UTF-16) mode, characters with values greater than 255

4736

(0xffff) can be included in a class as a literal string of data units,

4773

In UTF-8 (UTF-16) mode, characters with values greater than 255

4774

(0xffff) can be included in a class as a literal string of data units,

4737

4775

or by using the \x{ escaping mechanism.

4738

4776

4739

When caseless matching is set, any letters in a class represent both

4740

their upper case and lower case versions, so for example, a caseless

4741

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

4742

match "A", whereas a caseful version would. In a UTF mode, PCRE always

4743

understands the concept of case for characters whose values are less

4744

than 128, so caseless matching is always possible. For characters with

4745

higher values, the concept of case is supported if PCRE is compiled

4746

with Unicode property support, but not otherwise. If you want to use

4747

caseless matching in a UTF mode for characters 128 and above, you must

4748

ensure that PCRE is compiled with Unicode property support as well as

4777

When caseless matching is set, any letters in a class represent both

4778

their upper case and lower case versions, so for example, a caseless

4779

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

4780

match "A", whereas a caseful version would. In a UTF mode, PCRE always

4781

understands the concept of case for characters whose values are less

4782

than 128, so caseless matching is always possible. For characters with

4783

higher values, the concept of case is supported if PCRE is compiled

4784

with Unicode property support, but not otherwise. If you want to use

4785

caseless matching in a UTF mode for characters 128 and above, you must

4786

ensure that PCRE is compiled with Unicode property support as well as

4749

4787

with UTF support.

4750

4788

4751

Characters that might indicate line breaks are never treated in any

4752

special way when matching character classes, whatever line-ending

4753

sequence is in use, and whatever setting of the PCRE_DOTALL and

4789

Characters that might indicate line breaks are never treated in any

4790

special way when matching character classes, whatever line-ending

4791

sequence is in use, and whatever setting of the PCRE_DOTALL and

4754

4792

PCRE_MULTILINE options is used. A class such as [^a] always matches one

4755

4793

of these characters.

4756

4794

4757

The minus (hyphen) character can be used to specify a range of charac-

4758

ters in a character class. For example, [d-m] matches any letter

4759

between d and m, inclusive. If a minus character is required in a

4760

class, it must be escaped with a backslash or appear in a position

4761

where it cannot be interpreted as indicating a range, typically as the

4795

The minus (hyphen) character can be used to specify a range of charac-

4796

ters in a character class. For example, [d-m] matches any letter

4797

between d and m, inclusive. If a minus character is required in a

4798

class, it must be escaped with a backslash or appear in a position

4799

where it cannot be interpreted as indicating a range, typically as the

4762

4800

first or last character in the class.

4763

4801

4764

4802

It is not possible to have the literal character "]" as the end charac-

4765

ter of a range. A pattern such as [W-]46] is interpreted as a class of

4766

two characters ("W" and "-") followed by a literal string "46]", so it

4767

would match "W46]" or "-46]". However, if the "]" is escaped with a

4768

backslash it is interpreted as the end of range, so [W-\]46] is inter-

4769

preted as a class containing a range followed by two other characters.

4770

The octal or hexadecimal representation of "]" can also be used to end

4803

ter of a range. A pattern such as [W-]46] is interpreted as a class of

4804

two characters ("W" and "-") followed by a literal string "46]", so it

4805

would match "W46]" or "-46]". However, if the "]" is escaped with a

4806

backslash it is interpreted as the end of range, so [W-\]46] is inter-

4807

preted as a class containing a range followed by two other characters.

4808

The octal or hexadecimal representation of "]" can also be used to end

4771

4809

a range.

4772

4810

4773

Ranges operate in the collating sequence of character values. They can

4774

also be used for characters specified numerically, for example

4775

[\000-\037]. Ranges can include any characters that are valid for the

4811

Ranges operate in the collating sequence of character values. They can

4812

also be used for characters specified numerically, for example

4813

[\000-\037]. Ranges can include any characters that are valid for the

4776

4814

current mode.

4777

4815

4778

4816

If a range that includes letters is used when caseless matching is set,

4779

4817

it matches the letters in either case. For example, [W-c] is equivalent

4780

to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if

4781

character tables for a French locale are in use, [\xc8-\xcb] matches

4782

accented E characters in both cases. In UTF modes, PCRE supports the

4783

concept of case for characters with values greater than 128 only when

4818

to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if

4819

character tables for a French locale are in use, [\xc8-\xcb] matches

4820

accented E characters in both cases. In UTF modes, PCRE supports the

4821

concept of case for characters with values greater than 128 only when

4784

4822

it is compiled with Unicode property support.

4785

4823

4786

The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,

4824

The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,

4787

4825

\w, and \W may appear in a character class, and add the characters that

4788

they match to the class. For example, [\dABCDEF] matches any hexadeci-

4789

mal digit. In UTF modes, the PCRE_UCP option affects the meanings of

4790

\d, \s, \w and their upper case partners, just as it does when they

4791

appear outside a character class, as described in the section entitled

4826

they match to the class. For example, [\dABCDEF] matches any hexadeci-

4827

mal digit. In UTF modes, the PCRE_UCP option affects the meanings of

4828

\d, \s, \w and their upper case partners, just as it does when they

4829

appear outside a character class, as described in the section entitled

4792

4830

"Generic character types" above. The escape sequence \b has a different

4793

meaning inside a character class; it matches the backspace character.

4794

The sequences \B, \N, \R, and \X are not special inside a character

4795

class. Like any other unrecognized escape sequences, they are treated

4796

as the literal characters "B", "N", "R", and "X" by default, but cause

4831

meaning inside a character class; it matches the backspace character.

4832

The sequences \B, \N, \R, and \X are not special inside a character

4833

class. Like any other unrecognized escape sequences, they are treated

4834

as the literal characters "B", "N", "R", and "X" by default, but cause

4797

4835

an error if the PCRE_EXTRA option is set.

4798

4836

4799

A circumflex can conveniently be used with the upper case character

4800

types to specify a more restricted set of characters than the matching

4801

lower case type. For example, the class [^\W_] matches any letter or

4837

A circumflex can conveniently be used with the upper case character

4838

types to specify a more restricted set of characters than the matching

4839

lower case type. For example, the class [^\W_] matches any letter or

4802

4840

digit, but not underscore, whereas [\w] includes underscore. A positive

4803

4841

character class should be read as "something OR something OR ..." and a

4804

4842

negative class as "NOT something AND NOT something AND NOT ...".

4805

4843

4806

The only metacharacters that are recognized in character classes are

4807

backslash, hyphen (only where it can be interpreted as specifying a

4808

range), circumflex (only at the start), opening square bracket (only

4809

when it can be interpreted as introducing a POSIX class name - see the

4810

next section), and the terminating closing square bracket. However,

4844

The only metacharacters that are recognized in character classes are

4845

backslash, hyphen (only where it can be interpreted as specifying a

4846

range), circumflex (only at the start), opening square bracket (only

4847

when it can be interpreted as introducing a POSIX class name - see the

4848

next section), and the terminating closing square bracket. However,

4811

4849

escaping other non-alphanumeric characters does no harm.

4812

4850

4813

4851

4814

4852

POSIX CHARACTER CLASSES

4815

4853

4816

4854

Perl supports the POSIX notation for character classes. This uses names

4817

enclosed by [: and :] within the enclosing square brackets. PCRE also

4855

enclosed by [: and :] within the enclosing square brackets. PCRE also

4818

4856

supports this notation. For example,

4819

4857

4820

4858

[01[:alpha:]%]

4837

4875

word "word" characters (same as \w)

4838

4876

xdigit hexadecimal digits

4839

4877

4840

The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),

4841

and space (32). Notice that this list includes the VT character (code

4878

The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),

4879

and space (32). Notice that this list includes the VT character (code

4842

4880

11). This makes "space" different to \s, which does not include VT (for

4843

4881

Perl compatibility).

4844

4882

4845

The name "word" is a Perl extension, and "blank" is a GNU extension

4846

from Perl 5.8. Another Perl extension is negation, which is indicated

4883

The name "word" is a Perl extension, and "blank" is a GNU extension

4884

from Perl 5.8. Another Perl extension is negation, which is indicated

4847

4885

by a ^ character after the colon. For example,

4848

4886

4849

4887

[12[:^digit:]]

4850

4888

4851

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the

4889

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the

4852

4890

POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but

4853

4891

these are not supported, and an error is given if they are encountered.

4854

4892

4855

By default, in UTF modes, characters with values greater than 128 do

4856

not match any of the POSIX character classes. However, if the PCRE_UCP

4857

option is passed to pcre_compile(), some of the classes are changed so

4893

By default, in UTF modes, characters with values greater than 128 do

4894

not match any of the POSIX character classes. However, if the PCRE_UCP

4895

option is passed to pcre_compile(), some of the classes are changed so

4858

4896

that Unicode character properties are used. This is achieved by replac-

4859

4897

ing the POSIX classes by other sequences, as follows:

4860

4898

4867

4905

[:upper:] becomes \p{Lu}

4868

4906

[:word:] becomes \p{Xwd}

4869

4907

4870

Negated versions, such as [:^alpha:] use \P instead of \p. The other

4908

Negated versions, such as [:^alpha:] use \P instead of \p. The other

4871

4909

POSIX classes are unchanged, and match only characters with code points

4872

4910

less than 128.

4873

4911

4874

4912

4875

4913

VERTICAL BAR

4876

4914

4877

Vertical bar characters are used to separate alternative patterns. For

4915

Vertical bar characters are used to separate alternative patterns. For

4878

4916

example, the pattern

4879

4917

4880

4918

gilbert|sullivan

4881

4919

4882

matches either "gilbert" or "sullivan". Any number of alternatives may

4883

appear, and an empty alternative is permitted (matching the empty

4920

matches either "gilbert" or "sullivan". Any number of alternatives may

4921

appear, and an empty alternative is permitted (matching the empty

4884

4922

string). The matching process tries each alternative in turn, from left

4885

to right, and the first one that succeeds is used. If the alternatives

4886

are within a subpattern (defined below), "succeeds" means matching the

4923

to right, and the first one that succeeds is used. If the alternatives

4924

are within a subpattern (defined below), "succeeds" means matching the

4887

4925

rest of the main pattern as well as the alternative in the subpattern.

4888

4926

4889

4927

4890

4928

INTERNAL OPTION SETTING

4891

4929

4892

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and

4893

PCRE_EXTENDED options (which are Perl-compatible) can be changed from

4894

within the pattern by a sequence of Perl option letters enclosed

4930

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and

4931

PCRE_EXTENDED options (which are Perl-compatible) can be changed from

4932

within the pattern by a sequence of Perl option letters enclosed

4895

4933

between "(?" and ")". The option letters are

4896

4934

4897

4935

i for PCRE_CASELESS

4901

4939

4902

4940

For example, (?im) sets caseless, multiline matching. It is also possi-

4903

4941

ble to unset these options by preceding the letter with a hyphen, and a

4904

combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-

4905

LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,

4906

is also permitted. If a letter appears both before and after the

4942

combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-

4943

LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,

4944

is also permitted. If a letter appears both before and after the

4907

4945

hyphen, the option is unset.

4908

4946

4909

The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA

4910

can be changed in the same way as the Perl-compatible options by using

4947

The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA

4948

can be changed in the same way as the Perl-compatible options by using

4911

4949

the characters J, U and X respectively.

4912

4950

4913

When one of these option changes occurs at top level (that is, not

4914

inside subpattern parentheses), the change applies to the remainder of

4951

When one of these option changes occurs at top level (that is, not

4952

inside subpattern parentheses), the change applies to the remainder of

4915

4953

the pattern that follows. If the change is placed right at the start of

4916

4954

a pattern, PCRE extracts it into the global options (and it will there-

4917

4955

fore show up in data extracted by the pcre_fullinfo() function).

4918

4956

4919

An option change within a subpattern (see below for a description of

4920

subpatterns) affects only that part of the subpattern that follows it,

4957

An option change within a subpattern (see below for a description of

4958

subpatterns) affects only that part of the subpattern that follows it,

4921

4959

so

4922

4960

4923

4961

(a(?i)b)c

4924

4962

4925

4963

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not

4926

used). By this means, options can be made to have different settings

4927

in different parts of the pattern. Any changes made in one alternative

4928

do carry on into subsequent branches within the same subpattern. For

4964

used). By this means, options can be made to have different settings

4965

in different parts of the pattern. Any changes made in one alternative

4966

do carry on into subsequent branches within the same subpattern. For

4929

4967

example,

4930

4968

4931

4969

(a(?i)b|c)

4932

4970

4933

matches "ab", "aB", "c", and "C", even though when matching "C" the

4934

first branch is abandoned before the option setting. This is because

4935

the effects of option settings happen at compile time. There would be

4971

matches "ab", "aB", "c", and "C", even though when matching "C" the

4972

first branch is abandoned before the option setting. This is because

4973

the effects of option settings happen at compile time. There would be

4936

4974

some very weird behaviour otherwise.

4937

4975

4938

Note: There are other PCRE-specific options that can be set by the

4939

application when the compiling or matching functions are called. In

4940

some cases the pattern can contain special leading sequences such as

4941

(*CRLF) to override what the application has set or what has been

4942

defaulted. Details are given in the section entitled "Newline

4943

sequences" above. There are also the (*UTF8), (*UTF16), and (*UCP)

4944

leading sequences that can be used to set UTF and Unicode property

4945

modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, and

4976

Note: There are other PCRE-specific options that can be set by the

4977

application when the compiling or matching functions are called. In

4978

some cases the pattern can contain special leading sequences such as

4979

(*CRLF) to override what the application has set or what has been

4980

defaulted. Details are given in the section entitled "Newline

4981

sequences" above. There are also the (*UTF8), (*UTF16), and (*UCP)

4982

leading sequences that can be used to set UTF and Unicode property

4983

modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, and

4946

4984

the PCRE_UCP options, respectively.

4947

4985

4948

4986

4955

4993

4956

4994

cat(aract|erpillar|)

4957

4995

4958

matches "cataract", "caterpillar", or "cat". Without the parentheses,

4996

matches "cataract", "caterpillar", or "cat". Without the parentheses,

4959

4997

it would match "cataract", "erpillar" or an empty string.

4960

4998

4961

2. It sets up the subpattern as a capturing subpattern. This means

4962

that, when the whole pattern matches, that portion of the subject

4999

2. It sets up the subpattern as a capturing subpattern. This means

5000

that, when the whole pattern matches, that portion of the subject

4963

5001

string that matched the subpattern is passed back to the caller via the

4964

ovector argument of the matching function. (This applies only to the

4965

traditional matching functions; the DFA matching functions do not sup-

5002

ovector argument of the matching function. (This applies only to the

5003

traditional matching functions; the DFA matching functions do not sup-

4966

5004

port capturing.)

4967

5005

4968

5006

Opening parentheses are counted from left to right (starting from 1) to

4969

obtain numbers for the capturing subpatterns. For example, if the

5007

obtain numbers for the capturing subpatterns. For example, if the

4970

5008

string "the red king" is matched against the pattern

4971

5009

4972

5010

the ((red|white) (king|queen))

4974

5012

the captured substrings are "red king", "red", and "king", and are num-

4975

5013

bered 1, 2, and 3, respectively.

4976

5014

4977

The fact that plain parentheses fulfil two functions is not always

4978

helpful. There are often times when a grouping subpattern is required

4979

without a capturing requirement. If an opening parenthesis is followed

4980

by a question mark and a colon, the subpattern does not do any captur-

4981

ing, and is not counted when computing the number of any subsequent

4982

capturing subpatterns. For example, if the string "the white queen" is

5015

The fact that plain parentheses fulfil two functions is not always

5016

helpful. There are often times when a grouping subpattern is required

5017

without a capturing requirement. If an opening parenthesis is followed

5018

by a question mark and a colon, the subpattern does not do any captur-

5019

ing, and is not counted when computing the number of any subsequent

5020

capturing subpatterns. For example, if the string "the white queen" is

4983

5021

matched against the pattern

4984

5022

4985

5023

the ((?:red|white) (king|queen))

4987

5025

the captured substrings are "white queen" and "queen", and are numbered

4988

5026

1 and 2. The maximum number of capturing subpatterns is 65535.

4989

5027

4990

As a convenient shorthand, if any option settings are required at the

4991

start of a non-capturing subpattern, the option letters may appear

5028

As a convenient shorthand, if any option settings are required at the

5029

start of a non-capturing subpattern, the option letters may appear

4992

5030

between the "?" and the ":". Thus the two patterns

4993

5031

4994

5032

(?i:saturday|sunday)

4995

5033

(?:(?i)saturday|sunday)

4996

5034

4997

5035

match exactly the same set of strings. Because alternative branches are

4998

tried from left to right, and options are not reset until the end of

4999

the subpattern is reached, an option setting in one branch does affect

5000

subsequent branches, so the above patterns match "SUNDAY" as well as

5036

tried from left to right, and options are not reset until the end of

5037

the subpattern is reached, an option setting in one branch does affect

5038

subsequent branches, so the above patterns match "SUNDAY" as well as

5001

5039

"Saturday".

5002

5040

5003

5041

5004

5042

DUPLICATE SUBPATTERN NUMBERS

5005

5043

5006

5044

Perl 5.10 introduced a feature whereby each alternative in a subpattern

5007

uses the same numbers for its capturing parentheses. Such a subpattern

5008

starts with (?| and is itself a non-capturing subpattern. For example,

5045

uses the same numbers for its capturing parentheses. Such a subpattern

5046

starts with (?| and is itself a non-capturing subpattern. For example,

5009

5047

consider this pattern:

5010

5048

5011

5049

(?|(Sat)ur|(Sun))day

5012

5050

5013

Because the two alternatives are inside a (?| group, both sets of cap-

5014

turing parentheses are numbered one. Thus, when the pattern matches,

5015

you can look at captured substring number one, whichever alternative

5016

matched. This construct is useful when you want to capture part, but

5051

Because the two alternatives are inside a (?| group, both sets of cap-

5052

turing parentheses are numbered one. Thus, when the pattern matches,

5053

you can look at captured substring number one, whichever alternative

5054

matched. This construct is useful when you want to capture part, but

5017

5055

not all, of one of a number of alternatives. Inside a (?| group, paren-

5018

theses are numbered as usual, but the number is reset at the start of

5019

each branch. The numbers of any capturing parentheses that follow the

5020

subpattern start after the highest number used in any branch. The fol-

5056

theses are numbered as usual, but the number is reset at the start of

5057

each branch. The numbers of any capturing parentheses that follow the

5058

subpattern start after the highest number used in any branch. The fol-

5021

5059

lowing example is taken from the Perl documentation. The numbers under-

5022

5060

neath show in which buffer the captured content will be stored.

5023

5061

5025

5063

/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x

5026

5064

# 1 2 2 3 2 3 4

5027

5065

5028

A back reference to a numbered subpattern uses the most recent value

5029

that is set for that number by any subpattern. The following pattern

5066

A back reference to a numbered subpattern uses the most recent value

5067

that is set for that number by any subpattern. The following pattern

5030

5068

matches "abcabc" or "defdef":

5031

5069

5032

5070

/(?|(abc)|(def))\1/

5033

5071

5034

In contrast, a subroutine call to a numbered subpattern always refers

5035

to the first one in the pattern with the given number. The following

5072

In contrast, a subroutine call to a numbered subpattern always refers

5073

to the first one in the pattern with the given number. The following

5036

5074

pattern matches "abcabc" or "defabc":

5037

5075

5038

5076

/(?|(abc)|(def))(?1)/

5039

5077

5040

If a condition test for a subpattern's having matched refers to a non-

5041

unique number, the test is true if any of the subpatterns of that num-

5078

If a condition test for a subpattern's having matched refers to a non-

5079

unique number, the test is true if any of the subpatterns of that num-

5042

5080

ber have matched.

5043

5081

5044

An alternative approach to using this "branch reset" feature is to use

5082

An alternative approach to using this "branch reset" feature is to use

5045

5083

duplicate named subpatterns, as described in the next section.

5046

5084

5047

5085

5048

5086

NAMED SUBPATTERNS

5049

5087

5050

Identifying capturing parentheses by number is simple, but it can be

5051

very hard to keep track of the numbers in complicated regular expres-

5052

sions. Furthermore, if an expression is modified, the numbers may

5053

change. To help with this difficulty, PCRE supports the naming of sub-

5088

Identifying capturing parentheses by number is simple, but it can be

5089

very hard to keep track of the numbers in complicated regular expres-

5090

sions. Furthermore, if an expression is modified, the numbers may

5091

change. To help with this difficulty, PCRE supports the naming of sub-

5054

5092

patterns. This feature was not added to Perl until release 5.10. Python

5055

had the feature earlier, and PCRE introduced it at release 4.0, using

5056

the Python syntax. PCRE now supports both the Perl and the Python syn-

5057

tax. Perl allows identically numbered subpatterns to have different

5093

had the feature earlier, and PCRE introduced it at release 4.0, using

5094

the Python syntax. PCRE now supports both the Perl and the Python syn-

5095

tax. Perl allows identically numbered subpatterns to have different

5058

5096

names, but PCRE does not.

5059

5097

5060

In PCRE, a subpattern can be named in one of three ways: (?<name>...)

5061

or (?'name'...) as in Perl, or (?P<name>...) as in Python. References

5062

to capturing parentheses from other parts of the pattern, such as back

5063

references, recursion, and conditions, can be made by name as well as

5098

In PCRE, a subpattern can be named in one of three ways: (?<name>...)

5099

or (?'name'...) as in Perl, or (?P<name>...) as in Python. References

5100

to capturing parentheses from other parts of the pattern, such as back

5101

references, recursion, and conditions, can be made by name as well as

5064

5102

by number.

5065

5103

5066

Names consist of up to 32 alphanumeric characters and underscores.

5067

Named capturing parentheses are still allocated numbers as well as

5068

names, exactly as if the names were not present. The PCRE API provides

5104

Names consist of up to 32 alphanumeric characters and underscores.

5105

Named capturing parentheses are still allocated numbers as well as

5106

names, exactly as if the names were not present. The PCRE API provides

5069

5107

function calls for extracting the name-to-number translation table from

5070

5108

a compiled pattern. There is also a convenience function for extracting

5071

5109

a captured substring by name.

5072

5110

5073

By default, a name must be unique within a pattern, but it is possible

5111

By default, a name must be unique within a pattern, but it is possible

5074

5112

to relax this constraint by setting the PCRE_DUPNAMES option at compile

5075

time. (Duplicate names are also always permitted for subpatterns with

5076

the same number, set up as described in the previous section.) Dupli-

5077

cate names can be useful for patterns where only one instance of the

5078

named parentheses can match. Suppose you want to match the name of a

5079

weekday, either as a 3-letter abbreviation or as the full name, and in

5113

time. (Duplicate names are also always permitted for subpatterns with

5114

the same number, set up as described in the previous section.) Dupli-

5115

cate names can be useful for patterns where only one instance of the

5116

named parentheses can match. Suppose you want to match the name of a

5117

weekday, either as a 3-letter abbreviation or as the full name, and in

5080

5118

both cases you want to extract the abbreviation. This pattern (ignoring

5081

5119

the line breaks) does the job:

5082

5120

5086

5124

(?<DN>Thu)(?:rsday)?|

5087

5125

(?<DN>Sat)(?:urday)?

5088

5126

5089

There are five capturing substrings, but only one is ever set after a

5127

There are five capturing substrings, but only one is ever set after a

5090

5128

match. (An alternative way of solving this problem is to use a "branch

5091

5129

reset" subpattern, as described in the previous section.)

5092

5130

5093

The convenience function for extracting the data by name returns the

5094

substring for the first (and in this example, the only) subpattern of

5095

that name that matched. This saves searching to find which numbered

5131

The convenience function for extracting the data by name returns the

5132

substring for the first (and in this example, the only) subpattern of

5133

that name that matched. This saves searching to find which numbered

5096

5134

subpattern it was.

5097

5135

5098

If you make a back reference to a non-unique named subpattern from

5099

elsewhere in the pattern, the one that corresponds to the first occur-

5136

If you make a back reference to a non-unique named subpattern from

5137

elsewhere in the pattern, the one that corresponds to the first occur-

5100

5138

rence of the name is used. In the absence of duplicate numbers (see the

5101

previous section) this is the one with the lowest number. If you use a

5102

named reference in a condition test (see the section about conditions

5103

below), either to check whether a subpattern has matched, or to check

5104

for recursion, all subpatterns with the same name are tested. If the

5105

condition is true for any one of them, the overall condition is true.

5139

previous section) this is the one with the lowest number. If you use a

5140

named reference in a condition test (see the section about conditions

5141

below), either to check whether a subpattern has matched, or to check

5142

for recursion, all subpatterns with the same name are tested. If the

5143

condition is true for any one of them, the overall condition is true.

5106

5144

This is the same behaviour as testing by number. For further details of

5107

5145

the interfaces for handling named subpatterns, see the pcreapi documen-

5108

5146

tation.

5109

5147

5110

5148

Warning: You cannot use different names to distinguish between two sub-

5111

patterns with the same number because PCRE uses only the numbers when

5149

patterns with the same number because PCRE uses only the numbers when

5112

5150

matching. For this reason, an error is given at compile time if differ-

5113

ent names are given to subpatterns with the same number. However, you

5114

can give the same name to subpatterns with the same number, even when

5151

ent names are given to subpatterns with the same number. However, you

5152

can give the same name to subpatterns with the same number, even when

5115

5153

PCRE_DUPNAMES is not set.

5116

5154

5117

5155

5118

5156

REPETITION

5119

5157

5120

Repetition is specified by quantifiers, which can follow any of the

5158

Repetition is specified by quantifiers, which can follow any of the

5121

5159

following items:

5122

5160

5123

5161

a literal data character

5131

5169

a parenthesized subpattern (including assertions)

5132

5170

a subroutine call to a subpattern (recursive or otherwise)

5133

5171

5134

The general repetition quantifier specifies a minimum and maximum num-

5135

ber of permitted matches, by giving the two numbers in curly brackets

5136

(braces), separated by a comma. The numbers must be less than 65536,

5172

The general repetition quantifier specifies a minimum and maximum num-

5173

ber of permitted matches, by giving the two numbers in curly brackets

5174

(braces), separated by a comma. The numbers must be less than 65536,

5137

5175

and the first must be less than or equal to the second. For example:

5138

5176

5139

5177

z{2,4}

5140

5178

5141

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

5142

special character. If the second number is omitted, but the comma is

5143

present, there is no upper limit; if the second number and the comma

5144

are both omitted, the quantifier specifies an exact number of required

5179

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

5180

special character. If the second number is omitted, but the comma is

5181

present, there is no upper limit; if the second number and the comma

5182

are both omitted, the quantifier specifies an exact number of required

5145

5183

matches. Thus

5146

5184

5147

5185

[aeiou]{3,}

5150

5188

5151

5189

\d{8}

5152

5190

5153

matches exactly 8 digits. An opening curly bracket that appears in a

5154

position where a quantifier is not allowed, or one that does not match

5155

the syntax of a quantifier, is taken as a literal character. For exam-

5191

matches exactly 8 digits. An opening curly bracket that appears in a

5192

position where a quantifier is not allowed, or one that does not match

5193

the syntax of a quantifier, is taken as a literal character. For exam-

5156

5194

ple, {,6} is not a quantifier, but a literal string of four characters.

5157

5195

5158

5196

In UTF modes, quantifiers apply to characters rather than to individual

5159

data units. Thus, for example, \x{100}{2} matches two characters, each

5197

data units. Thus, for example, \x{100}{2} matches two characters, each

5160

5198

of which is represented by a two-byte sequence in a UTF-8 string. Simi-

5161

larly, \X{3} matches three Unicode extended sequences, each of which

5199

larly, \X{3} matches three Unicode extended sequences, each of which

5162

5200

may be several data units long (and they may be of different lengths).

5163

5201

5164

5202

The quantifier {0} is permitted, causing the expression to behave as if

5165

5203

the previous item and the quantifier were not present. This may be use-

5166

ful for subpatterns that are referenced as subroutines from elsewhere

5204

ful for subpatterns that are referenced as subroutines from elsewhere

5167

5205

in the pattern (but see also the section entitled "Defining subpatterns

5168

for use by reference only" below). Items other than subpatterns that

5206

for use by reference only" below). Items other than subpatterns that

5169

5207

have a {0} quantifier are omitted from the compiled pattern.

5170

5208

5171

For convenience, the three most common quantifiers have single-charac-

5209

For convenience, the three most common quantifiers have single-charac-

5172

5210

ter abbreviations:

5173

5211

5174

5212

* is equivalent to {0,}

5175

5213

+ is equivalent to {1,}

5176

5214

? is equivalent to {0,1}

5177

5215

5178

It is possible to construct infinite loops by following a subpattern

5216

It is possible to construct infinite loops by following a subpattern

5179

5217

that can match no characters with a quantifier that has no upper limit,

5180

5218

for example:

5181

5219

5182

5220

(a?)*

5183

5221

5184

5222

Earlier versions of Perl and PCRE used to give an error at compile time

5185

for such patterns. However, because there are cases where this can be

5186

useful, such patterns are now accepted, but if any repetition of the

5187

subpattern does in fact match no characters, the loop is forcibly bro-

5223

for such patterns. However, because there are cases where this can be

5224

useful, such patterns are now accepted, but if any repetition of the

5225

subpattern does in fact match no characters, the loop is forcibly bro-

5188

5226

ken.

5189

5227

5190

By default, the quantifiers are "greedy", that is, they match as much

5191

as possible (up to the maximum number of permitted times), without

5192

causing the rest of the pattern to fail. The classic example of where

5228

By default, the quantifiers are "greedy", that is, they match as much

5229

as possible (up to the maximum number of permitted times), without

5230

causing the rest of the pattern to fail. The classic example of where

5193

5231

this gives problems is in trying to match comments in C programs. These

5194

appear between /* and */ and within the comment, individual * and /

5195

characters may appear. An attempt to match C comments by applying the

5232

appear between /* and */ and within the comment, individual * and /

5233

characters may appear. An attempt to match C comments by applying the

5196

5234

pattern

5197

5235

5198

5236

/\*.*\*/

5201

5239

5202

5240

/* first comment */ not comment /* second comment */

5203

5241

5204

fails, because it matches the entire string owing to the greediness of

5242

fails, because it matches the entire string owing to the greediness of

5205

5243

the .* item.

5206

5244

5207

However, if a quantifier is followed by a question mark, it ceases to

5245

However, if a quantifier is followed by a question mark, it ceases to

5208

5246

be greedy, and instead matches the minimum number of times possible, so

5209

5247

the pattern

5210

5248

5211

5249

/\*.*?\*/

5212

5250

5213

does the right thing with the C comments. The meaning of the various

5214

quantifiers is not otherwise changed, just the preferred number of

5215

matches. Do not confuse this use of question mark with its use as a

5216

quantifier in its own right. Because it has two uses, it can sometimes

5251

does the right thing with the C comments. The meaning of the various

5252

quantifiers is not otherwise changed, just the preferred number of

5253

matches. Do not confuse this use of question mark with its use as a

5254

quantifier in its own right. Because it has two uses, it can sometimes

5217

5255

appear doubled, as in

5218

5256

5219

5257

\d??\d

5221

5259

which matches one digit by preference, but can match two if that is the

5222

5260

only way the rest of the pattern matches.

5223

5261

5224

If the PCRE_UNGREEDY option is set (an option that is not available in

5225

Perl), the quantifiers are not greedy by default, but individual ones

5226

can be made greedy by following them with a question mark. In other

5262

If the PCRE_UNGREEDY option is set (an option that is not available in

5263

Perl), the quantifiers are not greedy by default, but individual ones

5264

can be made greedy by following them with a question mark. In other

5227

5265

words, it inverts the default behaviour.

5228

5266

5229

When a parenthesized subpattern is quantified with a minimum repeat

5230

count that is greater than 1 or with a limited maximum, more memory is

5231

required for the compiled pattern, in proportion to the size of the

5267

When a parenthesized subpattern is quantified with a minimum repeat

5268

count that is greater than 1 or with a limited maximum, more memory is

5269

required for the compiled pattern, in proportion to the size of the

5232

5270

minimum or maximum.

5233

5271

5234

5272

If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-

5235

alent to Perl's /s) is set, thus allowing the dot to match newlines,

5236

the pattern is implicitly anchored, because whatever follows will be

5237

tried against every character position in the subject string, so there

5238

is no point in retrying the overall match at any position after the

5239

first. PCRE normally treats such a pattern as though it were preceded

5273

alent to Perl's /s) is set, thus allowing the dot to match newlines,

5274

the pattern is implicitly anchored, because whatever follows will be

5275

tried against every character position in the subject string, so there

5276

is no point in retrying the overall match at any position after the

5277

first. PCRE normally treats such a pattern as though it were preceded

5240

5278

by \A.

5241

5279

5242

In cases where it is known that the subject string contains no new-

5243

lines, it is worth setting PCRE_DOTALL in order to obtain this opti-

5280

In cases where it is known that the subject string contains no new-

5281

lines, it is worth setting PCRE_DOTALL in order to obtain this opti-

5244

5282

mization, or alternatively using ^ to indicate anchoring explicitly.

5245

5283

5246

However, there is one situation where the optimization cannot be used.

5284

However, there is one situation where the optimization cannot be used.

5247

5285

When .* is inside capturing parentheses that are the subject of a back

5248

5286

reference elsewhere in the pattern, a match at the start may fail where

5249

5287

a later one succeeds. Consider, for example:

5250

5288

5251

5289

(.*)abc\1

5252

5290

5253

If the subject is "xyz123abc123" the match point is the fourth charac-

5291

If the subject is "xyz123abc123" the match point is the fourth charac-

5254

5292

ter. For this reason, such a pattern is not implicitly anchored.

5255

5293

5256

5294

When a capturing subpattern is repeated, the value captured is the sub-

5259

5297

(tweedle[dume]{3}\s*)+

5260

5298

5261

5299

has matched "tweedledum tweedledee" the value of the captured substring

5262

is "tweedledee". However, if there are nested capturing subpatterns,

5263

the corresponding captured values may have been set in previous itera-

5300

is "tweedledee". However, if there are nested capturing subpatterns,

5301

the corresponding captured values may have been set in previous itera-

5264

5302

tions. For example, after

5265

5303

5266

5304

/(a|(b))+/

5270

5308

5271

5309

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

5272

5310

5273

With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")

5274

repetition, failure of what follows normally causes the repeated item

5275

to be re-evaluated to see if a different number of repeats allows the

5276

rest of the pattern to match. Sometimes it is useful to prevent this,

5277

either to change the nature of the match, or to cause it fail earlier

5278

than it otherwise might, when the author of the pattern knows there is

5311

With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")

5312

repetition, failure of what follows normally causes the repeated item

5313

to be re-evaluated to see if a different number of repeats allows the

5314

rest of the pattern to match. Sometimes it is useful to prevent this,

5315

either to change the nature of the match, or to cause it fail earlier

5316

than it otherwise might, when the author of the pattern knows there is

5279

5317

no point in carrying on.

5280

5318

5281

Consider, for example, the pattern \d+foo when applied to the subject

5319

Consider, for example, the pattern \d+foo when applied to the subject

5282

5320

line

5283

5321

5284

5322

123456bar

5285

5323

5286

5324

After matching all 6 digits and then failing to match "foo", the normal

5287

action of the matcher is to try again with only 5 digits matching the

5288

\d+ item, and then with 4, and so on, before ultimately failing.

5289

"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides

5290

the means for specifying that once a subpattern has matched, it is not

5325

action of the matcher is to try again with only 5 digits matching the

5326

\d+ item, and then with 4, and so on, before ultimately failing.

5327

"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides

5328

the means for specifying that once a subpattern has matched, it is not

5291

5329

to be re-evaluated in this way.

5292

5330

5293

If we use atomic grouping for the previous example, the matcher gives

5294

up immediately on failing to match "foo" the first time. The notation

5331

If we use atomic grouping for the previous example, the matcher gives

5332

up immediately on failing to match "foo" the first time. The notation

5295

5333

is a kind of special parenthesis, starting with (?> as in this example:

5296

5334

5297

5335

(?>\d+)foo

5298

5336

5299

This kind of parenthesis "locks up" the part of the pattern it con-

5300

tains once it has matched, and a failure further into the pattern is

5301

prevented from backtracking into it. Backtracking past it to previous

5337

This kind of parenthesis "locks up" the part of the pattern it con-

5338

tains once it has matched, and a failure further into the pattern is

5339

prevented from backtracking into it. Backtracking past it to previous

5302

5340

items, however, works as normal.

5303

5341

5304

An alternative description is that a subpattern of this type matches

5305

the string of characters that an identical standalone pattern would

5342

An alternative description is that a subpattern of this type matches

5343

the string of characters that an identical standalone pattern would

5306

5344

match, if anchored at the current point in the subject string.

5307

5345

5308

5346

Atomic grouping subpatterns are not capturing subpatterns. Simple cases

5309

5347

such as the above example can be thought of as a maximizing repeat that

5310

must swallow everything it can. So, while both \d+ and \d+? are pre-

5311

pared to adjust the number of digits they match in order to make the

5348

must swallow everything it can. So, while both \d+ and \d+? are pre-

5349

pared to adjust the number of digits they match in order to make the

5312

5350

rest of the pattern match, (?>\d+) can only match an entire sequence of

5313

5351

digits.

5314

5352

5315

Atomic groups in general can of course contain arbitrarily complicated

5316

subpatterns, and can be nested. However, when the subpattern for an

5353

Atomic groups in general can of course contain arbitrarily complicated

5354

subpatterns, and can be nested. However, when the subpattern for an

5317

5355

atomic group is just a single repeated item, as in the example above, a

5318

simpler notation, called a "possessive quantifier" can be used. This

5319

consists of an additional + character following a quantifier. Using

5356

simpler notation, called a "possessive quantifier" can be used. This

5357

consists of an additional + character following a quantifier. Using

5320

5358

this notation, the previous example can be rewritten as

5321

5359

5322

5360

\d++foo

5326

5364

5327

5365

(abc|xyz){2,3}+

5328

5366

5329

Possessive quantifiers are always greedy; the setting of the

5367

Possessive quantifiers are always greedy; the setting of the

5330

5368

PCRE_UNGREEDY option is ignored. They are a convenient notation for the

5331

simpler forms of atomic group. However, there is no difference in the

5332

meaning of a possessive quantifier and the equivalent atomic group,

5333

though there may be a performance difference; possessive quantifiers

5369

simpler forms of atomic group. However, there is no difference in the

5370

meaning of a possessive quantifier and the equivalent atomic group,

5371

though there may be a performance difference; possessive quantifiers

5334

5372

should be slightly faster.

5335

5373

5336

The possessive quantifier syntax is an extension to the Perl 5.8 syn-

5337

tax. Jeffrey Friedl originated the idea (and the name) in the first

5374

The possessive quantifier syntax is an extension to the Perl 5.8 syn-

5375

tax. Jeffrey Friedl originated the idea (and the name) in the first

5338

5376

edition of his book. Mike McCloskey liked it, so implemented it when he

5339

built Sun's Java package, and PCRE copied it from there. It ultimately

5377

built Sun's Java package, and PCRE copied it from there. It ultimately

5340

5378

found its way into Perl at release 5.10.

5341

5379

5342

5380

PCRE has an optimization that automatically "possessifies" certain sim-

5343

ple pattern constructs. For example, the sequence A+B is treated as

5344

A++B because there is no point in backtracking into a sequence of A's

5381

ple pattern constructs. For example, the sequence A+B is treated as

5382

A++B because there is no point in backtracking into a sequence of A's

5345

5383

when B must follow.

5346

5384

5347

When a pattern contains an unlimited repeat inside a subpattern that

5348

can itself be repeated an unlimited number of times, the use of an

5349

atomic group is the only way to avoid some failing matches taking a

5385

When a pattern contains an unlimited repeat inside a subpattern that

5386

can itself be repeated an unlimited number of times, the use of an

5387

atomic group is the only way to avoid some failing matches taking a

5350

5388

very long time indeed. The pattern

5351

5389

5352

5390

(\D+|<\d+>)*[!?]

5353

5391

5354

matches an unlimited number of substrings that either consist of non-

5355

digits, or digits enclosed in <>, followed by either ! or ?. When it

5392

matches an unlimited number of substrings that either consist of non-

5393

digits, or digits enclosed in <>, followed by either ! or ?. When it

5356

5394

matches, it runs quickly. However, if it is applied to

5357

5395

5358

5396

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

5359

5397

5360

it takes a long time before reporting failure. This is because the

5361

string can be divided between the internal \D+ repeat and the external

5362

* repeat in a large number of ways, and all have to be tried. (The

5363

example uses [!?] rather than a single character at the end, because

5364

both PCRE and Perl have an optimization that allows for fast failure

5365

when a single character is used. They remember the last single charac-

5366

ter that is required for a match, and fail early if it is not present

5367

in the string.) If the pattern is changed so that it uses an atomic

5398

it takes a long time before reporting failure. This is because the

5399

string can be divided between the internal \D+ repeat and the external

5400

* repeat in a large number of ways, and all have to be tried. (The

5401

example uses [!?] rather than a single character at the end, because

5402

both PCRE and Perl have an optimization that allows for fast failure

5403

when a single character is used. They remember the last single charac-

5404

ter that is required for a match, and fail early if it is not present

5405

in the string.) If the pattern is changed so that it uses an atomic

5368

5406

group, like this:

5369

5407

5370

5408

((?>\D+)|<\d+>)*[!?]

5376

5414

5377

5415

Outside a character class, a backslash followed by a digit greater than

5378

5416

0 (and possibly further digits) is a back reference to a capturing sub-

5379

pattern earlier (that is, to its left) in the pattern, provided there

5417

pattern earlier (that is, to its left) in the pattern, provided there

5380

5418

have been that many previous capturing left parentheses.

5381

5419

5382

5420

However, if the decimal number following the backslash is less than 10,

5383

it is always taken as a back reference, and causes an error only if

5384

there are not that many capturing left parentheses in the entire pat-

5385

tern. In other words, the parentheses that are referenced need not be

5386

to the left of the reference for numbers less than 10. A "forward back

5387

reference" of this type can make sense when a repetition is involved

5388

and the subpattern to the right has participated in an earlier itera-

5421

it is always taken as a back reference, and causes an error only if

5422

there are not that many capturing left parentheses in the entire pat-

5423

tern. In other words, the parentheses that are referenced need not be

5424

to the left of the reference for numbers less than 10. A "forward back

5425

reference" of this type can make sense when a repetition is involved

5426

and the subpattern to the right has participated in an earlier itera-

5389

5427

tion.

5390

5428

5391

It is not possible to have a numerical "forward back reference" to a

5392

subpattern whose number is 10 or more using this syntax because a

5393

sequence such as \50 is interpreted as a character defined in octal.

5429

It is not possible to have a numerical "forward back reference" to a

5430

subpattern whose number is 10 or more using this syntax because a

5431

sequence such as \50 is interpreted as a character defined in octal.

5394

5432

See the subsection entitled "Non-printing characters" above for further

5395

details of the handling of digits following a backslash. There is no

5396

such problem when named parentheses are used. A back reference to any

5433

details of the handling of digits following a backslash. There is no

5434

such problem when named parentheses are used. A back reference to any

5397

5435

subpattern is possible using named parentheses (see below).

5398

5436

5399

Another way of avoiding the ambiguity inherent in the use of digits

5400

following a backslash is to use the \g escape sequence. This escape

5437

Another way of avoiding the ambiguity inherent in the use of digits

5438

following a backslash is to use the \g escape sequence. This escape

5401

5439

must be followed by an unsigned number or a negative number, optionally

5402

5440

enclosed in braces. These examples are all identical:

5403

5441

5405

5443

(ring), \g1

5406

5444

(ring), \g{1}

5407

5445

5408

An unsigned number specifies an absolute reference without the ambigu-

5446

An unsigned number specifies an absolute reference without the ambigu-

5409

5447

ity that is present in the older syntax. It is also useful when literal

5410

5448

digits follow the reference. A negative number is a relative reference.

5411

5449

Consider this example:

5414

5452

5415

5453

The sequence \g{-1} is a reference to the most recently started captur-

5416

5454

ing subpattern before \g, that is, is it equivalent to \2 in this exam-

5417

ple. Similarly, \g{-2} would be equivalent to \1. The use of relative

5418

references can be helpful in long patterns, and also in patterns that

5419

are created by joining together fragments that contain references

5455

ple. Similarly, \g{-2} would be equivalent to \1. The use of relative

5456

references can be helpful in long patterns, and also in patterns that

5457

are created by joining together fragments that contain references

5420

5458

within themselves.

5421

5459

5422

A back reference matches whatever actually matched the capturing sub-

5423

pattern in the current subject string, rather than anything matching

5460

A back reference matches whatever actually matched the capturing sub-

5461

pattern in the current subject string, rather than anything matching

5424

5462

the subpattern itself (see "Subpatterns as subroutines" below for a way

5425

5463

of doing that). So the pattern

5426

5464

5427

5465

(sens|respons)e and \1ibility

5428

5466

5429

matches "sense and sensibility" and "response and responsibility", but

5430

not "sense and responsibility". If caseful matching is in force at the

5431

time of the back reference, the case of letters is relevant. For exam-

5467

matches "sense and sensibility" and "response and responsibility", but

5468

not "sense and responsibility". If caseful matching is in force at the

5469

time of the back reference, the case of letters is relevant. For exam-

5432

5470

ple,

5433

5471

5434

5472

((?i)rah)\s+\1

5435

5473

5436

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

5474

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

5437

5475

original capturing subpattern is matched caselessly.

5438

5476

5439

There are several different ways of writing back references to named

5440

subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or

5441

\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's

5477

There are several different ways of writing back references to named

5478

subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or

5479

\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's

5442

5480

unified back reference syntax, in which \g can be used for both numeric

5443

and named references, is also supported. We could rewrite the above

5481

and named references, is also supported. We could rewrite the above

5444

5482

example in any of the following ways:

5445

5483

5446

5484

(?<p1>(?i)rah)\s+\k<p1>

5448

5486

(?P<p1>(?i)rah)\s+(?P=p1)

5449

5487

(?<p1>(?i)rah)\s+\g{p1}

5450

5488

5451

A subpattern that is referenced by name may appear in the pattern

5489

A subpattern that is referenced by name may appear in the pattern

5452

5490

before or after the reference.

5453

5491

5454

There may be more than one back reference to the same subpattern. If a

5455

subpattern has not actually been used in a particular match, any back

5492

There may be more than one back reference to the same subpattern. If a

5493

subpattern has not actually been used in a particular match, any back

5456

5494

references to it always fail by default. For example, the pattern

5457

5495

5458

5496

(a|(bc))\2

5459

5497

5460

always fails if it starts to match "a" rather than "bc". However, if

5498

always fails if it starts to match "a" rather than "bc". However, if

5461

5499

the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-

5462

5500

ence to an unset value matches an empty string.

5463

5501

5464

Because there may be many capturing parentheses in a pattern, all dig-

5465

its following a backslash are taken as part of a potential back refer-

5466

ence number. If the pattern continues with a digit character, some

5467

delimiter must be used to terminate the back reference. If the

5468

PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{

5469

syntax or an empty comment (see "Comments" below) can be used.

5502

Because there may be many capturing parentheses in a pattern, all dig-

5503

its following a backslash are taken as part of a potential back refer-

5504

ence number. If the pattern continues with a digit character, some

5505

delimiter must be used to terminate the back reference. If the

5506

PCRE_EXTENDED option is set, this can be white space. Otherwise, the

5507

\g{ syntax or an empty comment (see "Comments" below) can be used.

5470

5508

5471

5509

Recursive back references

5472

5510

5473

A back reference that occurs inside the parentheses to which it refers

5474

fails when the subpattern is first used, so, for example, (a\1) never

5475

matches. However, such references can be useful inside repeated sub-

5511

A back reference that occurs inside the parentheses to which it refers

5512

fails when the subpattern is first used, so, for example, (a\1) never

5513

matches. However, such references can be useful inside repeated sub-

5476

5514

patterns. For example, the pattern

5477

5515

5478

5516

(a|b\1)+

5479

5517

5480

5518

matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-

5481

ation of the subpattern, the back reference matches the character

5482

string corresponding to the previous iteration. In order for this to

5483

work, the pattern must be such that the first iteration does not need

5484

to match the back reference. This can be done using alternation, as in

5519

ation of the subpattern, the back reference matches the character

5520

string corresponding to the previous iteration. In order for this to

5521

work, the pattern must be such that the first iteration does not need

5522

to match the back reference. This can be done using alternation, as in

5485

5523

the example above, or by a quantifier with a minimum of zero.

5486

5524

5487

Back references of this type cause the group that they reference to be

5488

treated as an atomic group. Once the whole group has been matched, a

5489

subsequent matching failure cannot cause backtracking into the middle

5525

Back references of this type cause the group that they reference to be

5526

treated as an atomic group. Once the whole group has been matched, a

5527

subsequent matching failure cannot cause backtracking into the middle

5490

5528

of the group.

5491

5529

5492

5530

5493

5531

ASSERTIONS

5494

5532

5495

An assertion is a test on the characters following or preceding the

5496

current matching point that does not actually consume any characters.

5497

The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are

5533

An assertion is a test on the characters following or preceding the

5534

current matching point that does not actually consume any characters.

5535

The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are

5498

5536

described above.

5499

5537

5500

More complicated assertions are coded as subpatterns. There are two

5501

kinds: those that look ahead of the current position in the subject

5502

string, and those that look behind it. An assertion subpattern is

5503

matched in the normal way, except that it does not cause the current

5538

More complicated assertions are coded as subpatterns. There are two

5539

kinds: those that look ahead of the current position in the subject

5540

string, and those that look behind it. An assertion subpattern is

5541

matched in the normal way, except that it does not cause the current

5504

5542

matching position to be changed.

5505

5543

5506

Assertion subpatterns are not capturing subpatterns. If such an asser-

5507

tion contains capturing subpatterns within it, these are counted for

5508

the purposes of numbering the capturing subpatterns in the whole pat-

5509

tern. However, substring capturing is carried out only for positive

5544

Assertion subpatterns are not capturing subpatterns. If such an asser-

5545

tion contains capturing subpatterns within it, these are counted for

5546

the purposes of numbering the capturing subpatterns in the whole pat-

5547

tern. However, substring capturing is carried out only for positive

5510

5548

assertions, because it does not make sense for negative assertions.

5511

5549

5512

For compatibility with Perl, assertion subpatterns may be repeated;

5513

though it makes no sense to assert the same thing several times, the

5514

side effect of capturing parentheses may occasionally be useful. In

5550

For compatibility with Perl, assertion subpatterns may be repeated;

5551

though it makes no sense to assert the same thing several times, the

5552

side effect of capturing parentheses may occasionally be useful. In

5515

5553

practice, there only three cases:

5516

5554

5517

(1) If the quantifier is {0}, the assertion is never obeyed during

5518

matching. However, it may contain internal capturing parenthesized

5555

(1) If the quantifier is {0}, the assertion is never obeyed during

5556

matching. However, it may contain internal capturing parenthesized

5519

5557

groups that are called from elsewhere via the subroutine mechanism.

5520

5558

5521

(2) If quantifier is {0,n} where n is greater than zero, it is treated

5522

as if it were {0,1}. At run time, the rest of the pattern match is

5559

(2) If quantifier is {0,n} where n is greater than zero, it is treated

5560

as if it were {0,1}. At run time, the rest of the pattern match is

5523

5561

tried with and without the assertion, the order depending on the greed-

5524

5562

iness of the quantifier.

5525

5563

5526

(3) If the minimum repetition is greater than zero, the quantifier is

5527

ignored. The assertion is obeyed just once when encountered during

5564

(3) If the minimum repetition is greater than zero, the quantifier is

5565

ignored. The assertion is obeyed just once when encountered during

5528

5566

matching.

5529

5567

5530

5568

Lookahead assertions

5534

5572

5535

5573

\w+(?=;)

5536

5574

5537

matches a word followed by a semicolon, but does not include the semi-

5575

matches a word followed by a semicolon, but does not include the semi-

5538

5576

colon in the match, and

5539

5577

5540

5578

foo(?!bar)

5541

5579

5542

matches any occurrence of "foo" that is not followed by "bar". Note

5580

matches any occurrence of "foo" that is not followed by "bar". Note

5543

5581

that the apparently similar pattern

5544

5582

5545

5583

(?!foo)bar

5546

5584

5547

does not find an occurrence of "bar" that is preceded by something

5548

other than "foo"; it finds any occurrence of "bar" whatsoever, because

5585

does not find an occurrence of "bar" that is preceded by something

5586

other than "foo"; it finds any occurrence of "bar" whatsoever, because

5549

5587

the assertion (?!foo) is always true when the next three characters are

5550

5588

"bar". A lookbehind assertion is needed to achieve the other effect.

5551

5589

5552

5590

If you want to force a matching failure at some point in a pattern, the

5553

most convenient way to do it is with (?!) because an empty string

5554

always matches, so an assertion that requires there not to be an empty

5591

most convenient way to do it is with (?!) because an empty string

5592

always matches, so an assertion that requires there not to be an empty

5555

5593

string must always fail. The backtracking control verb (*FAIL) or (*F)

5556

5594

is a synonym for (?!).

5557

5595

5558

5596

Lookbehind assertions

5559

5597

5560

Lookbehind assertions start with (?<= for positive assertions and (?<!

5598

Lookbehind assertions start with (?<= for positive assertions and (?<!

5561

5599

for negative assertions. For example,

5562

5600

5563

5601

(?<!foo)bar

5564

5602

5565

does find an occurrence of "bar" that is not preceded by "foo". The

5566

contents of a lookbehind assertion are restricted such that all the

5603

does find an occurrence of "bar" that is not preceded by "foo". The

5604

contents of a lookbehind assertion are restricted such that all the

5567

5605

strings it matches must have a fixed length. However, if there are sev-

5568

eral top-level alternatives, they do not all have to have the same

5606

eral top-level alternatives, they do not all have to have the same

5569

5607

fixed length. Thus

5570

5608

5571

5609

(?<=bullock|donkey)

5574

5612

5575

5613

(?<!dogs?|cats?)

5576

5614

5577

causes an error at compile time. Branches that match different length

5578

strings are permitted only at the top level of a lookbehind assertion.

5615

causes an error at compile time. Branches that match different length

5616

strings are permitted only at the top level of a lookbehind assertion.

5579

5617

This is an extension compared with Perl, which requires all branches to

5580

5618

match the same length of string. An assertion such as

5581

5619

5582

5620

(?<=ab(c|de))

5583

5621

5584

is not permitted, because its single top-level branch can match two

5622

is not permitted, because its single top-level branch can match two

5585

5623

different lengths, but it is acceptable to PCRE if rewritten to use two

5586

5624

top-level branches:

5587

5625

5588

5626

(?<=abc|abde)

5589

5627

5590

In some cases, the escape sequence \K (see above) can be used instead

5628

In some cases, the escape sequence \K (see above) can be used instead

5591

5629

of a lookbehind assertion to get round the fixed-length restriction.

5592

5630

5593

The implementation of lookbehind assertions is, for each alternative,

5594

to temporarily move the current position back by the fixed length and

5631

The implementation of lookbehind assertions is, for each alternative,

5632

to temporarily move the current position back by the fixed length and

5595

5633

then try to match. If there are insufficient characters before the cur-

5596

5634

rent position, the assertion fails.

5597

5635

5598

In a UTF mode, PCRE does not allow the \C escape (which matches a sin-

5599

gle data unit even in a UTF mode) to appear in lookbehind assertions,

5600

because it makes it impossible to calculate the length of the lookbe-

5601

hind. The \X and \R escapes, which can match different numbers of data

5636

In a UTF mode, PCRE does not allow the \C escape (which matches a sin-

5637

gle data unit even in a UTF mode) to appear in lookbehind assertions,

5638

because it makes it impossible to calculate the length of the lookbe-

5639

hind. The \X and \R escapes, which can match different numbers of data

5602

5640

units, are also not permitted.

5603

5641

5604

"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in

5605

lookbehinds, as long as the subpattern matches a fixed-length string.

5642

"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in

5643

lookbehinds, as long as the subpattern matches a fixed-length string.

5606

5644

Recursion, however, is not supported.

5607

5645

5608

Possessive quantifiers can be used in conjunction with lookbehind

5646

Possessive quantifiers can be used in conjunction with lookbehind

5609

5647

assertions to specify efficient matching of fixed-length strings at the

5610

5648

end of subject strings. Consider a simple pattern such as

5611

5649

5612

5650

abcd$

5613

5651

5614

when applied to a long string that does not match. Because matching

5652

when applied to a long string that does not match. Because matching

5615

5653

proceeds from left to right, PCRE will look for each "a" in the subject

5616

and then see if what follows matches the rest of the pattern. If the

5654

and then see if what follows matches the rest of the pattern. If the

5617

5655

pattern is specified as

5618

5656

5619

5657

^.*abcd$

5620

5658

5621

the initial .* matches the entire string at first, but when this fails

5659

the initial .* matches the entire string at first, but when this fails

5622

5660

(because there is no following "a"), it backtracks to match all but the

5623

last character, then all but the last two characters, and so on. Once

5624

again the search for "a" covers the entire string, from right to left,

5661

last character, then all but the last two characters, and so on. Once

5662

again the search for "a" covers the entire string, from right to left,

5625

5663

so we are no better off. However, if the pattern is written as

5626

5664

5627

5665

^.*+(?<=abcd)

5628

5666

5629

there can be no backtracking for the .*+ item; it can match only the

5630

entire string. The subsequent lookbehind assertion does a single test

5631

on the last four characters. If it fails, the match fails immediately.

5632

For long strings, this approach makes a significant difference to the

5667

there can be no backtracking for the .*+ item; it can match only the

5668

entire string. The subsequent lookbehind assertion does a single test

5669

on the last four characters. If it fails, the match fails immediately.

5670

For long strings, this approach makes a significant difference to the

5633

5671

processing time.

5634

5672

5635

5673

Using multiple assertions

5638

5676

5639

5677

(?<=\d{3})(?<!999)foo

5640

5678

5641

matches "foo" preceded by three digits that are not "999". Notice that

5642

each of the assertions is applied independently at the same point in

5643

the subject string. First there is a check that the previous three

5644

characters are all digits, and then there is a check that the same

5679

matches "foo" preceded by three digits that are not "999". Notice that

5680

each of the assertions is applied independently at the same point in

5681

the subject string. First there is a check that the previous three

5682

characters are all digits, and then there is a check that the same

5645

5683

three characters are not "999". This pattern does not match "foo" pre-

5646

ceded by six characters, the first of which are digits and the last

5647

three of which are not "999". For example, it doesn't match "123abc-

5684

ceded by six characters, the first of which are digits and the last

5685

three of which are not "999". For example, it doesn't match "123abc-

5648

5686

foo". A pattern to do that is

5649

5687

5650

5688

(?<=\d{3}...)(?<!999)foo

5651

5689

5652

This time the first assertion looks at the preceding six characters,

5690

This time the first assertion looks at the preceding six characters,

5653

5691

checking that the first three are digits, and then the second assertion

5654

5692

checks that the preceding three characters are not "999".

5655

5693

5657

5695

5658

5696

(?<=(?<!foo)bar)baz

5659

5697

5660

matches an occurrence of "baz" that is preceded by "bar" which in turn

5698

matches an occurrence of "baz" that is preceded by "bar" which in turn

5661

5699

is not preceded by "foo", while

5662

5700

5663

5701

(?<=\d{3}(?!999)...)foo

5664

5702

5665

is another pattern that matches "foo" preceded by three digits and any

5703

is another pattern that matches "foo" preceded by three digits and any

5666

5704

three characters that are not "999".

5667

5705

5668

5706

5669

5707

CONDITIONAL SUBPATTERNS

5670

5708

5671

It is possible to cause the matching process to obey a subpattern con-

5672

ditionally or to choose between two alternative subpatterns, depending

5673

on the result of an assertion, or whether a specific capturing subpat-

5674

tern has already been matched. The two possible forms of conditional

5709

It is possible to cause the matching process to obey a subpattern con-

5710

ditionally or to choose between two alternative subpatterns, depending

5711

on the result of an assertion, or whether a specific capturing subpat-

5712

tern has already been matched. The two possible forms of conditional

5675

5713

subpattern are:

5676

5714

5677

5715

(?(condition)yes-pattern)

5678

5716

(?(condition)yes-pattern|no-pattern)

5679

5717

5680

If the condition is satisfied, the yes-pattern is used; otherwise the

5681

no-pattern (if present) is used. If there are more than two alterna-

5682

tives in the subpattern, a compile-time error occurs. Each of the two

5718

If the condition is satisfied, the yes-pattern is used; otherwise the

5719

no-pattern (if present) is used. If there are more than two alterna-

5720

tives in the subpattern, a compile-time error occurs. Each of the two

5683

5721

alternatives may itself contain nested subpatterns of any form, includ-

5684

5722

ing conditional subpatterns; the restriction to two alternatives

5685

5723

applies only at the level of the condition. This pattern fragment is an

5688

5726

(?(1) (A|B|C) | (D | (?(2)E|F) | E) )

5689

5727

5690

5728

5691

There are four kinds of condition: references to subpatterns, refer-

5729

There are four kinds of condition: references to subpatterns, refer-

5692

5730

ences to recursion, a pseudo-condition called DEFINE, and assertions.

5693

5731

5694

5732

Checking for a used subpattern by number

5695

5733

5696

If the text between the parentheses consists of a sequence of digits,

5734

If the text between the parentheses consists of a sequence of digits,

5697

5735

the condition is true if a capturing subpattern of that number has pre-

5698

viously matched. If there is more than one capturing subpattern with

5699

the same number (see the earlier section about duplicate subpattern

5700

numbers), the condition is true if any of them have matched. An alter-

5701

native notation is to precede the digits with a plus or minus sign. In

5702

this case, the subpattern number is relative rather than absolute. The

5703

most recently opened parentheses can be referenced by (?(-1), the next

5704

most recent by (?(-2), and so on. Inside loops it can also make sense

5736

viously matched. If there is more than one capturing subpattern with

5737

the same number (see the earlier section about duplicate subpattern

5738

numbers), the condition is true if any of them have matched. An alter-

5739

native notation is to precede the digits with a plus or minus sign. In

5740

this case, the subpattern number is relative rather than absolute. The

5741

most recently opened parentheses can be referenced by (?(-1), the next

5742

most recent by (?(-2), and so on. Inside loops it can also make sense

5705

5743

to refer to subsequent groups. The next parentheses to be opened can be

5706

referenced as (?(+1), and so on. (The value zero in any of these forms

5744

referenced as (?(+1), and so on. (The value zero in any of these forms

5707

5745

is not used; it provokes a compile-time error.)

5708

5746

5709

Consider the following pattern, which contains non-significant white

5747

Consider the following pattern, which contains non-significant white

5710

5748

space to make it more readable (assume the PCRE_EXTENDED option) and to

5711

5749

divide it into three parts for ease of discussion:

5712

5750

5713

5751

( $ )? [^()]+ (?(1) $ )

5714

5752

5715

The first part matches an optional opening parenthesis, and if that

5753

The first part matches an optional opening parenthesis, and if that

5716

5754

character is present, sets it as the first captured substring. The sec-

5717

ond part matches one or more characters that are not parentheses. The

5718

third part is a conditional subpattern that tests whether or not the

5719

first set of parentheses matched. If they did, that is, if subject

5720

started with an opening parenthesis, the condition is true, and so the

5721

yes-pattern is executed and a closing parenthesis is required. Other-

5722

wise, since no-pattern is not present, the subpattern matches nothing.

5723

In other words, this pattern matches a sequence of non-parentheses,

5755

ond part matches one or more characters that are not parentheses. The

5756

third part is a conditional subpattern that tests whether or not the

5757

first set of parentheses matched. If they did, that is, if subject

5758

started with an opening parenthesis, the condition is true, and so the

5759

yes-pattern is executed and a closing parenthesis is required. Other-

5760

wise, since no-pattern is not present, the subpattern matches nothing.

5761

In other words, this pattern matches a sequence of non-parentheses,

5724

5762

optionally enclosed in parentheses.

5725

5763

5726

If you were embedding this pattern in a larger one, you could use a

5764

If you were embedding this pattern in a larger one, you could use a

5727

5765

relative reference:

5728

5766

5729

5767

...other stuff... ( $ )? [^()]+ (?(-1) $ ) ...

5730

5768

5731

This makes the fragment independent of the parentheses in the larger

5769

This makes the fragment independent of the parentheses in the larger

5732

5770

pattern.

5733

5771

5734

5772

Checking for a used subpattern by name

5735

5773

5736

Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a

5737

used subpattern by name. For compatibility with earlier versions of

5738

PCRE, which had this facility before Perl, the syntax (?(name)...) is

5739

also recognized. However, there is a possible ambiguity with this syn-

5740

tax, because subpattern names may consist entirely of digits. PCRE

5741

looks first for a named subpattern; if it cannot find one and the name

5742

consists entirely of digits, PCRE looks for a subpattern of that num-

5743

ber, which must be greater than zero. Using subpattern names that con-

5774

Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a

5775

used subpattern by name. For compatibility with earlier versions of

5776

PCRE, which had this facility before Perl, the syntax (?(name)...) is

5777

also recognized. However, there is a possible ambiguity with this syn-

5778

tax, because subpattern names may consist entirely of digits. PCRE

5779

looks first for a named subpattern; if it cannot find one and the name

5780

consists entirely of digits, PCRE looks for a subpattern of that num-

5781

ber, which must be greater than zero. Using subpattern names that con-

5744

5782

sist entirely of digits is not recommended.

5745

5783

5746

5784

Rewriting the above example to use a named subpattern gives this:

5747

5785

5748

5786

(?<OPEN> $ )? [^()]+ (?(<OPEN>) $ )

5749

5787

5750

If the name used in a condition of this kind is a duplicate, the test

5751

is applied to all subpatterns of the same name, and is true if any one

5788

If the name used in a condition of this kind is a duplicate, the test

5789

is applied to all subpatterns of the same name, and is true if any one

5752

5790

of them has matched.

5753

5791

5754

5792

Checking for pattern recursion

5755

5793

5756

5794

If the condition is the string (R), and there is no subpattern with the

5757

name R, the condition is true if a recursive call to the whole pattern

5795

name R, the condition is true if a recursive call to the whole pattern

5758

5796

or any subpattern has been made. If digits or a name preceded by amper-

5759

5797

sand follow the letter R, for example:

5760

5798

5762

5800

5763

5801

the condition is true if the most recent recursion is into a subpattern

5764

5802

whose number or name is given. This condition does not check the entire

5765

recursion stack. If the name used in a condition of this kind is a

5803

recursion stack. If the name used in a condition of this kind is a

5766

5804

duplicate, the test is applied to all subpatterns of the same name, and

5767

5805

is true if any one of them is the most recent recursion.

5768

5806

5769

At "top level", all these recursion test conditions are false. The

5807

At "top level", all these recursion test conditions are false. The

5770

5808

syntax for recursive patterns is described below.

5771

5809

5772

5810

Defining subpatterns for use by reference only

5773

5811

5774

If the condition is the string (DEFINE), and there is no subpattern

5775

with the name DEFINE, the condition is always false. In this case,

5776

there may be only one alternative in the subpattern. It is always

5777

skipped if control reaches this point in the pattern; the idea of

5778

DEFINE is that it can be used to define subroutines that can be refer-

5779

enced from elsewhere. (The use of subroutines is described below.) For

5780

example, a pattern to match an IPv4 address such as "192.168.23.245"

5781

could be written like this (ignore whitespace and line breaks):

5812

If the condition is the string (DEFINE), and there is no subpattern

5813

with the name DEFINE, the condition is always false. In this case,

5814

there may be only one alternative in the subpattern. It is always

5815

skipped if control reaches this point in the pattern; the idea of

5816

DEFINE is that it can be used to define subroutines that can be refer-

5817

enced from elsewhere. (The use of subroutines is described below.) For

5818

example, a pattern to match an IPv4 address such as "192.168.23.245"

5819

could be written like this (ignore white space and line breaks):

5782

5820

5783

5821

(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )

5784

5822

\b (?&byte) (\.(?&byte)){3} \b

5785

5823

5786

The first part of the pattern is a DEFINE group inside which a another

5787

group named "byte" is defined. This matches an individual component of

5788

an IPv4 address (a number less than 256). When matching takes place,

5789

this part of the pattern is skipped because DEFINE acts like a false

5790

condition. The rest of the pattern uses references to the named group

5791

to match the four dot-separated components of an IPv4 address, insist-

5824

The first part of the pattern is a DEFINE group inside which a another

5825

group named "byte" is defined. This matches an individual component of

5826

an IPv4 address (a number less than 256). When matching takes place,

5827

this part of the pattern is skipped because DEFINE acts like a false

5828

condition. The rest of the pattern uses references to the named group

5829

to match the four dot-separated components of an IPv4 address, insist-

5792

5830

ing on a word boundary at each end.

5793

5831

5794

5832

Assertion conditions

5795

5833

5796

If the condition is not in any of the above formats, it must be an

5797

assertion. This may be a positive or negative lookahead or lookbehind

5798

assertion. Consider this pattern, again containing non-significant

5834

If the condition is not in any of the above formats, it must be an

5835

assertion. This may be a positive or negative lookahead or lookbehind

5836

assertion. Consider this pattern, again containing non-significant

5799

5837

white space, and with the two alternatives on the second line:

5800

5838

5801

5839

(?(?=[^a-z]*[a-z])

5802

5840

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

5803

5841

5804

The condition is a positive lookahead assertion that matches an

5805

optional sequence of non-letters followed by a letter. In other words,

5806

it tests for the presence of at least one letter in the subject. If a

5807

letter is found, the subject is matched against the first alternative;

5808

otherwise it is matched against the second. This pattern matches

5809

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

5842

The condition is a positive lookahead assertion that matches an

5843

optional sequence of non-letters followed by a letter. In other words,

5844

it tests for the presence of at least one letter in the subject. If a

5845

letter is found, the subject is matched against the first alternative;

5846

otherwise it is matched against the second. This pattern matches

5847

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

5810

5848

letters and dd are digits.

5811

5849

5812

5850

5815

5853

There are two ways of including comments in patterns that are processed

5816

5854

by PCRE. In both cases, the start of the comment must not be in a char-

5817

5855

acter class, nor in the middle of any other sequence of related charac-

5818

ters such as (?: or a subpattern name or number. The characters that

5856

ters such as (?: or a subpattern name or number. The characters that

5819

5857

make up a comment play no part in the pattern matching.

5820

5858

5821

The sequence (?# marks the start of a comment that continues up to the

5822

next closing parenthesis. Nested parentheses are not permitted. If the

5859

The sequence (?# marks the start of a comment that continues up to the

5860

next closing parenthesis. Nested parentheses are not permitted. If the

5823

5861

PCRE_EXTENDED option is set, an unescaped # character also introduces a

5824

comment, which in this case continues to immediately after the next

5825

newline character or character sequence in the pattern. Which charac-

5862

comment, which in this case continues to immediately after the next

5863

newline character or character sequence in the pattern. Which charac-

5826

5864

ters are interpreted as newlines is controlled by the options passed to

5827

a compiling function or by a special sequence at the start of the pat-

5865

a compiling function or by a special sequence at the start of the pat-

5828

5866

tern, as described in the section entitled "Newline conventions" above.

5829

5867

Note that the end of this type of comment is a literal newline sequence

5830

in the pattern; escape sequences that happen to represent a newline do

5831

not count. For example, consider this pattern when PCRE_EXTENDED is

5868

in the pattern; escape sequences that happen to represent a newline do

5869

not count. For example, consider this pattern when PCRE_EXTENDED is

5832

5870

set, and the default newline convention is in force:

5833

5871

5834

5872

abc #comment \n still comment

5835

5873

5836

On encountering the # character, pcre_compile() skips along, looking

5837

for a newline in the pattern. The sequence \n is still literal at this

5838

stage, so it does not terminate the comment. Only an actual character

5874

On encountering the # character, pcre_compile() skips along, looking

5875

for a newline in the pattern. The sequence \n is still literal at this

5876

stage, so it does not terminate the comment. Only an actual character

5839

5877

with the code value 0x0a (the default newline) does so.

5840

5878

5841

5879

5842

5880

RECURSIVE PATTERNS

5843

5881

5844

Consider the problem of matching a string in parentheses, allowing for

5845

unlimited nested parentheses. Without the use of recursion, the best

5846

that can be done is to use a pattern that matches up to some fixed

5847

depth of nesting. It is not possible to handle an arbitrary nesting

5882

Consider the problem of matching a string in parentheses, allowing for

5883

unlimited nested parentheses. Without the use of recursion, the best

5884

that can be done is to use a pattern that matches up to some fixed

5885

depth of nesting. It is not possible to handle an arbitrary nesting

5848

5886

depth.

5849

5887

5850

5888

For some time, Perl has provided a facility that allows regular expres-

5851

sions to recurse (amongst other things). It does this by interpolating

5852

Perl code in the expression at run time, and the code can refer to the

5889

sions to recurse (amongst other things). It does this by interpolating

5890

Perl code in the expression at run time, and the code can refer to the

5853

5891

expression itself. A Perl pattern using code interpolation to solve the

5854

5892

parentheses problem can be created like this:

5855

5893

5859

5897

refers recursively to the pattern in which it appears.

5860

5898

5861

5899

Obviously, PCRE cannot support the interpolation of Perl code. Instead,

5862

it supports special syntax for recursion of the entire pattern, and

5863

also for individual subpattern recursion. After its introduction in

5864

PCRE and Python, this kind of recursion was subsequently introduced

5900

it supports special syntax for recursion of the entire pattern, and

5901

also for individual subpattern recursion. After its introduction in

5902

PCRE and Python, this kind of recursion was subsequently introduced

5865

5903

into Perl at release 5.10.

5866

5904

5867

A special item that consists of (? followed by a number greater than

5868

zero and a closing parenthesis is a recursive subroutine call of the

5869

subpattern of the given number, provided that it occurs inside that

5870

subpattern. (If not, it is a non-recursive subroutine call, which is

5871

described in the next section.) The special item (?R) or (?0) is a

5905

A special item that consists of (? followed by a number greater than

5906

zero and a closing parenthesis is a recursive subroutine call of the

5907

subpattern of the given number, provided that it occurs inside that

5908

subpattern. (If not, it is a non-recursive subroutine call, which is

5909

described in the next section.) The special item (?R) or (?0) is a

5872

5910

recursive call of the entire regular expression.

5873

5911

5874

This PCRE pattern solves the nested parentheses problem (assume the

5912

This PCRE pattern solves the nested parentheses problem (assume the

5875

5913

PCRE_EXTENDED option is set so that white space is ignored):

5876

5914

5877

5915

$ ( [^()]++ | (?R) )* $

5878

5916

5879

First it matches an opening parenthesis. Then it matches any number of

5880

substrings which can either be a sequence of non-parentheses, or a

5881

recursive match of the pattern itself (that is, a correctly parenthe-

5917

First it matches an opening parenthesis. Then it matches any number of

5918

substrings which can either be a sequence of non-parentheses, or a

5919

recursive match of the pattern itself (that is, a correctly parenthe-

5882

5920

sized substring). Finally there is a closing parenthesis. Note the use

5883

5921

of a possessive quantifier to avoid backtracking into sequences of non-

5884

5922

parentheses.

5885

5923

5886

If this were part of a larger pattern, you would not want to recurse

5924

If this were part of a larger pattern, you would not want to recurse

5887

5925

the entire pattern, so instead you could use this:

5888

5926

5889

5927

( $ ( [^()]++ | (?1) )* $ )

5890

5928

5891

We have put the pattern into parentheses, and caused the recursion to

5929

We have put the pattern into parentheses, and caused the recursion to

5892

5930

refer to them instead of the whole pattern.

5893

5931

5894

In a larger pattern, keeping track of parenthesis numbers can be

5895

tricky. This is made easier by the use of relative references. Instead

5932

In a larger pattern, keeping track of parenthesis numbers can be

5933

tricky. This is made easier by the use of relative references. Instead

5896

5934

of (?1) in the pattern above you can write (?-2) to refer to the second

5897

most recently opened parentheses preceding the recursion. In other

5898

words, a negative number counts capturing parentheses leftwards from

5935

most recently opened parentheses preceding the recursion. In other

5936

words, a negative number counts capturing parentheses leftwards from

5899

5937

the point at which it is encountered.

5900

5938

5901

It is also possible to refer to subsequently opened parentheses, by

5902

writing references such as (?+2). However, these cannot be recursive

5903

because the reference is not inside the parentheses that are refer-

5904

enced. They are always non-recursive subroutine calls, as described in

5939

It is also possible to refer to subsequently opened parentheses, by

5940

writing references such as (?+2). However, these cannot be recursive

5941

because the reference is not inside the parentheses that are refer-

5942

enced. They are always non-recursive subroutine calls, as described in

5905

5943

the next section.

5906

5944

5907

An alternative approach is to use named parentheses instead. The Perl

5908

syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also

5945

An alternative approach is to use named parentheses instead. The Perl

5946

syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also

5909

5947

supported. We could rewrite the above example as follows:

5910

5948

5911

5949

(?<pn> $ ( [^()]++ | (?&pn) )* $ )

5912

5950

5913

If there is more than one subpattern with the same name, the earliest

5951

If there is more than one subpattern with the same name, the earliest

5914

5952

one is used.

5915

5953

5916

This particular example pattern that we have been looking at contains

5954

This particular example pattern that we have been looking at contains

5917

5955

nested unlimited repeats, and so the use of a possessive quantifier for

5918

5956

matching strings of non-parentheses is important when applying the pat-

5919

tern to strings that do not match. For example, when this pattern is

5957

tern to strings that do not match. For example, when this pattern is

5920

5958

applied to

5921

5959

5922

5960

(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

5923

5961

5924

it yields "no match" quickly. However, if a possessive quantifier is

5925

not used, the match runs for a very long time indeed because there are

5926

so many different ways the + and * repeats can carve up the subject,

5962

it yields "no match" quickly. However, if a possessive quantifier is

5963

not used, the match runs for a very long time indeed because there are

5964

so many different ways the + and * repeats can carve up the subject,

5927

5965

and all have to be tested before failure can be reported.

5928

5966

5929

At the end of a match, the values of capturing parentheses are those

5930

from the outermost level. If you want to obtain intermediate values, a

5931

callout function can be used (see below and the pcrecallout documenta-

5967

At the end of a match, the values of capturing parentheses are those

5968

from the outermost level. If you want to obtain intermediate values, a

5969

callout function can be used (see below and the pcrecallout documenta-

5932

5970

tion). If the pattern above is matched against

5933

5971

5934

5972

(ab(cd)ef)

5935

5973

5936

the value for the inner capturing parentheses (numbered 2) is "ef",

5937

which is the last value taken on at the top level. If a capturing sub-

5938

pattern is not matched at the top level, its final captured value is

5939

unset, even if it was (temporarily) set at a deeper level during the

5974

the value for the inner capturing parentheses (numbered 2) is "ef",

5975

which is the last value taken on at the top level. If a capturing sub-

5976

pattern is not matched at the top level, its final captured value is

5977

unset, even if it was (temporarily) set at a deeper level during the

5940

5978

matching process.

5941

5979

5942

If there are more than 15 capturing parentheses in a pattern, PCRE has

5943

to obtain extra memory to store data during a recursion, which it does

5980

If there are more than 15 capturing parentheses in a pattern, PCRE has

5981

to obtain extra memory to store data during a recursion, which it does

5944

5982

by using pcre_malloc, freeing it via pcre_free afterwards. If no memory

5945

5983

can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.

5946

5984

5947

Do not confuse the (?R) item with the condition (R), which tests for

5948

recursion. Consider this pattern, which matches text in angle brack-

5949

ets, allowing for arbitrary nesting. Only digits are allowed in nested

5950

brackets (that is, when recursing), whereas any characters are permit-

5985

Do not confuse the (?R) item with the condition (R), which tests for

5986

recursion. Consider this pattern, which matches text in angle brack-

5987

ets, allowing for arbitrary nesting. Only digits are allowed in nested

5988

brackets (that is, when recursing), whereas any characters are permit-

5951

5989

ted at the outer level.

5952

5990

5953

5991

< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >

5954

5992

5955

In this pattern, (?(R) is the start of a conditional subpattern, with

5956

two different alternatives for the recursive and non-recursive cases.

5993

In this pattern, (?(R) is the start of a conditional subpattern, with

5994

two different alternatives for the recursive and non-recursive cases.

5957

5995

The (?R) item is the actual recursive call.

5958

5996

5959

5997

Differences in recursion processing between PCRE and Perl

5960

5998

5961

Recursion processing in PCRE differs from Perl in two important ways.

5962

In PCRE (like Python, but unlike Perl), a recursive subpattern call is

5999

Recursion processing in PCRE differs from Perl in two important ways.

6000

In PCRE (like Python, but unlike Perl), a recursive subpattern call is

5963

6001

always treated as an atomic group. That is, once it has matched some of

5964

6002

the subject string, it is never re-entered, even if it contains untried

5965

alternatives and there is a subsequent matching failure. This can be

5966

illustrated by the following pattern, which purports to match a palin-

5967

dromic string that contains an odd number of characters (for example,

6003

alternatives and there is a subsequent matching failure. This can be

6004

illustrated by the following pattern, which purports to match a palin-

6005

dromic string that contains an odd number of characters (for example,

5968

6006

"a", "aba", "abcba", "abcdcba"):

5969

6007

5970

6008

^(.|(.)(?1)\2)$

5971

6009

5972

6010

The idea is that it either matches a single character, or two identical

5973

characters surrounding a sub-palindrome. In Perl, this pattern works;

5974

in PCRE it does not if the pattern is longer than three characters.

6011

characters surrounding a sub-palindrome. In Perl, this pattern works;

6012

in PCRE it does not if the pattern is longer than three characters.

5975

6013

Consider the subject string "abcba":

5976

6014

5977

At the top level, the first character is matched, but as it is not at

6015

At the top level, the first character is matched, but as it is not at

5978

6016

the end of the string, the first alternative fails; the second alterna-

5979

6017

tive is taken and the recursion kicks in. The recursive call to subpat-

5980

tern 1 successfully matches the next character ("b"). (Note that the

6018

tern 1 successfully matches the next character ("b"). (Note that the

5981

6019

beginning and end of line tests are not part of the recursion).

5982

6020

5983

Back at the top level, the next character ("c") is compared with what

5984

subpattern 2 matched, which was "a". This fails. Because the recursion

5985

is treated as an atomic group, there are now no backtracking points,

5986

and so the entire match fails. (Perl is able, at this point, to re-

5987

enter the recursion and try the second alternative.) However, if the

6021

Back at the top level, the next character ("c") is compared with what

6022

subpattern 2 matched, which was "a". This fails. Because the recursion

6023

is treated as an atomic group, there are now no backtracking points,

6024

and so the entire match fails. (Perl is able, at this point, to re-

6025

enter the recursion and try the second alternative.) However, if the

5988

6026

pattern is written with the alternatives in the other order, things are

5989

6027

different:

5990

6028

5991

6029

^((.)(?1)\2|.)$

5992

6030

5993

This time, the recursing alternative is tried first, and continues to

5994

recurse until it runs out of characters, at which point the recursion

5995

fails. But this time we do have another alternative to try at the

5996

higher level. That is the big difference: in the previous case the

6031

This time, the recursing alternative is tried first, and continues to

6032

recurse until it runs out of characters, at which point the recursion

6033

fails. But this time we do have another alternative to try at the

6034

higher level. That is the big difference: in the previous case the

5997

6035

remaining alternative is at a deeper recursion level, which PCRE cannot

5998

6036

use.

5999

6037

6000

To change the pattern so that it matches all palindromic strings, not

6001

just those with an odd number of characters, it is tempting to change

6038

To change the pattern so that it matches all palindromic strings, not

6039

just those with an odd number of characters, it is tempting to change

6002

6040

the pattern to this:

6003

6041

6004

6042

^((.)(?1)\2|.?)$

6005

6043

6006

Again, this works in Perl, but not in PCRE, and for the same reason.

6007

When a deeper recursion has matched a single character, it cannot be

6008

entered again in order to match an empty string. The solution is to

6009

separate the two cases, and write out the odd and even cases as alter-

6044

Again, this works in Perl, but not in PCRE, and for the same reason.

6045

When a deeper recursion has matched a single character, it cannot be

6046

entered again in order to match an empty string. The solution is to

6047

separate the two cases, and write out the odd and even cases as alter-

6010

6048

natives at the higher level:

6011

6049

6012

6050

^(?:((.)(?1)\2|)|((.)(?3)\4|.))

6013

6051

6014

If you want to match typical palindromic phrases, the pattern has to

6052

If you want to match typical palindromic phrases, the pattern has to

6015

6053

ignore all non-word characters, which can be done like this:

6016

6054

6017

6055

^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$

6018

6056

6019

6057

If run with the PCRE_CASELESS option, this pattern matches phrases such

6020

6058

as "A man, a plan, a canal: Panama!" and it works well in both PCRE and

6021

Perl. Note the use of the possessive quantifier *+ to avoid backtrack-

6022

ing into sequences of non-word characters. Without this, PCRE takes a

6023

great deal longer (ten times or more) to match typical phrases, and

6059

Perl. Note the use of the possessive quantifier *+ to avoid backtrack-

6060

ing into sequences of non-word characters. Without this, PCRE takes a

6061

great deal longer (ten times or more) to match typical phrases, and

6024

6062

Perl takes so long that you think it has gone into a loop.

6025

6063

6026

WARNING: The palindrome-matching patterns above work only if the sub-

6027

ject string does not start with a palindrome that is shorter than the

6028

entire string. For example, although "abcba" is correctly matched, if

6029

the subject is "ababa", PCRE finds the palindrome "aba" at the start,

6030

then fails at top level because the end of the string does not follow.

6031

Once again, it cannot jump back into the recursion to try other alter-

6064

WARNING: The palindrome-matching patterns above work only if the sub-

6065

ject string does not start with a palindrome that is shorter than the

6066

entire string. For example, although "abcba" is correctly matched, if

6067

the subject is "ababa", PCRE finds the palindrome "aba" at the start,

6068

then fails at top level because the end of the string does not follow.

6069

Once again, it cannot jump back into the recursion to try other alter-

6032

6070

natives, so the entire match fails.

6033

6071

6034

The second way in which PCRE and Perl differ in their recursion pro-

6035

cessing is in the handling of captured values. In Perl, when a subpat-

6036

tern is called recursively or as a subpattern (see the next section),

6037

it has no access to any values that were captured outside the recur-

6038

sion, whereas in PCRE these values can be referenced. Consider this

6072

The second way in which PCRE and Perl differ in their recursion pro-

6073

cessing is in the handling of captured values. In Perl, when a subpat-

6074

tern is called recursively or as a subpattern (see the next section),

6075

it has no access to any values that were captured outside the recur-

6076

sion, whereas in PCRE these values can be referenced. Consider this

6039

6077

pattern:

6040

6078

6041

6079

^(.)(\1|a(?2))

6042

6080

6043

In PCRE, this pattern matches "bab". The first capturing parentheses

6044

match "b", then in the second group, when the back reference \1 fails

6045

to match "b", the second alternative matches "a" and then recurses. In

6046

the recursion, \1 does now match "b" and so the whole match succeeds.

6047

In Perl, the pattern fails to match because inside the recursive call

6081

In PCRE, this pattern matches "bab". The first capturing parentheses

6082

match "b", then in the second group, when the back reference \1 fails

6083

to match "b", the second alternative matches "a" and then recurses. In

6084

the recursion, \1 does now match "b" and so the whole match succeeds.

6085

In Perl, the pattern fails to match because inside the recursive call

6048

6086

\1 cannot access the externally set value.

6049

6087

6050

6088

6051

6089

SUBPATTERNS AS SUBROUTINES

6052

6090

6053

If the syntax for a recursive subpattern call (either by number or by

6054

name) is used outside the parentheses to which it refers, it operates

6055

like a subroutine in a programming language. The called subpattern may

6056

be defined before or after the reference. A numbered reference can be

6091

If the syntax for a recursive subpattern call (either by number or by

6092

name) is used outside the parentheses to which it refers, it operates

6093

like a subroutine in a programming language. The called subpattern may

6094

be defined before or after the reference. A numbered reference can be

6057

6095

absolute or relative, as in these examples:

6058

6096

6059

6097

(...(absolute)...)...(?2)...

6064

6102

6065

6103

(sens|respons)e and \1ibility

6066

6104

6067

matches "sense and sensibility" and "response and responsibility", but

6105

matches "sense and sensibility" and "response and responsibility", but

6068

6106

not "sense and responsibility". If instead the pattern

6069

6107

6070

6108

(sens|respons)e and (?1)ibility

6071

6109

6072

is used, it does match "sense and responsibility" as well as the other

6073

two strings. Another example is given in the discussion of DEFINE

6110

is used, it does match "sense and responsibility" as well as the other

6111

two strings. Another example is given in the discussion of DEFINE

6074

6112

above.

6075

6113

6076

All subroutine calls, whether recursive or not, are always treated as

6077

atomic groups. That is, once a subroutine has matched some of the sub-

6114

All subroutine calls, whether recursive or not, are always treated as

6115

atomic groups. That is, once a subroutine has matched some of the sub-

6078

6116

ject string, it is never re-entered, even if it contains untried alter-

6079

natives and there is a subsequent matching failure. Any capturing

6080

parentheses that are set during the subroutine call revert to their

6117

natives and there is a subsequent matching failure. Any capturing

6118

parentheses that are set during the subroutine call revert to their

6081

6119

previous values afterwards.

6082

6120

6083

Processing options such as case-independence are fixed when a subpat-

6084

tern is defined, so if it is used as a subroutine, such options cannot

6121

Processing options such as case-independence are fixed when a subpat-

6122

tern is defined, so if it is used as a subroutine, such options cannot

6085

6123

be changed for different calls. For example, consider this pattern:

6086

6124

6087

6125

(abc)(?i:(?-1))

6088

6126

6089

It matches "abcabc". It does not match "abcABC" because the change of

6127

It matches "abcabc". It does not match "abcABC" because the change of

6090

6128

processing option does not affect the called subpattern.

6091

6129

6092

6130

6093

6131

ONIGURUMA SUBROUTINE SYNTAX

6094

6132

6095

For compatibility with Oniguruma, the non-Perl syntax \g followed by a

6133

For compatibility with Oniguruma, the non-Perl syntax \g followed by a

6096

6134

name or a number enclosed either in angle brackets or single quotes, is

6097

an alternative syntax for referencing a subpattern as a subroutine,

6098

possibly recursively. Here are two of the examples used above, rewrit-

6135

an alternative syntax for referencing a subpattern as a subroutine,

6136

possibly recursively. Here are two of the examples used above, rewrit-

6099

6137

ten using this syntax:

6100

6138

6101

6139

(?<pn> $ ( (?>[^()]+) | \g<pn> )* $ )

6102

6140

(sens|respons)e and \g'1'ibility

6103

6141

6104

PCRE supports an extension to Oniguruma: if a number is preceded by a

6142

PCRE supports an extension to Oniguruma: if a number is preceded by a

6105

6143

plus or a minus sign it is taken as a relative reference. For example:

6106

6144

6107

6145

(abc)(?i:\g<-1>)

6108

6146

6109

Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not

6110

synonymous. The former is a back reference; the latter is a subroutine

6147

Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not

6148

synonymous. The former is a back reference; the latter is a subroutine

6111

6149

call.

6112

6150

6113

6151

6114

6152

CALLOUTS

6115

6153

6116

6154

Perl has a feature whereby using the sequence (?{...}) causes arbitrary

6117

Perl code to be obeyed in the middle of matching a regular expression.

6155

Perl code to be obeyed in the middle of matching a regular expression.

6118

6156

This makes it possible, amongst other things, to extract different sub-

6119

6157

strings that match the same pair of parentheses when there is a repeti-

6120

6158

tion.

6121

6159

6122

6160

PCRE provides a similar feature, but of course it cannot obey arbitrary

6123

6161

Perl code. The feature is called "callout". The caller of PCRE provides

6124

an external function by putting its entry point in the global variable

6125

pcre_callout (8-bit library) or pcre16_callout (16-bit library). By

6162

an external function by putting its entry point in the global variable

6163

pcre_callout (8-bit library) or pcre16_callout (16-bit library). By

6126

6164

default, this variable contains NULL, which disables all calling out.

6127

6165

6128

Within a regular expression, (?C) indicates the points at which the

6129

external function is to be called. If you want to identify different

6130

callout points, you can put a number less than 256 after the letter C.

6131

The default value is zero. For example, this pattern has two callout

6166

Within a regular expression, (?C) indicates the points at which the

6167

external function is to be called. If you want to identify different

6168

callout points, you can put a number less than 256 after the letter C.

6169

The default value is zero. For example, this pattern has two callout

6132

6170

points:

6133

6171

6134

6172

(?C1)abc(?C2)def

6135

6173

6136

If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-

6137

outs are automatically installed before each item in the pattern. They

6174

If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-

6175

outs are automatically installed before each item in the pattern. They

6138

6176

are all numbered 255.

6139

6177

6140

During matching, when PCRE reaches a callout point, the external func-

6141

tion is called. It is provided with the number of the callout, the

6142

position in the pattern, and, optionally, one item of data originally

6143

supplied by the caller of the matching function. The callout function

6144

may cause matching to proceed, to backtrack, or to fail altogether. A

6145

complete description of the interface to the callout function is given

6178

During matching, when PCRE reaches a callout point, the external func-

6179

tion is called. It is provided with the number of the callout, the

6180

position in the pattern, and, optionally, one item of data originally

6181

supplied by the caller of the matching function. The callout function

6182

may cause matching to proceed, to backtrack, or to fail altogether. A

6183

complete description of the interface to the callout function is given

6146

6184

in the pcrecallout documentation.

6147

6185

6148

6186

6149

6187

BACKTRACKING CONTROL

6150

6188

6151

Perl 5.10 introduced a number of "Special Backtracking Control Verbs",

6189

Perl 5.10 introduced a number of "Special Backtracking Control Verbs",

6152

6190

which are described in the Perl documentation as "experimental and sub-

6153

ject to change or removal in a future version of Perl". It goes on to

6154

say: "Their usage in production code should be noted to avoid problems

6191

ject to change or removal in a future version of Perl". It goes on to

6192

say: "Their usage in production code should be noted to avoid problems

6155

6193

during upgrades." The same remarks apply to the PCRE features described

6156

6194

in this section.

6157

6195

6158

Since these verbs are specifically related to backtracking, most of

6159

them can be used only when the pattern is to be matched using one of

6196

Since these verbs are specifically related to backtracking, most of

6197

them can be used only when the pattern is to be matched using one of

6160

6198

the traditional matching functions, which use a backtracking algorithm.

6161

With the exception of (*FAIL), which behaves like a failing negative

6162

assertion, they cause an error if encountered by a DFA matching func-

6199

With the exception of (*FAIL), which behaves like a failing negative

6200

assertion, they cause an error if encountered by a DFA matching func-

6163

6201

tion.

6164

6202

6165

If any of these verbs are used in an assertion or in a subpattern that

6203

If any of these verbs are used in an assertion or in a subpattern that

6166

6204

is called as a subroutine (whether or not recursively), their effect is

6167

6205

confined to that subpattern; it does not extend to the surrounding pat-

6168

6206

tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)

6169

that is encountered in a successful positive assertion is passed back

6170

when a match succeeds (compare capturing parentheses in assertions).

6207

that is encountered in a successful positive assertion is passed back

6208

when a match succeeds (compare capturing parentheses in assertions).

6171

6209

Note that such subpatterns are processed as anchored at the point where

6172

they are tested. Note also that Perl's treatment of subroutines is dif-

6173

ferent in some cases.

6210

they are tested. Note also that Perl's treatment of subroutines and

6211

assertions is different in some cases.

6174

6212

6175

The new verbs make use of what was previously invalid syntax: an open-

6213

The new verbs make use of what was previously invalid syntax: an open-

6176

6214

ing parenthesis followed by an asterisk. They are generally of the form

6177

(*VERB) or (*VERB:NAME). Some may take either form, with differing be-

6178

haviour, depending on whether or not an argument is present. A name is

6215

(*VERB) or (*VERB:NAME). Some may take either form, with differing be-

6216

haviour, depending on whether or not an argument is present. A name is

6179

6217

any sequence of characters that does not include a closing parenthesis.

6180

If the name is empty, that is, if the closing parenthesis immediately

6181

follows the colon, the effect is as if the colon were not there. Any

6182

number of these verbs may occur in a pattern.

6218

The maximum length of name is 255 in the 8-bit library and 65535 in the

6219

16-bit library. If the name is empty, that is, if the closing parenthe-

6220

sis immediately follows the colon, the effect is as if the colon were

6221

not there. Any number of these verbs may occur in a pattern.

6222

6223

Optimizations that affect backtracking verbs

6183

6224

6184

6225

PCRE contains some optimizations that are used to speed up matching by

6185

6226

running some checks at the start of each match attempt. For example, it

6189

6230

course, be processed. You can suppress the start-of-match optimizations

6190

6231

by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com-

6191

6232

pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).

6233

There is more discussion of this option in the section entitled "Option

6234

bits for pcre_exec()" in the pcreapi documentation.

6192

6235

6193

6236

Experiments with Perl suggest that it too has similar optimizations,

6194

6237

sometimes leading to anomalous results.

6268

6311

No match, mark = B

6269

6312

6270

6313

Note that in this unanchored example the mark is retained from the

6271

match attempt that started at the letter "X". Subsequent match attempts

6272

starting at "P" and then with an empty string do not get as far as the

6273

(*MARK) item, but nevertheless do not reset it.

6314

match attempt that started at the letter "X" in the subject. Subsequent

6315

match attempts starting at "P" and then with an empty string do not get

6316

as far as the (*MARK) item, but nevertheless do not reset it.

6317

6318

If you are interested in (*MARK) values after failed matches, you

6319

should probably set the PCRE_NO_START_OPTIMIZE option (see above) to

6320

ensure that the match is always attempted.

6274

6321

6275

6322

Verbs that act after backtracking

6276

6323

6448

6495

6449

6496

REVISION

6450

6497

6451

Last updated: 09 January 2012

6498

Last updated: 17 June 2012

6452

6499

6453

6500

------------------------------------------------------------------------------

6454

6501

6478

6525

\a alarm, that is, the BEL character (hex 07)

6479

6526

\cx "control-x", where x is any ASCII character

6480

6527

\e escape (hex 1B)

6481

\f formfeed (hex 0C)

6528

\f form feed (hex 0C)

6482

6529

\n newline (hex 0A)

6483

6530

\r carriage return (hex 0D)

6484

6531

\t tab (hex 09)

6494

6541

\C one data unit, even in UTF mode (best avoided)

6495

6542

\d a decimal digit

6496

6543

\D a character that is not a decimal digit

6497

\h a horizontal whitespace character

6498

\H a character that is not a horizontal whitespace character

6544

\h a horizontal white space character

6545

\H a character that is not a horizontal white space character

6499

6546

\N a character that is not a newline

6500

6547

\p{xx} a character with the xx property

6501

6548

\P{xx} a character without the xx property

6502

6549

\R a newline sequence

6503

\s a whitespace character

6504

\S a character that is not a whitespace character

6505

\v a vertical whitespace character

6506

\V a character that is not a vertical whitespace character

6550

\s a white space character

6551

\S a character that is not a white space character

6552

\v a vertical white space character

6553

\V a character that is not a vertical white space character

6507

6554

\w a "word" character

6508

6555

\W a "non-word" character

6509

6556

\X an extended Unicode sequence

6571

6618

6572

6619

SCRIPT NAMES FOR \p AND \P

6573

6620

6574

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,

6575

Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,

6576

Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-

6577

tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,

6578

Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-

6579

rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,

6580

Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,

6581

Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,

6582

Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,

6583

Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,

6584

Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,

6585

Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,

6586

Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,

6587

Ugaritic, Vai, Yi.

6621

Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,

6622

Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,

6623

Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,

6624

Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic,

6625

Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-

6626

gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-

6627

tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,

6628

Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,

6629

Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive,

6630

Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko,

6631

Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic,

6632

Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-

6633

tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese,

6634

Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,

6635

Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,

6636

Yi.

6588

6637

6589

6638

6590

6639

CHARACTER CLASSES

6605

6654

lower lower case letter

6606

6655

print printing, including space

6607

6656

punct printing, excluding alphanumeric

6608

space whitespace

6657

space white space

6609

6658

upper upper case letter

6610

6659

word same as \w

6611

6660

xdigit hexadecimal digit

6889

6938

6890

6939

When you set the PCRE_UTF8 flag, the byte strings passed as patterns

6891

6940

and subjects are (by default) checked for validity on entry to the rel-

6892

evant functions. From release 7.3 of PCRE, the check is according the

6941

evant functions. The entire string is checked before any other process-

6942

ing takes place. From release 7.3 of PCRE, the check is according the

6893

6943

rules of RFC 3629, which are themselves derived from the Unicode speci-

6894

fication. Earlier releases of PCRE followed the rules of RFC 2279,

6895

which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The

6896

current check allows only values in the range U+0 to U+10FFFF, exclud-

6944

fication. Earlier releases of PCRE followed the rules of RFC 2279,

6945

which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The

6946

current check allows only values in the range U+0 to U+10FFFF, exclud-

6897

6947

ing U+D800 to U+DFFF.

6898

6948

6899

The excluded code points are the "Surrogate Area" of Unicode. They are

6900

reserved for use by UTF-16, where they are used in pairs to encode

6901

codepoints with values greater than 0xFFFF. The code points that are

6949

The excluded code points are the "Surrogate Area" of Unicode. They are

6950

reserved for use by UTF-16, where they are used in pairs to encode

6951

codepoints with values greater than 0xFFFF. The code points that are

6902

6952

encoded by UTF-16 pairs are available independently in the UTF-8 encod-

6903

ing. (In other words, the whole surrogate thing is a fudge for UTF-16

6953

ing. (In other words, the whole surrogate thing is a fudge for UTF-16

6904

6954

which unfortunately messes up UTF-8.)

6905

6955

6906

6956

If an invalid UTF-8 string is passed to PCRE, an error return is given.

6907

At compile time, the only additional information is the offset to the

6908

first byte of the failing character. The runtime functions pcre_exec()

6909

and pcre_dfa_exec() also pass back this information, as well as a more

6910

detailed reason code if the caller has provided memory in which to do

6957

At compile time, the only additional information is the offset to the

6958

first byte of the failing character. The run-time functions pcre_exec()

6959

and pcre_dfa_exec() also pass back this information, as well as a more

6960

detailed reason code if the caller has provided memory in which to do

6911

6961

this.

6912

6962

6913

In some situations, you may already know that your strings are valid,

6914

and therefore want to skip these checks in order to improve perfor-

6915

mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run

6916

time, PCRE assumes that the pattern or subject it is given (respec-

6917

tively) contains only valid UTF-8 codes. In this case, it does not

6918

diagnose an invalid UTF-8 string.

6963

In some situations, you may already know that your strings are valid,

6964

and therefore want to skip these checks in order to improve perfor-

6965

mance, for example in the case of a long subject string that is being

6966

scanned repeatedly with different patterns. If you set the

6967

PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes

6968

that the pattern or subject it is given (respectively) contains only

6969

valid UTF-8 codes. In this case, it does not diagnose an invalid UTF-8

6970

string.

6919

6971

6920

If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,

6921

what happens depends on why the string is invalid. If the string con-

6972

If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,

6973

what happens depends on why the string is invalid. If the string con-

6922

6974

forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a

6923

string of characters in the range 0 to 0x7FFFFFFF by pcre_dfa_exec()

6924

and the interpreted version of pcre_exec(). In other words, apart from

6925

the initial validity test, these functions (when in UTF-8 mode) handle

6926

strings according to the more liberal rules of RFC 2279. However, the

6975

string of characters in the range 0 to 0x7FFFFFFF by pcre_dfa_exec()

6976

and the interpreted version of pcre_exec(). In other words, apart from

6977

the initial validity test, these functions (when in UTF-8 mode) handle

6978

strings according to the more liberal rules of RFC 2279. However, the

6927

6979

just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.

6928

If you are using JIT optimization, or if the string does not even con-

6980

If you are using JIT optimization, or if the string does not even con-

6929

6981

form to RFC 2279, the result is undefined. Your program may crash.

6930

6982

6931

If you want to process strings of values in the full range 0 to

6932

0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can

6983

If you want to process strings of values in the full range 0 to

6984

0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can

6933

6985

set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in

6934

this situation, you will have to apply your own validity check, and

6986

this situation, you will have to apply your own validity check, and

6935

6987

avoid the use of JIT optimization.

6936

6988

6937

6989

Validity of UTF-16 strings

6938

6990

6939

6991

When you set the PCRE_UTF16 flag, the strings of 16-bit data units that

6940

6992

are passed as patterns and subjects are (by default) checked for valid-

6941

ity on entry to the relevant functions. Values other than those in the

6993

ity on entry to the relevant functions. Values other than those in the

6942

6994

surrogate range U+D800 to U+DFFF are independent code points. Values in

6943

6995

the surrogate range must be used in pairs in the correct manner.

6944

6996

6945

If an invalid UTF-16 string is passed to PCRE, an error return is

6946

given. At compile time, the only additional information is the offset

6947

to the first data unit of the failing character. The runtime functions

6997

If an invalid UTF-16 string is passed to PCRE, an error return is

6998

given. At compile time, the only additional information is the offset

6999

to the first data unit of the failing character. The run-time functions

6948

7000

pcre16_exec() and pcre16_dfa_exec() also pass back this information, as

6949

well as a more detailed reason code if the caller has provided memory

7001

well as a more detailed reason code if the caller has provided memory

6950

7002

in which to do this.

6951

7003

6952

In some situations, you may already know that your strings are valid,

6953

and therefore want to skip these checks in order to improve perfor-

6954

mance. If you set the PCRE_NO_UTF16_CHECK flag at compile time or at

7004

In some situations, you may already know that your strings are valid,

7005

and therefore want to skip these checks in order to improve perfor-

7006

mance. If you set the PCRE_NO_UTF16_CHECK flag at compile time or at

6955

7007

run time, PCRE assumes that the pattern or subject it is given (respec-

6956

7008

tively) contains only valid UTF-16 sequences. In this case, it does not

6957

7009

diagnose an invalid UTF-16 string.

6958

7010

6959

7011

General comments about UTF modes

6960

7012

6961

1. Codepoints less than 256 can be specified by either braced or

6962

unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).

7013

1. Codepoints less than 256 can be specified by either braced or

7014

unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).

6963

7015

Larger values have to use braced sequences.

6964

7016

6965

2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they

7017

2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they

6966

7018

match two-byte characters for values greater than \177.

6967

7019

6968

7020

3. Repeat quantifiers apply to complete UTF characters, not to individ-

6969

7021

ual data units, for example: \x{100}{3}.

6970

7022

6971

4. The dot metacharacter matches one UTF character instead of a single

7023

4. The dot metacharacter matches one UTF character instead of a single

6972

7024

data unit.

6973

7025

6974

5. The escape sequence \C can be used to match a single byte in UTF-8

7026

5. The escape sequence \C can be used to match a single byte in UTF-8

6975

7027

mode, or a single 16-bit data unit in UTF-16 mode, but its use can lead

6976

7028

to some strange effects because it breaks up multi-unit characters (see

6977

the description of \C in the pcrepattern documentation). The use of \C

6978

is not supported in the alternative matching function

6979

pcre[16]_dfa_exec(), nor is it supported in UTF mode by the JIT opti-

7029

the description of \C in the pcrepattern documentation). The use of \C

7030

is not supported in the alternative matching function

7031

pcre[16]_dfa_exec(), nor is it supported in UTF mode by the JIT opti-

6980

7032

mization of pcre[16]_exec(). If JIT optimization is requested for a UTF

6981

7033

pattern that contains \C, it will not succeed, and so the matching will

6982

7034

be carried out by the normal interpretive function.

6983

7035

6984

6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly

7036

6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly

6985

7037

test characters of any code value, but, by default, the characters that

6986

PCRE recognizes as digits, spaces, or word characters remain the same

6987

set as in non-UTF mode, all with values less than 256. This remains

6988

true even when PCRE is built to include Unicode property support,

7038

PCRE recognizes as digits, spaces, or word characters remain the same

7039

set as in non-UTF mode, all with values less than 256. This remains

7040

true even when PCRE is built to include Unicode property support,

6989

7041

because to do otherwise would slow down PCRE in many common cases. Note

6990

in particular that this applies to \b and \B, because they are defined

7042

in particular that this applies to \b and \B, because they are defined

6991

7043

in terms of \w and \W. If you really want to test for a wider sense of,

6992

say, "digit", you can use explicit Unicode property tests such as

7044

say, "digit", you can use explicit Unicode property tests such as

6993

7045

\p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the

6994

character escapes work is changed so that Unicode properties are used

7046

character escapes work is changed so that Unicode properties are used

6995

7047

to determine which characters match. There are more details in the sec-

6996

7048

tion on generic character types in the pcrepattern documentation.

6997

7049

6998

7. Similarly, characters that match the POSIX named character classes

7050

7. Similarly, characters that match the POSIX named character classes

6999

7051

are all low-valued characters, unless the PCRE_UCP option is set.

7000

7052

7001

8. However, the horizontal and vertical whitespace matching escapes

7002

(\h, \H, \v, and \V) do match all the appropriate Unicode characters,

7053

8. However, the horizontal and vertical white space matching escapes

7054

(\h, \H, \v, and \V) do match all the appropriate Unicode characters,

7003

7055

whether or not PCRE_UCP is set.

7004

7056

7005

9. Case-insensitive matching applies only to characters whose values

7006

are less than 128, unless PCRE is built with Unicode property support.

7007

Even when Unicode property support is available, PCRE still uses its

7008

own character tables when checking the case of low-valued characters,

7009

so as not to degrade performance. The Unicode property information is

7057

9. Case-insensitive matching applies only to characters whose values

7058

are less than 128, unless PCRE is built with Unicode property support.

7059

Even when Unicode property support is available, PCRE still uses its

7060

own character tables when checking the case of low-valued characters,

7061

so as not to degrade performance. The Unicode property information is

7010

7062

used only for characters with higher values. Furthermore, PCRE supports

7011

case-insensitive matching only when there is a one-to-one mapping

7012

between a letter's cases. There are a small number of many-to-one map-

7063

case-insensitive matching only when there is a one-to-one mapping

7064

between a letter's cases. There are a small number of many-to-one map-

7013

7065

pings in Unicode; these are not supported by PCRE.

7014

7066

7015

7067

7022

7074

7023

7075

REVISION

7024

7076

7025

Last updated: 13 January 2012

7077

Last updated: 14 April 2012

7026

7078

7027

7079

------------------------------------------------------------------------------

7028

7080

7072

7124

MIPS 32-bit

7073

7125

Power PC 32-bit and 64-bit

7074

7126

7075

The Power PC support is designated as experimental because it has not

7076

been fully tested. If --enable-jit is set on an unsupported platform,

7077

compilation fails.

7127

If --enable-jit is set on an unsupported platform, compilation fails.

7078

7128

7079

7129

A program that is linked with PCRE 8.20 or later can tell if JIT sup-

7080

7130

port is available by calling pcre_config() with the PCRE_CONFIG_JIT

7081

7131

option. The result is 1 when JIT is available, and 0 otherwise. How-

7082

7132

ever, a simple program does not need to check this in order to use JIT.

7083

The API is implemented in a way that falls back to the ordinary PCRE

7133

The API is implemented in a way that falls back to the interpretive

7084

7134

code if JIT is not available.

7085

7135

7086

7136

If your program may sometimes be linked with versions of PCRE that are

7099

7149

pcre_exec().

7100

7150

7101

7151

(2) Use pcre_free_study() to free the pcre_extra block when it is

7102

no longer needed instead of just freeing it yourself. This

7152

no longer needed, instead of just freeing it yourself. This

7103

7153

ensures that any JIT data is also freed.

7104

7154

7105

7155

For a program that may be linked with pre-8.20 versions of PCRE, you

7118

7168

pcre_free(study_ptr);

7119

7169

#endif

7120

7170

7121

In some circumstances you may need to call additional functions. These

7122

are described in the section entitled "Controlling the JIT stack"

7171

PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate code for

7172

complete matches. If you want to run partial matches using the

7173

PCRE_PARTIAL_HARD or PCRE_PARTIAL_SOFT options of pcre_exec(), you

7174

should set one or both of the following options in addition to, or

7175

instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():

7176

7177

PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE

7178

PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE

7179

7180

The JIT compiler generates different optimized code for each of the

7181

three modes (normal, soft partial, hard partial). When pcre_exec() is

7182

called, the appropriate code is run if it is available. Otherwise, the

7183

pattern is matched using interpretive code.

7184

7185

In some circumstances you may need to call additional functions. These

7186

are described in the section entitled "Controlling the JIT stack"

7123

7187

below.

7124

7188

7125

If JIT support is not available, PCRE_STUDY_JIT_COMPILE is ignored, and

7126

no JIT data is set up. Otherwise, the compiled pattern is passed to the

7127

JIT compiler, which turns it into machine code that executes much

7128

faster than the normal interpretive code. When pcre_exec() is passed a

7129

pcre_extra block containing a pointer to JIT code, it obeys that

7130

instead of the normal code. The result is identical, but the code runs

7131

much faster.

7189

If JIT support is not available, PCRE_STUDY_JIT_COMPILE etc. are

7190

ignored, and no JIT data is created. Otherwise, the compiled pattern is

7191

passed to the JIT compiler, which turns it into machine code that exe-

7192

cutes much faster than the normal interpretive code. When pcre_exec()

7193

is passed a pcre_extra block containing a pointer to JIT code of the

7194

appropriate mode (normal or hard/soft partial), it obeys that code

7195

instead of running the interpreter. The result is identical, but the

7196

compiled JIT code runs much faster.

7132

7197

7133

7198

There are some pcre_exec() options that are not supported for JIT exe-

7134

7199

cution. There are also some pattern items that JIT cannot handle.

7135

7200

Details are given below. In both cases, execution automatically falls

7136

back to the interpretive code.

7201

back to the interpretive code. If you want to know whether JIT was

7202

actually used for a particular match, you should arrange for a JIT

7203

callback function to be set up as described in the section entitled

7204

"Controlling the JIT stack" below, even if you do not need to supply a

7205

non-default JIT stack. Such a callback function is called whenever JIT

7206

code is about to be obeyed. If the execution options are not right for

7207

JIT execution, the callback function is not obeyed.

7137

7208

7138

7209

If the JIT compiler finds an unsupported item, no JIT data is gener-

7139

7210

ated. You can find out if JIT execution is available after studying a

7140

7211

pattern by calling pcre_fullinfo() with the PCRE_INFO_JIT option. A

7141

7212

result of 1 means that JIT compilation was successful. A result of 0

7142

7213

means that JIT support is not available, or the pattern was not studied

7143

with PCRE_STUDY_JIT_COMPILE, or the JIT compiler was not able to handle

7144

the pattern.

7214

with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not able to

7215

handle the pattern.

7145

7216

7146

7217

Once a pattern has been studied, with or without JIT, it can be used as

7147

7218

many times as you like for matching different subject strings.

7150

7221

UNSUPPORTED OPTIONS AND PATTERN ITEMS

7151

7222

7152

7223

The only pcre_exec() options that are supported for JIT execution are

7153

PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and

7154

PCRE_NOTEMPTY_ATSTART. Note in particular that partial matching is not

7155

supported.

7224

PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NOTBOL, PCRE_NOTEOL,

7225

PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PAR-

7226

TIAL_SOFT.

7156

7227

7157

7228

The unsupported pattern items are:

7158

7229

7159

7230

\C match a single byte; not supported in UTF-8 mode

7160

7231

(?Cn) callouts

7161

(*COMMIT) )

7162

(*MARK) )

7163

(*PRUNE) ) the backtracking control verbs

7164

(*SKIP) )

7232

(*PRUNE) )

7233

(*SKIP) ) backtracking control verbs

7165

7234

(*THEN) )

7166

7235

7167

7236

Support for some of these may be added in future.

7228

7297

void *data

7229

7298

7230

7299

The extra argument must be the result of studying a pattern with

7231

PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the

7300

PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the

7232

7301

other two options:

7233

7302

7234

7303

(1) If callback is NULL and data is NULL, an internal 32K block

7237

7306

(2) If callback is NULL and data is not NULL, data must be

7238

7307

a valid JIT stack, the result of calling pcre_jit_stack_alloc().

7239

7308

7240

(3) If callback not NULL, it must point to a function that is called

7241

with data as an argument at the start of matching, in order to

7242

set up a JIT stack. If the result is NULL, the internal 32K stack

7243

is used; otherwise the return value must be a valid JIT stack,

7244

the result of calling pcre_jit_stack_alloc().

7245

7246

You may safely assign the same JIT stack to more than one pattern, as

7247

long as they are all matched sequentially in the same thread. In a mul-

7248

tithread application, each thread must use its own JIT stack.

7249

7250

Strictly speaking, even more is allowed. You can assign the same stack

7251

to any number of patterns as long as they are not used for matching by

7252

multiple threads at the same time. For example, you can assign the same

7253

stack to all compiled patterns, and use a global mutex in the callback

7254

to wait until the stack is available for use. However, this is an inef-

7255

ficient solution, and not recommended.

7256

7257

This is a suggestion for how a typical multithreaded program might

7258

operate:

7309

(3) If callback is not NULL, it must point to a function that is

7310

called with data as an argument at the start of matching, in

7311

order to set up a JIT stack. If the return from the callback

7312

function is NULL, the internal 32K stack is used; otherwise the

7313

return value must be a valid JIT stack, the result of calling

7314

pcre_jit_stack_alloc().

7315

7316

A callback function is obeyed whenever JIT code is about to be run; it

7317

is not obeyed when pcre_exec() is called with options that are incom-

7318

patible for JIT execution. A callback function can therefore be used to

7319

determine whether a match operation was executed by JIT or by the

7320

interpreter.

7321

7322

You may safely use the same JIT stack for more than one pattern (either

7323

by assigning directly or by callback), as long as the patterns are all

7324

matched sequentially in the same thread. In a multithread application,

7325

if you do not specify a JIT stack, or if you assign or pass back NULL

7326

from a callback, that is thread-safe, because each thread has its own

7327

machine stack. However, if you assign or pass back a non-NULL JIT

7328

stack, this must be a different stack for each thread so that the

7329

application is thread-safe.

7330

7331

Strictly speaking, even more is allowed. You can assign the same non-

7332

NULL stack to any number of patterns as long as they are not used for

7333

matching by multiple threads at the same time. For example, you can

7334

assign the same stack to all compiled patterns, and use a global mutex

7335

in the callback to wait until the stack is available for use. However,

7336

this is an inefficient solution, and not recommended.

7337

7338

This is a suggestion for how a multithreaded program that needs to set

7339

up non-default JIT stacks might operate:

7259

7340

7260

7341

During thread initalization

7261

7342

thread_local_var = pcre_jit_stack_alloc(...)

7266

7347

Use a one-line callback function

7267

7348

return thread_local_var

7268

7349

7269

All the functions described in this section do nothing if JIT is not

7270

available, and pcre_assign_jit_stack() does nothing unless the extra

7271

argument is non-NULL and points to a pcre_extra block that is the

7272

result of a successful study with PCRE_STUDY_JIT_COMPILE.

7350

All the functions described in this section do nothing if JIT is not

7351

available, and pcre_assign_jit_stack() does nothing unless the extra

7352

argument is non-NULL and points to a pcre_extra block that is the

7353

result of a successful study with PCRE_STUDY_JIT_COMPILE etc.

7273

7354

7274

7355

7275

7356

JIT STACK FAQ

7276

7357

7277

7358

(1) Why do we need JIT stacks?

7278

7359

7279

PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack

7280

where the local data of the current node is pushed before checking its

7360

PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack

7361

where the local data of the current node is pushed before checking its

7281

7362

child nodes. Allocating real machine stack on some platforms is diffi-

7282

7363

cult. For example, the stack chain needs to be updated every time if we

7283

extend the stack on PowerPC. Although it is possible, its updating

7364

extend the stack on PowerPC. Although it is possible, its updating

7284

7365

time overhead decreases performance. So we do the recursion in memory.

7285

7366

7286

7367

(2) Why don't we simply allocate blocks of memory with malloc()?

7287

7368

7288

Modern operating systems have a nice feature: they can reserve an

7369

Modern operating systems have a nice feature: they can reserve an

7289

7370

address space instead of allocating memory. We can safely allocate mem-

7290

ory pages inside this address space, so the stack could grow without

7371

ory pages inside this address space, so the stack could grow without

7291

7372

moving memory data (this is important because of pointers). Thus we can

7292

allocate 1M address space, and use only a single memory page (usually

7293

4K) if that is enough. However, we can still grow up to 1M anytime if

7373

allocate 1M address space, and use only a single memory page (usually

7374

4K) if that is enough. However, we can still grow up to 1M anytime if

7294

7375

needed.

7295

7376

7296

7377

(3) Who "owns" a JIT stack?

7297

7378

7298

7379

The owner of the stack is the user program, not the JIT studied pattern

7299

or anything else. The user program must ensure that if a stack is used

7300

by pcre_exec(), (that is, it is assigned to the pattern currently run-

7380

or anything else. The user program must ensure that if a stack is used

7381

by pcre_exec(), (that is, it is assigned to the pattern currently run-

7301

7382

ning), that stack must not be used by any other threads (to avoid over-

7302

7383

writing the same memory area). The best practice for multithreaded pro-

7303

grams is to allocate a stack for each thread, and return this stack

7384

grams is to allocate a stack for each thread, and return this stack

7304

7385

through the JIT callback function.

7305

7386

7306

7387

(4) When should a JIT stack be freed?

7307

7388

7308

7389

You can free a JIT stack at any time, as long as it will not be used by

7309

pcre_exec() again. When you assign the stack to a pattern, only a

7310

pointer is set. There is no reference counting or any other magic. You

7311

can free the patterns and stacks in any order, anytime. Just do not

7312

call pcre_exec() with a pattern pointing to an already freed stack, as

7313

that will cause SEGFAULT. (Also, do not free a stack currently used by

7314

pcre_exec() in another thread). You can also replace the stack for a

7315

pattern at any time. You can even free the previous stack before

7390

pcre_exec() again. When you assign the stack to a pattern, only a

7391

pointer is set. There is no reference counting or any other magic. You

7392

can free the patterns and stacks in any order, anytime. Just do not

7393

call pcre_exec() with a pattern pointing to an already freed stack, as

7394

that will cause SEGFAULT. (Also, do not free a stack currently used by

7395

pcre_exec() in another thread). You can also replace the stack for a

7396

pattern at any time. You can even free the previous stack before

7316

7397

assigning a replacement.

7317

7398

7318

(5) Should I allocate/free a stack every time before/after calling

7399

(5) Should I allocate/free a stack every time before/after calling

7319

7400

pcre_exec()?

7320

7401

7321

No, because this is too costly in terms of resources. However, you

7322

could implement some clever idea which release the stack if it is not

7402

No, because this is too costly in terms of resources. However, you

7403

could implement some clever idea which release the stack if it is not

7323

7404

used in let's say two minutes. The JIT callback can help to achive this

7324

7405

without keeping a list of the currently JIT studied patterns.

7325

7406

7326

(6) OK, the stack is for long term memory allocation. But what happens

7327

if a pattern causes stack overflow with a stack of 1M? Is that 1M kept

7407

(6) OK, the stack is for long term memory allocation. But what happens

7408

if a pattern causes stack overflow with a stack of 1M? Is that 1M kept

7328

7409

until the stack is freed?

7329

7410

7330

Especially on embedded sytems, it might be a good idea to release mem-

7331

ory sometimes without freeing the stack. There is no API for this at

7332

the moment. Probably a function call which returns with the currently

7333

allocated memory for any stack and another which allows releasing mem-

7411

Especially on embedded sytems, it might be a good idea to release mem-

7412

ory sometimes without freeing the stack. There is no API for this at

7413

the moment. Probably a function call which returns with the currently

7414

allocated memory for any stack and another which allows releasing mem-

7334

7415

ory (shrinking the stack) would be a good idea if someone needs this.

7335

7416

7336

7417

(7) This is too much of a headache. Isn't there any better solution for

7337

7418

JIT stack handling?

7338

7419

7339

No, thanks to Windows. If POSIX threads were used everywhere, we could

7420

No, thanks to Windows. If POSIX threads were used everywhere, we could

7340

7421

throw out this complicated API.

7341

7422

7342

7423

7343

7424

EXAMPLE CODE

7344

7425

7345

This is a single-threaded example that specifies a JIT stack without

7426

This is a single-threaded example that specifies a JIT stack without

7346

7427

using a callback.

7347

7428

7348

7429

int rc;

7378

7459

7379

7460

REVISION

7380

7461

7381

Last updated: 08 January 2012

7462

Last updated: 04 May 2012

7382

7463

7383

7464

------------------------------------------------------------------------------

7384

7465

7422

7503

matching function. If both options are set, PCRE_PARTIAL_HARD takes

7423

7504

precedence.

7424

7505

7425

Setting a partial matching option disables the use of any just-in-time

7426

code that was set up by studying the compiled pattern with the

7427

PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard

7428

optimizations. PCRE remembers the last literal data unit in a pattern,

7429

and abandons matching immediately if it is not present in the subject

7506

If you want to use partial matching with just-in-time optimized code,

7507

you must call pcre_study() or pcre16_study() with one or both of these

7508

options:

7509

7510

PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE

7511

PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE

7512

7513

PCRE_STUDY_JIT_COMPILE should also be set if you are going to run non-

7514

partial matches on the same pattern. If the appropriate JIT study mode

7515

has not been set for a match, the interpretive matching code is used.

7516

7517

Setting a partial matching option disables two of PCRE's standard opti-

7518

mizations. PCRE remembers the last literal data unit in a pattern, and

7519

abandons matching immediately if it is not present in the subject

7430

7520

string. This optimization cannot be used for a subject string that

7431

7521

might match only partially. If the pattern was studied, PCRE knows the

7432

7522

minimum length of a matching string, and does not bother to run the

7682

7772

7683

7773

At this stage, an application could discard the text preceding "23ja",

7684

7774

add on text from the next segment, and call the matching function

7685

again. Unlike the DFA matching functions the entire matching string

7775

again. Unlike the DFA matching functions, the entire matching string

7686

7776

must always be available, and the complete matching process occurs for

7687

7777

each call, so more memory and more processing time is needed.

7688

7778

7690

7780

with \b or \B, the string that is returned for a partial match includes

7691

7781

characters that precede the partially matched string itself, because

7692

7782

these must be retained when adding on more characters for a subsequent

7693

matching attempt.

7783

matching attempt. However, in some cases you may need to retain even

7784

earlier characters, as discussed in the next section.

7694

7785

7695

7786

7696

7787

ISSUES WITH MULTI-SEGMENT MATCHING

7699

7790

whichever matching function is used.

7700

7791

7701

7792

1. If the pattern contains a test for the beginning of a line, you need

7702

to pass the PCRE_NOTBOL option when the subject string for any call

7703

does start at the beginning of a line. There is also a PCRE_NOTEOL

7793

to pass the PCRE_NOTBOL option when the subject string for any call

7794

does start at the beginning of a line. There is also a PCRE_NOTEOL

7704

7795

option, but in practice when doing multi-segment matching you should be

7705

7796

using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.

7706

7797

7707

2. Lookbehind assertions at the start of a pattern are catered for in

7708

the offsets that are returned for a partial match. However, in theory,

7709

a lookbehind assertion later in the pattern could require even earlier

7710

characters to be inspected, and it might not have been reached when a

7711

partial match occurs. This is probably an extremely unlikely case; you

7712

could guard against it to a certain extent by always including extra

7713

characters at the start.

7714

7715

3. Matching a subject string that is split into multiple segments may

7798

2. Lookbehind assertions that have already been obeyed are catered for

7799

in the offsets that are returned for a partial match. However a lookbe-

7800

hind assertion later in the pattern could require even earlier charac-

7801

ters to be inspected. You can handle this case by using the

7802

PCRE_INFO_MAXLOOKBEHIND option of the pcre_fullinfo() or

7803

pcre16_fullinfo() functions to obtain the length of the largest lookbe-

7804

hind in the pattern. This length is given in characters, not bytes. If

7805

you always retain at least that many characters before the partially

7806

matched string, all should be well. (Of course, near the start of the

7807

subject, fewer characters may be present; in that case all characters

7808

should be retained.)

7809

7810

3. Because a partial match must always contain at least one character,

7811

what might be considered a partial match of an empty string actually

7812

gives a "no match" result. For example:

7813

7814

re> /c(?<=abc)x/

7815

data> ab\P

7816

No match

7817

7818

If the next segment begins "cx", a match should be found, but this will

7819

only happen if characters from the previous segment are retained. For

7820

this reason, a "no match" result should be interpreted as "partial

7821

match of an empty string" when the pattern contains lookbehinds.

7822

7823

4. Matching a subject string that is split into multiple segments may

7716

7824

not always produce exactly the same result as matching over one single

7717

7825

long string, especially when PCRE_PARTIAL_SOFT is used. The section

7718

7826

"Partial Matching and Word Boundaries" above describes an issue that

7756

7864

data> gsb\R\P\P\D

7757

7865

Partial match: gsb

7758

7866

7759

4. Patterns that contain alternatives at the top level which do not all

7867

5. Patterns that contain alternatives at the top level which do not all

7760

7868

start with the same pattern item may not work as expected when

7761

7869

PCRE_DFA_RESTART is used. For example, consider this pattern:

7762

7870

7801

7909

7802

7910

REVISION

7803

7911

7804

Last updated: 21 January 2012

7912

Last updated: 24 February 2012

7805

7913

7806

7914

------------------------------------------------------------------------------

7807

7915

8551

8659

PCRE_DOTALL dot matches newlines /s

8552

8660

PCRE_DOLLAR_ENDONLY $ matches only at end N/A

8553

8661

PCRE_EXTRA strict escape parsing N/A

8554

PCRE_EXTENDED ignore whitespaces /x

8662

PCRE_EXTENDED ignore white spaces /x

8555

8663

PCRE_UTF8 handles UTF8 chars built-in

8556

8664

PCRE_UNGREEDY reverses * and *? N/A

8557

8665

PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)

8839

8947

The maximum length of name for a named subpattern is 32 characters, and

8840

8948

the maximum number of named subpatterns is 10000.

8841

8949

8950

The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or

8951

(*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit

8952

library.

8953

8842

8954

The maximum length of a subject string is the largest positive number

8843

8955

that an integer variable can hold. However, when using the traditional

8844

8956

matching function, PCRE uses recursion to handle subpatterns and indef-

8856

8968

8857

8969

REVISION

8858

8970

8859

Last updated: 08 January 2012

8971

Last updated: 04 May 2012

8860

8972

8861

8973

------------------------------------------------------------------------------

8862

8974