~ubuntu-branches/debian/squeeze/erlang/squeeze

<item>Specifies specifically that \\R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters.(overrides compilation option)</item>

387

<item>Specifies specifically that \R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters.(overrides compilation option)</item>

388

<tag><c>bsr_unicode</c></tag>

389

<item>Specifies specifically that \\R is to match all the Unicode newline characters (including crlf etc, the default).(overrides compilation option)</item>

389

<item>Specifies specifically that \R is to match all the Unicode newline characters (including crlf etc, the default).(overrides compilation option)</item>

390

391

<tag><c>{capture, ValueSpec}</c>/<c>{capture, ValueSpec, Type}</c></tag>

392

<item>

471

<tag><c>index</c></tag>

472

<item>Return captured substrings as pairs of byte indexes into the subject string and length of the matching string in the subject (as if the subject string was flattened with <c>iolist_to_binary/1</c> or <c>unicode:characters_to_binary/2</c> prior to matching). Note that the <c>unicode</c> option results in byte-oriented indexes in a (possibly imagined) UTF-8 encoded binary. A byte index tuple <c>{0,2}</c> might therefore represent one or two characters when <c>unicode</c> is in effect. This might seem contra-intuitive, but has been deemed the most effective and useful way to way to do it. To return lists instead might result in simpler code if that is desired. This return type is the default.</item>

473

474

<item>Return matching substrings as lists of characters (Erlang <c>string()</c>'s). It the <c>unicode</c> option is used in combination with the \\C sequence in the regular expression, a captured subpattern can contain bytes that has is not valid UTF-8 (\\C matches bytes regardless of character encoding). In that case the <c>list</c> capturing may result in the same types of tuples that <c>unicode:characters_to_list/2</c> can return, namely three-tuples with the tag <c>incomplete</c> or <c>error</c>, the successfully converted characters and the invalid UTF-8 tail of the conversion as a binary. The best strategy is to avoid using the\\C sequence when capturing lists.</item>

474

<item>Return matching substrings as lists of characters (Erlang <c>string()</c>'s). It the <c>unicode</c> option is used in combination with the \C sequence in the regular expression, a captured subpattern can contain bytes that has is not valid UTF-8 (\C matches bytes regardless of character encoding). In that case the <c>list</c> capturing may result in the same types of tuples that <c>unicode:characters_to_list/2</c> can return, namely three-tuples with the tag <c>incomplete</c> or <c>error</c>, the successfully converted characters and the invalid UTF-8 tail of the conversion as a binary. The best strategy is to avoid using the \C sequence when capturing lists.</item>

475

<tag><c>binary</c></tag>

476

<item>Return matching substrings as binaries. If the <c>unicode</c> option is used, these binaries is in UTF-8. If the \\C sequence is used together with <c>unicode</c> the binaries may be invalid UTF-8.</item>

476

<item>Return matching substrings as binaries. If the <c>unicode</c> option is used, these binaries is in UTF-8. If the \C sequence is used together with <c>unicode</c> the binaries may be invalid UTF-8.</item>

477

</taglist>

478

</item>

479

</taglist>

544

545

The replacement string can contain the special character

546

<c>&</c>, which inserts the whole matching expression in the

547

result, and the special sequence <c>\\</c>N (where N is an

547

result, and the special sequence <c>\</c>N (where N is an

548

integer > 0), resulting in the subexpression number N will be

549

inserted in the result. If no subexpression with that number is

550

generated by the regular expression, nothing is inserted.

551

To insert an <c>&</c> or <c>\\</c> in the result, precede it

552

with a <c>\\</c>. Note that Erlang already gives a special

553

meaning to <c>\\</c> in literal strings, why a single <c>\\</c>

554

has to be written as <c>"\\\\"</c> and therefore a double <c>\\</c>

555

as <c>"\\\\\\\\"</c>. Example:

551

To insert an <c>&</c> or <c>\</c> in the result, precede it

552

with a <c>\</c>. Note that Erlang already gives a special

553

meaning to <c>\</c> in literal strings, why a single <c>\</c>

554

has to be written as <c>"\\"</c> and therefore a double <c>\</c>

555

as <c>"\\\\"</c>. Example:

556

<code> re:replace("abcd","c","[&]",[{return,list}]).</code>

557

gives

558

559

while

560

<code> re:replace("abcd","c","[\\\&]",[{return,list}]).</code>

560

<code> re:replace("abcd","c","[\\&]",[{return,list}]).</code>

561

gives

562

563

As with <c>re:run/3</c>, compilation errors raise the <c>badarg</c>

592

<v>NLSpec = cr | crlf | lf | anycrlf | any </v>

593

<v>SplitList = [ RetData ] | [ GroupedRetData ]</v>

594

<v>GroupedRetData = [ RetData ]</v>

595

<v>RetData = iodata() charlist() | binary() | list()</v>

595

<v>RetData = iodata() | charlist() | binary() | list()</v>

596

</type>

597

<desc>

598

This function splits the input into parts by finding tokens

852

(*CR)a.b

853

</quote>

854

855

changes the convention to CR. That pattern matches "a\\nb" because LF is no

855

changes the convention to CR. That pattern matches "a\nb" because LF is no

856

longer a newline. Note that these special settings, which are not

857

Perl-compatible, are recognized only at the very start of a pattern, and that

858

they must be in upper case. If more than one of them is present, the last one

859

is used.

860

861

The newline convention does not affect what the \\R escape sequence matches. By

861

The newline convention does not affect what the \R escape sequence matches. By

862

default, this is any Unicode newline sequence, for Perl compatibility. However,

863

this can be changed; see the description of \\R in the section entitled

863

this can be changed; see the description of \R in the section entitled

864

865

"Newline sequences"

866

867

below. A change of \\R setting can be combined with a change of newline

867

below. A change of \R setting can be combined with a change of newline

868

convention.

869

870

</section>

897

are as follows:

898

899

900

<tag>\\</tag> <item>general escape character with several uses</item>

900

<tag>\</tag> <item>general escape character with several uses</item>

901

<tag>^</tag> <item>assert start of string (or line, in multiline mode)</item>

902

<tag>$</tag> <item>assert end of string (or line, in multiline mode)</item>

903

<tag>.</tag> <item>match any character except newline (by default)</item>

918

a character class the only metacharacters are:

919

920

921

<tag>\\</tag> <item>general escape character</item>

921

<tag>\</tag> <item>general escape character</item>

922

<tag>^</tag> <item>negate the class, but only if the first character</item>

923

<tag>-</tag> <item>indicates character range</item>

924

<tag>[</tag> <item>POSIX character class (only if followed by POSIX

939

may have. This use of backslash as an escape character applies both inside and

940

outside character classes.

941

942

For example, if you want to match a * character, you write \\* in the pattern.

942

For example, if you want to match a * character, you write \* in the pattern.

943

This escaping action applies whether or not the following character would

944

otherwise be interpreted as a metacharacter, so it is always safe to precede a

945

non-alphanumeric with backslash to specify that it stands for itself. In

946

particular, if you want to match a backslash, you write \\\\.

946

particular, if you want to match a backslash, you write \\.

947

948

If a pattern is compiled with the <c>extended</c> option, whitespace in the

949

pattern (other than in a character class) and characters between a # outside

951

be used to include a whitespace or # character as part of the pattern.

952

953

If you want to remove the special meaning from a sequence of characters, you

954

can do so by putting them between \\Q and \\E. This is different from Perl in

955

that $ and @ are handled as literals in \\Q...\\E sequences in PCRE, whereas in

954

can do so by putting them between \Q and \E. This is different from Perl in

955

that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in

956

Perl, $ and @ cause variable interpolation. Note the following examples:

957

958

Pattern PCRE matches Perl matches

959

960

\\Qabc$xyz\\E abc$xyz abc followed by the contents of $xyz

961

\\Qabc\\$xyz\\E abc\\$xyz abc\\$xyz

962

\\Qabc\\E\\$\\Qxyz\\E abc$xyz abc$xyz</code>

963

964

965

The \\Q...\\E sequence is recognized both inside and outside character classes.

960

\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz

961

\Qabc\$xyz\E abc\$xyz abc\$xyz

962

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz</code>

963

964

965

The \Q...\E sequence is recognized both inside and outside character classes.

966

967

968

Non-printing characters

975

represents:

976

977

978

<tag>\\a</tag> <item>alarm, that is, the BEL character (hex 07)</item>

979

<tag>\\cx</tag> <item>"control-x", where x is any character</item>

980

<tag>\\e </tag> <item>escape (hex 1B)</item>

981

<tag>\\f</tag> <item>formfeed (hex 0C)</item>

982

<tag>\\n</tag> <item>linefeed (hex 0A)</item>

983

<tag>\\r</tag> <item>carriage return (hex 0D)</item>

984

985

<tag>\\ddd</tag> <item>character with octal code ddd, or backreference</item>

986

<tag>\\xhh </tag> <item>character with hex code hh</item>

987

<tag>\\x{hhh..}</tag> <item>character with hex code hhh..</item>

978

<tag>\a</tag> <item>alarm, that is, the BEL character (hex 07)</item>

979

<tag>\cx</tag> <item>"control-x", where x is any character</item>

980

<tag>\e </tag> <item>escape (hex 1B)</item>

981

<tag>\f</tag> <item>formfeed (hex 0C)</item>

982

<tag>\n</tag> <item>linefeed (hex 0A)</item>

983

<tag>\r</tag> <item>carriage return (hex 0D)</item>

984

985

<tag>\ddd</tag> <item>character with octal code ddd, or backreference</item>

986

<tag>\xhh </tag> <item>character with hex code hh</item>

987

<tag>\x{hhh..}</tag> <item>character with hex code hhh..</item>

988

</taglist>

989

990

The precise effect of \\cx is as follows: if x is a lower case letter, it

990

The precise effect of \cx is as follows: if x is a lower case letter, it

991

is converted to upper case. Then bit 6 of the character (hex 40) is inverted.

992

Thus \\cz becomes hex 1A, but \\c{ becomes hex 3B, while \\c; becomes hex

992

Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex

993

7B.

994

995

After \\x, from zero to two hexadecimal digits are read (letters can be in

996

upper or lower case). Any number of hexadecimal digits may appear between \\x{

995

After \x, from zero to two hexadecimal digits are read (letters can be in

996

upper or lower case). Any number of hexadecimal digits may appear between \x{

997

and }, but the value of the character code must be less than 256 in non-UTF-8

998

mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in

999

hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code

1000

point, which is 10FFFF.

1001

1002

If characters other than hexadecimal digits appear between \\x{ and }, or if

1002

If characters other than hexadecimal digits appear between \x{ and }, or if

1003

there is no terminating }, this form of escape is not recognized. Instead, the

1004

initial \\x will be interpreted as a basic hexadecimal escape, with no

1004

initial \x will be interpreted as a basic hexadecimal escape, with no

1005

following digits, giving a character whose value is zero.

1006

1007

Characters whose value is less than 256 can be defined by either of the two

1008

syntaxes for \\x. There is no difference in the way they are handled. For

1009

example, \\xdc is exactly the same as \\x{dc}.

1008

syntaxes for \x. There is no difference in the way they are handled. For

1009

example, \xdc is exactly the same as \x{dc}.

1010

1011

After \\0 up to two further octal digits are read. If there are fewer than two

1012

digits, just those that are present are used. Thus the sequence \\0\\x\\07

1011

After \0 up to two further octal digits are read. If there are fewer than two

1012

digits, just those that are present are used. Thus the sequence \0\x\07

1013

specifies two binary zeros followed by a BEL character (code value 7). Make

1014

sure you supply two digits after the initial zero if the pattern character that

1015

follows is itself an octal digit.

1027

digits following the backslash, and uses them to generate a data character. Any

1028

subsequent digits stand for themselves.

1029

The value of a

1030

character specified in octal must be less than \\400.

1030

character specified in octal must be less than \400.

1031

In non-UTF-8 mode, the value of a

1032

character specified in octal must be less than \\400. In UTF-8 mode, values up

1033

to \\777 are permitted.

1032

character specified in octal must be less than \400. In UTF-8 mode, values up

1033

to \777 are permitted.

1034

1035

For example:

1036

1037

1038

<tag>\\040</tag> <item>is another way of writing a space</item>

1038

<tag>\040</tag> <item>is another way of writing a space</item>

1039

1040

<tag>\\40</tag> <item>is the same, provided there are fewer than 40

1040

<tag>\40</tag> <item>is the same, provided there are fewer than 40

1041

previous capturing subpatterns</item>

1042

<tag>\\7</tag> <item>is always a back reference</item>

1042

<tag>\7</tag> <item>is always a back reference</item>

1043

1044

<tag>\\11</tag> <item> might be a back reference, or another way of

1044

<tag>\11</tag> <item> might be a back reference, or another way of

1045

writing a tab</item>

1046

<tag>\\011</tag> <item>is always a tab</item>

1047

<tag>\\0113</tag> <item>is a tab followed by the character "3"</item>

1046

<tag>\011</tag> <item>is always a tab</item>

1047

<tag>\0113</tag> <item>is a tab followed by the character "3"</item>

1048

1049

<tag>\\113</tag> <item>might be a back reference, otherwise the

1049

<tag>\113</tag> <item>might be a back reference, otherwise the

1050

character with octal code 113</item>

1051

1052

<tag>\\377</tag> <item>might be a back reference, otherwise

1052

<tag>\377</tag> <item>might be a back reference, otherwise

1053

the byte consisting entirely of 1 bits</item>

1054

1055

<tag>\\81</tag> <item>is either a back reference, or a binary zero

1055

<tag>\81</tag> <item>is either a back reference, or a binary zero

1056

followed by the two characters "8" and "1"</item>

1057

</taglist>

1058

1062

1063

All the sequences that define a single character value can be used

1064

both inside and outside character classes. In addition, inside a

1065

character class, the sequence \\b is interpreted as the backspace

1066

character (hex 08), and the sequences \\R and \\X are interpreted as

1065

character class, the sequence \b is interpreted as the backspace

1066

character (hex 08), and the sequences \R and \X are interpreted as

1067

the characters "R" and "X", respectively. Outside a character class,

1068

these sequences have different meanings (see below).

1069

1070

Absolute and relative back references

1071

1072

The sequence \\g followed by an unsigned or a negative number,

1072

The sequence \g followed by an unsigned or a negative number,

1073

optionally enclosed in braces, is an absolute or relative back

1074

reference. A named back reference can be coded as \\g{name}. Back

1074

reference. A named back reference can be coded as \g{name}. Back

1075

references are discussed later, following the discussion of

1076

parenthesized subpatterns.

1077

1081

following are always recognized:

1082

1083

1084

<tag>\\d</tag> <item>any decimal digit</item>

1085

<tag>\\D</tag> <item>any character that is not a decimal digit</item>

1086

<tag>\\h</tag> <item>any horizontal whitespace character</item>

1087

<tag>\\H</tag> <item>any character that is not a horizontal whitespace character</item>

1088

<tag>\\s</tag> <item>any whitespace character</item>

1089

<tag>\\S</tag> <item>any character that is not a whitespace character</item>

1090

<tag>\\v</tag> <item>any vertical whitespace character</item>

1091

<tag>\\V</tag> <item>any character that is not a vertical whitespace character</item>

1092

<tag>\\w</tag> <item>any "word" character</item>

1093

<tag>\\W</tag> <item>any "non-word" character</item>

1084

<tag>\d</tag> <item>any decimal digit</item>

1085

<tag>\D</tag> <item>any character that is not a decimal digit</item>

1086

<tag>\h</tag> <item>any horizontal whitespace character</item>

1087

<tag>\H</tag> <item>any character that is not a horizontal whitespace character</item>

1088

<tag>\s</tag> <item>any whitespace character</item>

1089

<tag>\S</tag> <item>any character that is not a whitespace character</item>

1090

<tag>\v</tag> <item>any vertical whitespace character</item>

1091

<tag>\V</tag> <item>any character that is not a vertical whitespace character</item>

1092

<tag>\w</tag> <item>any "word" character</item>

1093

<tag>\W</tag> <item>any "non-word" character</item>

1094

</taglist>

1095

1096

Each pair of escape sequences partitions the complete set of characters into

1101

matching point is at the end of the subject string, all of them fail, since

1102

there is no character to match.

1103

1104

For compatibility with Perl, \\s does not match the VT character (code 11).

1105

This makes it different from the POSIX "space" class. The \\s characters

1104

For compatibility with Perl, \s does not match the VT character (code 11).

1105

This makes it different from the POSIX "space" class. The \s characters

1106

are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is

1107

included in a Perl script, \\s may match the VT character. In PCRE, it never

1107

included in a Perl script, \s may match the VT character. In PCRE, it never

1108

does.

1109

1110

In UTF-8 mode, characters with values greater than 128 never match \\d, \\s, or

1111

\\w, and always match \\D, \\S, and \\W. This is true even when Unicode

1110

In UTF-8 mode, characters with values greater than 128 never match \d, \s, or

1111

\w, and always match \D, \S, and \W. This is true even when Unicode

1112

character property support is available. These sequences retain their original

1113

meanings from before UTF-8 support was available, mainly for efficiency

1114

reasons.

1115

1116

The sequences \\h, \\H, \\v, and \\V are Perl 5.10 features. In contrast to the

1116

The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the

1117

other sequences, these do match certain high-valued codepoints in UTF-8 mode.

1118

The horizontal space characters are:

1119

1157

1158

Newline sequences

1159

1160

Outside a character class, by default, the escape sequence \\R matches any

1161

Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \\R is

1160

Outside a character class, by default, the escape sequence \R matches any

1161

Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is

1162

equivalent to the following:

1163

1164

1164

1165

1166

This is an example of an "atomic group", details of which are given below.

1167

1177

recognized.

1178

1179

1180

It is possible to restrict \\R to match only CR, LF, or CRLF (instead of the

1180

It is possible to restrict \R to match only CR, LF, or CRLF (instead of the

1181

complete set of Unicode line endings) by setting the option <c>bsr_anycrlf</c>

1182

either at compile time or when the pattern is matched. (BSR is an abbreviation

1183

for "backslash R".) This can be made the default when PCRE is built; if this is

1197

1198

(*ANY)(*BSR_ANYCRLF)

1199

1200

Inside a character class, \\R matches the letter "R".

1200

Inside a character class, \R matches the letter "R".

1201

1202

1203

Unicode character properties

1208

characters whose codepoints are less than 256, but they do work in this mode.

1209

The extra escape sequences are:

1210

1211

\\p{xx} a character with the xx property

1212

\\P{xx} a character without the xx property

1213

\\X an extended Unicode sequence

1211

\p{xx} a character with the xx property

1212

\P{xx} a character without the xx property

1213

\X an extended Unicode sequence

1214

1215

The property names represented by xx above are limited to the Unicode

1216

script names, the general category properties, and "Any", which matches any

1217

character (including newline). Other properties such as "InMusicalSymbols" are

1218

not currently supported by PCRE. Note that \\P{Any} does not match any

1218

not currently supported by PCRE. Note that \P{Any} does not match any

1219

characters, so always causes a match failure.

1220

1221

Sets of Unicode characters are defined as belonging to certain scripts. A

1222

character from one of these sets can be matched using a script name. For

1223

example:

1224

1225

\\p{Greek}

1226

\\P{Han}

1225

\p{Greek}

1226

\P{Han}

1227

1228

Those that are not part of an identified script are lumped together as

1229

"Common". The current list of scripts is:

1300

Each character has exactly one general category property, specified by a

1301

two-letter abbreviation. For compatibility with Perl, negation can be specified

1302

by including a circumflex between the opening brace and the property name. For

1303

example, \\p{^Lu} is the same as \\P{Lu}.

1303

example, \p{^Lu} is the same as \P{Lu}.

1304

1305

If only one letter is specified with \\p or \\P, it includes all the general

1305

If only one letter is specified with \p or \P, it includes all the general

1306

category properties that start with that letter. In this case, in the absence

1307

of negation, the curly brackets in the escape sequence are optional; these two

1308

examples have the same effect:

1309

1310

1311

1310

1311

1312

1313

The following general category property codes are supported:

1314

1382

pcreapi

1383

page).

1384

1385

The long synonyms for these properties that Perl supports (such as \\p{Letter})

1385

The long synonyms for these properties that Perl supports (such as \p{Letter})

1386

are not supported by PCRE, nor is it permitted to prefix any of these

1387

properties with "Is".

1388

1391

Unicode table.

1392

1393

Specifying caseless matching does not affect these escape sequences. For

1394

example, \\p{Lu} always matches only upper case letters.

1395

1396

The \\X escape matches any number of Unicode characters that form an extended

1397

Unicode sequence. \\X is equivalent to

1398

1399

1394

example, \p{Lu} always matches only upper case letters.

1395

1396

The \X escape matches any number of Unicode characters that form an extended

1397

Unicode sequence. \X is equivalent to

1398

1399

1400

1401

That is, it matches a character without the "mark" property, followed by zero

1402

or more characters with the "mark" property, and treats the sequence as an

1404

(see below).

1405

Characters with the "mark" property are typically accents that affect the

1406

preceding character. None of them have codepoints less than 256, so in

1407

non-UTF-8 mode \\X matches any one character.

1407

non-UTF-8 mode \X matches any one character.

1408

1409

Matching characters by Unicode property is not fast, because PCRE has to search

1410

a structure that contains data for over fifteen thousand characters. That is

1411

why the traditional escape sequences such as \\d and \\w do not use Unicode

1411

why the traditional escape sequences such as \d and \w do not use Unicode

1412

properties in PCRE.

1413

1414

Resetting the match start

1415

1416

The escape sequence \\K, which is a Perl 5.10 feature, causes any previously

1416

The escape sequence \K, which is a Perl 5.10 feature, causes any previously

1417

matched characters not to be included in the final matched sequence. For

1418

example, the pattern:

1419

1420

1420

1421

1422

matches "foobar", but reports that it has matched "bar". This feature is

1423

similar to a lookbehind assertion

1426

(described below).

1427

1428

However, in this case, the part of the subject before the real match does not

1429

have to be of fixed length, as lookbehind assertions do. The use of \\K does

1429

have to be of fixed length, as lookbehind assertions do. The use of \K does

1430

not interfere with the setting of

1431

captured substrings.

1432

For example, when the pattern

1433

1434

1434

1435

1436

matches "foobar", the first substring is still set to "foo".

1437

1444

described below. The backslashed assertions are:

1445

1446

1447

<tag>\\b</tag> <item>matches at a word boundary</item>

1448

<tag>\\B</tag> <item>matches when not at a word boundary</item>

1449

<tag>\\A</tag> <item>matches at the start of the subject</item>

1450

<tag>\\Z</tag> <item>matches at the end of the subject

1447

<tag>\b</tag> <item>matches at a word boundary</item>

1448

<tag>\B</tag> <item>matches when not at a word boundary</item>

1449

<tag>\A</tag> <item>matches at the start of the subject</item>

1450

<tag>\Z</tag> <item>matches at the end of the subject

1451

also matches before a newline at the end of

1452

the subject</item>

1453

<tag>\\z</tag> <item>matches only at the end of the subject</item>

1454

<tag>\\G</tag> <item>matches at the first matching position in the

1453

<tag>\z</tag> <item>matches only at the end of the subject</item>

1454

<tag>\G</tag> <item>matches at the first matching position in the

1455

subject</item>

1456

</taglist>

1457

1458

These assertions may not appear in character classes (but note that \\b has a

1458

These assertions may not appear in character classes (but note that \b has a

1459

different meaning, namely the backspace character, inside a character class).

1460

1461

A word boundary is a position in the subject string where the current character

1462

and the previous character do not both match \\w or \\W (i.e. one matches

1463

\\w and the other matches \\W), or the start or end of the string if the

1464

first or last character matches \\w, respectively.

1462

and the previous character do not both match \w or \W (i.e. one matches

1463

\w and the other matches \W), or the start or end of the string if the

1464

first or last character matches \w, respectively.

1465

1466

The \\A, \\Z, and \\z assertions differ from the traditional circumflex and

1466

The \A, \Z, and \z assertions differ from the traditional circumflex and

1467

dollar (described in the next section) in that they only ever match at the very

1468

start and end of the subject string, whatever options are set. Thus, they are

1469

independent of multiline mode. These three assertions are not affected by the

1470

<c>notbol</c> or <c>noteol</c> options, which affect only the behaviour of the

1471

circumflex and dollar metacharacters. However, if the startoffset

1472

argument of <c>re:run/3</c> is non-zero, indicating that matching is to start

1473

at a point other than the beginning of the subject, \\A can never match. The

1474

difference between \\Z and \\z is that \\Z matches before a newline at the end

1475

of the string as well as at the very end, whereas \\z matches only at the end.

1473

at a point other than the beginning of the subject, \A can never match. The

1474

difference between \Z and \z is that \Z matches before a newline at the end

1475

of the string as well as at the very end, whereas \z matches only at the end.

1476

1477

The \\G assertion is true only when the current matching position is at the

1477

The \G assertion is true only when the current matching position is at the

1478

start point of the match, as specified by the startoffset argument of

1479

<c>re:run/3</c>. It differs from \\A when the value of startoffset is

1479

<c>re:run/3</c>. It differs from \A when the value of startoffset is

1480

non-zero. By calling <c>re:run/3</c> multiple times with appropriate

1481

arguments, you can mimic Perl's /g option, and it is in this kind of

1482

implementation where \\G can be useful.

1482

implementation where \G can be useful.

1483

1484

Note, however, that PCRE's interpretation of \\G, as the start of the current

1484

Note, however, that PCRE's interpretation of \G, as the start of the current

1485

match, is subtly different from Perl's, which defines it as the end of the

1486

previous match. In Perl, these can be different when the previously matched

1487

string was empty. Because PCRE does just one match at a time, it cannot

1488

reproduce this behaviour.

1489

1490

If all the alternatives of a pattern begin with \\G, the expression is anchored

1490

If all the alternatives of a pattern begin with \G, the expression is anchored

1491

to the starting match position, and the "anchored" flag is set in the compiled

1492

regular expression.

1493

1519

1520

The meaning of dollar can be changed so that it matches only at the

1521

very end of the string, by setting the <c>dollar_endonly</c> option at

1522

compile time. This does not affect the \\Z assertion.

1522

compile time. This does not affect the \Z assertion.

1523

1524

The meanings of the circumflex and dollar characters are changed if the

1525

<c>multiline</c> option is set. When this is the case, a circumflex matches

1530

sequence CRLF, isolated CR and LF characters do not indicate newlines.

1531

1532

For example, the pattern /^abc$/ matches the subject string

1533

"def\\nabc" (where \\n represents a newline) in multiline mode, but

1533

"def\nabc" (where \n represents a newline) in multiline mode, but

1534

not otherwise. Consequently, patterns that are anchored in single line

1535

mode because all branches start with ^ are not anchored in multiline

1536

mode, and a match for circumflex is possible when the

1537

startoffset argument of <c>re:run/3</c> is non-zero. The

1538

<c>dollar_endonly</c> option is ignored if <c>multiline</c> is set.

1539

1540

Note that the sequences \\A, \\Z, and \\z can be used to match the start and

1540

Note that the sequences \A, \Z, and \z can be used to match the start and

1541

end of the subject in both modes, and if all branches of a pattern start with

1542

\\A it is always anchored, whether or not <c>multiline</c> is set.

1542

\A it is always anchored, whether or not <c>multiline</c> is set.

1543

1544

1545

</section>

1574

1575

<section><marker id="sect6"></marker><title>Matching a single byte</title>

1576

1577

Outside a character class, the escape sequence \\C matches any one byte, both

1577

Outside a character class, the escape sequence \C matches any one byte, both

1578

in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending

1579

characters. The feature is provided in Perl in order to match individual bytes

1580

in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,

1581

what remains in the string may be a malformed UTF-8 string. For this reason,

1582

the \\C escape sequence is best avoided.

1582

the \C escape sequence is best avoided.

1583

1584

PCRE does not allow \\C to appear in lookbehind assertions (described below),

1584

PCRE does not allow \C to appear in lookbehind assertions (described below),

1585

because in UTF-8 mode this would make it impossible to calculate the length of

1586

the lookbehind.

1587

1615

string.

1616

1617

In UTF-8 mode, characters with values greater than 255 can be included in a

1618

class as a literal string of bytes, or by using the \\x{ escaping mechanism.

1618

class as a literal string of bytes, or by using the \x{ escaping mechanism.

1619

1620

When caseless matching is set, any letters in a class represent both their

1621

upper case and lower case versions, so for example, a caseless [aeiou] matches

1648

class of two characters ("W" and "-") followed by a literal string

1649

"46]", so it would match "W46]" or "-46]". However, if the "]" is

1650

escaped with a backslash it is interpreted as the end of range, so

1651

[W-\\]46] is interpreted as a class containing a range followed by two

1651

[W-\]46] is interpreted as a class containing a range followed by two

1652

other characters. The octal or hexadecimal representation of "]" can

1653

also be used to end a range.

1654

1655

Ranges operate in the collating sequence of character values. They can also be

1656

used for characters specified numerically, for example [\\000-\\037].

1656

used for characters specified numerically, for example [\000-\037].

1657

In UTF-8

1658

mode, ranges can include characters whose values are greater than 255, for

1659

example [\\x{100}-\\x{2ff}].

1659

example [\x{100}-\x{2ff}].

1660

1661

1662

If a range that includes letters is used when caseless matching is set, it

1663

matches the letters in either case. For example, [W-c] is equivalent to

1664

[][\\\\^_`wxyzabc], matched caselessly

1664

[][\\^_`wxyzabc], matched caselessly

1665

, and in non-UTF-8 mode, if character

1666

tables for a French locale are in use, [\\xc8-\\xcb] matches accented E

1666

tables for a French locale are in use, [\xc8-\xcb] matches accented E

1667

characters in both cases. In UTF-8 mode, PCRE supports the concept of case for

1668

characters with values greater than 128 only when it is compiled with Unicode

1669

property support.

1670

1671

The character types \\d, \\D, \\p, \\P, \\s, \\S, \\w, and \\W may

1671

The character types \d, \D, \p, \P, \s, \S, \w, and \W may

1672

also appear in a character class, and add the characters that they

1673

match to the class. For example, [\\dABCDEF] matches any hexadecimal

1673

match to the class. For example, [\dABCDEF] matches any hexadecimal

1674

digit. A circumflex can conveniently be used with the upper case

1675

character types to specify a more restricted set of characters than

1676

the matching lower case type. For example, the class [^\\W_] matches

1676

the matching lower case type. For example, the class [^\W_] matches

1677

any letter or digit, but not underscore.

1678

1679

The only metacharacters that are recognized in character classes

1702

<tag>ascii</tag> <item>character codes 0 - 127</item>

1703

<tag>blank</tag> <item>space or tab only</item>

1704

<tag>cntrl</tag> <item>control characters</item>

1705

<tag>digit</tag> <item>decimal digits (same as \\d)</item>

1705

<tag>digit</tag> <item>decimal digits (same as \d)</item>

1706

<tag>graph</tag> <item>printing characters, excluding space</item>

1707

<tag>lower</tag> <item>lower case letters</item>

1708

<tag>print</tag> <item>printing characters, including space</item>

1709

<tag>punct</tag> <item>printing characters, excluding letters and digits</item>

1710

<tag>space</tag> <item>whitespace (not quite the same as \\s)</item>

1710

<tag>space</tag> <item>whitespace (not quite the same as \s)</item>

1711

<tag>upper</tag> <item>upper case letters</item>

1712

<tag>word</tag> <item>"word" characters (same as \\w)</item>

1712

<tag>word</tag> <item>"word" characters (same as \w)</item>

1713

<tag>xdigit</tag> <item>hexadecimal digits</item>

1714

</taglist>

1715

1716

The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and

1717

space (32). Notice that this list includes the VT character (code 11). This

1718

makes "space" different to \\s, which does not include VT (for Perl

1718

makes "space" different to \s, which does not include VT (for Perl

1719

compatibility).

1720

1721

The name "word" is a Perl extension, and "blank" is a GNU extension

1936

abbreviation or as the full name, and in both cases you want to extract the

1937

abbreviation. This pattern (ignoring the line breaks) does the job:

1938

1939

1939

1940

(?<DN>Mon|Fri|Sun)(?:day)?|

1941

(?<DN>Tue)(?:sday)?|

1942

(?<DN>Wed)(?:nesday)?|

1972

<list>

1973

<item>a literal data character</item>

1974

<item>the dot metacharacter</item>

1975

<item>the \\C escape sequence</item>

1976

<item>the \\X escape sequence

1975

<item>the \C escape sequence</item>

1976

<item>the \X escape sequence

1977

(in UTF-8 mode with Unicode properties)

1978

</item>

1979

<item>the \\R escape sequence</item>

1980

<item>an escape such as \\d that matches a single character</item>

1979

<item>the \R escape sequence</item>

1980

<item>an escape such as \d that matches a single character</item>

1981

<item>a character class</item>

1982

<item>a back reference (see next section)</item>

1983

<item>a parenthesized subpattern (unless it is an assertion)</item>

1999

2000

matches at least 3 successive vowels, but may match many more, while

2001

2002

2002

2003

2004

matches exactly 8 digits. An opening curly bracket that appears in a position

2005

where a quantifier is not allowed, or one that does not match the syntax of a

2007

quantifier, but a literal string of four characters.

2008

2009

In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual

2010

bytes. Thus, for example, \\x{100}{2} matches two UTF-8 characters, each of

2010

bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of

2011

which is represented by a two-byte sequence. Similarly, when Unicode property

2012

support is available, \\X{3} matches three Unicode extended sequences, each of

2012

support is available, \X{3} matches three Unicode extended sequences, each of

2013

which may be several bytes long (and they may be of different lengths).

2014

2015

The quantifier {0} is permitted, causing the expression to behave as if the

2042

and within the comment, individual * and / characters may appear. An attempt to

2043

match C comments by applying the pattern

2044

2045

2045

2046

2047

to the string

2048

2055

greedy, and instead matches the minimum number of times possible, so the

2056

pattern

2057

2058

2058

2059

2060

does the right thing with the C comments. The meaning of the various

2061

quantifiers is not otherwise changed, just the preferred number of matches.

2062

Do not confuse this use of question mark with its use as a quantifier in its

2063

own right. Because it has two uses, it can sometimes appear doubled, as in

2064

2065

2065

2066

2067

which matches one digit by preference, but can match two if that is the only

2068

way the rest of the pattern matches.

2081

implicitly anchored, because whatever follows will be tried against every

2082

character position in the subject string, so there is no point in retrying the

2083

overall match at any position after the first. PCRE normally treats such a

2084

pattern as though it were preceded by \\A.

2084

pattern as though it were preceded by \A.

2085

2086

In cases where it is known that the subject string contains no newlines, it is

2087

worth setting <c>dotall</c> in order to obtain this optimization, or

2092

elsewhere in the pattern, a match at the start may fail where a later one

2093

succeeds. Consider, for example:

2094

2095

2095

2096

2097

If the subject is "xyz123abc123" the match point is the fourth character. For

2098

this reason, such a pattern is not implicitly anchored.

2100

When a capturing subpattern is repeated, the value captured is the substring

2101

that matched the final iteration. For example, after

2102

2103

<quote> (tweedle[dume]{3}\\s*)+</quote>

2103

<quote> (tweedle[dume]{3}\s*)+</quote>

2104

2105

has matched "tweedledum tweedledee" the value of the captured substring is

2106

"tweedledee". However, if there are nested capturing subpatterns, the

2123

nature of the match, or to cause it fail earlier than it otherwise might, when

2124

the author of the pattern knows there is no point in carrying on.

2125

2126

Consider, for example, the pattern \\d+foo when applied to the subject line

2126

Consider, for example, the pattern \d+foo when applied to the subject line

2127

2128

2129

2130

After matching all 6 digits and then failing to match "foo", the normal

2131

action of the matcher is to try again with only 5 digits matching the \\d+

2131

action of the matcher is to try again with only 5 digits matching the \d+

2132

item, and then with 4, and so on, before ultimately failing. "Atomic grouping"

2133

(a term taken from Jeffrey Friedl's book) provides the means for specifying

2134

that once a subpattern has matched, it is not to be re-evaluated in this way.

2137

immediately on failing to match "foo" the first time. The notation is a kind of

2138

special parenthesis, starting with (?> as in this example:

2139

2140

2140

2141

2142

This kind of parenthesis "locks up" the part of the pattern it contains once

2143

it has matched, and a failure further into the pattern is prevented from

2150

2151

Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as

2152

the above example can be thought of as a maximizing repeat that must swallow

2153

everything it can. So, while both \\d+ and \\d+? are prepared to adjust the

2153

everything it can. So, while both \d+ and \d+? are prepared to adjust the

2154

number of digits they match in order to make the rest of the pattern match,

2155

(?>\\d+) can only match an entire sequence of digits.

2155

(?>\d+) can only match an entire sequence of digits.

2156

2157

Atomic groups in general can of course contain arbitrarily complicated

2158

subpatterns, and can be nested. However, when the subpattern for an atomic

2161

additional + character following a quantifier. Using this notation, the

2162

previous example can be rewritten as

2163

2164

2164

2165

2166

Note that a possessive quantifier can be used with an entire group, for

2167

example:

2189

only way to avoid some failing matches taking a very long time indeed. The

2190

pattern

2191

2192

2192

2193

2194

matches an unlimited number of substrings that either consist of non-digits, or

2195

digits enclosed in <>, followed by either ! or ?. When it matches, it runs

2198

<quote> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</quote>

2199

2200

it takes a long time before reporting failure. This is because the string can

2201

be divided between the internal \\D+ repeat and the external * repeat in a

2201

be divided between the internal \D+ repeat and the external * repeat in a

2202

large number of ways, and all have to be tried. (The example uses [!?] rather

2203

than a single character at the end, because both PCRE and Perl have an

2204

optimization that allows for fast failure when a single character is used. They

2206

if it is not present in the string.) If the pattern is changed so that it uses

2207

an atomic group, like this:

2208

2209

2209

2210

2211

sequences of non-digits cannot be broken, and failure happens quickly.

2212

2229

2230

It is not possible to have a numerical "forward back reference" to

2231

a subpattern whose number is 10 or more using this syntax because a

2232

sequence such as \\50 is interpreted as a character defined in

2232

sequence such as \50 is interpreted as a character defined in

2233

octal. See the subsection entitled "Non-printing characters" above for

2234

further details of the handling of digits following a backslash. There

2235

is no such problem when named parentheses are used. A back reference

2236

to any subpattern is possible using named parentheses (see below).

2237

2238

Another way of avoiding the ambiguity inherent in the use of digits

2239

following a backslash is to use the \\g escape sequence, which is a

2239

following a backslash is to use the \g escape sequence, which is a

2240

feature introduced in Perl 5.10. This escape must be followed by an

2241

unsigned number or a negative number, optionally enclosed in

2242

braces. These examples are all identical:

2243

2244

<list>

2245

2246

2247

2245

2246

2247

2248

</list>

2249

2250

An unsigned number specifies an absolute reference without the

2252

literal digits follow the reference. A negative number is a relative

2253

reference. Consider this example:

2254

2255

2255

2256

2257

The sequence \\g{-1} is a reference to the most recently started capturing

2258

subpattern before \\g, that is, is it equivalent to \\2. Similarly, \\g{-2}

2259

would be equivalent to \\1. The use of relative references can be helpful in

2257

The sequence \g{-1} is a reference to the most recently started capturing

2258

subpattern before \g, that is, is it equivalent to \2. Similarly, \g{-2}

2259

would be equivalent to \1. The use of relative references can be helpful in

2260

long patterns, and also in patterns that are created by joining together

2261

fragments that contain references within themselves.

2262

2265

matching the subpattern itself (see "Subpatterns as subroutines" below

2266

for a way of doing that). So the pattern

2267

2268

<quote> (sens|respons)e and \\1ibility</quote>

2268

<quote> (sens|respons)e and \1ibility</quote>

2269

2270

matches "sense and sensibility" and "response and responsibility", but not

2271

"sense and responsibility". If caseful matching is in force at the time of the

2272

back reference, the case of letters is relevant. For example,

2273

2274

2274

2275

2276

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original

2277

capturing subpattern is matched caselessly.

2278

2279

There are several different ways of writing back references to named

2280

subpatterns. The .NET syntax \\k{name} and the Perl syntax \\k<name> or

2281

\\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified

2282

back reference syntax, in which \\g can be used for both numeric and named

2280

subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or

2281

\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified

2282

back reference syntax, in which \g can be used for both numeric and named

2283

references, is also supported. We could rewrite the above example in any of

2284

the following ways:

2285

2286

<list>

2287

2288

2289

2290

2287

2288

2289

2290

2291

</list>

2292

2293

A subpattern that is referenced by name may appear in the pattern before or

2297

subpattern has not actually been used in a particular match, any back

2298

references to it always fail. For example, the pattern

2299

2300

2300

2301

2302

always fails if it starts to match "a" rather than "bc". Because

2303

there may be many capturing parentheses in a pattern, all digits

2308

empty comment (see "Comments" below) can be used.

2309

2310

A back reference that occurs inside the parentheses to which it refers fails

2311

when the subpattern is first used, so, for example, (a\\1) never matches.

2311

when the subpattern is first used, so, for example, (a\1) never matches.

2312

However, such references can be useful inside repeated subpatterns. For

2313

example, the pattern

2314

2315

2315

2316

2317

matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of

2318

the subpattern, the back reference matches the character string corresponding

2327

2328

An assertion is a test on the characters following or preceding the current

2329

matching point that does not actually consume any characters. The simple

2330

assertions coded as \\b, \\B, \\A, \\G, \\Z, \\z, ^ and $ are described

2330

assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described

2331

above.

2332

2333

2348

Lookahead assertions start with (?= for positive assertions and (?! for

2349

negative assertions. For example,

2350

2351

2351

2352

2353

matches a word followed by a semicolon, but does not include the semicolon in

2354

the match, and

2400

2401

2402

2403

In some cases, the Perl 5.10 escape sequence \\K (see above) can be

2403

In some cases, the Perl 5.10 escape sequence \K (see above) can be

2404

used instead of a lookbehind assertion; this is not restricted to a

2405

fixed-length.

2406

2409

match. If there are insufficient characters before the current position, the

2410

assertion fails.

2411

2412

PCRE does not allow the \\C escape (which matches a single byte in UTF-8 mode)

2412

PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)

2413

to appear in lookbehind assertions, because it makes it impossible to calculate

2414

the length of the lookbehind. The \\X and \\R escapes, which can match

2414

the length of the lookbehind. The \X and \R escapes, which can match

2415

different numbers of bytes, are also not permitted.

2416

2417

Possessive quantifiers can be used in conjunction with lookbehind assertions to

2443

2444

Several assertions (of any sort) may occur in succession. For example,

2445

2446

2446

2447

2448

matches "foo" preceded by three digits that are not "999". Notice

2449

that each of the assertions is applied independently at the same point

2454

the last three of which are not "999". For example, it doesn't match

2455

"123abcfoo". A pattern to do that is

2456

2457

2457

2458

2459

This time the first assertion looks at the preceding six

2460

characters, checking that the first three are digits, and then the

2468

matches an occurrence of "baz" that is preceded by "bar" which in

2469

turn is not preceded by "foo", while

2470

2471

2471

2472

2473

is another pattern that matches "foo" preceded by three digits and any three

2474

characters that are not "999".

2510

whitespace to make it more readable (assume the <c>extended</c>

2511

option) and to divide it into three parts for ease of discussion:

2512

2513

2513

2514

2515

The first part matches an optional opening parenthesis, and if that

2516

character is present, sets it as the first captured substring. The second part

2525

If you were embedding this pattern in a larger one, you could use a relative

2526

reference:

2527

2528

<quote> ...other stuff... ( \$ )? [^()]+ (?(-1) \$ ) ...</quote>

2528

<quote> ...other stuff... ( $ )? [^()]+ (?(-1) $ ) ...</quote>

2529

2530

This makes the fragment independent of the parentheses in the larger pattern.

2531

2543

2544

Rewriting the above example to use a named subpattern gives this:

2545

2546

2546

2547

2548

Checking for pattern recursion

2549

2571

is described below.) For example, a pattern to match an IPv4 address could be

2572

written like this (ignore whitespace and line breaks):

2573

2574

<quote> (?(DEFINE) (?<byte> 2[0-4]\\d | 25[0-5] | 1\\d\\d | [1-9]?\\d) )

2575

\\b (?&byte) (\\.(?&byte)){3} \\b</quote>

2574

<quote> (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )

2575

\b (?&byte) (\.(?&byte)){3} \b</quote>

2576

2577

The first part of the pattern is a DEFINE group inside which a another group

2578

named "byte" is defined. This matches an individual component of an IPv4

2590

assertion. Consider this pattern, again containing non-significant

2591

whitespace, and with the two alternatives on the second line:

2592

2593

2593

2594

(?(?=[^a-z]*[a-z])

2595

\\d{2}-[a-z]{3}-\\d{2} | \\d{2}-\\d{2}-\\d{2} )</code>

2595

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )</code>

2596

2597

The condition is a positive lookahead assertion that matches an optional

2598

sequence of non-letters followed by a letter. In other words, it tests for the

2631

interpolation to solve the parentheses problem can be created like

2632

this:

2633

2634

2634

2635

2636

The (?p{...}) item interpolates Perl code at run time, and in this

2637

case refers recursively to the pattern in which it appears.

2657

This PCRE pattern solves the nested parentheses problem (assume the

2658

<c>extended</c> option is set so that whitespace is ignored):

2659

2660

2660

2661

2662

First it matches an opening parenthesis. Then it matches any number

2663

of substrings which can either be a sequence of non-parentheses, or a

2667

If this were part of a larger pattern, you would not want to

2668

recurse the entire pattern, so instead you could use this:

2669

2670

2670

2671

2672

We have put the pattern into parentheses, and caused the recursion

2673

to refer to them instead of the whole pattern.

2691

(?P>name) is also supported. We could rewrite the above example as

2692

follows:

2693

2694

2694

2695

2696

If there is more than one subpattern with the same name, the earliest one is

2697

used.

2727

on at the top level. If additional parentheses are added, giving

2728

2729

2730

\$ ( ( (?>[^()]+) | (?R) )* ) \$

2730

$ ( ( (?>[^()]+) | (?R) )* ) $

2731

^ ^

2732

^ ^</code>

2733

2747

nested brackets (that is, when recursing), whereas any characters are

2748

permitted at the outer level.

2749

2750

2750

2751

2752

In this pattern, (?(R) is the start of a conditional subpattern,

2753

with two different alternatives for the recursive and non-recursive

2771

2772

An earlier example pointed out that the pattern

2773

2774

<quote> (sens|respons)e and \\1ibility</quote>

2774

<quote> (sens|respons)e and \1ibility</quote>

2775

2776

matches "sense and sensibility" and "response and responsibility", but not

2777

"sense and responsibility". If instead the pattern

Older »