386
386
<tag><c>bsr_anycrlf</c></tag>
387
<item>Specifies specifically that \\R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters.(overrides compilation option)</item>
387
<item>Specifies specifically that \R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters.(overrides compilation option)</item>
388
388
<tag><c>bsr_unicode</c></tag>
389
<item>Specifies specifically that \\R is to match all the Unicode newline characters (including crlf etc, the default).(overrides compilation option)</item>
389
<item>Specifies specifically that \R is to match all the Unicode newline characters (including crlf etc, the default).(overrides compilation option)</item>
391
391
<tag><c>{capture, ValueSpec}</c>/<c>{capture, ValueSpec, Type}</c></tag>
471
471
<tag><c>index</c></tag>
472
472
<item>Return captured substrings as pairs of byte indexes into the subject string and length of the matching string in the subject (as if the subject string was flattened with <c>iolist_to_binary/1</c> or <c>unicode:characters_to_binary/2</c> prior to matching). Note that the <c>unicode</c> option results in <em>byte-oriented</em> indexes in a (possibly imagined) <em>UTF-8 encoded</em> binary. A byte index tuple <c>{0,2}</c> might therefore represent one or two characters when <c>unicode</c> is in effect. This might seem contra-intuitive, but has been deemed the most effective and useful way to way to do it. To return lists instead might result in simpler code if that is desired. This return type is the default.</item>
473
473
<tag><c>list</c></tag>
474
<item>Return matching substrings as lists of characters (Erlang <c>string()</c>'s). It the <c>unicode</c> option is used in combination with the \\C sequence in the regular expression, a captured subpattern can contain bytes that has is not valid UTF-8 (\\C matches bytes regardless of character encoding). In that case the <c>list</c> capturing may result in the same types of tuples that <c>unicode:characters_to_list/2</c> can return, namely three-tuples with the tag <c>incomplete</c> or <c>error</c>, the successfully converted characters and the invalid UTF-8 tail of the conversion as a binary. The best strategy is to avoid using the\\C sequence when capturing lists.</item>
474
<item>Return matching substrings as lists of characters (Erlang <c>string()</c>'s). It the <c>unicode</c> option is used in combination with the \C sequence in the regular expression, a captured subpattern can contain bytes that has is not valid UTF-8 (\C matches bytes regardless of character encoding). In that case the <c>list</c> capturing may result in the same types of tuples that <c>unicode:characters_to_list/2</c> can return, namely three-tuples with the tag <c>incomplete</c> or <c>error</c>, the successfully converted characters and the invalid UTF-8 tail of the conversion as a binary. The best strategy is to avoid using the \C sequence when capturing lists.</item>
475
475
<tag><c>binary</c></tag>
476
<item>Return matching substrings as binaries. If the <c>unicode</c> option is used, these binaries is in UTF-8. If the \\C sequence is used together with <c>unicode</c> the binaries may be invalid UTF-8.</item>
476
<item>Return matching substrings as binaries. If the <c>unicode</c> option is used, these binaries is in UTF-8. If the \C sequence is used together with <c>unicode</c> the binaries may be invalid UTF-8.</item>
545
545
<p>The replacement string can contain the special character
546
546
<c>&</c>, which inserts the whole matching expression in the
547
result, and the special sequence <c>\\</c>N (where N is an
547
result, and the special sequence <c>\</c>N (where N is an
548
548
integer > 0), resulting in the subexpression number N will be
549
549
inserted in the result. If no subexpression with that number is
550
550
generated by the regular expression, nothing is inserted.</p>
551
<p>To insert an <c>&</c> or <c>\\</c> in the result, precede it
552
with a <c>\\</c>. Note that Erlang already gives a special
553
meaning to <c>\\</c> in literal strings, why a single <c>\\</c>
554
has to be written as <c>"\\\\"</c> and therefore a double <c>\\</c>
555
as <c>"\\\\\\\\"</c>. Example:</p>
551
<p>To insert an <c>&</c> or <c>\</c> in the result, precede it
552
with a <c>\</c>. Note that Erlang already gives a special
553
meaning to <c>\</c> in literal strings, why a single <c>\</c>
554
has to be written as <c>"\\"</c> and therefore a double <c>\</c>
555
as <c>"\\\\"</c>. Example:</p>
556
556
<code> re:replace("abcd","c","[&]",[{return,list}]).</code>
558
558
<code> "ab[c]d"</code>
560
<code> re:replace("abcd","c","[\\\&]",[{return,list}]).</code>
560
<code> re:replace("abcd","c","[\\&]",[{return,list}]).</code>
562
562
<code> "ab[&]d"</code>
563
563
<p>As with <c>re:run/3</c>, compilation errors raise the <c>badarg</c>
855
<p>changes the convention to CR. That pattern matches "a\\nb" because LF is no
855
<p>changes the convention to CR. That pattern matches "a\nb" because LF is no
856
856
longer a newline. Note that these special settings, which are not
857
857
Perl-compatible, are recognized only at the very start of a pattern, and that
858
858
they must be in upper case. If more than one of them is present, the last one
861
<p>The newline convention does not affect what the \\R escape sequence matches. By
861
<p>The newline convention does not affect what the \R escape sequence matches. By
862
862
default, this is any Unicode newline sequence, for Perl compatibility. However,
863
this can be changed; see the description of \\R in the section entitled
863
this can be changed; see the description of \R in the section entitled
865
865
"Newline sequences"
867
below. A change of \\R setting can be combined with a change of newline
867
below. A change of \R setting can be combined with a change of newline
939
939
may have. This use of backslash as an escape character applies both inside and
940
940
outside character classes.</p>
942
<p>For example, if you want to match a * character, you write \\* in the pattern.
942
<p>For example, if you want to match a * character, you write \* in the pattern.
943
943
This escaping action applies whether or not the following character would
944
944
otherwise be interpreted as a metacharacter, so it is always safe to precede a
945
945
non-alphanumeric with backslash to specify that it stands for itself. In
946
particular, if you want to match a backslash, you write \\\\.</p>
946
particular, if you want to match a backslash, you write \\.</p>
948
948
<p>If a pattern is compiled with the <c>extended</c> option, whitespace in the
949
949
pattern (other than in a character class) and characters between a # outside
951
951
be used to include a whitespace or # character as part of the pattern.</p>
953
953
<p>If you want to remove the special meaning from a sequence of characters, you
954
can do so by putting them between \\Q and \\E. This is different from Perl in
955
that $ and @ are handled as literals in \\Q...\\E sequences in PCRE, whereas in
954
can do so by putting them between \Q and \E. This is different from Perl in
955
that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
956
956
Perl, $ and @ cause variable interpolation. Note the following examples:</p>
957
957
<code type="none">
958
958
Pattern PCRE matches Perl matches
960
\\Qabc$xyz\\E abc$xyz abc followed by the contents of $xyz
961
\\Qabc\\$xyz\\E abc\\$xyz abc\\$xyz
962
\\Qabc\\E\\$\\Qxyz\\E abc$xyz abc$xyz</code>
965
<p>The \\Q...\\E sequence is recognized both inside and outside character classes.</p>
960
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
961
\Qabc\$xyz\E abc\$xyz abc\$xyz
962
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz</code>
965
<p>The \Q...\E sequence is recognized both inside and outside character classes.</p>
968
968
<p><em>Non-printing characters</em></p>
978
<tag>\\a</tag> <item>alarm, that is, the BEL character (hex 07)</item>
979
<tag>\\cx</tag> <item>"control-x", where x is any character</item>
980
<tag>\\e </tag> <item>escape (hex 1B)</item>
981
<tag>\\f</tag> <item>formfeed (hex 0C)</item>
982
<tag>\\n</tag> <item>linefeed (hex 0A)</item>
983
<tag>\\r</tag> <item>carriage return (hex 0D)</item>
984
<tag>\\t </tag> <item>tab (hex 09)</item>
985
<tag>\\ddd</tag> <item>character with octal code ddd, or backreference</item>
986
<tag>\\xhh </tag> <item>character with hex code hh</item>
987
<tag>\\x{hhh..}</tag> <item>character with hex code hhh..</item>
978
<tag>\a</tag> <item>alarm, that is, the BEL character (hex 07)</item>
979
<tag>\cx</tag> <item>"control-x", where x is any character</item>
980
<tag>\e </tag> <item>escape (hex 1B)</item>
981
<tag>\f</tag> <item>formfeed (hex 0C)</item>
982
<tag>\n</tag> <item>linefeed (hex 0A)</item>
983
<tag>\r</tag> <item>carriage return (hex 0D)</item>
984
<tag>\t </tag> <item>tab (hex 09)</item>
985
<tag>\ddd</tag> <item>character with octal code ddd, or backreference</item>
986
<tag>\xhh </tag> <item>character with hex code hh</item>
987
<tag>\x{hhh..}</tag> <item>character with hex code hhh..</item>
990
<p>The precise effect of \\cx is as follows: if x is a lower case letter, it
990
<p>The precise effect of \cx is as follows: if x is a lower case letter, it
991
991
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
992
Thus \\cz becomes hex 1A, but \\c{ becomes hex 3B, while \\c; becomes hex
992
Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex
995
<p>After \\x, from zero to two hexadecimal digits are read (letters can be in
996
upper or lower case). Any number of hexadecimal digits may appear between \\x{
995
<p>After \x, from zero to two hexadecimal digits are read (letters can be in
996
upper or lower case). Any number of hexadecimal digits may appear between \x{
997
997
and }, but the value of the character code must be less than 256 in non-UTF-8
998
998
mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
999
999
hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
1000
1000
point, which is 10FFFF.</p>
1002
<p>If characters other than hexadecimal digits appear between \\x{ and }, or if
1002
<p>If characters other than hexadecimal digits appear between \x{ and }, or if
1003
1003
there is no terminating }, this form of escape is not recognized. Instead, the
1004
initial \\x will be interpreted as a basic hexadecimal escape, with no
1004
initial \x will be interpreted as a basic hexadecimal escape, with no
1005
1005
following digits, giving a character whose value is zero.</p>
1007
1007
<p>Characters whose value is less than 256 can be defined by either of the two
1008
syntaxes for \\x. There is no difference in the way they are handled. For
1009
example, \\xdc is exactly the same as \\x{dc}.</p>
1008
syntaxes for \x. There is no difference in the way they are handled. For
1009
example, \xdc is exactly the same as \x{dc}.</p>
1011
<p>After \\0 up to two further octal digits are read. If there are fewer than two
1012
digits, just those that are present are used. Thus the sequence \\0\\x\\07
1011
<p>After \0 up to two further octal digits are read. If there are fewer than two
1012
digits, just those that are present are used. Thus the sequence \0\x\07
1013
1013
specifies two binary zeros followed by a BEL character (code value 7). Make
1014
1014
sure you supply two digits after the initial zero if the pattern character that
1015
1015
follows is itself an octal digit.</p>
1027
1027
digits following the backslash, and uses them to generate a data character. Any
1028
1028
subsequent digits stand for themselves.
1030
character specified in octal must be less than \\400.
1030
character specified in octal must be less than \400.
1031
1031
In non-UTF-8 mode, the value of a
1032
character specified in octal must be less than \\400. In UTF-8 mode, values up
1033
to \\777 are permitted.
1032
character specified in octal must be less than \400. In UTF-8 mode, values up
1033
to \777 are permitted.
1035
1035
For example:</p>
1038
<tag>\\040</tag> <item>is another way of writing a space</item>
1038
<tag>\040</tag> <item>is another way of writing a space</item>
1040
<tag>\\40</tag> <item>is the same, provided there are fewer than 40
1040
<tag>\40</tag> <item>is the same, provided there are fewer than 40
1041
1041
previous capturing subpatterns</item>
1042
<tag>\\7</tag> <item>is always a back reference</item>
1042
<tag>\7</tag> <item>is always a back reference</item>
1044
<tag>\\11</tag> <item> might be a back reference, or another way of
1044
<tag>\11</tag> <item> might be a back reference, or another way of
1045
1045
writing a tab</item>
1046
<tag>\\011</tag> <item>is always a tab</item>
1047
<tag>\\0113</tag> <item>is a tab followed by the character "3"</item>
1046
<tag>\011</tag> <item>is always a tab</item>
1047
<tag>\0113</tag> <item>is a tab followed by the character "3"</item>
1049
<tag>\\113</tag> <item>might be a back reference, otherwise the
1049
<tag>\113</tag> <item>might be a back reference, otherwise the
1050
1050
character with octal code 113</item>
1052
<tag>\\377</tag> <item>might be a back reference, otherwise
1052
<tag>\377</tag> <item>might be a back reference, otherwise
1053
1053
the byte consisting entirely of 1 bits</item>
1055
<tag>\\81</tag> <item>is either a back reference, or a binary zero
1055
<tag>\81</tag> <item>is either a back reference, or a binary zero
1056
1056
followed by the two characters "8" and "1"</item>
1063
1063
<p>All the sequences that define a single character value can be used
1064
1064
both inside and outside character classes. In addition, inside a
1065
character class, the sequence \\b is interpreted as the backspace
1066
character (hex 08), and the sequences \\R and \\X are interpreted as
1065
character class, the sequence \b is interpreted as the backspace
1066
character (hex 08), and the sequences \R and \X are interpreted as
1067
1067
the characters "R" and "X", respectively. Outside a character class,
1068
1068
these sequences have different meanings (see below).</p>
1070
1070
<p><em>Absolute and relative back references</em></p>
1072
<p>The sequence \\g followed by an unsigned or a negative number,
1072
<p>The sequence \g followed by an unsigned or a negative number,
1073
1073
optionally enclosed in braces, is an absolute or relative back
1074
reference. A named back reference can be coded as \\g{name}. Back
1074
reference. A named back reference can be coded as \g{name}. Back
1075
1075
references are discussed later, following the discussion of
1076
1076
parenthesized subpatterns.</p>
1081
1081
following are always recognized:</p>
1084
<tag>\\d</tag> <item>any decimal digit</item>
1085
<tag>\\D</tag> <item>any character that is not a decimal digit</item>
1086
<tag>\\h</tag> <item>any horizontal whitespace character</item>
1087
<tag>\\H</tag> <item>any character that is not a horizontal whitespace character</item>
1088
<tag>\\s</tag> <item>any whitespace character</item>
1089
<tag>\\S</tag> <item>any character that is not a whitespace character</item>
1090
<tag>\\v</tag> <item>any vertical whitespace character</item>
1091
<tag>\\V</tag> <item>any character that is not a vertical whitespace character</item>
1092
<tag>\\w</tag> <item>any "word" character</item>
1093
<tag>\\W</tag> <item>any "non-word" character</item>
1084
<tag>\d</tag> <item>any decimal digit</item>
1085
<tag>\D</tag> <item>any character that is not a decimal digit</item>
1086
<tag>\h</tag> <item>any horizontal whitespace character</item>
1087
<tag>\H</tag> <item>any character that is not a horizontal whitespace character</item>
1088
<tag>\s</tag> <item>any whitespace character</item>
1089
<tag>\S</tag> <item>any character that is not a whitespace character</item>
1090
<tag>\v</tag> <item>any vertical whitespace character</item>
1091
<tag>\V</tag> <item>any character that is not a vertical whitespace character</item>
1092
<tag>\w</tag> <item>any "word" character</item>
1093
<tag>\W</tag> <item>any "non-word" character</item>
1096
1096
<p>Each pair of escape sequences partitions the complete set of characters into
1101
1101
matching point is at the end of the subject string, all of them fail, since
1102
1102
there is no character to match.</p>
1104
<p>For compatibility with Perl, \\s does not match the VT character (code 11).
1105
This makes it different from the POSIX "space" class. The \\s characters
1104
<p>For compatibility with Perl, \s does not match the VT character (code 11).
1105
This makes it different from the POSIX "space" class. The \s characters
1106
1106
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
1107
included in a Perl script, \\s may match the VT character. In PCRE, it never
1107
included in a Perl script, \s may match the VT character. In PCRE, it never
1110
<p>In UTF-8 mode, characters with values greater than 128 never match \\d, \\s, or
1111
\\w, and always match \\D, \\S, and \\W. This is true even when Unicode
1110
<p>In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
1111
\w, and always match \D, \S, and \W. This is true even when Unicode
1112
1112
character property support is available. These sequences retain their original
1113
1113
meanings from before UTF-8 support was available, mainly for efficiency
1116
<p>The sequences \\h, \\H, \\v, and \\V are Perl 5.10 features. In contrast to the
1116
<p>The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
1117
1117
other sequences, these do match certain high-valued codepoints in UTF-8 mode.
1118
1118
The horizontal space characters are:</p>
1208
1208
characters whose codepoints are less than 256, but they do work in this mode.
1209
1209
The extra escape sequences are:</p>
1211
<p> \\p{<em>xx</em>} a character with the <em>xx</em> property
1212
\\P{<em>xx</em>} a character without the <em>xx</em> property
1213
\\X an extended Unicode sequence</p>
1211
<p> \p{<em>xx</em>} a character with the <em>xx</em> property
1212
\P{<em>xx</em>} a character without the <em>xx</em> property
1213
\X an extended Unicode sequence</p>
1215
1215
<p>The property names represented by <em>xx</em> above are limited to the Unicode
1216
1216
script names, the general category properties, and "Any", which matches any
1217
1217
character (including newline). Other properties such as "InMusicalSymbols" are
1218
not currently supported by PCRE. Note that \\P{Any} does not match any
1218
not currently supported by PCRE. Note that \P{Any} does not match any
1219
1219
characters, so always causes a match failure.</p>
1221
1221
<p>Sets of Unicode characters are defined as belonging to certain scripts. A
1222
1222
character from one of these sets can be matched using a script name. For
1228
1228
<p>Those that are not part of an identified script are lumped together as
1229
1229
"Common". The current list of scripts is:</p>
1300
1300
<p>Each character has exactly one general category property, specified by a
1301
1301
two-letter abbreviation. For compatibility with Perl, negation can be specified
1302
1302
by including a circumflex between the opening brace and the property name. For
1303
example, \\p{^Lu} is the same as \\P{Lu}.</p>
1303
example, \p{^Lu} is the same as \P{Lu}.</p>
1305
<p>If only one letter is specified with \\p or \\P, it includes all the general
1305
<p>If only one letter is specified with \p or \P, it includes all the general
1306
1306
category properties that start with that letter. In this case, in the absence
1307
1307
of negation, the curly brackets in the escape sequence are optional; these two
1308
1308
examples have the same effect:</p>
1310
<list><item>\\p{L}</item>
1311
<item>\\pL</item></list>
1310
<list><item>\p{L}</item>
1311
<item>\pL</item></list>
1313
1313
<p>The following general category property codes are supported:</p>
1391
1391
Unicode table.</p>
1393
1393
<p>Specifying caseless matching does not affect these escape sequences. For
1394
example, \\p{Lu} always matches only upper case letters.</p>
1396
<p>The \\X escape matches any number of Unicode characters that form an extended
1397
Unicode sequence. \\X is equivalent to</p>
1399
<quote><p> (?>\\PM\\pM*)</p></quote>
1394
example, \p{Lu} always matches only upper case letters.</p>
1396
<p>The \X escape matches any number of Unicode characters that form an extended
1397
Unicode sequence. \X is equivalent to</p>
1399
<quote><p> (?>\PM\pM*)</p></quote>
1401
1401
<p>That is, it matches a character without the "mark" property, followed by zero
1402
1402
or more characters with the "mark" property, and treats the sequence as an
1405
1405
Characters with the "mark" property are typically accents that affect the
1406
1406
preceding character. None of them have codepoints less than 256, so in
1407
non-UTF-8 mode \\X matches any one character.</p>
1407
non-UTF-8 mode \X matches any one character.</p>
1409
1409
<p>Matching characters by Unicode property is not fast, because PCRE has to search
1410
1410
a structure that contains data for over fifteen thousand characters. That is
1411
why the traditional escape sequences such as \\d and \\w do not use Unicode
1411
why the traditional escape sequences such as \d and \w do not use Unicode
1412
1412
properties in PCRE.</p>
1414
1414
<p><em>Resetting the match start</em></p>
1416
<p>The escape sequence \\K, which is a Perl 5.10 feature, causes any previously
1416
<p>The escape sequence \K, which is a Perl 5.10 feature, causes any previously
1417
1417
matched characters not to be included in the final matched sequence. For
1418
1418
example, the pattern:</p>
1420
<quote><p> foo\\Kbar</p></quote>
1420
<quote><p> foo\Kbar</p></quote>
1422
1422
<p>matches "foobar", but reports that it has matched "bar". This feature is
1423
1423
similar to a lookbehind assertion
1444
1444
described below. The backslashed assertions are:</p>
1447
<tag>\\b</tag> <item>matches at a word boundary</item>
1448
<tag>\\B</tag> <item>matches when not at a word boundary</item>
1449
<tag>\\A</tag> <item>matches at the start of the subject</item>
1450
<tag>\\Z</tag> <item>matches at the end of the subject
1447
<tag>\b</tag> <item>matches at a word boundary</item>
1448
<tag>\B</tag> <item>matches when not at a word boundary</item>
1449
<tag>\A</tag> <item>matches at the start of the subject</item>
1450
<tag>\Z</tag> <item>matches at the end of the subject
1451
1451
also matches before a newline at the end of
1452
1452
the subject</item>
1453
<tag>\\z</tag> <item>matches only at the end of the subject</item>
1454
<tag>\\G</tag> <item>matches at the first matching position in the
1453
<tag>\z</tag> <item>matches only at the end of the subject</item>
1454
<tag>\G</tag> <item>matches at the first matching position in the
1458
<p>These assertions may not appear in character classes (but note that \\b has a
1458
<p>These assertions may not appear in character classes (but note that \b has a
1459
1459
different meaning, namely the backspace character, inside a character class).</p>
1461
1461
<p>A word boundary is a position in the subject string where the current character
1462
and the previous character do not both match \\w or \\W (i.e. one matches
1463
\\w and the other matches \\W), or the start or end of the string if the
1464
first or last character matches \\w, respectively.</p>
1462
and the previous character do not both match \w or \W (i.e. one matches
1463
\w and the other matches \W), or the start or end of the string if the
1464
first or last character matches \w, respectively.</p>
1466
<p>The \\A, \\Z, and \\z assertions differ from the traditional circumflex and
1466
<p>The \A, \Z, and \z assertions differ from the traditional circumflex and
1467
1467
dollar (described in the next section) in that they only ever match at the very
1468
1468
start and end of the subject string, whatever options are set. Thus, they are
1469
1469
independent of multiline mode. These three assertions are not affected by the
1470
1470
<c>notbol</c> or <c>noteol</c> options, which affect only the behaviour of the
1471
1471
circumflex and dollar metacharacters. However, if the <em>startoffset</em>
1472
1472
argument of <c>re:run/3</c> is non-zero, indicating that matching is to start
1473
at a point other than the beginning of the subject, \\A can never match. The
1474
difference between \\Z and \\z is that \\Z matches before a newline at the end
1475
of the string as well as at the very end, whereas \\z matches only at the end.</p>
1473
at a point other than the beginning of the subject, \A can never match. The
1474
difference between \Z and \z is that \Z matches before a newline at the end
1475
of the string as well as at the very end, whereas \z matches only at the end.</p>
1477
<p>The \\G assertion is true only when the current matching position is at the
1477
<p>The \G assertion is true only when the current matching position is at the
1478
1478
start point of the match, as specified by the <em>startoffset</em> argument of
1479
<c>re:run/3</c>. It differs from \\A when the value of <em>startoffset</em> is
1479
<c>re:run/3</c>. It differs from \A when the value of <em>startoffset</em> is
1480
1480
non-zero. By calling <c>re:run/3</c> multiple times with appropriate
1481
1481
arguments, you can mimic Perl's /g option, and it is in this kind of
1482
implementation where \\G can be useful.</p>
1482
implementation where \G can be useful.</p>
1484
<p>Note, however, that PCRE's interpretation of \\G, as the start of the current
1484
<p>Note, however, that PCRE's interpretation of \G, as the start of the current
1485
1485
match, is subtly different from Perl's, which defines it as the end of the
1486
1486
previous match. In Perl, these can be different when the previously matched
1487
1487
string was empty. Because PCRE does just one match at a time, it cannot
1488
1488
reproduce this behaviour.</p>
1490
<p>If all the alternatives of a pattern begin with \\G, the expression is anchored
1490
<p>If all the alternatives of a pattern begin with \G, the expression is anchored
1491
1491
to the starting match position, and the "anchored" flag is set in the compiled
1492
1492
regular expression.</p>
1530
1530
sequence CRLF, isolated CR and LF characters do not indicate newlines.</p>
1532
1532
<p>For example, the pattern /^abc$/ matches the subject string
1533
"def\\nabc" (where \\n represents a newline) in multiline mode, but
1533
"def\nabc" (where \n represents a newline) in multiline mode, but
1534
1534
not otherwise. Consequently, patterns that are anchored in single line
1535
1535
mode because all branches start with ^ are not anchored in multiline
1536
1536
mode, and a match for circumflex is possible when the
1537
1537
<em>startoffset</em> argument of <c>re:run/3</c> is non-zero. The
1538
1538
<c>dollar_endonly</c> option is ignored if <c>multiline</c> is set.</p>
1540
<p>Note that the sequences \\A, \\Z, and \\z can be used to match the start and
1540
<p>Note that the sequences \A, \Z, and \z can be used to match the start and
1541
1541
end of the subject in both modes, and if all branches of a pattern start with
1542
\\A it is always anchored, whether or not <c>multiline</c> is set.</p>
1542
\A it is always anchored, whether or not <c>multiline</c> is set.</p>
1575
1575
<section><marker id="sect6"></marker><title>Matching a single byte</title>
1577
<p>Outside a character class, the escape sequence \\C matches any one byte, both
1577
<p>Outside a character class, the escape sequence \C matches any one byte, both
1578
1578
in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
1579
1579
characters. The feature is provided in Perl in order to match individual bytes
1580
1580
in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,
1581
1581
what remains in the string may be a malformed UTF-8 string. For this reason,
1582
the \\C escape sequence is best avoided.</p>
1582
the \C escape sequence is best avoided.</p>
1584
<p>PCRE does not allow \\C to appear in lookbehind assertions (described below),
1584
<p>PCRE does not allow \C to appear in lookbehind assertions (described below),
1585
1585
because in UTF-8 mode this would make it impossible to calculate the length of
1586
1586
the lookbehind.</p>
1648
1648
class of two characters ("W" and "-") followed by a literal string
1649
1649
"46]", so it would match "W46]" or "-46]". However, if the "]" is
1650
1650
escaped with a backslash it is interpreted as the end of range, so
1651
[W-\\]46] is interpreted as a class containing a range followed by two
1651
[W-\]46] is interpreted as a class containing a range followed by two
1652
1652
other characters. The octal or hexadecimal representation of "]" can
1653
1653
also be used to end a range.</p>
1655
1655
<p>Ranges operate in the collating sequence of character values. They can also be
1656
used for characters specified numerically, for example [\\000-\\037].
1656
used for characters specified numerically, for example [\000-\037].
1658
1658
mode, ranges can include characters whose values are greater than 255, for
1659
example [\\x{100}-\\x{2ff}].
1659
example [\x{100}-\x{2ff}].
1662
1662
<p>If a range that includes letters is used when caseless matching is set, it
1663
1663
matches the letters in either case. For example, [W-c] is equivalent to
1664
[][\\\\^_`wxyzabc], matched caselessly
1664
[][\\^_`wxyzabc], matched caselessly
1665
1665
, and in non-UTF-8 mode, if character
1666
tables for a French locale are in use, [\\xc8-\\xcb] matches accented E
1666
tables for a French locale are in use, [\xc8-\xcb] matches accented E
1667
1667
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
1668
1668
characters with values greater than 128 only when it is compiled with Unicode
1669
1669
property support.</p>
1671
<p>The character types \\d, \\D, \\p, \\P, \\s, \\S, \\w, and \\W may
1671
<p>The character types \d, \D, \p, \P, \s, \S, \w, and \W may
1672
1672
also appear in a character class, and add the characters that they
1673
match to the class. For example, [\\dABCDEF] matches any hexadecimal
1673
match to the class. For example, [\dABCDEF] matches any hexadecimal
1674
1674
digit. A circumflex can conveniently be used with the upper case
1675
1675
character types to specify a more restricted set of characters than
1676
the matching lower case type. For example, the class [^\\W_] matches
1676
the matching lower case type. For example, the class [^\W_] matches
1677
1677
any letter or digit, but not underscore.</p>
1679
1679
<p>The only metacharacters that are recognized in character classes
1702
1702
<tag>ascii</tag> <item>character codes 0 - 127</item>
1703
1703
<tag>blank</tag> <item>space or tab only</item>
1704
1704
<tag>cntrl</tag> <item>control characters</item>
1705
<tag>digit</tag> <item>decimal digits (same as \\d)</item>
1705
<tag>digit</tag> <item>decimal digits (same as \d)</item>
1706
1706
<tag>graph</tag> <item>printing characters, excluding space</item>
1707
1707
<tag>lower</tag> <item>lower case letters</item>
1708
1708
<tag>print</tag> <item>printing characters, including space</item>
1709
1709
<tag>punct</tag> <item>printing characters, excluding letters and digits</item>
1710
<tag>space</tag> <item>whitespace (not quite the same as \\s)</item>
1710
<tag>space</tag> <item>whitespace (not quite the same as \s)</item>
1711
1711
<tag>upper</tag> <item>upper case letters</item>
1712
<tag>word</tag> <item>"word" characters (same as \\w)</item>
1712
<tag>word</tag> <item>"word" characters (same as \w)</item>
1713
1713
<tag>xdigit</tag> <item>hexadecimal digits</item>
1716
1716
<p>The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
1717
1717
space (32). Notice that this list includes the VT character (code 11). This
1718
makes "space" different to \\s, which does not include VT (for Perl
1718
makes "space" different to \s, which does not include VT (for Perl
1719
1719
compatibility).</p>
1721
1721
<p>The name "word" is a Perl extension, and "blank" is a GNU extension
1973
1973
<item>a literal data character</item>
1974
1974
<item>the dot metacharacter</item>
1975
<item>the \\C escape sequence</item>
1976
<item>the \\X escape sequence
1975
<item>the \C escape sequence</item>
1976
<item>the \X escape sequence
1977
1977
(in UTF-8 mode with Unicode properties)
1979
<item>the \\R escape sequence</item>
1980
<item>an escape such as \\d that matches a single character</item>
1979
<item>the \R escape sequence</item>
1980
<item>an escape such as \d that matches a single character</item>
1981
1981
<item>a character class</item>
1982
1982
<item>a back reference (see next section)</item>
1983
1983
<item>a parenthesized subpattern (unless it is an assertion)</item>
2007
2007
quantifier, but a literal string of four characters.</p>
2009
2009
<p>In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
2010
bytes. Thus, for example, \\x{100}{2} matches two UTF-8 characters, each of
2010
bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of
2011
2011
which is represented by a two-byte sequence. Similarly, when Unicode property
2012
support is available, \\X{3} matches three Unicode extended sequences, each of
2012
support is available, \X{3} matches three Unicode extended sequences, each of
2013
2013
which may be several bytes long (and they may be of different lengths).</p>
2015
2015
<p>The quantifier {0} is permitted, causing the expression to behave as if the
2055
2055
greedy, and instead matches the minimum number of times possible, so the
2058
<quote><p> /\\*.*?\\*/</p></quote>
2058
<quote><p> /\*.*?\*/</p></quote>
2060
2060
<p>does the right thing with the C comments. The meaning of the various
2061
2061
quantifiers is not otherwise changed, just the preferred number of matches.
2062
2062
Do not confuse this use of question mark with its use as a quantifier in its
2063
2063
own right. Because it has two uses, it can sometimes appear doubled, as in</p>
2065
<quote><p> \\d??\\d</p></quote>
2065
<quote><p> \d??\d</p></quote>
2067
2067
<p>which matches one digit by preference, but can match two if that is the only
2068
2068
way the rest of the pattern matches.</p>
2123
2123
nature of the match, or to cause it fail earlier than it otherwise might, when
2124
2124
the author of the pattern knows there is no point in carrying on.</p>
2126
<p>Consider, for example, the pattern \\d+foo when applied to the subject line</p>
2126
<p>Consider, for example, the pattern \d+foo when applied to the subject line</p>
2128
2128
<quote><p> 123456bar</p></quote>
2130
2130
<p>After matching all 6 digits and then failing to match "foo", the normal
2131
action of the matcher is to try again with only 5 digits matching the \\d+
2131
action of the matcher is to try again with only 5 digits matching the \d+
2132
2132
item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
2133
2133
(a term taken from Jeffrey Friedl's book) provides the means for specifying
2134
2134
that once a subpattern has matched, it is not to be re-evaluated in this way.</p>
2151
2151
<p>Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
2152
2152
the above example can be thought of as a maximizing repeat that must swallow
2153
everything it can. So, while both \\d+ and \\d+? are prepared to adjust the
2153
everything it can. So, while both \d+ and \d+? are prepared to adjust the
2154
2154
number of digits they match in order to make the rest of the pattern match,
2155
(?>\\d+) can only match an entire sequence of digits.</p>
2155
(?>\d+) can only match an entire sequence of digits.</p>
2157
2157
<p>Atomic groups in general can of course contain arbitrarily complicated
2158
2158
subpatterns, and can be nested. However, when the subpattern for an atomic
2230
2230
<p>It is not possible to have a numerical "forward back reference" to
2231
2231
a subpattern whose number is 10 or more using this syntax because a
2232
sequence such as \\50 is interpreted as a character defined in
2232
sequence such as \50 is interpreted as a character defined in
2233
2233
octal. See the subsection entitled "Non-printing characters" above for
2234
2234
further details of the handling of digits following a backslash. There
2235
2235
is no such problem when named parentheses are used. A back reference
2236
2236
to any subpattern is possible using named parentheses (see below).</p>
2238
2238
<p>Another way of avoiding the ambiguity inherent in the use of digits
2239
following a backslash is to use the \\g escape sequence, which is a
2239
following a backslash is to use the \g escape sequence, which is a
2240
2240
feature introduced in Perl 5.10. This escape must be followed by an
2241
2241
unsigned number or a negative number, optionally enclosed in
2242
2242
braces. These examples are all identical:</p>
2245
<item>(ring), \\1</item>
2246
<item>(ring), \\g1</item>
2247
<item>(ring), \\g{1}</item>
2245
<item>(ring), \1</item>
2246
<item>(ring), \g1</item>
2247
<item>(ring), \g{1}</item>
2250
2250
<p>An unsigned number specifies an absolute reference without the
2252
2252
literal digits follow the reference. A negative number is a relative
2253
2253
reference. Consider this example:</p>
2255
<quote><p> (abc(def)ghi)\\g{-1}</p></quote>
2255
<quote><p> (abc(def)ghi)\g{-1}</p></quote>
2257
<p>The sequence \\g{-1} is a reference to the most recently started capturing
2258
subpattern before \\g, that is, is it equivalent to \\2. Similarly, \\g{-2}
2259
would be equivalent to \\1. The use of relative references can be helpful in
2257
<p>The sequence \g{-1} is a reference to the most recently started capturing
2258
subpattern before \g, that is, is it equivalent to \2. Similarly, \g{-2}
2259
would be equivalent to \1. The use of relative references can be helpful in
2260
2260
long patterns, and also in patterns that are created by joining together
2261
2261
fragments that contain references within themselves.</p>
2265
2265
matching the subpattern itself (see "Subpatterns as subroutines" below
2266
2266
for a way of doing that). So the pattern</p>
2268
<quote><p> (sens|respons)e and \\1ibility</p></quote>
2268
<quote><p> (sens|respons)e and \1ibility</p></quote>
2270
2270
<p>matches "sense and sensibility" and "response and responsibility", but not
2271
2271
"sense and responsibility". If caseful matching is in force at the time of the
2272
2272
back reference, the case of letters is relevant. For example,</p>
2274
<quote><p> ((?i)rah)\\s+\\1</p></quote>
2274
<quote><p> ((?i)rah)\s+\1</p></quote>
2276
2276
<p>matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
2277
2277
capturing subpattern is matched caselessly.</p>
2279
2279
<p>There are several different ways of writing back references to named
2280
subpatterns. The .NET syntax \\k{name} and the Perl syntax \\k<name> or
2281
\\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2282
back reference syntax, in which \\g can be used for both numeric and named
2280
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
2281
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2282
back reference syntax, in which \g can be used for both numeric and named
2283
2283
references, is also supported. We could rewrite the above example in any of
2284
2284
the following ways:</p>
2287
<item>(?<p1>(?i)rah)\\s+\\k<p1></item>
2288
<item>(?'p1'(?i)rah)\\s+\\k{p1}</item>
2289
<item>(?P<p1>(?i)rah)\\s+(?P=p1)</item>
2290
<item>(?<p1>(?i)rah)\\s+\\g{p1}</item>
2287
<item>(?<p1>(?i)rah)\s+\k<p1></item>
2288
<item>(?'p1'(?i)rah)\s+\k{p1}</item>
2289
<item>(?P<p1>(?i)rah)\s+(?P=p1)</item>
2290
<item>(?<p1>(?i)rah)\s+\g{p1}</item>
2293
2293
<p>A subpattern that is referenced by name may appear in the pattern before or
2308
2308
empty comment (see "Comments" below) can be used.</p>
2310
2310
<p>A back reference that occurs inside the parentheses to which it refers fails
2311
when the subpattern is first used, so, for example, (a\\1) never matches.
2311
when the subpattern is first used, so, for example, (a\1) never matches.
2312
2312
However, such references can be useful inside repeated subpatterns. For
2313
2313
example, the pattern</p>
2315
<quote><p> (a|b\\1)+</p></quote>
2315
<quote><p> (a|b\1)+</p></quote>
2317
2317
<p>matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
2318
2318
the subpattern, the back reference matches the character string corresponding