1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
5
>Pattern Matching</TITLE
8
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
10
HREF="mailto:pgsql-docs@postgresql.org"><LINK
12
TITLE="PostgreSQL 9.1beta1 Documentation"
13
HREF="index.html"><LINK
15
TITLE="Functions and Operators"
16
HREF="functions.html"><LINK
18
TITLE="Bit String Functions and Operators"
19
HREF="functions-bitstring.html"><LINK
21
TITLE="Data Type Formatting Functions"
22
HREF="functions-formatting.html"><LINK
25
HREF="stylesheet.css"><META
26
HTTP-EQUIV="Content-Type"
27
CONTENT="text/html; charset=ISO-8859-1"><META
29
CONTENT="2011-04-27T21:20:33"></HEAD
35
SUMMARY="Header navigation table"
47
>PostgreSQL 9.1beta1 Documentation</A
56
TITLE="Bit String Functions and Operators"
57
HREF="functions-bitstring.html"
66
TITLE="Functions and Operators"
74
>Chapter 9. Functions and Operators</TD
80
TITLE="Functions and Operators"
89
TITLE="Data Type Formatting Functions"
90
HREF="functions-formatting.html"
104
NAME="FUNCTIONS-MATCHING"
105
>9.7. Pattern Matching</A
108
> There are three separate approaches to pattern matching provided
124
SQL:1999), and <ACRONYM
128
expressions. Aside from the basic <SPAN
130
>"does this string match
132
> operators, functions are available to extract
133
or replace matching substrings and to split a string at matching
143
> If you have pattern matching needs that go beyond this,
144
consider writing a user-defined function in Perl or Tcl.
153
NAME="FUNCTIONS-LIKE"
203
> expression returns true if the
209
> matches the supplied
223
> returns true, and vice versa.
224
An equivalent expression is
247
> does not contain percent
248
signs or underscores, then the pattern only represents the string
249
itself; in that case <CODE
253
equals operator. An underscore (<TT
262
> stands for (matches) any single
263
character; a percent sign (<TT
266
>) matches any sequence
267
of zero or more characters.
272
CLASS="PROGRAMLISTING"
274
CLASS="LINEANNOTATION"
278
CLASS="LINEANNOTATION"
282
CLASS="LINEANNOTATION"
286
CLASS="LINEANNOTATION"
295
> pattern matching always covers the entire
296
string. Therefore, to match a sequence anywhere within a string, the
297
pattern must start and end with a percent sign.
300
> To match a literal underscore or percent sign without matching
301
other characters, the respective character in
308
preceded by the escape character. The default escape
309
character is the backslash but a different one can be selected by
313
> clause. To match the escape
314
character itself, write two escape characters.
317
> Note that the backslash already has a special meaning in string literals,
318
so to write a pattern constant that contains a backslash you must write two
319
backslashes in an SQL statement (assuming escape string syntax is used, see
321
HREF="sql-syntax-lexical.html#SQL-SYNTAX-STRINGS"
323
>). Thus, writing a pattern that
324
actually matches a literal backslash means writing four backslashes in the
325
statement. You can avoid this by selecting a different escape character
329
>; then a backslash is not special to
333
> anymore. (But backslash is still special to the
334
string literal parser, so you still need two of them to match a backslash.)
337
> It's also possible to select no escape character by writing
341
>. This effectively disables the
342
escape mechanism, which makes it impossible to turn off the
343
special meaning of underscore and percent signs in the pattern.
349
> can be used instead of
353
> to make the match case-insensitive according
354
to the active locale. This is not in the <ACRONYM
393
>, respectively. All of these operators are
405
NAME="FUNCTIONS-SIMILARTO-REGEXP"
409
> Regular Expressions</A
455
> operator returns true or
456
false depending on whether its pattern matches the given string.
457
It is similar to <CODE
461
interprets the pattern using the SQL standard's definition of a
462
regular expression. SQL regular expressions are a curious cross
466
> notation and common regular
477
operator succeeds only if its pattern matches the entire string;
478
this is unlike common regular expression behavior where the pattern
479
can match any part of the string.
494
> as wildcard characters denoting
495
any single character and any string, respectively (these are
506
> In addition to these facilities borrowed from <CODE
513
> supports these pattern-matching
514
metacharacters borrowed from POSIX regular expressions:
524
> denotes alternation (either of two alternatives).
532
> denotes repetition of the previous item zero
541
> denotes repetition of the previous item one
550
> denotes repetition of the previous item zero
568
of the previous item exactly <TT
590
of the previous item <TT
620
denotes repetition of the previous item at least <TT
639
> can be used to group items into
640
a single logical item.
645
> A bracket expression <TT
648
> specifies a character
649
class, just as in POSIX regular expressions.
655
Notice that the period (<TT
658
>) is not a metacharacter
668
>, a backslash disables the special meaning
669
of any of these metacharacters; or a different escape character can
670
be specified with <TT
678
CLASS="PROGRAMLISTING"
679
>'abc' SIMILAR TO 'abc' <I
680
CLASS="LINEANNOTATION"
683
'abc' SIMILAR TO 'a' <I
684
CLASS="LINEANNOTATION"
687
'abc' SIMILAR TO '%(b|d)%' <I
688
CLASS="LINEANNOTATION"
691
'abc' SIMILAR TO '(b|c)%' <I
692
CLASS="LINEANNOTATION"
701
> function with three parameters,
723
extraction of a substring that matches an SQL
724
regular expression pattern. As with <TT
728
specified pattern must match the entire data string, or else the
729
function fails and returns null. To indicate the part of the
730
pattern that should be returned on success, the pattern must contain
731
two occurrences of the escape character followed by a double quote
736
The text matching the portion of the pattern
737
between these markers is returned.
740
> Some examples, with <TT
743
> delimiting the return string:
745
CLASS="PROGRAMLISTING"
746
>substring('foobar' from '%#"o_b#"%' for '#') <I
747
CLASS="LINEANNOTATION"
750
substring('foobar' from '#"o_b#"%' for '#') <I
751
CLASS="LINEANNOTATION"
762
NAME="FUNCTIONS-POSIX-REGEXP"
766
> Regular Expressions</A
770
HREF="functions-matching.html#FUNCTIONS-POSIX-TABLE"
772
> lists the available
773
operators for pattern matching using POSIX regular expressions.
778
NAME="FUNCTIONS-POSIX-TABLE"
782
>Table 9-11. Regular Expression Match Operators</B
787
><COL><COL><COL><THEAD
805
>Matches regular expression, case sensitive</TD
809
>'thomas' ~ '.*thomas.*'</TT
819
>Matches regular expression, case insensitive</TD
823
>'thomas' ~* '.*Thomas.*'</TT
833
>Does not match regular expression, case sensitive</TD
837
>'thomas' !~ '.*Thomas.*'</TT
847
>Does not match regular expression, case insensitive</TD
851
>'thomas' !~* '.*vadim.*'</TT
861
> regular expressions provide a more
862
powerful means for pattern matching than the <CODE
870
Many Unix tools such as <TT
881
matching language that is similar to the one described here.
884
> A regular expression is a character sequence that is an
885
abbreviated definition of a set of strings (a <I
889
>). A string is said to match a regular expression
890
if it is a member of the regular set described by the regular
891
expression. As with <CODE
894
>, pattern characters
895
match string characters exactly unless they are special characters
896
in the regular expression language — but regular expressions use
897
different special characters than <CODE
905
regular expression is allowed to match anywhere within a string, unless
906
the regular expression is explicitly anchored to the beginning or
912
CLASS="PROGRAMLISTING"
914
CLASS="LINEANNOTATION"
918
CLASS="LINEANNOTATION"
922
CLASS="LINEANNOTATION"
926
CLASS="LINEANNOTATION"
935
> pattern language is described in much
936
greater detail below.
942
> function with two parameters,
957
>, provides extraction of a
959
that matches a POSIX regular expression pattern. It returns null if
960
there is no match, otherwise the portion of the text that matched the
961
pattern. But if the pattern contains any parentheses, the portion
962
of the text that matched the first parenthesized subexpression (the
963
one whose left parenthesis comes first) is
964
returned. You can put parentheses around the whole expression
965
if you want to use parentheses within it without triggering this
966
exception. If you need parentheses in the pattern before the
967
subexpression you want to extract, see the non-capturing parentheses
973
CLASS="PROGRAMLISTING"
974
>substring('foobar' from 'o.b') <I
975
CLASS="LINEANNOTATION"
978
substring('foobar' from 'o(.)b') <I
979
CLASS="LINEANNOTATION"
987
>regexp_replace</CODE
988
> function provides substitution of
989
new text for substrings that match POSIX regular expression patterns.
993
>regexp_replace</CODE
1025
> string is returned unchanged if
1026
there is no match to the <TT
1037
> string is returned with the
1043
> string substituted for the matching
1049
> string can contain
1064
through 9, to indicate that the source substring matching the
1070
>'th parenthesized subexpression of the pattern should be
1071
inserted, and it can contain <TT
1074
> to indicate that the
1075
substring matching the entire pattern should be inserted. Write
1079
> if you need to put a literal backslash in the replacement
1080
text. (As always, remember to double backslashes written in literal
1081
constant strings, assuming escape string syntax is used.)
1087
> parameter is an optional text
1088
string containing zero or more single-letter flags that change the
1089
function's behavior. Flag <TT
1092
> specifies case-insensitive
1093
matching, while flag <TT
1096
> specifies replacement of each matching
1097
substring rather than only the first one. Other supported flags are
1099
HREF="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE"
1106
CLASS="PROGRAMLISTING"
1107
>regexp_replace('foobarbaz', 'b..', 'X')
1109
CLASS="LINEANNOTATION"
1112
regexp_replace('foobarbaz', 'b..', 'X', 'g')
1114
CLASS="LINEANNOTATION"
1117
regexp_replace('foobarbaz', 'b(..)', E'X\\1Y', 'g')
1119
CLASS="LINEANNOTATION"
1127
>regexp_matches</CODE
1128
> function returns a text array of
1129
all of the captured substrings resulting from matching a POSIX
1130
regular expression pattern. It has the syntax
1133
>regexp_matches</CODE
1154
The function can return no rows, one row, or multiple rows (see
1158
> flag below). If the <TT
1164
does not match, the function returns no rows. If the pattern
1165
contains no parenthesized subexpressions, then each row
1166
returned is a single-element text array containing the substring
1167
matching the whole pattern. If the pattern contains parenthesized
1168
subexpressions, the function returns a text array whose
1174
>'th element is the substring matching the
1180
>'th parenthesized subexpression of the pattern
1183
>"non-capturing"</SPAN
1184
> parentheses; see below for
1191
> parameter is an optional text
1192
string containing zero or more single-letter flags that change the
1193
function's behavior. Flag <TT
1196
> causes the function to find
1197
each match in the string, not only the first one, and return a row for
1198
each such match. Other supported
1199
flags are described in <A
1200
HREF="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE"
1207
CLASS="PROGRAMLISTING"
1208
>SELECT regexp_matches('foobarbequebaz', '(bar)(beque)');
1214
SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
1221
SELECT regexp_matches('foobarbequebaz', 'barbeque');
1229
> It is possible to force <CODE
1231
>regexp_matches()</CODE
1233
return one row by using a sub-select; this is particularly useful
1237
> target list when you want all rows
1238
returned, even non-matching ones:
1240
CLASS="PROGRAMLISTING"
1241
>SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;</PRE
1247
>regexp_split_to_table</CODE
1248
> function splits a string using a POSIX
1249
regular expression pattern as a delimiter. It has the syntax
1252
>regexp_split_to_table</CODE
1273
If there is no match to the <TT
1278
>, the function returns the
1284
>. If there is at least one match, for each match it returns
1285
the text from the end of the last match (or the beginning of the string)
1286
to the beginning of the match. When there are no more matches, it
1287
returns the text from the end of the last match to the end of the string.
1293
> parameter is an optional text string containing
1294
zero or more single-letter flags that change the function's behavior.
1297
>regexp_split_to_table</CODE
1298
> supports the flags described in
1300
HREF="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE"
1307
>regexp_split_to_array</CODE
1308
> function behaves the same as
1311
>regexp_split_to_table</CODE
1312
>, except that <CODE
1314
>regexp_split_to_array</CODE
1316
returns its result as an array of <TT
1319
>. It has the syntax
1322
>regexp_split_to_array</CODE
1343
The parameters are the same as for <CODE
1345
>regexp_split_to_table</CODE
1351
CLASS="PROGRAMLISTING"
1352
> SELECT foo FROM regexp_split_to_table('the quick brown fox jumped over the lazy dog', E'\\s+') AS foo;
1366
SELECT regexp_split_to_array('the quick brown fox jumped over the lazy dog', E'\\s+');
1367
regexp_split_to_array
1368
------------------------------------------------
1369
{the,quick,brown,fox,jumped,over,the,lazy,dog}
1372
SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
1395
> As the last example demonstrates, the regexp split functions ignore
1396
zero-length matches that occur at the start or end of the string
1397
or immediately after a previous match. This is contrary to the strict
1398
definition of regexp matching that is implemented by
1401
>regexp_matches</CODE
1402
>, but is usually the most convenient behavior
1403
in practice. Other software systems such as Perl use similar definitions.
1410
NAME="POSIX-SYNTAX-DETAILS"
1411
>9.7.3.1. Regular Expression Details</A
1417
>'s regular expressions are implemented
1418
using a software package written by Henry Spencer. Much of
1419
the description of regular expressions below is copied verbatim from his
1423
> Regular expressions (<ACRONYM
1430
> 1003.2, come in two forms:
1441
(roughly those of <TT
1455
(roughly those of <TT
1462
> supports both forms, and
1463
also implements some extensions
1464
that are not in the POSIX standard, but have become widely used
1465
due to their availability in programming languages such as Perl and Tcl.
1469
>s using these non-POSIX extensions are called
1480
in this documentation. AREs are almost an exact superset of EREs,
1481
but BREs have several notational incompatibilities (as well as being
1483
We first describe the ARE and ERE forms, noting features that apply
1484
only to AREs, and then describe how BREs differ.
1496
> always initially presumes that a regular
1497
expression follows the ARE rules. However, the more limited ERE or
1498
BRE rules can be chosen by prepending an <I
1502
to the RE pattern, as described in <A
1503
HREF="functions-matching.html#POSIX-METASYNTAX"
1506
This can be useful for compatibility with applications that expect
1507
exactly the <ACRONYM
1515
> A regular expression is defined as one or more
1523
>. It matches anything that matches one of the
1527
> A branch is zero or more <I
1529
>quantified atoms</I
1535
It matches a match for the first, followed by a match for the second, etc;
1536
an empty branch matches the empty string.
1539
> A quantified atom is an <I
1547
Without a quantifier, it matches a match for the atom.
1548
With a quantifier, it can match some number of matches of the atom.
1552
> can be any of the possibilities
1554
HREF="functions-matching.html#POSIX-ATOMS-TABLE"
1557
The possible quantifiers and their meanings are shown in
1559
HREF="functions-matching.html#POSIX-QUANTIFIERS-TABLE"
1567
> matches an empty string, but matches only when
1568
specific conditions are met. A constraint can be used where an atom
1569
could be used, except it cannot be followed by a quantifier.
1570
The simple constraints are shown in
1572
HREF="functions-matching.html#POSIX-CONSTRAINTS-TABLE"
1575
some more constraints are described later.
1580
NAME="POSIX-ATOMS-TABLE"
1584
>Table 9-12. Regular Expression Atoms</B
1618
> is any regular expression)
1625
>, with the match noted for possible reporting </TD
1642
> as above, but the match is not noted for reporting
1645
>"non-capturing"</SPAN
1646
> set of parentheses)
1656
> matches any single character </TD
1675
>bracket expression</I
1677
matching any one of the <TT
1684
HREF="functions-matching.html#POSIX-BRACKET-EXPRESSIONS"
1686
> for more detail) </TD
1705
> is a non-alphanumeric character)
1706
matches that character taken as an ordinary character,
1710
> matches a backslash character </TD
1730
(possibly followed by other characters)
1735
HREF="functions-matching.html#POSIX-ESCAPE-SEQUENCES"
1738
(AREs only; in EREs and BREs, this matches <TT
1752
> when followed by a character other than a digit,
1753
matches the left-brace character <TT
1757
when followed by a digit, it is the beginning of a
1779
> is a single character with no other
1780
significance, matches that character </TD
1786
> An RE cannot end with <TT
1798
> Remember that the backslash (<TT
1801
>) already has a special
1806
To write a pattern constant that contains a backslash,
1807
you must write two backslashes in the statement, assuming escape
1808
string syntax is used (see <A
1809
HREF="sql-syntax-lexical.html#SQL-SYNTAX-STRINGS"
1818
NAME="POSIX-QUANTIFIERS-TABLE"
1822
>Table 9-13. Regular Expression Quantifiers</B
1843
> a sequence of 0 or more matches of the atom </TD
1852
> a sequence of 1 or more matches of the atom </TD
1861
> a sequence of 0 or 1 matches of the atom </TD
1878
> a sequence of exactly <TT
1883
> matches of the atom </TD
1905
> or more matches of the atom </TD
1941
(inclusive) matches of the atom; <TT
1961
> non-greedy version of <TT
1973
> non-greedy version of <TT
1985
> non-greedy version of <TT
2005
> non-greedy version of <TT
2033
> non-greedy version of <TT
2069
> non-greedy version of <TT
2094
> The forms using <TT
2120
> within a bound are
2121
unsigned decimal integers with permissible values from 0 to 255 inclusive.
2127
> quantifiers (available in AREs only) match the
2128
same possibilities as their corresponding normal (<I
2132
counterparts, but prefer the smallest number rather than the largest
2135
HREF="functions-matching.html#POSIX-MATCHING-RULES"
2146
> A quantifier cannot immediately follow another quantifier, e.g.,
2152
begin an expression or subexpression or follow
2166
NAME="POSIX-CONSTRAINTS-TABLE"
2170
>Table 9-14. Regular Expression Constraints</B
2191
> matches at the beginning of the string </TD
2200
> matches at the end of the string </TD
2219
>positive lookahead</I
2220
> matches at any point
2221
where a substring matching <TT
2246
>negative lookahead</I
2247
> matches at any point
2248
where no substring matching <TT
2260
> Lookahead constraints cannot contain <I
2265
HREF="functions-matching.html#POSIX-ESCAPE-SEQUENCES"
2268
and all parentheses within them are considered non-capturing.
2276
NAME="POSIX-BRACKET-EXPRESSIONS"
2277
>9.7.3.2. Bracket Expressions</A
2282
>bracket expression</I
2284
characters enclosed in <TT
2287
>. It normally matches
2288
any single character from the list (but see below). If the list
2292
>, it matches any single character
2299
> from the rest of the list.
2301
in the list are separated by <TT
2305
shorthand for the full range of characters between those two
2306
(inclusive) in the collating sequence,
2314
any decimal digit. It is illegal for two ranges to share an
2319
collating-sequence-dependent, so portable programs should avoid
2323
> To include a literal <TT
2326
> in the list, make it the
2327
first character (after <TT
2330
>, if that is used). To
2331
include a literal <TT
2334
>, make it the first or last
2335
character, or the second endpoint of a range. To use a literal
2339
> as the first endpoint of a range, enclose it
2347
collating element (see below). With the exception of these characters,
2348
some combinations using <TT
2352
(see next paragraphs), and escapes (AREs only), all other special
2353
characters lose their special significance within a bracket expression.
2357
> is not special when following
2358
ERE or BRE rules, though it is special (as introducing an escape)
2362
> Within a bracket expression, a collating element (a character, a
2363
multiple-character sequence that collates as if it were a single
2364
character, or a collating-sequence name for either) enclosed in
2372
sequence of characters of that collating element. The sequence is
2373
treated as a single element of the bracket expression's list. This
2375
expression containing a multiple-character collating element to
2376
match more than one character, e.g., if the collating sequence
2380
> collating element, then the RE
2384
> matches the first five characters of
2400
> currently does not support multi-character collating
2401
elements. This information describes possible future behavior.
2406
> Within a bracket expression, a collating element enclosed in
2417
>, standing for the sequences of characters of all collating
2418
elements equivalent to that one, including itself. (If there are
2419
no other equivalent collating elements, the treatment is as if the
2420
enclosing delimiters were <TT
2427
>.) For example, if <TT
2434
> are the members of an equivalence class, then
2445
> are all synonymous. An equivalence class
2446
cannot be an endpoint of a range.
2449
> Within a bracket expression, the name of a character class
2457
for the list of all characters belonging to that class. Standard
2458
character class names are: <TT
2500
>. These stand for the character classes
2503
CLASS="CITEREFENTRY"
2505
CLASS="REFENTRYTITLE"
2509
A locale can provide others. A character class cannot be used as
2510
an endpoint of a range.
2513
> There are two special cases of bracket expressions: the bracket
2522
matching empty strings at the beginning
2523
and end of a word respectively. A word is defined as a sequence
2524
of word characters that is neither preceded nor followed by word
2525
characters. A word character is an <TT
2531
CLASS="CITEREFENTRY"
2533
CLASS="REFENTRYTITLE"
2537
or an underscore. This is an extension, compatible with but not
2538
specified by <ACRONYM
2541
> 1003.2, and should be used with
2542
caution in software intended to be portable to other systems.
2543
The constraint escapes described below are usually preferable; they
2544
are no more standard, but are easier to type.
2552
NAME="POSIX-ESCAPE-SEQUENCES"
2553
>9.7.3.3. Regular Expression Escapes</A
2559
> are special sequences beginning with <TT
2563
followed by an alphanumeric character. Escapes come in several varieties:
2564
character entry, class shorthands, constraint escapes, and back references.
2568
> followed by an alphanumeric character but not constituting
2569
a valid escape is illegal in AREs.
2570
In EREs, there are no escapes: outside a bracket expression,
2574
> followed by an alphanumeric character merely stands for
2575
that character as an ordinary character, and inside a bracket expression,
2579
> is an ordinary character.
2580
(The latter is the one actual incompatibility between EREs and AREs.)
2585
>Character-entry escapes</I
2586
> exist to make it easier to specify
2587
non-printing and other inconvenient characters in REs. They are
2589
HREF="functions-matching.html#POSIX-CHARACTER-ENTRY-ESCAPES-TABLE"
2596
>Class-shorthand escapes</I
2597
> provide shorthands for certain
2598
commonly-used character classes. They are
2600
HREF="functions-matching.html#POSIX-CLASS-SHORTHAND-ESCAPES-TABLE"
2607
>constraint escape</I
2609
matching the empty string if specific conditions are met,
2610
written as an escape. They are
2612
HREF="functions-matching.html#POSIX-CONSTRAINT-ESCAPES-TABLE"
2629
same string matched by the previous parenthesized subexpression specified
2637
HREF="functions-matching.html#POSIX-CONSTRAINT-BACKREF-TABLE"
2657
The subexpression must entirely precede the back reference in the RE.
2658
Subexpressions are numbered in the order of their leading parentheses.
2659
Non-capturing parentheses do not define subexpressions.
2668
> Keep in mind that an escape's leading <TT
2672
doubled when entering the pattern as an SQL string constant. For example:
2674
CLASS="PROGRAMLISTING"
2675
>'123' ~ E'^\\d{3}' <I
2676
CLASS="LINEANNOTATION"
2686
NAME="POSIX-CHARACTER-ENTRY-ESCAPES-TABLE"
2690
>Table 9-15. Regular Expression Character-entry Escapes</B
2711
> alert (bell) character, as in C </TD
2720
> backspace, as in C </TD
2729
> synonym for backslash (<TT
2732
>) to help reduce the need for backslash
2752
> is any character) the character whose
2753
low-order 5 bits are the same as those of
2759
>, and whose other bits are all zero </TD
2768
> the character whose collating-sequence name
2773
or failing that, the character with octal value 033 </TD
2782
> form feed, as in C </TD
2791
> newline, as in C </TD
2800
> carriage return, as in C </TD
2809
> horizontal tab, as in C </TD
2828
> is exactly four hexadecimal digits)
2829
the UTF16 (Unicode, 16-bit) character <TT
2838
in the local byte ordering </TD
2857
> is exactly eight hexadecimal
2859
reserved for a hypothetical Unicode extension to 32 bits
2869
> vertical tab, as in C </TD
2888
> is any sequence of hexadecimal
2890
the character whose hexadecimal value is
2900
(a single character no matter how many hexadecimal digits are used)
2910
> the character whose value is <TT
2913
> (the null byte)</TD
2932
> is exactly two octal digits,
2937
the character whose octal value is
2965
> is exactly three octal digits,
2970
the character whose octal value is
2985
> Hexadecimal digits are <TT
3005
Octal digits are <TT
3014
> The character-entry escapes are always taken as ordinary characters.
3025
> does not terminate a bracket expression.
3030
NAME="POSIX-CLASS-SHORTHAND-ESCAPES-TABLE"
3034
>Table 9-16. Regular Expression Class-shorthand Escapes</B
3083
(note underscore is included) </TD
3120
(note underscore is included) </TD
3126
> Within bracket expressions, <TT
3136
> lose their outer brackets,
3147
(So, for example, <TT
3158
>, which is equivalent to
3161
>[a-c^[:digit:]]</TT
3167
NAME="POSIX-CONSTRAINT-ESCAPES-TABLE"
3171
>Table 9-17. Regular Expression Constraint Escapes</B
3192
> matches only at the beginning of the string
3194
HREF="functions-matching.html#POSIX-MATCHING-RULES"
3196
> for how this differs from
3209
> matches only at the beginning of a word </TD
3218
> matches only at the end of a word </TD
3227
> matches only at the beginning or end of a word </TD
3236
> matches only at a point that is not the beginning or end of a
3246
> matches only at the end of the string
3248
HREF="functions-matching.html#POSIX-MATCHING-RULES"
3250
> for how this differs from
3260
> A word is defined as in the specification of
3268
Constraint escapes are illegal within bracket expressions.
3273
NAME="POSIX-CONSTRAINT-BACKREF-TABLE"
3277
>Table 9-18. Regular Expression Back References</B
3308
> is a nonzero digit)
3309
a back reference to the <TT
3314
>'th subexpression </TD
3333
> is a nonzero digit, and
3339
> is some more digits, and the decimal value
3345
> is not greater than the number of closing capturing
3346
parentheses seen so far)
3347
a back reference to the <TT
3352
>'th subexpression </TD
3364
> There is an inherent ambiguity between octal character-entry
3365
escapes and back references, which is resolved by the following heuristics,
3367
A leading zero always indicates an octal escape.
3368
A single non-zero digit, not followed by another digit,
3369
is always taken as a back reference.
3370
A multi-digit sequence not starting with a zero is taken as a back
3371
reference if it comes after a suitable subexpression
3372
(i.e., the number is in the legal range for a back reference),
3373
and otherwise is taken as octal.
3383
NAME="POSIX-METASYNTAX"
3384
>9.7.3.4. Regular Expression Metasyntax</A
3387
> In addition to the main syntax described above, there are some special
3388
forms and miscellaneous syntactic facilities available.
3391
> An RE can begin with one of two special <I
3395
If an RE begins with <TT
3399
the rest of the RE is taken as an ARE. (This normally has no effect in
3403
>, since REs are assumed to be AREs;
3404
but it does have an effect if ERE or BRE mode had been specified by
3410
> parameter to a regex function.)
3411
If an RE begins with <TT
3415
the rest of the RE is taken to be a literal string,
3416
with all characters considered ordinary characters.
3419
> An ARE can begin with <I
3421
>embedded options</I
3440
> is one or more alphabetic characters)
3441
specifies options affecting the rest of the RE.
3442
These options override any previously determined options —
3443
in particular, they can override the case-sensitivity behavior implied by
3444
a regex operator, or the <TT
3449
> parameter to a regex
3451
The available option letters are
3453
HREF="functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE"
3456
Note that these same option letters are used in the <TT
3462
parameters of regex functions.
3467
NAME="POSIX-EMBEDDED-OPTIONS-TABLE"
3471
>Table 9-19. ARE Embedded-option Letters</B
3492
> rest of RE is a BRE </TD
3501
> case-sensitive matching (overrides operator type) </TD
3510
> rest of RE is an ERE </TD
3519
> case-insensitive matching (see
3521
HREF="functions-matching.html#POSIX-MATCHING-RULES"
3523
>) (overrides operator type) </TD
3532
> historical synonym for <TT
3544
> newline-sensitive matching (see
3546
HREF="functions-matching.html#POSIX-MATCHING-RULES"
3557
> partial newline-sensitive matching (see
3559
HREF="functions-matching.html#POSIX-MATCHING-RULES"
3570
> rest of RE is a literal (<SPAN
3573
>) string, all ordinary
3583
> non-newline-sensitive matching (default) </TD
3592
> tight syntax (default; see below) </TD
3601
> inverse partial newline-sensitive (<SPAN
3606
HREF="functions-matching.html#POSIX-MATCHING-RULES"
3617
> expanded syntax (see below) </TD
3623
> Embedded options take effect at the <TT
3626
> terminating the sequence.
3627
They can appear only at the start of an ARE (after the
3634
> In addition to the usual (<I
3637
>) RE syntax, in which all
3638
characters are significant, there is an <I
3642
available by specifying the embedded <TT
3646
In the expanded syntax,
3647
white-space characters in the RE are ignored, as are
3648
all characters between a <TT
3652
and the following newline (or the end of the RE). This
3653
permits paragraphing and commenting a complex RE.
3654
There are three exceptions to that basic rule:
3661
> a white-space character or <TT
3673
> white space or <TT
3676
> within a bracket expression is retained
3681
> white space and comments cannot appear within multi-character symbols,
3691
For this purpose, white-space characters are blank, tab, newline, and
3692
any character that belongs to the <TT
3700
> Finally, in an ARE, outside bracket expressions, the sequence
3718
> is any text not containing a <TT
3722
is a comment, completely ignored.
3723
Again, this is not allowed between the characters of
3724
multi-character symbols, like <TT
3728
Such comments are more a historical artifact than a useful facility,
3729
and their use is deprecated; use the expanded syntax instead.
3738
> of these metasyntax extensions is available if
3743
has specified that the user's input be treated as a literal string
3744
rather than as an RE.
3752
NAME="POSIX-MATCHING-RULES"
3753
>9.7.3.5. Regular Expression Matching Rules</A
3756
> In the event that an RE could match more than one substring of a given
3757
string, the RE matches the one starting earliest in the string.
3758
If the RE could match more than one substring starting at that point,
3759
either the longest possible match or the shortest possible match will
3760
be taken, depending on whether the RE is <I
3770
> Whether an RE is greedy or not is determined by the following rules:
3776
> Most atoms, and all constraints, have no greediness attribute (because
3777
they cannot match variable amounts of text anyway).
3782
> Adding parentheses around an RE does not change its greediness.
3787
> A quantified atom with a fixed-repetition quantifier
3813
has the same greediness (possibly none) as the atom itself.
3818
> A quantified atom with other normal quantifiers (including
3850
is greedy (prefers longest match).
3855
> A quantified atom with a non-greedy quantifier (including
3887
is non-greedy (prefers shortest match).
3892
> A branch — that is, an RE that has no top-level
3896
> operator — has the same greediness as the first
3897
quantified atom in it that has a greediness attribute.
3902
> An RE consisting of two or more branches connected by the
3906
> operator is always greedy.
3913
> The above rules associate greediness attributes not only with individual
3914
quantified atoms, but with branches and entire REs that contain quantified
3915
atoms. What that means is that the matching is done in such a way that
3916
the branch, or whole RE, matches the longest or shortest possible
3923
>. Once the length of the entire match
3924
is determined, the part of it that matches any particular subexpression
3925
is determined on the basis of the greediness attribute of that
3926
subexpression, with subexpressions starting earlier in the RE taking
3927
priority over ones starting later.
3930
> An example of what this means:
3933
>SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
3935
CLASS="LINEANNOTATION"
3938
CLASS="COMPUTEROUTPUT"
3941
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
3943
CLASS="LINEANNOTATION"
3946
CLASS="COMPUTEROUTPUT"
3950
In the first case, the RE as a whole is greedy because <TT
3954
is greedy. It can match beginning at the <TT
3958
the longest possible string starting there, i.e., <TT
3962
The output is the parenthesized part of that, or <TT
3966
In the second case, the RE as a whole is non-greedy because <TT
3970
is non-greedy. It can match beginning at the <TT
3974
the shortest possible string starting there, i.e., <TT
3978
The subexpression <TT
3981
> is greedy but it cannot change
3982
the decision as to the overall match length; so it is forced to match
3989
> In short, when an RE contains both greedy and non-greedy subexpressions,
3990
the total match length is either as long as possible or as short as
3991
possible, according to the attribute assigned to the whole RE. The
3992
attributes assigned to the subexpressions only affect how much of that
3993
match they are allowed to <SPAN
3996
> relative to each other.
3999
> The quantifiers <TT
4006
can be used to force greediness or non-greediness, respectively,
4007
on a subexpression or a whole RE.
4010
> Match lengths are measured in characters, not collating elements.
4011
An empty string is considered longer than no match at all.
4017
matches the three middle characters of <TT
4023
>(week|wee)(night|knights)</TT
4025
matches all ten characters of <TT
4033
is matched against <TT
4036
> the parenthesized subexpression
4037
matches all three characters; and when
4041
> is matched against <TT
4045
both the whole RE and the parenthesized
4046
subexpression match an empty string.
4049
> If case-independent matching is specified,
4050
the effect is much as if all case distinctions had vanished from the
4052
When an alphabetic that exists in multiple cases appears as an
4053
ordinary character outside a bracket expression, it is effectively
4054
transformed into a bracket expression containing both cases,
4062
When it appears inside a bracket expression, all case counterparts
4063
of it are added to the bracket expression, e.g.,
4080
> If newline-sensitive matching is specified, <TT
4084
and bracket expressions using <TT
4088
will never match the newline character
4089
(so that matches will never cross newlines unless the RE
4090
explicitly arranges it)
4098
will match the empty string after and before a newline
4099
respectively, in addition to matching at beginning and end of string
4101
But the ARE escapes <TT
4108
continue to match beginning or end of string <SPAN
4117
> If partial newline-sensitive matching is specified,
4121
> and bracket expressions
4122
as with newline-sensitive matching, but not <TT
4132
> If inverse partial newline-sensitive matching is specified,
4140
as with newline-sensitive matching, but not <TT
4144
and bracket expressions.
4145
This isn't very useful but is provided for symmetry.
4153
NAME="POSIX-LIMITS-COMPATIBILITY"
4154
>9.7.3.6. Limits and Compatibility</A
4157
> No particular limit is imposed on the length of REs in this
4158
implementation. However,
4159
programs intended to be highly portable should not employ REs longer
4161
as a POSIX-compliant implementation can refuse to accept such REs.
4164
> The only feature of AREs that is actually incompatible with
4165
POSIX EREs is that <TT
4168
> does not lose its special
4169
significance inside bracket expressions.
4170
All other ARE features use syntax which is illegal or has
4171
undefined or unspecified effects in POSIX EREs;
4175
> syntax of directors likewise is outside the POSIX
4176
syntax for both BREs and EREs.
4179
> Many of the ARE extensions are borrowed from Perl, but some have
4180
been changed to clean them up, and a few Perl extensions are not present.
4181
Incompatibilities of note include <TT
4188
the lack of special treatment for a trailing newline,
4189
the addition of complemented bracket expressions to the things
4190
affected by newline-sensitive matching,
4191
the restrictions on parentheses and back references in lookahead
4192
constraints, and the longest/shortest-match (rather than first-match)
4196
> Two significant incompatibilities exist between AREs and the ERE syntax
4197
recognized by pre-7.4 releases of <SPAN
4210
> followed by an alphanumeric character is either
4211
an escape or an error, while in previous releases, it was just another
4212
way of writing the alphanumeric.
4213
This should not be much of a problem because there was no reason to
4214
write such a sequence in earlier releases.
4222
> remains a special character within
4230
expression must be written <TT
4245
NAME="POSIX-BASIC-REGEXES"
4246
>9.7.3.7. Basic Regular Expressions</A
4249
> BREs differ from EREs in several respects.
4260
are ordinary characters and there is no equivalent
4261
for their functionality.
4262
The delimiters for bounds are
4277
by themselves ordinary characters.
4278
The parentheses for nested subexpressions are
4292
> by themselves ordinary characters.
4296
> is an ordinary character except at the beginning of the
4297
RE or the beginning of a parenthesized subexpression,
4301
> is an ordinary character except at the end of the
4302
RE or the end of a parenthesized subexpression,
4306
> is an ordinary character if it appears at the beginning
4307
of the RE or the beginning of a parenthesized subexpression
4308
(after a possible leading <TT
4312
Finally, single-digit back references are available, and
4328
respectively; no other escapes are available in BREs.
4338
SUMMARY="Footer navigation table"
4349
HREF="functions-bitstring.html"
4367
HREF="functions-formatting.html"
4377
>Bit String Functions and Operators</TD
4383
HREF="functions.html"
4391
>Data Type Formatting Functions</TD
b'\\ No newline at end of file'