2
:mod:`re` --- Regular expression operations
3
===========================================
6
:synopsis: Regular expression operations.
7
.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8
.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
11
This module provides regular expression matching operations similar to
12
those found in Perl. Both patterns and strings to be searched can be
13
Unicode strings as well as 8-bit strings.
15
Regular expressions use the backslash character (``'\'``) to indicate
16
special forms or to allow special characters to be used without invoking
17
their special meaning. This collides with Python's usage of the same
18
character for the same purpose in string literals; for example, to match
19
a literal backslash, one might have to write ``'\\\\'`` as the pattern
20
string, because the regular expression must be ``\\``, and each
21
backslash must be expressed as ``\\`` inside a regular Python string
24
The solution is to use Python's raw string notation for regular expression
25
patterns; backslashes are not handled in any special way in a string literal
26
prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
27
``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
28
newline. Usually patterns will be expressed in Python code using this raw
31
It is important to note that most regular expression operations are available as
32
module-level functions and :class:`RegexObject` methods. The functions are
33
shortcuts that don't require you to compile a regex object first, but miss some
34
fine-tuning parameters.
39
Regular Expression Syntax
40
-------------------------
42
A regular expression (or RE) specifies a set of strings that matches it; the
43
functions in this module let you check if a particular string matches a given
44
regular expression (or if a given regular expression matches a particular
45
string, which comes down to the same thing).
47
Regular expressions can be concatenated to form new regular expressions; if *A*
48
and *B* are both regular expressions, then *AB* is also a regular expression.
49
In general, if a string *p* matches *A* and another string *q* matches *B*, the
50
string *pq* will match AB. This holds unless *A* or *B* contain low precedence
51
operations; boundary conditions between *A* and *B*; or have numbered group
52
references. Thus, complex expressions can easily be constructed from simpler
53
primitive expressions like the ones described here. For details of the theory
54
and implementation of regular expressions, consult the Friedl book referenced
55
above, or almost any textbook about compiler construction.
57
A brief explanation of the format of regular expressions follows. For further
58
information and a gentler presentation, consult the :ref:`regex-howto`.
60
Regular expressions can contain both special and ordinary characters. Most
61
ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
62
expressions; they simply match themselves. You can concatenate ordinary
63
characters, so ``last`` matches the string ``'last'``. (In the rest of this
64
section, we'll write RE's in ``this special style``, usually without quotes, and
65
strings to be matched ``'in single quotes'``.)
67
Some characters, like ``'|'`` or ``'('``, are special. Special
68
characters either stand for classes of ordinary characters, or affect
69
how the regular expressions around them are interpreted. Regular
70
expression pattern strings may not contain null bytes, but can specify
71
the null byte using the ``\number`` notation, e.g., ``'\x00'``.
74
The special characters are:
77
(Dot.) In the default mode, this matches any character except a newline. If
78
the :const:`DOTALL` flag has been specified, this matches any character
82
(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
83
matches immediately after each newline.
86
Matches the end of the string or just before the newline at the end of the
87
string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
88
matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
89
only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
90
matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
91
a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
92
the newline, and one at the end of the string.
95
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
96
many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
97
by any number of 'b's.
100
Causes the resulting RE to match 1 or more repetitions of the preceding RE.
101
``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
105
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
106
``ab?`` will match either 'a' or 'ab'.
108
``*?``, ``+?``, ``??``
109
The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
110
as much text as possible. Sometimes this behaviour isn't desired; if the RE
111
``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
112
string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
113
perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
114
characters as possible will be matched. Using ``.*?`` in the previous
115
expression will match only ``'<H1>'``.
118
Specifies that exactly *m* copies of the previous RE should be matched; fewer
119
matches cause the entire RE not to match. For example, ``a{6}`` will match
120
exactly six ``'a'`` characters, but not five.
123
Causes the resulting RE to match from *m* to *n* repetitions of the preceding
124
RE, attempting to match as many repetitions as possible. For example,
125
``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
126
lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
127
example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
128
followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
129
modifier would be confused with the previously described form.
132
Causes the resulting RE to match from *m* to *n* repetitions of the preceding
133
RE, attempting to match as *few* repetitions as possible. This is the
134
non-greedy version of the previous qualifier. For example, on the
135
6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
136
while ``a{3,5}?`` will only match 3 characters.
139
Either escapes special characters (permitting you to match characters like
140
``'*'``, ``'?'``, and so forth), or signals a special sequence; special
141
sequences are discussed below.
143
If you're not using a raw string to express the pattern, remember that Python
144
also uses the backslash as an escape sequence in string literals; if the escape
145
sequence isn't recognized by Python's parser, the backslash and subsequent
146
character are included in the resulting string. However, if Python would
147
recognize the resulting sequence, the backslash should be repeated twice. This
148
is complicated and hard to understand, so it's highly recommended that you use
149
raw strings for all but the simplest expressions.
152
Used to indicate a set of characters. In a set:
154
* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
157
* Ranges of characters can be indicated by giving two characters and separating
158
them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
159
``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
160
``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
161
``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
162
it will match a literal ``'-'``.
164
* Special characters lose their special meaning inside sets. For example,
165
``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
168
* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
169
inside a set, although the characters they match depends on whether
170
:const:`LOCALE` or :const:`UNICODE` mode is in force.
172
* Characters that are not within a range can be matched by :dfn:`complementing`
173
the set. If the first character of the set is ``'^'``, all the characters
174
that are *not* in the set will be matched. For example, ``[^5]`` will match
175
any character except ``'5'``, and ``[^^]`` will match any character except
176
``'^'``. ``^`` has no special meaning if it's not the first character in
179
* To match a literal ``']'`` inside a set, precede it with a backslash, or
180
place it at the beginning of the set. For example, both ``[()[\]{}]`` and
181
``[]()[{}]`` will both match a parenthesis.
184
``A|B``, where A and B can be arbitrary REs, creates a regular expression that
185
will match either A or B. An arbitrary number of REs can be separated by the
186
``'|'`` in this way. This can be used inside groups (see below) as well. As
187
the target string is scanned, REs separated by ``'|'`` are tried from left to
188
right. When one pattern completely matches, that branch is accepted. This means
189
that once ``A`` matches, ``B`` will not be tested further, even if it would
190
produce a longer overall match. In other words, the ``'|'`` operator is never
191
greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
192
character class, as in ``[|]``.
195
Matches whatever regular expression is inside the parentheses, and indicates the
196
start and end of a group; the contents of a group can be retrieved after a match
197
has been performed, and can be matched later in the string with the ``\number``
198
special sequence, described below. To match the literals ``'('`` or ``')'``,
199
use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
202
This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
203
otherwise). The first character after the ``'?'`` determines what the meaning
204
and further syntax of the construct is. Extensions usually do not create a new
205
group; ``(?P<name>...)`` is the only exception to this rule. Following are the
206
currently supported extensions.
209
(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
210
``'u'``, ``'x'``.) The group matches the empty string; the letters
211
set the corresponding flags: :const:`re.I` (ignore case),
212
:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
213
:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
214
and :const:`re.X` (verbose), for the entire regular expression. (The
215
flags are described in :ref:`contents-of-module-re`.) This
216
is useful if you wish to include the flags as part of the regular
217
expression, instead of passing a *flag* argument to the
218
:func:`re.compile` function.
220
Note that the ``(?x)`` flag changes how the expression is parsed. It should be
221
used first in the expression string, or after one or more whitespace characters.
222
If there are non-whitespace characters before the flag, the results are
226
A non-capturing version of regular parentheses. Matches whatever regular
227
expression is inside the parentheses, but the substring matched by the group
228
*cannot* be retrieved after performing a match or referenced later in the
232
Similar to regular parentheses, but the substring matched by the group is
233
accessible via the symbolic group name *name*. Group names must be valid
234
Python identifiers, and each group name must be defined only once within a
235
regular expression. A symbolic group is also a numbered group, just as if
236
the group were not named.
238
Named groups can be referenced in three contexts. If the pattern is
239
``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
240
single or double quotes):
242
+---------------------------------------+----------------------------------+
243
| Context of reference to group "quote" | Ways to reference it |
244
+=======================================+==================================+
245
| in the same pattern itself | * ``(?P=quote)`` (as shown) |
247
+---------------------------------------+----------------------------------+
248
| when processing match object ``m`` | * ``m.group('quote')`` |
249
| | * ``m.end('quote')`` (etc.) |
250
+---------------------------------------+----------------------------------+
251
| in a string passed to the ``repl`` | * ``\g<quote>`` |
252
| argument of ``re.sub()`` | * ``\g<1>`` |
254
+---------------------------------------+----------------------------------+
257
A backreference to a named group; it matches whatever text was matched by the
258
earlier group named *name*.
261
A comment; the contents of the parentheses are simply ignored.
264
Matches if ``...`` matches next, but doesn't consume any of the string. This is
265
called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
266
``'Isaac '`` only if it's followed by ``'Asimov'``.
269
Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
270
For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
271
followed by ``'Asimov'``.
274
Matches if the current position in the string is preceded by a match for ``...``
275
that ends at the current position. This is called a :dfn:`positive lookbehind
276
assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
277
lookbehind will back up 3 characters and check if the contained pattern matches.
278
The contained pattern must only match strings of some fixed length, meaning that
279
``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
280
references are not supported even if they match strings of some fixed length.
282
patterns which start with positive lookbehind assertions will not match at the
283
beginning of the string being searched; you will most likely want to use the
284
:func:`search` function rather than the :func:`match` function:
287
>>> m = re.search('(?<=abc)def', 'abcdef')
291
This example looks for a word following a hyphen:
293
>>> m = re.search('(?<=-)\w+', 'spam-egg')
298
Matches if the current position in the string is not preceded by a match for
299
``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
300
positive lookbehind assertions, the contained pattern must only match strings of
301
some fixed length and shouldn't contain group references.
302
Patterns which start with negative lookbehind assertions may
303
match at the beginning of the string being searched.
305
``(?(id/name)yes-pattern|no-pattern)``
306
Will try to match with ``yes-pattern`` if the group with given *id* or *name*
307
exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
308
can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
309
matching pattern, which will match with ``'<user@host.com>'`` as well as
310
``'user@host.com'``, but not with ``'<user@host.com'``.
312
.. versionadded:: 2.4
314
The special sequences consist of ``'\'`` and a character from the list below.
315
If the ordinary character is not on the list, then the resulting RE will match
316
the second character. For example, ``\$`` matches the character ``'$'``.
319
Matches the contents of the group of the same number. Groups are numbered
320
starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
321
but not ``'thethe'`` (note the space after the group). This special sequence
322
can only be used to match one of the first 99 groups. If the first digit of
323
*number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
324
a group match, but as the character with octal value *number*. Inside the
325
``'['`` and ``']'`` of a character class, all numeric escapes are treated as
329
Matches only at the start of the string.
332
Matches the empty string, but only at the beginning or end of a word. A word is
333
defined as a sequence of alphanumeric or underscore characters, so the end of a
334
word is indicated by whitespace or a non-alphanumeric, non-underscore character.
335
Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
336
a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
337
of the string, so the precise set of characters deemed to be alphanumeric
338
depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
339
For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
340
``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
341
Inside a character range, ``\b`` represents the backspace character, for
342
compatibility with Python's string literals.
345
Matches the empty string, but only when it is *not* at the beginning or end of a
346
word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
347
but not ``'py'``, ``'py.'``, or ``'py!'``.
348
``\B`` is just the opposite of ``\b``, so is also subject to the settings
349
of ``LOCALE`` and ``UNICODE``.
352
When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
353
is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
354
whatever is classified as a decimal digit in the Unicode character properties
358
When the :const:`UNICODE` flag is not specified, matches any non-digit
359
character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
360
will match anything other than character marked as digits in the Unicode
361
character properties database.
364
When the :const:`UNICODE` flag is not specified, it matches any whitespace
365
character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
366
:const:`LOCALE` flag has no extra effect on matching of the space.
367
If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
368
plus whatever is classified as space in the Unicode character properties
372
When the :const:`UNICODE` flag is not specified, matches any non-whitespace
373
character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
374
:const:`LOCALE` flag has no extra effect on non-whitespace match. If
375
:const:`UNICODE` is set, then any character not marked as space in the
376
Unicode character properties database is matched.
380
When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
381
any alphanumeric character and the underscore; this is equivalent to the set
382
``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
383
whatever characters are defined as alphanumeric for the current locale. If
384
:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
385
is classified as alphanumeric in the Unicode character properties database.
388
When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
389
any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
390
With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
391
not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
392
this will match anything other than ``[0-9_]`` plus characters classified as
393
not alphanumeric in the Unicode character properties database.
396
Matches only at the end of the string.
398
If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
399
particular sequence, then :const:`LOCALE` flag takes effect first followed by
400
the :const:`UNICODE`.
402
Most of the standard escapes supported by Python string literals are also
403
accepted by the regular expression parser::
409
(Note that ``\b`` is used to represent word boundaries, and means "backspace"
410
only inside character classes.)
412
Octal escapes are included in a limited form: If the first digit is a 0, or if
413
there are three octal digits, it is considered an octal escape. Otherwise, it is
414
a group reference. As for string literals, octal escapes are always at most
415
three digits in length.
419
Mastering Regular Expressions
420
Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
421
second edition of the book no longer covers Python at all, but the first
422
edition covered writing good regular expression patterns in great detail.
426
.. _contents-of-module-re:
431
The module defines several functions, constants, and an exception. Some of the
432
functions are simplified versions of the full featured methods for compiled
433
regular expressions. Most non-trivial applications always use the compiled
437
.. function:: compile(pattern, flags=0)
439
Compile a regular expression pattern into a regular expression object, which
440
can be used for matching using its :func:`~RegexObject.match` and
441
:func:`~RegexObject.search` methods, described below.
443
The expression's behaviour can be modified by specifying a *flags* value.
444
Values can be any of the following variables, combined using bitwise OR (the
449
prog = re.compile(pattern)
450
result = prog.match(string)
454
result = re.match(pattern, string)
456
but using :func:`re.compile` and saving the resulting regular expression
457
object for reuse is more efficient when the expression will be used several
458
times in a single program.
462
The compiled versions of the most recent patterns passed to
463
:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
464
programs that use only a few regular expressions at a time needn't worry
465
about compiling regular expressions.
470
Display debug information about compiled expression.
476
Perform case-insensitive matching; expressions like ``[A-Z]`` will match
477
lowercase letters, too. This is not affected by the current locale.
483
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
490
When specified, the pattern character ``'^'`` matches at the beginning of the
491
string and at the beginning of each line (immediately following each newline);
492
and the pattern character ``'$'`` matches at the end of the string and at the
493
end of each line (immediately preceding each newline). By default, ``'^'``
494
matches only at the beginning of the string, and ``'$'`` only at the end of the
495
string and immediately before the newline (if any) at the end of the string.
501
Make the ``'.'`` special character match any character at all, including a
502
newline; without this flag, ``'.'`` will match anything *except* a newline.
508
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
509
on the Unicode character properties database.
511
.. versionadded:: 2.0
517
This flag allows you to write regular expressions that look nicer and are
518
more readable by allowing you to visually separate logical sections of the
519
pattern and add comments. Whitespace within the pattern is ignored, except
520
when in a character class or when preceded by an unescaped backslash.
521
When a line contains a ``#`` that is not in a character class and is not
522
preceded by an unescaped backslash, all characters from the leftmost such
523
``#`` through the end of the line are ignored.
525
This means that the two following regular expression objects that match a
526
decimal number are functionally equal::
528
a = re.compile(r"""\d + # the integral part
529
\. # the decimal point
530
\d * # some fractional digits""", re.X)
531
b = re.compile(r"\d+\.\d*")
534
.. function:: search(pattern, string, flags=0)
536
Scan through *string* looking for the first location where the regular expression
537
*pattern* produces a match, and return a corresponding :class:`MatchObject`
538
instance. Return ``None`` if no position in the string matches the pattern; note
539
that this is different from finding a zero-length match at some point in the
543
.. function:: match(pattern, string, flags=0)
545
If zero or more characters at the beginning of *string* match the regular
546
expression *pattern*, return a corresponding :class:`MatchObject` instance.
547
Return ``None`` if the string does not match the pattern; note that this is
548
different from a zero-length match.
550
Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
551
at the beginning of the string and not at the beginning of each line.
553
If you want to locate a match anywhere in *string*, use :func:`search`
554
instead (see also :ref:`search-vs-match`).
557
.. function:: split(pattern, string, maxsplit=0, flags=0)
559
Split *string* by the occurrences of *pattern*. If capturing parentheses are
560
used in *pattern*, then the text of all groups in the pattern are also returned
561
as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
562
splits occur, and the remainder of the string is returned as the final element
563
of the list. (Incompatibility note: in the original Python 1.5 release,
564
*maxsplit* was ignored. This has been fixed in later releases.)
566
>>> re.split('\W+', 'Words, words, words.')
567
['Words', 'words', 'words', '']
568
>>> re.split('(\W+)', 'Words, words, words.')
569
['Words', ', ', 'words', ', ', 'words', '.', '']
570
>>> re.split('\W+', 'Words, words, words.', 1)
571
['Words', 'words, words.']
572
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
575
If there are capturing groups in the separator and it matches at the start of
576
the string, the result will start with an empty string. The same holds for
577
the end of the string:
579
>>> re.split('(\W+)', '...words, words...')
580
['', '...', 'words', ', ', 'words', '...', '']
582
That way, separator components are always found at the same relative
583
indices within the result list (e.g., if there's one capturing group
584
in the separator, the 0th, the 2nd and so forth).
586
Note that *split* will never split a string on an empty pattern match.
589
>>> re.split('x*', 'foo')
591
>>> re.split("(?m)^$", "foo\n\nbar\n")
594
.. versionchanged:: 2.7
595
Added the optional flags argument.
598
.. function:: findall(pattern, string, flags=0)
600
Return all non-overlapping matches of *pattern* in *string*, as a list of
601
strings. The *string* is scanned left-to-right, and matches are returned in
602
the order found. If one or more groups are present in the pattern, return a
603
list of groups; this will be a list of tuples if the pattern has more than
604
one group. Empty matches are included in the result unless they touch the
605
beginning of another match.
607
.. versionadded:: 1.5.2
609
.. versionchanged:: 2.4
610
Added the optional flags argument.
613
.. function:: finditer(pattern, string, flags=0)
615
Return an :term:`iterator` yielding :class:`MatchObject` instances over all
616
non-overlapping matches for the RE *pattern* in *string*. The *string* is
617
scanned left-to-right, and matches are returned in the order found. Empty
618
matches are included in the result unless they touch the beginning of another
621
.. versionadded:: 2.2
623
.. versionchanged:: 2.4
624
Added the optional flags argument.
627
.. function:: sub(pattern, repl, string, count=0, flags=0)
629
Return the string obtained by replacing the leftmost non-overlapping occurrences
630
of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
631
*string* is returned unchanged. *repl* can be a string or a function; if it is
632
a string, any backslash escapes in it are processed. That is, ``\n`` is
633
converted to a single newline character, ``\r`` is converted to a carriage return, and
634
so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
635
as ``\6``, are replaced with the substring matched by group 6 in the pattern.
638
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
639
... r'static PyObject*\npy_\1(void)\n{',
641
'static PyObject*\npy_myfunc(void)\n{'
643
If *repl* is a function, it is called for every non-overlapping occurrence of
644
*pattern*. The function takes a single match object argument, and returns the
645
replacement string. For example:
647
>>> def dashrepl(matchobj):
648
... if matchobj.group(0) == '-': return ' '
650
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
652
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
655
The pattern may be a string or an RE object.
657
The optional argument *count* is the maximum number of pattern occurrences to be
658
replaced; *count* must be a non-negative integer. If omitted or zero, all
659
occurrences will be replaced. Empty matches for the pattern are replaced only
660
when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
663
In string-type *repl* arguments, in addition to the character escapes and
664
backreferences described above,
665
``\g<name>`` will use the substring matched by the group named ``name``, as
666
defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
667
group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
668
in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
669
reference to group 20, not a reference to group 2 followed by the literal
670
character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
671
substring matched by the RE.
673
.. versionchanged:: 2.7
674
Added the optional flags argument.
677
.. function:: subn(pattern, repl, string, count=0, flags=0)
679
Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
680
number_of_subs_made)``.
682
.. versionchanged:: 2.7
683
Added the optional flags argument.
686
.. function:: escape(string)
688
Return *string* with all non-alphanumerics backslashed; this is useful if you
689
want to match an arbitrary literal string that may have regular expression
690
metacharacters in it.
693
.. function:: purge()
695
Clear the regular expression cache.
700
Exception raised when a string passed to one of the functions here is not a
701
valid regular expression (for example, it might contain unmatched parentheses)
702
or when some other error occurs during compilation or matching. It is never an
703
error if a string contains no match for a pattern.
708
Regular Expression Objects
709
--------------------------
711
.. class:: RegexObject
713
The :class:`RegexObject` class supports the following methods and attributes:
715
.. method:: RegexObject.search(string[, pos[, endpos]])
717
Scan through *string* looking for a location where this regular expression
718
produces a match, and return a corresponding :class:`MatchObject` instance.
719
Return ``None`` if no position in the string matches the pattern; note that this
720
is different from finding a zero-length match at some point in the string.
722
The optional second parameter *pos* gives an index in the string where the
723
search is to start; it defaults to ``0``. This is not completely equivalent to
724
slicing the string; the ``'^'`` pattern character matches at the real beginning
725
of the string and at positions just after a newline, but not necessarily at the
726
index where the search is to start.
728
The optional parameter *endpos* limits how far the string will be searched; it
729
will be as if the string is *endpos* characters long, so only the characters
730
from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
731
than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
732
expression object, ``rx.search(string, 0, 50)`` is equivalent to
733
``rx.search(string[:50], 0)``.
735
>>> pattern = re.compile("d")
736
>>> pattern.search("dog") # Match at index 0
737
<_sre.SRE_Match object at ...>
738
>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
741
.. method:: RegexObject.match(string[, pos[, endpos]])
743
If zero or more characters at the *beginning* of *string* match this regular
744
expression, return a corresponding :class:`MatchObject` instance. Return
745
``None`` if the string does not match the pattern; note that this is different
746
from a zero-length match.
748
The optional *pos* and *endpos* parameters have the same meaning as for the
749
:meth:`~RegexObject.search` method.
751
>>> pattern = re.compile("o")
752
>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
753
>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
754
<_sre.SRE_Match object at ...>
756
If you want to locate a match anywhere in *string*, use
757
:meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
760
.. method:: RegexObject.split(string, maxsplit=0)
762
Identical to the :func:`split` function, using the compiled pattern.
765
.. method:: RegexObject.findall(string[, pos[, endpos]])
767
Similar to the :func:`findall` function, using the compiled pattern, but
768
also accepts optional *pos* and *endpos* parameters that limit the search
769
region like for :meth:`match`.
772
.. method:: RegexObject.finditer(string[, pos[, endpos]])
774
Similar to the :func:`finditer` function, using the compiled pattern, but
775
also accepts optional *pos* and *endpos* parameters that limit the search
776
region like for :meth:`match`.
779
.. method:: RegexObject.sub(repl, string, count=0)
781
Identical to the :func:`sub` function, using the compiled pattern.
784
.. method:: RegexObject.subn(repl, string, count=0)
786
Identical to the :func:`subn` function, using the compiled pattern.
789
.. attribute:: RegexObject.flags
791
The regex matching flags. This is a combination of the flags given to
792
:func:`.compile` and any ``(?...)`` inline flags in the pattern.
795
.. attribute:: RegexObject.groups
797
The number of capturing groups in the pattern.
800
.. attribute:: RegexObject.groupindex
802
A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
803
numbers. The dictionary is empty if no symbolic groups were used in the
807
.. attribute:: RegexObject.pattern
809
The pattern string from which the RE object was compiled.
817
.. class:: MatchObject
819
Match objects always have a boolean value of ``True``.
820
Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
821
when there is no match, you can test whether there was a match with a simple
824
match = re.search(pattern, string)
828
Match objects support the following methods and attributes:
831
.. method:: MatchObject.expand(template)
833
Return the string obtained by doing backslash substitution on the template
834
string *template*, as done by the :meth:`~RegexObject.sub` method. Escapes
835
such as ``\n`` are converted to the appropriate characters, and numeric
836
backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
837
``\g<name>``) are replaced by the contents of the corresponding group.
840
.. method:: MatchObject.group([group1, ...])
842
Returns one or more subgroups of the match. If there is a single argument, the
843
result is a single string; if there are multiple arguments, the result is a
844
tuple with one item per argument. Without arguments, *group1* defaults to zero
845
(the whole match is returned). If a *groupN* argument is zero, the corresponding
846
return value is the entire matching string; if it is in the inclusive range
847
[1..99], it is the string matching the corresponding parenthesized group. If a
848
group number is negative or larger than the number of groups defined in the
849
pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
850
part of the pattern that did not match, the corresponding result is ``None``.
851
If a group is contained in a part of the pattern that matched multiple times,
852
the last match is returned.
854
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
855
>>> m.group(0) # The entire match
857
>>> m.group(1) # The first parenthesized subgroup.
859
>>> m.group(2) # The second parenthesized subgroup.
861
>>> m.group(1, 2) # Multiple arguments give us a tuple.
864
If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
865
arguments may also be strings identifying groups by their group name. If a
866
string argument is not used as a group name in the pattern, an :exc:`IndexError`
869
A moderately complicated example:
871
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
872
>>> m.group('first_name')
874
>>> m.group('last_name')
877
Named groups can also be referred to by their index:
884
If a group matches multiple times, only the last match is accessible:
886
>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
887
>>> m.group(1) # Returns only the last match.
891
.. method:: MatchObject.groups([default])
893
Return a tuple containing all the subgroups of the match, from 1 up to however
894
many groups are in the pattern. The *default* argument is used for groups that
895
did not participate in the match; it defaults to ``None``. (Incompatibility
896
note: in the original Python 1.5 release, if the tuple was one element long, a
897
string would be returned instead. In later versions (from 1.5.1 on), a
898
singleton tuple is returned in such cases.)
902
>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
906
If we make the decimal place and everything after it optional, not all groups
907
might participate in the match. These groups will default to ``None`` unless
908
the *default* argument is given:
910
>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
911
>>> m.groups() # Second group defaults to None.
913
>>> m.groups('0') # Now, the second group defaults to '0'.
917
.. method:: MatchObject.groupdict([default])
919
Return a dictionary containing all the *named* subgroups of the match, keyed by
920
the subgroup name. The *default* argument is used for groups that did not
921
participate in the match; it defaults to ``None``. For example:
923
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
925
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
928
.. method:: MatchObject.start([group])
929
MatchObject.end([group])
931
Return the indices of the start and end of the substring matched by *group*;
932
*group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
933
*group* exists but did not contribute to the match. For a match object *m*, and
934
a group *g* that did contribute to the match, the substring matched by group *g*
935
(equivalent to ``m.group(g)``) is ::
937
m.string[m.start(g):m.end(g)]
939
Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
940
null string. For example, after ``m = re.search('b(c?)', 'cba')``,
941
``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
942
2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
944
An example that will remove *remove_this* from email addresses:
946
>>> email = "tony@tiremove_thisger.net"
947
>>> m = re.search("remove_this", email)
948
>>> email[:m.start()] + email[m.end():]
952
.. method:: MatchObject.span([group])
954
For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
955
m.end(group))``. Note that if *group* did not contribute to the match, this is
956
``(-1, -1)``. *group* defaults to zero, the entire match.
959
.. attribute:: MatchObject.pos
961
The value of *pos* which was passed to the :meth:`~RegexObject.search` or
962
:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
963
index into the string at which the RE engine started looking for a match.
966
.. attribute:: MatchObject.endpos
968
The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
969
:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
970
index into the string beyond which the RE engine will not go.
973
.. attribute:: MatchObject.lastindex
975
The integer index of the last matched capturing group, or ``None`` if no group
976
was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
977
``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
978
the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
982
.. attribute:: MatchObject.lastgroup
984
The name of the last matched capturing group, or ``None`` if the group didn't
985
have a name, or if no group was matched at all.
988
.. attribute:: MatchObject.re
990
The regular expression object whose :meth:`~RegexObject.match` or
991
:meth:`~RegexObject.search` method produced this :class:`MatchObject`
995
.. attribute:: MatchObject.string
997
The string passed to :meth:`~RegexObject.match` or
998
:meth:`~RegexObject.search`.
1008
In this example, we'll use the following helper function to display match
1009
objects a little more gracefully:
1013
def displaymatch(match):
1016
return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1018
Suppose you are writing a poker program where a player's hand is represented as
1019
a 5-character string with each character representing a card, "a" for ace, "k"
1020
for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1021
representing the card with that value.
1023
To see if a given string is a valid hand, one could do the following:
1025
>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1026
>>> displaymatch(valid.match("akt5q")) # Valid.
1027
"<Match: 'akt5q', groups=()>"
1028
>>> displaymatch(valid.match("akt5e")) # Invalid.
1029
>>> displaymatch(valid.match("akt")) # Invalid.
1030
>>> displaymatch(valid.match("727ak")) # Valid.
1031
"<Match: '727ak', groups=()>"
1033
That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1034
To match this with a regular expression, one could use backreferences as such:
1036
>>> pair = re.compile(r".*(.).*\1")
1037
>>> displaymatch(pair.match("717ak")) # Pair of 7s.
1038
"<Match: '717', groups=('7',)>"
1039
>>> displaymatch(pair.match("718ak")) # No pairs.
1040
>>> displaymatch(pair.match("354aa")) # Pair of aces.
1041
"<Match: '354aa', groups=('a',)>"
1043
To find out what card the pair consists of, one could use the
1044
:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
1049
>>> pair.match("717ak").group(1)
1052
# Error because re.match() returns None, which doesn't have a group() method:
1053
>>> pair.match("718ak").group(1)
1054
Traceback (most recent call last):
1055
File "<pyshell#23>", line 1, in <module>
1056
re.match(r".*(.).*\1", "718ak").group(1)
1057
AttributeError: 'NoneType' object has no attribute 'group'
1059
>>> pair.match("354aa").group(1)
1066
.. index:: single: scanf()
1068
Python does not currently have an equivalent to :c:func:`scanf`. Regular
1069
expressions are generally more powerful, though also more verbose, than
1070
:c:func:`scanf` format strings. The table below offers some more-or-less
1071
equivalent mappings between :c:func:`scanf` format tokens and regular
1074
+--------------------------------+---------------------------------------------+
1075
| :c:func:`scanf` Token | Regular Expression |
1076
+================================+=============================================+
1078
+--------------------------------+---------------------------------------------+
1079
| ``%5c`` | ``.{5}`` |
1080
+--------------------------------+---------------------------------------------+
1081
| ``%d`` | ``[-+]?\d+`` |
1082
+--------------------------------+---------------------------------------------+
1083
| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1084
+--------------------------------+---------------------------------------------+
1085
| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1086
+--------------------------------+---------------------------------------------+
1087
| ``%o`` | ``[-+]?[0-7]+`` |
1088
+--------------------------------+---------------------------------------------+
1089
| ``%s`` | ``\S+`` |
1090
+--------------------------------+---------------------------------------------+
1091
| ``%u`` | ``\d+`` |
1092
+--------------------------------+---------------------------------------------+
1093
| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
1094
+--------------------------------+---------------------------------------------+
1096
To extract the filename and numbers from a string like ::
1098
/usr/sbin/sendmail - 0 errors, 4 warnings
1100
you would use a :c:func:`scanf` format like ::
1102
%s - %d errors, %d warnings
1104
The equivalent regular expression would be ::
1106
(\S+) - (\d+) errors, (\d+) warnings
1109
.. _search-vs-match:
1111
search() vs. match()
1112
^^^^^^^^^^^^^^^^^^^^
1114
.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
1116
Python offers two different primitive operations based on regular expressions:
1117
:func:`re.match` checks for a match only at the beginning of the string, while
1118
:func:`re.search` checks for a match anywhere in the string (this is what Perl
1123
>>> re.match("c", "abcdef") # No match
1124
>>> re.search("c", "abcdef") # Match
1125
<_sre.SRE_Match object at ...>
1127
Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1128
restrict the match at the beginning of the string::
1130
>>> re.match("c", "abcdef") # No match
1131
>>> re.search("^c", "abcdef") # No match
1132
>>> re.search("^a", "abcdef") # Match
1133
<_sre.SRE_Match object at ...>
1135
Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1136
beginning of the string, whereas using :func:`search` with a regular expression
1137
beginning with ``'^'`` will match at the beginning of each line.
1139
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1140
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
1141
<_sre.SRE_Match object at ...>
1147
:func:`split` splits a string into a list delimited by the passed pattern. The
1148
method is invaluable for converting textual data into data structures that can be
1149
easily read and modified by Python as demonstrated in the following example that
1150
creates a phonebook.
1152
First, here is the input. Normally it may come from a file, here we are using
1153
triple-quoted string syntax:
1155
>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1157
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1158
... Frank Burger: 925.541.7625 662 South Dogwood Way
1161
... Heather Albrecht: 548.326.4584 919 Park Place"""
1163
The entries are separated by one or more newlines. Now we convert the string
1164
into a list with each nonempty line having its own entry:
1167
:options: +NORMALIZE_WHITESPACE
1169
>>> entries = re.split("\n+", text)
1171
['Ross McFluff: 834.345.1254 155 Elm Street',
1172
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1173
'Frank Burger: 925.541.7625 662 South Dogwood Way',
1174
'Heather Albrecht: 548.326.4584 919 Park Place']
1176
Finally, split each entry into a list with first name, last name, telephone
1177
number, and address. We use the ``maxsplit`` parameter of :func:`split`
1178
because the address has spaces, our splitting pattern, in it:
1181
:options: +NORMALIZE_WHITESPACE
1183
>>> [re.split(":? ", entry, 3) for entry in entries]
1184
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1185
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1186
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1187
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1189
The ``:?`` pattern matches the colon after the last name, so that it does not
1190
occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
1191
house number from the street name:
1194
:options: +NORMALIZE_WHITESPACE
1196
>>> [re.split(":? ", entry, 4) for entry in entries]
1197
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1198
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1199
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1200
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1206
:func:`sub` replaces every occurrence of a pattern with a string or the
1207
result of a function. This example demonstrates using :func:`sub` with
1208
a function to "munge" text, or randomize the order of all the characters
1209
in each word of a sentence except for the first and last characters::
1212
... inner_word = list(m.group(2))
1213
... random.shuffle(inner_word)
1214
... return m.group(1) + "".join(inner_word) + m.group(3)
1215
>>> text = "Professor Abdolmalek, please report your absences promptly."
1216
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1217
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1218
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1219
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1225
:func:`findall` matches *all* occurrences of a pattern, not just the first
1226
one as :func:`search` does. For example, if one was a writer and wanted to
1227
find all of the adverbs in some text, he or she might use :func:`findall` in
1228
the following manner:
1230
>>> text = "He was carefully disguised but captured quickly by police."
1231
>>> re.findall(r"\w+ly", text)
1232
['carefully', 'quickly']
1235
Finding all Adverbs and their Positions
1236
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1238
If one wants more information about all matches of a pattern than the matched
1239
text, :func:`finditer` is useful as it provides instances of
1240
:class:`MatchObject` instead of strings. Continuing with the previous example,
1241
if one was a writer who wanted to find all of the adverbs *and their positions*
1242
in some text, he or she would use :func:`finditer` in the following manner:
1244
>>> text = "He was carefully disguised but captured quickly by police."
1245
>>> for m in re.finditer(r"\w+ly", text):
1246
... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
1254
Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1255
every backslash (``'\'``) in a regular expression would have to be prefixed with
1256
another one to escape it. For example, the two following lines of code are
1257
functionally identical:
1259
>>> re.match(r"\W(.)\1\W", " ff ")
1260
<_sre.SRE_Match object at ...>
1261
>>> re.match("\\W(.)\\1\\W", " ff ")
1262
<_sre.SRE_Match object at ...>
1264
When one wants to match a literal backslash, it must be escaped in the regular
1265
expression. With raw string notation, this means ``r"\\"``. Without raw string
1266
notation, one must use ``"\\\\"``, making the following lines of code
1267
functionally identical:
1269
>>> re.match(r"\\", r"\\")
1270
<_sre.SRE_Match object at ...>
1271
>>> re.match("\\\\", r"\\")
1272
<_sre.SRE_Match object at ...>