~ubuntu-branches/ubuntu/hardy/exim4/hardy-proposed

Committer: Bazaar Package Importer
Author(s): Marc Haber
Date: 2005-07-02 06:08:34 UTC
mfrom: (1.1.1 upstream)
Revision ID: james.westby@ubuntu.com-20050702060834-qk17pd52kb9nt3bj

Tags: 4.52-1

http://bugs.debian.org/315775

* new upstream version 4.51. (mh)
  * adapt 70_remove_exim-users_references
  * remove 37_gnutlsparams
  * adapt 36_pcre
  * adapt 31_eximmanpage
* fix package priorities to have them in sync with override again. (mh)
* Fix error in nb (Norwegian) translation.
  Thanks to Helge Hafting. (mh). Closes: #315775
* Standards-Version: 3.6.2, no changes needed. (mh)

files added:
OS/Makefile-GNUkFreeBSD

OS/Makefile-GNUkNetBSD

OS/os.h-GNUkFreeBSD

OS/os.h-GNUkNetBSD

debian/config-custom/debian/install

debian/linda

debian/linda-overrides

debian/linda/overrides

debian/linda/overrides/exim4-daemon-heavy

debian/linda/overrides/exim4-daemon-light

debian/lintian

debian/lintian/overrides

debian/lintian/overrides/exim4-config

debian/lintian/overrides/exim4-daemon-heavy

debian/lintian/overrides/exim4-daemon-light

debian/patches/60_convert4r4.dpatch

debian/patches/70_remove_exim-users_references.dpatch

debian/po/et.po

debian/po/gl.po

debian/po/tl.po

debian/po/vi.po

doc/experimental-spec.txt

src/auths/cyrus_sasl.c

src/auths/cyrus_sasl.h

src/bmi_spam.c

src/bmi_spam.h

src/demime.c

src/demime.h

src/dk.c

src/dk.h

src/lookups/lf_quote.c

src/lookups/spf.c

src/lookups/spf.h

src/malware.c

src/mime.c

src/mime.h

src/pcre/pcre_compile.c

src/pcre/pcre_config.c

src/pcre/pcre_exec.c

src/pcre/pcre_fullinfo.c

src/pcre/pcre_get.c

src/pcre/pcre_globals.c

src/pcre/pcre_internal.h

src/pcre/pcre_maketables.c

src/pcre/pcre_printint.c

src/pcre/pcre_study.c

src/pcre/pcre_tables.c

src/pcre/pcre_try_flipped.c

src/pcre/pcre_version.c

src/pcre/ucp.h

src/regex.c

src/spam.c

src/spam.h

src/spf.c

src/spf.h

src/spool_mbox.c

src/srs.c

src/srs.h

util/README

util/mkcdb.pl

files removed:
OS/Makefile-Linux-libc5

OS/os.c-Linux-libc5

OS/os.h-Linux-libc5

debian/README.TLS

debian/config-custom/debian/files

debian/debconf/30_exim4-config_example_check_rcpt

debian/exim4-config-medium

debian/exim4-config-medium/debian

debian/exim4-config-medium/debian/changelog

debian/exim4-config-medium/debian/compat

debian/exim4-config-medium/debian/config

debian/exim4-config-medium/debian/config/30_exim4-config_example_check_rcpt

debian/exim4-config-medium/debian/config/conf.d

debian/exim4-config-medium/debian/config/conf.d/30_exim4-config-medium_example_check_rcpt

debian/exim4-config-medium/debian/config/conf.d/conf.d

debian/exim4-config-medium/debian/config/conf.d/conf.d/acl

debian/exim4-config-medium/debian/config/conf.d/conf.d/acl/00_exim4-config-medium_header

debian/exim4-config-medium/debian/config/conf.d/conf.d/acl/20_exim4-config-medium_whitelist_local_deny

debian/exim4-config-medium/debian/config/conf.d/conf.d/acl/30_exim4-config-medium_check_rcpt

debian/exim4-config-medium/debian/config/conf.d/conf.d/acl/40_exim4-config-medium_check_data

debian/exim4-config-medium/debian/config/conf.d/conf.d/auth

debian/exim4-config-medium/debian/config/conf.d/conf.d/auth/00_exim4-config-medium_header

debian/exim4-config-medium/debian/config/conf.d/conf.d/auth/30_exim4-config-medium_examples

debian/exim4-config-medium/debian/config/conf.d/conf.d/main

debian/exim4-config-medium/debian/config/conf.d/conf.d/main/01_exim4-config-medium_listmacrosdefs

debian/exim4-config-medium/debian/config/conf.d/conf.d/main/02_exim4-config-medium_options

debian/exim4-config-medium/debian/config/conf.d/conf.d/main/03_exim4-config-medium_tlsoptions

debian/exim4-config-medium/debian/config/conf.d/conf.d/retry

debian/exim4-config-medium/debian/config/conf.d/conf.d/retry/00_exim4-config-medium_header

debian/exim4-config-medium/debian/config/conf.d/conf.d/retry/30_exim4-config-medium

debian/exim4-config-medium/debian/config/conf.d/conf.d/rewrite

debian/exim4-config-medium/debian/config/conf.d/conf.d/rewrite/00_exim4-config-medium_header

debian/exim4-config-medium/debian/config/conf.d/conf.d/rewrite/31_exim4-config-medium_rewriting

debian/exim4-config-medium/debian/config/conf.d/conf.d/router

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/00_exim4-config-medium_header

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/100_exim4-config-medium_domain_literal

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/200_exim4-config-medium_primary

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/300_exim4-config-medium_real_local

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/400_exim4-config-medium_system_aliases

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/500_exim4-config-medium_hubuser

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/600_exim4-config-medium_userforward

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/700_exim4-config-medium_procmail

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/800_exim4-config-medium_maildrop

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/900_exim4-config-medium_local_user

debian/exim4-config-medium/debian/config/conf.d/conf.d/router/mmm_mail4root

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/00_exim4-config-medium_header

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_address_file

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_address_pipe

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_address_reply

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_mail_spool

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_maildir_home

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_maildrop_pipe

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_procmail_pipe

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/30_exim4-config-medium_remote_smtp

debian/exim4-config-medium/debian/config/conf.d/conf.d/transport/35_exim4-config-medium_address_directory

debian/exim4-config-medium/debian/config/conf.d/default_acl

debian/exim4-config-medium/debian/config/default_acl

debian/exim4-config-medium/debian/config/update-exim4.conf

debian/exim4-config-medium/debian/control

debian/exim4-config-medium/debian/copyright

debian/exim4-config-medium/debian/email-addresses

debian/exim4-config-medium/debian/exim4-config-medium.dirs

debian/exim4-config-medium/debian/exim4-config-medium.manpages

debian/exim4-config-medium/debian/exim4-config-medium.postinst

debian/exim4-config-medium/debian/exim4-config-medium.postrm

debian/exim4-config-medium/debian/ip-up.d

debian/exim4-config-medium/debian/manpages

debian/exim4-config-medium/debian/manpages/update-exim4.conf.8

debian/exim4-config-medium/debian/manpages/update-exim4defaults.8

debian/exim4-config-medium/debian/rules

debian/exim4-config-medium/debian/update-exim4.conf.conf

debian/exim4-config-medium/debian/update-exim4defaults

debian/exim4-config-simple

debian/exim4-config-simple/debian

debian/exim4-config-simple/debian/changelog

debian/exim4-config-simple/debian/compat

debian/exim4-config-simple/debian/control

debian/exim4-config-simple/debian/copyright

debian/exim4-config-simple/debian/debconf

debian/exim4-config-simple/debian/debconf/update-exim4.conf

debian/exim4-config-simple/debian/exim4-config-simple.dirs

debian/exim4-config-simple/debian/exim4-config-simple.manpages

debian/exim4-config-simple/debian/exim4-config-simple.postinst

debian/exim4-config-simple/debian/exim4-config-simple.postrm

debian/exim4-config-simple/debian/exim4.conf.defaults

debian/exim4-config-simple/debian/exim4.conf.source

debian/exim4-config-simple/debian/manpages

debian/exim4-config-simple/debian/manpages/update-exim4.conf.8

debian/exim4-config-simple/debian/rules

debian/patches/10_daemon_close_fds.dpatch

debian/patches/60_upstream_fixes.dpatch

debian/patches/61_queryprogramrouter.dpatch

debian/patches/62_statvfs.dpatch

debian/patches/63_nomorecrashongnutlserror.dpatch

debian/patches/64_pipeliningfixup.dpatch

debian/patches/65_tidydb-splitspool.dpatch

debian/patches/66_can2005-0021_can2005-0022.dpatch

debian/patches/exiscan.patch

files modified:
ACKNOWLEDGMENTS

CHANGES

LICENCE

Makefile

NOTICE

OS/Makefile-AIX

OS/Makefile-BSDI

OS/Makefile-Base

OS/Makefile-CYGWIN

OS/Makefile-DGUX

OS/Makefile-Darwin

OS/Makefile-Default

OS/Makefile-FreeBSD

OS/Makefile-GNU

OS/Makefile-HI-OSF

OS/Makefile-HI-UX

OS/Makefile-HP-UX

OS/Makefile-HP-UX-9

OS/Makefile-IRIX

OS/Makefile-IRIX6

OS/Makefile-IRIX632

OS/Makefile-IRIX65

OS/Makefile-Linux

OS/Makefile-NetBSD

OS/Makefile-NetBSD-a.out

OS/Makefile-OSF1

OS/Makefile-OpenBSD

OS/Makefile-OpenUNIX

OS/Makefile-QNX

OS/Makefile-SCO

OS/Makefile-SCO_SV

OS/Makefile-SunOS4

OS/Makefile-SunOS5

OS/Makefile-SunOS5-hal

OS/Makefile-ULTRIX

OS/Makefile-UNIX_SV

OS/Makefile-USG

OS/Makefile-Unixware7

OS/Makefile-mips

OS/eximon.conf-Default

OS/os.Configuring

OS/os.c-GNU

OS/os.c-HI-OSF

OS/os.c-IRIX

OS/os.c-IRIX6

OS/os.c-IRIX632

OS/os.c-IRIX65

OS/os.c-Linux

OS/os.c-OSF1

OS/os.c-cygwin

OS/os.h-AIX

OS/os.h-BSDI

OS/os.h-DGUX

OS/os.h-Darwin

OS/os.h-FreeBSD

OS/os.h-GNU

OS/os.h-HI-OSF

OS/os.h-HI-UX

OS/os.h-HP-UX

OS/os.h-HP-UX-9

OS/os.h-IRIX

OS/os.h-IRIX6

OS/os.h-IRIX632

OS/os.h-IRIX65

OS/os.h-Linux

OS/os.h-NetBSD

OS/os.h-NetBSD-a.out

OS/os.h-OSF1

OS/os.h-OpenBSD

OS/os.h-OpenUNIX

OS/os.h-QNX

OS/os.h-SCO

OS/os.h-SCO_SV

OS/os.h-SunOS4

OS/os.h-SunOS5

OS/os.h-SunOS5-hal

OS/os.h-ULTRIX

OS/os.h-UNIX_SV

OS/os.h-USG

OS/os.h-Unixware7

OS/os.h-cygwin

OS/os.h-mips

README

README.UPDATING

debian/EDITME.exim4-heavy.diff

debian/EDITME.exim4-light.diff

debian/EDITME.eximon.diff

debian/README.Debian

debian/README.Debian-accountname

debian/README.Debian.UUCP

debian/README.Debian.xinetd

debian/README.SMTP-AUTH

debian/README.system_aliases

debian/TODO

debian/changelog

debian/config-custom/create-custom-config-package

debian/config-custom/debian/rules

debian/control

debian/create-custom-package

debian/debconf/conf.d/acl/20_exim4-config_whitelist_local_deny

debian/debconf/conf.d/acl/30_exim4-config_check_rcpt

debian/debconf/conf.d/acl/40_exim4-config_check_data

debian/debconf/conf.d/auth/30_exim4-config_examples

debian/debconf/conf.d/main/01_exim4-config_listmacrosdefs

debian/debconf/conf.d/main/02_exim4-config_options

debian/debconf/conf.d/main/03_exim4-config_tlsoptions

debian/debconf/conf.d/rewrite/31_exim4-config_rewriting

debian/debconf/conf.d/router/100_exim4-config_domain_literal

debian/debconf/conf.d/router/300_exim4-config_real_local

debian/debconf/conf.d/router/400_exim4-config_system_aliases

debian/debconf/conf.d/router/500_exim4-config_hubuser

debian/debconf/conf.d/transport/30_exim4-config_maildir_home

debian/debconf/default_acl

debian/debconf/update-exim4.conf

debian/debconf/update-exim4.conf.template

debian/exim-gencert

debian/exim4-base.config

debian/exim4-base.cron.daily

debian/exim4-base.docs

debian/exim4-base.init

debian/exim4-base.postinst

debian/exim4-base.postrm

debian/exim4-base.templates

debian/exim4-config.NEWS

debian/exim4-config.config

debian/exim4-config.docs

debian/exim4-config.install

debian/exim4-config.postinst

debian/exim4-config.postrm

debian/exim4-config.templates

debian/exim4-config.templates.master

debian/exim4-daemon-custom.links

debian/exim4-daemon-heavy.install

debian/exim4-daemon-heavy.links

debian/exim4-daemon-light.install

debian/exim4-daemon-light.links

debian/exim4-daemon-light.postinst

debian/exim4-daemon-light.prerm

debian/ip-up.d

debian/manpages/exiwhat.8

debian/manpages/update-exim4.conf.8

debian/manpages/update-exim4.conf.template.8

debian/patches/00list

debian/patches/31_eximmanpage.dpatch

debian/patches/36_pcre.dpatch

debian/po/ar.po

debian/po/bg.po

debian/po/bs.po

debian/po/ca.po

debian/po/cs.po

debian/po/cy.po

debian/po/da.po

debian/po/de.po

debian/po/el.po

debian/po/es.po

debian/po/eu.po

debian/po/fi.po

debian/po/fr.po

debian/po/he.po

debian/po/hr.po

debian/po/hu.po

debian/po/id.po

debian/po/it.po

debian/po/ja.po

debian/po/ko.po

debian/po/lt.po

debian/po/mk.po

debian/po/nb.po

debian/po/nl.po

debian/po/nn.po

debian/po/pl.po

debian/po/pt.po

debian/po/pt_BR.po

debian/po/ro.po

debian/po/ru.po

debian/po/sk.po

debian/po/sl.po

debian/po/sq.po

debian/po/sv.po

debian/po/templates.pot

debian/po/tr.po

debian/po/uk.po

debian/po/zh_CN.po

debian/po/zh_TW.po

debian/rules

debian/script

debian/update-exim4defaults

doc/ChangeLog

doc/Exim3.upgrade

doc/Exim4.upgrade

doc/NewStuff

doc/OptionLists.txt

doc/README

doc/README.SIEVE

doc/dbm.discuss.txt

doc/exim.8

doc/filter.txt

doc/pcrepattern.txt

doc/pcretest.txt

doc/spec.txt

exim_monitor/EDITME

exim_monitor/em_StripChart.c

exim_monitor/em_TextPop.c

exim_monitor/em_globals.c

exim_monitor/em_hdr.h

exim_monitor/em_init.c

exim_monitor/em_log.c

exim_monitor/em_main.c

exim_monitor/em_menu.c

exim_monitor/em_queue.c

exim_monitor/em_strip.c

exim_monitor/em_text.c

exim_monitor/em_version.c

exim_monitor/em_xs.c

scripts/Configure

scripts/Configure-Makefile

scripts/Configure-config.h

scripts/Configure-eximon

scripts/Configure-os.c

scripts/Configure-os.h

scripts/MakeLinks

scripts/arch-type

scripts/exim_install

scripts/newer

scripts/os-type

src/EDITME

src/acl.c

src/aliases.default

src/auths/Makefile

src/auths/README

src/auths/auth-spa.c

src/auths/auth-spa.h

src/auths/b64decode.c

src/auths/b64encode.c

src/auths/call_pam.c

src/auths/call_pwcheck.c

src/auths/call_radius.c

src/auths/cram_md5.c

src/auths/cram_md5.h

src/auths/get_data.c

src/auths/get_no64_data.c

src/auths/md5.c

src/auths/plaintext.c

src/auths/plaintext.h

src/auths/pwcheck.c

src/auths/pwcheck.h

src/auths/sha1.c

src/auths/spa.c

src/auths/spa.h

src/auths/xtextdecode.c

src/auths/xtextencode.c

src/buildconfig.c

src/child.c

src/config.h.defaults

src/configure.default

src/convert4r3.src

src/convert4r4.src

src/crypt16.c

src/daemon.c

src/dbfn.c

src/dbfunctions.h

src/dbstuff.h

src/debug.c

src/deliver.c

src/directory.c

src/dns.c

src/drtables.c

src/dummies.c

src/enq.c

src/exicyclog.src

src/exigrep.src

src/exim.c

src/exim.h

src/exim_checkaccess.src

src/exim_dbmbuild.c

src/exim_dbutil.c

src/exim_lock.c

src/eximon.src

src/eximstats.src

src/exinext.src

src/exipick.src

src/exiqgrep.src

src/exiqsumm.src

src/exiwhat.src

src/expand.c

src/filter.c

src/filtertest.c

src/functions.h

src/globals.c

src/globals.h

src/header.c

src/host.c

src/ip.c

src/local_scan.c

src/local_scan.h

src/log.c

src/lookups/Makefile

src/lookups/README

src/lookups/cdb.c

src/lookups/cdb.h

src/lookups/dbmdb.c

src/lookups/dbmdb.h

src/lookups/dnsdb.c

src/lookups/dnsdb.h

src/lookups/dsearch.c

src/lookups/dsearch.h

src/lookups/ibase.c

src/lookups/ibase.h

src/lookups/ldap.c

src/lookups/ldap.h

src/lookups/lf_check_file.c

src/lookups/lf_functions.h

src/lookups/lsearch.c

src/lookups/lsearch.h

src/lookups/mysql.c

src/lookups/mysql.h

src/lookups/nis.c

src/lookups/nis.h

src/lookups/nisplus.c

src/lookups/nisplus.h

src/lookups/oracle.c

src/lookups/oracle.h

src/lookups/passwd.c

src/lookups/passwd.h

src/lookups/pgsql.c

src/lookups/pgsql.h

src/lookups/testdb.c

src/lookups/testdb.h

src/lookups/whoson.c

src/lookups/whoson.h

src/lss.c

src/macros.h

src/match.c

src/moan.c

src/mytypes.h

src/os.c

src/osfunctions.h

src/parse.c

src/pcre/ChangeLog

src/pcre/LICENCE

src/pcre/Makefile

src/pcre/README

src/pcre/config.h

src/pcre/dftables.c

src/pcre/get.c

src/pcre/internal.h

src/pcre/maketables.c

src/pcre/pcre.c

src/pcre/pcre.h

src/pcre/pcretest.c

src/pcre/printint.c

src/pcre/study.c

src/perl.c

src/queue.c

src/rda.c

src/readconf.c

src/receive.c

src/retry.c

src/rewrite.c

src/rfc2047.c

src/route.c

src/routers/Makefile

src/routers/README

src/routers/accept.c

src/routers/accept.h

src/routers/dnslookup.c

src/routers/dnslookup.h

src/routers/ipliteral.c

src/routers/ipliteral.h

src/routers/iplookup.c

src/routers/iplookup.h

src/routers/manualroute.c

src/routers/manualroute.h

src/routers/queryprogram.c

src/routers/queryprogram.h

src/routers/redirect.c

src/routers/redirect.h

src/routers/rf_change_domain.c

src/routers/rf_expand_data.c

src/routers/rf_functions.h

src/routers/rf_get_errors_address.c

src/routers/rf_get_munge_headers.c

src/routers/rf_get_transport.c

src/routers/rf_get_ugid.c

src/routers/rf_lookup_hostlist.c

src/routers/rf_queue_add.c

src/routers/rf_self_action.c

src/routers/rf_set_ugid.c

src/search.c

src/sieve.c

src/smtp_in.c

src/smtp_out.c

src/spool_in.c

src/spool_out.c

src/store.c

src/store.h

src/string.c

src/structs.h

src/tls-gnu.c

src/tls-openssl.c

src/tls.c

src/tod.c

src/transport-filter.src

src/transport.c

src/transports/Makefile

src/transports/README

src/transports/appendfile.c

src/transports/appendfile.h

src/transports/autoreply.c

src/transports/autoreply.h

src/transports/lmtp.c

src/transports/lmtp.h

src/transports/pipe.c

src/transports/pipe.h

src/transports/smtp.c

src/transports/smtp.h

src/transports/tf_maildir.c

src/transports/tf_maildir.h

src/tree.c

src/verify.c

src/version.c

util/cramtest.pl

util/logargs.sh

util/unknownuser.sh

Show diffs side-by-side

added added

removed removed

doc/pcrepattern.txt

This file contains the PCRE man page that describes the regular expressions

supported by PCRE version 4.5. Note that not all of the features are relevant

This file contains the PCRE man page that describes the regular expressions

supported by PCRE version 6.0. Note that not all of the features are relevant

in the context of Exim. In particular, the version of PCRE that is compiled

with Exim does not include UTF-8 support, there is no mechanism for changing

the options with which the PCRE functions are called, and features such as

callout are not accessible.

-----------------------------------------------------------------------------

PCRE(3) PCRE(3)

NAME

PCRE - Perl-compatible regular expressions

PCRE REGULAR EXPRESSION DETAILS

The syntax and semantics of the regular expressions supported by PCRE

are described below. Regular expressions are also described in the Perl

documentation and in a number of other books, some of which have copi-

ous examples. Jeffrey Friedl's "Mastering Regular Expressions", pub-

lished by O'Reilly, covers them in great detail. The description here

is intended as reference documentation.

The basic operation of PCRE is on strings of bytes. However, there is

also support for UTF-8 character strings. To use this support you must

build PCRE to include UTF-8 support, and then call pcre_compile() with

the PCRE_UTF8 option. How this affects the pattern matching is men-

tioned in several places below. There is also a summary of UTF-8 fea-

tures in the section on UTF-8 support in the main pcre page.

A regular expression is a pattern that is matched against a subject

string from left to right. Most characters stand for themselves in a

pattern, and match the corresponding characters in the subject. As a

documentation and in a number of books, some of which have copious

examples. Jeffrey Friedl's "Mastering Regular Expressions", published

by O'Reilly, covers regular expressions in great detail. This descrip-

tion of PCRE's regular expressions is intended as reference material.

The original operation of PCRE was on strings of one-byte characters.

However, there is now also support for UTF-8 character strings. To use

this, you must build PCRE to include UTF-8 support, and then call

pcre_compile() with the PCRE_UTF8 option. How this affects pattern

matching is mentioned in several places below. There is also a summary

of UTF-8 features in the section on UTF-8 support in the main pcre

page.

The remainder of this document discusses the patterns that are sup-

ported by PCRE when its main matching function, pcre_exec(), is used.

From release 6.0, PCRE offers a second matching function,

pcre_dfa_exec(), which matches using a different algorithm that is not

Perl-compatible. The advantages and disadvantages of the alternative

function, and how it differs from the normal function, are discussed in

the pcrematching page.

A regular expression is a pattern that is matched against a subject

string from left to right. Most characters stand for themselves in a

pattern, and match the corresponding characters in the subject. As a

trivial example, the pattern

The quick brown fox

matches a portion of a subject string that is identical to itself. The

power of regular expressions comes from the ability to include alterna-

tives and repetitions in the pattern. These are encoded in the pattern

by the use of meta-characters, which do not stand for themselves but

instead are interpreted in some special way.

There are two different sets of meta-characters: those that are recog-

matches a portion of a subject string that is identical to itself. When

caseless matching is specified (the PCRE_CASELESS option), letters are

matched independently of case. In UTF-8 mode, PCRE always understands

the concept of case for characters whose values are less than 128, so

caseless matching is always possible. For characters with higher val-

ues, the concept of case is supported if PCRE is compiled with Unicode

property support, but not otherwise. If you want to use caseless

matching for characters 128 and above, you must ensure that PCRE is

compiled with Unicode property support as well as with UTF-8 support.

The power of regular expressions comes from the ability to include

alternatives and repetitions in the pattern. These are encoded in the

pattern by the use of metacharacters, which do not stand for themselves

but instead are interpreted in some special way.

There are two different sets of metacharacters: those that are recog-

nized anywhere in the pattern except within square brackets, and those

that are recognized in square brackets. Outside square brackets, the

meta-characters are as follows:

metacharacters are as follows:

\ general escape character with several uses

^ assert start of string (or line, in multiline mode)

{ start min/max quantifier

Part of a pattern that is in square brackets is called a "character

class". In a character class the only meta-characters are:

class". In a character class the only metacharacters are:

\ general escape character

^ negate the class, but only if the first character

syntax)

] terminates the character class

The following sections describe the use of each of the meta-characters.

The following sections describe the use of each of the metacharacters.

BACKSLASH

The backslash character has several uses. Firstly, if it is followed by

a non-alphameric character, it takes away any special meaning that

a non-alphanumeric character, it takes away any special meaning that

100

character may have. This use of backslash as an escape character

101

applies both inside and outside character classes.

102

103

For example, if you want to match a * character, you write \* in the

104

pattern. This escaping action applies whether or not the following

character would otherwise be interpreted as a meta-character, so it is

always safe to precede a non-alphameric with backslash to specify that

it stands for itself. In particular, if you want to match a backslash,

you write \\.

105

character would otherwise be interpreted as a metacharacter, so it is

106

always safe to precede a non-alphanumeric with backslash to specify

107

that it stands for itself. In particular, if you want to match a back-

108

slash, you write \\.

109

110

If a pattern is compiled with the PCRE_EXTENDED option, whitespace in

111

the pattern (other than in a character class) and characters between a

112

129

The \Q...\E sequence is recognized both inside and outside character

113

130

classes.

114

131

132

Non-printing characters

133

115

134

A second use of backslash provides a way of encoding non-printing char-

116

135

acters in patterns in a visible manner. There is no restriction on the

117

136

appearance of non-printing characters, apart from the binary zero that

141

160

must be less than 2**31 (that is, the maximum hexadecimal value is

142

161

7FFFFFFF). If characters other than hexadecimal digits appear between

143

162

\x{ and }, or if there is no terminating }, this form of escape is not

144

recognized. Instead, the initial \x will be interpreted as a basic hex-

145

adecimal escape, with no following digits, giving a byte whose value is

146

zero.

163

recognized. Instead, the initial \x will be interpreted as a basic

164

hexadecimal escape, with no following digits, giving a character whose

165

value is zero.

147

166

148

167

Characters whose value is less than 256 can be defined by either of the

149

168

two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference

154

173

there are fewer than two digits, just those that are present are used.

155

174

Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL

156

175

character (code value 7). Make sure you supply two digits after the

157

initial zero if the character that follows is itself an octal digit.

176

initial zero if the pattern character that follows is itself an octal

177

digit.

158

178

159

179

The handling of a backslash followed by a digit other than 0 is compli-

160

180

cated. Outside a character class, PCRE reads it and any following dig-

161

its as a decimal number. If the number is less than 10, or if there

181

its as a decimal number. If the number is less than 10, or if there

162

182

have been at least that many previous capturing left parentheses in the

163

expression, the entire sequence is taken as a back reference. A

164

description of how this works is given later, following the discussion

183

expression, the entire sequence is taken as a back reference. A

184

description of how this works is given later, following the discussion

165

185

of parenthesized subpatterns.

166

186

167

Inside a character class, or if the decimal number is greater than 9

168

and there have not been that many capturing subpatterns, PCRE re-reads

169

up to three octal digits following the backslash, and generates a sin-

187

Inside a character class, or if the decimal number is greater than 9

188

and there have not been that many capturing subpatterns, PCRE re-reads

189

up to three octal digits following the backslash, and generates a sin-

170

190

gle byte from the least significant 8 bits of the value. Any subsequent

171

191

digits stand for themselves. For example:

172

192

185

205

\81 is either a back reference, or a binary zero

186

206

followed by the two characters "8" and "1"

187

207

188

Note that octal values of 100 or greater must not be introduced by a

208

Note that octal values of 100 or greater must not be introduced by a

189

209

leading zero, because no more than three octal digits are ever read.

190

210

191

All the sequences that define a single byte value or a single UTF-8

211

All the sequences that define a single byte value or a single UTF-8

192

212

character (in UTF-8 mode) can be used both inside and outside character

193

classes. In addition, inside a character class, the sequence \b is

194

interpreted as the backspace character (hex 08). Outside a character

195

class it has a different meaning (see below).

196

197

The third use of backslash is for specifying generic character types:

213

classes. In addition, inside a character class, the sequence \b is

214

interpreted as the backspace character (hex 08), and the sequence \X is

215

interpreted as the character "X". Outside a character class, these

216

sequences have different meanings (see below).

217

218

Generic character types

219

220

The third use of backslash is for specifying generic character types.

221

The following are always recognized:

198

222

199

223

\d any decimal digit

200

224

\D any character that is not a decimal digit

204

228

\W any "non-word" character

205

229

206

230

Each pair of escape sequences partitions the complete set of characters

207

into two disjoint sets. Any given character matches one, and only one,

231

into two disjoint sets. Any given character matches one, and only one,

208

232

of each pair.

209

233

210

In UTF-8 mode, characters with values greater than 255 never match \d,

211

\s, or \w, and always match \D, \S, and \W.

212

213

For compatibility with Perl, \s does not match the VT character (code

214

11). This makes it different from the the POSIX "space" class. The \s

215

characters are HT (9), LF (10), FF (12), CR (13), and space (32).

216

217

A "word" character is any letter or digit or the underscore character,

218

that is, any character which can be part of a Perl "word". The defini-

219

tion of letters and digits is controlled by PCRE's character tables,

220

and may vary if locale- specific matching is taking place (see "Locale

221

support" in the pcreapi page). For example, in the "fr" (French)

222

locale, some character codes greater than 128 are used for accented

223

letters, and these are matched by \w.

224

225

234

These character type sequences can appear both inside and outside char-

226

235

acter classes. They each match one character of the appropriate type.

227

236

If the current matching point is at the end of the subject string, all

228

237

of them fail, since there is no character to match.

229

238

239

For compatibility with Perl, \s does not match the VT character (code

240

11). This makes it different from the the POSIX "space" class. The \s

241

characters are HT (9), LF (10), FF (12), CR (13), and space (32).

242

243

A "word" character is an underscore or any character less than 256 that

244

is a letter or digit. The definition of letters and digits is con-

245

trolled by PCRE's low-valued character tables, and may vary if locale-

246

specific matching is taking place (see "Locale support" in the pcreapi

247

page). For example, in the "fr_FR" (French) locale, some character

248

codes greater than 128 are used for accented letters, and these are

249

matched by \w.

250

251

In UTF-8 mode, characters with values greater than 128 never match \d,

252

\s, or \w, and always match \D, \S, and \W. This is true even when Uni-

253

code character property support is available.

254

255

Unicode character properties

256

257

When PCRE is built with Unicode character property support, three addi-

258

tional escape sequences to match generic character types are available

259

when UTF-8 mode is selected. They are:

260

261

\p{xx} a character with the xx property

262

\P{xx} a character without the xx property

263

\X an extended Unicode sequence

264

265

The property names represented by xx above are limited to the Unicode

266

general category properties. Each character has exactly one such prop-

267

erty, specified by a two-letter abbreviation. For compatibility with

268

Perl, negation can be specified by including a circumflex between the

269

opening brace and the property name. For example, \p{^Lu} is the same

270

as \P{Lu}.

271

272

If only one letter is specified with \p or \P, it includes all the

273

properties that start with that letter. In this case, in the absence of

274

negation, the curly brackets in the escape sequence are optional; these

275

two examples have the same effect:

276

277

\p{L}

278

\pL

279

280

The following property codes are supported:

281

282

C Other

283

Cc Control

284

Cf Format

285

Cn Unassigned

286

Co Private use

287

Cs Surrogate

288

289

L Letter

290

Ll Lower case letter

291

Lm Modifier letter

292

Lo Other letter

293

Lt Title case letter

294

Lu Upper case letter

295

296

M Mark

297

Mc Spacing mark

298

Me Enclosing mark

299

Mn Non-spacing mark

300

301

N Number

302

Nd Decimal number

303

Nl Letter number

304

No Other number

305

306

P Punctuation

307

Pc Connector punctuation

308

Pd Dash punctuation

309

Pe Close punctuation

310

Pf Final punctuation

311

Pi Initial punctuation

312

Po Other punctuation

313

Ps Open punctuation

314

315

S Symbol

316

Sc Currency symbol

317

Sk Modifier symbol

318

Sm Mathematical symbol

319

So Other symbol

320

321

Z Separator

322

Zl Line separator

323

Zp Paragraph separator

324

Zs Space separator

325

326

Extended properties such as "Greek" or "InMusicalSymbols" are not sup-

327

ported by PCRE.

328

329

Specifying caseless matching does not affect these escape sequences.

330

For example, \p{Lu} always matches only upper case letters.

331

332

The \X escape matches any number of Unicode characters that form an

333

extended Unicode sequence. \X is equivalent to

334

335

(?>\PM\pM*)

336

337

That is, it matches a character without the "mark" property, followed

338

by zero or more characters with the "mark" property, and treats the

339

sequence as an atomic group (see below). Characters with the "mark"

340

property are typically accents that affect the preceding character.

341

342

Matching characters by Unicode property is not fast, because PCRE has

343

to search a structure that contains data for over fifteen thousand

344

characters. That is why the traditional escape sequences such as \d and

345

\w do not use Unicode properties in PCRE.

346

347

Simple assertions

348

230

349

The fourth use of backslash is for certain simple assertions. An asser-

231

tion specifies a condition that has to be met at a particular point in

232

a match, without consuming any characters from the subject string. The

233

use of subpatterns for more complicated assertions is described below.

234

The backslashed assertions are

350

tion specifies a condition that has to be met at a particular point in

351

a match, without consuming any characters from the subject string. The

352

use of subpatterns for more complicated assertions is described below.

353

The backslashed assertions are:

235

354

236

355

\b matches at a word boundary

237

356

\B matches when not at a word boundary

240

359

\z matches at end of subject

241

360

\G matches at first matching position in subject

242

361

243

These assertions may not appear in character classes (but note that \b

362

These assertions may not appear in character classes (but note that \b

244

363

has a different meaning, namely the backspace character, inside a char-

245

364

acter class).

246

365

247

A word boundary is a position in the subject string where the current

248

character and the previous character do not both match \w or \W (i.e.

249

one matches \w and the other matches \W), or the start or end of the

366

A word boundary is a position in the subject string where the current

367

character and the previous character do not both match \w or \W (i.e.

368

one matches \w and the other matches \W), or the start or end of the

250

369

string if the first or last character matches \w, respectively.

251

370

252

The \A, \Z, and \z assertions differ from the traditional circumflex

253

and dollar (described below) in that they only ever match at the very

254

start and end of the subject string, whatever options are set. Thus,

255

they are independent of multiline mode.

256

257

They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the

258

startoffset argument of pcre_exec() is non-zero, indicating that match-

259

ing is to start at a point other than the beginning of the subject, \A

260

can never match. The difference between \Z and \z is that \Z matches

261

before a newline that is the last character of the string as well as at

262

the end of the string, whereas \z matches only at the end.

371

The \A, \Z, and \z assertions differ from the traditional circumflex

372

and dollar (described in the next section) in that they only ever match

373

at the very start and end of the subject string, whatever options are

374

set. Thus, they are independent of multiline mode. These three asser-

375

tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which

376

affect only the behaviour of the circumflex and dollar metacharacters.

377

However, if the startoffset argument of pcre_exec() is non-zero, indi-

378

cating that matching is to start at a point other than the beginning of

379

the subject, \A can never match. The difference between \Z and \z is

380

that \Z matches before a newline that is the last character of the

381

string as well as at the end of the string, whereas \z matches only at

382

the end.

263

383

264

384

The \G assertion is true only when the current matching position is at

265

385

the start point of the match, as specified by the startoffset argument

282

402

CIRCUMFLEX AND DOLLAR

283

403

284

404

Outside a character class, in the default matching mode, the circumflex

285

character is an assertion which is true only if the current matching

405

character is an assertion that is true only if the current matching

286

406

point is at the start of the subject string. If the startoffset argu-

287

407

ment of pcre_exec() is non-zero, circumflex can never match if the

288

408

PCRE_MULTILINE option is unset. Inside a character class, circumflex

296

416

ject, it is said to be an "anchored" pattern. (There are also other

297

417

constructs that can cause a pattern to be anchored.)

298

418

299

A dollar character is an assertion which is true only if the current

419

A dollar character is an assertion that is true only if the current

300

420

matching point is at the end of the subject string, or immediately

301

421

before a newline character that is the last character in the string (by

302

422

default). Dollar need not be the last character of the pattern if a

313

433

ately after and immediately before an internal newline character,

314

434

respectively, in addition to matching at the start and end of the sub-

315

435

ject string. For example, the pattern /^abc$/ matches the subject

316

string "def\nabc" in multiline mode, but not otherwise. Consequently,

317

patterns that are anchored in single line mode because all branches

318

start with ^ are not anchored in multiline mode, and a match for cir-

319

cumflex is possible when the startoffset argument of pcre_exec() is

320

non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE

321

is set.

436

string "def\nabc" (where \n represents a newline character) in multi-

437

line mode, but not otherwise. Consequently, patterns that are anchored

438

in single line mode because all branches start with ^ are not anchored

439

in multiline mode, and a match for circumflex is possible when the

440

startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-

441

LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.

322

442

323

443

Note that the sequences \A, \Z, and \z can be used to match the start

324

444

and end of the subject in both modes, and if all branches of a pattern

331

451

Outside a character class, a dot in the pattern matches any one charac-

332

452

ter in the subject, including a non-printing character, but not (by

333

453

default) newline. In UTF-8 mode, a dot matches any UTF-8 character,

334

which might be more than one byte long, except (by default) for new-

335

line. If the PCRE_DOTALL option is set, dots match newlines as well.

336

The handling of dot is entirely independent of the handling of circum-

337

flex and dollar, the only relationship being that they both involve

338

newline characters. Dot has no special meaning in a character class.

454

which might be more than one byte long, except (by default) newline. If

455

the PCRE_DOTALL option is set, dots match newlines as well. The han-

456

dling of dot is entirely independent of the handling of circumflex and

457

dollar, the only relationship being that they both involve newline

458

characters. Dot has no special meaning in a character class.

339

459

340

460

341

461

MATCHING A SINGLE BYTE

342

462

343

463

Outside a character class, the escape sequence \C matches any one byte,

344

both in and out of UTF-8 mode. Unlike a dot, it always matches a new-

345

line. The feature is provided in Perl in order to match individual

346

bytes in UTF-8 mode. Because it breaks up UTF-8 characters into indi-

347

vidual bytes, what remains in the string may be a malformed UTF-8

348

string. For this reason it is best avoided.

349

350

PCRE does not allow \C to appear in lookbehind assertions (see below),

351

because in UTF-8 mode it makes it impossible to calculate the length of

352

the lookbehind.

353

354

355

SQUARE BRACKETS

464

both in and out of UTF-8 mode. Unlike a dot, it can match a newline.

465

The feature is provided in Perl in order to match individual bytes in

466

UTF-8 mode. Because it breaks up UTF-8 characters into individual

467

bytes, what remains in the string may be a malformed UTF-8 string. For

468

this reason, the \C escape sequence is best avoided.

469

470

PCRE does not allow \C to appear in lookbehind assertions (described

471

below), because in UTF-8 mode this would make it impossible to calcu-

472

late the length of the lookbehind.

473

474

475

SQUARE BRACKETS AND CHARACTER CLASSES

356

476

357

477

An opening square bracket introduces a character class, terminated by a

358

478

closing square bracket. A closing square bracket on its own is not spe-

371

491

For example, the character class [aeiou] matches any lower case vowel,

372

492

while [^aeiou] matches any character that is not a lower case vowel.

373

493

Note that a circumflex is just a convenient notation for specifying the

374

characters which are in the class by enumerating those that are not. It

375

is not an assertion: it still consumes a character from the subject

376

string, and fails if the current pointer is at the end of the string.

494

characters that are in the class by enumerating those that are not. A

495

class that starts with a circumflex is not an assertion: it still con-

496

sumes a character from the subject string, and therefore it fails if

497

the current pointer is at the end of the string.

377

498

378

In UTF-8 mode, characters with values greater than 255 can be included

379

in a class as a literal string of bytes, or by using the \x{ escaping

499

In UTF-8 mode, characters with values greater than 255 can be included

500

in a class as a literal string of bytes, or by using the \x{ escaping

380

501

mechanism.

381

502

382

When caseless matching is set, any letters in a class represent both

383

their upper case and lower case versions, so for example, a caseless

384

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

385

match "A", whereas a caseful version would. PCRE does not support the

386

concept of case for characters with values greater than 255.

503

When caseless matching is set, any letters in a class represent both

504

their upper case and lower case versions, so for example, a caseless

505

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

506

match "A", whereas a caseful version would. In UTF-8 mode, PCRE always

507

understands the concept of case for characters whose values are less

508

than 128, so caseless matching is always possible. For characters with

509

higher values, the concept of case is supported if PCRE is compiled

510

with Unicode property support, but not otherwise. If you want to use

511

caseless matching for characters 128 and above, you must ensure that

512

PCRE is compiled with Unicode property support as well as with UTF-8

513

support.

387

514

388

The newline character is never treated in any special way in character

389

classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE

515

The newline character is never treated in any special way in character

516

classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE

390

517

options is. A class such as [^a] will always match a newline.

391

518

392

The minus (hyphen) character can be used to specify a range of charac-

393

ters in a character class. For example, [d-m] matches any letter

394

between d and m, inclusive. If a minus character is required in a

395

class, it must be escaped with a backslash or appear in a position

396

where it cannot be interpreted as indicating a range, typically as the

519

The minus (hyphen) character can be used to specify a range of charac-

520

ters in a character class. For example, [d-m] matches any letter

521

between d and m, inclusive. If a minus character is required in a

522

class, it must be escaped with a backslash or appear in a position

523

where it cannot be interpreted as indicating a range, typically as the

397

524

first or last character in the class.

398

525

399

526

It is not possible to have the literal character "]" as the end charac-

400

ter of a range. A pattern such as [W-]46] is interpreted as a class of

401

two characters ("W" and "-") followed by a literal string "46]", so it

402

would match "W46]" or "-46]". However, if the "]" is escaped with a

403

backslash it is interpreted as the end of range, so [W-\]46] is inter-

404

preted as a single class containing a range followed by two separate

405

characters. The octal or hexadecimal representation of "]" can also be

406

used to end a range.

527

ter of a range. A pattern such as [W-]46] is interpreted as a class of

528

two characters ("W" and "-") followed by a literal string "46]", so it

529

would match "W46]" or "-46]". However, if the "]" is escaped with a

530

backslash it is interpreted as the end of range, so [W-\]46] is inter-

531

preted as a class containing a range followed by two other characters.

532

The octal or hexadecimal representation of "]" can also be used to end

533

a range.

407

534

408

Ranges operate in the collating sequence of character values. They can

409

also be used for characters specified numerically, for example

410

[\000-\037]. In UTF-8 mode, ranges can include characters whose values

535

Ranges operate in the collating sequence of character values. They can

536

also be used for characters specified numerically, for example

537

[\000-\037]. In UTF-8 mode, ranges can include characters whose values

411

538

are greater than 255, for example [\x{100}-\x{2ff}].

412

539

413

540

If a range that includes letters is used when caseless matching is set,

414

541

it matches the letters in either case. For example, [W-c] is equivalent

415

to [][\^_`wxyzabc], matched caselessly, and if character tables for the

416

"fr" locale are in use, [\xc8-\xcb] matches accented E characters in

417

both cases.

418

419

The character types \d, \D, \s, \S, \w, and \W may also appear in a

420

character class, and add the characters that they match to the class.

421

For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can

422

conveniently be used with the upper case character types to specify a

423

more restricted set of characters than the matching lower case type.

424

For example, the class [^\W_] matches any letter or digit, but not

425

underscore.

426

427

All non-alphameric characters other than \, -, ^ (at the start) and the

428

terminating ] are non-special in character classes, but it does no harm

429

if they are escaped.

542

to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if

543

character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches

544

accented E characters in both cases. In UTF-8 mode, PCRE supports the

545

concept of case for characters with values greater than 128 only when

546

it is compiled with Unicode property support.

547

548

The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear

549

in a character class, and add the characters that they match to the

550

class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-

551

flex can conveniently be used with the upper case character types to

552

specify a more restricted set of characters than the matching lower

553

case type. For example, the class [^\W_] matches any letter or digit,

554

but not underscore.

555

556

The only metacharacters that are recognized in character classes are

557

backslash, hyphen (only where it can be interpreted as specifying a

558

range), circumflex (only at the start), opening square bracket (only

559

when it can be interpreted as introducing a POSIX class name - see the

560

next section), and the terminating closing square bracket. However,

561

escaping other non-alphanumeric characters does no harm.

430

562

431

563

432

564

POSIX CHARACTER CLASSES

433

565

434

Perl supports the POSIX notation for character classes, which uses

435

names enclosed by [: and :] within the enclosing square brackets. PCRE

436

also supports this notation. For example,

566

Perl supports the POSIX notation for character classes. This uses names

567

enclosed by [: and :] within the enclosing square brackets. PCRE also

568

supports this notation. For example,

437

569

438

570

[01[:alpha:]%]

439

571

470

602

POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but

471

603

these are not supported, and an error is given if they are encountered.

472

604

473

In UTF-8 mode, characters with values greater than 255 do not match any

605

In UTF-8 mode, characters with values greater than 128 do not match any

474

606

of the POSIX character classes.

475

607

476

608

537

669

in the same way as the Perl-compatible options by using the characters

538

670

U and X respectively. The (?X) flag setting is special in that it must

539

671

always occur earlier in the pattern than any of the additional features

540

it turns on, even when it is at top level. It is best put at the start.

672

it turns on, even when it is at top level. It is best to put it at the

673

start.

541

674

542

675

543

676

SUBPATTERNS

544

677

545

678

Subpatterns are delimited by parentheses (round brackets), which can be

546

nested. Marking part of a pattern as a subpattern does two things:

679

nested. Turning part of a pattern into a subpattern does two things:

547

680

548

681

1. It localizes a set of alternatives. For example, the pattern

549

682

553

686

the parentheses, it would match "cataract", "erpillar" or the empty

554

687

string.

555

688

556

2. It sets up the subpattern as a capturing subpattern (as defined

557

above). When the whole pattern matches, that portion of the subject

689

2. It sets up the subpattern as a capturing subpattern. This means

690

that, when the whole pattern matches, that portion of the subject

558

691

string that matched the subpattern is passed back to the caller via the

559

692

ovector argument of pcre_exec(). Opening parentheses are counted from

560

left to right (starting from 1) to obtain the numbers of the capturing

693

left to right (starting from 1) to obtain numbers for the capturing

561

694

subpatterns.

562

695

563

696

For example, if the string "the red king" is matched against the pat-

602

735

Identifying capturing parentheses by number is simple, but it can be

603

736

very hard to keep track of the numbers in complicated regular expres-

604

737

sions. Furthermore, if an expression is modified, the numbers may

605

change. To help with the difficulty, PCRE supports the naming of sub-

738

change. To help with this difficulty, PCRE supports the naming of sub-

606

739

patterns, something that Perl does not provide. The Python syntax

607

740

(?P<name>...) is used. Names consist of alphanumeric characters and

608

741

underscores, and must be unique within a pattern.

609

742

610

743

Named capturing parentheses are still allocated numbers as well as

611

744

names. The PCRE API provides function calls for extracting the name-to-

612

number translation table from a compiled pattern. For further details

613

see the pcreapi documentation.

745

number translation table from a compiled pattern. There is also a con-

746

venience function for extracting a captured substring by name. For fur-

747

ther details see the pcreapi documentation.

614

748

615

749

616

750

REPETITION

617

751

618

Repetition is specified by quantifiers, which can follow any of the

752

Repetition is specified by quantifiers, which can follow any of the

619

753

following items:

620

754

621

755

a literal data character

622

756

the . metacharacter

623

757

the \C escape sequence

624

escapes such as \d that match single characters

758

the \X escape sequence (in UTF-8 mode with Unicode properties)

759

an escape such as \d that matches a single character

625

760

a character class

626

761

a back reference (see next section)

627

762

a parenthesized subpattern (unless it is an assertion)

628

763

629

The general repetition quantifier specifies a minimum and maximum num-

630

ber of permitted matches, by giving the two numbers in curly brackets

631

(braces), separated by a comma. The numbers must be less than 65536,

764

The general repetition quantifier specifies a minimum and maximum num-

765

ber of permitted matches, by giving the two numbers in curly brackets

766

(braces), separated by a comma. The numbers must be less than 65536,

632

767

and the first must be less than or equal to the second. For example:

633

768

634

769

z{2,4}

635

770

636

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

637

special character. If the second number is omitted, but the comma is

638

present, there is no upper limit; if the second number and the comma

639

are both omitted, the quantifier specifies an exact number of required

771

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

772

special character. If the second number is omitted, but the comma is

773

present, there is no upper limit; if the second number and the comma

774

are both omitted, the quantifier specifies an exact number of required

640

775

matches. Thus

641

776

642

777

[aeiou]{3,}

645

780

646

781

\d{8}

647

782

648

matches exactly 8 digits. An opening curly bracket that appears in a

649

position where a quantifier is not allowed, or one that does not match

650

the syntax of a quantifier, is taken as a literal character. For exam-

783

matches exactly 8 digits. An opening curly bracket that appears in a

784

position where a quantifier is not allowed, or one that does not match

785

the syntax of a quantifier, is taken as a literal character. For exam-

651

786

ple, {,6} is not a quantifier, but a literal string of four characters.

652

787

653

In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to

788

In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to

654

789

individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-

655

acters, each of which is represented by a two-byte sequence.

790

acters, each of which is represented by a two-byte sequence. Similarly,

791

when Unicode property support is available, \X{3} matches three Unicode

792

extended sequences, each of which may be several bytes long (and they

793

may be of different lengths).

656

794

657

795

The quantifier {0} is permitted, causing the expression to behave as if

658

796

the previous item and the quantifier were not present.

680

818

as possible (up to the maximum number of permitted times), without

681

819

causing the rest of the pattern to fail. The classic example of where

682

820

this gives problems is in trying to match comments in C programs. These

683

appear between the sequences /* and */ and within the sequence, indi-

684

vidual * and / characters may appear. An attempt to match C comments by

685

applying the pattern

821

appear between /* and */ and within the comment, individual * and /

822

characters may appear. An attempt to match C comments by applying the

823

pattern

686

824

687

825

/\*.*\*/

688

826

689

827

to the string

690

828

691

/* first command */ not comment /* second comment */

829

/* first comment */ not comment /* second comment */

692

830

693

831

fails, because it matches the entire string owing to the greediness of

694

832

the .* item.

716

854

words, it inverts the default behaviour.

717

855

718

856

When a parenthesized subpattern is quantified with a minimum repeat

719

count that is greater than 1 or with a limited maximum, more store is

857

count that is greater than 1 or with a limited maximum, more memory is

720

858

required for the compiled pattern, in proportion to the size of the

721

859

minimum or maximum.

722

860

807

945

consists of an additional + character following a quantifier. Using

808

946

this notation, the previous example can be rewritten as

809

947

810

\d++bar

948

\d++foo

811

949

812

950

Possessive quantifiers are always greedy; the setting of the

813

951

PCRE_UNGREEDY option is ignored. They are a convenient notation for the

832

970

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

833

971

834

972

it takes a long time before reporting failure. This is because the

835

string can be divided between the two repeats in a large number of

836

ways, and all have to be tried. (The example used [!?] rather than a

837

single character at the end, because both PCRE and Perl have an opti-

838

mization that allows for fast failure when a single character is used.

839

They remember the last single character that is required for a match,

840

and fail early if it is not present in the string.) If the pattern is

841

changed to

973

string can be divided between the internal \D+ repeat and the external

974

* repeat in a large number of ways, and all have to be tried. (The

975

example uses [!?] rather than a single character at the end, because

976

both PCRE and Perl have an optimization that allows for fast failure

977

when a single character is used. They remember the last single charac-

978

ter that is required for a match, and fail early if it is not present

979

in the string.) If the pattern is changed so that it uses an atomic

980

group, like this:

842

981

843

982

((?>\D+)|<\d+>)*[!?]

844

983

845

sequences of non-digits cannot be broken, and failure happens quickly.

984

sequences of non-digits cannot be broken, and failure happens quickly.

846

985

847

986

848

987

BACK REFERENCES

849

988

850

989

Outside a character class, a backslash followed by a digit greater than

851

990

0 (and possibly further digits) is a back reference to a capturing sub-

852

pattern earlier (that is, to its left) in the pattern, provided there

991

pattern earlier (that is, to its left) in the pattern, provided there

853

992

have been that many previous capturing left parentheses.

854

993

855

994

However, if the decimal number following the backslash is less than 10,

856

it is always taken as a back reference, and causes an error only if

857

there are not that many capturing left parentheses in the entire pat-

858

tern. In other words, the parentheses that are referenced need not be

859

to the left of the reference for numbers less than 10. See the section

860

entitled "Backslash" above for further details of the handling of dig-

861

its following a backslash.

995

it is always taken as a back reference, and causes an error only if

996

there are not that many capturing left parentheses in the entire pat-

997

tern. In other words, the parentheses that are referenced need not be

998

to the left of the reference for numbers less than 10. See the subsec-

999

tion entitled "Non-printing characters" above for further details of

1000

the handling of digits following a backslash.

862

1001

863

A back reference matches whatever actually matched the capturing sub-

864

pattern in the current subject string, rather than anything matching

1002

A back reference matches whatever actually matched the capturing sub-

1003

pattern in the current subject string, rather than anything matching

865

1004

the subpattern itself (see "Subpatterns as subroutines" below for a way

866

1005

of doing that). So the pattern

867

1006

868

1007

(sens|respons)e and \1ibility

869

1008

870

matches "sense and sensibility" and "response and responsibility", but

871

not "sense and responsibility". If caseful matching is in force at the

872

time of the back reference, the case of letters is relevant. For exam-

1009

matches "sense and sensibility" and "response and responsibility", but

1010

not "sense and responsibility". If caseful matching is in force at the

1011

time of the back reference, the case of letters is relevant. For exam-

873

1012

ple,

874

1013

875

1014

((?i)rah)\s+\1

876

1015

877

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

1016

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

878

1017

original capturing subpattern is matched caselessly.

879

1018

880

Back references to named subpatterns use the Python syntax (?P=name).

1019

Back references to named subpatterns use the Python syntax (?P=name).

881

1020

We could rewrite the above example as follows:

882

1021

883

1022

(?<p1>(?i)rah)\s+(?P=p1)

884

1023

885

There may be more than one back reference to the same subpattern. If a

886

subpattern has not actually been used in a particular match, any back

1024

There may be more than one back reference to the same subpattern. If a

1025

subpattern has not actually been used in a particular match, any back

887

1026

references to it always fail. For example, the pattern

888

1027

889

1028

(a|(bc))\2

890

1029

891

always fails if it starts to match "a" rather than "bc". Because there

892

may be many capturing parentheses in a pattern, all digits following

893

the backslash are taken as part of a potential back reference number.

1030

always fails if it starts to match "a" rather than "bc". Because there

1031

may be many capturing parentheses in a pattern, all digits following

1032

the backslash are taken as part of a potential back reference number.

894

1033

If the pattern continues with a digit character, some delimiter must be

895

used to terminate the back reference. If the PCRE_EXTENDED option is

896

set, this can be whitespace. Otherwise an empty comment can be used.

1034

used to terminate the back reference. If the PCRE_EXTENDED option is

1035

set, this can be whitespace. Otherwise an empty comment (see "Com-

1036

ments" below) can be used.

897

1037

898

1038

A back reference that occurs inside the parentheses to which it refers

899

1039

fails when the subpattern is first used, so, for example, (a\1) never

915

1055

An assertion is a test on the characters following or preceding the

916

1056

current matching point that does not actually consume any characters.

917

1057

The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are

918

described above. More complicated assertions are coded as subpatterns.

919

There are two kinds: those that look ahead of the current position in

920

the subject string, and those that look behind it.

921

922

An assertion subpattern is matched in the normal way, except that it

923

does not cause the current matching position to be changed. Lookahead

924

assertions start with (?= for positive assertions and (?! for negative

925

assertions. For example,

1058

described above.

1059

1060

More complicated assertions are coded as subpatterns. There are two

1061

kinds: those that look ahead of the current position in the subject

1062

string, and those that look behind it. An assertion subpattern is

1063

matched in the normal way, except that it does not cause the current

1064

matching position to be changed.

1065

1066

Assertion subpatterns are not capturing subpatterns, and may not be

1067

repeated, because it makes no sense to assert the same thing several

1068

times. If any kind of assertion contains capturing subpatterns within

1069

it, these are counted for the purposes of numbering the capturing sub-

1070

patterns in the whole pattern. However, substring capturing is carried

1071

out only for positive assertions, because it does not make sense for

1072

negative assertions.

1073

1074

Lookahead assertions

1075

1076

Lookahead assertions start with (?= for positive assertions and (?! for

1077

negative assertions. For example,

926

1078

927

1079

\w+(?=;)

928

1080

939

1091

does not find an occurrence of "bar" that is preceded by something

940

1092

other than "foo"; it finds any occurrence of "bar" whatsoever, because

941

1093

the assertion (?!foo) is always true when the next three characters are

942

"bar". A lookbehind assertion is needed to achieve this effect.

1094

"bar". A lookbehind assertion is needed to achieve the other effect.

943

1095

944

1096

If you want to force a matching failure at some point in a pattern, the

945

1097

most convenient way to do it is with (?!) because an empty string

946

1098

always matches, so an assertion that requires there not to be an empty

947

1099

string must always fail.

948

1100

1101

Lookbehind assertions

1102

949

1103

Lookbehind assertions start with (?<= for positive assertions and (?<!

950

1104

for negative assertions. For example,

951

1105

984

1138

985

1139

PCRE does not allow the \C escape (which matches a single byte in UTF-8

986

1140

mode) to appear in lookbehind assertions, because it makes it impossi-

987

ble to calculate the length of the lookbehind.

1141

ble to calculate the length of the lookbehind. The \X escape, which can

1142

match different numbers of bytes, is also not permitted.

988

1143

989

Atomic groups can be used in conjunction with lookbehind assertions to

1144

Atomic groups can be used in conjunction with lookbehind assertions to

990

1145

specify efficient matching at the end of the subject string. Consider a

991

1146

simple pattern such as

992

1147

993

1148

abcd$

994

1149

995

when applied to a long string that does not match. Because matching

1150

when applied to a long string that does not match. Because matching

996

1151

proceeds from left to right, PCRE will look for each "a" in the subject

997

and then see if what follows matches the rest of the pattern. If the

1152

and then see if what follows matches the rest of the pattern. If the

998

1153

pattern is specified as

999

1154

1000

1155

^.*abcd$

1001

1156

1002

the initial .* matches the entire string at first, but when this fails

1157

the initial .* matches the entire string at first, but when this fails

1003

1158

(because there is no following "a"), it backtracks to match all but the

1004

last character, then all but the last two characters, and so on. Once

1005

again the search for "a" covers the entire string, from right to left,

1159

last character, then all but the last two characters, and so on. Once

1160

again the search for "a" covers the entire string, from right to left,

1006

1161

so we are no better off. However, if the pattern is written as

1007

1162

1008

1163

^(?>.*)(?<=abcd)

1009

1164

1010

or, equivalently,

1165

or, equivalently, using the possessive quantifier syntax,

1011

1166

1012

1167

^.*+(?<=abcd)

1013

1168

1014

there can be no backtracking for the .* item; it can match only the

1015

entire string. The subsequent lookbehind assertion does a single test

1016

on the last four characters. If it fails, the match fails immediately.

1017

For long strings, this approach makes a significant difference to the

1169

there can be no backtracking for the .* item; it can match only the

1170

entire string. The subsequent lookbehind assertion does a single test

1171

on the last four characters. If it fails, the match fails immediately.

1172

For long strings, this approach makes a significant difference to the

1018

1173

processing time.

1019

1174

1175

Using multiple assertions

1176

1020

1177

Several assertions (of any sort) may occur in succession. For example,

1021

1178

1022

1179

(?<=\d{3})(?<!999)foo

1023

1180

1024

matches "foo" preceded by three digits that are not "999". Notice that

1025

each of the assertions is applied independently at the same point in

1026

the subject string. First there is a check that the previous three

1027

characters are all digits, and then there is a check that the same

1181

matches "foo" preceded by three digits that are not "999". Notice that

1182

each of the assertions is applied independently at the same point in

1183

the subject string. First there is a check that the previous three

1184

characters are all digits, and then there is a check that the same

1028

1185

three characters are not "999". This pattern does not match "foo" pre-

1029

ceded by six characters, the first of which are digits and the last

1030

three of which are not "999". For example, it doesn't match "123abc-

1186

ceded by six characters, the first of which are digits and the last

1187

three of which are not "999". For example, it doesn't match "123abc-

1031

1188

foo". A pattern to do that is

1032

1189

1033

1190

(?<=\d{3}...)(?<!999)foo

1034

1191

1035

This time the first assertion looks at the preceding six characters,

1192

This time the first assertion looks at the preceding six characters,

1036

1193

checking that the first three are digits, and then the second assertion

1037

1194

checks that the preceding three characters are not "999".

1038

1195

1040

1197

1041

1198

(?<=(?<!foo)bar)baz

1042

1199

1043

matches an occurrence of "baz" that is preceded by "bar" which in turn

1200

matches an occurrence of "baz" that is preceded by "bar" which in turn

1044

1201

is not preceded by "foo", while

1045

1202

1046

1203

(?<=\d{3}(?!999)...)foo

1047

1204

1048

is another pattern which matches "foo" preceded by three digits and any

1205

is another pattern that matches "foo" preceded by three digits and any

1049

1206

three characters that are not "999".

1050

1207

1051

Assertion subpatterns are not capturing subpatterns, and may not be

1052

repeated, because it makes no sense to assert the same thing several

1053

times. If any kind of assertion contains capturing subpatterns within

1054

it, these are counted for the purposes of numbering the capturing sub-

1055

patterns in the whole pattern. However, substring capturing is carried

1056

out only for positive assertions, because it does not make sense for

1057

negative assertions.

1058

1059

1208

1060

1209

CONDITIONAL SUBPATTERNS

1061

1210

1062

It is possible to cause the matching process to obey a subpattern con-

1063

ditionally or to choose between two alternative subpatterns, depending

1064

on the result of an assertion, or whether a previous capturing

1065

subpattern matched or not. The two possible forms of conditional sub-

1066

pattern are

1211

It is possible to cause the matching process to obey a subpattern con-

1212

ditionally or to choose between two alternative subpatterns, depending

1213

on the result of an assertion, or whether a previous capturing subpat-

1214

tern matched or not. The two possible forms of conditional subpattern

1215

are

1067

1216

1068

1217

(?(condition)yes-pattern)

1069

1218

(?(condition)yes-pattern|no-pattern)

1070

1219

1071

If the condition is satisfied, the yes-pattern is used; otherwise the

1072

no-pattern (if present) is used. If there are more than two alterna-

1220

If the condition is satisfied, the yes-pattern is used; otherwise the

1221

no-pattern (if present) is used. If there are more than two alterna-

1073

1222

tives in the subpattern, a compile-time error occurs.

1074

1223

1075

1224

There are three kinds of condition. If the text between the parentheses

1076

consists of a sequence of digits, the condition is satisfied if the

1077

capturing subpattern of that number has previously matched. The number

1078

must be greater than zero. Consider the following pattern, which con-

1079

tains non-significant white space to make it more readable (assume the

1080

PCRE_EXTENDED option) and to divide it into three parts for ease of

1225

consists of a sequence of digits, the condition is satisfied if the

1226

capturing subpattern of that number has previously matched. The number

1227

must be greater than zero. Consider the following pattern, which con-

1228

tains non-significant white space to make it more readable (assume the

1229

PCRE_EXTENDED option) and to divide it into three parts for ease of

1081

1230

discussion:

1082

1231

1083

1232

( $ )? [^()]+ (?(1) $ )

1084

1233

1085

The first part matches an optional opening parenthesis, and if that

1234

The first part matches an optional opening parenthesis, and if that

1086

1235

character is present, sets it as the first captured substring. The sec-

1087

ond part matches one or more characters that are not parentheses. The

1236

ond part matches one or more characters that are not parentheses. The

1088

1237

third part is a conditional subpattern that tests whether the first set

1089

1238

of parentheses matched or not. If they did, that is, if subject started

1090

1239

with an opening parenthesis, the condition is true, and so the yes-pat-

1091

tern is executed and a closing parenthesis is required. Otherwise,

1092

since no-pattern is not present, the subpattern matches nothing. In

1093

other words, this pattern matches a sequence of non-parentheses,

1240

tern is executed and a closing parenthesis is required. Otherwise,

1241

since no-pattern is not present, the subpattern matches nothing. In

1242

other words, this pattern matches a sequence of non-parentheses,

1094

1243

optionally enclosed in parentheses.

1095

1244

1096

1245

If the condition is the string (R), it is satisfied if a recursive call

1097

to the pattern or subpattern has been made. At "top level", the condi-

1098

tion is false. This is a PCRE extension. Recursive patterns are

1246

to the pattern or subpattern has been made. At "top level", the condi-

1247

tion is false. This is a PCRE extension. Recursive patterns are

1099

1248

described in the next section.

1100

1249

1101

If the condition is not a sequence of digits or (R), it must be an

1102

assertion. This may be a positive or negative lookahead or lookbehind

1103

assertion. Consider this pattern, again containing non-significant

1250

If the condition is not a sequence of digits or (R), it must be an

1251

assertion. This may be a positive or negative lookahead or lookbehind

1252

assertion. Consider this pattern, again containing non-significant

1104

1253

white space, and with the two alternatives on the second line:

1105

1254

1106

1255

(?(?=[^a-z]*[a-z])

1107

1256

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

1108

1257

1109

The condition is a positive lookahead assertion that matches an

1110

optional sequence of non-letters followed by a letter. In other words,

1111

it tests for the presence of at least one letter in the subject. If a

1112

letter is found, the subject is matched against the first alternative;

1113

otherwise it is matched against the second. This pattern matches

1114

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

1258

The condition is a positive lookahead assertion that matches an

1259

optional sequence of non-letters followed by a letter. In other words,

1260

it tests for the presence of at least one letter in the subject. If a

1261

letter is found, the subject is matched against the first alternative;

1262

otherwise it is matched against the second. This pattern matches

1263

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

1115

1264

letters and dd are digits.

1116

1265

1117

1266

1118

1267

COMMENTS

1119

1268

1120

The sequence (?# marks the start of a comment which continues up to the

1121

next closing parenthesis. Nested parentheses are not permitted. The

1122

characters that make up a comment play no part in the pattern matching

1269

The sequence (?# marks the start of a comment that continues up to the

1270

next closing parenthesis. Nested parentheses are not permitted. The

1271

characters that make up a comment play no part in the pattern matching

1123

1272

at all.

1124

1273

1125

If the PCRE_EXTENDED option is set, an unescaped # character outside a

1274

If the PCRE_EXTENDED option is set, an unescaped # character outside a

1126

1275

character class introduces a comment that continues up to the next new-

1127

1276

line character in the pattern.

1128

1277

1129

1278

1130

1279

RECURSIVE PATTERNS

1131

1280

1132

Consider the problem of matching a string in parentheses, allowing for

1133

unlimited nested parentheses. Without the use of recursion, the best

1134

that can be done is to use a pattern that matches up to some fixed

1135

depth of nesting. It is not possible to handle an arbitrary nesting

1136

depth. Perl has provided an experimental facility that allows regular

1137

expressions to recurse (amongst other things). It does this by interpo-

1138

lating Perl code in the expression at run time, and the code can refer

1139

to the expression itself. A Perl pattern to solve the parentheses prob-

1140

lem can be created like this:

1281

Consider the problem of matching a string in parentheses, allowing for

1282

unlimited nested parentheses. Without the use of recursion, the best

1283

that can be done is to use a pattern that matches up to some fixed

1284

depth of nesting. It is not possible to handle an arbitrary nesting

1285

depth. Perl provides a facility that allows regular expressions to

1286

recurse (amongst other things). It does this by interpolating Perl code

1287

in the expression at run time, and the code can refer to the expression

1288

itself. A Perl pattern to solve the parentheses problem can be created

1289

like this:

1141

1290

1142

1291

$re = qr{$ (?: (?>[^()]+) | (?p{$re}) )* $}x;

1143

1292

1144

1293

The (?p{...}) item interpolates Perl code at run time, and in this case

1145

refers recursively to the pattern in which it appears. Obviously, PCRE

1146

cannot support the interpolation of Perl code. Instead, it supports

1147

some special syntax for recursion of the entire pattern, and also for

1294

refers recursively to the pattern in which it appears. Obviously, PCRE

1295

cannot support the interpolation of Perl code. Instead, it supports

1296

some special syntax for recursion of the entire pattern, and also for

1148

1297

individual subpattern recursion.

1149

1298

1150

The special item that consists of (? followed by a number greater than

1299

The special item that consists of (? followed by a number greater than

1151

1300

zero and a closing parenthesis is a recursive call of the subpattern of

1152

the given number, provided that it occurs inside that subpattern. (If

1153

not, it is a "subroutine" call, which is described in the next sec-

1154

tion.) The special item (?R) is a recursive call of the entire regular

1301

the given number, provided that it occurs inside that subpattern. (If

1302

not, it is a "subroutine" call, which is described in the next sec-

1303

tion.) The special item (?R) is a recursive call of the entire regular

1155

1304

expression.

1156

1305

1157

For example, this PCRE pattern solves the nested parentheses problem

1158

(assume the PCRE_EXTENDED option is set so that white space is

1306

For example, this PCRE pattern solves the nested parentheses problem

1307

(assume the PCRE_EXTENDED option is set so that white space is

1159

1308

ignored):

1160

1309

1161

1310

$ ( (?>[^()]+) | (?R) )* $

1162

1311

1163

First it matches an opening parenthesis. Then it matches any number of

1164

substrings which can either be a sequence of non-parentheses, or a

1165

recursive match of the pattern itself (that is a correctly parenthe-

1312

First it matches an opening parenthesis. Then it matches any number of

1313

substrings which can either be a sequence of non-parentheses, or a

1314

recursive match of the pattern itself (that is a correctly parenthe-

1166

1315

sized substring). Finally there is a closing parenthesis.

1167

1316

1168

If this were part of a larger pattern, you would not want to recurse

1317

If this were part of a larger pattern, you would not want to recurse

1169

1318

the entire pattern, so instead you could use this:

1170

1319

1171

1320

( $ ( (?>[^()]+) | (?1) )* $ )

1172

1321

1173

We have put the pattern into parentheses, and caused the recursion to

1174

refer to them instead of the whole pattern. In a larger pattern, keep-

1175

ing track of parenthesis numbers can be tricky. It may be more conve-

1176

nient to use named parentheses instead. For this, PCRE uses (?P>name),

1177

which is an extension to the Python syntax that PCRE uses for named

1322

We have put the pattern into parentheses, and caused the recursion to

1323

refer to them instead of the whole pattern. In a larger pattern, keep-

1324

ing track of parenthesis numbers can be tricky. It may be more conve-

1325

nient to use named parentheses instead. For this, PCRE uses (?P>name),

1326

which is an extension to the Python syntax that PCRE uses for named

1178

1327

parentheses (Perl does not provide named parentheses). We could rewrite

1179

1328

the above example as follows:

1180

1329

1181

1330

(?P<pn> $ ( (?>[^()]+) | (?P>pn) )* $ )

1182

1331

1183

This particular example pattern contains nested unlimited repeats, and

1184

so the use of atomic grouping for matching strings of non-parentheses

1185

is important when applying the pattern to strings that do not match.

1332

This particular example pattern contains nested unlimited repeats, and

1333

so the use of atomic grouping for matching strings of non-parentheses

1334

is important when applying the pattern to strings that do not match.

1186

1335

For example, when this pattern is applied to

1187

1336

1188

1337

(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

1189

1338

1190

it yields "no match" quickly. However, if atomic grouping is not used,

1191

the match runs for a very long time indeed because there are so many

1192

different ways the + and * repeats can carve up the subject, and all

1339

it yields "no match" quickly. However, if atomic grouping is not used,

1340

the match runs for a very long time indeed because there are so many

1341

different ways the + and * repeats can carve up the subject, and all

1193

1342

have to be tested before failure can be reported.

1194

1343

1195

1344

At the end of a match, the values set for any capturing subpatterns are

1196

1345

those from the outermost level of the recursion at which the subpattern

1197

value is set. If you want to obtain intermediate values, a callout

1198

function can be used (see below and the pcrecallout documentation). If

1199

the pattern above is matched against

1346

value is set. If you want to obtain intermediate values, a callout

1347

function can be used (see the next section and the pcrecallout documen-

1348

tation). If the pattern above is matched against

1200

1349

1201

1350

(ab(cd)ef)

1202

1351

1203

the value for the capturing parentheses is "ef", which is the last

1204

value taken on at the top level. If additional parentheses are added,

1352

the value for the capturing parentheses is "ef", which is the last

1353

value taken on at the top level. If additional parentheses are added,

1205

1354

giving

1206

1355

1207

1356

$ ( ( (?>[^()]+) | (?R) )* ) $

1208

1357

^ ^

1209

1358

^ ^

1210

1359

1211

the string they capture is "ab(cd)ef", the contents of the top level

1212

parentheses. If there are more than 15 capturing parentheses in a pat-

1360

the string they capture is "ab(cd)ef", the contents of the top level

1361

parentheses. If there are more than 15 capturing parentheses in a pat-

1213

1362

tern, PCRE has to obtain extra memory to store data during a recursion,

1214

which it does by using pcre_malloc, freeing it via pcre_free after-

1215

wards. If no memory can be obtained, the match fails with the

1363

which it does by using pcre_malloc, freeing it via pcre_free after-

1364

wards. If no memory can be obtained, the match fails with the

1216

1365

PCRE_ERROR_NOMEMORY error.

1217

1366

1218

Do not confuse the (?R) item with the condition (R), which tests for

1219

recursion. Consider this pattern, which matches text in angle brack-

1220

ets, allowing for arbitrary nesting. Only digits are allowed in nested

1221

brackets (that is, when recursing), whereas any characters are permit-

1367

Do not confuse the (?R) item with the condition (R), which tests for

1368

recursion. Consider this pattern, which matches text in angle brack-

1369

ets, allowing for arbitrary nesting. Only digits are allowed in nested

1370

brackets (that is, when recursing), whereas any characters are permit-

1222

1371

ted at the outer level.

1223

1372

1224

1373

< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >

1225

1374

1226

In this pattern, (?(R) is the start of a conditional subpattern, with

1227

two different alternatives for the recursive and non-recursive cases.

1375

In this pattern, (?(R) is the start of a conditional subpattern, with

1376

two different alternatives for the recursive and non-recursive cases.

1228

1377

The (?R) item is the actual recursive call.

1229

1378

1230

1379

1231

1380

SUBPATTERNS AS SUBROUTINES

1232

1381

1233

1382

If the syntax for a recursive subpattern reference (either by number or

1234

by name) is used outside the parentheses to which it refers, it oper-

1235

ates like a subroutine in a programming language. An earlier example

1383

by name) is used outside the parentheses to which it refers, it oper-

1384

ates like a subroutine in a programming language. An earlier example

1236

1385

pointed out that the pattern

1237

1386

1238

1387

(sens|respons)e and \1ibility

1239

1388

1240

matches "sense and sensibility" and "response and responsibility", but

1389

matches "sense and sensibility" and "response and responsibility", but

1241

1390

not "sense and responsibility". If instead the pattern

1242

1391

1243

1392

(sens|respons)e and (?1)ibility

1244

1393

1245

is used, it does match "sense and responsibility" as well as the other

1246

two strings. Such references must, however, follow the subpattern to

1394

is used, it does match "sense and responsibility" as well as the other

1395

two strings. Such references must, however, follow the subpattern to

1247

1396

which they refer.

1248

1397

1249

1398

1250

1399

CALLOUTS

1251

1400

1252

1401

Perl has a feature whereby using the sequence (?{...}) causes arbitrary

1253

Perl code to be obeyed in the middle of matching a regular expression.

1402

Perl code to be obeyed in the middle of matching a regular expression.

1254

1403

This makes it possible, amongst other things, to extract different sub-

1255

1404

strings that match the same pair of parentheses when there is a repeti-

1256

1405

tion.

1257

1406

1258

1407

PCRE provides a similar feature, but of course it cannot obey arbitrary

1259

1408

Perl code. The feature is called "callout". The caller of PCRE provides

1260

an external function by putting its entry point in the global variable

1261

pcre_callout. By default, this variable contains NULL, which disables

1409

an external function by putting its entry point in the global variable

1410

pcre_callout. By default, this variable contains NULL, which disables

1262

1411

all calling out.

1263

1412

1264

Within a regular expression, (?C) indicates the points at which the

1265

external function is to be called. If you want to identify different

1266

callout points, you can put a number less than 256 after the letter C.

1267

The default value is zero. For example, this pattern has two callout

1413

Within a regular expression, (?C) indicates the points at which the

1414

external function is to be called. If you want to identify different

1415

callout points, you can put a number less than 256 after the letter C.

1416

The default value is zero. For example, this pattern has two callout

1268

1417

points:

1269

1418

1270

1419

(?C1)abc(?C2)def

1271

1420

1421

If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are

1422

automatically installed before each item in the pattern. They are all

1423

numbered 255.

1424

1272

1425

During matching, when PCRE reaches a callout point (and pcre_callout is

1273

set), the external function is called. It is provided with the number

1274

of the callout, and, optionally, one item of data originally supplied

1275

by the caller of pcre_exec(). The callout function may cause matching

1276

to backtrack, or to fail altogether. A complete description of the

1277

interface to the callout function is given in the pcrecallout documen-

1278

tation.

1426

set), the external function is called. It is provided with the number

1427

of the callout, the position in the pattern, and, optionally, one item

1428

of data originally supplied by the caller of pcre_exec(). The callout

1429

function may cause matching to proceed, to backtrack, or to fail alto-

1430

gether. A complete description of the interface to the callout function

1431

is given in the pcrecallout documentation.

1279

1432

1280

Last updated: 03 February 2003

1281

1433

Last updated: 28 February 2005

1434

Older »