1
/* tag: Tom Lord Tue Dec 4 14:57:16 2001 (posix-regexps.doc)
3
/************************************************************************
6
* |Posix| |Posix.2| |1003.2| |ANSI/IEEE 1003.2| |ISO/IEDC 994502|
7
* |regular expression notation|
11
* The *Posix.2* standard (*ISO/IEC 994502: 1993 (ANSI/IEEE Std 1003.2
12
* - 1992*)), section *2.8*) specifies the syntax and semantics of a
13
* ``Regular Expression Notation''. Appendix ``B.5'' defines a ``C Binding
14
* for RE Matching'' which includes the functions `regcomp' (to compile
15
* a Posix regexp), `regexec' (to search for a match), `regerror' (to
16
* translate a regexp error code into a string), and `regfree' (to
17
* release storage associated with a compiled regexp).
19
* The Hackerlab C library provides "Rx", an implementation of
20
* Posix.2 regexp functions (with some extensions).
22
* This chapter describes that interface. An appendix to this manual
23
* contains an introduction to Posix regexps. (See xref:"An
24
* Introduction to Posix Regexps".) If you are unfamiliar with
25
* regexps, reading that appendix before reading this chapter may be
28
* This chapter begins with documentation for the standard Posix
29
* functions (and some closely related non-standard extensions). If
30
* you are looking for programmer reference manual, see xref:"Posix
33
* The Posix standard for regexps is precise. On the other hand, it
34
* is often implemented incorrectly and almost never implemented
35
* completely. A discussion of the relation between the Posix
36
* standard and Rx can be found in xref:"Rx and Conformance to the
39
* Obtaining good performance from regexp matchers is sometimes
40
* complicated: they are easy to understand and use in many common
41
* situations, but they require careful attention in applications that
42
* use complicated regexps and in applications that use regexp
43
* matching heavily. Some general advice can be found in xref:"Rx
44
* and the Performance of Regexp Pattern Matching".
46
* Finally, if you are performance tuning a regexp-intensive
47
* application, you'll need to understand the non-standard interfaces
48
* in xref:"Tuning Posix Regexp and XML Regular Expression Performance".
56
/*(include-documentation "posix.c")
62
/************************************************************************
63
*(h1 "Rx and Conformance to the Posix Standard")
65
* Posix specifies the behavior of the C regexp functions quite
66
* precisely. Rx attempts to honor this specification to the
67
* greatest extent possible in a portable implementation.
68
* There are two areas of conformance that are worthy of note:
69
* the question of what is matched by a given regexp, and the
70
* question of how regexp matching interacts with Posix locales.
72
* The question of ``what is matched'' is worthy of note because
73
* obtaining correct behavior has proven to be extremely difficult --
74
* few implementations succeed. The implementation of Rx has been
75
* carefully developed and tested with correctness in mind.
77
* The question of how regexps and locales interact is worthy of note
78
* because it is impossible to completely implement the behavior
79
* specified by Posix in a portable implementation (i.e., in an
80
* implementation that is not intimately familiar with the
81
* non-standard internals of a particular implementation of the
82
* standard C library).
86
/************************************************************************
87
*(h2 "What is Matched by a Posix Regexp")
89
* Posix requires that when `regexec' reports the position of a
90
* matching substring, it must report the first-occurring (``leftmost'')
91
* match. Of the possible first-occuring matches, the longest match
92
* must be returned. This is called the "left-most longest rule".
94
* Posix requires that when `regexec' reports the position of a
95
* substring matched by a parenthesized subexpression, it must report
96
* the last substring matched by that subexpression. If one
97
* parenthesized expression (the ``inner expression'') is enclosed
98
* in another (the ``outer expression'') and the inner expression did
99
* not participate in the last match of the outer expression, then
100
* no substring match may be reported for the inner expression.
102
* Finally, Posix requires that when regexec determines what
103
* each subpattern matched (regardless of whether the subpattern is
104
* surrounded by parentheses), if there is more than one possibility,
105
* regexec must choose the possibility that first maximizes the length
106
* of substrings matched by the outermost subpatterns, from
107
* left-to-right, and then recursively apply the same rule to
108
* inner subpatterns. However, this rule is subordinate to the
109
* left-most longest rule: if an earlier-occuring or longer overall
110
* match can be obtained by returning a non-maximal match for some
111
* subpattern, `regexec' must return that earlier or longer match.
113
* This combination of constraints completely determines the return
114
* value of `regexec' and describes the behavior of Rx. Many other
115
* implementations do not conform to the standard in this regard --
116
* in exceptional situations, compatibility issues may arise when
117
* switching between regexp implementations.
120
/************************************************************************
121
*(h2 "Rx and Posix Locales")
123
* Posix requires that a character set containing a multi-character
124
* collating element be treated specially. For example, if the
125
* character sequence `ch' is a collating element, then the regexp:
129
* will match `ch'. On the other hand, if `ch' is not a collating
130
* element, the same expression is illegal. Similarly, an expression
135
* should match all collating elements which collate between ``a'' and
136
* ``z'' (inclusively), including multi-character collating elements
137
* that collate between ``a'' and ``z''.
139
* Unfortunately, Posix provides no portable mechanism for determining
140
* which sequences of characters are multi-character collating
141
* elements, and which are not. Consequently, Rx operates as if
142
* multi-character collating elements did not exist.
144
* Posix also defines a character set construct called an ``equivalence
147
* [[=<X>=]] where <X> is a collating element
149
* An equivalence class stands for the set of all collating elements
150
* having the same primary collation weight as `<X>'. Unfortunately,
151
* Posix provides no portable mechanism for determining the primary
152
* collation weight of any collating element. Consequently, Rx
153
* implements the equivalence class construct by returning an error
154
* from `regcomp' whenever it is used.
156
* Posix requires that in a character set, a range of characters such
161
* includes all characters that collate between `a' and `z' in the
162
* current locale. Some people argue that this behavior is confusing:
163
* that character ranges should be based on the encoding values of
164
* characters -- not on the rules of collation. Because of
165
* differences in collation, Posix advises that character ranges are a
166
* non-portable construct: portable programs should not use them at
169
* Rx conforms to Posix by using collation order to interpret
170
* character ranges (with the exception that Rx always behaves as if
171
* there are no multi-character collating elements). Using the C
172
* locale when calling `regcomp' and `regexec' ensures that character
173
* ranges will be interpreted in a way consistent with the ASCII
178
/************************************************************************
179
*(h1 "Rx and the Performance of Regexp Pattern Matching")
182
* The performance (speed and memory use) of any Posix regexp matcher
183
* (including Rx) is a complicated subject. Programmers who want to use
184
* regexps will benefit by understanding the issues, at least in
185
* broad outline, so that they can avoid pitfalls, so they can make
186
* the best possible use of a particular implementation (Rx, in this
187
* case), and so they know where to delve deeper when performance
188
* issues become particularly important.
190
* Traditionally, many programmers use regexps as if they were always
191
* computationally *inexpensive*. This is naive. Some uses of
192
* regexps are inexpensive, others are intractable. Many fall
193
* somewhere in the middle. Which uses fall into which cases varies
194
* between implementations.
199
/************************************************************************
200
*(h2 "Posix Regexp Performance in General")
202
* This section describes the performance of Posix regexp matching
203
* in general -- it is not specific to Rx.
205
* Posix regexp pattern matching, in its full generality, is a
206
* computationally expensive process. In some cases, involving only
207
* moderately sized regexps, it is intractably expensive, regardless of
208
* what implementation is being used. Thus, one should never write
209
* programs with the assumption that `regexec' will always return
210
* normally in a reasonable amount of time. (Rx includes non-standard
211
* functionality which can be used to interrupt a call to `regexec'
212
* after a time-out. See xref:"Escaping Long-Running Matches".)
214
* On the other hand, for many very simple regexps, Posix regexp
215
* matching is very inexpensive, again, (nearly) regardless of
216
* implementation. For example, if a regexp is simply a string of
217
* literal characters, searching for a match is almost certain to be
220
* Implementations of Posix regexps often differ in the set of
221
* optimizations they provide. Simplistic implementations, containing
222
* few optimizations, perform well for small and simple regexps, but
223
* poorly in many other cases. Sophisticated implementations can
224
* perform very well on even large regexps if those regexps are true
225
* regular expressions or are nearly true regular expressions. (For
226
* an explanation of the distinction between regexps in general and
227
* ``true regular expressions'', see xref:"An Introduction to Posix
230
* Implementations of Posix regexps often differ in correctness, and
231
* this has a bearing on performance. Several popular implementations
232
* sometimes give incorrect results. The bugs that cause those errors
233
* also improve the performance of the same matchers on some patterns
234
* for which they give correct results. Thus, programmers choosing an
235
* implementation are sometimes faced with the uncomfortable trade-off
236
* between the best performance bench-mark results, and the best
237
* correctness testing results. In such situations, an important
238
* question is the relevance of the tests: do the bench-mark tests
239
* accurately reflect regexp usage in the target application? What
240
* are the risks of using an incorrect matcher in the application?
241
* Consider whether the better performance of buggy matchers on some
242
* expressions is offset by their considerably worse performance on
243
* other expressions: it is not the case that the buggy
244
* implementations are always faster.
248
/************************************************************************
249
*(h2 "Posix Regexp Performance in Rx")
251
* This section describes the performance of Posix regexp matching
254
* Rx is designed to give excellent performance over the widest
255
* possible range of regexps (including many large, complicated
256
* regexps), but to never sacrifice correctness. While Rx is at least
257
* competitive with most most implementations on most regexps (and
258
* is sometimes much faster), there are some regexps for which Rx
259
* is much slower than other implementations. Often, this difference
260
* can be attributed to the bugs in other implementations which
261
* speed up some cases while getting other cases wrong. This is
262
* something to keep in mind when comparing Rx to other implementations.
264
* When a trade-off is necessary between memory use and speed, Rx is
265
* designed to allow programmers to choose how much memory to use and
266
* to provide programmers with the tools necessary to tune memory use
267
* for best possible performance. Rx can operate usefully (though
268
* comparatively slowly) with as little as 10-20K of dynamically
269
* allocated memory. As a rule of thumb, Rx's default of
270
* approximately 2MB is suitable even for applications that use
271
* regexps fairly heavily. (See xref:"Tuning Posix Regexp and
272
* XML Regular Expression Performance".)
274
* Rx contains optimizations targetted for regexps which are true
275
* regular expressions. Rx converts true regular expressions to
276
* deterministic automata and can compare such expressions to a
277
* potentially matching string very quickly. This optimization makes
278
* it practical to use even quite large regular expressions. For more
279
* information, see the the appendix xref:"Data Sheet
280
* for the Hackerlab Rx Posix Regexp Functions".
282
* Sometimes regexps which are not true regular expressions can be
283
* matched as quickly as if they were true regular expressions. If a
284
* regexp is not a regular expression only because it begins with a
285
* left anchor (`^') and/or ends with a right anchor (`$'), Rx can
286
* match the expression as quickly as a true regular expression.
287
* If, in addition, the regexp contains parenthesized subexpressions,
288
* Rx can use regular expression optimizations if, either, the
289
* `REG_NOSUB' flag is passed to `regcomp', or the `nmatch' parameter
290
* to `regexec' is less than or equal to 1 (i.e., if `regexec' is not
291
* required to report the positions of substrings matched by
292
* parenthesized subexpressions).
294
* If a regexp is not a regular expression because it contains
295
* backreferences (`\n') or counted iterations (`RE{n,m}'), Rx's DFA
296
* optimizations do not apply in their full generality. Such
297
* regexps run the greatest risk of being slow.
299
* The Rx implementation of `regcomp' supports a non-standard flag,
300
* `REG_DFA_ONLY', which can be used to disable all regexp constructs
301
* that are forbidden in true regular expressions. See
304
* When regexps are being used for lexical analysis, good performance can
305
* often be achieved by using true regular expressions in combination
306
* with the non-standard regexp operator `[[:cut n:]]'. See xref:"The
311
/*(include-documentation "../rx/escape.c")