76
76
of states approximately proportional to the length of the regexp.
78
78
* The NFA is then optimized into a "compact NFA" representation, which is
79
basically the same data but without fields that are not going to be needed
80
at runtime. We do a little bit of cleanup too, such as removing
81
unreachable states that might be created as a result of the rather naive
82
transformation done by initial parsing. The cNFA representation is what
83
is passed from regcomp to regexec.
79
basically the same idea but without fields that are not going to be needed
80
at runtime. It is simplified too: the compact format only allows "plain"
81
and "LACON" arc types. The cNFA representation is what is passed from
85
84
* Unlike traditional NFA-based regex engines, we do not execute directly
86
85
from the NFA representation, as that would require backtracking and so be
139
138
each match their part of the string (and although this specific case can
140
139
only succeed when the division is at the middle, the code does not know
141
140
that, nor would it be true in general). However, we can first run the DFA
142
and quickly reject any input that doesn't contain two a's and some number
143
of b's and c's. If the DFA doesn't match, there is no need to recurse to
144
the two child nodes for each possible string division point. In many
145
cases, this prefiltering makes the search run much faster than a pure NFA
146
engine could do. It is this behavior that justifies using the phrase
147
"hybrid DFA/NFA engine" to describe Spencer's library.
141
and quickly reject any input that doesn't start with an "a" and contain
142
one more "a" plus some number of b's and c's. If the DFA doesn't match,
143
there is no need to recurse to the two child nodes for each possible
144
string division point. In many cases, this prefiltering makes the search
145
run much faster than a pure NFA engine could do. It is this behavior that
146
justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
150
150
Colors and colormapping
296
296
full expansion of their contents at parse time. This would mean that we'd
297
297
have to be ready to call iswalpha() at runtime, but if that only happens
298
298
for high-code-value characters, it shouldn't be a big performance hit.
301
Detailed semantics of an NFA
302
----------------------------
304
When trying to read dumped-out NFAs, it's helpful to know these facts:
306
State 0 (additionally marked with "@" in dumpnfa's output) is always the
307
goal state, and state 1 (additionally marked with ">") is the start state.
308
(The code refers to these as the post state and pre state respectively.)
310
The possible arc types are:
312
PLAIN arcs, which specify matching of any character of a given "color"
313
(see above). These are dumped as "[color_number]->to_state".
315
EMPTY arcs, which specify a no-op transition to another state. These
316
are dumped as "->to_state".
318
AHEAD constraints, which represent a "next character must be of this
319
color" constraint. AHEAD differs from a PLAIN arc in that the input
320
character is not consumed when crossing the arc. These are dumped as
321
">color_number>->to_state".
323
BEHIND constraints, which represent a "previous character must be of
324
this color" constraint, which likewise consumes no input. These are
325
dumped as "<color_number<->to_state".
327
'^' arcs, which specify a beginning-of-input constraint. These are
328
dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
329
beginning-of-line constraints respectively.
331
'$' arcs, which specify an end-of-input constraint. These are dumped
332
as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
333
constraints respectively.
335
LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
336
i.e. the input starting at this point must match (or not match) a
337
given sub-RE, but the matching input is not consumed. These are
338
dumped as ":subtree_number:->to_state".
340
If you see anything else (especially any question marks) in the display of
341
an arc, it's dumpnfa() trying to tell you that there's something fishy
342
about the arc; see the source code.
344
The regex executor can only handle PLAIN and LACON transitions. The regex
345
optimize() function is responsible for transforming the parser's output
346
to get rid of all the other arc types. In particular, ^ and $ arcs that
347
are not dropped as impossible will always end up adjacent to the pre or
348
post state respectively, and then will be converted into PLAIN arcs that
349
mention the special "colors" for BOS, BOL, EOS, or EOL.
351
To decide whether a thus-transformed NFA matches a given substring of the
352
input string, the executor essentially follows these rules:
353
1. Start the NFA "looking at" the character *before* the given substring,
354
or if the substring is at the start of the input, prepend an imaginary BOS
356
2. Run the NFA until it has consumed the character *after* the given
357
substring, or an imaginary following EOS character if the substring is at
358
the end of the input.
359
3. If the NFA is (or can be) in the goal state at this point, it matches.
361
So one can mentally execute an untransformed NFA by taking ^ and $ as
362
ordinary constraints that match at start and end of input; but plain
363
arcs out of the start state should be taken as matches for the character
364
before the target substring, and similarly, plain arcs leading to the
365
post state are matches for the character after the target substring.
366
This definition is necessary to support regexes that begin or end with
367
constraints such as \m and \M, which imply requirements on the adjacent
368
character if any. NFAs for simple unanchored patterns will usually have
369
pre-state outarcs for all possible character colors as well as BOS and
370
BOL, and post-state inarcs for all possible character colors as well as
371
EOS and EOL, so that the executor's behavior will work.