1
Regular Expression Scanner Generator
2
====================================
4
Notes in the scanner File Format
5
--------------------------------
7
The file format used is based on the GNU flex table file format
8
(--tables-file option; see Table File Format in the flex info pages and
9
the flex sources for documentation). The magic number used in the header
10
is set to 0x1B5E783D insted of 0xF13C57B1 though, which is meant to
11
indicate that the file format logically is not the same: the YY_ID_CHK
12
(check) and YY_ID_DEF (default) tables are used differently.
14
Flex uses state compression to store only the differences between states
15
for states that are similar. The amount of compresion influences the parse
18
The following two states could be stored as in the tables outlined
21
States and transitions on specific characters to next states
22
------------------------------------------------------------
23
1: ('a' => 2, 'b' => 3, 'c' => 4)
24
2: ('a' => 2, 'b' => 3, 'd' => 5)
26
Flex-like table format
27
----------------------
28
index: (default, base)
29
0: ( 0, 0) <== dummy state (nonmatching)
34
0: ( 0, 0) <== unused entry
35
( 0, 1) <== ord('a') identical entries
39
( 0, 1) <== (255 - ord('c')) identical entries
43
Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else
44
as in state 1. The matching algorithm is as follows.
46
Flex-like scanner algorithm
47
---------------------------
48
/* current state is in <state>, input character <c> */
49
while (check[base[state] + c] != state)
50
state = default[state];
52
/* continue with the next input character */
54
This state compression algorithm performs well, except when there are
55
many inverted or wildcard matches ("[^x]", "."). Each input character
56
may cause several iterations in the while loop.
59
We will have many inverted character classes ("[^/]") that wouldn't
60
compress very well. Therefore, the regexp matcher uses no state
61
compression, and uses the check and default tables differently. The
62
above states could be stored as follows:
67
index: (default, base)
68
0: ( 0, 0) <== dummy state (nonmatching)
73
0: ( 0, 0) <== unused entry
74
( 0, 0) <== ord('a') identical, unused entries
80
3+'c': ( 0, 0) <== entry is unused
82
( 0, 0) <== (255 - ord('d')) identical, unused entries
84
All the entries with 0 in check (except the first entry, which is
85
deliberately reserved) are still available for other states that
88
Regexp scanner algorithm
89
------------------------
90
/* current state is in <state>, matching character <c> */
91
if (check[base[state] + c] == state)
94
state = default[state];
95
/* continue with the next input character */
97
This representation and algorithm allows states which match more
98
characters than they do not match to be represented as their inverse.
99
For example, a third state that accepts everything other than 'a' can
100
be added to the tables as one entry in (default, base) and one entry in
105
3: ('a' => 0, everything else => 5)
109
index: (default, base)
110
0: ( 0, 0) <== dummy state (nonmatching)
116
0: ( 0, 0) <== unused entry
117
( 0, 0) <== ord('a') identical, unused entries
123
3+'c': ( 0, 0) <== entry is unused
126
( 0, 0) <== (255 - ord('a')) identical, unused entries
128
While the current code does not implement any form of state compression,
129
the flex state compression representation could be combined by
130
remembering (in a bit per state, for example) which default entries
131
refer to inverted matches, and which refer to parent states.