1
This is a description of the mararc parser that the Deadwood project
2
uses. This is a rewrite of ParseMaraRc.c.
4
We use a finite state machine to parse a mararc file. Here are the
5
various classes of characters that we need to test for:
7
A alpha: The letters A through Z, a through z, and the _ character
9
B alphanum: The letters A-Z, a-z, the _ character, and the numbers 0-9
11
Y alphastart: The lettera A through Z, and a through z
13
[ leftbrace: The [ character
15
Q quote: The " character
17
] rightbrace: The ] character
19
D dname: The letters A-Z, a-z, the - character, the numbers 0-9, the
20
'.' character, and the '_' character.
22
S dname_start: The letters A-Z, a-z, and the numbers 0-9
24
. dot: The '.' character
30
= equals: The = symbol
32
N number: The numbers 0-9
36
I in_string: Any printable ASCII character except for the #, ", and
39
R carriage return: The \r non-printable ASCII character
41
T newline: The \n non-printable ASCII characters
43
W whitespace: The space (' ') or tab character
45
X any: Any printable ASCII character and the tab character
47
{ left curly brace: The { character
49
} right curly brace: The } character
51
So, we have 13 character classes with letter shortcuts. We have
52
ten multi-character classes, and the " and # characters get letter
53
shortcuts (": Because otherwise we need to have ugly \" sequences
54
in the quoted state machine definition; #: So we can potentially
55
add comments to state machine definitions)
57
We also have eight actions:
59
1: Add the character we are looking at to variable 1
60
2: Add the character we are looking at to variable 2
61
3: Add the character we are looking at to variable 3
62
4: Add the character we are looking at to variable 4
63
5: Add the character we are looking at to variable 5
64
6: Add the character we are looking at to variable 6
65
7: Add the character we are looking at to variable 7
66
8: Return a fatal error stating that leading whitespace is not allowed
67
;: End the processing of the current line successfully
71
1: The mararc parameter
72
2: The dictionary index
73
3: The mararc string value
74
4: The mararc numeric value
75
5: If this is set, then we append instead of assigning
76
6: If this is set, then we initialize the specified dictionary variable
77
7: The filename to read and parse as a dwood2rc file
79
Should there not be a new state specified for a given character class
80
in a given state, we halt processing with an error
82
We also have 51 states, represented by lower case letters. The initial
83
state is state 'a'. If the first letter of the state is 'x', the state
84
representation uses two lower-case letters (such as 'xa' or 'xp').
86
Instructions for the state machine are as follows:
88
<state name>: <character class><action (optional)><new state>
90
Again, here are the character classes using letters:
92
A: A-Za-z_ B: A-Za-z_0-9 D: -A-Za-z0-9._
93
H: # I: pASCII except # and "
94
N: 0-9 Q: " S: A-Za-z0-9
95
T: \n W: [ \t] X: pASCII, hi-bit, and \t
98
And here is the specified state machine for mararc processing. This
99
state machine is run for each line in the mararc file
101
Start of line: a: Hb Y1c Wxb Rxp T;
102
In comment: b: Xb Rxp T;
103
Reading mararc parameter: c: B1c Wd =e [f +g (y
104
Whitespace after mararc parameter: d: Wd =e [f +g
105
Equal sign: e: We N4h Qi {6w
108
Numeric mararc parameter: h: N4h Wk Hb Rxp T;
109
Quote beginning mararc parameter: i: I3m
110
End of line: k: Wk Hb Rxp T;
111
In mararc parameter: m: I3m Qk
112
Quote beginning dictionary index: n: .2o S2p -2p
113
Dot as dictionary index: o: Qq
114
Dictionary index: p: D2p Qq
115
Quote at end of dictionary index: q: Wq ]r
116
Right brace ending dictionary index: r: Wr =s +t
117
Equal sign before dictionary value: s: Ws Qu
118
Plus sign before dictionary value: t: =5s
119
Quote beginning dictionary value: u: I3v
120
In dictionary value: v: I3v Qk
121
At left curly brace: w: }k
122
At carriage return: xp: T;
124
In filename for execfile: z: I7z Qxa
125
Quote after execfile filename: xa: )k
126
After whitespace in line: xb: Hb Wxb Y8xb Rxp T;
128
Once a line is processed, we then look at the value of variable 1 (the
131
Should 1 be a known normal mararc parameter we support, store the value
132
of variable 3 in the parameter indexed by variable 1
134
Note that, to make life easier for the initial version of Deadwood, we
135
will only support a dictionary index of ".". This is temporary, so we
136
can more quickly get a very basic forwarding DNS server written.
140
When tokenizing the state machine, the state is converted from a lower
141
case letter to a number between 0 (for 'a') to 52 (for 'xz'). Each
142
<class><action><new state> is stored as three 8-bit numbers:
144
* The pattern, which is the literal character. E.G. Pattern 'A' is
145
tokenized as the number 65 ('A' in ASCII)
147
* Action, which is a number from 0 to 10. 0 indicates "no action"; 1-9
148
indicate actions #1-9. Action #10 means "terminate reading line with
151
* New state: This is the converted state number ('a' becomes 0; 'z' becomes
152
25; 'xz' becomes 51; note that 'x' [23] isn't used)
154
A given state can only have seven different patterns (this can be expanded
155
by changing DWM_MAX_PATTERNS in DwMararc.h)