10
10
<author>Lars Dölle, Heike Manns
11
11
<htmlurl url="mailto:lars.doelle@on-line.de" name="lars.doelle@on-line.de">
12
12
<htmlurl url="mailto:heike@free-it.org" name="heike@free-it.org">
13
<date>Version 1.8, 7 November 2010
13
<date>Version 2.0.1, 7 July 2012
15
15
Styx is a scanner and parser generator designed to address some
16
16
shortcomings of the traditional lex/yacc combination.
174
174
be weird. Examples are languages that have no context-free grammar (like C,
175
175
which comes with the context-dependent "typedef" construction).
177
This decision was made because Styx has been used for internal language
177
This decision was originally made because Styx has been used for internal language
178
178
design, therefore it did not have to cope with all the oddities out. Now that
179
179
we have released Styx to public use, this design decision may cause troubles
180
180
to other people, although one can work around these restrictions.
182
With the introduction of "dynamic" tokens in version 2.0.1 Styx is able to handle
183
context-dependent constructs like C's "typedef".
184
Beside that users will be able to work around ambiguities in the syntax of programming
185
languages. One example for such an ambiguity is the "prefixexp" and "functioncall" definition
186
in Lua's syntax. (see also <url url="http://www.lua.org/manual/5.1/manual.html#2.5"
187
name="Lua reference manual 5.1, chapter 2.5.8">)
188
Reduce-reduce conflicts can be solved with the help of explicit reduction rules.
182
190
<sect1>Comparison to the lex/yacc combination
202
210
It only means that when one does weird things, the implementation may
203
211
become weird in these places and the support less handy.
205
More substantial restrictions are both in the parser and scanner. The parser
206
is LALR(1), which means that a look-ahead of only one token is supported.
213
More substantial restrictions are in the parser and - originally - the scanner.
214
The parser is LALR(1), which means that a look-ahead of only one token is supported.
207
215
Likely the scanner, in versions prior to 1.7, does a look-ahead of only one character.
208
216
This later restriction has been introduced to guarantee an effective scanner, which is
209
217
linear by the length of the string, while lex provides an arbitrary look
214
222
Additionally, lex allows to switch between different sets of tokens (languages)
215
223
to be able to do even more weird things. Since version 1.6 the Styx scanner is able
216
to handle different token sets, too. Like the lex/yacc combination, the scanner and
217
parser are separated within Styx (also much more integrated), so one can plug-in any
218
other, even a hand-crafted scanner, if nothing else helps.
224
to handle different token sets, too.
226
Like the lex/yacc combination, the scanner and parser are separated within Styx
227
(also much more integrated), so one can plug-in any other, even a hand-crafted scanner,
228
if nothing else helps.
219
229
The disadvantage of doing so is again, that the many supports that Styx offers for
220
230
the canonical design do not apply anymore. One has to write additional code, about
221
231
to the amount that people applying the lex/yacc combination are already used to.
567
577
<item>An optional <tt>Regular Grammar</tt> section defining the tokens.
568
578
<item>An optional <tt>Context Free Grammar</tt> section, which, tautologically,
569
579
is the section where the context free grammar is defined.
580
<item>Optional rules after the grammar definition to explicitly solve
581
reduce-reduce conflicts, introduced in styx version 2.0.1.
571
583
Note: Since version 1.3 the preprocessing facility <em>#include</em> can be used
572
584
to modularize the specification in order to re-use parts of them.
711
724
The reserved words are "Language", "Regular", "Grammar", "Context", "Free",
712
725
"let", "tok", "ign", "com", "ica", "ind", "lan", "InGroup", "ExGroup", "Group",
713
"other", "start" and "err".
726
"other", "start", "err" and "reduce".
714
727
Further, "cons", "nil" and words starting with "ign" have a special meaning
715
728
when used as production names.
716
729
Since version 1.8 the production names "none" and "some" also have a special meaning
729
742
different regular languages.
730
743
( see Example05 for a demonstration )
745
<label id="dynamic tokens">
746
Version 2.0.1 introduces "dynamic" tokens. They are declared to the parser with the
747
QlxDfn-production "defd" and initially unknown to the scanner, i.e. have no regular expressions.
748
With the help of a <ref id="dtok" name="special member production"> token values are assigned to
749
them during the parse process.
750
( see Example08 for a demonstration )
740
760
let QlxDfn ; Qlx-Definition
741
761
:defn : QlxCat QlxOpt QlxGrp0 Ide QlxGrp1 "=" ExpQuot ; regular expression definition
762
:defd : "tok" "<" Ide ">" ; dynamic token
742
763
:igrp : "InGroup" Ide ; inclusive group definition
743
764
:xgrp : "ExGroup" Ide ; exclusive group definition
765
!tgrp : "ExGroup" Ide "[" "tok" "]" ; and ignore keywords in group (version >= 2.0.1)
744
766
:mgrp : "Group" Ide = Ids ; group list definition, introduced in version 1.7
746
768
let Ids ; Group list members
974
996
Since version 1.8 Styx supports an EBNF-like notation for list ans optional members.
975
997
(see the member syntax below and <ref id="example02" name="Example02"> for an application)
1000
With the help of the member production "dtok" it is possible to introduce an as token X parsed value V as dynamic
1001
token Y, where X, Y are denoted by the left and right token identifiers. From that position within the parse process
1002
the scanner always reports V as Y. (styx version >= 2.0.1)
977
1004
The non-terminal names defined have to be unique within the scope of the source and disjunctive from the names
978
1005
of the regular sets defined in the previous section. The production names have to be unique within each
979
1006
non-terminal definition.
1102
<sect2>Explicit rules to solve reduce-reduce conflicts
1104
Often language definitions are ambigous w.r.t LALR(1) parsing. As stated above one example is the
1105
function call and expression syntax of <em>Lua</em>.
1106
(see the styx grammar lua.sty which is part of Example07 as an application)
1109
let Conflicts ; Conflict rules
1114
let Conflict ; Conflict rule (according diagnostic output)
1115
:defn : "Context" State "." Token ":"
1118
let State ; Conflict state
1119
:nat : Nat ; state index
1120
:seq : Seq ; state symbol (keyword)
1121
:ide : Ide ; state symbol (token, nonterminal)
1123
let Token ; Conflict token
1132
let Rules ; Conflicting reduce productions, explicitly ordered
1137
:red : "reduce" Ide "." Ide ; Conflicting reduce production
1072
1140
<sect2>Normalizing Productions
1074
Some of the identifiers for the production names are reserved for normalization. These are "cons",
1075
"nil" and "ign(0-9)+". Beside other keywords used in the Styx grammar, you are otherwise free to
1142
Some of the identifiers for the production names are reserved for normalization. These are "cons(0-9)*",
1143
"nil(0-9)*" and "ign(0-9)+". Beside other keywords used in the Styx grammar, you are otherwise free to
1076
1144
chose these names. The mentioned identifiers serve it's duty as indications of how to make up the
1077
1145
depth grammar. A separate section is devoted to this topic. See below.
1156
1224
A production is well-formed if it belongs to one of the following groups:
1159
<item><tt>let X :nil :</tt> where the production contains no (identifier)
1161
<item><tt>let X :cons : Y Z</tt> and contains exactly two (identifier) members.
1162
<item><tt>let X :ign# : Y</tt> where the production name starts with "ign" and
1163
continues with a natural number (that's what the
1164
"#" indicates) and the production contains exactly
1165
one (identifier) member.
1166
<item><tt>let X :name : X1 .. Xn</tt> where "name" is not "nil", "cons" or
1167
starts with "ign". No restriction apply
1168
to the production members.
1227
<item><tt>let X :nil#*:</tt> where the production name starts with "nil", optionally
1228
followed by a natural number, and the production contains
1229
no (identifier) members.
1230
<item><tt>let X :cons#*: Y Z</tt> where the production name is "cons", optionally numbered,
1231
and contains exactly two (identifier) members.
1232
<item><tt>let X :ign#+ : Y</tt> where the production name starts with "ign"i, followed by
1233
a natural number and the production contains exactly
1234
one (identifier) member.
1235
<item><tt>let X :name : X1 .. Xn</tt> where "name" does not start with "nil", "cons" or "ign".
1236
No restriction apply to the production members.
1170
1238
<item><em>EBNF extension (styx version >= 1.8):</em>
1171
<item><tt>let X :name : X1 .. Xn</tt> where "name" is not "nil", "cons", "none", "some" or
1172
starts with "ign". No restriction apply
1173
to the production members.
1239
<item><tt>let X :name : X1 .. Xn</tt> where "name" is not "none", "some" or
1240
starts with "nil", "cons", "ign".
1241
No restriction apply to the production members.
1174
1242
<item><tt>let X :none :</tt> where the production contains no (identifier) members.
1175
1243
<item><tt>let X :some : Y</tt>where the production contains exactly one (identifier) member.
1179
1247
This grouping serves two purposes. The first two groups will be used to derive list-like productions,
1180
while the "ign" production are used to define identity productions. The later typically occur with
1248
while the "ign" production is used to define identity productions. The later typically occur with
1181
1249
expressions that have different levels binding strength or when likely classes of productions are excluded
1182
or included into certain contexts. When producing the abstract grammar, we consider these non-terminal to
1250
or included into certain contexts. When producing the abstract grammar, we consider these non-terminals to
1185
As examples, see the definitions of Exp, Exp1-2 in the introduction and Exp, Exp1-4 in the
1186
lexical Styx grammar itself. For lists, the context free grammar of Styx (Dfns, Prds, Mbrs) are proper
1253
As examples, see the definitions of Exp, Exp1-2 in the introduction and Exp, ExpDyck, ExpQuot, Exp0-4 in
1254
the lexical Styx grammar itself. For lists, the context free grammar of Styx (Dfns, Prds, Mbrs) are proper
1189
1257
The last two groups will be used to derive option-like productions.
1191
1259
<sect1>An induced congruence relation
1193
1261
We get rid of the superficial distinction between the different non-terminals by means of a congruence
1194
relation over the non-terminal names induced by the special production names "cons","nil" and "ign".
1262
relation over the non-terminal names induced by the special production names "cons#*","nil#*" and "ign#+".
1196
1264
The congruence relation is defined as follows:
1199
1267
<item>X <=> Y --> Y <=> X
1200
1268
<item>X <=> Y && Y <=> Z --> X <=> Z
1201
<item>let X :ign#: Y --> X <=> Y
1202
<item>let X :cons: Y Z --> X <=> Z
1269
<item>let X :ign#+: Y --> X <=> Y
1270
<item>let X :cons#*: Y Z --> X <=> Z
1203
1271
<item>X <=> Y && let X :id: X1 .. Xn && let Y :id: Y1 .. Yn && 1 <= i <= n --> Xi <=> Yi
1242
1310
<item>X <=> Y --> Type(X) = Type(Y), where Type(Z) = { terminal, nonterminal }
1243
1311
<item>let X :id: X1 .. Xm && let Y :id: Y1 .. Yn && X <=> Y --> m = n
1244
<item>(let X :nil: || let X :cons: A B) -->
1245
not exists P,prod: P <=> X && let P :prod: c && prod not in { ign, nil, cons }
1312
<item>(let X :nil#*: || let X :cons#*: A B) -->
1313
not exists P,prod: P <=> X && let P :prod: c && prod not in { ign#+, nil#*, cons#* }
1247
1315
<item><em>EBNF extension (styx version >= 1.8):</em>
1248
1316
<item>(let X^ :none: a || let X^ :some: b)
1249
--> not exists P: P = let X^ :id: c && id =/= { ign#, none, some }
1317
--> not exists P: P = let X^ :id: c && id =/= { ign#+, none, some }
1250
1318
<item>(let X^ :none: a)
1251
1319
--> exists P: P = let X^ :some: b
1252
1320
<item>(let X^ :some: a)
1260
1328
After all this preparation and conditions, we can finally convert the concrete grammar to a signature.
1262
To do this, we map all non-terminals NT which does not have list-productions (those named "cons" or "nil")
1330
To do this, we map all non-terminals NT which does not have list-productions (those named "cons#*" or "nil#*")
1263
1331
or - optionally - option-productions (those named "none" or "some") to their representative names NT^.
1264
1332
Likely, all terminal names T are mapped to their representatives T^.
1265
1333
Collectively, these form the types of the algebra.
1269
1337
the (non-)terminal names to types. The difference is, that we have to cope with list-production and - optionally -
1270
1338
option-production, which have been omitted earlier.
1271
1339
|X| is X^ if we have a non-list and - optionally - non-terminal or a terminal X.
1272
If X is a non-terminal with a production "let A :cons: B C" and X <=> A, |X| is List(|B|).
1340
If X is a non-terminal with a production "let A :cons#*: B C" and X <=> A, |X| is List(|B|).
1273
1341
(If we only have nil-productions, the translation is List(void)).
1274
1342
Optionally: If X is a non-terminal with a production "let A :some: B" and X <=> A, |X| is Option(|B|).
1297
styx = Start_Source(Source)
1299
Source = root(OptNat, Ide, QlxDfn*, OptCfg)
1304
QlxDfn = mgrp(Ide, Ide*);
1306
defn(QlxCat, QlxOpt, QlxGrp, Ide, QlxGrp, Exp);
1326
Exp = conc(Exp, Exp);
1331
epat(Exp, Ide, Exp);
1336
spat(Exp, Set, Exp);
1337
dyck(Exp, Exp, Exp);
1350
Dfn = defn(Cat, DfnOpt, Ide, Prd*)
1362
Prd = prod(Lay, Ide, Mbr*)
1365
klst0(Seq*, Mbr, Seq*, Seq*);
1367
klst1(Seq*, Mbr, Seq*, Seq*);
1368
opt(Seq*, Mbr, Seq*);
1365
styx = Start_Source(Source)
1367
Source = root(OptNat, Ide, QlxDfn*, OptCfg)
1370
cfg(Dfn*, Conflict*)
1373
defn(QlxCat, QlxOpt, QlxGrp, Ide, QlxGrp, Exp);
1396
Exp = conc(Exp, Exp);
1401
dyck(Exp, Exp, Exp);
1406
epat(Exp, Ide, Exp);
1417
Limit = range(Nat, OptNat);
1420
Dfn = defn(Cat, DfnOpt, Ide, Prd*)
1432
Prd = prod(Lay, Ide, Mbr*)
1434
Mbr = opt(Seq*, Mbr, Seq*);
1436
klst1(Seq*, Mbr, Seq*, Seq*);
1439
klst0(Seq*, Mbr, Seq*, Seq*);
1442
Conflict = defn(State, Token, Rule*)
1451
Rule = red(Ide, Ide)
2652
2734
<sect1>Unicode support
2654
The Styx scanner & parser generator should be able to deal with
2736
The Styx scanner & parser generator is able to deal with
2655
2737
unicode based language definitions and scan streams. Since version 1.5
2656
each of the released programs support unicode. This capability was added
2657
later and isn't yet tested very well.
2738
each of the released programs support unicode.
2659
2740
First you have to design the proper grammar. Styx itself doesn't
2660
2741
accept unicode specifications. You define <ref id="unicode literals" name="unicode tokens and keywords">
3034
3115
with the <ref id="styx program" name="'styx' program">. The C++ runtime scanner and parse table classes provide
3035
3116
methods to import the exported tables.
3037
Note, that there are some restrictions regarding the Styx grammar specification. The following features won't be
3038
supported by the C++ runtime system:
3118
Note, that there are some restrictions regarding the Styx grammar specification. For now, the following features
3119
won't be supported by the C++ runtime system:
3040
3121
<item><ref id="indended languages" name="Indended languages">
3041
3122
<item><ref id="embedded languages" name="Embedded languages">
3123
<item><ref id="dynamic tokens" name="Dynamic tokens">
3044
3126
Further, there won't be a generated C++ interface to the abstract syntax tree. You have to use the generic C++
3054
3136
<sect2>C# Runtime system
3056
3138
The source distribution of version 1.7.6 comes with a C# runtime library.
3057
For now, the C# library contains the module 'StyxScanner.cs' for the construction of scanners. In order to use it with a Styx grammar,
3058
you first have to specify the grammar and export the scanner with the <ref id="styx program" name="'styx' program">.
3139
For now, the C# library contains the module 'StyxScanner.cs' for the construction of scanners. In order to use it with
3140
a Styx grammar, you first have to specify the grammar and export the scanner with the <ref id="styx program" name="'styx' program">.
3059
3141
The C# runtime scanner classes provide methods to import the exported scanner table.
3061
3143
Similar to the basic Styx system the C# runtime system provides a scanner test program 'StyxScannerTest',
3089
3171
and context-dependent grammar (many do), cannot be
3090
3172
parsed without extra tricks. Take the "typedef"
3091
3173
declaration of C together with the application of
3092
the type name in type denotations as example. Likely
3093
examples are languages with definable operator
3174
the type name in type denotations as example.
3175
Likely examples are languages with definable operator
3094
3176
precedences as Algol98 or Prolog for instance.
3095
Styx provides a few hooks to cope with stuff like this.
3177
Styx provides a few hooks to cope with stuff like this,
3178
since version 2.0.1 the usage of "dynamic" tokens to
3179
solve the first mentioned problem.
3098
3182
<sect1>Intensive grammar abstractions