4
<TITLE>Snowball Manual</TITLE></HEAD>
6
<TABLE WIDTH=75% ALIGN=CENTER COLS=1>
7
<H1 ALIGN=CENTER>Snowball Manual</H1>
8
<TR><TD BGCOLOR="#FCFCe0">
9
<BR> <H2>Links to resources</H2>
10
<DL><DD><TABLE CELLPADDING=0>
11
<TR><TD><A HREF="../q/use.html"> Using Snowball</A>
12
<TR><TD><A HREF="../porter/stemmer.html"> Porter stemmer - a case study</A>
19
<BR> <H2>Snowball definition</H2>
21
Snowball is a small string-handling language, and its name was chosen as a
22
tribute to SNOBOL (Farber 1964, Griswold 1968), with which it shares the
23
concept of string patterns delivering signals that are used to control the
26
<BR> <H2>Data types</H2>
28
The basic data types handled by Snowball are strings of characters, signed
29
integers, and boolean truth values, or more simply <I>strings</I>, <I>integers</I>
30
and <I>booleans</I>. At present Snowball's characters are 8-bit ASCII, which is
31
sufficient for presenting the algorithms, but at some point Unicode
32
characters will need to be supported.
34
<BR> <H2>Names</H2>
36
A name in Snowball is a letter followed by zero or more letters, digits
37
and underlines. A name can be of type <I>string</I>, <I>integer</I>, <I>boolean</I>,
38
<I>routine</I>, <I>external</I> or <I>grouping</I>. All names must be declared. A
39
declaration has the form
43
where symbol <TT>T</TT> is one of <TT>string</TT>, <TT>integer</TT> etc, and the region in
44
brackets contains a list of names separated by whitespace. For example,
52
Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b
57
groupings ( v v_WXY v_LSZ )
59
<TT>p1</TT> and <TT>p2</TT> are integers, <TT>Y_found</TT> is boolean, and so on. Snowball is quite
60
strict about the declarations, so all the names go in the same name space,
61
no name may be declared twice, all used names must be declared, no two
62
routine definitions can have the same name, etc. Names declared and
63
subsequently not used are merely reported in a warning message. A name may
64
not be one of the reserved words of Snowball.
66
<BR> <H2>Literals</H2>
68
A literal integer is a digit sequence, and is always interpreted as
69
decimal. A literal string is written between single quotes, for example,
73
A string may be preceded by the word <TT>hex</TT>, in which case the contents are
74
interpreted as being written out in hexadecimal notation, e.g.
76
hex '0D0A' /* carriage return, line feed */
80
hex '0d0a' /* lower case also allowed */
81
hex '0D 0A' /* spaces ignored */
83
But <TT>hex 'd0a'</TT> would be an error: the number of enclosed symbols must be
86
Strings may include special <I>string macro</I> sequences, to handle unusual
87
character combinations.
89
String macros may be defined in the form <TT>stringdef m 'S'</TT>, where <TT>'S'</TT> is a
90
string, and <TT>m</TT> a sequence of one or more printing characters terminating
93
Two special <I>insert characters</I> are defined by the directive
94
<TT>stringescapes AB</TT>, where <TT>A</TT> and <TT>B</TT> are printing characters, and <TT>A</TT> is not
95
single quote. (<TT>B</TT> may equal <TT>A</TT>, but then <TT>A</TT> itself can never be escaped.) For
100
A subsequent occurrence of the same directive redefines the insert
103
Thereafter, <TT>[m]</TT> inside a string causes <TT>S</TT> to be substituted in place of <TT>m</TT>.
105
Immediately after the stringescapes directive, <TT>[']</TT> will substitute <TT>'</TT> and
106
<TT>[[]</TT> will substitute <TT>[</TT>, although macros <TT>'</TT> and <TT>[</TT> may subsequently be
107
redefined. A further feature is that <TT>[<I>W</I>]</TT> inside a string, where <TT><I>W</I></TT> is a
108
sequence of whitespace characters including one or more newlines, is
109
ignored. This enables long strings to be written over a number of lines.
115
/* special Spanish characters (in ISO Latin) */
117
stringdef a' hex 'A0' // a-acute
118
stringdef e' hex '82' // e-acute
119
stringdef i' hex 'A1' // i-acute
120
stringdef o' hex 'A2' // o-acute
121
stringdef u' hex 'A3' // u-acute
122
stringdef u" hex '81' // u-diaeresis
123
stringdef n~ hex 'A4' // n-tilde
125
/* and in the next string we define all the characters in Spanish
126
used to represent vowels
129
define v 'aeiou[a'][e'][i'][o'][u'][u"]'
131
<BR> <H2>Routines</H2>
133
A routine definition has the form
137
where <TT>R</TT> is the routine name and <TT>C</TT> is a command, or bracketed group of
138
commands. So a routine is defined as a sequence of zero or more commands.
139
Snowball routines do not (at present) take parameters. For example,
141
define Step_5b as ( // this defines Step_5b
142
['l'] // three commands here: [, 'l' and ]
143
R2 'l' // two commands, R2 and 'l'
144
delete // delete is one command
147
define R1 as $p1 <= cursor
148
/* R1 is defined as the single command "$p1 <= cursor" */
150
A routine is called simply by using its name, <TT>R</TT>, as a command.
152
<BR> <H2>Commands and signals</H2>
154
The flow of control in Snowball is arranged by the implicit use of
155
<I>signals</I>, rather than the explicit use of constructs like the <TT>if</TT>,
156
<TT>then</TT>, <TT>break</TT> of C. The scheme is designed for handling strings, but is
157
perhaps easier to introduce using integers. Suppose <TT>x</TT>, <TT>y</TT>, <TT>z</TT> ... are
158
integers. The command
162
sets <TT>x</TT> to 1. The command
166
tests if <TT>x</TT> is greater than zero. Both commands give a signal <B><I>t</I></B> or <B><I>f</I></B>,
167
(<I>true</I> or <I>false</I>), but while the second command gives <B><I>t</I></B> if <TT>x</TT> is greater
168
than zero and <B><I>f</I></B> otherwise, the first command always gives <B><I>t</I></B>. In Snowball,
169
every command gives a <B><I>t</I></B> or <B><I>f</I></B> signal. A sequence of commands can be turned
170
into as a single command by putting them in a list surrounded by round
173
( C<SUB>1</SUB> C<SUB>2</SUB> C<SUB>3</SUB> ... C<SUB>i</SUB> C<SUB>i+1</SUB> ... )
175
When this is obeyed, <TT>C<SUB>i+1</SUB></TT> will be obeyed if each of the preceding <TT>C<SUB>1</SUB></TT> ...
176
<TT>C<SUB>i</SUB></TT> give <B><I>t</I></B>, but as soon as a <TT>C<SUB>i</SUB></TT> gives <B><I>f</I></B>, the subsequent <TT>C<SUB>i+1</SUB> C<SUB>i+2</SUB></TT> ...
177
are ignored, and the whole sequence gives signal <B><I>f</I></B>. If all the <TT>C<SUB>i</SUB></TT> give <B><I>t</I></B>,
178
however, the bracketed command sequence also gives <B><I>t</I></B>. So,
182
sets <TT>y</TT> to 1 if <TT>x</TT> is greater than zero. If <TT>x</TT> is less than or equal to zero
183
the two commands give <B><I>f</I></B>.
185
If <TT>C<SUB>1</SUB></TT> and <TT>C<SUB>2</SUB></TT> are commands, we can build up the larger commands,
187
<DT><TT>C<SUB>1</SUB> or C<SUB>2</SUB></TT>
188
<DD>- Do <TT>C<SUB>1</SUB></TT>. If it gives <B><I>t</I></B> ignore <TT>C<SUB>2</SUB></TT>, otherwise do <TT>C<SUB>2</SUB></TT>. The resulting
189
signal is <B><I>t</I></B> if and only <TT>C<SUB>1</SUB></TT> or <TT>C<SUB>2</SUB></TT> gave <B><I>t</I></B>.
190
<DT><TT>C<SUB>1</SUB> and C<SUB>2</SUB></TT>
191
<DD>- Do <TT>C<SUB>1</SUB></TT>. If it gives <B><I>f</I></B> ignore <TT>C<SUB>2</SUB></TT>, otherwise do <TT>C<SUB>2</SUB></TT>. The resulting
192
signal is <B><I>t</I></B> if and only <TT>C<SUB>1</SUB></TT> and <TT>C<SUB>2</SUB></TT> gave <B><I>t</I></B>.
194
<DD>- Do <TT>C</TT>. The resulting signal is <B><I>t</I></B> if <TT>C</TT> gave <B><I>f</I></B>, otherwise <B><I>f</I></B>.
196
<DD>- Do <TT>C</TT>. The resulting signal is <B><I>t</I></B> whatever the signal of <TT>C</TT>.
198
<DD>- Do <TT>C</TT>. The resulting signal is <B><I>f</I></B> whatever the signal of <TT>C</TT>.
202
<DT><TT>($x > 0 $y = 1) or ($y = 0)</TT>
203
<DD>- sets <TT>y</TT> to 1 if <TT>x</TT> is greater than zero, otherwise to zero.
205
<DT><TT>try( ($x > 0) and ($z > 0) $y = 1)</TT>
206
<DD>- sets <TT>y</TT> to 1 if both <TT>x</TT> and <TT>z</TT> are greater than 0, and gives <B><I>t</I></B>.
208
This last example is the same as
210
try($x > 0 $z > 0 $y = 1)
212
so that <TT>and</TT> seems unnecessary here. But we will see that <TT>and</TT> has a
213
particular significance in string commands.
215
When a ‘monadic’ construct like <TT>not</TT>, <TT>try</TT> or <TT>fail</TT> is not followed by a
216
round bracket, the construct applies to the shortest following valid command.
219
try not $x < 1 $z > 0
223
try ( not ( $x < 1 ) ) $z > 0
225
because <TT>$x < 1</TT> is the shortest valid command following <TT>not</TT>, and then
226
<TT>not $x < 1</TT> is the shortest valid command following <TT>try</TT>.
228
The ‘diadic’ constructs like <TT>and</TT> and <TT>or</TT> must sit in a bracketed list
229
of commands anyway, for example,
231
( C<SUB>1</SUB> C<SUB>2</SUB> and C<SUB>3</SUB> C<SUB>4</SUB> or C<SUB>5</SUB> )
233
And then in this case <TT>C<SUB>2</SUB></TT> and <TT>C<SUB>3</SUB></TT> are connected by the <TT>and</TT>; <TT>C<SUB>4</SUB></TT> and <TT>C<SUB>5</SUB></TT> are
234
connected by the <TT>or</TT>. So
236
$x > 0 not $y > 0 or not $z > 0 $t > 0
240
$x > 0 ((not ($y > 0)) or (not ($z > 0))) $t > 0
242
<TT>and</TT> and <TT>or</TT> are equally binding, and bind from left to right,
243
so <TT>C<SUB>1</SUB> or C<SUB>2</SUB> and C<SUB>3</SUB></TT> means <TT>(C<SUB>1</SUB> or C<SUB>2</SUB>) and C<SUB>3</SUB></TT> etc.
245
<BR> <H2>AEs and integer commands</H2>
247
An AE (arithmetic expression) consists of integer names and literal
248
numbers connected by diadic <TT>+</TT>, <TT>-</TT>, <TT>*</TT> and <TT>/</TT>, and monadic <TT>-</TT>, with the same
249
binding powers and semantics as C. An integer command has the form
253
where <TT>X</TT> is an integer name and <I>op</I> is one of the six tests <TT>==</TT>, <TT>!=</TT>, <TT>>=</TT>, <TT>></TT>,
254
<TT><=</TT>, <TT><</TT>, or five assignments <TT>=</TT>, <TT>+=</TT>, <TT>-=</TT>, <TT>*=</TT>, <TT>/=</TT>. Again, the meanings are the
257
As well as integer names and literal numbers, the following may be used in
259
<DL><DD><TABLE CELLPADDING=0>
260
<TR><TD><TT>minint</TT> <TD></TD><TD> - the smallest negative number
261
<TR><TD><TT>maxint</TT> <TD></TD><TD> - the largest positive number
262
<TR><TD><TT>sizeof s</TT> <TD></TD><TD> - the number of characters in <TT>s</TT>, where <TT>s</TT> is the name of a string
263
<TR><TD><TT>cursor</TT> <TD></TD><TD> - the current value of the string <I>cursor</I>
264
<TR><TD><TT>limit</TT> <TD></TD><TD> - the current value of the string <I>limit</I>
265
<TR><TD><TT>size</TT> <TD></TD><TD> - the size of the string, in characters
267
The <I>cursor</I> and <I>limit</I> concepts are explained below.
269
Examples of integer commands are,
271
$p1 <= cursor // signal is f if the cursor is before position p1
272
$p1 = limit // set p1 to the string limit
275
<BR> <H2>String commands</H2>
277
If <TT>s</TT> is a string name, a string command has the form
281
where <TT>C</TT> is a command that operate on the string. Strings can be processed
282
left-to-right or right-to-left, but we will describe only the
283
left-to-right case for now. The string has a <I>cursor</I>, which we will
284
denote by <B><I>c</I></B>, and a limit point, or <I>limit</I>, which we will denote by <B><I>l</I></B>. <B><I>c</I></B>
285
advances towards <B><I>l</I></B> in the course of a string command, but the various
286
constructs <TT>and</TT>, <TT>or</TT>, <TT>not</TT> etc have side-effects which keep moving it
287
backwards. Initially <B><I>c</I></B> is at the start and <B><I>l</I></B> the end of the string. For
290
'a|n|i|m|a|d|v|e|r|s|i|o|n'
294
<B><I>c</I></B>, and <B><I>l</I></B>, mark the boundaries between characters, and not
295
characters themselves. The characters between <B><I>c</I></B> and <B><I>l</I></B> will be denoted by
298
If <TT>C</TT> gives <B><I>t</I></B>, the cursor <B><I>c</I></B> will have a new, well-defined value. But if <TT>C</TT>
299
gives <B><I>f</I></B>, <B><I>c</I></B> is undefined. Its later value will in fact be determined by the
300
outer context of commands in which <TT>C</TT> came to be obeyed, not by <TT>C</TT> itself.
302
Here is a list of the commands that can be used to operate on strings.
304
<H4>a) Setting a value</H4>
309
<DD>where <TT>S</TT> is the name of a string or a literal string. <B><I>c:l</I></B> is set equal
310
to <TT>S</TT>, and <B><I>l</I></B> is adjusted to point to the end of the copied string. The
311
signal is <B><I>t</I></B>. For example,
313
$x = 'animadversion' /* literal string */
314
$y = x /* string name */
318
<H4>b) Basic tests</H4>
322
<DD>here and below, <TT>S</TT> is the name of a string or a literal string. If <B><I>c:l</I></B>
323
begins with the substring <TT>S</TT>, <B><I>c</I></B> is repositioned to the end of this
324
substring, and the signal is <B><I>t</I></B>. Otherwise the signal is <B><I>f</I></B>. For example,
326
$x 'anim' /* gives t, assuming the string is 'animadversion' */
327
$x ('anim' 'ad' 'vers')
333
<DT><TT>C<SUB>1</SUB> or C<SUB>2</SUB></TT>
334
<DD>This is like the case for integers described above, but the extra
335
touch is that if <TT>C<SUB>1</SUB></TT> gives <B><I>f</I></B>, <B><I>c</I></B> is set back to its old position after
336
<TT>C<SUB>1</SUB></TT> has given <B><I>f</I></B> and before <TT>C<SUB>2</SUB></TT> is tried, so that the test takes place on
337
the same point in the string. So we have
339
$x ('anim' /* signal t */
340
'ation' /* signal f */
342
( 'an' /* signal t - from the beginning */
345
<DT><TT>true</TT>, <TT>false</TT>
346
<DD><TT>true</TT> is a dummy command that generates signal <B><I>t</I></B>. <TT>false</TT> generates
347
signal <B><I>f</I></B>. They are sometimes useful for emphasis,
349
define start_off as true // nothing to do
350
define exception_list as false // put in among(...) list later
353
<TT>true</TT> is equivalent to <TT>()</TT>
354
<DT><TT>C<SUB>1</SUB> and C<SUB>2</SUB></TT>
355
<DD>And similarly <B><I>c</I></B> is set back to its old position after <TT>C<SUB>1</SUB></TT> has given <B><I>t</I></B>
356
and before <TT>C<SUB>2</SUB></TT> is tried. So,
358
$x 'anim' and 'an' /* signal t */
359
$x ('anim' 'an') /* signal f, since 'an' and 'ad' mis-match */
363
<DD>These are like the integer tests, with the added feature that <B><I>c</I></B> is set
364
back to its old position after an <B><I>f</I></B> signal is turned into <B><I>t</I></B>. So,
366
$x (not 'animation' not 'immersion')
367
/* both tests are done at the start of the string */
369
$x (try 'animus' try 'an'
373
<DL><DD><TABLE CELLPADDING=0>
374
<TR><TD> <TT>try C</TT> <TD></TD><TD> is equivalent to <TD></TD><TD> <TT>C or true</TT>
377
<DD>This does command <TT>C</TT> but without advancing <B><I>c</I></B>. Its signal is the same as
378
the signal of <TT>C</TT>, but following signal <B><I>t</I></B>, <B><I>c</I></B> is set back to its old
380
<DL><DD><TABLE CELLPADDING=0>
381
<TR><TD> <TT>test C</TT> <TD></TD><TD> is equivalent to <TD></TD><TD> <TT>not not C</TT>
382
<TR><TD> <TT>test C<SUB>1</SUB> C<SUB>2</SUB></TT> <TD></TD><TD> is equivalent to <TD></TD><TD> <TT>C<SUB>1</SUB> and C<SUB>2</SUB></TT>
385
<DD>This does <TT>C</TT> and gives signal <B><I>f</I></B>. It is equivalent to <TT>C false</TT>. Like
386
<TT>false</TT> it is useful, but only rarely.
389
<DD>This does <TT>C</TT>, puts <B><I>c</I></B> back to its old value and gives signal <B><I>t</I></B>. It is
390
very useful as a way of suppressing the side effect of <B><I>f</I></B> signals and
392
<DL><DD><TABLE CELLPADDING=0>
393
<TR><TD> <TT>do C</TT> <TD></TD><TD> is equivalent to <TD></TD><TD> <TT>try test C</TT>
394
<TR><TD> <TD></TD><TD> or <TD></TD><TD> <TT>test try C</TT>
397
<DD><B><I>c</I></B> is moved right until obeying <TT>C</TT> gives <B><I>t</I></B>. But if <B><I>c</I></B> cannot be moved
398
right because it is at <B><I>l</I></B> the signal is <B><I>f</I></B>. <B><I>c</I></B> is set back to the position
399
it had before the last obeying of <TT>C</TT>, so the effect is to leave <B><I>c</I></B> before
400
the pattern which matched against <TT>C</TT>.
402
$x goto 'ad' /* positions c after 'anim' */
403
$x goto 'ax' /* signal f */
405
<DT><TT>gopast C</TT>
406
<DD>Like goto, but <B><I>c</I></B> is not set back, so the effect is to leave <B><I>c</I></B> after
407
the pattern which matched against <TT>C</TT>.
409
$x gopast 'ad' /* positions c after 'animad' */
411
<DT><TT>repeat C</TT>
412
<DD><TT>C</TT> is repeated until it gives <B><I>f</I></B>. When this happens <B><I>c</I></B> is set back to the
413
position it had before the last repetition of <TT>C</TT>, and <TT>repeat C</TT> gives
414
signal <B><I>t</I></B>. For example,
416
$x repeat gopast 'a' /* position c after the last 'a' */
418
<DT><TT>loop AE C</TT>
419
<DD>This is like <TT>C C ... C</TT> written out AE times, where AE is an arithmetic
420
expression. For example,
422
$x loop 2 gopast ('a' or 'e' or 'i' or 'o' or 'u')
423
/* position c after the second vowel */
425
The equivalent expression in C has the shape,
429
for (i = 0; i < limit; i++) C;
432
<DT><TT>atleast AE C</TT>
433
<DD>This is equivalent to <TT>loop AE C repeat C</TT>.
436
<DD>moves <B><I>c</I></B> AE character positions towards <B><I>l</I></B>, but if AE is negative, or if
437
there are less than AE characters between <B><I>c</I></B> and <B><I>l</I></B> the signal is <B><I>f</I></B>.
442
tests that <B><I>c:l</I></B> contains more than 2 characters.
445
<DD>is equivalent to <TT>hop 1</TT>.
448
<H4>c) Moving text about</H4>
451
We have seen in (a) that <TT>$x = y</TT>, when <TT>x</TT> and <TT>y</TT> are strings, sets <B><I>c:l</I></B> of <TT>x</TT>
452
to the value of <TT>y</TT>. Conversely
456
sets the value of <TT>y</TT> to the <B><I>c:l</I></B> region of <TT>x</TT>.
458
A more delicate mechanism for pushing text around is to define a substring,
459
or <I>slice</I> of the string being tested. Then
463
<DD>sets the left-end of the slice to <B><I>c</I></B>,
465
<DD>sets the right-end of the slice to <B><I>c</I></B>,
468
<DD>moves the slice to variable <TT>s</TT>,
470
<DD>replaces the slice with variable (or literal) <TT>S</TT>.
474
/* assume x holds 'animadversion' */
475
$x ( [ // '[animadversion' - [ set as indicated
477
// '[anima|dversion' - c is marked by '|'
478
] // '[anima]dversion' - ] set as indicated
482
For any string, the slice ends should be assumed to be unset until they are
483
set with the two commands <TT>[</TT>, <TT>]</TT>. Thereafter the slice ends will retain
484
the same values until altered.
488
<DD>is equivalent to <TT><- ''</TT>
491
This next example deletes all vowels in x,
493
define vowel ('a' or 'e' or 'i' or 'o' or 'u')
495
$ x repeat ( gopast([vowel]) delete )
497
As this example shows, the slice markers <TT>[</TT> and <TT>]</TT> often appear as
498
pairs in a bracketed style, which makes for easy reading of the Snowball
499
scripts. But it must be remembered that, unusually in a computer
500
programming language, they are not true brackets.
502
More simply, text can be inserted at <B><I>c</I></B>.
504
<DT><TT>insert S</TT>
505
<DD>insert variable or literal <TT>S</TT> before <B><I>c</I></B>, moving <B><I>c</I></B> to the right of the
506
insert. <TT><+</TT> is a synonym for <TT>insert</TT>.
508
<DT><TT>attach S</TT>
509
<DD>the same, but leave <B><I>c</I></B> at the left of the insert.
512
<H4>d) Marks</H4>
515
The cursor, <B><I>c</I></B>, (and the limit, <B><I>l</I></B>) can be thought of as having a numeric
516
value, from zero upwards:
518
| a | n | i | m | a | d | v | e | r | s | i | o | n |
519
0 1 2 3 4 5 6 7 8 9 10 11 12 13
521
It is these numeric values of <B><I>c</I></B> and <B><I>l</I></B> which are accessible through
522
<TT>cursor</TT> and <TT>limit</TT> in arithmetic expressions.
524
<DT><TT>setmark X</TT>
525
<DD>sets <TT>X</TT> to the current value of <B><I>c</I></B>, where <TT>X</TT> is an integer variable.
527
<DT><TT>tomark AE</TT>
528
<DD>moves <B><I>c</I></B> forward to the position given by AE,
530
<DT><TT>atmark AE</TT>
531
<DD>tests if <B><I>c</I></B> is at position AE (<B><I>t</I></B> or <B><I>f</I></B> signal).
533
In the case of <TT>tomark AE</TT>, a similar fail condition occurs as with <TT>hop AE</TT>.
534
If <B><I>c</I></B> is already beyond AE, or if position <B><I>l</I></B> is before position AE, the
535
signal is <B><I>f</I></B>.
537
In the stemming algorithms, certain regions of the word are defined by
538
setting marks, and later the failure condition of <TT>tomark</TT> is used to see if
539
<B><I>c</I></B> is inside a particular region.
541
Two other commands put <B><I>c</I></B> at <B><I>l</I></B>, and test if <B><I>c</I></B> is at <B><I>l</I></B>,
544
<DD>moves <B><I>c</I></B> forward to <B><I>l</I></B> (signal <B><I>t</I></B> always),
547
<DD>tests if <B><I>c</I></B> is at <B><I>l</I></B> (<B><I>t</I></B> or <B><I>f</I></B> signal).
550
<H4>e) Changing <B><I>l</I></B></H4>
553
In this account of string commands we see <B><I>c</I></B> moving right towards <B><I>l</I></B>, while
554
<B><I>l</I></B> stays fixed at the end. In fact <B><I>l</I></B> can be reset to a new position between
555
<B><I>c</I></B> and its old position, to act as a shorter barrier for the movement of <B><I>c</I></B>.
558
<DT><TT>setlimit C<SUB>1</SUB> for C<SUB>2</SUB></TT>
559
<DD><TT>C<SUB>1</SUB></TT> is obeyed, and if it gives <B><I>f</I></B> the final value of <B><I>c</I></B> becomes the new
560
position of <B><I>l</I></B>. <B><I>c</I></B> is then set back to its old value before <TT>C<SUB>1</SUB></TT> was
561
obeyed, and <TT>C<SUB>2</SUB></TT> is obeyed. Finally <B><I>l</I></B> is set back to its old position.
562
The signal is <B><I>f</I></B> if either <TT>C<SUB>1</SUB></TT> or <TT>C<SUB>2</SUB></TT> gives <B><I>f</I></B>, otherwise <B><I>t</I></B>.
565
$x ( setlimit goto 's' // 'animadver}sion' new l as marked '}'
566
for // below, '|' marks c after each goto
567
( goto 'a' and // '|animadver}sion'
568
goto 'e' and // 'animadv|er}sion'
569
goto 'i' and // 'an|imadver}sion'
573
This checks that x has characters ‘a’, ‘e’ and ‘i’ before the first
577
<H4>f) Backward processing</H4>
579
String commands have been described with <B><I>c</I></B> to the left of <B><I>l</I></B> and moving
580
right. But the process can be reversed.
583
<DT><TT>backwards C</TT>
584
<DD><B><I>c</I></B> and <B><I>l</I></B> are swapped over, and <B><I>c</I></B> moves left towards <B><I>l</I></B>. <TT>C</TT> is obeyed, the
585
signal given by <TT>C</TT> becomes the signal of <TT>backwards C</TT>, and <B><I>c</I></B> and <B><I>l</I></B> are
586
swapped back to their old values (except that <B><I>l</I></B> may have been adjusted
587
because of deletions and insertions). <TT>C</TT> cannot contain another
588
<TT>backwards</TT> command.
590
<DT><TT>reverse C</TT>
591
<DD>A similar idea, but here <B><I>c</I></B> simply moves left instead of moving right,
592
with the beginning of the string as the limit, <B><I>l</I></B>. <TT>C</TT> can contain other
593
<TT>reverse</TT> commands, but it cannot contain commands to do deletions or
594
insertions - it must be used for testing only. (Without this
595
restriction Snowball's semantics would become very untidy.)
598
Forward and backward processing are entirely symmetric, except that forward
599
processing is the default direction, and literals strings are always
600
written out forwards, even when they are being tested backwards. So the
601
following are equivalent,
604
'ani' 'mad' 'version' atlimit
608
'version' 'mad' 'ani' atlimit
611
If a routine is defined for backwards mode processing, it must be included
612
inside a <TT>backwardmode(...)</TT> declaration.
614
<H4>g) <TT>substring among</TT></H4>
616
The use of <TT>substring among</TT> is central to the implementation of the
617
stemming algorithms. It is like a case switch on strings. In its simpler
620
substring among('S<SUB>1</SUB>' 'S<SUB>2</SUB>' 'S<SUB>3</SUB>' ...)
622
searches for the longest matching substring <TT>'S<SUB>1</SUB>'</TT> or <TT>'S<SUB>2</SUB>'</TT> or <TT>'S<SUB>3</SUB>'</TT> ... from
623
position <B><I>c</I></B>. (The <TT>'S<SUB>i</SUB>'</TT> must all be different.) So this has the same
626
('S<SUB>1</SUB>' or 'S<SUB>2</SUB>' or 'S<SUB>3</SUB>' ...)
628
- so long as the <TT>'S<SUB>i</SUB>'</TT> are written out in decreasing order of length.
630
<TT>substring</TT> may be omitted, in which case it is attached to its following
635
without a preceding <TT>substring</TT> is equivalent to
637
(substring among(...))
639
<TT>substring</TT> may also be detached from its <TT>among</TT>, although it must
640
precede it textually in the same routine in which the <TT>among</TT> appears.
641
The general form of <TT>substring ... among</TT> is,
645
among( 'S<SUB>11</SUB>' 'S<SUB>12</SUB>' ... (C<SUB>1</SUB>)
646
'S<SUB>21</SUB>' 'S<SUB>22</SUB>' ... (C<SUB>2</SUB>)
649
'S<SUB>n1</SUB>' 'S<SUB>n2</SUB>' ... (C<SUB>n</SUB>)
652
Obeying <TT>substring</TT> searches for a longest match among the <TT>'S<SUB>ij</SUB>'</TT>. The
653
signal from <TT>substring</TT> is <B><I>t</I></B> if a match is found, otherwise <B><I>f</I></B>. When the
654
<TT>among</TT> comes to be obeyed, the <TT>C<SUB>i</SUB></TT> corresponding to the matched <TT>'S<SUB>ij</SUB>'</TT> is
655
obeyed, and its signal becomes the signal of the <TT>among</TT> command.
657
<TT>substring/among</TT> pairs must match up textually inside each routine
658
definition. But there is no problem with an <TT>among</TT> containing other
659
<TT>substring/among</TT> pairs, and <TT>substring</TT> is optional before <TT>among</TT> anyway.
660
The essential constraint is that two <TT>substring</TT>s must be separated by an
661
<TT>among</TT>, and each <TT>substring</TT> must be followed by an <TT>among</TT>.
663
The effect of obeying <TT>substring</TT> when the preceding <TT>among</TT> is not obeyed
664
is undefined. This would happen for example here,
666
try($x != 617 substring)
667
among(...) // among is bypassed in the exceptional case where x == 617
669
The significance of separating the <TT>substring</TT> from the <TT>among</TT> is to allow
670
them to work in different contexts. For example,
672
setlimit tomark L for substring
674
among( 'S<SUB>11</SUB>' 'S<SUB>12</SUB>' ... (C<SUB>1</SUB>)
677
'S<SUB>n1</SUB>' 'S<SUB>n2</SUB>' ... (C<SUB>n</SUB>)
680
Here the test for the longest <TT>'S<SUB>ij</SUB>'</TT> is constrained to the region between <B><I>c</I></B>
681
and the mark point given by integer <TT>L</TT>. But the commands <TT>C<SUB>i</SUB></TT> operate outside
682
this limit. Another example is
686
among( 'S<SUB>11</SUB>' 'S<SUB>12</SUB>' ... (C<SUB>1</SUB>)
689
'S<SUB>n1</SUB>' 'S<SUB>n2</SUB>' ... (C<SUB>n</SUB>)
692
The substring test is in the opposite direction in the string to the
693
direction of the commands <TT>C<SUB>i</SUB></TT>.
695
The last <TT>(C<SUB>n</SUB>)</TT> may be omitted, in which case <TT>(true)</TT> is assumed.
697
Another possible abbreviation is that when <TT>substring</TT> is omitted, a
700
among( 'S<SUB>11</SUB>' 'S<SUB>12</SUB>' ... (C C<SUB>1</SUB>)
701
'S<SUB>21</SUB>' 'S<SUB>22</SUB>' ... (C C<SUB>2</SUB>)
703
'S<SUB>n1</SUB>' 'S<SUB>n2</SUB>' ... (C C<SUB>n</SUB>)
709
'S<SUB>11</SUB>' 'S<SUB>12</SUB>' ... (C<SUB>1</SUB>)
710
'S<SUB>21</SUB>' 'S<SUB>22</SUB>' ... (C<SUB>2</SUB>)
712
'S<SUB>n1</SUB>' 'S<SUB>n2</SUB>' ... (C<SUB>n</SUB>)
715
and this is just equivalent to
718
among( 'S<SUB>11</SUB>' 'S<SUB>12</SUB>' ... (C<SUB>1</SUB>)
719
'S<SUB>21</SUB>' 'S<SUB>22</SUB>' ... (C<SUB>2</SUB>)
721
'S<SUB>n1</SUB>' 'S<SUB>n2</SUB>' ... (C<SUB>n</SUB>)
724
<BR> <H2>Booleans</H2>
726
<TT>set B</TT> and <TT>unset B</TT> set <TT>B</TT> to true and false respectively, where <TT>B</TT> is a
727
boolean name. <TT>B</TT> as a command gives a signal <B><I>t</I></B> if it is set true, <B><I>f</I></B>
728
otherwise. For example,
730
booleans ( Y_found ) // declare the boolean
734
unset Y_found // unset it
735
do ( ['y'] <-'Y' set Y_found )
736
/* if c:l begins 'y' replace it by 'Y' and set Y_found */
738
do repeat(goto (v ['y']) <-'Y' set Y_found)
739
/* repeatedy move down the string looking for v 'y' and
740
replacing 'y' with 'Y'. Whenever the replacement takes
741
place set Y_found. v is a test for a vowel, defined as
742
a grouping (see below). */
745
/* Y_found means there are some letters Y in the string.
746
Later we can use this to trigger a conversion back to
751
do (Y_found repeat(goto (['Y']) <- 'y')
753
<BR> <H2>Groupings</H2>
755
A grouping brings characters together and enables them to be looked for
758
If <TT>G</TT> is declared as a grouping, it can be defined by
760
define G G<SUB>1</SUB> <I>op</I> G<SUB>2</SUB> <I>op</I> G<SUB>3</SUB> ...
762
where <I>op</I> is <TT>+</TT> or <TT>-</TT>, and <TT>G<SUB>1</SUB></TT>, <TT>G<SUB>2</SUB></TT>, <TT>G<SUB>3</SUB></TT> are literal strings, or groupings that
763
have already been defined. (There can be zero or more of these additional
764
<I>op</I> components). For example,
766
define capital_letter 'ABDEFGHIJKLMNOPQRSTUVWXYZ'
767
define small_letter 'abdefghijklmnopqrstuvwxyz'
768
define letter capital_letter + small_letter
769
define vowel 'aeiou' + 'AEIOU'
770
define consonant letter - vowel
771
define digit '0123456789'
772
define alphanumeric letter + digit
774
Once <TT>G</TT> is defined, it can be used as a command, and is equivalent to a test
776
'ch1' or 'ch2' or ...
778
where <TT>ch1</TT>, <TT>ch2</TT> ... list all the characters in the grouping.
780
<TT>non G</TT> is the converse test, and matches any character except the
781
characters of <TT>G</TT>. Note that <TT>non G</TT> is not the same as <TT>not G</TT>, in fact
783
non G is equivalent to (not G next)
785
<TT>non</TT> may be optionally followed by hyphen, so one may write
792
<BR> <H2>A Snowball program</H2>
795
A complete program consists of a sequence of declarations followed by a
796
sequence of definitions of groupings and routines. Routines which are
797
implicitly defined as operating on <B><I>c:l</I></B> from right to left must be included
798
in a <TT>backwardmode(...)</TT> declaration.
800
A Snowball program is called up via a simple
801
<A HREF="../q/use.html">API</A>
803
externals. For example,
805
externals ( stem1 stem2 )
807
define stem1 as ( ... /* stem1 commands */ )
808
define stem2 as ( ... /* stem2 commands */ )
810
The API also allows a current string to be defined, and this becomes the
811
<B><I>c:l</I></B> string for the external routine to work on. Its final value is the
812
result handed back through the API.
814
The strings, integers and booleans are accessible from any point in the
815
program, and exist throughout the running of the Snowball program. They are
816
therefore like static declarations in C.
818
<BR> <H2>Comments, and other whitespace fillers</H2>
820
At a deeper level, a program is a sequence of <I>tokens</I>, interspersed with
821
whitespace. Names, reserved words, literal numbers and strings are all
822
tokens. Various symbols, made up of non-alphanumerics, are also tokens.
824
A name, reserved word or number is terminated by the first character that
825
cannot form part of it. A symbol is recognised as the longest sequence of
826
characters that forms a valid symbol. So <TT>+=-</TT> is two symbols, <TT>+=</TT> and
827
<TT>-</TT>, because <TT>+=</TT> is a valid symbol in the language while <TT>+=-</TT> is not.
828
Whitespace separates tokens but is otherwise ignored. This of course is
831
Anywhere that whitespace can occur, there may also occur:
833
(a) Comments, in the usual multi-line <TT>/* .... */</TT> or single line
834
<TT>// ...</TT> format.
836
(b) Get directives. These are like <TT>#include</TT> commands in C, and have the form
837
<TT>get 'S'</TT>, where <TT>'S'</TT> is a literal string. For example,
839
get '/home/martin/snowball/main-hdr' // include the file contents
841
(c) <TT>stringescapes XY</TT> where <TT>X</TT> and <TT>Y</TT> are any two printing characters.
843
(d) <TT>stringdef m 'S'</TT> where <TT>m</TT> is sequence of characters not including
844
whitespace and terminated with whitespace, and <TT>'S'</TT> is a literal string.
847
This completes the definition of Snowball.
855
<TR><TD BGCOLOR="#e0e0FC">
856
<BR> <H2>Snowball syntax</H2>
860
<BR> <H2>Appendix 1 - Snowball syntax</H2>
862
<TT>||</TT> is used for alternatives, <TT>[<I>X</I>]</TT> means that <I>X</I> is optional, and <TT>[<I>X</I>]*</TT>
863
means that <I>X</I> is repreated zero or more times. meta-symbols are defined on
864
the left. <TT><char></TT> means any character.
866
The definition of literal strings does not allow for the escaping
867
conventions. The command <TT>?</TT> is a debugging aid.
871
<letter> ::= a || b || ... || z || A || B || ... || Z
872
<digit> ::= 0 || 1 || ... || 9
873
<name> ::= <letter> [ <letter> || <digit> || _ ]*
874
<s_name> ::= <name>
875
<i_name> ::= <name>
876
<b_name> ::= <name>
877
<r_name> ::= <name>
878
<g_name> ::= <name>
879
<hexdigit> ::= <digit> || a || b || c || d || e || f ||
880
A || B || C || D || E || F
881
<literal string>::= '[<char>]*' || hex '[<hexdigit>]*'
882
<number> ::= <digit> [ <digit> ]*
884
S ::= <s_name> || <literal string>
885
G ::= <g_name> || <literal string>
887
<declaration> ::= strings ( [<s_name>]* ) ||
888
integers ( [<i_name>]* ) ||
889
booleans ( [<b_name>]* ) ||
890
routines ( [<r_name>]* ) ||
891
externals ( [<r_name>]* ) ||
892
groupings ( [<g_name>]* )
894
<r_definition> ::= define <r_name> as C
895
<g_definition> ::= G || <g_definition> + G || <g_definition> - G
898
AE + AE || AE - AE || AE * AE || AE / AE || - AE ||
899
maxint || minint || cursor || limit || size ||
900
sizeof <s_name> || <i_name> || <integer>
902
<i_command> ::= $ <i_name> = AE ||
903
$ <i_name> += AE || $ <i_name> -= AE ||
904
$ <i_name> *= AE || $ <i_name> /= AE ||
905
$ <i_name> == AE || $ <i_name> != AE ||
906
$ <i_name> > AE || $ <i_name> >= AE ||
907
$ <i_name> < AE || $ <i_name> <= AE ||
909
<s_command> ::= $ <s_name> C
912
<i_command> || <s_command> || C or C || C and C ||
913
not C || test C || try C || do C || fail C ||
914
goto C || gopast C || repeat C || loop AE C ||
915
atleast AE C || S || = S || insert S || attach S ||
916
<- S || delete || hop AE || next ||
917
=> <s_name> || [ || ] || -> <s_name> ||
918
setmark <i_name> || tomark AE || atmark AE ||
919
tolimit || atlimit || setlimit C for C ||
920
backwards C || reverse C ||
921
substring || among ( [<literal string> || (C)]* ) ||
922
set <b_name> || unset <b_name> || <b_name> ||
923
<r_name> || <g_name> || non [-] <g_name> ||
926
P ::= [P]* || <declaration> ||
927
<r_definition> || <g_definition> ||
934
synonyms: <+ for insert