The stemming algorithm
Letters in Romanian include the following accented forms:
-
ă â î ş ţ
The following letters are vowels:
-
a ă â e i î o u
RV and R1 are first set up in the standard way, but then they are adjusted so that the
region before them contains at least 3 letters.
RV is the region after the first vowel, or the end of the word
if it contains no vowel.
R1 is the region after the first non-vowel following a vowel, or the end of
the word if there is no such non-vowel.
(See note on R1 and R2.)
For example:
c o n v i n g ă t o r
|<--- RV -->|
|<--- R1 --->|
Words are assumed to be in lower case.
Also, words that contain hyphens (caută-mă) must be separated into the constituent words (caută, mă).
Always do step 1.
Step 1: Verb suffixes in the non-personal moods
-
Search for the longest among the following suffixes in R1, and if found, delete.
-
are ere ire âre
at ut s t it ât
ind ând
indu ându
Do step 2 if no ending was removed by step 1.
Step 2: Verb suffixes in the tenses of the indicative mood
-
Search for the longest among the following suffixes, and perform the action indicated.
-
ez ezi ează esc eşti eşte
ăsc ăşti ăşte
am ai au eam eai ea eau iam
iai ia iau eaţi iaţi
âi
ase use ise âse
aşi arăm arăţi ară
uşi urăm urăţi ură
işi irăm irăţi iră
âşi ârăm ârăţi âră
asem aseşi aserăm aserăţi aseră
usem useşi userăm userăţi useră
isem iseşi iserăm iserăţi iseră
âsem âseşi âserăm âserăţi
âseră
-
if preceded by a consonant or u, delete if in R1
(the preceding consonant or u need not be in R1)
-
ăm em im âm aţi eţi
iţi âţi seise sese seşi
serăm serăţi seră sesem seseşi
seserăm seserăţi seseră
-
delete if in R1
-
ui
-
if preceded by a consonant other than l, delete if in R1
(the preceding consonant need not be in R1)
Do each of steps 3 and 4 if no ending was removed by steps 1 or 2.
Step 3: Article suffixes (for noun and adjective)
-
Search for the longest among the following suffixes, and perform the action indicated.
-
ul l a ua ia eaua lui
lor o ule
-
delete if in RV
-
le
-
if preceded by a vowel, delete if in RV
(the preceding vowel need not be in RV)
-
i
-
if not preceded by ur, delete if in RV
(the preceding ur need not be in RV)
Step 4: Plural suffixes (for noun and adjective)
-
Search for the longest among the following suffixes, and perform the action indicated.
-
i uri e
-
delete if in RV
-
le
-
if preceded by a vowel, delete if in RV
(the preceding vowel need not be in RV)
Always do step 5.
Step 5: Residual suffixes
-
Search for the longest among the following suffixes in RV, and if found, delete.
-
a e i u î ă â
|
The same algorithm in Snowball
-
stringescapes {}
/* special characters (Unicode) */
stringdef ab hex '103' // a-breve
stringdef a^ hex 'E2' // a-circumflex
stringdef i^ hex 'EE' // i-circumflex
stringdef s, hex '15F' // s-cedilla
stringdef t, hex '163' // t-cedilla
routines ( mark_regions RV R1
verb_non_personal_moods
verb_conjugation
definite_article
number_plural
residual_suffix
)
externals ( stem )
integers ( pV p1 x )
groupings ( v )
define v 'aeiou{ab}{a^}{i^}'
define mark_regions as (
$pV = limit
$p1 = limit
test(hop 3 setmark x)
gopast v setmark pV
try($pV < x $pV = x) // at least 3
gopast non-v setmark p1
try($p1 < x $p1 = x) // at least 3
)
backwardmode (
define RV as $pV <= cursor
define R1 as $p1 <= cursor
define verb_non_personal_moods as (
[substring] R1 among (
'are' 'ere' 'ire' '{a^}re' //infinitive
'at' 'ut' 's' 't' 'it' '{a^}t' //participle
'ind' '{a^}nd' //gerund
'indu' '{a^}ndu'
(delete)
)
)
define verb_conjugation as (
setlimit tomark p1 for ([substring])
among (
'ez' 'ezi' 'eaz{ab}' 'esc' 'e{s,}ti'
'e{s,}te' '{ab}sc' '{ab}{s,}ti' '{ab}{s,}te' //prezent
'am' 'ai' 'au' 'eam' 'eai' 'ea' 'eau'
'iam' 'iai' 'ia' 'iau' 'ea{t,}i' 'ia{t,}i' //imperfect
'{a^}i'
'a{s,}i' 'ar{ab}m' 'ar{ab}{t,}i' 'ar{ab}'
'u{s,}i' 'ur{ab}m' 'ur{ab}{t,}i' 'ur{ab}'
'i{s,}i' 'ir{ab}m' 'ir{ab}{t,}i' 'ir{ab}'
'{a^}{s,}i' '{a^}r{ab}m' '{a^}r{ab}{t,}i'
'{a^}r{ab}' //simple perfect
'ase' 'use' 'ise' '{a^}se'
'asem' 'ase{s,}i' 'aser{ab}m' 'aser{ab}{t,}i' 'aser{ab}'
'usem' 'use{s,}i' 'user{ab}m' 'user{ab}{t,}i' 'user{ab}'
'isem' 'ise{s,}i' 'iser{ab}m' 'iser{ab}{t,}i' 'iser{ab}'
'{a^}sem' '{a^}se{s,}i' '{a^}ser{ab}m' '{a^}ser{ab}{t,}i'
'{a^}ser{ab}' //pluperfect
(test (non-v or 'u') delete)
'{ab}m' 'em' 'im' '{a^}m' 'a{t,}i' 'e{t,}i'
'i{t,}i' '{a^}{t,}i' //prezent
'se{s,}i' 'ser{ab}m' 'ser{ab}{t,}i' 'ser{ab}'
'sei''se' //simple perfect
'sesem' 'sese{s,}i' 'seser{ab}m' 'seser{ab}{t,}i' 'seser{ab}'
'sese' //pluperfect
(delete)
'ui' //simple perfect
(test non-v not 'l' delete) //not match 'lui' (article)
)
)
define definite_article as (
setlimit tomark pV for ([substring])
among (
'ul' 'l' 'a' 'ua' 'ia' 'eaua'
'lui' 'lor' //genitive + dative
'o' 'ule' //vocative
(delete)
'le'
(test v delete)
'i'
(not 'ur' delete) //not match uri (plural)
)
)
define number_plural as (
setlimit tomark pV for ([substring])
among (
'i' 'uri' 'e'
(delete)
'le'
(test v delete)
)
)
define residual_suffix as (
[substring] RV among (
'a' 'e' 'i' 'u' '{i^}' '{ab}' '{a^}'
// 'o'
(delete)
)
)
)
define stem as (
do mark_regions
backwards (
do (
verb_non_personal_moods or verb_conjugation or
( try definite_article and number_plural )
)
do residual_suffix
)
)
|