Romanian stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball (Unicode encoding)
The ANSI C stemmer
— and its header
Sample Romanian vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent
Tar-gzipped file of all of the above

Romanian stop word list


The sample Romanian vocabulary, its stemmed equivalent and the stop word list should be viewable in your browser if you select the UTF-8 character set. You can get to this in Microsoft's Internet Explorer via View/Encoding/More, and in Netscape via View/Character Set.



Here is a sample of Romanian vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
abandon
abandona
abandonăm
abandonare
abandonarea
abandonării
abandonat
abandonată
abandonate
abandonaţi
abandonează
abandoneze
abandonând
abandonul
abandonuri
abandonurile
abanos
abateri
abaterile
abaterilor
abaţia
abaţie
abatoare
abator
abătut
abdică
abdicare
abdicarea
abdicări
abdicării
abdicările
abdice
abdominale
abecedarul
abera
aberant
aberantă
aberante
aberantele
aberantului
  =>   abandon
abandon
abandon
abandon
abandonar
abandonăr
abandon
abandonat
abandonat
abandon
abandon
abandonez
abandonând
abandon
abandon
abandon
abano
abater
abater
abater
abaţ
abaţ
abato
abator
abăt
abdic
abdic
abdicar
abdicăr
abdicăr
abdicăr
abdic
abdomin
abecedar
aber
aberan
aberant
aberant
aberant
aberant
sa
şa

sabatic
sabia
sabie
săbii
şablon
sabota
sabotaj
sabotajul
sabotare
sabotarea
sabotat
saboteze
sabotori
sac
şacali
sacerdoţi
saci
sacii
sacoşe
sacră
sacrală
sacralitate
sacralizare
sacralizarea
sacralizării
sacralizată
sacramental
sacrific
sacrifica
sacrificăm
sacrificarea
sacrificării
sacrificat
sacrificată
sacrificate
sacrificaţi
sacrifice
  =>   sa
şa

sabatic
sab
sab
săb
şablon
sabot
sabotaj
sabotaj
sabot
sabotar
sabot
sabotez
sabotor
sac
şacal
sacerdoţ
sac
sac
sacoş
sacr
sacral
sacralitat
sacraliz
sacralizar
sacralizăr
sacralizat
sacrament
sacrific
sacrific
sacrific
sacrificar
sacrificăr
sacrific
sacrificat
sacrificat
sacrific
sacrific



 

The stemming algorithm

Letters in Romanian include the following accented forms:
ă   â   î   ş   ţ
The following letters are vowels:
a   ă   â   e   i   î   o   u
RV and R1 are first set up in the standard way, but then they are adjusted so that the region before them contains at least 3 letters. RV is the region after the first vowel, or the end of the word if it contains no vowel. R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)

For example:
   c o n v i n g ă t o r
      |<---   RV     -->|
        |<---   R1  --->|
       
Words are assumed to be in lower case. Also, words that contain hyphens (caută-mă) must be separated into the constituent words (caută, mă).


Always do step 1.

Step 1: Verb suffixes in the non-personal moods
Search for the longest among the following suffixes in R1, and if found, delete.

are   ere   ire   âre   at   ut   s   t   it   ât   ind   ând   indu   ându

Do step 2 if no ending was removed by step 1.

Step 2: Verb suffixes in the tenses of the indicative mood
Search for the longest among the following suffixes, and perform the action indicated.

ez   ezi   ează   esc   eşti   eşte   ăsc   ăşti   ăşte   am   ai   au   eam   eai   ea   eau   iam  iai   ia   iau   eaţi   iaţi   âi   ase   use   ise   âse   aşi   arăm   arăţi   ară   uşi   urăm   urăţi   ură   işi   irăm   irăţi   iră   âşi   ârăm   ârăţi   âră   asem   aseşi   aserăm   aserăţi   aseră   usem   useşi   userăm   userăţi   useră   isem   iseşi   iserăm   iserăţi   iseră   âsem   âseşi   âserăm   âserăţi   âseră
if preceded by a consonant or u, delete if in R1
(the preceding consonant or u need not be in R1)
ăm   em   im   âm   aţi   eţi   iţi   âţi   seise   sese   seşi   serăm   serăţi   seră   sesem seseşi   seserăm   seserăţi   seseră
delete if in R1
ui
if preceded by a consonant other than l, delete if in R1
(the preceding consonant need not be in R1)

Do each of steps 3 and 4 if no ending was removed by steps 1 or 2.

Step 3: Article suffixes (for noun and adjective)
Search for the longest among the following suffixes, and perform the action indicated.

ul   l   a   ua   ia   eaua   lui   lor   o   ule
delete if in RV
le
if preceded by a vowel, delete if in RV
(the preceding vowel need not be in RV)
i
if not preceded by ur, delete if in RV
(the preceding ur need not be in RV)

Step 4: Plural suffixes (for noun and adjective)
Search for the longest among the following suffixes, and perform the action indicated.

i   uri   e
delete if in RV
le
if preceded by a vowel, delete if in RV
(the preceding vowel need not be in RV)

Always do step 5.

Step 5: Residual suffixes
Search for the longest among the following suffixes in RV, and if found, delete.

a   e   i   u   î   ă   â


 

The same algorithm in Snowball


stringescapes {}

/* special characters (Unicode) */
   
stringdef ab    hex '103'                                // a-breve
stringdef a^    hex 'E2'                                 // a-circumflex
stringdef i^    hex 'EE'                                 // i-circumflex
stringdef s,    hex '15F'                                // s-cedilla
stringdef t,    hex '163'                                // t-cedilla


routines ( mark_regions RV R1
           verb_non_personal_moods
           verb_conjugation
           definite_article
           number_plural
           residual_suffix
)

externals ( stem )

integers ( pV p1 x )

groupings ( v )

define v 'aeiou{ab}{a^}{i^}'


define mark_regions as (

    $pV = limit
    $p1 = limit

    test(hop 3 setmark x)
    
    gopast v  setmark pV 
    try($pV < x  $pV = x)                                // at least 3
    gopast non-v setmark p1
    try($p1 < x  $p1 = x)                                // at least 3
)


backwardmode (

    define RV as $pV <= cursor
    define R1 as $p1 <= cursor

    define verb_non_personal_moods as (
        [substring] R1 among (
            'are' 'ere' 'ire' '{a^}re'                   //infinitive
            'at' 'ut' 's' 't' 'it' '{a^}t'               //participle
            'ind' '{a^}nd'                               //gerund 
            'indu' '{a^}ndu'
                (delete)
        )
    )

    define verb_conjugation as (
        setlimit tomark p1 for ([substring])
        among (
            'ez' 'ezi' 'eaz{ab}' 'esc' 'e{s,}ti' 
            'e{s,}te' '{ab}sc' '{ab}{s,}ti' '{ab}{s,}te' //prezent
            'am' 'ai' 'au' 'eam' 'eai' 'ea' 'eau' 
            'iam' 'iai' 'ia' 'iau' 'ea{t,}i' 'ia{t,}i'   //imperfect  
            '{a^}i'
            'a{s,}i' 'ar{ab}m' 'ar{ab}{t,}i' 'ar{ab}'
            'u{s,}i' 'ur{ab}m' 'ur{ab}{t,}i' 'ur{ab}'
            'i{s,}i' 'ir{ab}m' 'ir{ab}{t,}i' 'ir{ab}'
            '{a^}{s,}i' '{a^}r{ab}m' '{a^}r{ab}{t,}i'
            '{a^}r{ab}'                                  //simple perfect
            'ase' 'use' 'ise' '{a^}se'
            'asem' 'ase{s,}i' 'aser{ab}m' 'aser{ab}{t,}i' 'aser{ab}'
            'usem' 'use{s,}i' 'user{ab}m' 'user{ab}{t,}i' 'user{ab}'
            'isem' 'ise{s,}i' 'iser{ab}m' 'iser{ab}{t,}i' 'iser{ab}'
            '{a^}sem' '{a^}se{s,}i' '{a^}ser{ab}m' '{a^}ser{ab}{t,}i'
            '{a^}ser{ab}'                                //pluperfect
                (test (non-v or 'u') delete)
            '{ab}m' 'em' 'im' '{a^}m' 'a{t,}i' 'e{t,}i' 
            'i{t,}i' '{a^}{t,}i'                         //prezent
            'se{s,}i' 'ser{ab}m' 'ser{ab}{t,}i' 'ser{ab}'
            'sei''se'                                    //simple perfect
            'sesem' 'sese{s,}i' 'seser{ab}m' 'seser{ab}{t,}i' 'seser{ab}'
            'sese'                                       //pluperfect            
                (delete)
            'ui'                                         //simple perfect
                (test non-v not 'l' delete)              //not match 'lui' (article)
        )
    )

    define definite_article as (
        setlimit tomark pV for ([substring])
        among (
            'ul' 'l' 'a' 'ua' 'ia' 'eaua'
            'lui' 'lor'                                  //genitive + dative
            'o' 'ule'                                    //vocative
                (delete)
            'le'
                (test v delete)
            'i'
                (not 'ur' delete)                        //not match uri (plural)
        )
    )

    define number_plural as (
        setlimit tomark pV for ([substring])
        among (
            'i' 'uri' 'e'
                (delete)
            'le'
                (test v delete)   
        )
    )

    define residual_suffix as (
        [substring] RV among (
            'a' 'e' 'i' 'u' '{i^}' '{ab}' '{a^}'
            // 'o' 
               (delete)
        )
    )
)

define stem as (
    do mark_regions
    backwards (
        do (
             verb_non_personal_moods or verb_conjugation or
             ( try definite_article and number_plural )
        )
        do residual_suffix
    )
)