Romanian stemming algorithm


 

Links to resources

Snowball main page
The Romanian stemmer in Snowball (UTF-8 encoding)
The ANSI C stemmer
— and its header
Sample Romanian vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent

Romanian stop word list




Here is a sample of Romanian vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
abandonat
abreviative
aburului
accelerații
accentuată
accidentală
accidentologie
aclinice
acordata
actualizabil
acupuncturist
acuzare
adaptativ
administratoare
adorabila
adormitori
aeraj
afundat
ajunez
ajunși
ajustabil
ajutorare
alarmez
albăstrită
alegorizata
alunecuș
ambițioasă
ameliorativa
amendabila
amenințător
amăgit
amăgita
anexat
aniversez
aplecator
apologistă
apucați
arcași
arendași
argintar
arhicunoscuta
armonia
armonice
bacteriform
  =>   abandon
abrevi
abur
acceler
accentu
accident
accidentolog
aclin
acord
actual
acupunctur
acuz
adapt
administr
ador
adorm
aer
afund
ajun
ajunș
ajust
ajutor
alarm
albăstr
alegoriz
alunec
ambiț
amelior
amend
ameninț
amăg
amăg
anex
anivers
aplec
apolog
apuc
arc
arend
argint
arhicunosc
armon
armon
bacteriform
baftă
baloane
bandaj
basm
bestiale
bețivancă
biciclică
bifazat
bilingvă
binecuvântez
bipede
blamată
blestemat
blesteme
blond
bobine
boemă
bolșevism
bombați
bombăni
bondoacă
boreala
boxez
breaza
britanica
brodeza
brumării
brunată
brândușică
brădet
brăzdui
brăzdătura
bucătării
bufnita
bujor
buletine
bunătăți
buzna
bâjbâitură
bârsană
bădădăi
băietan
băieți
bălsămat
  =>   baft
baloan
band
basm
best
bețivanc
bicicl
bifaz
bilingv
binecuvânt
biped
blam
blestem
blestem
blond
bobin
boem
bolșev
bomb
bombăn
bondoac
boreal
box
breaz
britan
brodez
brum
brun
brânduș
brădet
brăzd
brăzdătur
bucăt
bufn
bujor
buletin
bunătăț
buzn
bâjbâit
bârs
bădăd
băiet
băieț
bălsăm


 

The stemming algorithm



Romanian language is a Romance language which has a rich inflexional and derivational system. For example:
(1) determination of the noun:
the article is post-positioned for general nouns (enclitics), or it precedes the proper nouns:
masa - the table
Ii dau elevului o carte. - I give the pupil a book.
Ii dau lui Marius o carte. - I give Marius a book.
if the adjective precedes the noun, it takes the article of the noun:
frumosul copil0 - the beautiful child
(2) the verb has in the traditional grammar 4 flexionary classes, established after infinitive suffixes:
a mânca - eat
a bea - drink
a merge- walk, go
a iubi - love
a urî - hate
But some grammars are speaking about 12 or flexionary classes of the Romanian verb.
(3) the adjective is classified in four inflectional classes: with 4, 3 or 2 endings and non-inflectional adjective.

The inflected words can have 3 letters: oul - the egg, ace - needles, or more letters: imparatescului – of/to the royal

The Romanian derivational system is very complex and multistratal. A word can take one d suffix or even more d suffixes, being thus twice or thrice derivated. For example: a lexical paradigm of the word stabil is:
stabil - stable
stabili - stabilize
stabilit - stabled
stabilire - stabilisation
stabilibil - stabilizable
stabiliza - stabilize
stabilizat - stabilized
stabilizant - stabilizing
stabilizare - stabilization
stabilizator - stabilizing


However, one problem for the algorithm occurs: there are homonymous sequences which can be either suffixes or part of the root itself, e.g:
-al : cal - horse (part of the root), industrial - industrial (suffix)
-em: morphological suffix from verb, 1st pers. pl. (facem – we do, mergem – we go vs. sistem - system, barem - limit, where -em is part of the root)
-esc: morphological suffix from verb, 1st pers. sg. and 3rd pers. pl: eu iubesc (I love), ei iubesc (they love), or derivational suffix - building adjectives and adverbs: imparatesc (royal/ly)


Through our stemming program we want to remove all the morphological suffixes and most of the derivational suffixes. In order to fix the problem caused by homonymity we create a list of exceptions.The list contains words without inflection, like adverbials or conjunctions. Generally, these are stopwords: devreme - early, apoi - then, căci - as, deci - therefore) There are only a few examples for each case included in the list and their number could be increased. In order to find a further solution for solving the issue of homonymity we treated some words containing a sequence identically with a suffix as an exception to the usual setting of p1, the left point of R1: coral – coral reef, moral – moral, marar – dill, declar – declare.
 

Definition of the Romanian stemmer

Letters in Romanian include the following accented forms /* ă - ab , â - a^ , î - i^, ș - s, ț - t,. */ Define a vowel as one of
a   e   i   o   u   ab a^ i^, y
We define the regions R1 and R2 as they are defined in English. (See note on R1 and R2.) We want to stemm only the words which contain more than two letters.

Remove morphological suffixes:
Search for the longest among the following suffixes and perform the action indicated:

ului uri urile urilor ul
delete
le
if preceded by a or o replace with l(it fix miss-steming for oale(pots), otherwise delete
lelor elor ilor elui lor ele ile ei i ii e a {ab}
delete
Remove verb suffixes:
Search for the longest among the following suffixes and perform the action indicated:

eaz{ab} eaza ezi ez z{ab}
if in R1 or R2 delete
esc e{s,}ti e{s,}te im i{t,}i
if in R1 or R2 delete
ai a{s,}i i{s,}i am {ab}m em au r{ab}m ea u
if in R1 or R2 delete
{t,}i
if preceded by a, e, i, u and in R1 delete
Remove derivational suffixes(we do this operation in two steps):
Search for the longest among the following suffixes and perform the action indicated:
(1):

ism, ist, i{s,}t
if in R1 or R2 and preceded by a consonant, delete
iz, ant, {ab}r, ar
if in R2 and preceded by a consonant, delete
tor, toar, abil, ibil
delete if in R1 or R2
ime, esc
delete
n{t,}
if in R2 replace with nt
(2):

{ab}r, ar
if in R1 or R2 and preceded by consonant delete
anie, icel, giu, eal, {ab}tat
if R2 and preceded by consonant delete
ulte{t,}, u{t,}, uc, u{s,}, el, oi
if R1 delete
ir, im, i{s,}, iz, iv, aj, an, ac
if in R1 or R2 and preceded by consonant delete
ic
if in R1 or R2 delete
er
if in R1 or R2 and preceded by consonant or i, delete
os, o{s,}, oas
if preceded by u, j, i, r or p in R1, delete
ant, ean, liv, al
if in R1 or R2, delete
{s,}or, {s,}oar
if in R1 or R2 and preceded by vowel, delete
Remove residual_suffix (there are vowels which appear through derivation, e.g: muncitor - worker, or participle suffixes)
Search for the longest among the following suffixes and perform the action indicated:

a, {ab}, e, u, i
if in R1 or R2 and preceded by consonant, delete
at, a{t,}, ut, u{t,}, it, i{t,}
if in R1 or R2 delete

 

The same algorithm in Snowball


/* Department of Computational Linguistics, Ruprecht-Karls-University of Heidelberg Marina Stegarescu: mstegare@hotmail.com Doina Gliga: doina_gliga@yahoo.co.uk Erwin Glockner: eglockner@hotmail.com 2006.07.15 */ routines ( mark_regions R1 R2 morhological_suffixes deriv_suffixes1 deriv_suffixes2 exception1 verb_suffix residual_suffix ) externals ( stem ) integers (p1 p2 ) groupings ( v ) stringescapes {} stringdef a^ hex 'E2' // â stringdef ab hex '103' // ? stringdef i^ hex 'EE' // î stringdef s, hex '15F' // ? stringdef t, hex '163' // ? define v 'aeiouy{a^}{ab}{i^}' define mark_regions as ( $p1 = limit $p2 = limit do( among ( 'coral' 'moral' 'social' 'canal' 'final' 'papagal' 'special' 'tractor' 'abator' 'marar' 'declar' 'suf{ab}r' 'polonic' 'voinic' 'paravan' 'simultan' 'decan' 'decal' 'tiran' 'caracter' 'tiner' 'acoper' 'descoper' 'sufer' 'numer' 'orator' 'autor' 'exprim' 'prim' 'ultim' 'optim' 'victim' 'antonim' 'sinonim' 'adjectiv' 'conjunctiv' 'subjonctiv' 'substantiv' 'pozitiv' 'recidiv' 'infinitiv' 'complet' 'absolut' 'debut' 'debit' 'miros' 'dantel' 'nuvel' 'tutel' 'model' 'cercel' 'savant' 'ambulant' 'aparat' 'ar{ab}t' 'specific' 'critic' 'oribil' 'probabil' 'bine' 'feroce' 'atroce' // ... extensions possible here ... ) or (gopast v gopast non-v) setmark p1 gopast v gopast non-v setmark p2 ) ) define exception1 as ( [substring] atlimit among ( 'cea' (<-'ce') 'cel' (<- 'ce') 'cei' (<- 'ce') 'celui'(<- 'ce') 'celei' (<- 'ce') 'celor' (<- 'ce') 'destul' 'astfel' 'altfel' 'asupra' 'deasupra' 'asemenea' 'afar{ab}' 'mai' 'nici' 'aici' 'apoi' 'musai' 'baremi' 'uneori' 'altminteri' 'deseori' 'numai' '{i^}nt{a^}i' 'p{a^}n{ab}' 'dup{ab}' 'noi' 'voi' 'imi' 'i{t,}i' 'i{s,}i' 'cine' 'care' 'cui' 'ori' 'acest' 'pentru' 'sau' 'c{ab}tre' 'despre' 'spre' 'dinspre' 'dintre' 'printre' '{i^}ntre' 'devreme' 'aproape' 'departe' 'bine' 'feroce' 'atroce' 'exprim' 'prim' 'ultim' 'optim' 'victim' 'antonim' 'sinonim' 'fonem' 'extrem' 'poem' 'suprem' ) ) backwardmode ( define R1 as $p1 <= cursor define R2 as $p2 <= cursor define morhological_suffixes as ( [substring] among ( 'ului' 'uri' 'urile' 'urilor' 'ul' (delete) 'le' ( ('a' or 'o' <- 'l') or delete) 'lui' 'lor' 'elor' 'ilor' 'ele' 'ile' 'ei' 'i' 'ii' 'e' 'a' '{ab}' ( delete ) ) ) define deriv_suffixes1 as ( [substring] among ( 'ism' 'ist' 'i{s,}t' (R2 or R1 non-v delete) 'iz' 'ant' '{ab}r' 'ar' (R2 non-v delete) 'tor' 'toar' 'abil' 'ibil' (R1 or R2 delete) 'ime' 'esc' ( delete) 'n{t,}' (R2 <- 'nt') ) ) define deriv_suffixes2 as ( [substring] among ( '{ab}r' 'ar' (R1 or R2 non-v delete) 'anie' 'icel' 'giu' 'eal' '{ab}tat' (R2 non-v delete) 'ulte{t,}' 'u{t,}' 'uc' 'u{s,}' 'el' 'oi' ( R1 delete) 'ir' 'im' 'i{s,}' 'iz' 'iv' 'aj' 'an' 'ac' ( R1 or R2 non-v delete) 'ic' (R2 or R1 delete) 'er' (R1 or R2 non-v or 'i' delete) 'os' 'o{s,}' 'oas' (('u' or 'i' or 'j' or 'r' or 'p') R1 delete) 'ant' 'ean' 'liv' 'al' ( R1 or R2 delete) '{s,}or' '{s,}oar' (R1 or R2 v delete) ) ) define verb_suffix as ( [substring] among ( 'eaz{ab}' 'eaza' 'ezi' 'ez' 'z{ab}' (R1 or R2 delete) 'esc' 'easc{ab}' 'e{s,}ti' 'e{s,}te' 'im' 'i{t,}i' (R1 or R2 delete) 'ai' 'a{s,}i' 'i{s,}i' 'am' '{ab}m' 'em' 'au' 'r{ab}m' 'ea' 'u' (R1 or R2 delete) '{t,}i' (R1 'a' 'e' 'i' 'u' delete) 'se' 'sei' 'se{s,}i' 'ser{ab}m' 'ser{ab}{t,}i' 'ser{ab}' 'r{ab}' (R1 or R2 delete) 'ind' '{i^}nd' '{a^}nd' (R1 or R2 delete) ) ) define residual_suffix as ( [substring] among ( 'a' '{ab}' 'e' 'u' 'i' (R1 or R2 non-v delete) 'at' 'a{t,}' 'it' 'i{t,}' 'ut' 'u{t,}' (R2 or R1 delete) ) ) ) define stem as ( exception1 or not hop 3 or ( do mark_regions backwards ( do (verb_suffix or morhological_suffixes) do deriv_suffixes1 do deriv_suffixes2 residual_suffix ) ) )