The stemming algorithm
Romanian language is a Romance language which has a rich inflexional and derivational system. For example:
(1) determination of the noun:
-
the article is post-positioned for general nouns (enclitics),
or it precedes the proper nouns:
-
masa - the table
Ii dau elevului o carte. - I give the pupil a book.
Ii dau lui Marius o carte. - I give Marius a book.
if the adjective precedes the noun, it takes the article of the noun:
-
frumosul copil0 - the beautiful child
(2) the verb has in the traditional grammar 4 flexionary classes, established after infinitive suffixes:
-
a mânca - eat
a bea - drink
a merge- walk, go
a iubi - love
a urî - hate
But some grammars are speaking about 12 or flexionary classes of the Romanian verb.
(3) the adjective is classified in four inflectional classes: with 4, 3 or 2 endings and non-inflectional adjective.
The inflected words can have 3 letters: oul - the egg, ace - needles, or more letters: imparatescului of/to the royal
The Romanian derivational system is very complex and multistratal. A word can take one d suffix or even more d suffixes, being thus twice or thrice derivated. For example: a lexical paradigm of the word stabil is:
-
stabil - stable
stabili - stabilize
stabilit - stabled
stabilire - stabilisation
stabilibil - stabilizable
stabiliza - stabilize
stabilizat - stabilized
stabilizant - stabilizing
stabilizare - stabilization
stabilizator - stabilizing
However, one problem for the algorithm occurs: there are homonymous sequences which can be either suffixes or part of the root itself, e.g:
-
-al : cal - horse (part of the root), industrial - industrial (suffix)
-em: morphological suffix from verb, 1st pers. pl. (facem we do, mergem we go vs. sistem - system, barem - limit, where -em is part of the root)
-esc: morphological suffix from verb, 1st pers. sg. and 3rd pers. pl: eu iubesc (I love), ei iubesc (they love),
or derivational suffix - building adjectives and adverbs: imparatesc (royal/ly)
Through our stemming program we want to remove all the morphological suffixes and most of the derivational suffixes.
In order to fix the problem caused by homonymity we create a list of exceptions.The list contains words without inflection, like adverbials or conjunctions. Generally, these are stopwords: devreme - early, apoi - then, căci - as, deci - therefore)
There are only a few examples for each case included in the list and their number could be increased.
In order to find a further solution for solving the issue of homonymity we treated some words containing a sequence identically with a suffix as an exception to the usual setting of p1, the left point of R1: coral coral reef, moral moral, marar dill, declar declare.
Definition of the Romanian stemmer
Letters in Romanian include the following accented forms
/* ă - ab , â - a^ , î - i^, ș - s, ț - t,. */
Define a vowel as one of
-
a e i o u ab a^ i^, y
We define the regions R1 and R2 as they are defined in English.
(See note on R1 and R2.)
We want to stemm only the words which contain more than two letters.
Remove morphological suffixes:
-
Search for the longest among the following suffixes
and perform the action indicated:
- ului uri urile urilor ul
- delete
- le
- if preceded by a or o replace with l(it fix miss-steming for oale(pots), otherwise delete
- lelor elor ilor elui lor ele ile ei i ii e a {ab}
- delete
Remove verb suffixes:
-
Search for the longest among the following suffixes
and perform the action indicated:
- eaz{ab} eaza ezi ez z{ab}
- if in R1 or R2 delete
- esc e{s,}ti e{s,}te im i{t,}i
- if in R1 or R2 delete
- ai a{s,}i i{s,}i am {ab}m em au r{ab}m ea u
- if in R1 or R2 delete
- {t,}i
- if preceded by a, e, i, u and in R1 delete
Remove derivational suffixes(we do this operation in two steps):
-
Search for the longest among the following suffixes
and perform the action indicated:
(1):
- ism, ist, i{s,}t
- if in R1 or R2 and preceded by a consonant, delete
- iz, ant, {ab}r, ar
- if in R2 and preceded by a consonant, delete
- tor, toar, abil, ibil
- delete if in R1 or R2
- ime, esc
- delete
- n{t,}
- if in R2 replace with nt
(2):
- {ab}r, ar
- if in R1 or R2 and preceded by consonant delete
- anie, icel, giu, eal, {ab}tat
- if R2 and preceded by consonant delete
- ulte{t,}, u{t,}, uc, u{s,}, el, oi
- if R1 delete
- ir, im, i{s,}, iz, iv, aj, an, ac
- if in R1 or R2 and preceded by consonant delete
- ic
- if in R1 or R2 delete
- er
- if in R1 or R2 and preceded by consonant or i, delete
- os, o{s,}, oas
- if preceded by u, j, i, r or p in R1, delete
- ant, ean, liv, al
- if in R1 or R2, delete
- {s,}or, {s,}oar
- if in R1 or R2 and preceded by vowel, delete
Remove residual_suffix (there are vowels which appear through derivation, e.g: muncitor - worker, or participle suffixes)
-
Search for the longest among the following suffixes
and perform the action indicated:
- a, {ab}, e, u, i
- if in R1 or R2 and preceded by consonant, delete
- at, a{t,}, ut, u{t,}, it, i{t,}
- if in R1 or R2 delete
|
The same algorithm in Snowball
-
/*
Department of Computational Linguistics,
Ruprecht-Karls-University of Heidelberg
Marina Stegarescu: mstegare@hotmail.com
Doina Gliga: doina_gliga@yahoo.co.uk
Erwin Glockner: eglockner@hotmail.com
2006.07.15
*/
routines (
mark_regions
R1 R2
morhological_suffixes
deriv_suffixes1
deriv_suffixes2
exception1
verb_suffix
residual_suffix
)
externals ( stem )
integers (p1 p2 )
groupings ( v )
stringescapes {}
stringdef a^ hex 'E2' // â
stringdef ab hex '103' // ?
stringdef i^ hex 'EE' // î
stringdef s, hex '15F' // ?
stringdef t, hex '163' // ?
define v 'aeiouy{a^}{ab}{i^}'
define mark_regions as (
$p1 = limit
$p2 = limit
do(
among (
'coral' 'moral' 'social' 'canal' 'final' 'papagal' 'special' 'tractor' 'abator'
'marar' 'declar' 'suf{ab}r'
'polonic' 'voinic'
'paravan' 'simultan' 'decan' 'decal' 'tiran'
'caracter' 'tiner' 'acoper' 'descoper' 'sufer' 'numer'
'orator' 'autor'
'exprim' 'prim' 'ultim' 'optim' 'victim' 'antonim' 'sinonim'
'adjectiv' 'conjunctiv' 'subjonctiv' 'substantiv' 'pozitiv' 'recidiv' 'infinitiv'
'complet' 'absolut' 'debut' 'debit'
'miros'
'dantel' 'nuvel' 'tutel' 'model' 'cercel'
'savant' 'ambulant'
'aparat' 'ar{ab}t'
'specific' 'critic'
'oribil' 'probabil'
'bine' 'feroce' 'atroce'
// ... extensions possible here ...
) or (gopast v gopast non-v)
setmark p1
gopast v gopast non-v setmark p2
)
)
define exception1 as (
[substring] atlimit among (
'cea' (<-'ce')
'cel' (<- 'ce')
'cei' (<- 'ce')
'celui'(<- 'ce')
'celei' (<- 'ce')
'celor' (<- 'ce')
'destul' 'astfel' 'altfel'
'asupra' 'deasupra' 'asemenea' 'afar{ab}'
'mai' 'nici' 'aici' 'apoi' 'musai' 'baremi' 'uneori' 'altminteri' 'deseori' 'numai'
'{i^}nt{a^}i' 'p{a^}n{ab}' 'dup{ab}'
'noi' 'voi' 'imi' 'i{t,}i' 'i{s,}i' 'cine' 'care'
'cui' 'ori'
'acest' 'pentru' 'sau'
'c{ab}tre' 'despre' 'spre' 'dinspre' 'dintre' 'printre' '{i^}ntre' 'devreme' 'aproape' 'departe'
'bine' 'feroce' 'atroce'
'exprim' 'prim' 'ultim' 'optim' 'victim' 'antonim' 'sinonim'
'fonem' 'extrem' 'poem' 'suprem'
)
)
backwardmode (
define R1 as $p1 <= cursor
define R2 as $p2 <= cursor
define morhological_suffixes as (
[substring] among (
'ului' 'uri' 'urile' 'urilor' 'ul'
(delete)
'le'
( ('a' or 'o' <- 'l') or delete)
'lui' 'lor' 'elor' 'ilor' 'ele' 'ile' 'ei' 'i' 'ii' 'e' 'a' '{ab}'
( delete )
)
)
define deriv_suffixes1 as (
[substring] among (
'ism' 'ist' 'i{s,}t'
(R2 or R1 non-v delete)
'iz' 'ant' '{ab}r' 'ar'
(R2 non-v delete)
'tor' 'toar' 'abil' 'ibil'
(R1 or R2 delete)
'ime' 'esc'
( delete)
'n{t,}'
(R2 <- 'nt')
)
)
define deriv_suffixes2 as (
[substring] among (
'{ab}r' 'ar'
(R1 or R2 non-v delete)
'anie' 'icel' 'giu' 'eal' '{ab}tat'
(R2 non-v delete)
'ulte{t,}' 'u{t,}' 'uc' 'u{s,}' 'el' 'oi'
( R1 delete)
'ir' 'im' 'i{s,}' 'iz' 'iv' 'aj' 'an' 'ac'
( R1 or R2 non-v delete)
'ic'
(R2 or R1 delete)
'er'
(R1 or R2 non-v or 'i' delete)
'os' 'o{s,}' 'oas'
(('u' or 'i' or 'j' or 'r' or 'p') R1 delete)
'ant' 'ean' 'liv' 'al'
( R1 or R2 delete)
'{s,}or' '{s,}oar'
(R1 or R2 v delete)
)
)
define verb_suffix as (
[substring] among (
'eaz{ab}' 'eaza' 'ezi' 'ez' 'z{ab}'
(R1 or R2 delete)
'esc' 'easc{ab}' 'e{s,}ti' 'e{s,}te' 'im' 'i{t,}i'
(R1 or R2 delete)
'ai' 'a{s,}i' 'i{s,}i' 'am' '{ab}m' 'em'
'au' 'r{ab}m' 'ea' 'u'
(R1 or R2 delete)
'{t,}i'
(R1 'a' 'e' 'i' 'u' delete)
'se' 'sei' 'se{s,}i' 'ser{ab}m' 'ser{ab}{t,}i' 'ser{ab}' 'r{ab}'
(R1 or R2 delete)
'ind' '{i^}nd' '{a^}nd'
(R1 or R2 delete)
)
)
define residual_suffix as (
[substring] among (
'a' '{ab}' 'e' 'u' 'i'
(R1 or R2 non-v delete)
'at' 'a{t,}' 'it' 'i{t,}' 'ut' 'u{t,}'
(R2 or R1 delete)
)
)
)
define stem as (
exception1 or
not hop 3 or (
do mark_regions
backwards (
do (verb_suffix or morhological_suffixes)
do deriv_suffixes1
do deriv_suffixes2
residual_suffix
)
)
)
|