Danish stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
- and its header
Sample Danish vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent in two columns
Tar-gzipped file of all of the above

Danish stop word list
The stemmer in Snowball - MS DOS Latin I encodings
Scandinavian language stemmers


Here is a sample of Danish vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
indtage
indtagelse
indtager
indtages
indtaget
indtil
indtog
indtraf
indtryk
indtræde
indtræder
indtræffe
indtræffer
indtrængende
indtægt
indtægter
indvandrede
indvandret
indvender
indvendig
indvendige
indvendigt
indvending
indvendingerne
indvie
indviede
indvielse
indvielsen
indvielsesløfte
indvielsestid
indvier
indvies
indviet
indvikle
indvikler
indvolde
indvoldene
indvortes
indånde
indåndede
  =>   indtag
indtag
indtag
indtag
indtag
indtil
indtog
indtraf
indtryk
indtræd
indtræd
indtræf
indtræf
indtræng
indtæg
indtæg
indvandred
indvandr
indvend
indvend
indvend
indvend
indvending
indvending
indvi
indvied
indvi
indvi
indvielsesløft
indvielsestid
indvi
indvi
indvi
indvikl
indvikl
indvold
indvold
indvort
indånd
indånded
underste
undersåtter
undersåtters
undersøg
undersøge
undersøgelse
undersøgelsen
undersøger
undersøgt
undersøgte
undertryk
undertrykke
undertrykkelse
undertrykker
undertrykkere
undertrykkeren
undertrykkerens
undertrykkeres
undertrykkes
undertrykt
undertrykte
undertryktes
undertvang
undertvunget
undertvungne
undervejs
underverdenen
undervise
underviser
undervises
undervisning
undervisningen
undervist
underviste
underværk
underværker
undevise
undeviste
undfange
undfanged
  =>   underst
undersåt
undersåt
undersøg
undersøg
undersøg
undersøg
undersøg
undersøg
undersøg
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertryk
undertvang
undertvung
undertvungn
undervej
underverden
undervis
undervis
undervis
undervisning
undervisning
undervist
undervist
underværk
underværk
undevis
undevist
undfang
undfanged



 

The stemming algorithm

The Danish alphabet includes the following additional letters,
æ   å   ø
The following letters are vowels:
a   e   i   o   u   y   æ   å   ø
A consonant is defined as a non-vowel.

R2 is not used: R1 is defined in the same way as in the German stemmer. (See the note on R1 and R2.)

Define a valid s-ending as one of
a   b   c   d   f   g   h   j   k   l   m   n   o   p   r   t   v   y   z   å
Do each of steps 1, 2, 3 and 4.

Step 1:
Search for the longest among the following suffixes in R1, and perform the action indicated.

(a) hed   ethed   ered   e   erede   ende   erende   ene   erne   ere   en   heden   eren   er   heder   erer   heds   es   endes   erendes   enes   ernes   eres   ens   hedens   erens   ers   ets   erets   et   eret
delete

(b) s
delete if preceded by a valid s-ending

(Of course the letter of the valid s-ending is not necessarily in R1)
Step 2:
Search for one of the following suffixes in R1, and if found delete the last letter.

gd   dt   gt   kt

(For example, friskt -> frisk)
Step 3:
If the word ends igst, remove the final st Search for the longest among the following suffixes in R1, and perform the action indicated.

(a) ig   lig   elig   els
delete, and then repeat step 2

(b) løst
replace with løs
Step 4: undouble
If the word ends with double consonant in R1, remove one of the consonants.

(For example, bestemmelse -> bestemmels (step 1) -> bestemm (step 3a) -> bestem in this step.)

 

The same algorithm in Snowball


routines ( mark_regions main_suffix consonant_pair other_suffix undouble ) externals ( stem ) strings ( ch ) integers ( p1 ) groupings ( v s_ending ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef ae hex 'E6' stringdef ao hex 'E5' stringdef o/ hex 'F8' define v 'aeiouy{ae}{ao}{o/}' define s_ending 'abcdfghjklmnoprtvyz{ao}' define mark_regions as ( $p1 = limit goto v gopast non-v setmark p1 try ( $p1 < 3 $p1 = 3 ) ) backwardmode ( define main_suffix as ( setlimit tomark p1 for ([substring]) among( 'hed' 'ethed' 'ered' 'e' 'erede' 'ende' 'erende' 'ene' 'erne' 'ere' 'en' 'heden' 'eren' 'er' 'heder' 'erer' 'heds' 'es' 'endes' 'erendes' 'enes' 'ernes' 'eres' 'ens' 'hedens' 'erens' 'ers' 'ets' 'erets' 'et' 'eret' (delete) 's' (s_ending delete) ) ) define consonant_pair as ( test ( setlimit tomark p1 for ([substring]) among( 'gd' // significant in the call from other_suffix 'dt' 'gt' 'kt' ) ) next] delete ) define other_suffix as ( do ( ['st'] 'ig' delete ) setlimit tomark p1 for ([substring]) among( 'ig' 'lig' 'elig' 'els' (delete do consonant_pair) 'l{o/}st' (<-'l{o/}s') ) ) define undouble as ( setlimit tomark p1 for ([non-v] ->ch) ch delete ) ) define stem as ( do mark_regions backwards ( do main_suffix do consonant_pair do other_suffix do undouble ) )