3
3
# $RCSfile: Stem.pm,v $ $Revision: 1.2 $ $Date: 1999/06/16 17:45:28 $ $Author: snowhare $
5
#######################################################################
6
# Initial POD Documentation
7
#######################################################################
11
Lingua::Stem - Stemming of words
15
use Lingua::Stem qw(stem);
16
my $stemmmed_words_anon_array = stem(@words);
18
or for the OO inclined,
21
my $stemmer = Lingua::Stem->new(-locale => 'EN-UK');
22
$stemmer->stem_caching({ -level => 2 });
23
my $stemmmed_words_anon_array = $stemmer->stem(@words);
27
This routine applies stemming algorithms to its parameters,
28
returning the stemmed words as appropriate to the selected
31
You can import some or all of the class methods.
33
use Lingua::Stem qw (stem clear_stem_cache stem_caching
34
add_exceptions delete_exceptions
35
get_exceptions set_locale get_locale
36
:all :locale :exceptions :stem :caching);
38
:all - imports stem add_exceptions delete_exceptions get_exceptions
41
:caching - imports stem_caching clear_stem_cache
42
:locale - imports set_locale get_locale
43
:exceptions - imports add_exceptions delete_exceptions get_exceptions
45
Currently supported locales are:
49
EN - English (also EN-US and EN-UK)
55
RU - Russian (also RU-RU and RU-RU.KOI8-R)
58
If you have the memory and lots of stemming to do,
59
I B<strongly> suggest using cache level 2 and processing
60
lists in 'big chunks' (long lists) for best performance.
62
Some benchmarks using Lingua::Stem 0.80 and Lingua::Stem::Snowball 0.7
63
give an idea of how batching and caching affect performance. The dataset was
64
3000 randomly selected words from the snowball english voc.txt list repeated
65
100 times (300000 words total, with 3000 unique words). The tests were performed
66
on a Linux machine with a 1.7 Ghz Athlon processor running Perl 5.8.5. There
67
are no tests listed for Lingua::Stem::Snowball using caching because it doesn't have
70
Lingua::Stem::Snowball, one word at a time, no caching: 26741 words/second
71
Lingua::Stem::Snowball, 3000 word batches, no caching: 169710 words/second
72
Lingua::Stem::Snowball, one batch, no caching: 155797 words/second
74
Lingua::Stem, one word at a time, no caching: 13509 words/second
75
Lingua::Stem, 3000 word batches, no caching: 32765 words/second
76
Lingua::Stem, one batch, no caching: 34847 words/second
78
Lingua::Stem, one word at a time, cache level 2: 25736 words/second
79
Lingua::Stem, 3000 word batches, cache level 2: 194384 words/second
80
Lingua::Stem, one batch, cache level 2: 216602 words/second
84
0.81 2004.07.26 - Minor documentation tweak. No functional change.
86
0.80 2004.07.25 - Added 'RU', 'RU_RU', 'RU_RU.KOI-8' locale.
87
Added support for Lingua::Stem::Ru to
88
Makefile.PL and autoloader.
90
Added documentation stressing use of caching
91
and batches for performance. Added support
92
for '_' as a seperator in the locale strings.
93
Added example benchmark script. Expanded copyright
96
0.70 2004.04.26 - Added FR locale and documentation fixes
99
0.61 2003.09.28 - Documentation fixes. No functional changes.
101
0.60 2003.04.05 - Added more locales by wrappering various stemming
102
implementations. Documented currently supported
105
0.50 2000.09.14 - Fixed major implementation error. Starting with
106
version 0.30 I forgot to include rulesets 2,3 and 4
107
for Porter's algorithm. The resulting stemming results
108
were very poor. Thanks go to <csyap@netfision.com>
109
for bringing the problem to my attention.
111
Unfortunately, the fix inherently generates *different*
112
stemming results than 0.30 and 0.40 did. If you
113
need identically broken output - use locale 'en-broken'.
115
0.40 2000.08.25 - Added stem caching support as an option. This
116
can provide a large speedup to the operation
117
of the stemmer. Caching is default turned off
118
to maximize compatibility with previous versions.
120
0.30 1999.06.24 - Replaced core of 'En' stemmers with code from
121
Jim Richardson <jimr@maths.usyd.edu.au>
122
Aliased 'en-us' and 'en-uk' to 'en'
123
Fixed 'SYNOPSIS' to correct return value
124
type for stemmed words (SYNOPIS error spotted
125
by <Arved_37@chebucto.ns.ca>)
127
0.20 1999.06.15 - Changed to '.pm' module, moved into Lingua:: namespace,
128
added OO interface, optionalized the export of routines
129
into the caller's namespace, added named parameter
130
initialization, stemming exceptions, autoloaded
131
locale support and isolated case flattening to
132
localized stemmers prevent i18n problems later.
134
Input and output text are assumed to be in UTF8
135
encoding (no operational impact right now, but
136
will be important when extending the module to
141
#######################################################################
143
#######################################################################
148
7
use Lingua::Stem::AutoLoader;
149
use vars qw (@ISA @EXPORT_OK %EXPORT_TAGS @EXPORT $VERSION);
153
@ISA = qw (Exporter);
155
@EXPORT_OK = qw (stem clear_stem_cache stem_caching add_exceptions delete_exceptions get_exceptions set_locale get_locale);
156
%EXPORT_TAGS = ( 'all' => [qw (stem stem_caching clear_stem_cache add_exceptions delete_exceptions get_exceptions set_locale get_locale)],
10
$Lingua::Stem::VERSION = '0.82';
11
@Lingua::Stem::ISA = qw (Exporter);
12
@Lingua::Stem::EXPORT = ();
13
@Lingua::Stem::EXPORT_OK = qw (stem stem_in_place clear_stem_cache stem_caching add_exceptions delete_exceptions get_exceptions set_locale get_locale);
14
%Lingua::Stem::EXPORT_TAGS = ( 'all' => [qw (stem stem_in_place stem_caching clear_stem_cache add_exceptions delete_exceptions get_exceptions set_locale get_locale)],
157
15
'stem' => [qw (stem)],
16
'stem_in_place' => [qw (stem_in_place)],
158
17
'caching' => [qw (stem_caching clear_stem_cache)],
159
18
'locale' => [qw (set_locale get_locale)],
160
19
'exceptions' => [qw (add_exceptions delete_exceptions get_exceptions)],
166
-stemmer => \&Lingua::Stem::En::stem,
167
-stem_caching => \&Lingua::Stem::En::stem_caching,
168
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
25
-stemmer => \&Lingua::Stem::En::stem,
26
-stem_in_place => \&Lingua::Stem::En::stem,
27
-stem_caching => \&Lingua::Stem::En::stem_caching,
28
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
170
30
-known_locales => {
171
31
'da' => { -stemmer => \&Lingua::Stem::Da::stem,
172
32
-stem_caching => \&Lingua::Stem::Da::stem_caching,
173
33
-clear_stem_cache => \&Lingua::Stem::Da::clear_stem_cache,
34
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'da' locale"); },
175
36
'de' => { -stemmer => \&Lingua::Stem::De::stem,
176
37
-stem_caching => \&Lingua::Stem::De::stem_caching,
177
38
-clear_stem_cache => \&Lingua::Stem::De::clear_stem_cache,
39
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'de' locale"); },
179
41
'en' => { -stemmer => \&Lingua::Stem::En::stem,
180
42
-stem_caching => \&Lingua::Stem::En::stem_caching,
181
43
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
44
-stem_in_place => \&Lingua::Stem::En::stem,
183
46
'en_us' => { -stemmer => \&Lingua::Stem::En::stem,
184
47
-stem_caching => \&Lingua::Stem::En::stem_caching,
185
48
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
49
-stem_in_place => \&Lingua::Stem::En::stem,
187
51
'en-us' => { -stemmer => \&Lingua::Stem::En::stem,
188
52
-stem_caching => \&Lingua::Stem::En::stem_caching,
189
53
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
54
-stem_in_place => \&Lingua::Stem::En::stem,
191
56
'en_uk' => { -stemmer => \&Lingua::Stem::En::stem,
192
57
-stem_caching => \&Lingua::Stem::En::stem_caching,
193
58
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
59
-stem_in_place => \&Lingua::Stem::En::stem,
195
61
'en-uk' => { -stemmer => \&Lingua::Stem::En::stem,
196
62
-stem_caching => \&Lingua::Stem::En::stem_caching,
197
63
-clear_stem_cache => \&Lingua::Stem::En::clear_stem_cache,
64
-stem_in_place => \&Lingua::Stem::En::stem,
199
66
'en-broken' => { -stemmer => \&Lingua::Stem::En_Broken::stem,
200
67
-stem_caching => \&Lingua::Stem::En_Broken::stem_caching,
201
68
-clear_stem_cache => \&Lingua::Stem::En_Broken::clear_stem_cache,
69
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'en-broken' locale"); },
203
71
'fr' => { -stemmer => \&Lingua::Stem::Fr::stem,
204
72
-stem_caching => \&Lingua::Stem::Fr::stem_caching,
205
73
-clear_stem_cache => \&Lingua::Stem::Fr::clear_stem_cache,
74
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'fr' locale"); },
207
76
'gl' => { -stemmer => \&Lingua::Stem::Gl::stem,
208
77
-stem_caching => \&Lingua::Stem::Gl::stem_caching,
209
78
-clear_stem_cache => \&Lingua::Stem::Gl::clear_stem_cache,
79
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'gl' locale"); },
211
81
'it' => { -stemmer => \&Lingua::Stem::It::stem,
212
82
-stem_caching => \&Lingua::Stem::It::stem_caching,
213
83
-clear_stem_cache => \&Lingua::Stem::It::clear_stem_cache,
84
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'it' locale"); },
215
86
'no' => { -stemmer => \&Lingua::Stem::No::stem,
216
87
-stem_caching => \&Lingua::Stem::No::stem_caching,
217
88
-clear_stem_cache => \&Lingua::Stem::No::clear_stem_cache,
89
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'no' locale"); },
219
91
'pt' => { -stemmer => \&Lingua::Stem::Pt::stem,
220
92
-stem_caching => \&Lingua::Stem::Pt::stem_caching,
221
93
-clear_stem_cache => \&Lingua::Stem::Pt::clear_stem_cache,
94
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'pt' locale"); },
223
96
'sv' => { -stemmer => \&Lingua::Stem::Sv::stem,
224
97
-stem_caching => \&Lingua::Stem::Sv::stem_caching,
225
98
-clear_stem_cache => \&Lingua::Stem::Sv::clear_stem_cache,
99
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'sv' locale"); },
227
101
'ru' => { -stemmer => \&Lingua::Stem::Ru::stem,
228
102
-stem_caching => \&Lingua::Stem::Ru::stem_caching,
229
103
-clear_stem_cache => \&Lingua::Stem::Ru::clear_stem_cache,
104
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'ru' locale"); },
232
107
-stemmer => \&Lingua::Stem::Ru::stem,
233
108
-stem_caching => \&Lingua::Stem::Ru::stem_caching,
234
109
-clear_stem_cache => \&Lingua::Stem::Ru::clear_stem_cache,
110
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'ru_ru' locale"); },
237
113
-stemmer => \&Lingua::Stem::Ru::stem,
238
114
-stem_caching => \&Lingua::Stem::Ru::stem_caching,
239
115
-clear_stem_cache => \&Lingua::Stem::Ru::clear_stem_cache,
116
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'ru-ru' locale"); },
241
118
'ru-ru.koi8-r' => {
242
119
-stemmer => \&Lingua::Stem::Ru::stem,
243
120
-stem_caching => \&Lingua::Stem::Ru::stem_caching,
244
121
-clear_stem_cache => \&Lingua::Stem::Ru::clear_stem_cache,
122
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'ru-ru.koi8-r' locale"); },
246
124
'ru_ru.koi8-r' => {
247
125
-stemmer => \&Lingua::Stem::Ru::stem,
248
126
-stem_caching => \&Lingua::Stem::Ru::stem_caching,
249
127
-clear_stem_cache => \&Lingua::Stem::Ru::clear_stem_cache,
128
-stem_in_place => sub { require Carp; Carp::croak("'stem_in_place' not available for 'ru_ru.koi8-r' locale"); },
254
#######################################################################
256
#######################################################################
262
#######################################################################
268
Returns a new instance of a Lingua::Stem object and, optionally, selection
269
of the locale to be used for stemming.
273
# By default the locale is en
274
$us_stemmer = Lingua::Stem->new;
277
$us_stemmer->stem_caching({ -level => 2 });
279
# Overriding the default for a specific instance
280
$uk_stemmer = Lingua::Stem->new({ -locale => 'en-uk' });
282
# Overriding the default for a specific instance and changing the default
283
$uk_stemmer = Lingua::Stem->new({ -default_locale => 'en-uk' });
290
136
my $proto = shift;
291
my $class = ref ($proto) || $proto || __PACKAGE__;
137
my $package = __PACKAGE__;
138
my $proto_ref = ref($proto);
292
147
my $self = bless {},$class;
294
149
# Set the defaults
295
150
%{$self->{'Lingua::Stem'}->{-exceptions}} = %{$defaults->{-exceptions}};
296
151
$self->{'Lingua::Stem'}->{-locale} = $defaults->{-locale};
297
152
$self->{'Lingua::Stem'}->{-stemmer} = $defaults->{-stemmer};
153
$self->{'Lingua::Stem'}->{-stem_in_place} = $defaults->{-stem_in_place};
298
154
$self->{'Lingua::Stem'}->{-stem_caching} = $defaults->{-stem_caching};
299
155
$self->{'Lingua::Stem'}->{-clear_stem_cache} = $defaults->{-clear_stem_cache};
714
380
&$stem_caching_sub(@_);
717
#######################################################################
718
# Terminal POD Documentation
719
#######################################################################
727
It started with the 'Text::Stem' module which has been adapted into
728
a more general framework and moved into the more
729
language oriented 'Lingua' namespace and re-organized to support a OOP
730
interface as well as switch core 'En' locale stemmers.
732
Version 0.40 added a cache for stemmed words. This can provide up
733
to a several fold performance improvement.
735
Organization is such that extending this module to any number
736
of languages should be direct and simple.
738
Case flattening is a function of the language, so the 'exceptions'
739
methods have to be used appropriately to the language. For 'En'
740
family stemming, use lower case words, only, for exceptions.
744
Benjamin Franz <snowhare@nihongo.org>
745
Jim Richardson <imr@maths.usyd.edu.au>
749
Jim Richardson <imr@maths.usyd.edu.au>
750
Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>
751
Aldo Calpini <dada@perl.it>
753
Ask Solem Hoel <ask@unixmonks.net>
754
Dennis Haney <davh@davh.dk>
755
S�bastien Darribere-Pleyt <sebastien.darribere@lefute.com>
756
Aleksandr Guidrevitch <pillgrim@mail.ru>
760
Lingua::Stem::En Lingua::Stem::En Lingua::Stem::Da
761
Lingua::Stem::De Lingua::Stem::Gl Lingua::Stem::No
762
Lingua::Stem::Pt Lingua::Stem::Sv Lingua::Stem::It
763
Lingua::Stem::Fr Lingua::Stem::Ru Text::German
764
Lingua::PT::Stemmer Lingua::GL::Stemmer Lingua::Stem::Snowball::No
765
Lingua::Stem::Snowball::Se Lingua::Stem::Snowball::Da Lingua::Stem::Snowball::Sv
766
Lingua::Stemmer::GL Lingua::Stem::Snowball
768
http://snowball.tartarus.org
774
Freerun Technologies, Inc (Freerun),
775
Jim Richardson, University of Sydney <imr@maths.usyd.edu.au>
776
and Benjamin Franz <snowhare@nihongo.org>. All rights reserved.
778
Text::German was written and is copyrighted by Ulrich Pfeifer.
780
Lingua::Stem::Snowball::Da was written and is copyrighted by
781
Dennis Haney and Ask Solem Hoel.
783
Lingua::Stem::It was written and is copyrighted by Aldo Calpini.
785
Lingua::Stem::Snowball::No, Lingua::Stem::Snowball::Se, Lingua::Stem::Snowball::Sv were
786
written and are copyrighted by Ask Solem Hoel.
788
Lingua::Stemmer::GL and Lingua::PT::Stemmer were written and are copyrighted by Xern.
790
Lingua::Stem::Fr was written and is copyrighted by Aldo Calpini and Sébastien Darribere-Pley.
792
Lingua::Stem::Ru was written and is copyrighted by Aleksandr Guidrevitch.
794
This software may be freely copied and distributed under the same
795
terms and conditions as Perl.
803
Add more languages. Extend regression tests. Add support for the
804
Lingua::Stem::Snowball family of stemmers as an alternative core stemming