Module Stemmer :: Class Stemmer
[show private | hide private]
[frames | no frames]

Type Stemmer

object --+
         |
        Stemmer


An instance of a stemming algorithm.

The algorithm has internal state, so must not be called concurrently. ie, only a single thread should access the instance at any given time.

When creating a Stemmer object, there is one required argument: the name of the algorithm to use in the new stemmer. A list of the valid algorithm names may be obtained by calling the algorithms() function in this module. In addition, the appropriate stemming algorithm for a given language may be obtained by using the 2 or 3 letter ISO 639 language codes.

A second optional argument to the constructor for Stemmer is the size of cache to use. The cache implemented in this module is not terribly efficient, but benchmarks show that it approximately doubles performance for typical text processing operations, without too much memory overhead. The cache may be disabled by passing a size of 0. The default size (10000 words) is probably appropriate in most situations. In pathological cases (for example, when no word is presented to the stemming algorithm more than once, so the cache is useless), the cache can severely damage performance.

The "benchmark.py" script supplied with the PyStemmer distribution can be used to test the performance of the stemming algorithms with various cache sizes.


Method Summary
  __init__(...)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
  __new__(T, S, ...)
T.__new__(S, ...) -> a new object with type S, a subtype of T
  stemWord(...)
Stem a word.
  stemWords(...)
Stem a list of words.
  __purgeCache(...)
    Inherited from object
  __delattr__(...)
x.__delattr__('name') <==> del x.name
  __getattribute__(...)
x.__getattribute__('name') <==> x.name
  __hash__(x)
x.__hash__() <==> hash(x)
  __reduce__(...)
helper for pickle
  __reduce_ex__(...)
helper for pickle
  __repr__(x)
x.__repr__() <==> repr(x)
  __setattr__(...)
x.__setattr__('name', value) <==> x.name = value
  __str__(x)
x.__str__() <==> str(x)

Class Variable Summary
getset_descriptor maxCacheSize = <attribute 'maxCacheSize' of 'Stemmer.Ste...

Method Details

__init__(...)
(Constructor)

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Overrides:
__builtin__.object.__init__

__new__(T, S, ...)

T.__new__(S, ...) -> a new object with type S, a subtype of T
Returns:
a new object with type S, a subtype of T
Overrides:
__builtin__.object.__new__

stemWord(...)

Stem a word.

This takes a single argument, word, which should either be a UTF-8 encoded string, or a unicode object.

The result is the stemmed form of the word. If the word supplied was a unicode object, the result will be a unicode object: if the word supplied was a string, the result will be a UTF-8 encoded string.

stemWords(...)

Stem a list of words.

This takes a single argument, words, which must be a sequence, iterator, generator or similar.

The entries in words should either be UTF-8 encoded strings, or a unicode objects.

The result is a list of the stemmed forms of the words. If the word supplied was a unicode object, the stemmed form will be a unicode object: if the word supplied was a string, the stemmed form will be a UTF-8 encoded string.


Class Variable Details

maxCacheSize

Type:
getset_descriptor
Value:
<attribute 'maxCacheSize' of 'Stemmer.Stemmer' objects>                

Generated by Epydoc 2.1 on Sun Jun 11 15:38:38 2006 http://epydoc.sf.net