100
100
class Thesaurus(object):
101
101
"""Represents the WordNet synonym database, either loaded into memory
102
102
from the wn_s.pl Prolog file, or stored on disk in a Whoosh index.
104
104
This class allows you to parse the prolog file "wn_s.pl" from the WordNet prolog
105
105
download into an object suitable for looking up synonyms and performing query
108
108
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz
110
110
To load a Thesaurus object from the wn_s.pl file...
112
112
>>> t = Thesaurus.from_filename("wn_s.pl")
114
114
To save the in-memory Thesaurus to a Whoosh index...
116
116
>>> from whoosh.filedb.filestore import FileStorage
117
117
>>> fs = FileStorage("index")
118
118
>>> t.to_storage(fs)
120
120
To load a Thesaurus object from a Whoosh index...
122
122
>>> t = Thesaurus.from_storage(fs)
124
124
The Thesaurus object is thus usable in two ways:
126
126
* Parse the wn_s.pl file into memory (Thesaurus.from_*) and then look up
127
127
synonyms in memory. This has a startup cost for parsing the file, and uses
128
128
quite a bit of memory to store two large dictionaries, however synonym
129
129
look-ups are very fast.
131
131
* Parse the wn_s.pl file into memory (Thesaurus.from_filename) then save it to
132
132
an index (to_storage). From then on, open the thesaurus from the saved
133
133
index (Thesaurus.from_storage). This has a large cost for storing the index,
134
134
but after that it is faster to open the Thesaurus (than re-parsing the file)
135
135
but slightly slower to look up synonyms.
137
137
Here are timings for various tasks on my (fast) Windows machine, which might
138
138
give an idea of relative costs for in-memory vs. on-disk.
140
140
================================================ ================
141
141
Task Approx. time (s)
142
142
================================================ ================
146
146
Look up synonyms for "light" (in memory) 0.0011
147
147
Look up synonyms for "light" (loaded from disk) 0.0028
148
148
================================================ ================
150
150
Basically, if you can afford spending the memory necessary to parse the
151
151
Thesaurus and then cache it, it's faster. Otherwise, use an on-disk index.
160
160
def from_file(cls, fileobj):
161
161
"""Creates a Thesaurus object from the given file-like object, which should
162
162
contain the WordNet wn_s.pl file.
164
164
>>> f = open("wn_s.pl")
165
165
>>> t = Thesaurus.from_file(f)
166
166
>>> t.synonyms("hail")
175
175
def from_filename(cls, filename):
176
176
"""Creates a Thesaurus object from the given filename, which should
177
177
contain the WordNet wn_s.pl file.
179
179
>>> t = Thesaurus.from_filename("wn_s.pl")
180
180
>>> t.synonyms("hail")
181
181
['acclaim', 'come', 'herald']
191
191
def from_storage(cls, storage, indexname="THES"):
192
192
"""Creates a Thesaurus object from the given storage object,
193
193
which should contain an index created by Thesaurus.to_storage().
195
195
>>> from whoosh.filedb.filestore import FileStorage
196
196
>>> fs = FileStorage("index")
197
197
>>> t = Thesaurus.from_storage(fs)
198
198
>>> t.synonyms("hail")
199
199
['acclaim', 'come', 'herald']
201
201
:param storage: A :class:`whoosh.store.Storage` object from
202
202
which to load the index.
203
203
:param indexname: A name for the index. This allows you to
212
212
def to_storage(self, storage, indexname="THES"):
213
213
"""Creates am index in the given storage object from the
214
214
synonyms loaded from a WordNet file.
216
216
>>> from whoosh.filedb.filestore import FileStorage
217
217
>>> fs = FileStorage("index")
218
218
>>> t = Thesaurus.from_filename("wn_s.pl")
219
219
>>> t.to_storage(fs)
221
221
:param storage: A :class:`whoosh.store.Storage` object in
222
222
which to save the index.
223
223
:param indexname: A name for the index. This allows you to
231
231
def synonyms(self, word):
232
232
"""Returns a list of synonyms for the given word.
234
234
>>> thesaurus.synonyms("hail")
235
235
['acclaim', 'come', 'herald']