4
Text indexes combine an inverted index and a lexicon to support text
5
indexing and searching. A text index can be created without passing
8
>>> from zope.index.text.textindex import TextIndex
9
>>> index = TextIndex()
11
By default, it uses an "Okapi" inverted index and a lexicon with a
12
pipeline consistening is a simple word splitter, a case normalizer,
13
and a stop-word remover.
15
We index text using the `index_doc` method:
17
>>> index.index_doc(1, u"the quick brown fox jumps over the lazy dog")
18
>>> index.index_doc(2,
19
... u"the brown fox and the yellow fox don't need the retriever")
20
>>> index.index_doc(3, u"""
21
... The Conservation Pledge
22
... =======================
24
... I give my pledge, as an American, to save, and faithfully
25
... to defend from waste, the natural resources of my Country;
26
... it's soils, minerals, forests, waters and wildlife.
28
>>> index.index_doc(4, u"Fran\xe7ois")
30
... u"\N{GREEK SMALL LETTER DELTA}"
31
... u"\N{GREEK SMALL LETTER EPSILON}"
32
... u"\N{GREEK SMALL LETTER LAMDA}"
33
... u"\N{GREEK SMALL LETTER TAU}"
34
... u"\N{GREEK SMALL LETTER ALPHA}"
36
>>> index.index_doc(5, word + u"\N{EM DASH}\N{GREEK SMALL LETTER ALPHA}")
37
>>> index.index_doc(6, u"""
38
... What we have here, is a failure to communicate.
40
>>> index.index_doc(7, u"""
41
... Hold on to your butts!
43
>>> index.index_doc(8, u"""
44
... The Zen of Python, by Tim Peters
46
... Beautiful is better than ugly.
47
... Explicit is better than implicit.
48
... Simple is better than complex.
49
... Complex is better than complicated.
50
... Flat is better than nested.
51
... Sparse is better than dense.
52
... Readability counts.
53
... Special cases aren't special enough to break the rules.
54
... Although practicality beats purity.
55
... Errors should never pass silently.
56
... Unless explicitly silenced.
57
... In the face of ambiguity, refuse the temptation to guess.
58
... There should be one-- and preferably only one --obvious way to do it.
59
... Although that way may not be obvious at first unless you're Dutch.
60
... Now is better than never.
61
... Although never is often better than *right* now.
62
... If the implementation is hard to explain, it's a bad idea.
63
... If the implementation is easy to explain, it may be a good idea.
64
... Namespaces are one honking great idea -- let's do more of those!
67
Then we can search using the apply method, which takes a search
70
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown fox').items()]
71
[(1, '0.6153'), (2, '0.6734')]
73
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'quick fox').items()]
76
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown python').items()]
79
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'dalmatian').items()]
82
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown or python').items()]
83
[(1, '0.2602'), (2, '0.2529'), (8, '0.0934')]
85
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'butts').items()]
88
The outputs are mappings from document ids to float scores. Items
89
with higher scores are more relevent.
91
We can use unicode characters in search strings.
93
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u"Fran\xe7ois").items()]
96
>>> [(k, "%.4f" % v) for (k, v) in index.apply(word).items()]
99
We can use globbing in search strings.
101
>>> [(k, "%.3f" % v) for (k, v) in index.apply('fo*').items()]
102
[(1, '2.179'), (2, '2.651'), (3, '2.041')]
104
Text indexes support basic statistics:
106
>>> index.documentCount()
108
>>> index.wordCount()
111
If we index the same document twice, once with a zero value, and then
112
with a normal value, it should still work:
114
>>> index2 = TextIndex()
115
>>> index2.index_doc(1, [])
116
>>> index2.index_doc(1, ["Zorro"])
117
>>> [(k, "%.4f" % v) for (k, v) in index2.apply("Zorro").items()]
124
If we index a document the first time it updates the _totaldoclen of
125
the underlying object.
127
>>> index = TextIndex()
128
>>> index.index._totaldoclen()
130
>>> index.index_doc(100, u"a new funky value")
131
>>> index.index._totaldoclen()
134
If we index it a second time, the underlying index length should not
137
>>> index.index_doc(100, u"a new funky value")
138
>>> index.index._totaldoclen()
141
But if we change it the length changes too.
143
>>> index.index_doc(100, u"an even newer funky value")
144
>>> index.index._totaldoclen()
147
The same as for index_doc applies to unindex_doc, if an object is
148
unindexed that is not indexed no indexes chould change state.
150
>>> index.unindex_doc(100)
151
>>> index.index._totaldoclen()
154
>>> index.unindex_doc(100)
155
>>> index.index._totaldoclen()