1
==========================================================================
2
This module is deprecated in 7.4 version of PostgreSQL and will be
3
obsoleted in 8.0. Please, use new tsearch2 contrib module.
4
==========================================================================
6
Tsearch contrib module contains implementation of new data type txtidx -
7
a searchable data type (textual) with indexed access.
9
All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
15
Space usage and using CLUSTER command documented
17
Fix works with 'bad' queries
19
Use parser of OpenFTS v0.33.
23
This is a first step of our work on integration of OpenFTS
24
full text search engine (http://openfts.sourceforge.net/) into
25
PostgreSQL. It's based on our recent development of GiST
26
(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
27
at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
28
and will works only for PostgreSQL version 7.2 and later.
30
We didn't try to implement a full-featured search engine with
31
stable interfaces but rather experiment with various approaches.
32
There are many issues remains (most of them just not documented or
33
implemented) but we'd like to present a working prototype
34
of full text search engine fully integrated into PostgreSQL to
35
collect user's feedback and recommendations.
49
psql DATABASE < tsearch.sql (from contrib/tsearch)
54
This module provides an implementation of a new data type 'txtidx' which is
55
a string of a space separated "words". "Words" with spaces
56
should be enclosed in apostrophes and apostrophes inside a "word" should be
59
This is quite different from OpenFTS approach which uses array
60
of integers (ID of lexems) and requires storing of lexem-id pairs in database.
61
One of the prominent benefit of this new approach is that it's possible now
62
to perform full text search in a 'natural' way.
71
insert into foo values ( 'the are' );
73
insert into foo values ( 'the\\ are' );
74
Words with apostrophe:
75
insert into foo values ( 'value\'s this' );
76
Complex word with apostrophe:
77
insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );
79
select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
80
select * from foo where titleidx @@ 'value\'s | this';
81
select * from foo where titleidx @@ '(the|this)&!we';
83
test=# select 'two words'::txtidx;
89
test=# select 'single\\ word'::txtidx;
98
The basic idea of this data type is to use it for full text search inside
99
database. If you have a 'text' column title and corresponding column
100
titleidx of type 'txtidx', which contains the same information from
101
text column, then search on title could be replaced by
102
searching on titleidx which would be fast because of indexed access.
104
As a real life example consider database with table 'titles' containing
105
titles of mailing list postings in column 'title':
107
create table titles (
112
Suppose, you already have a lot of titles and want to do full text search
115
First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
116
Add column 'titleidx' of type txtidx, containing space separated words from
117
title. It's possible to use function txt2txtidx(title) to fill 'titleidx'
118
column (see notice 1):
120
-- add titleidx column of type txtidx
121
alter table titles add titleidx txtidx;
122
update titles set titleidx=txt2txtidx(title);
124
Create index on titleidx:
125
create index t_idx on titles using gist(titleidx);
127
and now you can search all titles with words 'patch' and 'gist':
128
select title from titles where titleidx ## 'patch&gist';
130
Here, ## is a new operation defined for type 'txtidx' which could use index
131
(if exists) built on titleidx. This operator uses morphology to
133
## 'patches&gist' will find titles with 'patch' and 'gist' also.
134
If you want to provide query as is, use operator @@ instead:
135
select title from titles where titleidx @@ 'patch&gist';
136
but remember, that function txt2txtidx does uses morphology, so you need
137
to fill column 'titleidx' using some another way. We hope in future releases
138
provide more consistent and convenient interfaces.
140
Query could contains boolean operators &,|,!,() with their usual meaning,
141
for example: 'patch&gist&!cvs', 'patch|cvs'.
142
Each operation ( ##, @@ ) requires appropriate query type -
146
To see what query actually will be used :
148
test=# select 'patches&gist'::mquery_txt;
154
test=# select 'patches&gist'::query_txt;
160
Notice the difference !
162
You could use trigger to be sure column 'titleidx' is consistent
163
with any changes in column 'title':
165
create trigger txtidxupdate before update or insert on titles
166
for each row execute procedure tsearch(titleidx, title);
168
This trigger uses the same parser, dictionaries as function
169
txt2txtidx (see notice 1).
170
Current syntax allows creating trigger for several columns
171
you want to be searchable:
173
create trigger txtidxupdate before update or insert on titles
174
for each row execute procedure tsearch(titleidx, title1, title2,... );
176
Use function txtidxsize(titleidx) to get the number of "words" in column
177
titleidx. To get total number of words in table titles:
179
test=# select sum(txtidxsize(titleidx)) from titles;
188
function txt2txtidx and trigger use parser, dictionaries coming with
189
this contrib module on default. Parser is mostly the same as in OpenFTS and
190
dictionaries are simple stemmers (sort of Lovin's stemmer which uses a
191
longest match algorithm.) for english and russian languages. There is a perl
192
script makedict/makedict.pl, which could be used to create specific
193
dictionaries from files with endings and stop-words.
194
Example files for english and russian languages are available
195
from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
196
Run script without parameters to see information about arguments and options.
200
./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
201
-o ../dict/YOURDICT.dct
203
Another options of makedict.pl:
204
-f do not execute tolower for any char
205
-a function of checking stopword will be work after lemmatize,
208
You need to edit dict.h to use your dictionary and, probably,
209
morph.c to change mapdict array.
212
make clean; make; make install
215
txtidx doesn't preserve words ordering (this is not critical for searching)
216
for performance reason, for example:
218
test=# select 'page two'::txtidx;
225
Indexed access provided by txtidx data type isn't always good
226
because of internal data structure we use (RD-Tree). Particularly,
227
queries like '!gist' will be slower than just a sequential scan,
228
because for such queries RD-Tree doesn't provides selectivity on internal
229
nodes and all checks should be processed at leaf nodes, i.t. scan of
230
full index. You may play with function query_tree to see how effective
233
test=# select querytree( 'patch&gist'::query_txt );
239
This is an example of "good" query - index will effective for both words.
241
test=# select querytree( 'patch&!gist'::query_txt );
247
This means that index is effective only to search word 'patch' and resulted
248
rows will be checked against '!gist'.
250
test=# select querytree( 'patch|!gist'::query_txt );
256
test=# select querytree( '!gist'::query_txt );
262
These two queries will be processed by scanning of full index !
266
Following selects produce the same result
268
select title from titles where titleidx @@ 'patch&gist';
269
select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';
271
but the former will be more effective, because of internal optimization
277
Better configurability (as in OpenFTS)
278
User's interfaces to parser, dictionaries ...
284
We use test collection in our experiments which contains 377905
285
titles from various mailing lists stored in our mailware
288
All runs were performed on IBM ThinkPad T21 notebook with
289
PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
290
We didn't do extensive benchmarking and all
291
numbers provide for illustration. Actual performance
292
is strongly depends on many factors (query, collection, dictionaries
295
Collection is available for download from
296
http://www.sai.msu.su/~megera/postgres/gist/tsearch/mw_titles.gz
297
(377905 titles from postgresql mailing lists, about 3Mb).
299
0. install contrib/tsearch module
301
2. psql test < tsearch.sql (from contrib/tsearch)
302
3. zcat mw_titles.gz | psql test
303
(it will creates table, copy test data and creates index)
305
Database contains one table:
309
Column | Type | Modifiers
310
----------+------------------------+-----------
311
title | character varying(256) |
315
Index was created as:
316
create index t_idx on titles using gist(titleidx);
317
(notice: this operation takes about 14 minutes on my notebook)
319
Typical select looks like:
320
select title from titles where titleidx @@ 'patch&gist';
322
Total number of lexems in collection : 1917182
324
1. We trust index - we consider index is exact and no
325
checking against tuples is necessary.
327
update pg_amop set amopreqcheck = false where amopclaid =
328
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
331
1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
332
2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
333
3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
334
using gist indices (morph)
335
1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
336
2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
337
3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
339
1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
340
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
341
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
343
1: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.206s 0.10%
344
2: title ~* 'gist' and title ~* 'patch' 0.000u 0.010s 0m9.205s 0.10%
345
3: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.208s 0.10%
347
2. Need to check results against tuples to avoid possible hash collision.
349
update pg_amop set amopreqcheck = true where amopclaid =
350
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
353
1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
354
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
355
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
356
using gist indices (morph)
357
1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
358
2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
359
3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%
361
There are no visible difference between these 2 cases but your
367
1. The size of txtidx column should be lesser than size of corresponding column.
368
Below some real numbers from test database (link above).
370
a) After loading data
372
-rw------- 1 postgres users 23191552 Aug 29 14:08 53016937
373
-rw------- 1 postgres users 81059840 Aug 29 14:08 52639027
375
Table titles (52639027) occupies 80Mb, index on txtidx column (53016937)
376
occupies 22Mb. Use contrib/oid2name to get mappings from oid to names.
379
test=# select title into titles_tmp from titles;
382
I got size of table 'titles' without txtidx field
384
-rw------- 1 postgres users 30105600 Aug 29 14:14 53016938
386
So, txtidx column itself occupies about 50Mb.
388
b) after running 'vacuum full analyze' I got:
390
-rw------- 1 postgres users 30105600 Aug 29 14:26 53016938
391
-rw------- 1 postgres users 36880384 Aug 29 14:26 53016937
392
-rw------- 1 postgres users 51494912 Aug 29 14:26 52639027
394
53016938 = titles_tmp
396
So, actual size of 'txtidx' field is 20 Mb ! "quod erat demonstrandum"
398
2. CLUSTER command is highly recommended if you need fast searching.
401
test=# cluster t_idx on titles;
403
BUT ! In 7.2 CLUSTER command forgets about other indices and permissions,
404
so you need be carefull and rebuild these indices and restore permissions
405
after clustering. Also, clustering isn't dynamic, so you'd need to
406
use CLUSTER from time to time. In 7.3 CLUSTER command should works
411
-rw------- 1 postgres users 23404544 Aug 29 14:59 53394850
412
-rw------- 1 postgres users 30105600 Aug 29 14:26 53016938
413
-rw------- 1 postgres users 50995200 Aug 29 14:45 53394845
414
pg@zen:/usr/local/pgsql/data/base/52638986$ oid2name -d test
415
All tables from database "test":
416
---------------------------------
419
53016938 = titles_tmp