1
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4
<title>tsearch-v2-intro</title>
8
<h2>Tsearch2 - Introduction</h2>
10
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
11
[Online version]</a> of this document is available.</p>
13
<p>The tsearch2 module is available to add as an extension to the
14
PostgreSQL database to allow for Full Text Indexing. This document
15
is an introduction to installing, configuring, using and
16
maintaining the database with the tsearch2 module activated.</p>
17
<p>Please, note, tsearch2 module is fully incompatible with old
18
tsearch, which is deprecated in 7.4 and will be obsoleted in
20
<h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
22
<p>This documentation is provided as a short guide on how to
23
quickly get up and running with tsearch2 and PostgreSQL, for those
24
who want to implement a full text indexed based search engine. It
25
is not meant to be a complete in-depth guide into the full ins and
26
outs of the contrib/tsearch2 module, and is primarily aimed at
27
beginners who want to speed up searching of large text fields, or
28
those migrating from other database systems such as MS-SQL.</p>
29
<p>The README.tsearch2 file included in the contrib/tsearch2
30
directory contains a brief overview and history behind tsearch.
31
This can also be found online <a href="
32
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
34
<p>Further in depth documentation such as a full function
35
reference, and user guide can be found online at the <a href="
36
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
37
documentation home]</a>.</p>
38
<h3>ACKNOWLEDGEMENTS</h3>
40
<p>Robert John Shepherd originally wrote this documentation for the
41
previous version of tsearch module (v1) included with the postgres
42
release. I took his documentation and updated it to comply with the
43
tsearch2 modifications.</p>
44
<p>Robert's original acknowledgements:</p>
45
<p>"Thanks to Oleg Bartunov for taking the time to answer many of
46
my questions regarding this module, and also to Teodor Sigaev for
47
clearing up the process of making your own dictionaries. Plus of
48
course a big thanks to the pair of them for writing this module in
50
<p>I would also like to extend my thanks to the developers, and
51
Oleg Bartunov for all of his direction and help with the new
52
features of tsearch2.</p>
54
<p>MS-SQL provides a full text indexing (FTI) system which enables
55
the fast searching of text based fields, very useful for websites
56
(and other applications) that require a results set based on key
57
words. PostgreSQL ships with a contributed module called tsearch2,
58
which implements a special type of index that can also be used for
59
full text indexing. Further more, unlike MS' offering which
60
requires regular incremental rebuilds of the text indexes
61
themselves, tsearch2 indexes are always up-to-date and keeping them
62
so induces very little overhead.</p>
63
<p>Before we get into the details, it is recommended that you have
64
installed and tested PostgreSQL, are reasonably familiar with
65
databases, the SQL query language and also understand the basics of
66
connecting to PostgreSQL from the local shell. This document isn't
67
intended for the complete PostgreSQL newbie, but anyone with a
68
reasonable grasp of the basics should be able to follow it.</p>
70
<p>Starting with PostgreSQL version 7.4 tsearch2 is now included in
71
the contrib directory with the PostgreSQL sources. contrib/tsearch2
72
is where you will find everything needed to install and use
73
tsearch2. Please note that tsearch2 will also work with PostgreSQL
74
version 7.3.x, but it is not the module included with the source
75
distribution. You will have to download the module separately and
76
install it in the same fashion.</p>
78
<p>I installed the tsearch2 module to a PostgreSQL 7.3 database
79
from the contrib directory without squashing the original (old)
80
tsearch module. What I did was move the modules tsearch src
81
driectory into the contrib tree under the name tsearchV2.</p>
82
<p>Step one is to download the tsearch V2 module :</p>
84
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
85
(check Development History for latest stable version !)</p>
87
tar -zxvf tsearch-v2.tar.gz
88
mv tsearch2 $PGSQL_SRC/contrib/
89
cd $PGSQL_SRC/contrib/tsearch2
91
<p>If you are installing from PostgreSQL version 7.4 or higher, you
92
can skip those steps and just change to the contrib/tsearch2
93
directory in the source tree and continue from there.</p>
94
<p>As of May 9, 2004 there is a source patch available for
95
tsearch2. The patch provides changes to the pg_ts_ configuration
96
tables to allow for easy dump and restore of a database containing
97
tsearch2. The patch is available here : <a href="
98
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz">
99
[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz]</a></p>
101
<p>To apply this patch, download the mentioned file and place it in
102
your postgreSQL source tree ($PGSQL_SRC). This patch is not
103
required for tsearch2 to work. I would however, highly recommend it
104
as it makes the backup and restore procedures very simple.</p>
107
gunzip regprocedure_7.4.patch.gz
108
patch -b -p1 < regprocedure_7.4.patch
110
<p>If you have a working version of tsearch2 in your database, you
111
do not need to re-install the tsearch2 module. Just apply the patch
112
and run make. This patch only affects the tsearch2.sql file. You
113
can run the SQL script found : <a href="
114
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_update.sql">
115
[right here]</a> This script will make the modifications found in
116
the patch, and update the fields from the existing data. From this
117
point on, you can dump and restore the database in a normal
118
fashion. Without this patch, you must follow the instructions later
119
in this document for backup and restore.</p>
120
<p>This patch is only needed for tsearch2 in PostgreSQL versions
121
7.3.x and 7.4.x. The patch has been applied to the sources for
123
<p>When you have your source tree for tsearch2 ready, you can
124
continue with the regular building and installation process</p>
131
<p>That is pretty much all you have to do, unless of course you get
132
errors. However if you get those, you better go check with the
133
mailing lists over at <a href="
134
http://www.postgresql.org">http://www.postgresql.org</a> or
136
http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
137
since its never failed for me.</p>
138
<p>If you ever need to revert this patch, and go back to the
139
unpatched version of tsearch2, it is simple if you followed the
140
above patch command. The -b option creates a backup of the original
141
file, so we can just copy it back.</p>
143
cd $PGSQL_SRC/contrib/tsearch2
144
cp tsearch.sql.in.orig tsearch.sql.in
148
<p>If you need the patched version again, just follow the patch
149
instructions again.</p>
150
<p>The directory in the contib/ and the directory from the archive
151
is called tsearch2. Tsearch2 is completely incompatible with the
152
previous version of tsearch. This means that both versions can be
153
installed into a single database, and migration the new version may
155
<p>NOTE: the previous version of tsearch found in the
156
contrib/tsearch directory is depricated. Although it is still
157
available and included within PostgreSQL version 7.4. It will be
158
removed in version 8.0.</p>
159
<h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>
160
<p>We should create a database to use as an example for the
161
remainder of this file. We can call the database "ftstest". You can
162
create it from the command line like this:</p>
166
<p>If you thought installation was easy, this next bit is even
167
easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
171
psql ftstest < tsearch2.sql
173
<p>The file "tsearch2.sql" holds all the wonderful little goodies
174
you need to do full text indexing. It defines numerous functions
175
and operators, and creates the needed tables in the database. There
176
will be 4 new tables created after running the tsearch2.sql file :
177
pg_ts_dict, pg_ts_parser, pg_ts_cfg, pg_ts_cfgmap are added.</p>
178
<p>You can check out the tables if you like:</p>
183
Schema | Name | Type | Owner
184
--------+--------------+-------+----------
185
public | pg_ts_cfg | table | kopciuch
186
public | pg_ts_cfgmap | table | kopciuch
187
public | pg_ts_dict | table | kopciuch
188
public | pg_ts_parser | table | kopciuch
191
<p>You may need to grant permissions to use on pg_ts_dict, pg_ts_parser, pg_ts_cfg, pg_ts_cfgmap tables to let non-superuser works with tsearch2. GRANT SELECT should be enough for search-only access. <p>
192
<h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>
193
<p>The first thing we can do is try out some of the types that are
194
provided for us. Lets look at the tsvector type provided for
198
SELECT 'Our first string used today'::tsvector;
200
---------------------------------------
201
'Our' 'used' 'first' 'today' 'string'
204
<p>The results are the words used within our string. Notice they
205
are not in any particular order. The tsvector type returns a string
206
of space separated words.</p>
208
SELECT 'Our first string used today first string'::tsvector;
210
-----------------------------------------------
211
'Our' 'used' 'first' 'today' 'string'
214
<p>Notice the results string has each unique word ('first' and
215
'string' only appear once in the tsvector value). Which of course
216
makes sense if you are searching the full text ... you only need to
217
know each unique word in the text.</p>
218
<p>Those examples were just casting a text field to that of type
219
tsvector. Lets check out one of the new functions created by the
221
<p>The function to_tsvector has 3 possible signatures:</p>
224
to_tsvector(oid, text);
225
to_tsvector(text, text);
228
<p>We will use the second method using two text fields. The
229
overloaded methods provide us with a way to specifiy the way the
230
searchable text is broken up into words (Stemming process). Right
231
now we will specify the 'default' configuration. See the section on
232
TSEARCH2 CONFIGURATION to learn more about this.</p>
234
SELECT to_tsvector('default',
235
'Our first string used today first string');
237
--------------------------------------------
238
'use':4 'first':2,6 'today':5 'string':3,7
241
<p>The result returned from this function is of type tsvector. The
242
results came about by this reasoning: All of the words in the text
243
passed in are stemmed, or not used because they are stop words
244
defined in our configuration. Each lower case morphed word is
245
returned with all of the positons in the text.</p>
246
<p>In this case the word "Our" is a stop word in the default
247
configuration. That means it will not be included in the result.
248
The word "first" is found at positions 2 and 6 (although "Our" is a
249
stop word, it's position is maintained). The word(s) positioning is
250
maintained exactly as in the original string. The word "used" is
251
morphed to the word "use" based on the default configuration for
252
word stemming, and is found at position 4. The rest of the results
253
follow the same logic. Just a reminder again ... the order of the
254
'word' position in the output is not in any kind of order. (ie
255
'use':4 appears first)</p>
256
<p>If you want to view the output of the tsvector fields without
257
their positions, you can do so with the function
258
"strip(tsvector)".</p>
260
SELECT strip(to_tsvector('default',
261
'Our first string used today first string'));
263
--------------------------------
264
'use' 'first' 'today' 'string'
267
<p>If you wish to know the number of unique words returned in the
268
tsvector you can do so by using the function "length(tsvector)"</p>
270
SELECT length(to_tsvector('default',
271
'Our first string used today first string'));
277
<p>Lets take a look at the function to_tsquery. It also has 3
278
signatures which follow the same rational as the to_tsvector
281
to_tsquery(oid, text);
282
to_tsquery(text, text);
285
<p>Lets try using the function with a single word :</p>
287
SELECT to_tsquery('default', 'word');
294
<p>I call the function the same way I would a to_tsvector function,
295
specifying the 'default' configuration for morphing, and the result
296
is the stemmed output 'word'.</p>
297
<p>Lets attempt to use the function with a string of multiple
300
SELECT to_tsquery('default', 'this is many words');
303
<p>The function can not accept a space separated string. The
304
intention of the to_tsquery function is to return a type of
305
"tsquery" used for searching a tsvector field. What we need to do
306
is search for one to many words with some kind of logic (for now
309
SELECT to_tsquery('default', 'searching|sentence');
311
----------------------
315
<p>Notice that the words are separated by the boolean logic "OR",
316
the text could contain boolean operators &,|,!,() with their
319
<p>You can not use words defined as being a stop word in your
320
configuration. The function will not fail ... you will just get no
321
result, and a NOTICE like this:</p>
323
SELECT to_tsquery('default', 'a|is&not|!the');
324
NOTICE: Query contains only stopword(s)
325
or doesn't contain lexem(s), ignored
330
<p>That is a beginning to using the types, and functions defined in
331
the tsearch2 module. There are numerous more functions that I have
332
not touched on. You can read through the tsearch2.sql file built
333
when compiling to get more familiar with what is included.</p>
334
<h3>INDEXING FIELDS IN A TABLE</h3>
335
<p>The next stage is to add a full text index to an existing table.
336
In this example we already have a table defined as follows:</p>
338
CREATE TABLE tblMessages
341
strTopic varchar(100),
346
<p>We are assuming there are several rows with some kind of data in
347
them. Any data will do, just do several inserts with test strings
348
for a topic, and a message. here is some test data I inserted. (yes
349
I know it's completely useless stuff ;-) but it will serve our
350
purpose right now).</p>
352
INSERT INTO tblMessages
353
VALUES ('1', 'Testing Topic', 'Testing message data input');
354
INSERT INTO tblMessages
355
VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
356
INSERT INTO tblMessages
357
VALUES ('3', 'Famous Author', 'Stephen King');
358
INSERT INTO tblMessages
359
VALUES ('4', 'Political Topic',
360
'Nelson Mandella is released from prison');
361
INSERT INTO tblMessages
362
VALUES ('5', 'Nursery rhyme phrase',
363
'Little jack horner sat in a corner');
364
INSERT INTO tblMessages
365
VALUES ('6', 'Gettysburg address quotation',
366
'Four score and seven years ago'
367
' our fathers brought forth on this'
368
' continent a new nation, conceived in'
369
' liberty and dedicated to the proposition'
370
' that all men are created equal');
371
INSERT INTO tblMessages
372
VALUES ('7', 'Classic Rock Bands',
373
'Led Zeppelin Grateful Dead and The Sex Pistols');
374
INSERT INTO tblMessages
375
VALUES ('8', 'My birth address',
376
'18 Sommervile road, Regina, Saskatchewan');
377
INSERT INTO tblMessages
378
VALUES ('9', 'Joke', 'knock knock : who\'s there?'
379
' I will not finish this joke');
380
INSERT INTO tblMessages
381
VALUES ('10', 'Computer information',
382
'My computer is a pentium III 400 mHz'
383
' with 192 megabytes of RAM');
385
<p>The next stage is to create a special text index which we will
386
use for FTI, so we can search our table of messages for words or a
387
phrase. We do this using the SQL command:</p>
389
ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
391
<p>Note that unlike traditional indexes, this is actually a new
392
field in the same table, which is then used (through the magic of
393
the tsearch2 operators and functions) by a special index we will
394
create in a moment.</p>
395
<p>The general rule for the initial insertion of data will follow
400
2. vacuum full analyze
402
4. vacuum full analyze
404
<p>The data can be updated into the table, the vacuum full analyze
405
will reclaim unused space. The index can be created on the table
406
after the data has been inserted. Having the index created prior to
407
the update will slow down the process. It can be done in that
408
manner, this way is just more efficient. After the index has been
409
created on the table, vacuum full analyze is run again to update
410
postgres's statistics (ie having the index take effect).</p>
412
UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
415
<p>Note that this only inserts the field strMessage as a tsvector,
416
so if you want to also add strTopic to the information stored, you
417
should instead do the following, which effectively concatenates the
418
two fields into one before being inserted into the table:</p>
421
SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
424
<p><strong>Using the coalesce function makes sure this
425
concatenation also works with NULL fields.</strong></p>
427
<p>We need to create the index on the column idxFTI. Keep in mind
428
that the database will update the index when some action is taken.
429
In this case we _need_ the index (The whole point of Full Text
430
INDEXINGi ;-)), so don't worry about any indexing overhead. We will
431
create an index based on the gist function. GiST is an index
432
structure for Generalized Search Tree.</p>
434
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
437
<p>After you have converted all of your data and indexed the
438
column, you can select some rows to see what actually happened. I
439
will not display output here but you can play around yourselves and
440
see what happened.</p>
441
<p>The last thing to do is set up a trigger so every time a row in
442
this table is changed, the text index is automatically updated.
443
This is easily done using:</p>
445
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
446
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
448
<p>Or if you are indexing both strMessage and strTopic you should
452
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
453
FOR EACH ROW EXECUTE PROCEDURE
454
tsearch2(idxFTI, strTopic, strMessage);
456
<p>Before you ask, the tsearch2 function accepts multiple fields as
457
arguments so there is no need to concatenate the two into one like
459
<p>If you want to do something specific with columns, you may write
460
your very own trigger function using plpgsql or other procedural
461
languages (but not SQL, unfortunately) and use it instead of
462
<em>tsearch2</em> trigger.</p>
463
<p>You could however call other stored procedures from within the
464
tsearch2 function. Lets say we want to create a function to remove
465
certain characters (like the @ symbol from all text).</p>
467
CREATE FUNCTION dropatsymbol(text)
468
RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
470
<p>Now we can use this function within the tsearch2 function on the
474
DROP TRIGGER tsvectorupdate ON tblmessages;
475
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
476
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
477
INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
479
<p>If at this point you receive an error stating: ERROR: Can't find
480
tsearch config by locale</p>
481
<p>Do not worry. You have done nothing wrong. And tsearch2 is not
482
broken. All that has happened here is that the configuration is
483
setup to use a configuration based on the locale of the server. All
484
you have to do is change your default configuration, or add a new
485
one for your specific locale. See the section on TSEARCH2
488
SELECT * FROM tblmessages WHERE intindex = 69;
490
intindex | strtopic | strmessage | idxfti
491
----------+--------------------------+---------------+-----------------------
492
69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
495
Notice that the string content was passed throught the stored
496
procedure dropatsymbol. The '@' character was replaced with a
497
single space ... and the output from the procedure was then stored
498
in the tsvector column.
499
<p>This could be useful for removing other characters from indexed
500
text, or any kind of preprocessing needed to be done on the text
501
prior to insertion into the index.</p>
502
<h3>QUERYING A TABLE</h3>
504
<p>There are some examples in the README.tsearch2 file for querying
505
a table. One major difference between tsearch and tsearch2 is the
506
operator ## is no longer available. Only the operator @@ is
507
defined, using the types tsvector on one side and tsquery on the
509
<p>Lets search the indexed data for the word "Test". I indexed
510
based on the the concatenation of the strTopic, and the
513
SELECT intindex, strtopic FROM tblmessages
514
WHERE idxfti @@ 'test'::tsquery;
516
----------+---------------
520
<p>The only result that matched was the row with a topic "Testing
521
Topic". Notice that the word I search for was all lowercase. Let's
522
see what happens when I query for uppercase "Test".</p>
524
SELECT intindex, strtopic FROM tblmessages
525
WHERE idxfti @@ 'Test'::tsquery;
527
----------+----------
530
<p>We get zero rows returned. The reason is because when the text
531
was inserted, it was morphed to my default configuration (because
532
of the call to to_tsvector in the UPDATE statement). If there was
533
no morphing done, and the tsvector field(s) contained the word
534
'Text', a match would have been found.</p>
535
<p>Most likely the best way to query the field is to use the
536
to_tsquery function on the right hand side of the @@ operator like
540
SELECT intindex, strtopic FROM tblmessages
541
WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
543
----------+--------------------
545
7 | Classic Rock Bands
548
<p>That query searched for all instances of "Test" OR "Zeppelin".
549
It returned two rows: the "Testing Topic" row, and the "Classic
550
Rock Bands" row. The to_tsquery function performed the correct
551
morphology upon the parameters, and searched the tsvector field
553
<p>The last example here relates to searching for a phrase, for
554
example "minority report". This poses a problem with regard to
555
tsearch2, as it doesn't index phrases, only words. But there is a
556
way around which doesn't appear to have a significant impact on
557
query time, and that is to use a query such as the following:</p>
559
SELECT intindex, strTopic FROM tblmessages
560
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
561
AND strMessage ~* '.*men are created equal.*';
563
----------+------------------------------
564
6 | Gettysburg address quotation
566
SELECT intindex, strTopic FROM tblmessages
567
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
568
AND strMessage ~* '.*something that does not exist.*';
570
----------+----------
573
<p>Of course if your indexing both strTopic and strMessage, and
574
want to search for this phrase on both, then you will have to get
575
out the brackets and extend this query a little more.</p>
577
<h3>TSEARCH2 CONFIGURATION</h3>
578
<p>Some words such as "and", "the", and "who" are automatically not
579
indexed, since they belong to a pre-existing dictionary of "Stop
580
Words" which tsearch2 does not perform indexing on. If someone
581
needs to search for "The Who" in your database, they are going to
582
have a tough time coming up with any results, since both are
583
ignored in the indexes. But there is a solution.</p>
584
<p>Lets say we want to add a word into the stop word list for
585
english stemming. We could edit the file
586
:'/usr/local/pgsql/share/english.stop' and add a word to the list.
587
I edited mine to exclude my name from indexing:</p>
589
- Edit /usr/local/pgsql/share/english.stop
590
- Add 'andy' to the list
593
<p>When you connect to the database, the dict_init procedure is run
594
during initialization. And in my configuration it will read the
595
stop words from the file I just edited. If you were connected to
596
the DB while editing the stop words, you will need to end the
597
current session and re-connect. When you re-connect to the
598
database, 'andy' is no longer indexed:</p>
600
SELECT to_tsvector('default', 'Andy');
605
<p>Originally I would get the result :</p>
608
SELECT to_tsvector('default', 'Andy');
614
<p>But since I added it as a stop word, it would be ingnored on the
615
indexing. The stop word added was used in the dictionary "en_stem".
616
If I were to use a different configuration such as 'simple', the
617
results would be different. There are no stop words for the simple
618
dictionary. It will just convert to lower case, and index every
621
SELECT to_tsvector('simple', 'Andy andy The the in out');
623
-------------------------------------
624
'in':5 'out':6 'the':3,4 'andy':1,2
627
<p>All this talk about which configuration to use is leading us
628
into the actual configuration of tsearch2. In the examples in this
629
document the configuration has always been specified when using the
630
tsearch2 functions:</p>
632
SELECT to_tsvector('default', 'Testing the default config');
633
SELECT to_tsvector('simple', 'Example of simple Config');
635
<p>The pg_ts_cfg table holds each configuration you can use with
636
the tsearch2 functions. As you can see the ts_name column contains
637
both the 'default' configurations based on the 'C' locale. And the
638
'simple' configuration which is not based on any locale.</p>
641
SELECT * from pg_ts_cfg;
642
ts_name | prs_name | locale
643
-----------------+----------+--------------
644
default | default | C
645
default_russian | default | ru_RU.KOI8-R
649
<p>Each row in the pg_ts_cfg table contains the name of the
650
tsearch2 configuration, the name of the parser to use, and the
651
locale mapped to the configuration. There is only one parser to
652
choose from the table pg_ts_parser called 'default'. More parsers
653
could be written, but for our needs we will use the default.</p>
654
<p>There are 3 configurations installed by tsearch2 initially. If
655
your locale is set to 'en_US' for example (like my laptop), then as
656
you can see there is currently no dictionary configured to use with
657
that locale. You can either set up a new configuration or just use
658
one that already exists. If I do not specify which configuration to
659
use in the to_tsvector function, I receive the following error.</p>
661
SELECT to_tsvector('learning tsearch is like going to school');
662
ERROR: Can't find tsearch config by locale
664
<p>We will create a new configuration for use with the server
665
encoding 'en_US'. The first step is to add a new configuration into
666
the pg_ts_cfg table. We will call the configuration
667
'default_english', with the default parser and use the locale
670
INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
671
VALUES ('default_english', 'default', 'en_US');
674
<p>We have only declared that there is a configuration called
675
'default_english'. We need to set the configuration of how
676
'default_english' will work. The next step is creating a new
677
dictionary to use. The configuration of the dictionary is
678
completlely different in tsearch2. In the prior versions to make
679
changes, you would have to re-compile your changes into the
680
tsearch.so. All of the configuration has now been moved into the
681
system tables created by executing the SQL code from
683
<p>Lets take a first look at the pg_ts_dict table</p>
685
ftstest=# \d pg_ts_dict
686
Table "public.pg_ts_dict"
687
Column | Type | Modifiers
688
-----------------+--------------+-----------
689
dict_name | text | not null
690
dict_init | regprocedure |
691
dict_initoption | text |
692
dict_lexize | regprocedure | not null
693
dict_comment | text |
694
Indexes: pg_ts_dict_pkey primary key btree (dict_name)
696
<p>The dict_name column is the name of the dictionary, for example
697
'simple', 'en_stem' or 'ru_stem'. The dict_init column is a text
698
representation of a stored procedure to run for initialization of
699
that dictionary, for example 'snb_en_init(text)' or
700
'snb_ru_init(text)'. The initial configuration of tsearch2 had the
701
dict_init and dict_lexize columns as type oid. The patch mentioned
702
in the Installation Notes changes these types to regprocedure. The
703
data inserted, or updated can still be the oid of the stored
704
procedure. The representation is just different. This makes backup
705
and restore procedures much easier for tsearch2. The dict_init
706
option is used for options passed to the init function for the
707
stored procedure. In the cases of 'en_stem' or 'ru_stem' it is a
708
path to a stopword file for that dictionary, for example
709
'/usr/local/pgsql/share/english.stop'. This is however dictated by
710
the dictionary. ISpell dictionaries may require different options.
711
The dict_lexize column is another OID of a stored procedure to the
712
function used to lexize, for example 'snb_lexize(internal,
713
internal, integer)'. The dict_comment column is just a comment.</p>
714
<p>Next we will configure the use of a new dictionary based on
715
ISpell. We will assume you have ISpell installed on you machine.
716
(in /usr/local/lib)</p>
717
<p>There has been some confusion in the past as to which files are
718
used from ISpell. ISpell operates using a hash file. This is a
719
binary file created by the ISpell command line utility "buildhash".
720
This utility accepts a file containing the words from the
721
dictionary, and the affixes file and the output is the hash file.
722
The default installation of ISPell installs the english hash file
723
english.hash, which is the exact same file as american.hash. ISpell
724
uses this as the fallback dictionary to use.</p>
725
<p>This hash file is not what tsearch2 requires as the ISpell
726
interface. The file(s) needed are those used to create the hash.
727
Tsearch uses the dictionary words for morphology, so the listing is
728
needed not spellchecking. Regardless, these files are included in
729
the ISpell sources, and you can use them to integrate into
730
tsearch2. This is not complicated, but is not very obvious to begin
731
with. The tsearch2 ISpell interface needs only the listing of
732
dictionary words, it will parse and load those words, and use the
733
ISpell dictionary for lexem processing.</p>
735
<p>I found the ISPell make system to be very finicky. Their
736
documentation actually states this to be the case. So I just did
737
things the command line way. In the ISpell source tree under
738
languages/english there are several files in this directory. For a
739
complete description, please read the ISpell README. Basically for
740
the english dictionary there is the option to create the small,
741
medium, large and extra large dictionaries. The medium dictionary
742
is recommended. If the make system is configured correctly, it
743
would build and install the english.has file from the medium size
744
dictionary. Since we are only concerned with the dictionary word
745
listing ... it can be created from the /languages/english directory
746
with the following command:</p>
748
sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1
750
<p>This will create a file called english.med. You can copy this
751
file to whever you like. I placed mine in /usr/local/lib so it
752
coincides with the ISpell hash files. You can now add the tsearch2
753
configuration entry for the ISpell english dictionary. We will also
754
continue to use the english word stop file that was installed for
755
the en_stem dictionary. You could use a different one if you like.
756
The ISpell configuration is based on the "ispell_template"
757
dictionary installed by default with tsearch2. We will use the OIDs
758
to the stored procedures from the row where the dict_name =
759
'ispell_template'.</p>
761
INSERT INTO pg_ts_dict
764
'DictFile="/usr/local/lib/english.med",'
765
'AffFile="/usr/local/lib/english.aff",'
766
'StopFile="/usr/local/pgsql/share/contrib/english.stop"',
769
WHERE dict_name = 'ispell_template');
771
<p>Now that we have a dictionary we can specify it's use in a query
772
to get a lexem. For this we will use the lexize function. The
773
lexize function takes the name of the dictionary to use as an
774
argument. Just as the other tsearch2 functions operate. You will
775
need to stop your psql session and start it again in order for this
776
modification to take place.</p>
778
SELECT lexize('en_ispell', 'program');
785
<p>If you wanted to always use the ISpell english dictionary you
786
have installed, you can configure tsearch2 to always use a specific
789
SELECT set_curdict('en_ispell');
791
<p>Lexize is meant to turn a word into a lexem. It is possible to
792
receive more than one lexem returned for a single word.</p>
794
SELECT lexize('en_ispell', 'conditionally');
796
-----------------------------
797
{conditionally,conditional}
800
<p>The lexize function is not meant to take a full string as an
801
argument to return lexems for. If you passed in an entire sentence,
802
it attempts to find that entire sentence in the dictionary. Since
803
the dictionary contains only words, you will receive an empty
806
SELECT lexize('en_ispell', 'This is a senctece to lexize');
812
If you parse a lexem from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not in the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)
815
<p>This is not to say that tsearch will be ignored when adding text
816
information to the the tsvector index column. This will be
817
explained in greater detail with the table pg_ts_cfgmap.</p>
818
<p>Next we need to set up the configuration for mapping the
819
dictionay use to the lexxem parsings. This will be done by altering
820
the pg_ts_cfgmap table. We will insert several rows, specifying to
821
use the new dictionary we installed and configured for lexizing
822
within tsearch2. There are several type of lexims we would be
823
concerned with forcing the use of the ISpell dictionary.</p>
825
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
826
VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
827
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
828
VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
829
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
830
VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
832
<p>We have just inserted 3 records to the configuration mapping,
833
specifying that the lexem types for "lhword, lpart_hword and lword"
834
are to be stemmed using the 'en_ispell' dictionary we added into
835
pg_ts_dict, when using the configuration ' default_english' which
836
we added to pg_ts_cfg.</p>
837
<p>There are several other lexem types used that we do not need to
838
specify as using the ISpell dictionary. We can simply insert values
839
using the 'simple' stemming process dictionary.</p>
841
INSERT INTO pg_ts_cfgmap
842
VALUES ('default_english', 'url', '{simple}');
843
INSERT INTO pg_ts_cfgmap
844
VALUES ('default_english', 'host', '{simple}');
845
INSERT INTO pg_ts_cfgmap
846
VALUES ('default_english', 'sfloat', '{simple}');
847
INSERT INTO pg_ts_cfgmap
848
VALUES ('default_english', 'uri', '{simple}');
849
INSERT INTO pg_ts_cfgmap
850
VALUES ('default_english', 'int', '{simple}');
851
INSERT INTO pg_ts_cfgmap
852
VALUES ('default_english', 'float', '{simple}');
853
INSERT INTO pg_ts_cfgmap
854
VALUES ('default_english', 'email', '{simple}');
855
INSERT INTO pg_ts_cfgmap
856
VALUES ('default_english', 'word', '{simple}');
857
INSERT INTO pg_ts_cfgmap
858
VALUES ('default_english', 'hword', '{simple}');
859
INSERT INTO pg_ts_cfgmap
860
VALUES ('default_english', 'nlword', '{simple}');
861
INSERT INTO pg_ts_cfgmap
862
VALUES ('default_english', 'nlpart_hword', '{simple}');
863
INSERT INTO pg_ts_cfgmap
864
VALUES ('default_english', 'part_hword', '{simple}');
865
INSERT INTO pg_ts_cfgmap
866
VALUES ('default_english', 'nlhword', '{simple}');
867
INSERT INTO pg_ts_cfgmap
868
VALUES ('default_english', 'file', '{simple}');
869
INSERT INTO pg_ts_cfgmap
870
VALUES ('default_english', 'uint', '{simple}');
871
INSERT INTO pg_ts_cfgmap
872
VALUES ('default_english', 'version', '{simple}');
875
<p>Our addition of a configuration for 'default_english' is now
876
complete. We have successfully created a new tsearch2
877
configuration. At the same time we have also set the new
878
configuration to be our default for en_US locale.</p>
880
SELECT to_tsvector('default_english',
881
'learning tsearch is like going to school');
883
--------------------------------------------------
884
'go':5 'like':4 'learn':1 'school':7 'tsearch':2
885
SELECT to_tsvector('learning tsearch is like going to school');
887
--------------------------------------------------
888
'go':5 'like':4 'learn':1 'school':7 'tsearch':2
891
<p>Notice here that words like "tsearch" are still parsed and
892
indexed in the tsvector column. There is a lexem returned for the
893
word becuase in the configuration mapping table, we specify words
894
to be used from the 'en_ispell' dictionary first, but as a fallback
895
to use the 'en_stem' dictionary. Therefore a lexem is not returned
896
from en_ispell, but is returned from en_stem, and added to the
899
SELECT to_tsvector('learning tsearch is like going to computer school');
901
---------------------------------------------------------------------------
902
'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7
905
<p>Notice in this last example I added the word "computer" to the
906
text to be converted into a tsvector. Because we have setup our
907
default configuration to use the ISpell english dictionary, the
908
words are lexized, and computer returns 2 lexems at the same
909
position. 'compute':7 and 'computer':7 are now both indexed for the
911
<p>You can create additional dictionary lists, or use the extra
912
large dictionary from ISpell. You can read through the ISpell
913
documents, and source tree to make modifications as you see
915
<p>In the case that you already have a configuration set for the
916
locale, and you are changing it to your new dictionary
917
configuration. You will have to set the old locale to NULL. If we
918
are using the 'C' locale then we would do this:</p>
921
UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
923
<p>That about wraps up the configuration of tsearch2. There is much
924
more you can do with the tables provided. This was just an
925
introduction to get things working rather quickly.</p>
926
<h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>
927
<p>To aid in the addition of new dictionaries to the tsearch2
928
module you can use another additional module in combination with
929
tsearch2. The gendict module is included into tsearch2 distribution
930
and is available from gendict/ subdirectory.</p>
931
<p>I will not go into detail about installation and instructions on
932
how to use gendict to it's fullest extent right now. You can read
933
the README.gendict ... it has all of the instructions and
934
information you will need.</p>
935
<h3>BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH2</h3>
936
<p><strong>Never rely on anyone elses instructions to backup and
937
restore a database system, always develop and understand your own
938
methodology, and test it numerous times before you need to do it
939
for real.</strong></p>
940
<p>The backup and restore procedure has changed over time. This is
941
not meant to be the bible for tsearch2 back up and restore. Please
942
read all sections so you have a complete understanding of some
943
backup and restore issues. Please test your own procedures, and do
944
not rely on these instructions solely.</p>
946
<p>If you come accross some issues in your own procedures, please
947
feel free to bring the question up on the Open-FTS, and PostgreSQL
949
<h3>ORIGINAL BACKUP PROCEDURES</h3>
950
<p>Originally, tsearch2 had problems when using the pg_dump, and or
951
the pg_dumpall utilities. The problem lies within the original use
952
of OIDs for column types. Since OIDs are not consistent accross
953
pg_dumps, when you reload the data values into the pg_ts_dict
954
table, for example, those oids no longer point to anything. You
955
would then end up trying to use a "broken" tsearch2
957
<p>The solution was to backup and restore a database using the
958
tsearch2 module into small unique parts, and then load them in the
959
correct order. You would have to edit the schema and remove the
960
tsearch stored procedure references in the sql file. You would have
961
to load your global objects, then the tsearch2 objects. You had to
962
re-create the tsearch module before restoring your schema so no
963
conflicts would arise. Then you could restore your data (all
964
schemas, and types needed for the data were now available).</p>
965
<p><strong>The original backup instructions were as
967
<p>1) Backup any global database objects such as users and groups
968
(this step is usually only necessary when you will be restoring to
971
pg_dumpall -g > GLOBALobjects.sql
974
<p>2) Backup the full database schema using pg_dump</p>
976
pg_dump -s DATABASE > DATABASEschema.sql
978
<p>3) Backup the full database using pg_dump</p>
980
pg_dump -Fc DATABASE > DATABASEdata.tar
983
<p><strong>The original restore procedures were as
985
<p>1) Create the blank database</p>
989
<p>2) Restore any global database objects such as users and groups
990
(this step is usually only necessary when you will be restoring to
993
psql DATABASE < GLOBALobjects.sql
995
<p>3) Create the tsearch2 objects, functions and operators</p>
998
psql DATABASE < tsearch2.sql
1000
<p>4) Edit the backed up database schema and delete all SQL
1001
commands which create tsearch2 related functions, operators and
1002
data types, BUT NOT fields in table definitions that specify
1003
tsvector types. If your not sure what these are, they are the ones
1004
listed in tsearch2.sql. Then restore the edited schema to the
1007
psql DATABASE < DATABASEschema.sql
1009
<p>5) Restore the data for the database</p>
1012
pg_restore -N -a -d DATABASE DATABASEdata.tar
1014
<p>If you get any errors in step 4, it will most likely be because
1015
you forgot to remove an object that was created in tsearch2.sql.
1016
Any errors in step 5 will mean the database schema was probably
1017
restored wrongly.</p>
1018
<p><strong>Issues with this procedure</strong></p>
1019
<p>As I mentioned before, it is vital that you test out your own
1020
backup and restore procedures. These procedures were originally
1021
adopted from this document's orignal author. Robert John Shepherd.
1022
It makes use of the pg_dump custom archive functionality. I am not
1023
that familiar with the formatting output of pg_dump, and using
1024
pg_restore. I have always had the luxury of using text files
1025
(Everything is DATABASE.sql).</p>
1026
<p>One issue not forseen in the case of using a binary dump is the
1027
when you have added more than the default tsearch2 configurations.
1028
Upon reload of the data it will fail due to duplicate primary keys.
1029
If you load the tsearch2 module, and then delete the data loaded by
1030
tsearch2 into the configuration tables, the data will restore. The
1031
configurations are incorrect because you can not remove the data
1032
using OID references from the custom archive.</p>
1033
<p>It would be very simple to fix this problem if the data was not
1034
in an archive format. I do believe all of your data would have been
1035
restored properly and you can get things working fairly easy. All
1036
one would have to do is create the configurations as in the
1037
tsearch2.sql file. And then create your custom configurations
1039
<p>I have read in the pg_dump man page that if the tar archive
1040
format is used, it is possible to limit which data is restored
1041
using pg_restore. If anyone has more experience with pg_dump
1042
archives, and pg_restore. Please feel free to test and contribute
1043
your procedure(s).</p>
1044
<h3>CURRENT BACKUP AND RESTORE PROCEDURES</h3>
1046
<p>Currently a backup and restore of a database using the tsearch2
1047
module can be quite simple. If you have applied the patch mentioned
1048
in the installation instructions prior to tsearch2 installation.
1049
This patch removes the use of the oid column. The text
1050
representation for the stored procedures used are dumped with the
1051
data and the restoration of the data works seemlessly.</p>
1052
<p>1) to backup the database</p>
1054
pg_dump DATABASE > DATABASE.sql
1056
<p>1) to restore the database</p>
1059
psql -d DATABASE -f DATABASE.sql
1061
<p>This procedure is now like any normal backup and restore
1062
procedure. I can say whether this has been proven using the pg_dump
1063
archive, and restoring with pg_restore. In theory there should be
1064
no problems with any format after the patch is applied.</p>
1066
<p>This restoration procedure should never be an issue with the
1067
patch applied to version 8.0 of PostgreSQL. Only versions 7.3 and
1068
7.4 are affected. You can avoid any troubles by applying the patch
1069
prior to installation, or running the SQL script provided to live
1070
database before backup and restoring is done.</p>