76
76
<emphasis>Parsing documents into <firstterm>tokens</></emphasis>. It is
77
useful to identify various classes of tokens, e.g. numbers, words,
77
useful to identify various classes of tokens, e.g., numbers, words,
78
78
complex words, email addresses, so that they can be processed
79
79
differently. In principle token classes depend on the specific
80
80
application, but for most purposes it is adequate to use a predefined
323
323
The above are all simple text search examples. As mentioned before, full
324
324
text search functionality includes the ability to do many more things:
325
325
skip indexing certain words (stop words), process synonyms, and use
326
sophisticated parsing, e.g. parse based on more than just white space.
326
sophisticated parsing, e.g., parse based on more than just white space.
327
327
This functionality is controlled by <firstterm>text search
328
328
configurations</>. <productname>PostgreSQL</> comes with predefined
329
329
configurations for many languages, and you can easily create your own
391
391
Text search parsers and templates are built from low-level C functions;
392
therefore it requires C programming ability to develop new ones, and
392
therefore C programming ability is required to develop new ones, and
393
393
superuser privileges to install one into a database. (There are examples
394
394
of add-on parsers and templates in the <filename>contrib/</> area of the
395
395
<productname>PostgreSQL</> distribution.) Since dictionaries and
416
416
<title>Searching a Table</title>
419
It is possible to do full text search with no index. A simple query
419
It is possible to do a full text search without an index. A simple query
420
420
to print the <structname>title</> of each row that contains the word
421
421
<literal>friend</> in its <structfield>body</> field is:
457
WHERE to_tsvector(title || body) @@ to_tsquery('create & table')
458
ORDER BY last_mod_date DESC LIMIT 10;
457
WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table')
458
ORDER BY last_mod_date DESC
459
460
</programlisting>
461
For clarity we omitted the <function>coalesce</function> function
462
which would be needed to search rows that contain <literal>NULL</literal>
462
For clarity we omitted the <function>coalesce</function> function calls
463
which would be needed to find rows that contain <literal>NULL</literal>
463
464
in one of the two fields.
518
519
recording which configuration was used for each index entry. This
519
520
would be useful, for example, if the document collection contained
520
521
documents in different languages. Again,
521
queries that are to use the index must be phrased to match, e.g.
522
queries that wish to use the index must be phrased to match, e.g.,
522
523
<literal>WHERE to_tsvector(config_name, body) @@ 'a & b'</>.
541
542
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
542
543
UPDATE pgweb SET textsearchable_index_col =
543
to_tsvector('english', coalesce(title,'') || coalesce(body,''));
544
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
544
545
</programlisting>
546
547
Then we create a <acronym>GIN</acronym> index to speed up the search:
840
842
document, and how important is the part of the document where they occur.
841
843
However, the concept of relevancy is vague and very application-specific.
842
844
Different applications might require additional information for ranking,
843
e.g. document modification time. The built-in ranking functions are only
845
e.g., document modification time. The built-in ranking functions are only
844
846
examples. You can write your own ranking functions and/or combine their
845
847
results with additional factors to fit your specific needs.
880
ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
882
ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
883
<replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
921
924
</programlisting>
923
926
Typically weights are used to mark words from special areas of the
924
document, like the title or an initial abstract, so that they can be
925
treated as more or less important than words in the document body.
927
document, like the title or an initial abstract, so they can be
928
treated with more or less importance than words in the document body.
929
932
Since a longer document has a greater chance of containing a query term
930
it is reasonable to take into account document size, e.g. a hundred-word
933
it is reasonable to take into account document size, e.g., a hundred-word
931
934
document with five instances of a search word is probably more relevant
932
935
than a thousand-word document with five instances. Both ranking functions
933
936
take an integer <replaceable>normalization</replaceable> option that
996
999
SELECT title, ts_rank_cd(textsearch, query) AS rank
997
1000
FROM apod, to_tsquery('neutrino|(dark & matter)') query
998
1001
WHERE query @@ textsearch
999
ORDER BY rank DESC LIMIT 10;
1001
1005
-----------------------------------------------+----------
1002
1006
Neutrinos in the Sun | 3.1
1017
1021
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1018
1022
FROM apod, to_tsquery('neutrino|(dark & matter)') query
1019
1023
WHERE query @@ textsearch
1020
ORDER BY rank DESC LIMIT 10;
1022
1027
-----------------------------------------------+-------------------
1023
1028
Neutrinos in the Sun | 0.756097569485493
1037
1042
Ranking can be expensive since it requires consulting the
1038
1043
<type>tsvector</type> of each matching document, which can be I/O bound and
1039
1044
therefore slow. Unfortunately, it is almost impossible to avoid since
1040
practical queries often result in large numbers of matches.
1045
practical queries often result in a large number of matches.
1065
1070
<function>ts_headline</function> accepts a document along
1066
with a query, and returns an excerpt from
1071
with a query, and returns an excerpt of
1067
1072
the document in which terms from the query are highlighted. The
1068
1073
configuration to be used to parse the document can be specified by
1069
1074
<replaceable>config</replaceable>; if <replaceable>config</replaceable>
1080
1085
<itemizedlist spacing="compact" mark="bullet">
1083
<literal>StartSel</>, <literal>StopSel</literal>: the strings with which
1084
query words appearing in the document should be delimited to distinguish
1085
them from other excerpted words.
1088
<literal>StartSel</>, <literal>StopSel</literal>: the strings to delimit
1089
query words appearing in the document, to distinguish
1090
them from other excerpted words. You must double-quote these strings
1091
if they contain spaces or commas.
1096
1102
<literal>ShortWord</literal>: words of this length or less will be
1097
1103
dropped at the start and end of a headline. The default
1098
value of three eliminates the English articles.
1103
<literal>MaxFragments</literal>: maximum number of text excerpts
1104
or fragments that matches the query words. It also triggers a
1105
different headline generation function than the default one. This
1106
function finds text fragments with as many query words as possible and
1107
stretches those fragments around the query words. As a result
1108
query words are close to the middle of each fragment and have words on
1109
each side. Each fragment will be of at most MaxWords and will not
1110
have words of size less than or equal to ShortWord at the start or
1111
end of a fragment. If all query words are not found in the document,
1112
then a single fragment of MinWords will be displayed.
1117
<literal>FragmentDelimiter</literal>: When more than one fragments are
1118
displayed, then the fragments will be separated by this delimiter. This
1119
option is effective only if MaxFragments is greater than 1 and there are
1120
more than one fragments to be diplayed. This option has no effect on the
1121
default headline generation function.
1104
value of three eliminates common English articles.
1126
1109
<literal>HighlightAll</literal>: Boolean flag; if
1127
<literal>true</literal> the whole document will be highlighted.
1110
<literal>true</literal> the whole document will be used as the
1111
headline, ignoring the preceding three parameters.
1116
<literal>MaxFragments</literal>: maximum number of text excerpts
1117
or fragments to display. The default value of zero selects a
1118
non-fragment-oriented headline generation method. A value greater than
1119
zero selects fragment-based headline generation. This method
1120
finds text fragments with as many query words as possible and
1121
stretches those fragments around the query words. As a result
1122
query words are close to the middle of each fragment and have words on
1123
each side. Each fragment will be of at most <literal>MaxWords</> and
1124
words of length <literal>ShortWord</> or less are dropped at the start
1125
and end of each fragment. If not all query words are found in the
1126
document, then a single fragment of the first <literal>MinWords</>
1127
in the document will be displayed.
1132
<literal>FragmentDelimiter</literal>: When more than one fragment is
1133
displayed, the fragments will be separated by this string.
1130
1136
</itemizedlist>
1132
1138
Any unspecified options receive these defaults:
1134
1140
<programlisting>
1135
StartSel=<b>, StopSel=</b>, MaxFragments=0, FragmentDelimiter=" ... ", MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
1141
StartSel=<b>, StopSel=</b>,
1142
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
1143
MaxFragments=0, FragmentDelimiter=" ... "
1136
1144
</programlisting>
1142
1150
<programlisting>
1143
SELECT ts_headline('english', 'The most common type of search
1151
SELECT ts_headline('english',
1152
'The most common type of search
1144
1153
is to find all documents containing given query terms
1145
1154
and return them in order of their similarity to the
1146
query.', to_tsquery('query & similarity'));
1156
to_tsquery('query & similarity'));
1148
1158
------------------------------------------------------------
1149
given <b>query</b> terms
1159
containing given <b>query</b> terms
1150
1160
and return them in order of their <b>similarity</b> to the
1151
1161
<b>query</b>.
1153
SELECT ts_headline('english', 'The most common type of search
1163
SELECT ts_headline('english',
1164
'The most common type of search
1154
1165
is to find all documents containing given query terms
1155
1166
and return them in order of their similarity to the
1158
1169
'StartSel = <, StopSel = >');
1160
1171
-------------------------------------------------------
1161
given <query> terms
1172
containing given <query> terms
1162
1173
and return them in order of their <similarity> to the
1164
1175
</programlisting>
1264
This function returns a copy of the input vector in which every
1276
<function>setweight</> returns a copy of the input vector in which every
1265
1277
position has been labeled with the given <replaceable>weight</>, either
1266
1278
<literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1267
1279
<literal>D</literal>. (<literal>D</literal> is the default for new
1462
1474
The <function>ts_rewrite</function> family of functions search a
1463
1475
given <type>tsquery</> for occurrences of a target
1464
subquery, and replace each occurrence with another
1476
subquery, and replace each occurrence with a
1465
1477
substitute subquery. In essence this operation is a
1466
1478
<type>tsquery</>-specific version of substring replacement.
1467
1479
A target and substitute combination can be
1561
1573
We can change the rewriting rules just by updating the table:
1563
1575
<programlisting>
1564
UPDATE aliases SET s = to_tsquery('supernovae|sn & !nebulae') WHERE t = to_tsquery('supernovae');
1577
SET s = to_tsquery('supernovae|sn & !nebulae')
1578
WHERE t = to_tsquery('supernovae');
1566
1580
SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
1574
1588
Rewriting can be slow when there are many rewriting rules, since it
1575
checks every rule for a possible hit. To filter out obvious non-candidate
1589
checks every rule for a possible match. To filter out obvious non-candidate
1576
1590
rules we can use the containment operators for the <type>tsquery</type>
1577
1591
type. In the example below, we select only those rules which might match
1578
1592
the original query:
1667
A limitation of the built-in triggers is that they treat all the
1681
A limitation of built-in triggers is that they treat all the
1668
1682
input columns alike. To process columns differently — for
1669
example, to weight title differently from body — it is necessary
1683
example, to weigh title differently from body — it is necessary
1670
1684
to write a custom trigger. Here is an example using
1671
1685
<application>PL/pgSQL</application> as the trigger language:
1711
ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional> OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>, OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
1725
ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>,
1726
</optional> OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>,
1727
OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
1715
<replaceable>sqlquery</replaceable> is a text value containing a SQL
1731
<replaceable>sqlquery</replaceable> is a text value containing an SQL
1716
1732
query which must return a single <type>tsvector</type> column.
1717
1733
<function>ts_stat</> executes the query and returns statistics about
1718
1734
each distinct lexeme (word) contained in the <type>tsvector</type>
2071
2087
by the parser, each dictionary in the list is consulted in turn,
2072
2088
until some dictionary recognizes it as a known word. If it is identified
2073
2089
as a stop word, or if no dictionary recognizes the token, it will be
2074
discarded and not indexed or searched for.
2090
discarded and not indexed or searched.
2075
2091
The general rule for configuring a list of dictionaries
2076
2092
is to place first the most narrow, most specific dictionary, then the more
2077
2093
general dictionaries, finishing with a very general dictionary, like
2264
2280
ALTER TEXT SEARCH CONFIGURATION english
2265
ALTER MAPPING FOR asciiword WITH my_synonym, english_stem;
2281
ALTER MAPPING FOR asciiword
2282
WITH my_synonym, english_stem;
2267
2284
SELECT * FROM ts_debug('english', 'Paris');
2268
2285
alias | description | token | dictionaries | dictionary | lexemes
2453
2471
ALTER TEXT SEARCH CONFIGURATION russian
2454
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_astro, english_stem;
2472
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2473
WITH thesaurus_astro, english_stem;
2455
2474
</programlisting>
2457
2476
Now we can see how it works.
2514
2533
<firstterm>morphological dictionaries</>, which can normalize many
2515
2534
different linguistic forms of a word into the same lexeme. For example,
2516
2535
an English <application>Ispell</> dictionary can match all declensions and
2517
conjugations of the search term <literal>bank</literal>, e.g.
2536
conjugations of the search term <literal>bank</literal>, e.g.,
2518
2537
<literal>banking</>, <literal>banked</>, <literal>banks</>,
2519
2538
<literal>banks'</>, and <literal>bank's</>.
2564
Ispell dictionaries support splitting compound words.
2565
This is a nice feature and
2566
<productname>PostgreSQL</productname> supports it.
2583
Ispell dictionaries support splitting compound words;
2567
2585
Notice that the affix file should specify a special flag using the
2568
2586
<literal>compoundwords controlled</literal> statement that marks dictionary
2569
2587
words that can participate in compound formation:
2597
2615
<title><application>Snowball</> Dictionary</title>
2600
The <application>Snowball</> dictionary template is based on the project
2601
of Martin Porter, inventor of the popular Porter's stemming algorithm
2618
The <application>Snowball</> dictionary template is based on a project
2619
by Martin Porter, inventor of the popular Porter's stemming algorithm
2602
2620
for the English language. Snowball now provides stemming algorithms for
2603
2621
many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
2604
2622
site</ulink> for more information). Each algorithm understands how to
2763
2781
The behavior of a custom text search configuration can easily become
2764
complicated enough to be confusing or undesirable. The functions described
2782
confusing. The functions described
2765
2783
in this section are useful for testing text search objects. You can
2766
2784
test a complete configuration, or test parsers and dictionaries separately.
2965
ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2966
ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2983
ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2984
OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2985
ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
2986
OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
2994
ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
2995
ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3014
ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
3015
OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3016
ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
3017
OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
3118
There are two kinds of indexes that can be used to speed up full text
3140
There are two kinds of indexes which can be used to speed up full text
3120
3142
Note that indexes are not mandatory for full text searching, but in
3121
cases where a column is searched on a regular basis, an index will
3122
usually be desirable.
3143
cases where a column is searched on a regular basis, an index is
3182
3204
to check the actual table row to eliminate such false matches.
3183
3205
(<productname>PostgreSQL</productname> does this automatically when needed.)
3184
3206
GiST indexes are lossy because each document is represented in the
3185
index by a fixed-length signature. The signature is generated by hashing
3207
index using a fixed-length signature. The signature is generated by hashing
3186
3208
each word into a random bit in an n-bit string, with all these bits OR-ed
3187
3209
together to produce an n-bit document signature. When two words hash to
3188
3210
the same bit position there will be a false match. If all words in
3194
Lossiness causes performance degradation due to useless fetches of table
3216
Lossiness causes performance degradation due to unnecessary fetches of table
3195
3217
records that turn out to be false matches. Since random access to table
3196
3218
records is slow, this limits the usefulness of GiST indexes. The
3197
3219
likelihood of false matches depends on several factors, in particular the
3281
The optional parameter <literal>PATTERN</literal> should be the name of
3303
The optional parameter <literal>PATTERN</literal> can be the name of
3282
3304
a text search object, optionally schema-qualified. If
3283
3305
<literal>PATTERN</literal> is omitted then information about all
3284
3306
visible objects will be displayed. <literal>PATTERN</literal> can be a
3559
3581
Text search configuration setup is completely different now.
3560
3582
Instead of manually inserting rows into configuration tables,
3561
3583
search is configured through the specialized SQL commands shown
3562
earlier in this chapter. There is not currently any automated
3584
earlier in this chapter. There is no automated
3563
3585
support for converting an existing custom configuration for 8.3;
3564
3586
you're on your own here.