~ubuntu-branches/ubuntu/karmic/postgresql-8.4/karmic-security

ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>

882

ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,

883

<replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>

881

884

</synopsis>

882

885

</term>

883

886

921

924

</programlisting>

922

925

923

926

Typically weights are used to mark words from special areas of the

924

document, like the title or an initial abstract, so that they can be

925

treated as more or less important than words in the document body.

927

document, like the title or an initial abstract, so they can be

928

treated with more or less importance than words in the document body.

926

929

</para>

927

930

928

931

<para>

929

932

Since a longer document has a greater chance of containing a query term

930

it is reasonable to take into account document size, e.g. a hundred-word

933

it is reasonable to take into account document size, e.g., a hundred-word

931

934

document with five instances of a search word is probably more relevant

932

935

than a thousand-word document with five instances. Both ranking functions

933

936

take an integer <replaceable>normalization</replaceable> option that

996

999

SELECT title, ts_rank_cd(textsearch, query) AS rank

997

1000

FROM apod, to_tsquery('neutrino|(dark & matter)') query

998

1001

WHERE query @@ textsearch

999

ORDER BY rank DESC LIMIT 10;

1002

ORDER BY rank DESC

1003

LIMIT 10;

1000

1004

title | rank

1001

1005

-----------------------------------------------+----------

1002

1006

Neutrinos in the Sun | 3.1

1017

1021

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank

1018

1022

FROM apod, to_tsquery('neutrino|(dark & matter)') query

1019

1023

WHERE query @@ textsearch

1020

ORDER BY rank DESC LIMIT 10;

1024

ORDER BY rank DESC

1025

LIMIT 10;

1021

1026

title | rank

1022

1027

-----------------------------------------------+-------------------

1023

1028

Neutrinos in the Sun | 0.756097569485493

1037

1042

Ranking can be expensive since it requires consulting the

1038

1043

<type>tsvector</type> of each matching document, which can be I/O bound and

1039

1044

therefore slow. Unfortunately, it is almost impossible to avoid since

1040

practical queries often result in large numbers of matches.

1045

practical queries often result in a large number of matches.

1041

1046

</para>

1042

1047

1043

1048

</sect2>

1063

1068

1064

1069

<para>

1065

1070

<function>ts_headline</function> accepts a document along

1066

with a query, and returns an excerpt from

1071

with a query, and returns an excerpt of

1067

1072

the document in which terms from the query are highlighted. The

1068

1073

configuration to be used to parse the document can be specified by

1069

1074

<replaceable>config</replaceable>; if <replaceable>config</replaceable>

1080

1085

1081

1086

1082

1087

<para>

1083

<literal>StartSel</>, <literal>StopSel</literal>: the strings with which

1084

query words appearing in the document should be delimited to distinguish

1085

them from other excerpted words.

1088

<literal>StartSel</>, <literal>StopSel</literal>: the strings to delimit

1089

query words appearing in the document, to distinguish

1090

them from other excerpted words. You must double-quote these strings

1091

if they contain spaces or commas.

1086

1092

</para>

1087

1093

</listitem>

1088

1094

1095

1101

<para>

1096

1102

<literal>ShortWord</literal>: words of this length or less will be

1097

1103

dropped at the start and end of a headline. The default

1098

value of three eliminates the English articles.

1099

</para>

1100

</listitem>

1101

1102

<para>

1103

<literal>MaxFragments</literal>: maximum number of text excerpts

1104

or fragments that matches the query words. It also triggers a

1105

different headline generation function than the default one. This

1106

function finds text fragments with as many query words as possible and

1107

stretches those fragments around the query words. As a result

1108

query words are close to the middle of each fragment and have words on

1109

each side. Each fragment will be of at most MaxWords and will not

1110

have words of size less than or equal to ShortWord at the start or

1111

end of a fragment. If all query words are not found in the document,

1112

then a single fragment of MinWords will be displayed.

1113

</para>

1114

</listitem>

1115

1116

<para>

1117

<literal>FragmentDelimiter</literal>: When more than one fragments are

1118

displayed, then the fragments will be separated by this delimiter. This

1119

option is effective only if MaxFragments is greater than 1 and there are

1120

more than one fragments to be diplayed. This option has no effect on the

1121

default headline generation function.

1104

value of three eliminates common English articles.

1122

1105

</para>

1123

1106

</listitem>

1124

1107

1125

1108

<para>

1126

1109

<literal>HighlightAll</literal>: Boolean flag; if

1127

<literal>true</literal> the whole document will be highlighted.

1110

<literal>true</literal> the whole document will be used as the

1111

headline, ignoring the preceding three parameters.

1112

</para>

1113

</listitem>

1114

1115

<para>

1116

<literal>MaxFragments</literal>: maximum number of text excerpts

1117

or fragments to display. The default value of zero selects a

1118

non-fragment-oriented headline generation method. A value greater than

1119

zero selects fragment-based headline generation. This method

1120

finds text fragments with as many query words as possible and

1121

stretches those fragments around the query words. As a result

1122

query words are close to the middle of each fragment and have words on

1123

each side. Each fragment will be of at most <literal>MaxWords</> and

1124

words of length <literal>ShortWord</> or less are dropped at the start

1125

and end of each fragment. If not all query words are found in the

1126

document, then a single fragment of the first <literal>MinWords</>

1127

in the document will be displayed.

1128

</para>

1129

</listitem>

1130

1131

<para>

1132

<literal>FragmentDelimiter</literal>: When more than one fragment is

1133

displayed, the fragments will be separated by this string.

1128

1134

</para>

1129

1135

</listitem>

1130

1136

</itemizedlist>

1132

1138

Any unspecified options receive these defaults:

1133

1139

1134

1140

1135

StartSel=, StopSel=, MaxFragments=0, FragmentDelimiter=" ... ", MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE

1141

StartSel=, StopSel=,

1142

MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,

1143

MaxFragments=0, FragmentDelimiter=" ... "

1136

1144

</programlisting>

1137

1145

</para>

1138

1146

1140

1148

For example:

1141

1149

1142

1150

1143

SELECT ts_headline('english', 'The most common type of search

1151

SELECT ts_headline('english',

1152

'The most common type of search

1144

1153

is to find all documents containing given query terms

1145

1154

and return them in order of their similarity to the

1146

query.', to_tsquery('query & similarity'));

1155

query.',

1156

to_tsquery('query & similarity'));

1147

1157

ts_headline

1148

1158

------------------------------------------------------------

1149

given query terms

1159

containing given query terms

1150

1160

and return them in order of their similarity to the

1151

1161

query.

1152

1162

1153

SELECT ts_headline('english', 'The most common type of search

1163

SELECT ts_headline('english',

1164

'The most common type of search

1154

1165

is to find all documents containing given query terms

1155

1166

and return them in order of their similarity to the

1156

1167

query.',

1158

1169

'StartSel = <, StopSel = >');

1159

1170

ts_headline

1160

1171

-------------------------------------------------------

1161

given <query> terms

1172

containing given <query> terms

1162

1173

and return them in order of their <similarity> to the

1163

1174

<query>.

1164

1175

</programlisting>

1177

1188

FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank

1178

1189

FROM apod, to_tsquery('stars') q

1179

1190

WHERE ti @@ q

1180

ORDER BY rank DESC LIMIT 10) AS foo;

1191

ORDER BY rank DESC

1192

LIMIT 10) AS foo;

1181

1193

</programlisting>

1182

1194

</para>

1183

1195

1261

1273

1262

1274

1263

1275

<para>

1264

This function returns a copy of the input vector in which every

1276

<function>setweight</> returns a copy of the input vector in which every

1265

1277

position has been labeled with the given <replaceable>weight</>, either

1266

1278

<literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or

1267

1279

<literal>D</literal>. (<literal>D</literal> is the default for new

1461

1473

<para>

1462

1474

The <function>ts_rewrite</function> family of functions search a

1463

1475

given <type>tsquery</> for occurrences of a target

1464

subquery, and replace each occurrence with another

1476

subquery, and replace each occurrence with a

1465

1477

substitute subquery. In essence this operation is a

1466

1478

<type>tsquery</>-specific version of substring replacement.

1467

1479

A target and substitute combination can be

1561

1573

We can change the rewriting rules just by updating the table:

1562

1574

1563

1575

1564

UPDATE aliases SET s = to_tsquery('supernovae|sn & !nebulae') WHERE t = to_tsquery('supernovae');

1576

UPDATE aliases

1577

SET s = to_tsquery('supernovae|sn & !nebulae')

1578

WHERE t = to_tsquery('supernovae');

1565

1579

1566

1580

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');

1567

1581

ts_rewrite

1572

1586

1573

1587

<para>

1574

1588

Rewriting can be slow when there are many rewriting rules, since it

1575

checks every rule for a possible hit. To filter out obvious non-candidate

1589

checks every rule for a possible match. To filter out obvious non-candidate

1576

1590

rules we can use the containment operators for the <type>tsquery</type>

1577

1591

type. In the example below, we select only those rules which might match

1578

1592

the original query:

1664

1678

</para>

1665

1679

1666

1680

<para>

1667

A limitation of the built-in triggers is that they treat all the

1681

A limitation of built-in triggers is that they treat all the

1668

1682

input columns alike. To process columns differently — for

1669

example, to weight title differently from body — it is necessary

1683

example, to weigh title differently from body — it is necessary

1670

1684

to write a custom trigger. Here is an example using

1671

1685

<application>PL/pgSQL</application> as the trigger language:

1672

1686

1708

1722

</para>

1709

1723

1710

1724

1711

ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional> OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>, OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>

1725

ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>,

1726

</optional> OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>,

1727

OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>

1712

1728

</synopsis>

1713

1729

1714

1730

<para>

1715

<replaceable>sqlquery</replaceable> is a text value containing a SQL

1731

<replaceable>sqlquery</replaceable> is a text value containing an SQL

1716

1732

query which must return a single <type>tsvector</type> column.

1717

1733

<function>ts_stat</> executes the query and returns statistics about

1718

1734

each distinct lexeme (word) contained in the <type>tsvector</type>

1924

1940

only the basic ASCII letters are reported as a separate token type,

1925

1941

since it is sometimes useful to distinguish them. In most European

1926

1942

languages, token types <literal>word</> and <literal>asciiword</>

1927

should always be treated alike.

1943

should be treated alike.

1928

1944

</para>

1929

1945

</note>

1930

1946

2071

2087

by the parser, each dictionary in the list is consulted in turn,

2072

2088

until some dictionary recognizes it as a known word. If it is identified

2073

2089

as a stop word, or if no dictionary recognizes the token, it will be

2074

discarded and not indexed or searched for.

2090

discarded and not indexed or searched.

2075

2091

The general rule for configuring a list of dictionaries

2076

2092

is to place first the most narrow, most specific dictionary, then the more

2077

2093

general dictionaries, finishing with a very general dictionary, like

2262

2278

);

2263

2279

2264

2280

ALTER TEXT SEARCH CONFIGURATION english

2265

ALTER MAPPING FOR asciiword WITH my_synonym, english_stem;

2281

ALTER MAPPING FOR asciiword

2282

WITH my_synonym, english_stem;

2266

2283

2267

2284

SELECT * FROM ts_debug('english', 'Paris');

2268

2285

2422

2439

2423

2440

2424

2441

ALTER TEXT SEARCH CONFIGURATION russian

2425

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_simple;

2442

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart

2443

WITH thesaurus_simple;

2426

2444

</programlisting>

2427

2445

</para>

2428

2446

2451

2469

);

2452

2470

2453

2471

ALTER TEXT SEARCH CONFIGURATION russian

2454

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_astro, english_stem;

2472

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart

2473

WITH thesaurus_astro, english_stem;

2455

2474

</programlisting>

2456

2475

2457

2476

Now we can see how it works.

2514

2533

<firstterm>morphological dictionaries</>, which can normalize many

2515

2534

different linguistic forms of a word into the same lexeme. For example,

2516

2535

an English <application>Ispell</> dictionary can match all declensions and

2517

conjugations of the search term <literal>bank</literal>, e.g.

2536

conjugations of the search term <literal>bank</literal>, e.g.,

2518

2537

<literal>banking</>, <literal>banked</>, <literal>banks</>,

2519

2538

<literal>banks'</>, and <literal>bank's</>.

2520

2539

</para>

2561

2580

</para>

2562

2581

2563

2582

<para>

2564

Ispell dictionaries support splitting compound words.

2565

This is a nice feature and

2566

<productname>PostgreSQL</productname> supports it.

2583

Ispell dictionaries support splitting compound words;

2584

a useful feature.

2567

2585

Notice that the affix file should specify a special flag using the

2568

2586

<literal>compoundwords controlled</literal> statement that marks dictionary

2569

2587

words that can participate in compound formation:

2597

2615

<title><application>Snowball</> Dictionary</title>

2598

2616

2599

2617

<para>

2600

The <application>Snowball</> dictionary template is based on the project

2601

of Martin Porter, inventor of the popular Porter's stemming algorithm

2618

The <application>Snowball</> dictionary template is based on a project

2619

by Martin Porter, inventor of the popular Porter's stemming algorithm

2602

2620

for the English language. Snowball now provides stemming algorithms for

2603

2621

many languages (see the <ulink url="http://snowball.tartarus.org">Snowball

2604

2622

site</ulink> for more information). Each algorithm understands how to

2662

2680

2663

2681

<para>

2664

2682

As an example, we will create a configuration

2665

<literal>pg</literal>, starting from a duplicate of the built-in

2683

<literal>pg</literal> by duplicating the built-in

2666

2684

<literal>english</> configuration.

2667

2685

2668

2686

2761

2779

2762

2780

<para>

2763

2781

The behavior of a custom text search configuration can easily become

2764

complicated enough to be confusing or undesirable. The functions described

2782

confusing. The functions described

2765

2783

in this section are useful for testing text search objects. You can

2766

2784

test a complete configuration, or test parsers and dictionaries separately.

2767

2785

</para>

2932

2950

</para>

2933

2951

2934

2952

<para>

2935

You can reduce the volume of output by explicitly specifying which columns

2953

You can reduce the width of the output by explicitly specifying which columns

2936

2954

you want to see:

2937

2955

2938

2956

2962

2980

</indexterm>

2963

2981

2964

2982

2965

ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>

2966

ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>

2983

ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,

2984

OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>

2985

ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,

2986

OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>

2967

2987

</synopsis>

2968

2988

2969

2989

<para>

2991

3011

</indexterm>

2992

3012

2993

3013

2994

ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>

2995

ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>

3014

ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,

3015

OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>

3016

ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,

3017

OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>

2996

3018

</synopsis>

2997

3019

2998

3020

<para>

3115

3137

</indexterm>

3116

3138

3117

3139

<para>

3118

There are two kinds of indexes that can be used to speed up full text

3140

There are two kinds of indexes which can be used to speed up full text

3119

3141

searches.

3120

3142

Note that indexes are not mandatory for full text searching, but in

3121

cases where a column is searched on a regular basis, an index will

3122

usually be desirable.

3143

cases where a column is searched on a regular basis, an index is

3144

usually desirable.

3123

3145

3124

3146

3125

3147

3173

3195

3174

3196

<para>

3175

3197

There are substantial performance differences between the two index types,

3176

so it is important to understand which to use.

3198

so it is important to understand their characteristics.

3177

3199

</para>

3178

3200

3179

3201

<para>

3182

3204

to check the actual table row to eliminate such false matches.

3183

3205

(<productname>PostgreSQL</productname> does this automatically when needed.)

3184

3206

GiST indexes are lossy because each document is represented in the

3185

index by a fixed-length signature. The signature is generated by hashing

3207

index using a fixed-length signature. The signature is generated by hashing

3186

3208

each word into a random bit in an n-bit string, with all these bits OR-ed

3187

3209

together to produce an n-bit document signature. When two words hash to

3188

3210

the same bit position there will be a false match. If all words in

3191

3213

</para>

3192

3214

3193

3215

<para>

3194

Lossiness causes performance degradation due to useless fetches of table

3216

Lossiness causes performance degradation due to unnecessary fetches of table

3195

3217

records that turn out to be false matches. Since random access to table

3196

3218

records is slow, this limits the usefulness of GiST indexes. The

3197

3219

likelihood of false matches depends on several factors, in particular the

3278

3300

</para>

3279

3301

3280

3302

<para>

3281

The optional parameter <literal>PATTERN</literal> should be the name of

3303

The optional parameter <literal>PATTERN</literal> can be the name of

3282

3304

a text search object, optionally schema-qualified. If

3283

3305

<literal>PATTERN</literal> is omitted then information about all

3284

3306

visible objects will be displayed. <literal>PATTERN</literal> can be a

3559

3581

Text search configuration setup is completely different now.

3560

3582

Instead of manually inserting rows into configuration tables,

3561

3583

search is configured through the specialized SQL commands shown

3562

earlier in this chapter. There is not currently any automated

3584

earlier in this chapter. There is no automated

3563

3585

support for converting an existing custom configuration for 8.3;

3564

3586

you're on your own here.

3565

3587

</para>

Older »