~vcs-imports/mammoth-replicator/trunk

You may need to grant permissions to use on pg_ts_dict, pg_ts_parser, pg_ts_cfg, pg_ts_cfgmap tables to let non-superuser works with tsearch2. GRANT SELECT should be enough for search-only access.

192

<h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>

193

The first thing we can do is try out some of the types that are

194

provided for us. Lets look at the tsvector type provided for

195

us:

196

197

<pre>

198

SELECT 'Our first string used today'::tsvector;

199

tsvector

200

---------------------------------------

201

'Our' 'used' 'first' 'today' 'string'

202

(1 row)

203

</pre>

204

The results are the words used within our string. Notice they

205

are not in any particular order. The tsvector type returns a string

206

of space separated words.

207

<pre>

208

SELECT 'Our first string used today first string'::tsvector;

209

tsvector

210

-----------------------------------------------

211

'Our' 'used' 'first' 'today' 'string'

212

(1 row)

213

</pre>

214

Notice the results string has each unique word ('first' and

215

'string' only appear once in the tsvector value). Which of course

216

makes sense if you are searching the full text ... you only need to

217

know each unique word in the text.

218

Those examples were just casting a text field to that of type

219

tsvector. Lets check out one of the new functions created by the

220

tsearch2 module.

221

The function to_tsvector has 3 possible signatures:

222

<pre>

223

224

to_tsvector(oid, text);

225

to_tsvector(text, text);

226

to_tsvector(text);

227

</pre>

228

We will use the second method using two text fields. The

229

overloaded methods provide us with a way to specifiy the way the

230

searchable text is broken up into words (Stemming process). Right

231

now we will specify the 'default' configuration. See the section on

232

TSEARCH2 CONFIGURATION to learn more about this.

233

<pre>

234

SELECT to_tsvector('default',

235

'Our first string used today first string');

236

to_tsvector

237

--------------------------------------------

238

'use':4 'first':2,6 'today':5 'string':3,7

239

(1 row)

240

</pre>

241

The result returned from this function is of type tsvector. The

242

results came about by this reasoning: All of the words in the text

243

passed in are stemmed, or not used because they are stop words

244

defined in our configuration. Each lower case morphed word is

245

returned with all of the positons in the text.

246

In this case the word "Our" is a stop word in the default

247

configuration. That means it will not be included in the result.

248

The word "first" is found at positions 2 and 6 (although "Our" is a

249

stop word, it's position is maintained). The word(s) positioning is

250

maintained exactly as in the original string. The word "used" is

251

morphed to the word "use" based on the default configuration for

252

word stemming, and is found at position 4. The rest of the results

253

follow the same logic. Just a reminder again ... the order of the

254

'word' position in the output is not in any kind of order. (ie

255

'use':4 appears first)

256

If you want to view the output of the tsvector fields without

257

their positions, you can do so with the function

258

"strip(tsvector)".

259

<pre>

260

SELECT strip(to_tsvector('default',

261

'Our first string used today first string'));

262

strip

263

--------------------------------

264

'use' 'first' 'today' 'string'

265

266

</pre>

267

If you wish to know the number of unique words returned in the

268

tsvector you can do so by using the function "length(tsvector)"

269

<pre>

270

SELECT length(to_tsvector('default',

271

'Our first string used today first string'));

272

length

273

--------

274

275

(1 row)

276

</pre>

277

Lets take a look at the function to_tsquery. It also has 3

278

signatures which follow the same rational as the to_tsvector

279

function:

280

<pre>

281

to_tsquery(oid, text);

282

to_tsquery(text, text);

283

to_tsquery(text);

284

</pre>

285

Lets try using the function with a single word :

286

<pre>

287

SELECT to_tsquery('default', 'word');

288

to_tsquery

289

-----------

290

'word'

291

(1 row)

292

293

</pre>

294

I call the function the same way I would a to_tsvector function,

295

specifying the 'default' configuration for morphing, and the result

296

is the stemmed output 'word'.

297

Lets attempt to use the function with a string of multiple

298

words:

299

<pre>

300

SELECT to_tsquery('default', 'this is many words');

301

ERROR: Syntax error

302

</pre>

303

The function can not accept a space separated string. The

304

intention of the to_tsquery function is to return a type of

305

"tsquery" used for searching a tsvector field. What we need to do

306

is search for one to many words with some kind of logic (for now

307

simple boolean).

308

<pre>

309

SELECT to_tsquery('default', 'searching|sentence');

310

to_tsquery

311

----------------------

312

'search' | 'sentenc'

313

(1 row)

314

</pre>

315

Notice that the words are separated by the boolean logic "OR",

316

the text could contain boolean operators &,|,!,() with their

317

usual meaning.

318

319

You can not use words defined as being a stop word in your

320

configuration. The function will not fail ... you will just get no

321

result, and a NOTICE like this:

322

<pre>

323

SELECT to_tsquery('default', 'a|is&not|!the');

324

NOTICE: Query contains only stopword(s)

325

or doesn't contain lexem(s), ignored

326

to_tsquery

327

-----------

328

(1 row)

329

</pre>

330

That is a beginning to using the types, and functions defined in

331

the tsearch2 module. There are numerous more functions that I have

332

not touched on. You can read through the tsearch2.sql file built

333

when compiling to get more familiar with what is included.

334

<h3>INDEXING FIELDS IN A TABLE</h3>

335

The next stage is to add a full text index to an existing table.

336

In this example we already have a table defined as follows:

337

<pre>

338

CREATE TABLE tblMessages

339

(

340

intIndex int4,

341

strTopic varchar(100),

342

strMessage text

343

);

344

</pre>

345

346

We are assuming there are several rows with some kind of data in

347

them. Any data will do, just do several inserts with test strings

348

for a topic, and a message. here is some test data I inserted. (yes

349

I know it's completely useless stuff ;-) but it will serve our

350

purpose right now).

351

<pre>

352

INSERT INTO tblMessages

353

VALUES ('1', 'Testing Topic', 'Testing message data input');

354

INSERT INTO tblMessages

355

VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');

356

INSERT INTO tblMessages

357

VALUES ('3', 'Famous Author', 'Stephen King');

358

INSERT INTO tblMessages

359

VALUES ('4', 'Political Topic',

360

'Nelson Mandella is released from prison');

361

INSERT INTO tblMessages

362

VALUES ('5', 'Nursery rhyme phrase',

363

'Little jack horner sat in a corner');

364

INSERT INTO tblMessages

365

VALUES ('6', 'Gettysburg address quotation',

366

'Four score and seven years ago'

367

' our fathers brought forth on this'

368

' continent a new nation, conceived in'

369

' liberty and dedicated to the proposition'

370

' that all men are created equal');

371

INSERT INTO tblMessages

372

VALUES ('7', 'Classic Rock Bands',

373

'Led Zeppelin Grateful Dead and The Sex Pistols');

374

INSERT INTO tblMessages

375

VALUES ('8', 'My birth address',

376

'18 Sommervile road, Regina, Saskatchewan');

377

INSERT INTO tblMessages

378

VALUES ('9', 'Joke', 'knock knock : who\'s there?'

379

' I will not finish this joke');

380

INSERT INTO tblMessages

381

VALUES ('10', 'Computer information',

382

'My computer is a pentium III 400 mHz'

383

' with 192 megabytes of RAM');

384

</pre>

385

The next stage is to create a special text index which we will

386

use for FTI, so we can search our table of messages for words or a

387

phrase. We do this using the SQL command:

388

<pre>

389

ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;

390

</pre>

391

Note that unlike traditional indexes, this is actually a new

392

field in the same table, which is then used (through the magic of

393

the tsearch2 operators and functions) by a special index we will

394

create in a moment.

395

The general rule for the initial insertion of data will follow

396

four steps:

397

<pre>

398

399

1. update table

400

2. vacuum full analyze

401

3. create index

402

4. vacuum full analyze

403

</pre>

404

The data can be updated into the table, the vacuum full analyze

405

will reclaim unused space. The index can be created on the table

406

after the data has been inserted. Having the index created prior to

407

the update will slow down the process. It can be done in that

408

manner, this way is just more efficient. After the index has been

409

created on the table, vacuum full analyze is run again to update

410

postgres's statistics (ie having the index take effect).

411

<pre>

412

UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);

413

VACUUM FULL ANALYZE;

414

</pre>

415

Note that this only inserts the field strMessage as a tsvector,

416

so if you want to also add strTopic to the information stored, you

417

should instead do the following, which effectively concatenates the

418

two fields into one before being inserted into the table:

419

<pre>

420

UPDATE tblMessages

421

SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));

422

VACUUM FULL ANALYZE;

423

</pre>

424

Using the coalesce function makes sure this

425

concatenation also works with NULL fields.

426

427

We need to create the index on the column idxFTI. Keep in mind

428

that the database will update the index when some action is taken.

429

In this case we _need_ the index (The whole point of Full Text

430

INDEXINGi ;-)), so don't worry about any indexing overhead. We will

431

create an index based on the gist function. GiST is an index

432

structure for Generalized Search Tree.

433

<pre>

434

CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);

435

VACUUM FULL ANALYZE;

436

</pre>

437

After you have converted all of your data and indexed the

438

column, you can select some rows to see what actually happened. I

439

will not display output here but you can play around yourselves and

440

see what happened.

441

The last thing to do is set up a trigger so every time a row in

442

this table is changed, the text index is automatically updated.

443

This is easily done using:

444

<pre>

445

CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages

446

FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);

447

</pre>

448

Or if you are indexing both strMessage and strTopic you should

449

instead do:

450

<pre>

451

452

CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages

453

FOR EACH ROW EXECUTE PROCEDURE

454

tsearch2(idxFTI, strTopic, strMessage);

455

</pre>

456

Before you ask, the tsearch2 function accepts multiple fields as

457

arguments so there is no need to concatenate the two into one like

458

we did before.

459

If you want to do something specific with columns, you may write

460

your very own trigger function using plpgsql or other procedural

461

languages (but not SQL, unfortunately) and use it instead of

462

tsearch2 trigger.

463

You could however call other stored procedures from within the

464

tsearch2 function. Lets say we want to create a function to remove

465

certain characters (like the @ symbol from all text).

466

<pre>

467

CREATE FUNCTION dropatsymbol(text)

468

RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;

469

</pre>

470

Now we can use this function within the tsearch2 function on the

471

trigger.

472

473

<pre>

474

DROP TRIGGER tsvectorupdate ON tblmessages;

475

CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages

476

FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);

477

INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');

478

</pre>

479

If at this point you receive an error stating: ERROR: Can't find

480

tsearch config by locale

481

Do not worry. You have done nothing wrong. And tsearch2 is not

482

broken. All that has happened here is that the configuration is

483

setup to use a configuration based on the locale of the server. All

484

you have to do is change your default configuration, or add a new

485

one for your specific locale. See the section on TSEARCH2

486

CONFIGURATION.

487

488

SELECT * FROM tblmessages WHERE intindex = 69;

489

490

intindex | strtopic | strmessage | idxfti

491

----------+--------------------------+---------------+-----------------------

492

69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2

493

(1 row)

494

</pre>

495

Notice that the string content was passed throught the stored

496

procedure dropatsymbol. The '@' character was replaced with a

497

single space ... and the output from the procedure was then stored

498

in the tsvector column.

499

This could be useful for removing other characters from indexed

500

text, or any kind of preprocessing needed to be done on the text

501

prior to insertion into the index.

502

<h3>QUERYING A TABLE</h3>

503

504

There are some examples in the README.tsearch2 file for querying

505

a table. One major difference between tsearch and tsearch2 is the

506

operator ## is no longer available. Only the operator @@ is

507

defined, using the types tsvector on one side and tsquery on the

508

other side.

509

Lets search the indexed data for the word "Test". I indexed

510

based on the the concatenation of the strTopic, and the

511

strMessage:

512

<pre>

513

SELECT intindex, strtopic FROM tblmessages

514

WHERE idxfti @@ 'test'::tsquery;

515

intindex | strtopic

516

----------+---------------

517

1 | Testing Topic

518

(1 row)

519

</pre>

520

The only result that matched was the row with a topic "Testing

521

Topic". Notice that the word I search for was all lowercase. Let's

522

see what happens when I query for uppercase "Test".

523

<pre>

524

SELECT intindex, strtopic FROM tblmessages

525

WHERE idxfti @@ 'Test'::tsquery;

526

intindex | strtopic

527

----------+----------

528

(0 rows)

529

</pre>

530

We get zero rows returned. The reason is because when the text

531

was inserted, it was morphed to my default configuration (because

532

of the call to to_tsvector in the UPDATE statement). If there was

533

no morphing done, and the tsvector field(s) contained the word

534

'Text', a match would have been found.

535

Most likely the best way to query the field is to use the

536

to_tsquery function on the right hand side of the @@ operator like

537

this:

538

539

<pre>

540

SELECT intindex, strtopic FROM tblmessages

541

WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');

542

intindex | strtopic

543

----------+--------------------

544

1 | Testing Topic

545

7 | Classic Rock Bands

546

(2 rows)

547

</pre>

548

That query searched for all instances of "Test" OR "Zeppelin".

549

It returned two rows: the "Testing Topic" row, and the "Classic

550

Rock Bands" row. The to_tsquery function performed the correct

551

morphology upon the parameters, and searched the tsvector field

552

appropriately.

553

The last example here relates to searching for a phrase, for

554

example "minority report". This poses a problem with regard to

555

tsearch2, as it doesn't index phrases, only words. But there is a

556

way around which doesn't appear to have a significant impact on

557

query time, and that is to use a query such as the following:

558

<pre>

559

SELECT intindex, strTopic FROM tblmessages

560

WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')

561

AND strMessage ~* '.*men are created equal.*';

562

intindex | strtopic

563

----------+------------------------------

564

6 | Gettysburg address quotation

565

(1 row)

566

SELECT intindex, strTopic FROM tblmessages

567

WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')

568

AND strMessage ~* '.*something that does not exist.*';

569

intindex | strtopic

570

----------+----------

571

(0 rows)

572

</pre>

573

Of course if your indexing both strTopic and strMessage, and

574

want to search for this phrase on both, then you will have to get

575

out the brackets and extend this query a little more.

576

577

<h3>TSEARCH2 CONFIGURATION</h3>

578

Some words such as "and", "the", and "who" are automatically not

579

indexed, since they belong to a pre-existing dictionary of "Stop

580

Words" which tsearch2 does not perform indexing on. If someone

581

needs to search for "The Who" in your database, they are going to

582

have a tough time coming up with any results, since both are

583

ignored in the indexes. But there is a solution.

584

Lets say we want to add a word into the stop word list for

585

english stemming. We could edit the file

586

:'/usr/local/pgsql/share/english.stop' and add a word to the list.

587

I edited mine to exclude my name from indexing:

588

<pre>

589

- Edit /usr/local/pgsql/share/english.stop

590

- Add 'andy' to the list

591

- Save the file.

592

</pre>

593

When you connect to the database, the dict_init procedure is run

594

during initialization. And in my configuration it will read the

595

stop words from the file I just edited. If you were connected to

596

the DB while editing the stop words, you will need to end the

597

current session and re-connect. When you re-connect to the

598

database, 'andy' is no longer indexed:

599

<pre>

600

SELECT to_tsvector('default', 'Andy');

601

to_tsvector

602

------------

603

(1 row)

604

</pre>

605

Originally I would get the result :

606

607

<pre>

608

SELECT to_tsvector('default', 'Andy');

609

to_tsvector

610

------------

611

'andi':1

612

(1 row)

613

</pre>

614

But since I added it as a stop word, it would be ingnored on the

615

indexing. The stop word added was used in the dictionary "en_stem".

616

If I were to use a different configuration such as 'simple', the

617

results would be different. There are no stop words for the simple

618

dictionary. It will just convert to lower case, and index every

619

unique word.

620

<pre>

621

SELECT to_tsvector('simple', 'Andy andy The the in out');

622

to_tsvector

623

-------------------------------------

624

'in':5 'out':6 'the':3,4 'andy':1,2

625

(1 row)

626

</pre>

627

All this talk about which configuration to use is leading us

628

into the actual configuration of tsearch2. In the examples in this

629

document the configuration has always been specified when using the

630

tsearch2 functions:

631

<pre>

632

SELECT to_tsvector('default', 'Testing the default config');

633

SELECT to_tsvector('simple', 'Example of simple Config');

634

</pre>

635

The pg_ts_cfg table holds each configuration you can use with

636

the tsearch2 functions. As you can see the ts_name column contains

637

both the 'default' configurations based on the 'C' locale. And the

638

'simple' configuration which is not based on any locale.

639

640

<pre>

641

SELECT * from pg_ts_cfg;

642

ts_name | prs_name | locale

643

-----------------+----------+--------------

644

default | default | C

645

default_russian | default | ru_RU.KOI8-R

646

simple | default |

647

(3 rows)

648

</pre>

649

Each row in the pg_ts_cfg table contains the name of the

650

tsearch2 configuration, the name of the parser to use, and the

651

locale mapped to the configuration. There is only one parser to

652

choose from the table pg_ts_parser called 'default'. More parsers

653

could be written, but for our needs we will use the default.

654

There are 3 configurations installed by tsearch2 initially. If

655

your locale is set to 'en_US' for example (like my laptop), then as

656

you can see there is currently no dictionary configured to use with

657

that locale. You can either set up a new configuration or just use

658

one that already exists. If I do not specify which configuration to

659

use in the to_tsvector function, I receive the following error.

660

<pre>

661

SELECT to_tsvector('learning tsearch is like going to school');

662

ERROR: Can't find tsearch config by locale

663

</pre>

664

We will create a new configuration for use with the server

665

encoding 'en_US'. The first step is to add a new configuration into

666

the pg_ts_cfg table. We will call the configuration

667

'default_english', with the default parser and use the locale

668

'en_US'.

669

<pre>

670

INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)

671

VALUES ('default_english', 'default', 'en_US');

672

673

</pre>

674

We have only declared that there is a configuration called

675

'default_english'. We need to set the configuration of how

676

'default_english' will work. The next step is creating a new

677

dictionary to use. The configuration of the dictionary is

678

completlely different in tsearch2. In the prior versions to make

679

changes, you would have to re-compile your changes into the

680

tsearch.so. All of the configuration has now been moved into the

681

system tables created by executing the SQL code from

682

tsearch2.sql

683

Lets take a first look at the pg_ts_dict table

684

<pre>

685

ftstest=# \d pg_ts_dict

686

Table "public.pg_ts_dict"

687

Column | Type | Modifiers

688

-----------------+--------------+-----------

689

dict_name | text | not null

690

dict_init | regprocedure |

691

dict_initoption | text |

692

dict_lexize | regprocedure | not null

693

dict_comment | text |

694

Indexes: pg_ts_dict_pkey primary key btree (dict_name)

695

</pre>

696

The dict_name column is the name of the dictionary, for example

697

'simple', 'en_stem' or 'ru_stem'. The dict_init column is a text

698

representation of a stored procedure to run for initialization of

699

that dictionary, for example 'snb_en_init(text)' or

700

'snb_ru_init(text)'. The initial configuration of tsearch2 had the

701

dict_init and dict_lexize columns as type oid. The patch mentioned

702

in the Installation Notes changes these types to regprocedure. The

703

data inserted, or updated can still be the oid of the stored

704

procedure. The representation is just different. This makes backup

705

and restore procedures much easier for tsearch2. The dict_init

706

option is used for options passed to the init function for the

707

stored procedure. In the cases of 'en_stem' or 'ru_stem' it is a

708

path to a stopword file for that dictionary, for example

709

'/usr/local/pgsql/share/english.stop'. This is however dictated by

710

the dictionary. ISpell dictionaries may require different options.

711

The dict_lexize column is another OID of a stored procedure to the

712

function used to lexize, for example 'snb_lexize(internal,

713

internal, integer)'. The dict_comment column is just a comment.

714

Next we will configure the use of a new dictionary based on

715

ISpell. We will assume you have ISpell installed on you machine.

716

(in /usr/local/lib)

717

There has been some confusion in the past as to which files are

718

used from ISpell. ISpell operates using a hash file. This is a

719

binary file created by the ISpell command line utility "buildhash".

720

This utility accepts a file containing the words from the

721

dictionary, and the affixes file and the output is the hash file.

722

The default installation of ISPell installs the english hash file

723

english.hash, which is the exact same file as american.hash. ISpell

724

uses this as the fallback dictionary to use.

725

This hash file is not what tsearch2 requires as the ISpell

726

interface. The file(s) needed are those used to create the hash.

727

Tsearch uses the dictionary words for morphology, so the listing is

728

needed not spellchecking. Regardless, these files are included in

729

the ISpell sources, and you can use them to integrate into

730

tsearch2. This is not complicated, but is not very obvious to begin

731

with. The tsearch2 ISpell interface needs only the listing of

732

dictionary words, it will parse and load those words, and use the

733

ISpell dictionary for lexem processing.

734

735

I found the ISPell make system to be very finicky. Their

736

documentation actually states this to be the case. So I just did

737

things the command line way. In the ISpell source tree under

738

languages/english there are several files in this directory. For a

739

complete description, please read the ISpell README. Basically for

740

the english dictionary there is the option to create the small,

741

medium, large and extra large dictionaries. The medium dictionary

742

is recommended. If the make system is configured correctly, it

743

would build and install the english.has file from the medium size

744

dictionary. Since we are only concerned with the dictionary word

745

listing ... it can be created from the /languages/english directory

746

with the following command:

747

<pre>

748

sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1

749

</pre>

750

This will create a file called english.med. You can copy this

751

file to whever you like. I placed mine in /usr/local/lib so it

752

coincides with the ISpell hash files. You can now add the tsearch2

753

configuration entry for the ISpell english dictionary. We will also

754

continue to use the english word stop file that was installed for

755

the en_stem dictionary. You could use a different one if you like.

756

The ISpell configuration is based on the "ispell_template"

757

dictionary installed by default with tsearch2. We will use the OIDs

758

to the stored procedures from the row where the dict_name =

759

'ispell_template'.

760

<pre>

761

INSERT INTO pg_ts_dict

762

(SELECT 'en_ispell',

763

dict_init,

764

'DictFile="/usr/local/lib/english.med",'

765

'AffFile="/usr/local/lib/english.aff",'

766

'StopFile="/usr/local/pgsql/share/contrib/english.stop"',

767

dict_lexize

768

FROM pg_ts_dict

769

WHERE dict_name = 'ispell_template');

770

</pre>

771

Now that we have a dictionary we can specify it's use in a query

772

to get a lexem. For this we will use the lexize function. The

773

lexize function takes the name of the dictionary to use as an

774

argument. Just as the other tsearch2 functions operate. You will

775

need to stop your psql session and start it again in order for this

776

modification to take place.

777

<pre>

778

SELECT lexize('en_ispell', 'program');

779

lexize

780

-----------

781

{program}

782

(1 row)

783

784

</pre>

785

If you wanted to always use the ISpell english dictionary you

786

have installed, you can configure tsearch2 to always use a specific

787

dictionary.

788

<pre>

789

SELECT set_curdict('en_ispell');

790

</pre>

791

Lexize is meant to turn a word into a lexem. It is possible to

792

receive more than one lexem returned for a single word.

793

<pre>

794

SELECT lexize('en_ispell', 'conditionally');

795

lexize

796

-----------------------------

797

{conditionally,conditional}

798

(1 row)

799

</pre>

800

The lexize function is not meant to take a full string as an

801

argument to return lexems for. If you passed in an entire sentence,

802

it attempts to find that entire sentence in the dictionary. Since

803

the dictionary contains only words, you will receive an empty

804

result set back.

805

<pre>

806

SELECT lexize('en_ispell', 'This is a senctece to lexize');

807

lexize

808

--------

809

810

(1 row)

811

812

If you parse a lexem from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not in the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)

813

814

</pre>

815

This is not to say that tsearch will be ignored when adding text

816

information to the the tsvector index column. This will be

817

explained in greater detail with the table pg_ts_cfgmap.

818

Next we need to set up the configuration for mapping the

819

dictionay use to the lexxem parsings. This will be done by altering

820

the pg_ts_cfgmap table. We will insert several rows, specifying to

821

use the new dictionary we installed and configured for lexizing

822

within tsearch2. There are several type of lexims we would be

823

concerned with forcing the use of the ISpell dictionary.

824

<pre>

825

INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)

826

VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');

827

INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)

828

VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');

829

INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)

830

VALUES ('default_english', 'lword', '{en_ispell,en_stem}');

831

</pre>

832

We have just inserted 3 records to the configuration mapping,

833

specifying that the lexem types for "lhword, lpart_hword and lword"

834

are to be stemmed using the 'en_ispell' dictionary we added into

835

pg_ts_dict, when using the configuration ' default_english' which

836

we added to pg_ts_cfg.

837

There are several other lexem types used that we do not need to

838

specify as using the ISpell dictionary. We can simply insert values

839

using the 'simple' stemming process dictionary.

840

<pre>

841

INSERT INTO pg_ts_cfgmap

842

VALUES ('default_english', 'url', '{simple}');

843

INSERT INTO pg_ts_cfgmap

844

VALUES ('default_english', 'host', '{simple}');

845

INSERT INTO pg_ts_cfgmap

846

VALUES ('default_english', 'sfloat', '{simple}');

847

INSERT INTO pg_ts_cfgmap

848

VALUES ('default_english', 'uri', '{simple}');

849

INSERT INTO pg_ts_cfgmap

850

VALUES ('default_english', 'int', '{simple}');

851

INSERT INTO pg_ts_cfgmap

852

VALUES ('default_english', 'float', '{simple}');

853

INSERT INTO pg_ts_cfgmap

854

VALUES ('default_english', 'email', '{simple}');

855

INSERT INTO pg_ts_cfgmap

856

VALUES ('default_english', 'word', '{simple}');

857

INSERT INTO pg_ts_cfgmap

858

VALUES ('default_english', 'hword', '{simple}');

859

INSERT INTO pg_ts_cfgmap

860

VALUES ('default_english', 'nlword', '{simple}');

861

INSERT INTO pg_ts_cfgmap

862

VALUES ('default_english', 'nlpart_hword', '{simple}');

863

INSERT INTO pg_ts_cfgmap

864

VALUES ('default_english', 'part_hword', '{simple}');

865

INSERT INTO pg_ts_cfgmap

866

VALUES ('default_english', 'nlhword', '{simple}');

867

INSERT INTO pg_ts_cfgmap

868

VALUES ('default_english', 'file', '{simple}');

869

INSERT INTO pg_ts_cfgmap

870

VALUES ('default_english', 'uint', '{simple}');

871

INSERT INTO pg_ts_cfgmap

872

VALUES ('default_english', 'version', '{simple}');

873

</pre>

874

875

Our addition of a configuration for 'default_english' is now

876

complete. We have successfully created a new tsearch2

877

configuration. At the same time we have also set the new

878

configuration to be our default for en_US locale.

879

<pre>

880

SELECT to_tsvector('default_english',

881

'learning tsearch is like going to school');

882

to_tsvector

883

--------------------------------------------------

884

'go':5 'like':4 'learn':1 'school':7 'tsearch':2

885

SELECT to_tsvector('learning tsearch is like going to school');

886

to_tsvector

887

--------------------------------------------------

888

'go':5 'like':4 'learn':1 'school':7 'tsearch':2

889

(1 row)

890

</pre>

891

Notice here that words like "tsearch" are still parsed and

892

indexed in the tsvector column. There is a lexem returned for the

893

word becuase in the configuration mapping table, we specify words

894

to be used from the 'en_ispell' dictionary first, but as a fallback

895

to use the 'en_stem' dictionary. Therefore a lexem is not returned

896

from en_ispell, but is returned from en_stem, and added to the

897

tsvector.

898

<pre>

899

SELECT to_tsvector('learning tsearch is like going to computer school');

900

to_tsvector

901

---------------------------------------------------------------------------

902

'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7

903

(1 row)

904

</pre>

905

Notice in this last example I added the word "computer" to the

906

text to be converted into a tsvector. Because we have setup our

907

default configuration to use the ISpell english dictionary, the

908

words are lexized, and computer returns 2 lexems at the same

909

position. 'compute':7 and 'computer':7 are now both indexed for the

910

word computer.

911

You can create additional dictionary lists, or use the extra

912

large dictionary from ISpell. You can read through the ISpell

913

documents, and source tree to make modifications as you see

914

fit.

915

In the case that you already have a configuration set for the

916

locale, and you are changing it to your new dictionary

917

configuration. You will have to set the old locale to NULL. If we

918

are using the 'C' locale then we would do this:

919

920

<pre>

921

UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';

922

</pre>

923

That about wraps up the configuration of tsearch2. There is much

924

more you can do with the tables provided. This was just an

925

introduction to get things working rather quickly.

926

<h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>

927

To aid in the addition of new dictionaries to the tsearch2

928

module you can use another additional module in combination with

929

tsearch2. The gendict module is included into tsearch2 distribution

930

and is available from gendict/ subdirectory.

931

I will not go into detail about installation and instructions on

932

how to use gendict to it's fullest extent right now. You can read

933

the README.gendict ... it has all of the instructions and

934

information you will need.

935

<h3>BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH2</h3>

936

Never rely on anyone elses instructions to backup and

937

restore a database system, always develop and understand your own

938

methodology, and test it numerous times before you need to do it

939

for real.

940

The backup and restore procedure has changed over time. This is

941

not meant to be the bible for tsearch2 back up and restore. Please

942

read all sections so you have a complete understanding of some

943

backup and restore issues. Please test your own procedures, and do

944

not rely on these instructions solely.

945

946

If you come accross some issues in your own procedures, please

947

feel free to bring the question up on the Open-FTS, and PostgreSQL

948

mailing lists.

949

<h3>ORIGINAL BACKUP PROCEDURES</h3>

950

Originally, tsearch2 had problems when using the pg_dump, and or

951

the pg_dumpall utilities. The problem lies within the original use

952

of OIDs for column types. Since OIDs are not consistent accross

953

pg_dumps, when you reload the data values into the pg_ts_dict

954

table, for example, those oids no longer point to anything. You

955

would then end up trying to use a "broken" tsearch2

956

configuration.

957

The solution was to backup and restore a database using the

958

tsearch2 module into small unique parts, and then load them in the

959

correct order. You would have to edit the schema and remove the

960

tsearch stored procedure references in the sql file. You would have

961

to load your global objects, then the tsearch2 objects. You had to

962

re-create the tsearch module before restoring your schema so no

963

conflicts would arise. Then you could restore your data (all

964

schemas, and types needed for the data were now available).

965

The original backup instructions were as

966

follows

967

1) Backup any global database objects such as users and groups

968

(this step is usually only necessary when you will be restoring to

969

a virgin system)

970

<pre>

971

pg_dumpall -g > GLOBALobjects.sql

972

973

</pre>

974

2) Backup the full database schema using pg_dump

975

<pre>

976

pg_dump -s DATABASE > DATABASEschema.sql

977

</pre>

978

3) Backup the full database using pg_dump

979

<pre>

980

pg_dump -Fc DATABASE > DATABASEdata.tar

981

</pre>

982

983

The original restore procedures were as

984

follows

985

1) Create the blank database

986

<pre>

987

createdb DATABASE

988

</pre>

989

2) Restore any global database objects such as users and groups

990

(this step is usually only necessary when you will be restoring to

991

a virgin system)

992

<pre>

993

psql DATABASE < GLOBALobjects.sql

994

</pre>

995

3) Create the tsearch2 objects, functions and operators

996

997

<pre>

998

psql DATABASE < tsearch2.sql

999

</pre>

1000

4) Edit the backed up database schema and delete all SQL

1001

commands which create tsearch2 related functions, operators and

1002

data types, BUT NOT fields in table definitions that specify

1003

tsvector types. If your not sure what these are, they are the ones

1004

listed in tsearch2.sql. Then restore the edited schema to the

1005

database

1006

<pre>

1007

psql DATABASE < DATABASEschema.sql

1008

</pre>

1009

5) Restore the data for the database

1010

<pre>

1011

1012

pg_restore -N -a -d DATABASE DATABASEdata.tar

1013

</pre>

1014

If you get any errors in step 4, it will most likely be because

1015

you forgot to remove an object that was created in tsearch2.sql.

1016

Any errors in step 5 will mean the database schema was probably

1017

restored wrongly.

1018

Issues with this procedure

1019

As I mentioned before, it is vital that you test out your own

1020

backup and restore procedures. These procedures were originally

1021

adopted from this document's orignal author. Robert John Shepherd.

1022

It makes use of the pg_dump custom archive functionality. I am not

1023

that familiar with the formatting output of pg_dump, and using

1024

pg_restore. I have always had the luxury of using text files

1025

(Everything is DATABASE.sql).

1026

One issue not forseen in the case of using a binary dump is the

1027

when you have added more than the default tsearch2 configurations.

1028

Upon reload of the data it will fail due to duplicate primary keys.

1029

If you load the tsearch2 module, and then delete the data loaded by

1030

tsearch2 into the configuration tables, the data will restore. The

1031

configurations are incorrect because you can not remove the data

1032

using OID references from the custom archive.

1033

It would be very simple to fix this problem if the data was not

1034

in an archive format. I do believe all of your data would have been

1035

restored properly and you can get things working fairly easy. All

1036

one would have to do is create the configurations as in the

1037

tsearch2.sql file. And then create your custom configurations

1038

again.

1039

I have read in the pg_dump man page that if the tar archive

1040

format is used, it is possible to limit which data is restored

1041

using pg_restore. If anyone has more experience with pg_dump

1042

archives, and pg_restore. Please feel free to test and contribute

1043

your procedure(s).

1044

<h3>CURRENT BACKUP AND RESTORE PROCEDURES</h3>

1045

1046

Currently a backup and restore of a database using the tsearch2

1047

module can be quite simple. If you have applied the patch mentioned

1048

in the installation instructions prior to tsearch2 installation.

1049

This patch removes the use of the oid column. The text

1050

representation for the stored procedures used are dumped with the

1051

data and the restoration of the data works seemlessly.

1052

1) to backup the database

1053

<pre>

1054

pg_dump DATABASE > DATABASE.sql

1055

</pre>

1056

1) to restore the database

1057

<pre>

1058

createdb DATABASE

1059

psql -d DATABASE -f DATABASE.sql

1060

</pre>

1061

This procedure is now like any normal backup and restore

1062

procedure. I can say whether this has been proven using the pg_dump

1063

archive, and restoring with pg_restore. In theory there should be

1064

no problems with any format after the patch is applied.

1065

1066

This restoration procedure should never be an issue with the

1067

patch applied to version 8.0 of PostgreSQL. Only versions 7.3 and

1068

7.4 are affected. You can avoid any troubles by applying the patch

1069

prior to installation, or running the SQL script provided to live

1070

database before backup and restoring is done.

1071

</div>

1072

</body>

1073

</html>

Older »