~ubuntu-branches/ubuntu/warty/swish-e/warty

« back to all changes in this revision

Viewing changes to pod/SWISH-CONFIG.pod

Committer: Bazaar Package Importer
Author(s): Ludovic Drolez
Date: 2004-03-11 08:41:07 UTC
mfrom: (1.1.1 upstream)
Revision ID: james.westby@ubuntu.com-20040311084107-7vp0mu82blq1qjvo

Tags: 2.4.1-3

http://bugs.debian.org/237332

Oops ! A comment was not removed to disable interactive compilation.
Closes: Bug#237332

files added:
INSTALL

Makefile.am

README

README.cvs

aclocal.m4

conf/Makefile.am

conf/Makefile.in

conf/README

conf/example1.config

conf/example2.config

conf/example3.config

conf/example4.config

conf/example5.config

conf/example6.config

conf/example7.config

conf/example8.config

conf/example9.config

conf/example9.pl

config

config/acinclude.m4

config/compile

config/config.guess

config/config.sub

config/depcomp

config/install-sh

config/ltmain.sh

config/missing

config/mkinstalldirs

debian/README.Debian

debian/compat

debian/files

debian/po

debian/po/POTFILES.in

debian/po/fr.po

debian/po/templates.pot

debian/postinst

debian/postrm

debian/prerm

debian/swish-e-dev.files

debian/swish-e.config

debian/swish-e.dirs

debian/swish-e.doc-base

debian/swish-e.substvars

debian/swish-e.templates

debian/watch

doc/Makefile.in

doc/Pod

doc/Pod/HtmlPsPdf

doc/Pod/HtmlPsPdf.pm

doc/Pod/HtmlPsPdf/Book.pm

doc/Pod/HtmlPsPdf/Chapter.pm

doc/Pod/HtmlPsPdf/Common.pm

doc/Pod/HtmlPsPdf/Config.pm

doc/Pod/HtmlPsPdf/Html.pm

doc/Pod/HtmlPsPdf/RunTime.pm

doc/Version.pm

doc/bin

doc/bin/build

doc/bin/html2ps

doc/bin/pod2hpp

doc/conf

doc/conf/html2ps-global.conf

doc/conf/html2ps-slides.conf

doc/conf/html2ps.conf

doc/conf/html2ps.html

doc/conf/hyphen.tex

doc/split.conf

doc/swish.conf

doc/tmpl

doc/tmpl/index.tmpl

doc/tmpl/indexps.tmpl

doc/tmpl/page.tmpl

doc/tmpl/pageps.tmpl

doc/tmpl/splitpage.tmpl

example

example/Makefile.am

example/Makefile.in

example/README

example/SWISH-Stemmer-0.05.tar.gz

example/modules

example/modules/SWISH

example/modules/SWISH/DateRanges.pm

example/modules/SWISH/DefaultHighlight.pm

example/modules/SWISH/ParseQuery.pm

example/modules/SWISH/PhraseHighlight.pm

example/modules/SWISH/SimpleHighlight.pm

example/modules/SWISH/TemplateDefault.pm

example/modules/SWISH/TemplateDumper.pm

example/modules/SWISH/TemplateFrame.pm

example/modules/SWISH/TemplateHTMLTemplate.pm

example/modules/SWISH/TemplateToolkit.pm

example/search.cgi.in

example/search.tt

example/swish.cgi.in

example/swish.gif

example/swish.tmpl

filter-bin/Makefile.am

filter-bin/Makefile.in

filter-bin/_pdf2html.pl

filter-bin/swish_filter.pl.in

filters

filters/Makefile.am

filters/Makefile.in

filters/README

filters/SWISH

filters/SWISH/Filter.pm.in

filters/SWISH/Filters

filters/SWISH/Filters/Doc2txt.pm

filters/SWISH/Filters/ID3toHTML.pm

filters/SWISH/Filters/Pdf2HTML.pm

filters/SWISH/Filters/XLtoHTML.pm

filters/SWISH/Makefile.am

filters/SWISH/Makefile.in

filters/swish-filter-test.in

html

html/.htaccess

html/.swishcgi.conf

html/API.html

html/CHANGES.html

html/Filter.html

html/INSTALL.html

html/Makefile.am

html/Makefile.in

html/README.html

html/SWISH-3.0.html

html/SWISH-BUGS.html

html/SWISH-CONFIG.html

html/SWISH-FAQ.html

html/SWISH-LIBRARY.html

html/SWISH-RUN.html

html/SWISH-SEARCH.html

html/images

html/images/dotrule1.gif

html/images/swish.gif

html/images/swish2.gif

html/images/swish2b.gif

html/images/swishbanner1.gif

html/index.html

html/index_long.html

html/search.html

html/searchdoc.html

html/spider.html

html/split.pl

html/style.css

html/swish.conf

html/swish.html

html/timestamp

man/Makefile.am

man/Makefile.in

man/SWISH-CONFIG.1

man/SWISH-FAQ.1

man/SWISH-LIBRARY.1

man/SWISH-RUN.1

man/swish-e.1

perl

perl/API.pm

perl/API.xs

perl/Changes

perl/MANIFEST

perl/Makefile

perl/Makefile.PL

perl/Makefile.old

perl/README

perl/t

perl/t/first.html

perl/t/second.html

perl/t/test.conf

perl/t/test.t

perl/t/third.html

perl/typemap

pod/.htaccess

pod/CHANGES.pod

pod/INSTALL.pod

pod/README.pod

pod/SWISH-3.0.pod

pod/SWISH-BUGS.pod

pod/SWISH-CONFIG.pod

pod/SWISH-FAQ.pod

pod/SWISH-LIBRARY.pod

pod/SWISH-RUN.pod

pod/SWISH-SEARCH.pod

pod/images

pod/images/dotrule1.gif

pod/images/swish.gif

pod/images/swish2.gif

pod/images/swish2b.gif

pod/images/swishbanner1.gif

pod/search.cgi

pod/style.css

pod/swish-e.pod

prog-bin

prog-bin/DirTree.pl.in

prog-bin/Makefile.am

prog-bin/Makefile.in

prog-bin/MySQL.pl

prog-bin/README

prog-bin/SwishSpiderConfig.pl

prog-bin/doc2txt.pm

prog-bin/file.pl

prog-bin/index_hypermail.pl

prog-bin/pdf2html.pm

prog-bin/pdf2xml.pm

prog-bin/spider.pl.in

rpm/swish-e.spec.in

rpm/swish-e.xpm

src/Makefile.am

src/acconfig.h.in

src/array.c

src/array.h

src/bash.c

src/bash.h

src/btree.c

src/btree.h

src/compress.c

src/compress.h

src/date_time.c

src/date_time.h

src/db.h

src/db_native.c

src/db_native.h

src/db_read.c

src/db_write.c

src/docprop_write.c

src/double_metaphone.c

src/double_metaphone.h

src/dump.c

src/dump.h

src/entities.c

src/entities.h

src/expat

src/expat/COPYING

src/expat/Makefile.am

src/expat/Makefile.in

src/expat/expat.dsw

src/expat/xmlparse

src/expat/xmlparse.c

src/expat/xmlparse/xmlparse.c

src/expat/xmlparse/xmlparse.dsp

src/expat/xmlparse/xmlparse.h

src/expat/xmlrole.c

src/expat/xmltok

src/expat/xmltok.c

src/expat/xmltok/ascii.h

src/expat/xmltok/asciitab.h

src/expat/xmltok/dllmain.c

src/expat/xmltok/iasciitab.h

src/expat/xmltok/latin1tab.h

src/expat/xmltok/nametab.h

src/expat/xmltok/utf8tab.h

src/expat/xmltok/xmldef.h

src/expat/xmltok/xmlrole.c

src/expat/xmltok/xmlrole.h

src/expat/xmltok/xmltok.c

src/expat/xmltok/xmltok.dsp

src/expat/xmltok/xmltok.h

src/expat/xmltok/xmltok_impl.c

src/expat/xmltok/xmltok_impl.h

src/expat/xmltok/xmltok_ns.c

src/extprog.c

src/extprog.h

src/fhash.c

src/fhash.h

src/filter.c

src/filter.h

src/getruntime.c

src/getruntime.h

src/headers.c

src/headers.h

src/html.c

src/html.h

src/keychar_out.c

src/keychar_out.h

src/libtest.c

src/metanames.c

src/metanames.h

src/parse_conffile.c

src/parse_conffile.h

src/parser.c

src/parser.h

src/pre_sort.c

src/proplimit.c

src/proplimit.h

src/ramdisk.c

src/ramdisk.h

src/rank.c

src/rank.h

src/replace

src/replace/Makefile.am

src/replace/Makefile.in

src/replace/dummy.c

src/replace/mkstemp.c

src/replace/mkstemp.h

src/replace/vsnprintf.c

src/result_output.c

src/result_output.h

src/result_sort.c

src/result_sort.h

src/snowball

src/snowball/Makefile.am

src/snowball/Makefile.in

src/snowball/api.c

src/snowball/api.h

src/snowball/header.h

src/snowball/stem_de.c

src/snowball/stem_de.h

src/snowball/stem_dk.c

src/snowball/stem_dk.h

src/snowball/stem_en1.c

src/snowball/stem_en1.h

src/snowball/stem_en2.c

src/snowball/stem_en2.h

src/snowball/stem_es.c

src/snowball/stem_es.h

src/snowball/stem_fi.c

src/snowball/stem_fi.h

src/snowball/stem_fr.c

src/snowball/stem_fr.h

src/snowball/stem_it.c

src/snowball/stem_it.h

src/snowball/stem_nl.c

src/snowball/stem_nl.h

src/snowball/stem_no.c

src/snowball/stem_no.h

src/snowball/stem_pt.c

src/snowball/stem_pt.h

src/snowball/stem_ru.c

src/snowball/stem_ru.h

src/snowball/stem_se.c

src/snowball/stem_se.h

src/snowball/utilities.c

src/swish-e.h

src/swish2.c

src/swish_qsort.c

src/swish_qsort.h

src/swish_words.c

src/swish_words.h

src/swregex.c

src/swregex.h

src/swstring.c

src/swstring.h

src/sys.h

src/txt.c

src/txt.h

src/vms

src/vms/acconfig.h_vms

src/vms/build_swish-e.com

src/vms/config.h

src/vms/descrip_axp.mms

src/vms/descrip_libxml2.mms

src/vms/descrip_vax.mms

src/vms/libtest.opt

src/vms/readme_vms.txt

src/vms/regex.c

src/vms/regex.h

src/vms/regexpr.h

src/vms/swish.opt

src/win32/build.sh

src/win32/dist.sh

src/win32/installer.nsi

src/win32/libswishe.dsp

src/win32/swishe.dsp

src/win32/swishe.dsw

src/worddata.c

src/worddata.h

src/xml.c

src/xml.h

tests/Makefile.am

tests/Makefile.in

tests/check_index

tests/check_metasearch

tests/check_search

tests/common.sh

tests/test.txt

tests/test.xml

files removed:
README-2.0

README-FILTERS

README-SWISH-E

conf/user.config

doc/swish-e.txt

filter-bin/_docfilter.sh

filter-bin/_pdffilter.sh

filters.html

src/Makefile.old

src/configure

src/configure.in

src/string.c

src/string.h

src/win32/regex.c

src/win32/regex.h

swish-e.man

tests/test.index

files modified:
Makefile.in

configure

configure.in

debian/changelog

debian/control

debian/copyright

debian/rules

filter-bin/README

filter-bin/_binfilter.sh

src/Makefile.in

src/check.c

src/check.h

src/config.h

src/docprop.c

src/docprop.h

src/error.c

src/error.h

src/file.c

src/file.h

src/fs.c

src/fs.h

src/hash.c

src/hash.h

src/http.c

src/http.h

src/httpserver.c

src/httpserver.h

src/index.c

src/index.h

src/list.c

src/list.h

src/mem.c

src/mem.h

src/merge.c

src/merge.h

src/methods.c

src/search.c

src/search.h

src/soundex.c

src/soundex.h

src/stemmer.c

src/stemmer.h

src/swish.c

src/swish.h

src/swishspider

src/win32/dirent.c

src/win32/dirent.h

tests/test.config

tests/test.html

tests/test_meta.html

tests/test_meta2.html

tests/test_phrase.html

Show diffs side-by-side

added added

removed removed

pod/SWISH-CONFIG.pod

=head1 NAME

SWISH-CONFIG - Configuration File Directives

=head1 Swish-e CONFIGURATION FILE

What files Swish-e indexes and how they are indexed, and where the index

is written can be controlled by a configuration file.

The configuration file is a text file composed of comments, blank

lines, and B<configuration directives>. The order of the directives

is not important. Some directives may be used more than once in the

configuration file, while others can only be used once (e.g. additional

directives will overwrite preceding directives). Case of the directive

is not important -- you may use upper, lower, or mixed case.

Comments are any line that begin with a "#".

# This is a comment

Directives may take more than one parameter. Enclose single parameters

that include whitespace in quotes (single or double). Inside of quotes

the backslash escapes the next character.

ReplaceRules append "foo bar" <- define "foo bar" as a single parameter

If you need to include a quote character in the value either use a

backslash to escape it, or enclose it in quotes of the other type.

For example, under unix you can use quotes to include white space in a

single parameter. Here, to protect against path names (%p) that might

have white space embedded use single quotes (this also protects against

shell expansion or metacharacters):

FileFilter .foo foofilter "'%p'" <- parameter passed through the shell in single quotes

FileFilter .foo foofilter '"%p"' <- windows uses double-quotes

FileFilter .foo foofilter '\'%p\''<- silly example

Backslashes also have special meaning in regular expressions.

FileFilterMatch pdftotext "'%p' -" /\.pdf$/

This says that the dot is a real dot (instead of matching any character).

If you place the regular expression in quotes then you must use

double-backslashes.

FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"

Swish-e will convert the double backslash into a single backslash before

passing the parameter to the regular expression compiler.

Commented example configuration files are included in the F<conf>

directory of the Swish-e distribution.

Some command line arguments can override directives specified in the

configuration file. Please see also the L<SWISH-RUN|SWISH-RUN> for

instructions on running Swish-e, and the L<SWISH-SEARCH|SWISH-SEARCH>

page for information and examples on how to search your index.

The configuration file is specified to Swish-e by the C<-c> switch.

For example,

swish-e -c myconfig.conf

You may also split your directives up into different configuration files.

This allows you to have a master configuration file used for many

different indexes, and smaller configuration files for each separate

index. You can specify the different configuration files when running

from the command line with the C<-c> switch (see L<SWISH-RUN|SWISH-RUN>),

or you may include other Configuration file with the B<IncludeConfigFile>

directive below.

Typically, in a configuration file the directives are grouped together in

some logical order -- that is, directives that control the source of the

documents would be grouped together first, and directives that control

how each document is filtered or its words index in another group of

directives. (The directives listed below are grouped in this order).

The configuration file directives are listed below in these groups:

=over 4

=item *

L<Administrative Headers Directives|/"Administrative Headers Directives">

-- You may add administrative information to the header of the index file.

=item *

L<Document Source Directives|/"Document Source Directives"> -- Directives

for selecting the source documents and the location of the index file.

=item *

L<Document Contents Directives|/"Document Contents Directives"> --

Directives that control how a document content is indexed.

=item *

100

101

L<Directives for the File Access method only|/"Directives for the File

102

Access method only"> -- These directives are only applicable to the File

103

Access indexing method.

104

105

=item *

106

107

L<Directives for the HTTP Access Method Only|/"Directives for the HTTP

108

Access Method Only"> -- Likewise, these only apply to the HTTP Access

109

method.

110

111

=item *

112

113

L<Directives for the prog Access Method Only|/"Directives for the prog

114

Access Method Only"> -- These only apply to the prog Access method.

115

116

=item *

117

118

L<Document Filter Directives|/"Document Filter Directives"> -- This is

119

a special section that describes using document filters with Swish-e.

120

121

=back

122

123

=head2 Alphabetical Listing of Directives

124

125

=over 4

126

127

=item *

128

129

L<AbsoluteLinks|/"item_AbsoluteLinks"> [yes|NO]

130

131

=item *

132

133

L<BeginCharacters|/"item_BeginCharacters"> *string of characters*

134

135

=item *

136

137

L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters"> *string*

138

139

=item *

140

141

L<Buzzwords|/"item_Buzzwords"> [*list of buzzwords*|File: path]

142

143

144

=item *

145

146

L<ConvertHTMLEntities|/"item_ConvertHTMLEntities"> [YES|no]

147

148

=item *

149

150

L<DefaultContents|/"item_DefaultContents"> [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*]

151

152

=item *

153

154

L<Delay|/"item_Delay"> *seconds*

155

156

=item *

157

158

L<DontBumpPositionOnEndTags|/"item_DontBumpPositionOnEndTags"> *list of names*

159

160

=item *

161

162

L<DontBumpPositionOnStartTags|/"item_DontBumpPositionOnStartTags"> *list of names*

163

164

=item *

165

166

L<EnableAltSearchSyntax|/"item_EnableAltSearchSyntax"> [yes|NO]

167

168

=item *

169

170

L<EndCharacter|/"item_EndCharacters"> *string of characters*

171

172

=item *

173

174

L<EquivalentServer|/"item_EquivalentServer"> *server alias*

175

176

=item *

177

178

179

180

=item *

181

182

L<FileFilter|/"item_FileFilter"> *suffix* *program* [options]

183

184

=item *

185

186

L<FileFilterMatch|/"item_FileFilterMatch"> *program* *options* *regex* [*regex* ...]

187

188

=item *

189

190

L<FileInfoCompression|/"item_FileInfoCompression"> [yes|NO]

191

192

=item *

193

194

L<FileMatch|/"item_FileMatch"> [contains|is|regex] *regular expression*

195

196

=item *

197

198

L<FileRules|/"item_FileRules"> [contains|is|regex] *regular expression*

199

200

=item *

201

202

203

204

=item *

205

206

L<FollowSymLinks|/"item_FollowSymLinks"> [yes|NO]

207

208

=item *

209

210

L<HTMLLinksMetaName|/"item_HTMLLinksMetaName"> *metaname*

211

212

=item *

213

214

L<IgnoreFirstChar|/"item_IgnoreFirstChar"> *string of characters*

215

216

=item *

217

218

L<IgnoreLastChar|/"item_IgnoreLastChar"> *string of characters*

219

220

=item *

221

222

L<IgnoreLimit|/"item_IgnoreLimit"> *integer integer*

223

224

=item *

225

226

L<IgnoreMetaTags|/"item_IgnoreMetaTags"> *list of names*

227

228

=item *

229

230

L<IgnoreNumberChars|/"item_IgnoreNumberChars"> *list of characters*

231

232

=item *

233

234

L<IgnoreTotalWordCountWhenRanking|/"item_IgnoreTotalWordCountWhenRanking"> [YES|no]

235

236

=item *

237

238

L<IgnoreWords|/"item_IgnoreWords"> [*list of stop words*|File: path]

239

240

=item *

241

242

L<ImageLinksMetaName|/"item_ImageLinksMetaName"> *metaname*

243

244

=item *

245

246

L<IncludeConfigFile|/"item_IncludeConfigFile">

247

248

=item *

249

250

L<IndexAdmin|/"item_IndexAdmin"> *text*

251

252

=item *

253

254

L<IndexAltTagMetaName|/"item_IndexAltTagMetaName"> *tagname*|as-text

255

256

=item *

257

258

L<IndexComments|/"item_IndexComments"> [yes|NO]

259

260

=item *

261

262

263

264

=item *

265

266

L<IndexDescription|/"item_IndexDescription"> *text*

267

268

=item *

269

270

L<IndexDir|/"item_IndexDir"> [URL|directories or files]

271

272

=item *

273

274

L<IndexFile|/"item_IndexFile"> *path*

275

276

=item *

277

278

L<IndexName|/"item_IndexName"> *text*

279

280

=item *

281

282

L<IndexOnly|/"item_IndexOnly"> *list of file suffixes*

283

284

=item *

285

286

L<IndexPointer|/"item_IndexPointer"> *text*

287

288

=item *

289

290

L<IndexReport|/"item_IndexReport"> [0|1|2|3]

291

292

=item *

293

294

L<MaxDepth|/"item_MaxDepth"> *integer*

295

296

=item *

297

298

L<MaxWordLimit|/"item_MaxWordLimit"> *integer*

299

300

=item *

301

302

L<MetaNameAlias|/"item_MetaNameAlias"> *meta name* *list of aliases*

303

304

=item *

305

306

L<MetaNames|/"item_MetaNames"> *list of names*

307

308

=item *

309

310

L<MinWordLimit|/"item_MinWordLimit"> *integer*

311

312

=item *

313

314

L<NoContents|/"item_NoContents"> *list of file suffixes*

315

316

=item *

317

318

L<obeyRobotsNoIndex|/"item_obeyRobotsNoIndex"> [yes|NO]

319

320

=item *

321

322

L<ParserWarnLevel|/"item_ParserWarnLevel"> [0|1|2|3]

323

324

=item *

325

326

L<PreSortedIndex|/"item_PreSortedIndex"> *list of property names*

327

328

=item *

329

330

L<PropCompressionLevel|/"item_PropCompressionLevel"> [0-9]

331

332

=item *

333

334

L<PropertyNameAlias|/"item_PropertyNameAlias"> *property name* *list of aliases*

335

336

=item *

337

338

L<PropertyNames|/"item_PropertyNames"> *list of meta names*

339

340

=item *

341

342

L<PropertyNamesCompareCase|/"item_PropertyNamesCompareCase"> *list of meta names*

343

344

=item *

345

346

L<PropertyNamesIgnoreCase|/"item_PropertyNamesIgnoreCase"> *list of meta names*

347

348

=item *

349

350

L<PropertyNamesNoStripChars|/"item_PropertyNoStripChars"> *list of meta names*

351

352

=item *

353

354

L<PropertyNamesDate|/"item_PropertyNamesDate"> *list of meta names*

355

356

=item *

357

358

L<PropertyNamesNumeric|/"item_PropertyNamesNumeric"> *list of meta names*

359

360

=item *

361

362

L<PropertyNamesMaxLength|/"item_PropertyNamesMaxLength"> integer *list of meta names*

363

364

=item *

365

366

L<PropertyNamesSortKeyLength|/"item_PropertyNamesSortKeyLength"> integer *list of meta names*

367

368

=item *

369

370

371

372

=item *

373

374

L<ResultExtFormatName|/"item_ResultExtFormatName"> name -x format string

375

376

=item *

377

378

L<SpiderDirectory|/"item_SpiderDirectory"> *path*

379

380

=item *

381

382

L<StoreDescription|/"item_StoreDescription"> [XML <tag>|HTML <meta>|TXT size]

383

384

=item *

385

386

L<SwishProgParameters|/"item_SwishProgParameters> *list of parameters*

387

388

=item *

389

390

L<SwishSearchDefaultRule|/"item_SwishSearchDefaultRule"> [<AND-WORD>|<or-word>]

391

392

=item *

393

394

L<SwishSearchOperators|/"item_SwishSearchOperators"> <and-word> <or-word> <not-word>

395

396

=item *

397

398

L<TmpDir|/"item_TmpDir"> *path*

399

400

=item *

401

402

L<TranslateCharacters|/"item_TranslateCharacters"> [*string1 string2*|:ascii7:]

403

404

=item *

405

406

L<TruncateDocSize|/"item_TruncateDocSize">

407

*number of characters*

408

409

=item *

410

411

L<UndefinedMetaTags|/"item_UndefinedMetaTags"> [error|ignore|INDEX|auto]

412

413

=item *

414

415

416

417

=item *

418

419

L<UseStemming|/"item_UseStemming"> [yes|NO]

420

421

=item *

422

423

L<UseSoundex|/"item_UseSoundex"> [yes|NO]

424

425

=item *

426

427

L<UseWords|/"item_UseWords"> [*list of words*|File: path]

428

429

=item *

430

431

L<WordCharacters|/"item_WordCharacters"> *string of characters*

432

433

=item *

434

435

L<XMLClassAttributes|/"item_XMLClassAttributes"> *list of XML attribute names*

436

437

=back

438

439

=head2 Directives that Control Swish

440

441

These configuration directives control the general behavior of Swish-e.

442

443

=over 4

444

445

=item IncludeConfigFile *path to config file*

446

447

This directive can be used to include configuration directives located

448

in another file.

449

450

IncludeConfigFile /usr/local/swish/conf/site_config.config

451

452

=item IndexReport [0|1|2|3]

453

454

This is how detailed you want reporting while indexing. You can specify

455

numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default

456

is 1.

457

458

This may be overridden from the command line via the C<-v> switch (see

459

L<SWISH-RUN|SWISH-RUN>).

460

461

=item ParserWarnLevel [0|1|2|3]

462

463

Sets the error level when using the libxml2 parser for XML and HTML.

464

libxml2 will point out structural errors in your documents.

465

466

0 = no report

467

1 = fatal errors

468

2 = errors

469

3 = warnings

470

471

The exception to this is UTF-8 to Latin-1 conversion errors are reported at

472

level 1. This is because words may be indexed incorrectly in these cases.

473

474

Note that unlike other errors generated by Swish-e, these errors are

475

sent to stderr.

476

477

=item IndexFile *path*

478

479

Index file specifies the location of the generated index file. If not

480

specified, Swish-e will create the file F<index.swish-e> in the current

481

directory.

482

483

IndexFile /usr/local/swish/site.index

484

485

=item obeyRobotsNoIndex [yes|NO]

486

487

When enabled, Swish-e will not index any HTML file that contains:

488

489

490

491

The default is to ignore these meta tags and index the document.

492

This tag is described at http://www.robotstxt.org/wc/exclusion.html.

493

494

Note: This feature is only available with the libxml2 HTML parser.

495

496

Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following

497

comments in your documents to prevent indexing:

498

499

500

501

502

and/or these may be used also:

503

504

505

506

507

For example, these are very helpful to prevent indexing of common headers, footers, and menus.

508

509

510

=back

511

512

B<NOTE>: This following items are currently not available. These items

513

require Swish-e to parse the configuration file while searching.

514

515

516

=over 4

517

518

=item EnableAltSearchSyntax [yes|NO]

519

520

B<NOTE>: This following item is currently not available.

521

522

Enable alternate search syntax. Allows the usage of a basic

523

"Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search

524

query can contain "+" and "-" as syntax parameter.

525

526

Example:

527

528

swish-e -w "+word1 +word2 -word3 word4 word5"

529

"+" = following word has to be in all found documents

530

"-" = following word may not be in any document found

531

" " = following word will be searched in documents

532

533

=item SwishSearchOperators <and-word> <or-word> <not-word>

534

535

B<NOTE>: This following item is currently not available.

536

537

Using this config directive you can change the boolean search operators of

538

Swish-e, e.g. to adapt these to your language.

539

The default is: AND OR NOT

540

541

Example (german):

542

543

SwishSearchOperators UND ODER NICHT

544

545

=item SwishSearchDefaultRule [<AND-WORD>|<or-word>]

546

547

B<NOTE>: This following item is currently not available.

548

549

C<SwishSearchDefaultRule> defines the default Boolean operator to use

550

if none is specified between words or phrases. The default is C<AND>.

551

552

The word you specify must match one of the available

553

C<SwishSearchOperators>.

554

555

Example:

556

557

SwishSearchOperators UND ODER NICHT

558

# Make it act like a web search engine

559

SwishSearchDefaultRule ODER

560

561

=item ResultExtFormatName name -x format string

562

563

B<NOTE>: This following item is currently not available.

564

565

The output of Swish-e can be defined by specifying a format string with

566

the C<-x> command line argument. Using C<ResultExtFormatName> you can

567

assign a predefined format string to a name.

568

569

Examples:

570

571

ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n"

572

573

Then when searching you can specify the format string's name

574

575

swish-e ... -x moreinfo ...

576

577

See the C<-x> switch in L<SWISH-RUN|SWISH-RUN> for more information

578

about output formats.

579

580

=back

581

582

583

=head2 Administrative Headers Directives

584

585

Swish-e stores configuration information in the header of the index file.

586

This information can be retrieved while searching or by functions in

587

the Swish-e C library. There are a number of fields available for your

588

own use. None of these fields are required:

589

590

=over 4

591

592

=item IndexName *text*

593

594

=item IndexDescription *text*

595

596

=item IndexPointer *text*

597

598

=item IndexAdmin *text*

599

600

These variables specify information that goes into index files to help

601

users and administrators. IndexName should be the name of your index,

602

like a book title. IndexDescription is a short description of the index

603

or a URL pointing to a more full description. IndexPointer should be

604

a pointer to the original information, most likely a URL. IndexAdmin

605

should be the name of the index maintainer and can include name and email

606

information. These values should not be more than 70 or so characters

607

and should be contained in quotes. Note that the automatically generated

608

date in index files is in D/M/Y and 24-hour format.

609

610

Examples:

611

612

IndexName "Linux Documentation"

613

IndexDescription "This is an index of /usr/doc on our Linux machine."

614

IndexPointer http://localhost/swish/linux/index.html

615

IndexAdmin webmaster

616

617

618

=back

619

620

=head2 Document Source Directives

621

622

These directives control I<what> documents are indexed and I<how>

623

they are accessed. See also L<Directives for the File Access method

624

only|/"Directives for the File Access method only"> and L<Directives for

625

the HTTP Access Method Only|/"Directives for the HTTP Access Method Only">

626

for directives that are specific to those access methods.

627

628

629

=over 4

630

631

=item IndexDir [directories or files|URL|external program]

632

633

IndexDir defines the source of the documents for Swish-e. Swish-e

634

currently supports three file access methods: B<File system>, B<HTTP>

635

(also called B<spidering>), and B<prog> for reading files from an

636

external program.

637

638

The C<-S> command line argument is used to select the file access method.

639

640

swish-e -c swish.config -S fs - file system

641

swish-e -c swish.config -S http - internal http spider

642

swish-e -c swish.config -S prog - external program of any type

643

644

For the B<fs> method of access B<IndexDir> is a space-separated

645

list of files and directories to index. Use a forward slash as the path

646

separator in MS Windows.

647

648

For the B<http> method the B<IndexDir> setting is a list of space-separated

649

URLs.

650

651

For the B<prog> method the B<IndexDir> setting is a list of space-separated

652

programs to run (which generate documents for swish to index).

653

654

You may specify more than one B<IndexDir> directive.

655

656

Any sub-directories of any listed directory will also be indexed.

657

658

Note: While I<processing> directories, Swish-e will ignore any files

659

or directories that begin with a dot ("."). You may index files

660

or directories that begin with a dot by specifying their name with

661

C<IndexDir> or C<-i>.

662

663

Examples:

664

665

# Index this directory an any subdirectories

666

IndexDir /usr/local/home/http

667

668

# Index the docs directory in current directory

669

IndexDir ./docs

670

671

# Index these files in the current directory

672

IndexDir ./index.html ./page1.html ./page2.html

673

# and index this directory, too

674

IndexDir ../public_html

675

676

For the B<HTTP> method of access specify the URL's from which

677

you want the spidering to begin.

678

679

Example:

680

681

IndexDir http://www.my-site.com/index.html

682

IndexDir http://localhost/index.html

683

684

Obviously, using the B<HTTP> method to index is B<much> slower than

685

indexing local files. Be well aware that some sites do not appreciate

686

spidering and may block your IP address. You may wish to contact the

687

remote site before spidering their web site. More information about

688

spidering can be found in L<Directives for the HTTP Access Method

689

Only|/"Directives for the HTTP Access Method Only"> below.

690

691

For the L<prog|SWISH-RUN/"item_prog"> method of access B<IndexDir>

692

specifies the path to the program(s) to execute. The external program

693

must correctly format the documents being passed back to Swish-e.

694

Examples of external programs are provided in the F<prog-bin> directory.

695

696

IndexDir ./myprogram.pl

697

698

See L<prog|SWISH-RUN/"item_prog"> for details.

699

700

701

Note: Not all directives work with all methods.

702

703

=item NoContents *list of file suffixes*

704

705

Files with these suffixes will B<not> have their contents indexed,

706

but will have their path name (file name) indexed instead.

707

708

If the file's type is HTML or HTML2 (as set by C<IndexContents> or

709

C<DefaultContents>) then the file will be parsed for a HTML title and

710

that title will be indexed. Note that you must set the file's type with

711

C<IndexContents> or C<DefaultContents>:

712

C<.html> and C<.htm> are NOT type HTML by default. For example:

713

714

IndexContents HTML* .htm .html

715

716

If a title is found, it will still be checked for C<FileRules title>, and the file will be

717

skipped if a match is found. See C<FileRules>.

718

719

If the file's type is not HTML, or it is HTML and no title is found,

720

then the file's path will be indexed.

721

722

For example, this will allow searching by image file name.

723

724

NoContents .gif .xbm .au .mov .mpg .pdf .ps

725

726

Note: Using this directive will B<not> cause files with those suffixes

727

to be indexed. That is, if you use C<IndexOnly> to limit the types of

728

files that are indexed, then you must specify in C<IndexOnly> the same

729

suffixes listed in C<NoContents>.

730

731

This does B<not> work:

732

733

# Wrong!

734

IndexOnly .htm .html

735

NoContents .gif .xbm .au .mov .mpg .pdf .ps

736

737

A C<-S prog> program may set the C<No-Contents:> header

738

to enable this feature for a specific document (although it would be

739

smarter for the C<-S prog> program to simply only send the pathname or

740

title to be indexed.

741

742

=item ReplaceRules [replace|remove|prepend|append|regex]

743

744

ReplaceRules allows you to make changes to file pathnames before

745

they're indexed. These changed file names or URLs will be returned in

746

search results.

747

748

For example, you may index your files locally (with the File system

749

indexing method), yet return a URL in search results. This directive can

750

be used to map the file names to their respective URLs on your web server.

751

752

There are five operations you can specify: B<replace>, B<append>,

753

B<remove>, B<prepend>, and B<regex> They will parse the pathname in the

754

order you've typed these commands.

755

756

This directive uses C library regex.h regular expressions.

757

758

replace "the string you want replaced" "what to change it to"

759

remove "a string to remove"

760

prepend "a string to add before the result"

761

append "a string to add after the result"

762

regex "/search string/replace string/options"

763

764

Remember, quotes are needed if an expression contains white space,

765

and backslashes have special meaning.

766

767

Regex is an Extended Regular Expression. The first character found is

768

the delimiter (but it's not smart enough to use matched chars such as [],

769

(), and {}).

770

771

The B<replace> string may use substitution variables:

772

773

$0 the entire matched (sub)string

774

$1-$9 returns patterns captured in "(" ")" pairs

775

$` the string before the matched pattern

776

$' the string after the matched pattern

777

778

The B<options> change the behavior of expression:

779

780

i ignore the case when matching

781

g repeat the substitution for the entire pattern

782

783

Examples:

784

785

ReplaceRules replace testdir/ anotherdir/

786

ReplaceRules replace [a-z_0-9]*_m.*\.html index.html

787

788

ReplaceRules remove testdir/

789

790

ReplaceRules prepend http://localhost/

791

ReplaceRules append .html

792

793

ReplaceRules regex !^/web/(.+)/!http://$1.domain.com/!

794

replaces a file path:

795

/web/search/foo/index.html

796

with

797

http://search.domain.com/foo/index.html

798

799

ReplaceRules regex #^#http://localhost/www#

800

ReplaceRules prepend http://localhost/www (same thing)

801

802

# Remove all extensions from C source files

803

ReplaceRules remove .c # ERROR! That "." is *any char*

804

ReplaceRules remove \.c # much better...

805

806

ReplaceRules remove "\\.c" # if in quotes you need double-backslash!

807

ReplaceRules remove "\.c" # ERROR! "\." -> "." and is *any char*

808

809

810

=item IndexContents [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*] *file extensions*

811

812

The C<IndexContents> directive assigns one of Swish-e's document parsers

813

to a document, based on the its extension. Swish-e currently knows how

814

to parse TXT, HTML, and XML documents.

815

816

The XML2, HTML2, and TXT2 parsers are currently only available when

817

Swish-e is configured to use libxml2.

818

819

You may use XML*, HTML*, and TXT* to select the parser automatically.

820

If libxml2 is installed then it will be used to parse the content. Otherwise,

821

Swish-e's internal parsers will be used.

822

823

Documents that are not assigned a parser with C<IndexContents> will, by

824

default, use the HTML2 parser if libxml2 is installed, otherwise will

825

use Swish-e's internal HTML parser. The C<DefaultContents> directive may be

826

used to assign a parser to documents that do not match a file extension

827

defined with the C<IndexContents> directive.

828

829

Example:

830

831

IndexContents HTML* .htm .html .shtml

832

IndexContents TXT* .txt .log .text

833

IndexContents XML* .xml

834

835

HTML* is the default type for all files, unless otherwise specified

836

(and this default can be changed by the B<DefaultContents> directive.

837

Swish-e parses titles from HTML files, if available, and keeps track

838

of the context of the text for context searching (see C<-t> in

839

L<SWISH-RUN|SWISH-RUN>).

840

841

If using filters (with the C<FileFilter> directive)

842

to convert documents you should include those extensions,

843

too. For example, if using a filter to convert .pdf to .html, you need

844

to tell Swish-e that .pdf should be indexed by the internal HTML parser:

845

846

FileFilter .pdf pdf2html

847

IndexContent HTML .pdf

848

849

See also L<Document Filter Directives|/"Document Filter Directives">.

850

851

B<Note:> Some of this may be changed in the future to use content-types

852

instead of file extensions. See L<SWISH-3.0|SWISH-3.0>

853

854

=item DefaultContents [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*]

855

856

This sets the default parser for documents that are not specified in

857

B<IndexContents>. If not specified the default is HTML.

858

859

The XML2, HTML2, and TXT2 parsers are currently only available when

860

Swish-e is configured to use libxml2.

861

862

You may use XML*, HTML*, and TXT* to select the parser automatically.

863

If libxml2 is installed then it will be used to parse the content. Otherwise,

864

Swish-e's internal parsers will be used.

865

866

867

Example:

868

869

DefaultContents HTML

870

871

The C<DefaultContents> directive I<should> be used when spidering,

872

as HTML files may be returned without a file extension (such as when

873

requesting a directory and the default index.html is returned).

874

875

876

=item FileInfoCompression [yes|NO]

877

878

** This directive is currently not supported **

879

880

Setting B<FileInfoCompression> to C<yes> will compress the index file to

881

save disk space. This may result in longer indexing times. The default

882

is C<no>.

883

884

Also see the C<-e> switch in L<SWISH-RUN|SWISH-RUN> for saving RAM

885

during indexing.

886

887

888

=back

889

890

=head2 Document Contents Directives

891

892

These directives control what information is extracted from your source

893

documents, and how that information is made available during searching.

894

895

=over 4

896

897

=item ConvertHTMLEntities [YES|no]

898

899

ASCII I<entities> can be converted automatically while indexing documents

900

of type HTML (not for HTML2).

901

For performance reasons you may wish to set this to C<no>

902

if your documents do not contain HTML entities. The default is C<yes>.

903

904

If C<ConvertHTMLEntities> is set C<no> the entities will be indexed

905

without conversion.

906

907

B<NOTE:> Entities within XML files and files parsed with libxml2 (HTML2) are

908

converted regardless of this setting.

909

910

=item MetaNames *list of names*

911

912

META names are a way to define "fields" in your XML and HTML documents.

913

You can use the META names in your queries to limit the search to just

914

the words contained in that META name of your document. For example,

915

you might have a META tagged field in your documents called C<subjects>

916

and then you can search your documents for the word "foo" but only return

917

documents where "foo" is within the C<subjects> META tag.

918

919

swish-e -w subjects=foo

920

921

(See also the C<-t> switch in L<SWISH-RUN|SWISH-RUN> for information

922

about I<context> searching in HTML documents.)

923

924

The B<MetaNames> directive is a space separated list. For example:

925

926

MetaNames meta1 meta2 keywords subjects

927

928

You may also use L<UndefinedMetaTags|/"item_UndefinedMetaTags"> to specify

929

automatic extraction of meta names from your HTML and XML documents,

930

and also to ignore indexing content of meta tags.

931

932

META tags can have two formats in your B<HTML> source documents:

933

934

935

936

and (if using the HTML2/libxml2 parser)

937

938

<meta1>

939

some content

940

</meta1>

941

942

But this second version is invalid HTML, and will generate a warning if

943

ParserWarningLevel is set (libxml2 only).

944

945

And in B<XML> documents, use the format:

946

947

<meta1>

948

Some Content

949

</meta1>

950

951

Then you can limit your search to just META B<meta1> like this:

952

953

swish-e -w 'meta1=(apples or oranges)'

954

955

You may nest the XML and the start/end tag versions:

956

957

958

<tag1>

959

some content

960

</tag1>

961

<tag2>

962

some other content

963

</tag2>

964

965

966

Then you can search in both tag2 and tag2 with:

967

968

swish-e -w 'keywords=(query words)'

969

970

Swish-e indexes all text as some metaname. The default is

971

C<swishdefault>, so these two queries are the same:

972

973

swish-e -w foo

974

swish-e -w swishdefault=foo

975

976

When indexing HTML Swish-e indexes the HTML title as default text, so

977

when searching Swish-e will find matches in both the HTML body and the

978

HTML title. Swish also, by default, indexes content of meta tags. So:

979

980

swish-e -w foo

981

982

will find "foo" in the body, the title, or any meta tags.

983

984

Currently, there's no way to prevent Swish-e from indexing

985

the title contents along with the body contents, but see

986

L<UndefinedMetaTags|/"item_UndefinedMetaTags"> for how to control the

987

indexing of meta tags.

988

989

If you would like to search just the title text, you may use:

990

991

MetaNames swishtitle

992

993

This will index the title text separately under the built-in swish

994

internal meta name "swishtitle". You may then search like

995

996

swish-e -w foo -- search for "foo" in title, body (and undefined meta tags)

997

swish-e -w swishtitle=foo -- search for "foo" in title only

998

999

In addition to swishtitle, you can limit searches to documents' path with:

1000

1001

MetaNames swishdocpath

1002

1003

Then to search for "foo" but also limit searches to documents that include

1004

"manual" or "tutorial" in their path:

1005

1006

swish-e -w foo swishdocpath=(manual or tutorial)

1007

1008

See also L<ExtractPath|/"item_ExtractPath">.

1009

1010

1011

=item MetaNameAlias *meta name* *list of aliases*

1012

1013

MetaNameAlias assigns aliases for a meta name. For example, if your

1014

documents contain meta tags "description", "summary", and "overview"

1015

that all give a summary of your documents you could do this:

1016

1017

MetaNames summary

1018

MetaNameAlias summary description overview

1019

1020

Then all three tags will get indexed as meta tag "summary". You can

1021

then search all the fields as:

1022

1023

-w summary=foo

1024

1025

The Alias work at search time, too. So these will also limit the search

1026

to the "summary" meta name.

1027

1028

-w description=foo

1029

-w overview=foo

1030

1031

=item MetaNamesRank integer *list of meta names*

1032

1033

* Not implemented yet *

1034

1035

You can assign a bias to metanames that will affect how ranking is

1036

calculated. The range of values is from -10 to +10, with zero being

1037

no bias.

1038

1039

MetaNamesRank 4 subject

1040

MetaNamesRank 3 swishdefault

1041

MetaNamesRank 2 author publisher

1042

MetaNamesRank -5 wrongwords

1043

1044

This feature is not implemented yet

1045

1046

=item HTMLLinksMetaName *metaname*

1047

1048

Allows indexing of HTML links. Normally, HTML links (href tags) are

1049

not indexed by Swish-e. This directive defines a metaname, and links

1050

will be indexed under this meta name.

1051

1052

Example:

1053

1054

HTMLLinksMetaName links

1055

1056

Now, to limit searches to files with a link to "home.html" do this:

1057

1058

-w links='"home.html"'

1059

1060

The double quotes force a phrase search.

1061

1062

To make Swish-e index links as normal text, you may use:

1063

1064

HTMLLinksMetaName swishdefault

1065

1066

This feature is only available with the libxml2 HTML parser.

1067

1068

=item ImageLinksMetaName *metaname*

1069

1070

Allows indexing of image links under a metaname. Normally, image URLs

1071

are not indexed.

1072

1073

Example:

1074

1075

ImagesLinksMetaName images

1076

1077

Now, if you would like to find pages that include a nice image of a beach:

1078

1079

-w images='beach'

1080

1081

To make Swish-e index links as normal text, you may use:

1082

1083

ImageLinksMetaName swishdefault

1084

1085

This feature is only available with the libxml2 HTML parser.

1086

1087

1088

=item IndexAltTagMetaName *tagname*|as-text

1089

1090

Allows indexing of images <IMG> ALT tag text. Specify either a tag name which will be

1091

used as a metaname, or the special text "as-text" which says to index the ALT text as

1092

if it were plain text at the current location.

1093

1094

For example, by specifying a tag name:

1095

1096

IndexAltTagMetaName bar

1097

1098

would make this markup:

1099

1100

<foo>

1101

1102

</foo>

1103

1104

appear like

1105

1106

<foo>

1107

1108

</foo>

1109

1110

Then the normal rules (C<MetaNames> and C<PropertyNames>) apply to how that text is indexed.

1111

1112

If you use the special tag "as-text" then

1113

1114

<foo>

1115

1116

</foo>

1117

1118

simply becomes

1119

1120

<foo>

1121

Alt text here

1122

</foo>

1123

1124

This feature is only available when using the libxml2 parser (HTML2 and XML2).

1125

1126

1127

=item AbsoluteLinks [yes|NO]

1128

1129

If this is set true then Swish-e will attempt to convert relative URIs

1130

extracted from HTML documents for use with C<HTMLLinksMetaName> and

1131

C<ImageLinksMetaName> into absolute URIs. Swish-e will use any <BASE>

1132

tag found in the document, otherwise it will use the file's pathname.

1133

The pathname used will be the pathname *after* C<ReplaceRules> has been

1134

applied to the document's pathname.

1135

1136

For example, say you wish to index image links under the metaname

1137

"images".

1138

1139

ImageLinksMetaName images

1140

1141

If an image is located in http://localhost/vacations/france/index.html

1142

and C<AbsoluteLinks> is set to no, then a image within that document:

1143

1144

1145

1146

will only index "beach.jpeg".

1147

1148

But, if you want more detail when searching, you can enable

1149

C<AbsoluteLinks> and Swish-e will index

1150

"http://localhost/vacations/france/beach.jpeg". You can then look for

1151

images of beaches, but only in France:

1152

1153

-w images=(beach and france)

1154

1155

This also means you can search for any images within France:

1156

1157

-w images=(france)

1158

1159

This feature is only available with the libxml2 HTML parser.

1160

1161

=item UndefinedMetaTags [error|ignore|INDEX|auto]

1162

1163

This directive defines the behavior of Swish-e during indexing when a

1164

meta name is found but is B<not> listed in B<MetaNames>. There are

1165

four choices:

1166

1167

1168

=over 2

1169

1170

=item error

1171

1172

If a meta name is found that is not listed in B<MetaNames>

1173

then indexing will be halted and an error reported.

1174

1175

=item ignore

1176

1177

The contents of the meta tag are ignored and B<not> indexed

1178

unless a metaname has been defined with the C<MetaNames> directive.

1179

1180

=item index

1181

1182

The contents of the meta tag are indexed, but placed in the

1183

main index unless there's an enclosing metatag already in force. This

1184

is the default.

1185

1186

=item auto

1187

1188

This method create meta tags automatically for HTML meta names

1189

and XML elements. Using this is the same as specifying all the meta

1190

names explicitly in a B<MetaNames> directive.

1191

1192

=back

1193

1194

=item UndefinedXMLAttributes [DISABLE|error|index|auto]

1195

1196

This is similar to C<UndefinedMetaTags>, but only applies to XML documents (parsed with libxml2).

1197

This allows indexing of attribute content, and provides a way to index the content under a

1198

metaname. For example, C<UndefinedXMLAttributes> can make

1199

1200

1201

John Doe

1202

</person>

1203

1204

look like the following to swish:

1205

1206

1207

<person.age>

1208

1209

</person.age>

1210

John Doe

1211

</person>

1212

1213

What happens to the text "23" will depend on the setting of C<UndefinedXMLAttributes>:

1214

1215

=over 2

1216

1217

=item disable

1218

1219

XML attributes are not parsed and not indexed. This is the default.

1220

1221

=item error

1222

1223

If the concatenated meta name (e.g. person.age) is not listed in

1224

B<MetaNames> then indexing will be halted and an error reported.

1225

1226

=item ignore

1227

1228

The contents of the meta tag are ignored and B<not> indexed unless a

1229

metaname has been defined with the C<MetaNames> directive.

1230

1231

=item index

1232

1233

The contents of the meta tag are indexed, but placed in the main index

1234

unless there's an enclosing metatag already in force.

1235

1236

=item auto

1237

1238

This method will create meta tags from the combined element and attributes

1239

(and XML Class name) This options should be used with caution as it can

1240

generate a lot of metaname entries.

1241

1242

See also the example below C<XMLClassAttribues>.

1243

1244

1245

=back

1246

1247

=item XMLClassAttributes *list of XML attribute names*

1248

1249

Combines an XML class name with the element name to make up a metaname.

1250

For example:

1251

1252

XMLClassAttributes class

1253

1254

1255

John

1256

</person>

1257

1258

Doe

1259

</person>

1260

1261

Will appear to Swish-e as:

1262

1263

1264

<person.first>

1265

John

1266

</person.first>

1267

</person>

1268

1269

<person.last>

1270

Doe

1271

</person.last>

1272

</person>

1273

1274

How the data is indexed depends on C<MetaNames> and C<UndefinedMetaTags>.

1275

1276

Here's an example using the following configuration which combines the

1277

two directives C<XMLClassAttributes> and C<UndefinedXMLAttributes>.

1278

1279

XMLClassAttributes class

1280

UndefinedMetaTags auto

1281

UndefinedXMLAttributes auto

1282

IndexContents XML2 .xml

1283

1284

The source XML file looks like:

1285

1286

<xml>

1287

1288

John

1289

</person>

1290

1291

</xml>

1292

1293

Swish-e parses as:

1294

1295

./swish-e -c 2 -i 1.xml -T parsed_tags parsed_text -v 0

1296

Indexing Data Source: "File-System"

1297

1298

<xml> (MetaName)

1299

1300

<person> (MetaName)

1301

<person.student> (MetaName)

1302

<person.student.phone> (MetaName)

1303

555-1212

1304

</person.student.phone>

1305

<person.student.age> (MetaName)

1306

102

1307

</person.student.age>

1308

John

1309

</person>

1310

1311

<person> (MetaName)

1312

<person.greeting> (MetaName)

1313

howdy

1314

</person.greeting>

1315

Bill

1316

</person>

1317

1318

</xml>

1319

Indexing done!

1320

1321

One thing to note is that the first <person> block finds a class name

1322

"student" so all metanames that are created from attributes use the

1323

combined name "person.student". The second <person> block doesn't contain

1324

a "class" so, the attribute name is combined directly with the element

1325

name (e.g. "person.greeting").

1326

1327

=item ExtractPath *metaname* [replace|remove|prepend|append|regex]

1328

1329

This directive can be used to index extracted parts of a document's path.

1330

A common use would be to limit searches to specific areas of your

1331

file tree.

1332

1333

The extracted string will be indexed under the specified meta name.

1334

1335

See C<ReplaceRules> for a description of the various pattern replacement

1336

methods, but you will use the I<regex> method.

1337

1338

For example, say your file system (or web tree) was organized into departments:

1339

1340

/web/sales/foo...

1341

/web/parts/foo...

1342

/web/accounting/foo...

1343

1344

And you wanted a way to limit searches to just documents under "sales".

1345

1346

ExtractPath department regex !^/web/([^/]+)/.*$!$1!

1347

1348

Which says, extract out the department name (as substring $1) and index

1349

it as meta name C<department>. Then to limit a search to the sales

1350

department:

1351

1352

swish-e -w foo AND department=sales

1353

1354

Note that the C<regex> method uses a substitution pattern, so to index

1355

only a sub-string match the I<entire> document path in the regular

1356

expression, as shown above. Otherwise any part that is not matched will

1357

end up in the substitution pattern.

1358

1359

See the C<ExtractPathDefault> option for a way to set a value if not

1360

patterns match.

1361

1362

Although unlikely, you may use more than one C<ExtractPath> directive.

1363

More than one directive of the I<same> meta name will operate successively

1364

(in order listed in the configuration file) on the path. This allows

1365

you to use regular expressions on the results of the previous pattern

1366

substitution (as if piping the output from one expression to the patter

1367

of the next).

1368

1369

ExtractPath foo regex !^(...).+$!$1!

1370

ExtractPath foo regex !^.+(.)$!$1!

1371

1372

So, the third letter is indexed as meta name "foo" if both patterns match.

1373

1374

ExtractPath foo regex !^X(...).+$!$1!

1375

ExtractPath foo regex !^.+(.)$!$1!

1376

1377

Now (not the "X"), if the first pattern doesn't match, the last character

1378

of the path name is indexed. You must be clear on this behavior if you

1379

are using more than one C<ExtractPath> directive with the same metaname.

1380

1381

The document path operated on is the real path swish used to access

1382

the document. That is, the C<ReplaceRules> directive has no effect on

1383

the path used with C<ExtractPath>.

1384

1385

The full path is used for each meta name if more than one C<ExtractPath>

1386

directive is used. That is, changes to the path used in C<ExtractPath

1387

foo> do not affect the path used by C<ExtractPath bar>.

1388

1389

=item ExtractPathDefault *metaname* default_value

1390

1391

This can be used with C<ExtractPath> to set a default string to index

1392

under the given metaname if none of the C<ExtractPath> patterns match.

1393

1394

For example, say your want to index each document with a metaname

1395

"department" based on the following path examples:

1396

1397

/web/sales/foo...

1398

/web/parts/foo...

1399

/web/accounting/foo...

1400

1401

But you are also indexing documents that do not follow that pattern and you want to search those

1402

separately, too.

1403

1404

ExtractPath department regex !^/web/([^/]+)/.*$!$1!

1405

ExtractPathDefault department other

1406

1407

Now, you may search like this:

1408

1409

-w foo department=(sales) - limit searches to the sales documents

1410

-w foo department=(parts) - limit searches to the parts documents

1411

-w foo department=(accounting) - limit searches to the accounting documents

1412

-w foo department=(other) - everything but sales, parts, and accounting.

1413

1414

This basically is a shortcut for:

1415

1416

-w foo not department=(sales or parts or accounting)

1417

1418

but you don't need to keep track of what was extracted.

1419

1420

=item PropertyNames *list of meta names*

1421

1422

=item PropertyNamesCompareCase *list of meta names*

1423

1424

=item PropertyNamesIgnoreCase *list of meta names*

1425

1426

Swish-e allows you to specify certain META tags that can be used as

1427

B<document properties>. The contents of any META tag that has been

1428

identified as a document property can be returned as part of the search

1429

results along with the rank, file name, title, and document size (see

1430

the C<-p> and C<-x> switches in L<SWISH-RUN|SWISH-RUN>).

1431

1432

Properties are useful for returning additional data from documents in

1433

search results -- this saves the effort of reading and parsing the source

1434

files while reading Swish-e search results, and is especially useful

1435

when the source documents are no longer available or slow to access

1436

(e.g. over http).

1437

1438

Another feature of properties is that Swish-e can use the PropertyNames

1439

for sorting the search results (see the C<-s> switch).

1440

1441

PropertyNames author subjects

1442

1443

Two variations are available. C<PropertyNamesCompareCase> and

1444

C<PropertyNamesIgnoreCase>. These tell Swish-e to either ignore or

1445

compare case when sorting results. The default for C<PropertyNames>

1446

is to ignore the case.

1447

1448

PropertyNamesIgnoreCase subject

1449

PropertyNamesCompareCase keyword

1450

1451

The defaults for "internal" properties are:

1452

1453

swishtitle -- ignore the case

1454

swishdocpath -- compare case

1455

swishdescription -- compare case

1456

1457

These can be overridden with C<PropertyNamesCompareCase> and

1458

C<PropertyNamesIgnoreCase>.

1459

1460

PropertyNamesCompareCase swishtitle

1461

1462

Use of PropertyNames will increase the size of your index files,

1463

sometimes significantly. Properties will be compressed if Swish-e is

1464

compiled with zlib as described in the L<INSTALL|INSTALL> manual page.

1465

1466

If Swish-e finds more than one property of the same name in a document

1467

the property's contents will be concatinated for strings, and a warning

1468

issues for numeric (or date) properties.

1469

1470

=item PropertyNamesNoStripChars

1471

1472

PropertyNamesNoStripChars specifies that the listed properties should not

1473

have strings of low ASCII characters replaced with a space character.

1474

Properties will be stored as found in the document.

1475

1476

When printing properties with the swish-e binary newlines are replaced with

1477

a space character. Use the swish-e library (or SWISH::API perl module) to

1478

fetch properties without newlines replaced.

1479

1480

1481

=item PropertyNamesNumeric

1482

1483

This directive is similar to C<PropertyNames>, but it flags the property

1484

as being a string of digits (integer value) that will be stored as binary data instead

1485

of a string. This allows sorting with C<-s> and limiting with C<-L>

1486

to sort and limit the property correctly.

1487

1488

Swish-e uses C<strtoul(3)> to convert the string into an unsigned long

1489

integer. Therefore, only positive integers can be stored.

1490

1491

Future versions of Swish-e may be able to store different property types

1492

(such as negative integers and real numbers). This directive may change

1493

in future releases of Swish.

1494

1495

=item PropertyNamesDate

1496

1497

This directive is exactly like C<PropertyNamesNumeric>, but it also

1498

flags the number as a machine timestamp (seconds since Epoch), and

1499

will print a formatted date when returning this property. See C<-x>

1500

in L<SWISH-RUN|SWISH-RUN>.

1501

1502

Swish-e will not parse dates when indexing; you must use a timestamp.

1503

1504

=item PropertyNameAlias *property name* *list of aliases*

1505

1506

This allows aliases for a property name. For example, if you are indexing

1507

HTML files, plus XML files that are written in English, German, and

1508

Spanish and thus use the tags "title", "titel", and "t�tulo" you can use:

1509

1510

PropertyNameAlias swishtitle title titel t�tulo titulo

1511

1512

Note that "swishtitle" is the built-in property used to store the title of

1513

a document, and therefore you do not need to specify it as a PropertyName

1514

before use.

1515

1516

=item PropertyNamesMaxLength integer *list of meta names*

1517

1518

This option will set the max length of the text stored in a property.

1519

You must specify a number between 0 and the max integer size on your

1520

platform, and a list of properties. The properties specified must not

1521

be aliases.

1522

1523

If any of the property names do not exist they will be created (e.g. you

1524

do not need to define the property with PropertyNames first).

1525

1526

In general, this feature will only be useful when parsing HTML or XML

1527

with the libxml2 parser.

1528

1529

For example:

1530

1531

PropertyNamesMaxLength 1000 swishdescription

1532

PropertyNameAlias swishdescription body

1533

1534

Is somewhat like

1535

1536

StoreDescription HTML <body> 1000

1537

StoreDescription XML <body> 1000

1538

StoreDescription HTML2 <body> 1000

1539

StoreDescription XML2 <body> 1000

1540

1541

but StoreDescription allows setting the tag for each parser type.

1542

1543

PropertyNamesMaxLength 1000 headings

1544

PropertyNameAlias headings h1 h2 h3 h4

1545

1546

collects all the heading text into a single property called "headings", not

1547

to exceed 1000 characters.

1548

1549

=item PropertyNamesSortKeyLength integer *list of meta names*

1550

1551

Sets the length of the string used when sorting.

1552

The default is 100 characters. The -T metanames debugging option will

1553

list the current values for an index.

1554

1555

This setting is used when sorting during indexing, and perhaps when sorting

1556

while searching. It also effects the order when limiting to a range of values

1557

with the -L option.

1558

1559

=item PreSortedIndex *list of property names*

1560

1561

By default Swish-e generates presorted tables while indexing for each

1562

property name. This allows faster sorting when generating results.

1563

On large document collections this presorting may add to the indexing

1564

time, and also adds to the total size of the index. This directive can

1565

be used to customize exactly which properties will be presorted.

1566

1567

If C<PreSortedIndex> it is I<not> present in the config file (default

1568

action), all the properties will be presorted at indexing time. If it

1569

is present without any parameter, no properties will be presorted.

1570

Otherwise, only the property names specified will be presorted.

1571

1572

For example, if you only wish to sort results by a property called

1573

C<title>:

1574

1575

PropertyNames title age time

1576

PreSortedIndex title

1577

1578

1579

=item StoreDescription [XML <tag> size|HTML <meta> size|TXT size]

1580

1581

B<StoreDescription> allows you to store a document description in the

1582

index file. This description can be returned in your search results

1583

when the C<-x> switch is used to include the I<swishdescription> for

1584

extended results, or by using C<-p swishdescription>.

1585

1586

The document type (XML, HTML and TXT) must match the document type currently being indexed

1587

as set by C<IndexContents> or C<DefaultContents>. See those directives for possible values.

1588

A common problem is using C<StoreDescription> yet not setting the document's type with

1589

C<IndexContents> or C<DefaultContents>. Another problem is different types:

1590

1591

IndexContents HTML2 .html

1592

StoreDescription HTML <body>

1593

1594

Then .html documents are assigned a type of HTML2 (and parsed by the libxml2 parser), but the

1595

description will not be stored since it is type HTML instead of HTML2.

1596

1597

For text documents you specify the type TXT (or TXT2 or TXT*) and the number of I<characters> to capture.

1598

1599

StoreDescription TXT 20

1600

1601

The above stores only the first twenty characters from the text file in the Swish-e index

1602

file.

1603

1604

For HTML, and XML file types, specify the tag to use for the

1605

description, and optionally the number of characters to capture. If not

1606

specified will capture the entire contents of the tag.

1607

1608

StoreDescription HTML <body> 20000

1609

StoreDescription XML <desc> 40

1610

1611

Again, note that documents must be assigned a document type with C<IndexContents>

1612

or C<DefaultContents> to use this feature.

1613

1614

Swish-e will compress the descriptions (or any other large property)

1615

if compiled to use zlib (see L<INSTALL|INSTALL>). This is recommended when using

1616

StoreDescription and a large number of documents. Compression of 30% to 50% is

1617

not uncommon with HTML files.

1618

1619

=item PropCompressionLevel [0-9]

1620

1621

This directive sets the compression level used when storing properties

1622

to disk. A setting of zero is no compression, and a setting of nine is

1623

the most compression.

1624

1625

The default depends on the default setting compiled with zlib, but is

1626

typically six.

1627

1628

This option is useful when using C<StoreDescription> to store a large

1629

amount text in properties (or if using C<PropertyNames> with large

1630

property sizes).

1631

1632

Properties must be over a value defined in F<config.h> (100 is the

1633

default) before compression will be attempted. Swish-e will never store

1634

the results of the compression if the compressed data is larger than

1635

the original data.

1636

1637

This option is only available when Swish-e is compiled with zlib support.

1638

1639

1640

=item TruncateDocSize *number of characters*

1641

1642

TruncateDocSize limits the size of a document while indexing documents

1643

and/or using filters. This config directive truncates the numbers of

1644

read bytes of a document to the specified size. This means: if a document

1645

is larger, read only the specified numbers of bytes of the document.

1646

1647

Example:

1648

1649

TruncateDocSize 10000000

1650

1651

The default is zero, which means read all data.

1652

1653

1654

Warning: If you use TruncateDocSize, use it with care! TruncateDocSize

1655

is a safety belt only, to limit e.g. filteroutput, when accessing

1656

databases, or to limit "runnaway" filters. Truncating doc input may

1657

destroy document structures for Swish-e (e.g. swish may miss closing

1658

tags for XML or HTML documents).

1659

1660

TruncateDocSize does not currently work with the C<prog> input source

1661

method.

1662

1663

=item FuzzyIndexingMode NONE|Stemming|Soundex|Metaphone|DoubleMetaphone

1664

1665

Selects the type of index to create. Only one type of index may be created.

1666

1667

It's a good idea to create both a normal index and a fuzzy index and

1668

allow your search interface select which index to use. Many people find the

1669

fuzzy searches to be too fuzzy.

1670

1671

The available fuzzy indexing options can be displayed by running

1672

1673

swish-e -T LIST_FUZZY_MODES

1674

1675

Available options include:

1676

1677

=over 4

1678

1679

=item None

1680

1681

Words are stored in the index without any conversion. This is the default.

1682

1683

=item Stemming_*

1684

1685

This options uses one of the installed Snowball stemmers (http://snowball.tartarus.org/).

1686

1687

The installed stemmers can be viewed by running

1688

1689

swish-e -T LIST_FUZZY_MODES

1690

1691

For example, to use the Spanish stemming module:

1692

1693

FuzzyIndexingMode Stemming_es

1694

1695

1696

=item Stem or Stemming_en

1697

1698

Selects the legacy Swish-e English stemmer.

1699

1700

This is depreciated in favor of the Snowball English stemmers (Stemming_en1, Stemming_en2).

1701

Future versions of Swish-e will likely use the Stemming_en2 stemmer by default.

1702

1703

Words are converted using the Porter stemming algorithm.

1704

1705

From: http://www.tartarus.org/~martin/PorterStemmer/

1706

1707

The Porter stemming algorithm (or �Porter stemmer�) is a

1708

process for removing the commoner morphological and inflexional

1709

endings from words in English. Its main use is as part of a

1710

term normalisation process that is usually done when setting up

1711

Information Retrieval systems.

1712

1713

1714

This will help a search for "running" to also find "run" and "runs", for example.

1715

1716

The stemming function does not convert words to their root, rather

1717

programmatically removes endings on words in an attempt to make similar

1718

words with different endings stem to the same string of characters.

1719

It's not a perfect system, and searches on stemmed indexes often return

1720

curious results. For example, two entirely different words may stem to

1721

the same word.

1722

1723

Stemming also can be confusing when used with a wildcard (truncation).

1724

For example, you might expect to find the word "running" by searching for

1725

"runn*". But this fails when using a stemmed index, as "running" stems to

1726

"run", yet searching for "runn*" looks for words that start with "runn".

1727

1728

=item Soundex

1729

1730

Soundex was developed in the 1880s so records for people with similar

1731

sounding names could be found more readily. Soundex is a coded surname

1732

based on the way a surname sounds rather than spelling. Surnames that

1733

sound similar, like Smith and Smyth, are filed together under the same

1734

Soundex code. This is mostly useful for US English.

1735

1736

Soundex should not be used to search for sound-alike words. Metaphone

1737

would be more appropriate for generic sound matching of words. Soundex

1738

should only be used where you need to search multiple documents for

1739

proper names which sound similar. This is primarily used for indexing

1740

genealogical records. This may be useful for indexing other collections

1741

of data consisting mostly of names. Many common name variations are

1742

matched by Soundex. The only notable exception is the first letter of

1743

the name. The first letter is not matched for sound.

1744

1745

=item Metaphone and DoubleMetaphone

1746

1747

Words are transformed into a short series of letters representing the sound of the word (in English).

1748

Metaphone algorithms are often used for looking up mis-spelled words in dictionary programs.

1749

1750

From: http://aspell.sourceforge.net/metaphone/

1751

1752

Lawrence Philips' Metaphone Algorithm is an algorithm which returns

1753

the rough approximation of how an English word sounds.

1754

1755

The C<DoubleMetaphone> mode will sometimes generate two different metaphones for the same word.

1756

This is supposed to be useful when a word may be pronounced more than one way.

1757

1758

A metaphone index should give results somewhere in between Soundex and Stemming.

1759

1760

=back

1761

1762

=item UseStemming [yes|NO]

1763

1764

Put yes to apply word stemming algorithm during indexing, else no.

1765

1766

UseStemming no

1767

UseStemming yes

1768

1769

When UseStemming is set to C<yes> every word is stemmed before placing

1770

it in to the index.

1771

1772

This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.

1773

1774

=item UseSoundex [yes|NO]

1775

1776

When UseSoundex is set to C<yes> every word is converted to a Soundex

1777

code before placing it in to the index.

1778

1779

This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.

1780

1781

=item IgnoreTotalWordCountWhenRanking [YES|no]

1782

1783

Put yes to ignore the total number of words in the file when calculating

1784

ranking. Often better with merges and small files. Default is yes.

1785

1786

IgnoreTotalWordCountWhenRanking no

1787

1788

The default was changed from no to yes in version 2.2.

1789

1790

=item MinWordLimit *integer*

1791

1792

Set the minimum length of an word. Shorter words will not be indexed.

1793

The default is 1 (as defined in F<src/config.h>).

1794

1795

MinWordLimit 5

1796

1797

=item MaxWordLimit *integer*

1798

1799

Set the maximum length of an indexable word. Every longer word will not

1800

be indexed. The Default is 40 (as defined in F<src/config.h>).

1801

1802

=item WordCharacters *string of characters*

1803

1804

=item IgnoreFirstChar *string of characters*

1805

1806

=item IgnoreLastChar *string of characters*

1807

1808

=item BeginCharacters *string of characters*

1809

1810

=item EndCharacter *string of characters*

1811

1812

1813

These settings define what a word consists of to the Swish-e indexing engine.

1814

Compiled in defaults are in F<src/config.h>.

1815

1816

When indexing Swish-e uses B<WordCharacters> to split up the document

1817

into words. Words are defined by any string of non-blank characters

1818

that contain only the characters listed in WordCharacters. If a string

1819

of characters includes a character that is not in WordCharacters then

1820

the word will be spit into two or more separate words.

1821

1822

For example:

1823

1824

WordCharacters abde

1825

1826

Would turn "abcde" into two words "ab" and "de".

1827

1828

Next, of these words, any characters defined in B<IgnoreFirstChar> are

1829

stripped off the start of the word, and B<IgnoreLastChar> characters

1830

are stripped off the end of the word. This allows, for example,

1831

periods within a word (www.slashdot.com), but not at the end of

1832

a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in

1833

WordCharacters.

1834

1835

Finally, the resulting words MUST begin with one of the characters

1836

listed in B<BeginCharacters> and end with one of the characters listed in

1837

B<EndCharacters>. BeginCharacters and EndCharacters must be a subset of

1838

the characters in WordCharacters. Often, WordCharacters, BeginCharacters

1839

and EndCharacters will all be the same.

1840

1841

Note that the same process applies to the query while searching.

1842

1843

Getting these settings correct will take careful consideration and

1844

practice. It's helpful to create an index of a single test file, and

1845

then look at the words that are placed in the index (see the C<-v 4>,

1846

C<-D> and C<-k> searching switches).

1847

1848

Currently there is only support for eight-bit characters.

1849

1850

Example:

1851

1852

WordCharacters .abcdefghijklmnopqrstuvwxyz

1853

BeginCharacters abcdefghijklmnopqrstuvwxyz

1854

EndCharacters abcdefghijklmnopqrstuvwxyz

1855

IgnoreFirstChar .

1856

IgnoreLastChar .

1857

1858

So the string

1859

1860

Please visit http://www.example.com/path/to/file.html.

1861

1862

will be indexed as the following words:

1863

1864

please

1865

visit

1866

http

1867

www.example.com

1868

path

1869

1870

file.html

1871

1872

Which means that you can search for C<www.example.com> as a single word,

1873

but searching for just C<example> will not find the document.

1874

1875

Note: when indexing HTML documents HTML entities are converted to their

1876

character equivalents before being processed with these directives.

1877

This is a change from previous versions of Swish-e where you were

1878

required to include the characters C<0123456789&#;> to index entities.

1879

See also L<ConvertHTMLEntities|/"item_ConvertHTMLEntities">

1880

1881

=item Buzzwords [*list of buzzwords*|File: path]

1882

1883

The Buzzwords option allows you to specify words that will be indexed

1884

regardless of WordCharacters, BeginCharacters, EndCharacters, stemming,

1885

soundex and many of the other checks done on words while indexing.

1886

1887

Buzzwords are case insensitive.

1888

1889

Buzzwords should be separated by spaces and may span multiple directives.

1890

If the special format C<File:filename> is used then the Buzzwords will

1891

be read from an external file during indexing.

1892

1893

Examples:

1894

1895

Buzzwords C++ TCP/IP

1896

1897

Buzzwords File: ./buzzwords.lst

1898

1899

If a Buzzword contains search operator characters they must be backslashed

1900

when searching. For example:

1901

1902

Buzzwords C++ TCP/IP web=http

1903

1904

./swish-e -w 'web\=http'

1905

1906

Buzzwords are found by splitting the text on whitespace, removing

1907

C<IgnoreFirstChar> and C<IgnoreLastChar> characters from the word,

1908

and then comparing with the list of C<Buzzwords>. Therefore, if

1909

adding C<Buzzwords> to an index you will probably want to define

1910

C<IgnoreFirstChar> and C<IgnoreLastChar> settings.

1911

1912

Note: Buzzwords specific settings for C<IgnoreFirstChar> and

1913

C<IgnoreLastChar> may be used in the future.

1914

1915

1916

=item IgnoreWords [*list of stop words*|File: path]

1917

1918

The IgnoreWords option allows you to specify words to ignore, called

1919

I<stopwords>. The default is to not use any stopwords.

1920

1921

Words should be separated by spaces and may span multiple directives.

1922

If the special format C<File:filename> is used then the stop words will

1923

be read from an external file during indexing.

1924

1925

In previous versions of Swish-e you could use the directive

1926

1927

IgnoreWords swishdefault - obsolete!

1928

1929

to include a default list of compiled in stopwords. This keyword is no

1930

longer supported.

1931

1932

Examples:

1933

1934

IgnoreWords www http a an the of and or

1935

1936

IgnoreWords File: ./stopwords.de

1937

1938

=item UseWords [*list of words*|File: path]

1939

1940

UseWords defines the words that Swish-e will index. B<Only> the words

1941

listed will be indexed.

1942

1943

You can specify a list of words following the directive (you may specify

1944

more than one C<UseWords> directive in a config file), and/or use the

1945

C<File:> form to specify a path to a file containing the words:

1946

1947

UseWords perl python pascal fortran basic cobal php

1948

UseWords File: /path/to/my/wordlist

1949

1950

Please drop the Swish-e list a note if you actually use this feature.

1951

It may be removed from future versions.

1952

1953

=item IgnoreLimit *integer integer*

1954

1955

This automatically omits words that appear too often in the files (these

1956

words are called stopwords). Specify a whole percentage and a number,

1957

such as "80 256". This omits words that occur in over 80% of the files

1958

and appear in over 256 files. Comment out to turn off auto-stopwording.

1959

1960

IgnoreLimit 50 1000

1961

1962

Swish-e must do extra processing to adjust the entire index when this

1963

feature is used. It is recommended that instead of using this feature

1964

that you decided what words are stopwords and add them to B<IngoreWords>

1965

in your configuration file. To do this, use IgnoreLimit one time and

1966

note the stop words that are found while indexing. Add this list to

1967

IgnoreWords, and then remove IgnoreLimit from the configuration file.

1968

1969

=item IgnoreMetaTags *list of names*

1970

1971

C<IgnoreMetaTags> defines a list of metatags to ignore while indexing

1972

XML files (and HTML files if using libxml2 for parsing HTML). All text

1973

within the tags will be ignored -- both for indexing (C<MetaNames>)

1974

and properties (C<PropertyNames>). To still parse properties, yet do

1975

not index the text, see L<UndefinedMetaTags|/"item_UndefinedMetaTags">.

1976

1977

This option is useful to avoid indexing specific data from a file.

1978

For example:

1979

1980

1981

<first_name>

1982

William

1983

</first_name> <last_name>

1984

Shakespeare

1985

</last_name> <updated_date>

1986

April 25, 1999

1987

</updated_date>

1988

</person>

1989

1990

In the above example you might B<not> want to index the updated date,

1991

and therefore prevent finding this record by searching

1992

1993

-w 'person=(April)'

1994

1995

This is solved by:

1996

1997

IgnoreMetaTags updated_date

1998

1999

2000

See also L<UndefinedMetaTags|/"item_UndefinedMetaTags">.

2001

2002

=item IgnoreNumberChars *list of characters*

2003

2004

Experimental Feature

2005

2006

This experimental feature can be used to define a set of characters

2007

that describe a number. If a word is found to contain only those

2008

characters it will not be indexed. The characters listed must be part

2009

of C<WordCharacters> settings. In other words, the "word" checked is

2010

a word that Swish-e would otherwise index.

2011

2012

For example,

2013

2014

IgnoreNumberChars 0123456789$.,

2015

2016

Then Swish-e would not index the following:

2017

2018

123

2019

123,456.78

2020

$123.45

2021

2022

You might be tempted to avoid indexing hex numbers with:

2023

2024

IgnoreNumberChars 0123456789abcdef

2025

2026

which will not index 0D31, but will also not index the word "bad".

2027

2028

This is an experimental feature that may change in future versions.

2029

One possible change is to use regular expressions instead.

2030

2031

2032

=item IndexComments [NO|yes]

2033

2034

This option allows the user decide if to index the contents of HTML

2035

comments. Default is no. Set to yes if comment indexing is required.

2036

2037

IndexComments yes

2038

2039

Note: This is a change in the default behavior prior to version 2.2.

2040

2041

=item TranslateCharacters [*string1 string2*|:ascii7:]

2042

2043

The TranslateCharacters directive maps the characters in string1 to the

2044

characters listed in string2.

2045

2046

For example:

2047

2048

# This will index a_b as a-b and �mo as amo

2049

TranslateCharacters _� -a

2050

2051

C<TranslateCharacters :ascii7:> is a predefined set of characters that

2052

will translate eight bit characters to ascii7 characters. Using the

2053

:ascii7: rule will translate "��" to "aac". This means: searching

2054

"�elik", "�elik" or "celik" will all match the same word.

2055

2056

TranslateCharacters is done early in the indexing process, after

2057

converting HTML entities but before splitting the input text into words

2058

based on B<WordCharacters>. So characters you are translating I<from>

2059

do not need to be listed in word characters.

2060

2061

The same character translations take place when searching.

2062

2063

=item BumpPositionCounterCharacters *string*

2064

2065

When indexing Swish-e assigns a word position to each word. This enables

2066

phrase searching. There may be cases where you would like to prevent

2067

phrase matching. The BumpPositionCounterCharacters directive allows

2068

you to specify a set of characters that when found in the text will

2069

increment the word position -- effectively preventing phrase matches

2070

across that character.

2071

2072

For example, if you have a tag:

2073

2074

2075

computer programming | apple computers

2076

</subjects>

2077

2078

You might want to prevent matching "programming apple" in that meta name.

2079

2080

BumpPositionCounterCharacters |

2081

2082

There is no default, and you may list a string of characters.

2083

2084

=item DontBumpPositionOnEndTags *list of names*

2085

2086

=item DontBumpPositionOnStartTags *list of names*

2087

2088

Since metatags are typically separate data fields, the word position

2089

counter is automatically bumped between metatags (actually, bumped when a

2090

start tag is found and when an end tag is found). This prevents matching

2091

a phrase that spans more than one metaname. C<DontBumpPositionOnEndTags>

2092

and C<DontBumpPositionOnStartTags> disables this feature for the listed

2093

metanames.

2094

2095

For example,

2096

2097

2098

<first_name>

2099

William

2100

</first_name>

2101

<last_name>

2102

Shakespeare

2103

</last_name>

2104

<updated_date>

2105

April 25, 1999

2106

</updated_date>

2107

</person>

2108

2109

In the configuration file:

2110

2111

DontBumpPositionOnEndTags first_name

2112

DontBumpPositionOnStartTags last_name

2113

2114

This configuration allows this phrase search

2115

2116

-w 'person=("william shakespeare")'

2117

2118

but this phrase search will fail

2119

2120

-w 'person=("shakespeare april")'

2121

2122

2123

2124

=back

2125

2126

2127

=head2 Directives for the File Access method only

2128

2129

Some directives have different uses depending on the source of the

2130

documents. These directives are only valid when using the B<File system>

2131

method of indexing.

2132

2133

=over 4

2134

2135

=item IndexOnly *list of file suffixes*

2136

2137

This directive specifies the allowable file suffixes (extensions) while

2138

indexing. The default is to index all files specified in B<IndexDir>.

2139

2140

# Only index .html .htm and .q files

2141

IndexOnly .html .htm .q

2142

2143

C<IndexOnly> checks that the file end in the characters listed. It does

2144

not check "extensions". C<IndexOnly> is tested right before C<FileRules>

2145

is processed.

2146

2147

=item FollowSymLinks [yes|NO]

2148

2149

Put "yes" to follow symbolic links in indexing, else "no". Default is no.

2150

2151

FollowSymLinks no

2152

FollowSymLinks yes

2153

2154

Note that when set to C<no> extra stat(2) system calls must be made for

2155

each file. For large number of files you may see a small reduction in

2156

indexing time by setting this to C<yes>.

2157

2158

See also the C<-l> switch in L<SWISH-RUN|SWISH-RUN>.

2159

2160

=item FileRules [type] [contains|is|regex] *regular expression*

2161

2162

=item FileMatch [type] [contains|is|regex] *regular expression*

2163

2164

FileRules and FileMatch are used to, respectively, exclude and include

2165

files and directories to index. Since, by default, Swish-e indexes all

2166

files and recurses all directories (but see also C<FollowSymLinks>) you

2167

will typically only use C<FileRules> to exclude files or directories.

2168

C<FileMatch> is useful in a few cases, for example, to override the

2169

behavior of C<IndexOnly>. Some examples are included below.

2170

2171

Except for C<FileRules title ...>, this feature is only available for

2172

file access method (-S fs), which is the default indexing mode. Also,

2173

any pathname modification with C<ReplaceRules> happens after the check

2174

for C<FileRules>. (It's unlikely that you would exclude files with

2175

C<FileRules> based on text you added with C<ReplaceRules>!)

2176

2177

The regular expression is a C regex.h extended regular expression.

2178

You may supply more than one regular expression per line, or use

2179

separate directives. Preceding the regular expression with the word

2180

"not" negates the match.

2181

2182

The regular expression is compared against B<[type]> as described below.

2183

2184

For historical reasons, you can specify C<contains> or C<is>. C<is>

2185

simply forces the regular expression to match at the start and end

2186

of the string (by internally prepending "^" and appending "$" to the

2187

regular expression).

2188

2189

The C<regex> option requires delimiter characters:

2190

2191

FileRules title regex /^private/i

2192

2193

The only advantage of C<regex> is if you want to do case insensitive

2194

matches, or simply like your regular expressions to look like perl

2195

regular expressions. You must use matching delimiters; (), {}, and [],

2196

are not currently supported for no good reason other than laziness.

2197

2198

Use quotes (" or ') around a pattern if it contains any white space.

2199

Note that the backslash character becomes the escape character within

2200

quotes.

2201

2202

For example, these sets generate the same regular expressions.

2203

2204

FileRules title is hello

2205

FileRules title contains ^hello$

2206

FileRules title regex /^hello$/

2207

2208

These all need quotes due to the included space character

2209

2210

FileRules title is "hello there"

2211

FileRules title contains "^hello there$"

2212

FileRules title regex "!^hello there$!"

2213

2214

These show how the backslash must be doubled inside of quotes.

2215

Swish-e converts a double-backslash into a single backslash, and then

2216

passes that single onto the regular expression compiler.

2217

2218

FileRules filename regex /\.pdf/

2219

FileRules filename regex "/\\.pdf/"

2220

2221

FileRules filename regex !hello\\there! # need double for real backslash

2222

FileRules filename regex "!hello\\\\there!" # need double-double inside of quotes

2223

2224

2225

B<Matching Types>

2226

2227

The following types of match strings my be supplied:

2228

2229

FileRules pathname

2230

FileRules dirname

2231

FileRules filename

2232

FileRules directory

2233

FileRules title

2234

2235

FileMatch pathname

2236

FileMatch filename

2237

FileMatch dirname

2238

FileMatch directory

2239

2240

B<pathname> matches the regular expression against the current pathname.

2241

The pathname may or may not be absolute depending on what you supplied

2242

to C<IndexDir>.

2243

2244

Example:

2245

2246

# Don't index paths that contain private or hidden

2247

FileRules pathname contains (private|hidden)

2248

2249

# Same thing

2250

FileRules pathname regex /(private|hidden)/

2251

2252

# Don't index exe files

2253

FileRules pathname contains \.exe$

2254

2255

B<dirname> and B<filename> split the path name by the last delimiter

2256

character into a directory name, and a file name. Then these are compared

2257

against the patterns supplied. Directory names do B<not> have a trailing

2258

slash. All path names use the forward slash as a delimiter within Swish-e.

2259

2260

Example:

2261

2262

# Same as last example - don't index *.exe files.

2263

FileRules filename contains \.exe$

2264

2265

# Don't index any file called test.html files

2266

FileRules filename contains ^test\.html$

2267

2268

# Same thing

2269

FileRules filename is test\.html

2270

2271

# Don't index any directories that contain "old" (/usr/local/myold/docs)

2272

FileRules dirname contains old

2273

2274

# Don't index any directories that contain the path segment "old" (/usr/local/old/foo)

2275

FileRules dirname contains /old/

2276

2277

# Index only .htm, .html, plus any all-digit file names

2278

IndexOnly .htm .html

2279

FileMatch filename contains ^\d+$

2280

2281

# Same as previous, but maybe a little slower

2282

FileRules filename regex not !\.(htm|html)$!

2283

FileMatch filename contains ^\d+$

2284

2285

Swish-e checks these settings in the order of C<pathname>, C<dirname>, and

2286

C<filename>, and C<FileMatch> patterns are checked before C<FileRules>,

2287

in general. This allows you to exclude most files with C<FileRules>,

2288

yet allow in a few special cases with C<FileMatch>. For example:

2289

2290

# Exclude all files of .exe, .bin, and .bat

2291

FileRules filename contains \.(exe|bin|bat)$

2292

# But, let these two in

2293

FileMatch filename is baseball\.bat incoming_mail\.bin

2294

2295

# Same, but as a single pattern

2296

FileMatch filename is (baseball\.bat|incoming_mail\.bin)

2297

2298

The C<directory> type is somewhat unique. When Swish-e recurses into a

2299

directory it will compare all the I<files> in the directory with the

2300

pattern and then decide if that entire directory should or should not

2301

be indexed (or recursed). Note that you are matching against file names

2302

in a directory -- and some of those names may be directory names.

2303

2304

A C<FileRules directory> match will cause Swish-e to ignore all files and

2305

sub-directories in the current directory.

2306

2307

Warning: A match with C<FileMatch directory> says to index B<everything>

2308

in the *current* directory and B<ignore> any FileRules for this directory.

2309

2310

2311

Example:

2312

2313

# Don't index any directories (and sub directories) that contain

2314

# a file (or sub-directory) called "index.skip"

2315

FileRules directory contains ^index\.skip$

2316

2317

# Don't index directories that contain a .htaccess file.

2318

FileRules directory contains ^\.htaccess

2319

2320

Note: While I<processing> directories, Swish-e will ignore any files

2321

or directories that begin with a dot ("."). You may index files

2322

or directories that begin with a dot by specifying their name with

2323

C<IndexDir> or C<-i>.

2324

2325

C<title> checks for a pattern match in an HTML title.

2326

2327

Example:

2328

2329

FileRules title contains construction example pointers

2330

2331

# This example says to ignore case

2332

FileRules title regex "/^Internal document/i"

2333

2334

Note: C<FileRules title> works for any input method (fs, prog, or http)

2335

that is parsed as HTML, and where a title was found in the document.

2336

2337

In case all this seems a bit confusing, processing a directory happens

2338

in the following order.

2339

2340

First the directory name is checked:

2341

2342

FileRules dirname - reject entire directory if matches

2343

2344

Next the directory is scanned and each file name (which might be the

2345

name of a sub-directory) is checked:

2346

2347

FileRules directory - reject entire dir if *any* files match

2348

FileMatch directory - accept entire dir if *any* files match

2349

2350

Then, unless C<FileMatch directory> matched, each file is tested with

2351

FileMatch. A match says to index the file without further testing

2352

(i.e. overrides FileRules and IndexOnly):

2353

2354

FileMatch pathname \

2355

FileMatch dirname - file is accepted if any match

2356

FileMatch filename /

2357

2358

otherwise

2359

2360

IndexOnly - file is checked for the correct file extension

2361

2362

FileRules pathname \

2363

FileRules dirname - file is rejected if any match

2364

FileRules filename /

2365

2366

finally, the file is indexed.

2367

2368

Files (not directories) listed with C<IndexDir> or C<-i> are processed

2369

in a similar way:

2370

2371

FileMatch pathname \

2372

FileMatch dirname - file is accepted if any match

2373

FileMatch filename /

2374

2375

otherwise, the file is rejected if it doesn't have the correct extension

2376

or a FileRules matches.

2377

2378

IndexOnly - file is checked for the correct file extension

2379

2380

FileRules pathname \

2381

FileRules dirname - file is rejected if any match

2382

FileRules filename /

2383

2384

Note: If things are not indexing as you expect, create a directory

2385

with some test files and use the C<-T regex> trace option to see how

2386

file names are checked. Start with very simple tests!

2387

2388

2389

=back

2390

2391

=head2 Directives for the HTTP Access Method Only

2392

2393

The HTTP Access method is enabled by the "-S http" switch when indexing. It works by

2394

running a Perl program called SwishSpider which fetches documents from a web server.

2395

2396

Only text files (content-type of "text/*") are indexed with the HTTP Access Method.

2397

Other document types (e.g. PDF or MSWord) may be indexed as well. The SwishSpider will

2398

attempt to make use of the SWISH::Filter module (included with the Swish-e distribution) to

2399

convert documents into a format that Swish-e can index.

2400

2401

Note: The -S prog method of spidering (using spider.pl) can be a replacement for the -S http method.

2402

It offers more configuration options and better spidering speed.

2403

2404

These directives below are available when using the HTTP Access Method of indexing.

2405

2406

=over 4

2407

2408

=item MaxDepth *integer*

2409

2410

MaxDepth defines how many links the spider should follow before stopping.

2411

A value of 0 configures the spider to traverse all links. The default

2412

is MaxDepth 0.

2413

2414

MaxDepth 5

2415

2416

Note: The default was changed from 5 to 0 in release 2.4.0

2417

2418

=item Delay *seconds*

2419

2420

The number of seconds to wait between issuing requests to a server.

2421

This setting allows for more friendly spidering of remote sites.

2422

The default is 5 seconds.

2423

2424

Delay 1

2425

2426

Note: The default was changed from 60 to 5 seconds in release 2.4.0

2427

2428

=item TmpDir *path*

2429

2430

The location of a writable temp directory on your system. The HTTP

2431

access method tells the Perl helper to place its files in this location,

2432

and the C<-e> switch causes Swish-e to use this directory while indexing.

2433

There is no default.

2434

2435

TmpDir /tmp/swish

2436

2437

If this directory does not exist or is not writable Swish-e will fail

2438

with an error during indexing.

2439

2440

Note, the environment variables of C<TMPDIR>, C<TMP>, and C<TEMP>

2441

(in that order) will B<override> this setting.

2442

2443

=item SpiderDirectory *path*

2444

2445

The location of the Perl helper script called F<swishspider>. If you

2446

use a relative directory, it is relative to your directory when you run

2447

Swish-e, not to the directory that Swish-e is in.

2448

The default is the location swishspider was installed.

2449

Normally this does not need to be set.

2450

2451

SpiderDirectory /usr/local/swish

2452

2453

=item EquivalentServer *server alias*

2454

2455

Often times the same site may be referred to by different names.

2456

A common example is that often http://www.some-server.com and

2457

http://some-server.com are the same. Each line should have a list of

2458

all the method/names that should be considered equivalent. Multiple

2459

EquivalentServer directives may be used. Each directive defines its

2460

own set of equivalent servers.

2461

2462

EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu

2463

EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu

2464

2465

=back

2466

2467

=head2 Directives for the prog Access Method Only

2468

2469

This section details the directives that are only available for the

2470

"prog" document source feature of Swish-e. The "prog" access method runs

2471

an external program that "feeds" documents to Swish-e. This allows indexing

2472

and filtering of documents from any source.

2473

2474

See L<prog - general purpose access method|SWISH-RUN/"item_prog"> in

2475

the SWISH-RUN man page for more information.

2476

2477

2478

A number of example programs for use with the "prog" access method are

2479

provided in the F<prog-bin> directory. Please see those example if you

2480

have questions about implementing a "prog" input program.

2481

2482

=over 4

2483

2484

=item SwishProgParameters *list of parameters*

2485

2486

This is a list of parameters that will be sent to the external program

2487

when running with the "prog" document source method.

2488

2489

SwishProgParameters /path/to/config hello there

2490

IndexDir /path/to/program.pl

2491

2492

Then running:

2493

2494

swish-e -c config -S prog

2495

2496

Swish-e will execute C</path/to/program.pl> and pass C</path/to/config

2497

hello there> as three command line arguments to the program. This

2498

directive makes it easy to pass settings from the Swish-e configuration

2499

file to the external program.

2500

2501

For example, the C<spider.pl> program (included in the C<prog-bin>

2502

directory) uses the C<SwishProgParameters> to specify what file to read

2503

for configuration information.

2504

2505

SwishProgParameters spider.config

2506

IndexDir ./spider.pl

2507

2508

The C<spider.pl> program also has a default action so you can avoid

2509

using a configuration file:

2510

2511

SwishProgParameters default http://www.swishe.org/ http://some.other.site/

2512

IndexDir ./spider.pl

2513

2514

And the spider program will use default settings for spidering those sites.

2515

2516

Swish-e can read documents from standard input, so another way to run an external program

2517

with parameters is:

2518

2519

./spider.pl spider.conf | ./swish-e -S prog -i stdin

2520

2521

=back

2522

2523

B<Notes when using MS Windows>

2524

2525

You should use unix style path separators to specify your external

2526

program. Swish will convert forward slashes to backslashes before

2527

calling the external program. This is only true for the program name

2528

specified with C<IndexDir> or the C<-i> command line option.

2529

2530

In addition, Swish-e will make sure the program specified actually exists,

2531

which means you need to use the full name of the program.

2532

2533

For example, to run the perl spider program F<spider.pl> you would need

2534

a Swish-e configuration file such as:

2535

2536

IndexDir e:/perl/bin/perl.exe

2537

SwishProgParameters prog-bin/spider.pl default http://swish-e.org

2538

2539

and run indexing with the command:

2540

2541

swish-e -c swish.cfg -S prog -v 9

2542

2543

The C<IndexDir> command tells Swish-e the name of the program to run.

2544

Under unix you can just specify the name of the script, since unix will

2545

figure out the program from the first line of the script.

2546

2547

The C<SwishProgParameters> are the parameters passed to the program

2548

specified by C<IndexDir> (perl.exe in this case). The first parameter

2549

is the perl script to run (F<prog-bin/spider.pl>). Perl passes the rest

2550

of the parameters directly to the perl script. The second parameter

2551

F<default> tells the F<spider.pl> program to use default settings for

2552

spidering (or you could specify a spider config file -- see C<perldoc

2553

spider.pl> for details), and lastly, the URL is passed into the spider

2554

program.

2555

2556

2557

=head2 Document Filter Directives

2558

2559

Internally, Swish-e knows how to parse only text, HTML, and XML documents.

2560

With "filters" you can index other types of documents. For example,

2561

if all your web pages are in gzip format a filter can uncompress these

2562

on the fly for indexing.

2563

2564

You may wish to read the Swish-e FAQ question on filtering before continuing here.

2565

L<How Do I filter documents?|SWISH-FAQ/"How Do I filter documents?">

2566

2567

There are two suggested methods for filtering.

2568

2569

=head3 Filtering with SWISH::Filter

2570

2571

The Swish-e distribution includes a Perl module called SWISH::Filter and individual

2572

filters located in the F<filters> directory. This system uses plug-in filters to

2573

extend the types of documents that Swish-e can index. The plug-in filters do not

2574

actually do the filtering, but rather provide a standard interface for accessing programs that

2575

can filter or convert documents. The programs that do the filtering are not part of

2576

the Swish-e distribution; they must be downloaded and installed separately.

2577

2578

The advantage of this method is that new filtering methods can be installed easily.

2579

2580

This system is designed to work with the -S http and -prog methods, but may also be used

2581

with the C<FileFilter> feature and -S fs indexing method. See

2582

F<$prefix/share/doc/swish-e/examples/filter-bin/swish_filter.pl> for

2583

an example.

2584

2585

See the F<filters/README> file for more information.

2586

2587

=head3 Filtering with the FileFilter feature

2588

2589

A filter is an external program that Swish-e executes while processing

2590

a document of a given type. Swish-e will execute the filter program

2591

for each file that matches the file suffix (extension) set in the

2592

B<FileFilter> or B<FileFilterMatch> directives. B<FileFilterMatch>

2593

matches using regular expressions and is described below.

2594

2595

Filters may be used with any type of input method (i.e. -S fs, -S http, or -S prog).

2596

But because

2597

2598

Swish-e calls the external program passing as B<default> arguments:

2599

2600

=over 4

2601

2602

=item $0

2603

2604

the name of the filter program

2605

2606

=item $1

2607

2608

the physical path name of the file to read. This may be a temporary

2609

file location if indexing by the http method.

2610

2611

=item $2

2612

2613

When indexing under the file system this will be the same as $1 (the

2614

path to the source file), but when indexing under the http method this

2615

will be the URL of the source document.

2616

2617

=back

2618

2619

Swish-e can also pass other parameters to the filter program. These

2620

parameters can be defined using the B<FileFilter> or B<FileFilterMatch>

2621

directives. See Filter Options below.

2622

2623

The filter program must open the file, process its contents, and return

2624

it to Swish-e by printing to STDOUT.

2625

2626

Note that this can add a significant amount of time to the indexing

2627

process if your external program is a perl or shell script. If you

2628

have many files to filter you should consider writing your filter in C

2629

instead of a shell or perl script, or using the "prog" Access Method.

2630

2631

=over 4

2632

2633

=item FilterDir *path-to-directory*

2634

2635

This is the path to a directory where the filter programs are stored.

2636

Swish-e looks in this directory to find the filter specified in the

2637

B<FileFilter> directive. If this directive is omitted, you have to

2638

specify the full path to the filterscript on each FileFilter directive.

2639

2640

This feature does *not* apply to the C<FileFilterMatch> directive.

2641

2642

Example:

2643

2644

FilterDir /usr/local/swish/filters

2645

2646

=item FileFilter *suffix* "filter-prog" ["filter-options"]

2647

2648

This maps file suffix (extension) to a filter program. If I<filter-prog>

2649

starts with a directory delimiter (absolute path), Swish-e doesn't use

2650

the FilterDir settings, but uses the given I<filter-prog> path directly.

2651

2652

Filter options:

2653

2654

Filter options are a string passed as arguments to the I<filter-prog>.

2655

Filter options can contain variables, replaced by Swish-e. If you omit

2656

I<filter-options> Swish-e will use default parameters for the options

2657

listed above.

2658

2659

Default: "'%p' '%P'"

2660

Which means: pass "workfile path" and "documentfile path" to filter (each quoted).

2661

2662

Variables in filter options:

2663

2664

%% = %

2665

%P = Full document pathname (e.g. URL, or path on filesystem)

2666

%p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem)

2667

%F = Filename stripped from full document pathname

2668

%f = Filename stripped from "work" pathname

2669

%D = Directoryname stripped from full document pathname

2670

%d = Directoryname stripped from full "work" pathname

2671

2672

Examples of strings passed:

2673

2674

%P = document pathname: http://myserver/path1/mydoc.txt

2675

%p = work pathname: /tmp/tmp.1234.mydoc.txt

2676

%F = mydoc.txt

2677

%f = tmp.1234.mydoc.txt

2678

%D = http://myserver/path1

2679

%d = /tmp

2680

2681

Important hint for security:

2682

2683

When using variable substitution, use quotes to ensure filename integrity.

2684

2685

e.g. "'%f'" --> 'file name with spaces.doc'.

2686

2687

If you don't use this, your system security may be compromised, or

2688

filtering may not work for these files.

2689

2690

B<Notes when using MS Windows>

2691

2692

Windows uses double quotes to escape shell metacharacters, so reverse

2693

the quotes in the examples above. e.g.:

2694

2695

'"%f"' --> "file name with spaced.doc"

2696

2697

You can specify the filter program using forward slashes (unix style).

2698

Swish will convert the slashes to backslashes before running your program.

2699

2700

FileFilter .mydoc c:/some/path/mydocfilter.exe '-d "%d" -example -url "%P" "%f"'

2701

2702

2703

Examples of filters:

2704

2705

FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"

2706

FileFilter .pdf pdftotext "'%p' -"

2707

FileFilter .html.gz gzip "-c '%p'"

2708

FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'"

2709

2710

The above examples are running a I<binary> filter program. For more

2711

complicated filtering needs you may use a scripting language such as

2712

Perl or a shell script. Here's some examples of calling a shell and

2713

perl script:

2714

2715

FileFilter .pdf pdf2html.sh

2716

FileFilter .ps ghostscript-filter.pl

2717

2718

Using a scripting language (or any language that has a large startup

2719

cost) can B<greatly increase the indexing time>. For small indexing

2720

jobs, this may not be an issue, but for large collections of files that

2721

require processing by a scripting language, you may be better off using

2722

the C<-S prog> access method where the script will only be compiled once,

2723

instead of for each document.

2724

2725

Filters are probably easier to write than a C<-S prog> program. Which you

2726

decide to use depends on your requirements. Examples of filter scripts

2727

can be found in the F<filter-bin> directory, and examples of C<-S prog>

2728

programs can be found in the F<prog-bin> directory.

2729

2730

=item FileFilterMatch *filter-prog* *filter-options* *regex* [*regex* ...]

2731

2732

This is similar to C<FileMatch> except uses regular expressions to

2733

match against the file name. *filter-prog* is the path to the program.

2734

Unlike C<FileFilter> this does B<not> use the C<FilterDir> option.

2735

Also unlike C<FileFilter> you B<must> specify the *filter-options*.

2736

2737

Examples:

2738

2739

FileFilterMatch ./pdftotext "'%p' -" /\.pdf$/

2740

2741

Note that will also match a file called ".pdf", so you may want to use

2742

something that requires a filename that has more than just an extension.

2743

For example:

2744

2745

FileFilterMatch ./pdftotext "'%p' -" /.\.pdf$/

2746

2747

To specify more than one extension:

2748

2749

FileFilterMatch ./check_title.pl "%p" /\.html$/ /\.htm$/

2750

2751

Or a few ways to do the same thing:

2752

2753

FileFilterMatch ./check_title.pl %p /\.(html|html)$/

2754

FileFilterMatch ./check_title.pl %p /\.html?$/

2755

2756

And to ignore case:

2757

2758

FileFilterMatch ./check_title.pl %p /\.html?$/i

2759

2760

You may also precede an expression with "not" to negate regular expression

2761

that follow. For example, to match files that do not have an extension:

2762

2763

FileFilterMatch ./convert "%p %P" not /\..+$/

2764

2765

=back

2766

2767

=head1 Document Info

2768

2769

$Id: SWISH-CONFIG.pod,v 1.74.2.1 2003/12/17 23:59:03 whmoseley Exp $

2770

2771

2772

2773

Older »