3
SWISH-CONFIG - Configuration File Directives
5
=head1 Swish-e CONFIGURATION FILE
7
What files Swish-e indexes and how they are indexed, and where the index
8
is written can be controlled by a configuration file.
10
The configuration file is a text file composed of comments, blank
11
lines, and B<configuration directives>. The order of the directives
12
is not important. Some directives may be used more than once in the
13
configuration file, while others can only be used once (e.g. additional
14
directives will overwrite preceding directives). Case of the directive
15
is not important -- you may use upper, lower, or mixed case.
17
Comments are any line that begin with a "#".
21
Directives may take more than one parameter. Enclose single parameters
22
that include whitespace in quotes (single or double). Inside of quotes
23
the backslash escapes the next character.
25
ReplaceRules append "foo bar" <- define "foo bar" as a single parameter
27
If you need to include a quote character in the value either use a
28
backslash to escape it, or enclose it in quotes of the other type.
30
For example, under unix you can use quotes to include white space in a
31
single parameter. Here, to protect against path names (%p) that might
32
have white space embedded use single quotes (this also protects against
33
shell expansion or metacharacters):
35
FileFilter .foo foofilter "'%p'" <- parameter passed through the shell in single quotes
36
FileFilter .foo foofilter '"%p"' <- windows uses double-quotes
37
FileFilter .foo foofilter '\'%p\''<- silly example
40
Backslashes also have special meaning in regular expressions.
42
FileFilterMatch pdftotext "'%p' -" /\.pdf$/
44
This says that the dot is a real dot (instead of matching any character).
45
If you place the regular expression in quotes then you must use
48
FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"
50
Swish-e will convert the double backslash into a single backslash before
51
passing the parameter to the regular expression compiler.
53
Commented example configuration files are included in the F<conf>
54
directory of the Swish-e distribution.
56
Some command line arguments can override directives specified in the
57
configuration file. Please see also the L<SWISH-RUN|SWISH-RUN> for
58
instructions on running Swish-e, and the L<SWISH-SEARCH|SWISH-SEARCH>
59
page for information and examples on how to search your index.
61
The configuration file is specified to Swish-e by the C<-c> switch.
64
swish-e -c myconfig.conf
66
You may also split your directives up into different configuration files.
67
This allows you to have a master configuration file used for many
68
different indexes, and smaller configuration files for each separate
69
index. You can specify the different configuration files when running
70
from the command line with the C<-c> switch (see L<SWISH-RUN|SWISH-RUN>),
71
or you may include other Configuration file with the B<IncludeConfigFile>
74
Typically, in a configuration file the directives are grouped together in
75
some logical order -- that is, directives that control the source of the
76
documents would be grouped together first, and directives that control
77
how each document is filtered or its words index in another group of
78
directives. (The directives listed below are grouped in this order).
80
The configuration file directives are listed below in these groups:
86
L<Administrative Headers Directives|/"Administrative Headers Directives">
87
-- You may add administrative information to the header of the index file.
91
L<Document Source Directives|/"Document Source Directives"> -- Directives
92
for selecting the source documents and the location of the index file.
96
L<Document Contents Directives|/"Document Contents Directives"> --
97
Directives that control how a document content is indexed.
101
L<Directives for the File Access method only|/"Directives for the File
102
Access method only"> -- These directives are only applicable to the File
103
Access indexing method.
107
L<Directives for the HTTP Access Method Only|/"Directives for the HTTP
108
Access Method Only"> -- Likewise, these only apply to the HTTP Access
113
L<Directives for the prog Access Method Only|/"Directives for the prog
114
Access Method Only"> -- These only apply to the prog Access method.
118
L<Document Filter Directives|/"Document Filter Directives"> -- This is
119
a special section that describes using document filters with Swish-e.
123
=head2 Alphabetical Listing of Directives
129
L<AbsoluteLinks|/"item_AbsoluteLinks"> [yes|NO]
133
L<BeginCharacters|/"item_BeginCharacters"> *string of characters*
137
L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters"> *string*
141
L<Buzzwords|/"item_Buzzwords"> [*list of buzzwords*|File: path]
146
L<ConvertHTMLEntities|/"item_ConvertHTMLEntities"> [YES|no]
150
L<DefaultContents|/"item_DefaultContents"> [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*]
154
L<Delay|/"item_Delay"> *seconds*
158
L<DontBumpPositionOnEndTags|/"item_DontBumpPositionOnEndTags"> *list of names*
162
L<DontBumpPositionOnStartTags|/"item_DontBumpPositionOnStartTags"> *list of names*
166
L<EnableAltSearchSyntax|/"item_EnableAltSearchSyntax"> [yes|NO]
170
L<EndCharacter|/"item_EndCharacters"> *string of characters*
174
L<EquivalentServer|/"item_EquivalentServer"> *server alias*
178
L<ExtractPath|/"item_ExtractPath"> *metaname* [replace|remove|prepend|append|regex]
182
L<FileFilter|/"item_FileFilter"> *suffix* *program* [options]
186
L<FileFilterMatch|/"item_FileFilterMatch"> *program* *options* *regex* [*regex* ...]
190
L<FileInfoCompression|/"item_FileInfoCompression"> [yes|NO]
194
L<FileMatch|/"item_FileMatch"> [contains|is|regex] *regular expression*
198
L<FileRules|/"item_FileRules"> [contains|is|regex] *regular expression*
202
L<FuzzyIndexingMode|/"item_FuzzyIndexingMode"> [NONE|Stemming|Soundex|Metaphone|DoubleMetaphone]
206
L<FollowSymLinks|/"item_FollowSymLinks"> [yes|NO]
210
L<HTMLLinksMetaName|/"item_HTMLLinksMetaName"> *metaname*
214
L<IgnoreFirstChar|/"item_IgnoreFirstChar"> *string of characters*
218
L<IgnoreLastChar|/"item_IgnoreLastChar"> *string of characters*
222
L<IgnoreLimit|/"item_IgnoreLimit"> *integer integer*
226
L<IgnoreMetaTags|/"item_IgnoreMetaTags"> *list of names*
230
L<IgnoreNumberChars|/"item_IgnoreNumberChars"> *list of characters*
234
L<IgnoreTotalWordCountWhenRanking|/"item_IgnoreTotalWordCountWhenRanking"> [YES|no]
238
L<IgnoreWords|/"item_IgnoreWords"> [*list of stop words*|File: path]
242
L<ImageLinksMetaName|/"item_ImageLinksMetaName"> *metaname*
246
L<IncludeConfigFile|/"item_IncludeConfigFile">
250
L<IndexAdmin|/"item_IndexAdmin"> *text*
254
L<IndexAltTagMetaName|/"item_IndexAltTagMetaName"> *tagname*|as-text
258
L<IndexComments|/"item_IndexComments"> [yes|NO]
262
L<IndexContents|/"item_IndexContents"> [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*] *file extensions*
266
L<IndexDescription|/"item_IndexDescription"> *text*
270
L<IndexDir|/"item_IndexDir"> [URL|directories or files]
274
L<IndexFile|/"item_IndexFile"> *path*
278
L<IndexName|/"item_IndexName"> *text*
282
L<IndexOnly|/"item_IndexOnly"> *list of file suffixes*
286
L<IndexPointer|/"item_IndexPointer"> *text*
290
L<IndexReport|/"item_IndexReport"> [0|1|2|3]
294
L<MaxDepth|/"item_MaxDepth"> *integer*
298
L<MaxWordLimit|/"item_MaxWordLimit"> *integer*
302
L<MetaNameAlias|/"item_MetaNameAlias"> *meta name* *list of aliases*
306
L<MetaNames|/"item_MetaNames"> *list of names*
310
L<MinWordLimit|/"item_MinWordLimit"> *integer*
314
L<NoContents|/"item_NoContents"> *list of file suffixes*
318
L<obeyRobotsNoIndex|/"item_obeyRobotsNoIndex"> [yes|NO]
322
L<ParserWarnLevel|/"item_ParserWarnLevel"> [0|1|2|3]
326
L<PreSortedIndex|/"item_PreSortedIndex"> *list of property names*
330
L<PropCompressionLevel|/"item_PropCompressionLevel"> [0-9]
334
L<PropertyNameAlias|/"item_PropertyNameAlias"> *property name* *list of aliases*
338
L<PropertyNames|/"item_PropertyNames"> *list of meta names*
342
L<PropertyNamesCompareCase|/"item_PropertyNamesCompareCase"> *list of meta names*
346
L<PropertyNamesIgnoreCase|/"item_PropertyNamesIgnoreCase"> *list of meta names*
350
L<PropertyNamesNoStripChars|/"item_PropertyNoStripChars"> *list of meta names*
354
L<PropertyNamesDate|/"item_PropertyNamesDate"> *list of meta names*
358
L<PropertyNamesNumeric|/"item_PropertyNamesNumeric"> *list of meta names*
362
L<PropertyNamesMaxLength|/"item_PropertyNamesMaxLength"> integer *list of meta names*
366
L<PropertyNamesSortKeyLength|/"item_PropertyNamesSortKeyLength"> integer *list of meta names*
370
L<ReplaceRules|/"item_ReplaceRules"> [replace|remove|prepend|append|regex]
374
L<ResultExtFormatName|/"item_ResultExtFormatName"> name -x format string
378
L<SpiderDirectory|/"item_SpiderDirectory"> *path*
382
L<StoreDescription|/"item_StoreDescription"> [XML <tag>|HTML <meta>|TXT size]
386
L<SwishProgParameters|/"item_SwishProgParameters> *list of parameters*
390
L<SwishSearchDefaultRule|/"item_SwishSearchDefaultRule"> [<AND-WORD>|<or-word>]
394
L<SwishSearchOperators|/"item_SwishSearchOperators"> <and-word> <or-word> <not-word>
398
L<TmpDir|/"item_TmpDir"> *path*
402
L<TranslateCharacters|/"item_TranslateCharacters"> [*string1 string2*|:ascii7:]
406
L<TruncateDocSize|/"item_TruncateDocSize">
407
*number of characters*
411
L<UndefinedMetaTags|/"item_UndefinedMetaTags"> [error|ignore|INDEX|auto]
415
L<UndefinedXMLAttributes|/"item_UndefinedXMLAttributes"> [DISABLE| error|ignore|index|auto]
419
L<UseStemming|/"item_UseStemming"> [yes|NO]
423
L<UseSoundex|/"item_UseSoundex"> [yes|NO]
427
L<UseWords|/"item_UseWords"> [*list of words*|File: path]
431
L<WordCharacters|/"item_WordCharacters"> *string of characters*
435
L<XMLClassAttributes|/"item_XMLClassAttributes"> *list of XML attribute names*
439
=head2 Directives that Control Swish
441
These configuration directives control the general behavior of Swish-e.
445
=item IncludeConfigFile *path to config file*
447
This directive can be used to include configuration directives located
450
IncludeConfigFile /usr/local/swish/conf/site_config.config
452
=item IndexReport [0|1|2|3]
454
This is how detailed you want reporting while indexing. You can specify
455
numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default
458
This may be overridden from the command line via the C<-v> switch (see
459
L<SWISH-RUN|SWISH-RUN>).
461
=item ParserWarnLevel [0|1|2|3]
463
Sets the error level when using the libxml2 parser for XML and HTML.
464
libxml2 will point out structural errors in your documents.
471
The exception to this is UTF-8 to Latin-1 conversion errors are reported at
472
level 1. This is because words may be indexed incorrectly in these cases.
474
Note that unlike other errors generated by Swish-e, these errors are
477
=item IndexFile *path*
479
Index file specifies the location of the generated index file. If not
480
specified, Swish-e will create the file F<index.swish-e> in the current
483
IndexFile /usr/local/swish/site.index
485
=item obeyRobotsNoIndex [yes|NO]
487
When enabled, Swish-e will not index any HTML file that contains:
489
<meta name="robots" content="noindex">
491
The default is to ignore these meta tags and index the document.
492
This tag is described at http://www.robotstxt.org/wc/exclusion.html.
494
Note: This feature is only available with the libxml2 HTML parser.
496
Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the following
497
comments in your documents to prevent indexing:
499
<!-- SwishCommand noindex -->
500
<!-- SwishCommand index -->
502
and/or these may be used also:
507
For example, these are very helpful to prevent indexing of common headers, footers, and menus.
512
B<NOTE>: This following items are currently not available. These items
513
require Swish-e to parse the configuration file while searching.
518
=item EnableAltSearchSyntax [yes|NO]
520
B<NOTE>: This following item is currently not available.
522
Enable alternate search syntax. Allows the usage of a basic
523
"Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search
524
query can contain "+" and "-" as syntax parameter.
528
swish-e -w "+word1 +word2 -word3 word4 word5"
529
"+" = following word has to be in all found documents
530
"-" = following word may not be in any document found
531
" " = following word will be searched in documents
533
=item SwishSearchOperators <and-word> <or-word> <not-word>
535
B<NOTE>: This following item is currently not available.
537
Using this config directive you can change the boolean search operators of
538
Swish-e, e.g. to adapt these to your language.
539
The default is: AND OR NOT
543
SwishSearchOperators UND ODER NICHT
545
=item SwishSearchDefaultRule [<AND-WORD>|<or-word>]
547
B<NOTE>: This following item is currently not available.
549
C<SwishSearchDefaultRule> defines the default Boolean operator to use
550
if none is specified between words or phrases. The default is C<AND>.
552
The word you specify must match one of the available
553
C<SwishSearchOperators>.
557
SwishSearchOperators UND ODER NICHT
558
# Make it act like a web search engine
559
SwishSearchDefaultRule ODER
561
=item ResultExtFormatName name -x format string
563
B<NOTE>: This following item is currently not available.
565
The output of Swish-e can be defined by specifying a format string with
566
the C<-x> command line argument. Using C<ResultExtFormatName> you can
567
assign a predefined format string to a name.
571
ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n"
573
Then when searching you can specify the format string's name
575
swish-e ... -x moreinfo ...
577
See the C<-x> switch in L<SWISH-RUN|SWISH-RUN> for more information
578
about output formats.
583
=head2 Administrative Headers Directives
585
Swish-e stores configuration information in the header of the index file.
586
This information can be retrieved while searching or by functions in
587
the Swish-e C library. There are a number of fields available for your
588
own use. None of these fields are required:
592
=item IndexName *text*
594
=item IndexDescription *text*
596
=item IndexPointer *text*
598
=item IndexAdmin *text*
600
These variables specify information that goes into index files to help
601
users and administrators. IndexName should be the name of your index,
602
like a book title. IndexDescription is a short description of the index
603
or a URL pointing to a more full description. IndexPointer should be
604
a pointer to the original information, most likely a URL. IndexAdmin
605
should be the name of the index maintainer and can include name and email
606
information. These values should not be more than 70 or so characters
607
and should be contained in quotes. Note that the automatically generated
608
date in index files is in D/M/Y and 24-hour format.
612
IndexName "Linux Documentation"
613
IndexDescription "This is an index of /usr/doc on our Linux machine."
614
IndexPointer http://localhost/swish/linux/index.html
620
=head2 Document Source Directives
622
These directives control I<what> documents are indexed and I<how>
623
they are accessed. See also L<Directives for the File Access method
624
only|/"Directives for the File Access method only"> and L<Directives for
625
the HTTP Access Method Only|/"Directives for the HTTP Access Method Only">
626
for directives that are specific to those access methods.
631
=item IndexDir [directories or files|URL|external program]
633
IndexDir defines the source of the documents for Swish-e. Swish-e
634
currently supports three file access methods: B<File system>, B<HTTP>
635
(also called B<spidering>), and B<prog> for reading files from an
638
The C<-S> command line argument is used to select the file access method.
640
swish-e -c swish.config -S fs - file system
641
swish-e -c swish.config -S http - internal http spider
642
swish-e -c swish.config -S prog - external program of any type
644
For the B<fs> method of access B<IndexDir> is a space-separated
645
list of files and directories to index. Use a forward slash as the path
646
separator in MS Windows.
648
For the B<http> method the B<IndexDir> setting is a list of space-separated
651
For the B<prog> method the B<IndexDir> setting is a list of space-separated
652
programs to run (which generate documents for swish to index).
654
You may specify more than one B<IndexDir> directive.
656
Any sub-directories of any listed directory will also be indexed.
658
Note: While I<processing> directories, Swish-e will ignore any files
659
or directories that begin with a dot ("."). You may index files
660
or directories that begin with a dot by specifying their name with
661
C<IndexDir> or C<-i>.
665
# Index this directory an any subdirectories
666
IndexDir /usr/local/home/http
668
# Index the docs directory in current directory
671
# Index these files in the current directory
672
IndexDir ./index.html ./page1.html ./page2.html
673
# and index this directory, too
674
IndexDir ../public_html
676
For the B<HTTP> method of access specify the URL's from which
677
you want the spidering to begin.
681
IndexDir http://www.my-site.com/index.html
682
IndexDir http://localhost/index.html
684
Obviously, using the B<HTTP> method to index is B<much> slower than
685
indexing local files. Be well aware that some sites do not appreciate
686
spidering and may block your IP address. You may wish to contact the
687
remote site before spidering their web site. More information about
688
spidering can be found in L<Directives for the HTTP Access Method
689
Only|/"Directives for the HTTP Access Method Only"> below.
691
For the L<prog|SWISH-RUN/"item_prog"> method of access B<IndexDir>
692
specifies the path to the program(s) to execute. The external program
693
must correctly format the documents being passed back to Swish-e.
694
Examples of external programs are provided in the F<prog-bin> directory.
696
IndexDir ./myprogram.pl
698
See L<prog|SWISH-RUN/"item_prog"> for details.
701
Note: Not all directives work with all methods.
703
=item NoContents *list of file suffixes*
705
Files with these suffixes will B<not> have their contents indexed,
706
but will have their path name (file name) indexed instead.
708
If the file's type is HTML or HTML2 (as set by C<IndexContents> or
709
C<DefaultContents>) then the file will be parsed for a HTML title and
710
that title will be indexed. Note that you must set the file's type with
711
C<IndexContents> or C<DefaultContents>:
712
C<.html> and C<.htm> are NOT type HTML by default. For example:
714
IndexContents HTML* .htm .html
716
If a title is found, it will still be checked for C<FileRules title>, and the file will be
717
skipped if a match is found. See C<FileRules>.
719
If the file's type is not HTML, or it is HTML and no title is found,
720
then the file's path will be indexed.
722
For example, this will allow searching by image file name.
724
NoContents .gif .xbm .au .mov .mpg .pdf .ps
726
Note: Using this directive will B<not> cause files with those suffixes
727
to be indexed. That is, if you use C<IndexOnly> to limit the types of
728
files that are indexed, then you must specify in C<IndexOnly> the same
729
suffixes listed in C<NoContents>.
731
This does B<not> work:
735
NoContents .gif .xbm .au .mov .mpg .pdf .ps
737
A C<-S prog> program may set the C<No-Contents:> header
738
to enable this feature for a specific document (although it would be
739
smarter for the C<-S prog> program to simply only send the pathname or
742
=item ReplaceRules [replace|remove|prepend|append|regex]
744
ReplaceRules allows you to make changes to file pathnames before
745
they're indexed. These changed file names or URLs will be returned in
748
For example, you may index your files locally (with the File system
749
indexing method), yet return a URL in search results. This directive can
750
be used to map the file names to their respective URLs on your web server.
752
There are five operations you can specify: B<replace>, B<append>,
753
B<remove>, B<prepend>, and B<regex> They will parse the pathname in the
754
order you've typed these commands.
756
This directive uses C library regex.h regular expressions.
758
replace "the string you want replaced" "what to change it to"
759
remove "a string to remove"
760
prepend "a string to add before the result"
761
append "a string to add after the result"
762
regex "/search string/replace string/options"
764
Remember, quotes are needed if an expression contains white space,
765
and backslashes have special meaning.
767
Regex is an Extended Regular Expression. The first character found is
768
the delimiter (but it's not smart enough to use matched chars such as [],
771
The B<replace> string may use substitution variables:
773
$0 the entire matched (sub)string
774
$1-$9 returns patterns captured in "(" ")" pairs
775
$` the string before the matched pattern
776
$' the string after the matched pattern
778
The B<options> change the behavior of expression:
780
i ignore the case when matching
781
g repeat the substitution for the entire pattern
785
ReplaceRules replace testdir/ anotherdir/
786
ReplaceRules replace [a-z_0-9]*_m.*\.html index.html
788
ReplaceRules remove testdir/
790
ReplaceRules prepend http://localhost/
791
ReplaceRules append .html
793
ReplaceRules regex !^/web/(.+)/!http://$1.domain.com/!
794
replaces a file path:
795
/web/search/foo/index.html
797
http://search.domain.com/foo/index.html
799
ReplaceRules regex #^#http://localhost/www#
800
ReplaceRules prepend http://localhost/www (same thing)
802
# Remove all extensions from C source files
803
ReplaceRules remove .c # ERROR! That "." is *any char*
804
ReplaceRules remove \.c # much better...
806
ReplaceRules remove "\\.c" # if in quotes you need double-backslash!
807
ReplaceRules remove "\.c" # ERROR! "\." -> "." and is *any char*
810
=item IndexContents [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*] *file extensions*
812
The C<IndexContents> directive assigns one of Swish-e's document parsers
813
to a document, based on the its extension. Swish-e currently knows how
814
to parse TXT, HTML, and XML documents.
816
The XML2, HTML2, and TXT2 parsers are currently only available when
817
Swish-e is configured to use libxml2.
819
You may use XML*, HTML*, and TXT* to select the parser automatically.
820
If libxml2 is installed then it will be used to parse the content. Otherwise,
821
Swish-e's internal parsers will be used.
823
Documents that are not assigned a parser with C<IndexContents> will, by
824
default, use the HTML2 parser if libxml2 is installed, otherwise will
825
use Swish-e's internal HTML parser. The C<DefaultContents> directive may be
826
used to assign a parser to documents that do not match a file extension
827
defined with the C<IndexContents> directive.
831
IndexContents HTML* .htm .html .shtml
832
IndexContents TXT* .txt .log .text
833
IndexContents XML* .xml
835
HTML* is the default type for all files, unless otherwise specified
836
(and this default can be changed by the B<DefaultContents> directive.
837
Swish-e parses titles from HTML files, if available, and keeps track
838
of the context of the text for context searching (see C<-t> in
839
L<SWISH-RUN|SWISH-RUN>).
841
If using filters (with the C<FileFilter> directive)
842
to convert documents you should include those extensions,
843
too. For example, if using a filter to convert .pdf to .html, you need
844
to tell Swish-e that .pdf should be indexed by the internal HTML parser:
846
FileFilter .pdf pdf2html
847
IndexContent HTML .pdf
849
See also L<Document Filter Directives|/"Document Filter Directives">.
851
B<Note:> Some of this may be changed in the future to use content-types
852
instead of file extensions. See L<SWISH-3.0|SWISH-3.0>
854
=item DefaultContents [TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*]
856
This sets the default parser for documents that are not specified in
857
B<IndexContents>. If not specified the default is HTML.
859
The XML2, HTML2, and TXT2 parsers are currently only available when
860
Swish-e is configured to use libxml2.
862
You may use XML*, HTML*, and TXT* to select the parser automatically.
863
If libxml2 is installed then it will be used to parse the content. Otherwise,
864
Swish-e's internal parsers will be used.
871
The C<DefaultContents> directive I<should> be used when spidering,
872
as HTML files may be returned without a file extension (such as when
873
requesting a directory and the default index.html is returned).
876
=item FileInfoCompression [yes|NO]
878
** This directive is currently not supported **
880
Setting B<FileInfoCompression> to C<yes> will compress the index file to
881
save disk space. This may result in longer indexing times. The default
884
Also see the C<-e> switch in L<SWISH-RUN|SWISH-RUN> for saving RAM
890
=head2 Document Contents Directives
892
These directives control what information is extracted from your source
893
documents, and how that information is made available during searching.
897
=item ConvertHTMLEntities [YES|no]
899
ASCII I<entities> can be converted automatically while indexing documents
900
of type HTML (not for HTML2).
901
For performance reasons you may wish to set this to C<no>
902
if your documents do not contain HTML entities. The default is C<yes>.
904
If C<ConvertHTMLEntities> is set C<no> the entities will be indexed
907
B<NOTE:> Entities within XML files and files parsed with libxml2 (HTML2) are
908
converted regardless of this setting.
910
=item MetaNames *list of names*
912
META names are a way to define "fields" in your XML and HTML documents.
913
You can use the META names in your queries to limit the search to just
914
the words contained in that META name of your document. For example,
915
you might have a META tagged field in your documents called C<subjects>
916
and then you can search your documents for the word "foo" but only return
917
documents where "foo" is within the C<subjects> META tag.
919
swish-e -w subjects=foo
921
(See also the C<-t> switch in L<SWISH-RUN|SWISH-RUN> for information
922
about I<context> searching in HTML documents.)
924
The B<MetaNames> directive is a space separated list. For example:
926
MetaNames meta1 meta2 keywords subjects
928
You may also use L<UndefinedMetaTags|/"item_UndefinedMetaTags"> to specify
929
automatic extraction of meta names from your HTML and XML documents,
930
and also to ignore indexing content of meta tags.
932
META tags can have two formats in your B<HTML> source documents:
934
<META NAME="meta1" CONTENT="some content">
936
and (if using the HTML2/libxml2 parser)
942
But this second version is invalid HTML, and will generate a warning if
943
ParserWarningLevel is set (libxml2 only).
945
And in B<XML> documents, use the format:
951
Then you can limit your search to just META B<meta1> like this:
953
swish-e -w 'meta1=(apples or oranges)'
955
You may nest the XML and the start/end tag versions:
966
Then you can search in both tag2 and tag2 with:
968
swish-e -w 'keywords=(query words)'
970
Swish-e indexes all text as some metaname. The default is
971
C<swishdefault>, so these two queries are the same:
974
swish-e -w swishdefault=foo
976
When indexing HTML Swish-e indexes the HTML title as default text, so
977
when searching Swish-e will find matches in both the HTML body and the
978
HTML title. Swish also, by default, indexes content of meta tags. So:
982
will find "foo" in the body, the title, or any meta tags.
984
Currently, there's no way to prevent Swish-e from indexing
985
the title contents along with the body contents, but see
986
L<UndefinedMetaTags|/"item_UndefinedMetaTags"> for how to control the
987
indexing of meta tags.
989
If you would like to search just the title text, you may use:
993
This will index the title text separately under the built-in swish
994
internal meta name "swishtitle". You may then search like
996
swish-e -w foo -- search for "foo" in title, body (and undefined meta tags)
997
swish-e -w swishtitle=foo -- search for "foo" in title only
999
In addition to swishtitle, you can limit searches to documents' path with:
1001
MetaNames swishdocpath
1003
Then to search for "foo" but also limit searches to documents that include
1004
"manual" or "tutorial" in their path:
1006
swish-e -w foo swishdocpath=(manual or tutorial)
1008
See also L<ExtractPath|/"item_ExtractPath">.
1011
=item MetaNameAlias *meta name* *list of aliases*
1013
MetaNameAlias assigns aliases for a meta name. For example, if your
1014
documents contain meta tags "description", "summary", and "overview"
1015
that all give a summary of your documents you could do this:
1018
MetaNameAlias summary description overview
1020
Then all three tags will get indexed as meta tag "summary". You can
1021
then search all the fields as:
1025
The Alias work at search time, too. So these will also limit the search
1026
to the "summary" meta name.
1031
=item MetaNamesRank integer *list of meta names*
1033
* Not implemented yet *
1035
You can assign a bias to metanames that will affect how ranking is
1036
calculated. The range of values is from -10 to +10, with zero being
1039
MetaNamesRank 4 subject
1040
MetaNamesRank 3 swishdefault
1041
MetaNamesRank 2 author publisher
1042
MetaNamesRank -5 wrongwords
1044
This feature is not implemented yet
1046
=item HTMLLinksMetaName *metaname*
1048
Allows indexing of HTML links. Normally, HTML links (href tags) are
1049
not indexed by Swish-e. This directive defines a metaname, and links
1050
will be indexed under this meta name.
1054
HTMLLinksMetaName links
1056
Now, to limit searches to files with a link to "home.html" do this:
1058
-w links='"home.html"'
1060
The double quotes force a phrase search.
1062
To make Swish-e index links as normal text, you may use:
1064
HTMLLinksMetaName swishdefault
1066
This feature is only available with the libxml2 HTML parser.
1068
=item ImageLinksMetaName *metaname*
1070
Allows indexing of image links under a metaname. Normally, image URLs
1075
ImagesLinksMetaName images
1077
Now, if you would like to find pages that include a nice image of a beach:
1081
To make Swish-e index links as normal text, you may use:
1083
ImageLinksMetaName swishdefault
1085
This feature is only available with the libxml2 HTML parser.
1088
=item IndexAltTagMetaName *tagname*|as-text
1090
Allows indexing of images <IMG> ALT tag text. Specify either a tag name which will be
1091
used as a metaname, or the special text "as-text" which says to index the ALT text as
1092
if it were plain text at the current location.
1094
For example, by specifying a tag name:
1096
IndexAltTagMetaName bar
1098
would make this markup:
1101
<img src="/someimage.png" alt="Alt text here">
1107
<bar>Alt text here</bar>
1110
Then the normal rules (C<MetaNames> and C<PropertyNames>) apply to how that text is indexed.
1112
If you use the special tag "as-text" then
1115
<img src="/someimage.png" alt="Alt text here">
1124
This feature is only available when using the libxml2 parser (HTML2 and XML2).
1127
=item AbsoluteLinks [yes|NO]
1129
If this is set true then Swish-e will attempt to convert relative URIs
1130
extracted from HTML documents for use with C<HTMLLinksMetaName> and
1131
C<ImageLinksMetaName> into absolute URIs. Swish-e will use any <BASE>
1132
tag found in the document, otherwise it will use the file's pathname.
1133
The pathname used will be the pathname *after* C<ReplaceRules> has been
1134
applied to the document's pathname.
1136
For example, say you wish to index image links under the metaname
1139
ImageLinksMetaName images
1141
If an image is located in http://localhost/vacations/france/index.html
1142
and C<AbsoluteLinks> is set to no, then a image within that document:
1144
<img src="beach.jpeg">
1146
will only index "beach.jpeg".
1148
But, if you want more detail when searching, you can enable
1149
C<AbsoluteLinks> and Swish-e will index
1150
"http://localhost/vacations/france/beach.jpeg". You can then look for
1151
images of beaches, but only in France:
1153
-w images=(beach and france)
1155
This also means you can search for any images within France:
1159
This feature is only available with the libxml2 HTML parser.
1161
=item UndefinedMetaTags [error|ignore|INDEX|auto]
1163
This directive defines the behavior of Swish-e during indexing when a
1164
meta name is found but is B<not> listed in B<MetaNames>. There are
1172
If a meta name is found that is not listed in B<MetaNames>
1173
then indexing will be halted and an error reported.
1177
The contents of the meta tag are ignored and B<not> indexed
1178
unless a metaname has been defined with the C<MetaNames> directive.
1182
The contents of the meta tag are indexed, but placed in the
1183
main index unless there's an enclosing metatag already in force. This
1188
This method create meta tags automatically for HTML meta names
1189
and XML elements. Using this is the same as specifying all the meta
1190
names explicitly in a B<MetaNames> directive.
1194
=item UndefinedXMLAttributes [DISABLE|error|index|auto]
1196
This is similar to C<UndefinedMetaTags>, but only applies to XML documents (parsed with libxml2).
1197
This allows indexing of attribute content, and provides a way to index the content under a
1198
metaname. For example, C<UndefinedXMLAttributes> can make
1204
look like the following to swish:
1213
What happens to the text "23" will depend on the setting of C<UndefinedXMLAttributes>:
1219
XML attributes are not parsed and not indexed. This is the default.
1223
If the concatenated meta name (e.g. person.age) is not listed in
1224
B<MetaNames> then indexing will be halted and an error reported.
1228
The contents of the meta tag are ignored and B<not> indexed unless a
1229
metaname has been defined with the C<MetaNames> directive.
1233
The contents of the meta tag are indexed, but placed in the main index
1234
unless there's an enclosing metatag already in force.
1238
This method will create meta tags from the combined element and attributes
1239
(and XML Class name) This options should be used with caution as it can
1240
generate a lot of metaname entries.
1242
See also the example below C<XMLClassAttribues>.
1247
=item XMLClassAttributes *list of XML attribute names*
1249
Combines an XML class name with the element name to make up a metaname.
1252
XMLClassAttributes class
1254
<person class="first">
1257
<person class="last">
1261
Will appear to Swish-e as:
1274
How the data is indexed depends on C<MetaNames> and C<UndefinedMetaTags>.
1276
Here's an example using the following configuration which combines the
1277
two directives C<XMLClassAttributes> and C<UndefinedXMLAttributes>.
1279
XMLClassAttributes class
1280
UndefinedMetaTags auto
1281
UndefinedXMLAttributes auto
1282
IndexContents XML2 .xml
1284
The source XML file looks like:
1287
<person class="student" phone="555-1212" age="102">
1290
<person greeting="howdy">Bill</person>
1295
./swish-e -c 2 -i 1.xml -T parsed_tags parsed_text -v 0
1296
Indexing Data Source: "File-System"
1301
<person.student> (MetaName)
1302
<person.student.phone> (MetaName)
1304
</person.student.phone>
1305
<person.student.age> (MetaName)
1307
</person.student.age>
1312
<person.greeting> (MetaName)
1321
One thing to note is that the first <person> block finds a class name
1322
"student" so all metanames that are created from attributes use the
1323
combined name "person.student". The second <person> block doesn't contain
1324
a "class" so, the attribute name is combined directly with the element
1325
name (e.g. "person.greeting").
1327
=item ExtractPath *metaname* [replace|remove|prepend|append|regex]
1329
This directive can be used to index extracted parts of a document's path.
1330
A common use would be to limit searches to specific areas of your
1333
The extracted string will be indexed under the specified meta name.
1335
See C<ReplaceRules> for a description of the various pattern replacement
1336
methods, but you will use the I<regex> method.
1338
For example, say your file system (or web tree) was organized into departments:
1342
/web/accounting/foo...
1344
And you wanted a way to limit searches to just documents under "sales".
1346
ExtractPath department regex !^/web/([^/]+)/.*$!$1!
1348
Which says, extract out the department name (as substring $1) and index
1349
it as meta name C<department>. Then to limit a search to the sales
1352
swish-e -w foo AND department=sales
1354
Note that the C<regex> method uses a substitution pattern, so to index
1355
only a sub-string match the I<entire> document path in the regular
1356
expression, as shown above. Otherwise any part that is not matched will
1357
end up in the substitution pattern.
1359
See the C<ExtractPathDefault> option for a way to set a value if not
1362
Although unlikely, you may use more than one C<ExtractPath> directive.
1363
More than one directive of the I<same> meta name will operate successively
1364
(in order listed in the configuration file) on the path. This allows
1365
you to use regular expressions on the results of the previous pattern
1366
substitution (as if piping the output from one expression to the patter
1369
ExtractPath foo regex !^(...).+$!$1!
1370
ExtractPath foo regex !^.+(.)$!$1!
1372
So, the third letter is indexed as meta name "foo" if both patterns match.
1374
ExtractPath foo regex !^X(...).+$!$1!
1375
ExtractPath foo regex !^.+(.)$!$1!
1377
Now (not the "X"), if the first pattern doesn't match, the last character
1378
of the path name is indexed. You must be clear on this behavior if you
1379
are using more than one C<ExtractPath> directive with the same metaname.
1381
The document path operated on is the real path swish used to access
1382
the document. That is, the C<ReplaceRules> directive has no effect on
1383
the path used with C<ExtractPath>.
1385
The full path is used for each meta name if more than one C<ExtractPath>
1386
directive is used. That is, changes to the path used in C<ExtractPath
1387
foo> do not affect the path used by C<ExtractPath bar>.
1389
=item ExtractPathDefault *metaname* default_value
1391
This can be used with C<ExtractPath> to set a default string to index
1392
under the given metaname if none of the C<ExtractPath> patterns match.
1394
For example, say your want to index each document with a metaname
1395
"department" based on the following path examples:
1399
/web/accounting/foo...
1401
But you are also indexing documents that do not follow that pattern and you want to search those
1404
ExtractPath department regex !^/web/([^/]+)/.*$!$1!
1405
ExtractPathDefault department other
1407
Now, you may search like this:
1409
-w foo department=(sales) - limit searches to the sales documents
1410
-w foo department=(parts) - limit searches to the parts documents
1411
-w foo department=(accounting) - limit searches to the accounting documents
1412
-w foo department=(other) - everything but sales, parts, and accounting.
1414
This basically is a shortcut for:
1416
-w foo not department=(sales or parts or accounting)
1418
but you don't need to keep track of what was extracted.
1420
=item PropertyNames *list of meta names*
1422
=item PropertyNamesCompareCase *list of meta names*
1424
=item PropertyNamesIgnoreCase *list of meta names*
1426
Swish-e allows you to specify certain META tags that can be used as
1427
B<document properties>. The contents of any META tag that has been
1428
identified as a document property can be returned as part of the search
1429
results along with the rank, file name, title, and document size (see
1430
the C<-p> and C<-x> switches in L<SWISH-RUN|SWISH-RUN>).
1432
Properties are useful for returning additional data from documents in
1433
search results -- this saves the effort of reading and parsing the source
1434
files while reading Swish-e search results, and is especially useful
1435
when the source documents are no longer available or slow to access
1438
Another feature of properties is that Swish-e can use the PropertyNames
1439
for sorting the search results (see the C<-s> switch).
1441
PropertyNames author subjects
1443
Two variations are available. C<PropertyNamesCompareCase> and
1444
C<PropertyNamesIgnoreCase>. These tell Swish-e to either ignore or
1445
compare case when sorting results. The default for C<PropertyNames>
1446
is to ignore the case.
1448
PropertyNamesIgnoreCase subject
1449
PropertyNamesCompareCase keyword
1451
The defaults for "internal" properties are:
1453
swishtitle -- ignore the case
1454
swishdocpath -- compare case
1455
swishdescription -- compare case
1457
These can be overridden with C<PropertyNamesCompareCase> and
1458
C<PropertyNamesIgnoreCase>.
1460
PropertyNamesCompareCase swishtitle
1462
Use of PropertyNames will increase the size of your index files,
1463
sometimes significantly. Properties will be compressed if Swish-e is
1464
compiled with zlib as described in the L<INSTALL|INSTALL> manual page.
1466
If Swish-e finds more than one property of the same name in a document
1467
the property's contents will be concatinated for strings, and a warning
1468
issues for numeric (or date) properties.
1470
=item PropertyNamesNoStripChars
1472
PropertyNamesNoStripChars specifies that the listed properties should not
1473
have strings of low ASCII characters replaced with a space character.
1474
Properties will be stored as found in the document.
1476
When printing properties with the swish-e binary newlines are replaced with
1477
a space character. Use the swish-e library (or SWISH::API perl module) to
1478
fetch properties without newlines replaced.
1481
=item PropertyNamesNumeric
1483
This directive is similar to C<PropertyNames>, but it flags the property
1484
as being a string of digits (integer value) that will be stored as binary data instead
1485
of a string. This allows sorting with C<-s> and limiting with C<-L>
1486
to sort and limit the property correctly.
1488
Swish-e uses C<strtoul(3)> to convert the string into an unsigned long
1489
integer. Therefore, only positive integers can be stored.
1491
Future versions of Swish-e may be able to store different property types
1492
(such as negative integers and real numbers). This directive may change
1493
in future releases of Swish.
1495
=item PropertyNamesDate
1497
This directive is exactly like C<PropertyNamesNumeric>, but it also
1498
flags the number as a machine timestamp (seconds since Epoch), and
1499
will print a formatted date when returning this property. See C<-x>
1500
in L<SWISH-RUN|SWISH-RUN>.
1502
Swish-e will not parse dates when indexing; you must use a timestamp.
1504
=item PropertyNameAlias *property name* *list of aliases*
1506
This allows aliases for a property name. For example, if you are indexing
1507
HTML files, plus XML files that are written in English, German, and
1508
Spanish and thus use the tags "title", "titel", and "t�tulo" you can use:
1510
PropertyNameAlias swishtitle title titel t�tulo titulo
1512
Note that "swishtitle" is the built-in property used to store the title of
1513
a document, and therefore you do not need to specify it as a PropertyName
1516
=item PropertyNamesMaxLength integer *list of meta names*
1518
This option will set the max length of the text stored in a property.
1519
You must specify a number between 0 and the max integer size on your
1520
platform, and a list of properties. The properties specified must not
1523
If any of the property names do not exist they will be created (e.g. you
1524
do not need to define the property with PropertyNames first).
1526
In general, this feature will only be useful when parsing HTML or XML
1527
with the libxml2 parser.
1531
PropertyNamesMaxLength 1000 swishdescription
1532
PropertyNameAlias swishdescription body
1536
StoreDescription HTML <body> 1000
1537
StoreDescription XML <body> 1000
1538
StoreDescription HTML2 <body> 1000
1539
StoreDescription XML2 <body> 1000
1541
but StoreDescription allows setting the tag for each parser type.
1543
PropertyNamesMaxLength 1000 headings
1544
PropertyNameAlias headings h1 h2 h3 h4
1546
collects all the heading text into a single property called "headings", not
1547
to exceed 1000 characters.
1549
=item PropertyNamesSortKeyLength integer *list of meta names*
1551
Sets the length of the string used when sorting.
1552
The default is 100 characters. The -T metanames debugging option will
1553
list the current values for an index.
1555
This setting is used when sorting during indexing, and perhaps when sorting
1556
while searching. It also effects the order when limiting to a range of values
1559
=item PreSortedIndex *list of property names*
1561
By default Swish-e generates presorted tables while indexing for each
1562
property name. This allows faster sorting when generating results.
1563
On large document collections this presorting may add to the indexing
1564
time, and also adds to the total size of the index. This directive can
1565
be used to customize exactly which properties will be presorted.
1567
If C<PreSortedIndex> it is I<not> present in the config file (default
1568
action), all the properties will be presorted at indexing time. If it
1569
is present without any parameter, no properties will be presorted.
1570
Otherwise, only the property names specified will be presorted.
1572
For example, if you only wish to sort results by a property called
1575
PropertyNames title age time
1576
PreSortedIndex title
1579
=item StoreDescription [XML <tag> size|HTML <meta> size|TXT size]
1581
B<StoreDescription> allows you to store a document description in the
1582
index file. This description can be returned in your search results
1583
when the C<-x> switch is used to include the I<swishdescription> for
1584
extended results, or by using C<-p swishdescription>.
1586
The document type (XML, HTML and TXT) must match the document type currently being indexed
1587
as set by C<IndexContents> or C<DefaultContents>. See those directives for possible values.
1588
A common problem is using C<StoreDescription> yet not setting the document's type with
1589
C<IndexContents> or C<DefaultContents>. Another problem is different types:
1591
IndexContents HTML2 .html
1592
StoreDescription HTML <body>
1594
Then .html documents are assigned a type of HTML2 (and parsed by the libxml2 parser), but the
1595
description will not be stored since it is type HTML instead of HTML2.
1597
For text documents you specify the type TXT (or TXT2 or TXT*) and the number of I<characters> to capture.
1599
StoreDescription TXT 20
1601
The above stores only the first twenty characters from the text file in the Swish-e index
1604
For HTML, and XML file types, specify the tag to use for the
1605
description, and optionally the number of characters to capture. If not
1606
specified will capture the entire contents of the tag.
1608
StoreDescription HTML <body> 20000
1609
StoreDescription XML <desc> 40
1611
Again, note that documents must be assigned a document type with C<IndexContents>
1612
or C<DefaultContents> to use this feature.
1614
Swish-e will compress the descriptions (or any other large property)
1615
if compiled to use zlib (see L<INSTALL|INSTALL>). This is recommended when using
1616
StoreDescription and a large number of documents. Compression of 30% to 50% is
1617
not uncommon with HTML files.
1619
=item PropCompressionLevel [0-9]
1621
This directive sets the compression level used when storing properties
1622
to disk. A setting of zero is no compression, and a setting of nine is
1623
the most compression.
1625
The default depends on the default setting compiled with zlib, but is
1628
This option is useful when using C<StoreDescription> to store a large
1629
amount text in properties (or if using C<PropertyNames> with large
1632
Properties must be over a value defined in F<config.h> (100 is the
1633
default) before compression will be attempted. Swish-e will never store
1634
the results of the compression if the compressed data is larger than
1637
This option is only available when Swish-e is compiled with zlib support.
1640
=item TruncateDocSize *number of characters*
1642
TruncateDocSize limits the size of a document while indexing documents
1643
and/or using filters. This config directive truncates the numbers of
1644
read bytes of a document to the specified size. This means: if a document
1645
is larger, read only the specified numbers of bytes of the document.
1649
TruncateDocSize 10000000
1651
The default is zero, which means read all data.
1654
Warning: If you use TruncateDocSize, use it with care! TruncateDocSize
1655
is a safety belt only, to limit e.g. filteroutput, when accessing
1656
databases, or to limit "runnaway" filters. Truncating doc input may
1657
destroy document structures for Swish-e (e.g. swish may miss closing
1658
tags for XML or HTML documents).
1660
TruncateDocSize does not currently work with the C<prog> input source
1663
=item FuzzyIndexingMode NONE|Stemming|Soundex|Metaphone|DoubleMetaphone
1665
Selects the type of index to create. Only one type of index may be created.
1667
It's a good idea to create both a normal index and a fuzzy index and
1668
allow your search interface select which index to use. Many people find the
1669
fuzzy searches to be too fuzzy.
1671
The available fuzzy indexing options can be displayed by running
1673
swish-e -T LIST_FUZZY_MODES
1675
Available options include:
1681
Words are stored in the index without any conversion. This is the default.
1685
This options uses one of the installed Snowball stemmers (http://snowball.tartarus.org/).
1687
The installed stemmers can be viewed by running
1689
swish-e -T LIST_FUZZY_MODES
1691
For example, to use the Spanish stemming module:
1693
FuzzyIndexingMode Stemming_es
1696
=item Stem or Stemming_en
1698
Selects the legacy Swish-e English stemmer.
1700
This is depreciated in favor of the Snowball English stemmers (Stemming_en1, Stemming_en2).
1701
Future versions of Swish-e will likely use the Stemming_en2 stemmer by default.
1703
Words are converted using the Porter stemming algorithm.
1705
From: http://www.tartarus.org/~martin/PorterStemmer/
1707
The Porter stemming algorithm (or �Porter stemmer�) is a
1708
process for removing the commoner morphological and inflexional
1709
endings from words in English. Its main use is as part of a
1710
term normalisation process that is usually done when setting up
1711
Information Retrieval systems.
1714
This will help a search for "running" to also find "run" and "runs", for example.
1716
The stemming function does not convert words to their root, rather
1717
programmatically removes endings on words in an attempt to make similar
1718
words with different endings stem to the same string of characters.
1719
It's not a perfect system, and searches on stemmed indexes often return
1720
curious results. For example, two entirely different words may stem to
1723
Stemming also can be confusing when used with a wildcard (truncation).
1724
For example, you might expect to find the word "running" by searching for
1725
"runn*". But this fails when using a stemmed index, as "running" stems to
1726
"run", yet searching for "runn*" looks for words that start with "runn".
1730
Soundex was developed in the 1880s so records for people with similar
1731
sounding names could be found more readily. Soundex is a coded surname
1732
based on the way a surname sounds rather than spelling. Surnames that
1733
sound similar, like Smith and Smyth, are filed together under the same
1734
Soundex code. This is mostly useful for US English.
1736
Soundex should not be used to search for sound-alike words. Metaphone
1737
would be more appropriate for generic sound matching of words. Soundex
1738
should only be used where you need to search multiple documents for
1739
proper names which sound similar. This is primarily used for indexing
1740
genealogical records. This may be useful for indexing other collections
1741
of data consisting mostly of names. Many common name variations are
1742
matched by Soundex. The only notable exception is the first letter of
1743
the name. The first letter is not matched for sound.
1745
=item Metaphone and DoubleMetaphone
1747
Words are transformed into a short series of letters representing the sound of the word (in English).
1748
Metaphone algorithms are often used for looking up mis-spelled words in dictionary programs.
1750
From: http://aspell.sourceforge.net/metaphone/
1752
Lawrence Philips' Metaphone Algorithm is an algorithm which returns
1753
the rough approximation of how an English word sounds.
1755
The C<DoubleMetaphone> mode will sometimes generate two different metaphones for the same word.
1756
This is supposed to be useful when a word may be pronounced more than one way.
1758
A metaphone index should give results somewhere in between Soundex and Stemming.
1762
=item UseStemming [yes|NO]
1764
Put yes to apply word stemming algorithm during indexing, else no.
1769
When UseStemming is set to C<yes> every word is stemmed before placing
1772
This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.
1774
=item UseSoundex [yes|NO]
1776
When UseSoundex is set to C<yes> every word is converted to a Soundex
1777
code before placing it in to the index.
1779
This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.
1781
=item IgnoreTotalWordCountWhenRanking [YES|no]
1783
Put yes to ignore the total number of words in the file when calculating
1784
ranking. Often better with merges and small files. Default is yes.
1786
IgnoreTotalWordCountWhenRanking no
1788
The default was changed from no to yes in version 2.2.
1790
=item MinWordLimit *integer*
1792
Set the minimum length of an word. Shorter words will not be indexed.
1793
The default is 1 (as defined in F<src/config.h>).
1797
=item MaxWordLimit *integer*
1799
Set the maximum length of an indexable word. Every longer word will not
1800
be indexed. The Default is 40 (as defined in F<src/config.h>).
1802
=item WordCharacters *string of characters*
1804
=item IgnoreFirstChar *string of characters*
1806
=item IgnoreLastChar *string of characters*
1808
=item BeginCharacters *string of characters*
1810
=item EndCharacter *string of characters*
1813
These settings define what a word consists of to the Swish-e indexing engine.
1814
Compiled in defaults are in F<src/config.h>.
1816
When indexing Swish-e uses B<WordCharacters> to split up the document
1817
into words. Words are defined by any string of non-blank characters
1818
that contain only the characters listed in WordCharacters. If a string
1819
of characters includes a character that is not in WordCharacters then
1820
the word will be spit into two or more separate words.
1826
Would turn "abcde" into two words "ab" and "de".
1828
Next, of these words, any characters defined in B<IgnoreFirstChar> are
1829
stripped off the start of the word, and B<IgnoreLastChar> characters
1830
are stripped off the end of the word. This allows, for example,
1831
periods within a word (www.slashdot.com), but not at the end of
1832
a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in
1835
Finally, the resulting words MUST begin with one of the characters
1836
listed in B<BeginCharacters> and end with one of the characters listed in
1837
B<EndCharacters>. BeginCharacters and EndCharacters must be a subset of
1838
the characters in WordCharacters. Often, WordCharacters, BeginCharacters
1839
and EndCharacters will all be the same.
1841
Note that the same process applies to the query while searching.
1843
Getting these settings correct will take careful consideration and
1844
practice. It's helpful to create an index of a single test file, and
1845
then look at the words that are placed in the index (see the C<-v 4>,
1846
C<-D> and C<-k> searching switches).
1848
Currently there is only support for eight-bit characters.
1852
WordCharacters .abcdefghijklmnopqrstuvwxyz
1853
BeginCharacters abcdefghijklmnopqrstuvwxyz
1854
EndCharacters abcdefghijklmnopqrstuvwxyz
1860
Please visit http://www.example.com/path/to/file.html.
1862
will be indexed as the following words:
1872
Which means that you can search for C<www.example.com> as a single word,
1873
but searching for just C<example> will not find the document.
1875
Note: when indexing HTML documents HTML entities are converted to their
1876
character equivalents before being processed with these directives.
1877
This is a change from previous versions of Swish-e where you were
1878
required to include the characters C<0123456789&#;> to index entities.
1879
See also L<ConvertHTMLEntities|/"item_ConvertHTMLEntities">
1881
=item Buzzwords [*list of buzzwords*|File: path]
1883
The Buzzwords option allows you to specify words that will be indexed
1884
regardless of WordCharacters, BeginCharacters, EndCharacters, stemming,
1885
soundex and many of the other checks done on words while indexing.
1887
Buzzwords are case insensitive.
1889
Buzzwords should be separated by spaces and may span multiple directives.
1890
If the special format C<File:filename> is used then the Buzzwords will
1891
be read from an external file during indexing.
1895
Buzzwords C++ TCP/IP
1897
Buzzwords File: ./buzzwords.lst
1899
If a Buzzword contains search operator characters they must be backslashed
1900
when searching. For example:
1902
Buzzwords C++ TCP/IP web=http
1904
./swish-e -w 'web\=http'
1906
Buzzwords are found by splitting the text on whitespace, removing
1907
C<IgnoreFirstChar> and C<IgnoreLastChar> characters from the word,
1908
and then comparing with the list of C<Buzzwords>. Therefore, if
1909
adding C<Buzzwords> to an index you will probably want to define
1910
C<IgnoreFirstChar> and C<IgnoreLastChar> settings.
1912
Note: Buzzwords specific settings for C<IgnoreFirstChar> and
1913
C<IgnoreLastChar> may be used in the future.
1916
=item IgnoreWords [*list of stop words*|File: path]
1918
The IgnoreWords option allows you to specify words to ignore, called
1919
I<stopwords>. The default is to not use any stopwords.
1921
Words should be separated by spaces and may span multiple directives.
1922
If the special format C<File:filename> is used then the stop words will
1923
be read from an external file during indexing.
1925
In previous versions of Swish-e you could use the directive
1927
IgnoreWords swishdefault - obsolete!
1929
to include a default list of compiled in stopwords. This keyword is no
1934
IgnoreWords www http a an the of and or
1936
IgnoreWords File: ./stopwords.de
1938
=item UseWords [*list of words*|File: path]
1940
UseWords defines the words that Swish-e will index. B<Only> the words
1941
listed will be indexed.
1943
You can specify a list of words following the directive (you may specify
1944
more than one C<UseWords> directive in a config file), and/or use the
1945
C<File:> form to specify a path to a file containing the words:
1947
UseWords perl python pascal fortran basic cobal php
1948
UseWords File: /path/to/my/wordlist
1950
Please drop the Swish-e list a note if you actually use this feature.
1951
It may be removed from future versions.
1953
=item IgnoreLimit *integer integer*
1955
This automatically omits words that appear too often in the files (these
1956
words are called stopwords). Specify a whole percentage and a number,
1957
such as "80 256". This omits words that occur in over 80% of the files
1958
and appear in over 256 files. Comment out to turn off auto-stopwording.
1962
Swish-e must do extra processing to adjust the entire index when this
1963
feature is used. It is recommended that instead of using this feature
1964
that you decided what words are stopwords and add them to B<IngoreWords>
1965
in your configuration file. To do this, use IgnoreLimit one time and
1966
note the stop words that are found while indexing. Add this list to
1967
IgnoreWords, and then remove IgnoreLimit from the configuration file.
1969
=item IgnoreMetaTags *list of names*
1971
C<IgnoreMetaTags> defines a list of metatags to ignore while indexing
1972
XML files (and HTML files if using libxml2 for parsing HTML). All text
1973
within the tags will be ignored -- both for indexing (C<MetaNames>)
1974
and properties (C<PropertyNames>). To still parse properties, yet do
1975
not index the text, see L<UndefinedMetaTags|/"item_UndefinedMetaTags">.
1977
This option is useful to avoid indexing specific data from a file.
1983
</first_name> <last_name>
1985
</last_name> <updated_date>
1990
In the above example you might B<not> want to index the updated date,
1991
and therefore prevent finding this record by searching
1997
IgnoreMetaTags updated_date
2000
See also L<UndefinedMetaTags|/"item_UndefinedMetaTags">.
2002
=item IgnoreNumberChars *list of characters*
2004
Experimental Feature
2006
This experimental feature can be used to define a set of characters
2007
that describe a number. If a word is found to contain only those
2008
characters it will not be indexed. The characters listed must be part
2009
of C<WordCharacters> settings. In other words, the "word" checked is
2010
a word that Swish-e would otherwise index.
2014
IgnoreNumberChars 0123456789$.,
2016
Then Swish-e would not index the following:
2022
You might be tempted to avoid indexing hex numbers with:
2024
IgnoreNumberChars 0123456789abcdef
2026
which will not index 0D31, but will also not index the word "bad".
2028
This is an experimental feature that may change in future versions.
2029
One possible change is to use regular expressions instead.
2032
=item IndexComments [NO|yes]
2034
This option allows the user decide if to index the contents of HTML
2035
comments. Default is no. Set to yes if comment indexing is required.
2039
Note: This is a change in the default behavior prior to version 2.2.
2041
=item TranslateCharacters [*string1 string2*|:ascii7:]
2043
The TranslateCharacters directive maps the characters in string1 to the
2044
characters listed in string2.
2048
# This will index a_b as a-b and �mo as amo
2049
TranslateCharacters _� -a
2051
C<TranslateCharacters :ascii7:> is a predefined set of characters that
2052
will translate eight bit characters to ascii7 characters. Using the
2053
:ascii7: rule will translate "���" to "aac". This means: searching
2054
"�elik", "�elik" or "celik" will all match the same word.
2056
TranslateCharacters is done early in the indexing process, after
2057
converting HTML entities but before splitting the input text into words
2058
based on B<WordCharacters>. So characters you are translating I<from>
2059
do not need to be listed in word characters.
2061
The same character translations take place when searching.
2063
=item BumpPositionCounterCharacters *string*
2065
When indexing Swish-e assigns a word position to each word. This enables
2066
phrase searching. There may be cases where you would like to prevent
2067
phrase matching. The BumpPositionCounterCharacters directive allows
2068
you to specify a set of characters that when found in the text will
2069
increment the word position -- effectively preventing phrase matches
2070
across that character.
2072
For example, if you have a tag:
2075
computer programming | apple computers
2078
You might want to prevent matching "programming apple" in that meta name.
2080
BumpPositionCounterCharacters |
2082
There is no default, and you may list a string of characters.
2084
=item DontBumpPositionOnEndTags *list of names*
2086
=item DontBumpPositionOnStartTags *list of names*
2088
Since metatags are typically separate data fields, the word position
2089
counter is automatically bumped between metatags (actually, bumped when a
2090
start tag is found and when an end tag is found). This prevents matching
2091
a phrase that spans more than one metaname. C<DontBumpPositionOnEndTags>
2092
and C<DontBumpPositionOnStartTags> disables this feature for the listed
2109
In the configuration file:
2111
DontBumpPositionOnEndTags first_name
2112
DontBumpPositionOnStartTags last_name
2114
This configuration allows this phrase search
2116
-w 'person=("william shakespeare")'
2118
but this phrase search will fail
2120
-w 'person=("shakespeare april")'
2127
=head2 Directives for the File Access method only
2129
Some directives have different uses depending on the source of the
2130
documents. These directives are only valid when using the B<File system>
2135
=item IndexOnly *list of file suffixes*
2137
This directive specifies the allowable file suffixes (extensions) while
2138
indexing. The default is to index all files specified in B<IndexDir>.
2140
# Only index .html .htm and .q files
2141
IndexOnly .html .htm .q
2143
C<IndexOnly> checks that the file end in the characters listed. It does
2144
not check "extensions". C<IndexOnly> is tested right before C<FileRules>
2147
=item FollowSymLinks [yes|NO]
2149
Put "yes" to follow symbolic links in indexing, else "no". Default is no.
2154
Note that when set to C<no> extra stat(2) system calls must be made for
2155
each file. For large number of files you may see a small reduction in
2156
indexing time by setting this to C<yes>.
2158
See also the C<-l> switch in L<SWISH-RUN|SWISH-RUN>.
2160
=item FileRules [type] [contains|is|regex] *regular expression*
2162
=item FileMatch [type] [contains|is|regex] *regular expression*
2164
FileRules and FileMatch are used to, respectively, exclude and include
2165
files and directories to index. Since, by default, Swish-e indexes all
2166
files and recurses all directories (but see also C<FollowSymLinks>) you
2167
will typically only use C<FileRules> to exclude files or directories.
2168
C<FileMatch> is useful in a few cases, for example, to override the
2169
behavior of C<IndexOnly>. Some examples are included below.
2171
Except for C<FileRules title ...>, this feature is only available for
2172
file access method (-S fs), which is the default indexing mode. Also,
2173
any pathname modification with C<ReplaceRules> happens after the check
2174
for C<FileRules>. (It's unlikely that you would exclude files with
2175
C<FileRules> based on text you added with C<ReplaceRules>!)
2177
The regular expression is a C regex.h extended regular expression.
2178
You may supply more than one regular expression per line, or use
2179
separate directives. Preceding the regular expression with the word
2180
"not" negates the match.
2182
The regular expression is compared against B<[type]> as described below.
2184
For historical reasons, you can specify C<contains> or C<is>. C<is>
2185
simply forces the regular expression to match at the start and end
2186
of the string (by internally prepending "^" and appending "$" to the
2187
regular expression).
2189
The C<regex> option requires delimiter characters:
2191
FileRules title regex /^private/i
2193
The only advantage of C<regex> is if you want to do case insensitive
2194
matches, or simply like your regular expressions to look like perl
2195
regular expressions. You must use matching delimiters; (), {}, and [],
2196
are not currently supported for no good reason other than laziness.
2198
Use quotes (" or ') around a pattern if it contains any white space.
2199
Note that the backslash character becomes the escape character within
2202
For example, these sets generate the same regular expressions.
2204
FileRules title is hello
2205
FileRules title contains ^hello$
2206
FileRules title regex /^hello$/
2208
These all need quotes due to the included space character
2210
FileRules title is "hello there"
2211
FileRules title contains "^hello there$"
2212
FileRules title regex "!^hello there$!"
2214
These show how the backslash must be doubled inside of quotes.
2215
Swish-e converts a double-backslash into a single backslash, and then
2216
passes that single onto the regular expression compiler.
2218
FileRules filename regex /\.pdf/
2219
FileRules filename regex "/\\.pdf/"
2221
FileRules filename regex !hello\\there! # need double for real backslash
2222
FileRules filename regex "!hello\\\\there!" # need double-double inside of quotes
2227
The following types of match strings my be supplied:
2240
B<pathname> matches the regular expression against the current pathname.
2241
The pathname may or may not be absolute depending on what you supplied
2246
# Don't index paths that contain private or hidden
2247
FileRules pathname contains (private|hidden)
2250
FileRules pathname regex /(private|hidden)/
2252
# Don't index exe files
2253
FileRules pathname contains \.exe$
2255
B<dirname> and B<filename> split the path name by the last delimiter
2256
character into a directory name, and a file name. Then these are compared
2257
against the patterns supplied. Directory names do B<not> have a trailing
2258
slash. All path names use the forward slash as a delimiter within Swish-e.
2262
# Same as last example - don't index *.exe files.
2263
FileRules filename contains \.exe$
2265
# Don't index any file called test.html files
2266
FileRules filename contains ^test\.html$
2269
FileRules filename is test\.html
2271
# Don't index any directories that contain "old" (/usr/local/myold/docs)
2272
FileRules dirname contains old
2274
# Don't index any directories that contain the path segment "old" (/usr/local/old/foo)
2275
FileRules dirname contains /old/
2277
# Index only .htm, .html, plus any all-digit file names
2278
IndexOnly .htm .html
2279
FileMatch filename contains ^\d+$
2281
# Same as previous, but maybe a little slower
2282
FileRules filename regex not !\.(htm|html)$!
2283
FileMatch filename contains ^\d+$
2285
Swish-e checks these settings in the order of C<pathname>, C<dirname>, and
2286
C<filename>, and C<FileMatch> patterns are checked before C<FileRules>,
2287
in general. This allows you to exclude most files with C<FileRules>,
2288
yet allow in a few special cases with C<FileMatch>. For example:
2290
# Exclude all files of .exe, .bin, and .bat
2291
FileRules filename contains \.(exe|bin|bat)$
2292
# But, let these two in
2293
FileMatch filename is baseball\.bat incoming_mail\.bin
2295
# Same, but as a single pattern
2296
FileMatch filename is (baseball\.bat|incoming_mail\.bin)
2298
The C<directory> type is somewhat unique. When Swish-e recurses into a
2299
directory it will compare all the I<files> in the directory with the
2300
pattern and then decide if that entire directory should or should not
2301
be indexed (or recursed). Note that you are matching against file names
2302
in a directory -- and some of those names may be directory names.
2304
A C<FileRules directory> match will cause Swish-e to ignore all files and
2305
sub-directories in the current directory.
2307
Warning: A match with C<FileMatch directory> says to index B<everything>
2308
in the *current* directory and B<ignore> any FileRules for this directory.
2313
# Don't index any directories (and sub directories) that contain
2314
# a file (or sub-directory) called "index.skip"
2315
FileRules directory contains ^index\.skip$
2317
# Don't index directories that contain a .htaccess file.
2318
FileRules directory contains ^\.htaccess
2320
Note: While I<processing> directories, Swish-e will ignore any files
2321
or directories that begin with a dot ("."). You may index files
2322
or directories that begin with a dot by specifying their name with
2323
C<IndexDir> or C<-i>.
2325
C<title> checks for a pattern match in an HTML title.
2329
FileRules title contains construction example pointers
2331
# This example says to ignore case
2332
FileRules title regex "/^Internal document/i"
2334
Note: C<FileRules title> works for any input method (fs, prog, or http)
2335
that is parsed as HTML, and where a title was found in the document.
2337
In case all this seems a bit confusing, processing a directory happens
2338
in the following order.
2340
First the directory name is checked:
2342
FileRules dirname - reject entire directory if matches
2344
Next the directory is scanned and each file name (which might be the
2345
name of a sub-directory) is checked:
2347
FileRules directory - reject entire dir if *any* files match
2348
FileMatch directory - accept entire dir if *any* files match
2350
Then, unless C<FileMatch directory> matched, each file is tested with
2351
FileMatch. A match says to index the file without further testing
2352
(i.e. overrides FileRules and IndexOnly):
2354
FileMatch pathname \
2355
FileMatch dirname - file is accepted if any match
2356
FileMatch filename /
2360
IndexOnly - file is checked for the correct file extension
2362
FileRules pathname \
2363
FileRules dirname - file is rejected if any match
2364
FileRules filename /
2366
finally, the file is indexed.
2368
Files (not directories) listed with C<IndexDir> or C<-i> are processed
2371
FileMatch pathname \
2372
FileMatch dirname - file is accepted if any match
2373
FileMatch filename /
2375
otherwise, the file is rejected if it doesn't have the correct extension
2376
or a FileRules matches.
2378
IndexOnly - file is checked for the correct file extension
2380
FileRules pathname \
2381
FileRules dirname - file is rejected if any match
2382
FileRules filename /
2384
Note: If things are not indexing as you expect, create a directory
2385
with some test files and use the C<-T regex> trace option to see how
2386
file names are checked. Start with very simple tests!
2391
=head2 Directives for the HTTP Access Method Only
2393
The HTTP Access method is enabled by the "-S http" switch when indexing. It works by
2394
running a Perl program called SwishSpider which fetches documents from a web server.
2396
Only text files (content-type of "text/*") are indexed with the HTTP Access Method.
2397
Other document types (e.g. PDF or MSWord) may be indexed as well. The SwishSpider will
2398
attempt to make use of the SWISH::Filter module (included with the Swish-e distribution) to
2399
convert documents into a format that Swish-e can index.
2401
Note: The -S prog method of spidering (using spider.pl) can be a replacement for the -S http method.
2402
It offers more configuration options and better spidering speed.
2404
These directives below are available when using the HTTP Access Method of indexing.
2408
=item MaxDepth *integer*
2410
MaxDepth defines how many links the spider should follow before stopping.
2411
A value of 0 configures the spider to traverse all links. The default
2416
Note: The default was changed from 5 to 0 in release 2.4.0
2418
=item Delay *seconds*
2420
The number of seconds to wait between issuing requests to a server.
2421
This setting allows for more friendly spidering of remote sites.
2422
The default is 5 seconds.
2426
Note: The default was changed from 60 to 5 seconds in release 2.4.0
2430
The location of a writable temp directory on your system. The HTTP
2431
access method tells the Perl helper to place its files in this location,
2432
and the C<-e> switch causes Swish-e to use this directory while indexing.
2433
There is no default.
2437
If this directory does not exist or is not writable Swish-e will fail
2438
with an error during indexing.
2440
Note, the environment variables of C<TMPDIR>, C<TMP>, and C<TEMP>
2441
(in that order) will B<override> this setting.
2443
=item SpiderDirectory *path*
2445
The location of the Perl helper script called F<swishspider>. If you
2446
use a relative directory, it is relative to your directory when you run
2447
Swish-e, not to the directory that Swish-e is in.
2448
The default is the location swishspider was installed.
2449
Normally this does not need to be set.
2451
SpiderDirectory /usr/local/swish
2453
=item EquivalentServer *server alias*
2455
Often times the same site may be referred to by different names.
2456
A common example is that often http://www.some-server.com and
2457
http://some-server.com are the same. Each line should have a list of
2458
all the method/names that should be considered equivalent. Multiple
2459
EquivalentServer directives may be used. Each directive defines its
2460
own set of equivalent servers.
2462
EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
2463
EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu
2467
=head2 Directives for the prog Access Method Only
2469
This section details the directives that are only available for the
2470
"prog" document source feature of Swish-e. The "prog" access method runs
2471
an external program that "feeds" documents to Swish-e. This allows indexing
2472
and filtering of documents from any source.
2474
See L<prog - general purpose access method|SWISH-RUN/"item_prog"> in
2475
the SWISH-RUN man page for more information.
2478
A number of example programs for use with the "prog" access method are
2479
provided in the F<prog-bin> directory. Please see those example if you
2480
have questions about implementing a "prog" input program.
2484
=item SwishProgParameters *list of parameters*
2486
This is a list of parameters that will be sent to the external program
2487
when running with the "prog" document source method.
2489
SwishProgParameters /path/to/config hello there
2490
IndexDir /path/to/program.pl
2494
swish-e -c config -S prog
2496
Swish-e will execute C</path/to/program.pl> and pass C</path/to/config
2497
hello there> as three command line arguments to the program. This
2498
directive makes it easy to pass settings from the Swish-e configuration
2499
file to the external program.
2501
For example, the C<spider.pl> program (included in the C<prog-bin>
2502
directory) uses the C<SwishProgParameters> to specify what file to read
2503
for configuration information.
2505
SwishProgParameters spider.config
2506
IndexDir ./spider.pl
2508
The C<spider.pl> program also has a default action so you can avoid
2509
using a configuration file:
2511
SwishProgParameters default http://www.swishe.org/ http://some.other.site/
2512
IndexDir ./spider.pl
2514
And the spider program will use default settings for spidering those sites.
2516
Swish-e can read documents from standard input, so another way to run an external program
2519
./spider.pl spider.conf | ./swish-e -S prog -i stdin
2523
B<Notes when using MS Windows>
2525
You should use unix style path separators to specify your external
2526
program. Swish will convert forward slashes to backslashes before
2527
calling the external program. This is only true for the program name
2528
specified with C<IndexDir> or the C<-i> command line option.
2530
In addition, Swish-e will make sure the program specified actually exists,
2531
which means you need to use the full name of the program.
2533
For example, to run the perl spider program F<spider.pl> you would need
2534
a Swish-e configuration file such as:
2536
IndexDir e:/perl/bin/perl.exe
2537
SwishProgParameters prog-bin/spider.pl default http://swish-e.org
2539
and run indexing with the command:
2541
swish-e -c swish.cfg -S prog -v 9
2543
The C<IndexDir> command tells Swish-e the name of the program to run.
2544
Under unix you can just specify the name of the script, since unix will
2545
figure out the program from the first line of the script.
2547
The C<SwishProgParameters> are the parameters passed to the program
2548
specified by C<IndexDir> (perl.exe in this case). The first parameter
2549
is the perl script to run (F<prog-bin/spider.pl>). Perl passes the rest
2550
of the parameters directly to the perl script. The second parameter
2551
F<default> tells the F<spider.pl> program to use default settings for
2552
spidering (or you could specify a spider config file -- see C<perldoc
2553
spider.pl> for details), and lastly, the URL is passed into the spider
2557
=head2 Document Filter Directives
2559
Internally, Swish-e knows how to parse only text, HTML, and XML documents.
2560
With "filters" you can index other types of documents. For example,
2561
if all your web pages are in gzip format a filter can uncompress these
2562
on the fly for indexing.
2564
You may wish to read the Swish-e FAQ question on filtering before continuing here.
2565
L<How Do I filter documents?|SWISH-FAQ/"How Do I filter documents?">
2567
There are two suggested methods for filtering.
2569
=head3 Filtering with SWISH::Filter
2571
The Swish-e distribution includes a Perl module called SWISH::Filter and individual
2572
filters located in the F<filters> directory. This system uses plug-in filters to
2573
extend the types of documents that Swish-e can index. The plug-in filters do not
2574
actually do the filtering, but rather provide a standard interface for accessing programs that
2575
can filter or convert documents. The programs that do the filtering are not part of
2576
the Swish-e distribution; they must be downloaded and installed separately.
2578
The advantage of this method is that new filtering methods can be installed easily.
2580
This system is designed to work with the -S http and -prog methods, but may also be used
2581
with the C<FileFilter> feature and -S fs indexing method. See
2582
F<$prefix/share/doc/swish-e/examples/filter-bin/swish_filter.pl> for
2585
See the F<filters/README> file for more information.
2587
=head3 Filtering with the FileFilter feature
2589
A filter is an external program that Swish-e executes while processing
2590
a document of a given type. Swish-e will execute the filter program
2591
for each file that matches the file suffix (extension) set in the
2592
B<FileFilter> or B<FileFilterMatch> directives. B<FileFilterMatch>
2593
matches using regular expressions and is described below.
2595
Filters may be used with any type of input method (i.e. -S fs, -S http, or -S prog).
2598
Swish-e calls the external program passing as B<default> arguments:
2604
the name of the filter program
2608
the physical path name of the file to read. This may be a temporary
2609
file location if indexing by the http method.
2613
When indexing under the file system this will be the same as $1 (the
2614
path to the source file), but when indexing under the http method this
2615
will be the URL of the source document.
2619
Swish-e can also pass other parameters to the filter program. These
2620
parameters can be defined using the B<FileFilter> or B<FileFilterMatch>
2621
directives. See Filter Options below.
2623
The filter program must open the file, process its contents, and return
2624
it to Swish-e by printing to STDOUT.
2626
Note that this can add a significant amount of time to the indexing
2627
process if your external program is a perl or shell script. If you
2628
have many files to filter you should consider writing your filter in C
2629
instead of a shell or perl script, or using the "prog" Access Method.
2633
=item FilterDir *path-to-directory*
2635
This is the path to a directory where the filter programs are stored.
2636
Swish-e looks in this directory to find the filter specified in the
2637
B<FileFilter> directive. If this directive is omitted, you have to
2638
specify the full path to the filterscript on each FileFilter directive.
2640
This feature does *not* apply to the C<FileFilterMatch> directive.
2644
FilterDir /usr/local/swish/filters
2646
=item FileFilter *suffix* "filter-prog" ["filter-options"]
2648
This maps file suffix (extension) to a filter program. If I<filter-prog>
2649
starts with a directory delimiter (absolute path), Swish-e doesn't use
2650
the FilterDir settings, but uses the given I<filter-prog> path directly.
2654
Filter options are a string passed as arguments to the I<filter-prog>.
2655
Filter options can contain variables, replaced by Swish-e. If you omit
2656
I<filter-options> Swish-e will use default parameters for the options
2659
Default: "'%p' '%P'"
2660
Which means: pass "workfile path" and "documentfile path" to filter (each quoted).
2662
Variables in filter options:
2665
%P = Full document pathname (e.g. URL, or path on filesystem)
2666
%p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem)
2667
%F = Filename stripped from full document pathname
2668
%f = Filename stripped from "work" pathname
2669
%D = Directoryname stripped from full document pathname
2670
%d = Directoryname stripped from full "work" pathname
2672
Examples of strings passed:
2674
%P = document pathname: http://myserver/path1/mydoc.txt
2675
%p = work pathname: /tmp/tmp.1234.mydoc.txt
2677
%f = tmp.1234.mydoc.txt
2678
%D = http://myserver/path1
2681
Important hint for security:
2683
When using variable substitution, use quotes to ensure filename integrity.
2685
e.g. "'%f'" --> 'file name with spaces.doc'.
2687
If you don't use this, your system security may be compromised, or
2688
filtering may not work for these files.
2690
B<Notes when using MS Windows>
2692
Windows uses double quotes to escape shell metacharacters, so reverse
2693
the quotes in the examples above. e.g.:
2695
'"%f"' --> "file name with spaced.doc"
2697
You can specify the filter program using forward slashes (unix style).
2698
Swish will convert the slashes to backslashes before running your program.
2700
FileFilter .mydoc c:/some/path/mydocfilter.exe '-d "%d" -example -url "%P" "%f"'
2703
Examples of filters:
2705
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
2706
FileFilter .pdf pdftotext "'%p' -"
2707
FileFilter .html.gz gzip "-c '%p'"
2708
FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'"
2710
The above examples are running a I<binary> filter program. For more
2711
complicated filtering needs you may use a scripting language such as
2712
Perl or a shell script. Here's some examples of calling a shell and
2715
FileFilter .pdf pdf2html.sh
2716
FileFilter .ps ghostscript-filter.pl
2718
Using a scripting language (or any language that has a large startup
2719
cost) can B<greatly increase the indexing time>. For small indexing
2720
jobs, this may not be an issue, but for large collections of files that
2721
require processing by a scripting language, you may be better off using
2722
the C<-S prog> access method where the script will only be compiled once,
2723
instead of for each document.
2725
Filters are probably easier to write than a C<-S prog> program. Which you
2726
decide to use depends on your requirements. Examples of filter scripts
2727
can be found in the F<filter-bin> directory, and examples of C<-S prog>
2728
programs can be found in the F<prog-bin> directory.
2730
=item FileFilterMatch *filter-prog* *filter-options* *regex* [*regex* ...]
2732
This is similar to C<FileMatch> except uses regular expressions to
2733
match against the file name. *filter-prog* is the path to the program.
2734
Unlike C<FileFilter> this does B<not> use the C<FilterDir> option.
2735
Also unlike C<FileFilter> you B<must> specify the *filter-options*.
2739
FileFilterMatch ./pdftotext "'%p' -" /\.pdf$/
2741
Note that will also match a file called ".pdf", so you may want to use
2742
something that requires a filename that has more than just an extension.
2745
FileFilterMatch ./pdftotext "'%p' -" /.\.pdf$/
2747
To specify more than one extension:
2749
FileFilterMatch ./check_title.pl "%p" /\.html$/ /\.htm$/
2751
Or a few ways to do the same thing:
2753
FileFilterMatch ./check_title.pl %p /\.(html|html)$/
2754
FileFilterMatch ./check_title.pl %p /\.html?$/
2758
FileFilterMatch ./check_title.pl %p /\.html?$/i
2760
You may also precede an expression with "not" to negate regular expression
2761
that follow. For example, to match files that do not have an extension:
2763
FileFilterMatch ./convert "%p %P" not /\..+$/
2767
=head1 Document Info
2769
$Id: SWISH-CONFIG.pod,v 1.74.2.1 2003/12/17 23:59:03 whmoseley Exp $