1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
4
<title>SWISH-Enhanced: INSTALL - Swish-e Installation Instructions </title>
5
<link href="./style.css" rel=stylesheet type="text/css" title="refstyle">
10
<a href="http://swish-e.org"><img border=0 src="images/swish.gif" alt="Swish-E Logo"></a><br>
11
<img src="images/swishbanner1.gif"><br>
12
<img src="images/dotrule1.gif"><br>
13
INSTALL - Swish-e Installation Instructions
20
<a href="./README.html">Prev</a> |
21
<a href="./index.html">Contents</a> |
22
<a href="./CHANGES.html">Next</a>
29
<P><B>Table of Contents:</B></P>
33
<LI><A HREF="#OVERVIEW">OVERVIEW</A>
36
<LI><A HREF="#Windows_Users">Windows Users</A>
39
<LI><A HREF="#SYSTEM_REQUIREMENTS">SYSTEM REQUIREMENTS</A>
42
<LI><A HREF="#Software_Requirements">Software Requirements</A>
43
<LI><A HREF="#Optional_but_Recommended_Packages">Optional but Recommended Packages</A>
46
<LI><A HREF="#INSTALLATION">INSTALLATION</A>
49
<LI><A HREF="#Building_Swish_e">Building Swish-e</A>
50
<LI><A HREF="#Building_a_Debian_Package">Building a Debian Package</A>
51
<LI><A HREF="#What_s_installed">What's installed</A>
52
<LI><A HREF="#Documentation">Documentation</A>
53
<LI><A HREF="#The_Swish_e_documentation_as_man_1_pages">The Swish-e documentation as man(1) pages</A>
54
<LI><A HREF="#Join_the_Swish_e_discussion_list">Join the Swish-e discussion list</A>
57
<LI><A HREF="#QUESTIONS_AND_TROUBLESHOOTING">QUESTIONS AND TROUBLESHOOTING</A>
60
<LI><A HREF="#When_posting_please_provide_the_following_information_">When posting please provide the following information:</A>
63
<LI><A HREF="#ADDITIONAL_INSTALLATION_OPTIONS">ADDITIONAL INSTALLATION OPTIONS</A>
66
<LI><A HREF="#The_SWISH_API_Perl_Module">The SWISH::API Perl Module</A>
67
<LI><A HREF="#Creating_PDF_and_Postscript_documentation">Creating PDF and Postscript documentation</A>
70
<LI><A HREF="#GENERAL_CONFIGURATION_AND_USAGE">GENERAL CONFIGURATION AND USAGE</A>
73
<LI><A HREF="#Introduction_to_Indexing_and_Searching">Introduction to Indexing and Searching</A>
74
<LI><A HREF="#Metanames_and_Properties">Metanames and Properties</A>
75
<LI><A HREF="#Getting_Started_With_Swish_e">Getting Started With Swish-e</A>
76
<LI><A HREF="#Step_1_Create_a_Configuration_File">Step 1: Create a Configuration File</A>
77
<LI><A HREF="#Step_2_Index_your_Files">Step 2: Index your Files</A>
78
<LI><A HREF="#Step_3_Search">Step 3: Search</A>
79
<LI><A HREF="#Phrase_Searching">Phrase Searching</A>
80
<LI><A HREF="#Boolean_Searching">Boolean Searching</A>
81
<LI><A HREF="#Context_Searching">Context Searching</A>
82
<LI><A HREF="#META_Tags">META Tags</A>
83
<LI><A HREF="#Spidering_and_Searching_with_a_Web_form_">Spidering and Searching with a Web form.</A>
86
<LI><A HREF="#Indexing_Other_Types_of_Documents_Filtering">Indexing Other Types of Documents - Filtering</A>
89
<LI><A HREF="#Filtering_Overview">Filtering Overview</A>
90
<LI><A HREF="#Filtering_Examples">Filtering Examples</A>
93
<LI><A HREF="#Document_Info">Document Info</A>
100
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
104
<H1><A NAME="OVERVIEW">OVERVIEW</A></H1>
106
This document describes how to download, build and install Swish-e from
107
source. Also below is a basic overview of using Swish-e to index documents
108
with pointers to other more advanced examples.
111
This document also provides instructions on how to get help installing and
112
using Swish-e (and the important information you should provide when asking
113
for help). Please read these instructions before requesting help on the
114
Swish-e discussion list. See <A HREF="#QUESTIONS_AND_TROUBLESHOOTING">QUESTIONS AND TROUBLESHOOTING</A>.
117
Although building from source is recommended, some OS distributions (e.g.
118
Debian) provide pre-compiled binaries. Check with your distribution for
119
available packages. Build from source if your distribution does not offer
120
the current version of swish-e.
123
Also, please read the Swish-e FAQ <A HREF="././SWISH-FAQ.html">SWISH-FAQ</A> as it answers many frequently asked questions.
126
Swish-e knows how to index HTML, XML, and plain text documents. Helper
127
applications and other tools are used to convert documents such as PDF or
128
MS Word into a format that swish-e can index. These additional applications
129
and tools (listed below) must be installed separately. The process of
130
converting douments is called ``filtering.''
133
NOTE: Swish-e version 4.2.0 installs a lot more files when running ``make
134
install''. Be aware that the Swish-e documentation may thus include errors
135
about where files are located. Please notify the swish-e discussion list of
136
any documentation errors.
139
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
141
<H2><A NAME="Windows_Users">Windows Users</A></H2>
143
A Windows binary version is available as a separate download from the
144
Swish-e site (http://swish-e.org). Many of the installation instructions
145
below will not apply to Windows users; the Windows version is pre-compiled
146
and includes libxml2, zlib, xpdf and catdoc.
149
A number of Perl modules may also be needed. These can be installed with
150
ActiveState's PPM utility.
157
<td bgcolor="#eeeeee" width="1">
162
<pre> libwww-perl - the LWP modules (for spidering)
163
HTML-Tagset - used by web spider
164
HTML-Parser - used by web spider
165
MIME-Types - used for filtering documents when not spidering
166
HTML-Template - formatting output from swish.cgi (optional)
167
HTML-FillInForm (if HTML-Template is used)</pre>
173
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
175
<H1><A NAME="SYSTEM_REQUIREMENTS">SYSTEM REQUIREMENTS</A></H1>
177
Swish-e makes use of a number of libraries and tools that are not
178
distributed with Swish-e. Some libraries need to be installed before
179
building Swish-e from source, where other tools can be installed at any
180
time. See below for details.
183
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
185
<H2><A NAME="Software_Requirements">Software Requirements</A></H2>
187
Swish-e is written in C and up to this time has been tested on a number of
188
platforms, including Sun/Solaris, Dec Alpha, BSD, Linux, OS X, and Open
192
The GNU C compiler, gcc, and GNU make are strongly recommended.
195
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
197
<H2><A NAME="Optional_but_Recommended_Packages">Optional but Recommended Packages</A></H2>
199
Most of these packages are available as easily installable packages. Check
200
your operating system vendor, or install from source. Most are very common
201
packages that may already be installed on your computer.
204
As noted below, some packages need to be installed before building Swish-e
205
from source, while others may be added after Swish-e is installed.
208
<P><LI><STRONG><A NAME="item_Libxml2">Libxml2</A></STRONG>
210
Libxml2 is very strongly recommended. It is used for parsing both HTML and
211
XML files. Swish-e can be built and installed without libxml2, but the HTML
212
parser built into swish-e is not as accurate as libxml2.
219
<td bgcolor="#eeeeee" width="1">
224
<pre> <A HREF="http://xmlsoft.org/">http://xmlsoft.org/</A></pre>
230
For swish-e to use libxml2 it must be installed before building swish-e.
232
<P><LI><STRONG><A NAME="item_Zlib">Zlib Compression</A></STRONG>
234
The Zlib compression library is commonly installed on most systems and is
235
recommended for use with Swish-e. Zlib is used for compressing text stored
236
in the swish-e index.
243
<td bgcolor="#eeeeee" width="1">
248
<pre> <A HREF="http://www.gzip.org/zlib/">http://www.gzip.org/zlib/</A></pre>
254
Zlib must be installed before building swish-e.
256
<P><LI><STRONG><A NAME="item_Perl">Perl Modules</A></STRONG>
258
Although Swish-e is a compiled C program, many support features use Perl.
259
For example, the web spiders are written in Perl, and modules to help with
260
filtering documents are also written in Perl.
263
The following Perl modules may be required. Check your current Perl
264
installation as many may already be installed.
271
<td bgcolor="#eeeeee" width="1">
280
MIME::Types (optional)</pre>
286
Not that installing Bundle::LWP with the CPAN module
293
<td bgcolor="#eeeeee" width="1">
298
<pre> perl -MCPAN -e 'install Bundle::LWP'</pre>
304
will install many of the above.
307
If you wish to use HTML-Template with swish.cgi to generate output:
314
<td bgcolor="#eeeeee" width="1">
320
HTML::FillInForm</pre>
326
If you wish to use Template-Toolkit with swish.cgi to generate output
334
<td bgcolor="#eeeeee" width="1">
345
Questions about installing these modules may be sent to the swish-e
349
The ``search.cgi'' example script requires both Template-Toolkit and
352
<P><LI><STRONG><A NAME="item_Indexing">Indexing PDF Documents</A></STRONG>
354
Indexing PDF files requires the xpdf package. This is a common package
355
available with most operating systems and often provided as a package.
362
<td bgcolor="#eeeeee" width="1">
367
<pre> <A HREF="http://www.foolabs.com/xpdf/">http://www.foolabs.com/xpdf/</A></pre>
373
Xpdf may be added after swish-e is installed.
375
<P><LI><STRONG><A NAME="item_Indexing">Indexing MS Word Documents</A></STRONG>
377
Indexing MS Word documents requires the Catdoc program.
384
<td bgcolor="#eeeeee" width="1">
389
<pre> <A HREF="http://www.45.free.net/~vitus/ice/catdoc">http://www.45.free.net/~vitus/ice/catdoc</A></pre>
395
Catdoc may be added after swish-e is installed
397
<P><LI><STRONG><A NAME="item_Indexing">Indexing MP3 ID3 Tags</A></STRONG>
399
Indexing MP3 ID3 Tags requires the MP3::Tag Perl module. See <A
400
HREF="http://search.cpan.org.">http://search.cpan.org.</A> MP3::Tag may be
401
installed after swish-e is installed.
403
<P><LI><STRONG><A NAME="item_Indexing">Indexing MS Excel Files</A></STRONG>
405
Indexing MS Excel files is supported by the following Perl modules, also
406
available at <A HREF="http://search.cpan.org.">http://search.cpan.org.</A>
413
<td bgcolor="#eeeeee" width="1">
418
<pre> Spreadsheet::ParseExcel
425
These Perl modules may be installed after swish-e is installed.
429
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
431
<H1><A NAME="INSTALLATION">INSTALLATION</A></H1>
433
Here are brief installation instructions that should work in most cases.
434
Following this section are more detailed instructions and examples.
437
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
439
<H2><A NAME="Building_Swish_e">Building Swish-e</A></H2>
441
Download swish-e using your favorite web browser or a utility like wget,
442
lynx, or lwp-download. Unpack and build using the following steps:
445
Note: ``swish-e-2.4.0'' is used as an example. Download the most current
446
version available and adjust the commands below! Also, if running Debian,
447
see notes below on building a .deb package.
450
The ``$'' symbol indicates steps run as an unprivileged user. The ``#''
451
indicates steps run as the superuser (root).
458
<td bgcolor="#eeeeee" width="1">
463
<pre> $ wget <A HREF="http://swish-e.org/Download/swish-e-2.4.0.tar.gz">http://swish-e.org/Download/swish-e-2.4.0.tar.gz</A>
464
$ gzip -dc swihs-e-2.4.0.tar.gz | tar xof -
465
$ cd swish-e-2.4.0 (this directory will depend on the version of Swish-e)</pre>
475
<td bgcolor="#eeeeee" width="1">
486
==================</pre>
496
<td bgcolor="#eeeeee" width="1">
501
<pre> $ su root (or use sudo)
502
(enter password)</pre>
512
<td bgcolor="#eeeeee" width="1">
526
<STRONG>IMPORTANT:</STRONG> Once installed do not run swish-e as the superuser (root) -- root is only
527
required during the installation step when installing into system
531
Here's another installation example. This might be used if you do not have
532
root access or you wish to install swish someplace other than /usr/local.
535
This example also shows building Swish-e in a ``build'' directory separate
536
from where the source files. This is the recommended way to build Swish-e,
537
but requires GNU Make. Without GNU Make you will likely need to build from
538
within the source directory as shown in the previous example.
545
<td bgcolor="#eeeeee" width="1">
550
<pre> $ tar zxof swish-e-2.4.0.tar.gz (GNU tar with "z" option)
558
Note that the current directory is not where Swish-e was unpacked.
561
Swish-e uses a <EM>configure</EM> script. <EM>configure</EM> has many options, but reasonable and standard defaults. Running
568
<td bgcolor="#eeeeee" width="1">
573
<pre> $ ../swish-e-2.4.0/configure --help</pre>
579
will display the options. Two options are of common interest: --prefix sets
580
the top-level installation directory, and --disable-shared will link
581
swish-e statically, which may be needed on some platforms (Solaris 2.6
585
Note: On some platfoms (e.g. Solaris) zlib is installed in /usr/local/lib,
586
but the linker does not use that path for run-time linkages. Swish will
587
build correctly but make check will fail. The solution is to set the
595
<td bgcolor="#eeeeee" width="1">
600
<pre> LDFLAGS=-R/usr/local/lib</pre>
606
before running configure.
609
Now configure and build Swish-e:
616
<td bgcolor="#eeeeee" width="1">
621
<pre> $ ../swish-e-2.4.0/configure --prefix=$HOME/swish-e
622
$ make >/dev/null (redirect output to only see warnings and errors)
627
==================</pre>
637
<td bgcolor="#eeeeee" width="1">
643
$ $HOME/swish-e/bin/swish-e -V
650
In this case you would likely want to add $HOME/swish-e/bin to your shell's
654
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
656
<H2><A NAME="Building_a_Debian_Package">Building a Debian Package</A></H2>
658
The Swish-e distribution includes the files requires to build a Debian
666
<td bgcolor="#eeeeee" width="1">
671
<pre> $ tar zxof swish-e-2.4.0.tar.gz (GNU tar with "z" option)
673
$ fakeroot debian/rules binary
675
dpkg-deb: building package `swish-e' in `../swish-e_2.4.0-0_i386.deb'.
677
# dpkg -i ../swish-e_2.4.0-0_i386.deb</pre>
683
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
685
<H2><A NAME="What_s_installed">What's installed</A></H2>
687
Swish installs a number of files. By default all files are installed below
688
/usr/local, but this can be changed by setting --prefix when running
689
<EM>configure</EM> (as shown above). Individual paths may also be set. Run
690
<EM>configure --help</EM> for details.
697
<td bgcolor="#eeeeee" width="1">
702
<pre> $prefix/bin/swish-e The swish-e binary program
703
$prefix/share/doc/swish-e/ Full documentation and examples
704
$prefix/lib/libswish-e The swish-e C library
705
$prefix/include/swish-e.h The library header file
706
$prefix/man/man1/ Documentation as manual pages
707
$prefix/lib/swish-e/ Helper programs (spider.pl, swishspider, swish.cgi)
708
$prefix/lib/swish-e/perl/ Perl helper modules</pre>
714
Note that the Perl modules are <EM>not</EM> installed in the system Perl library. Swish-e and the Perl scripts that
715
require the modules know where to find the modules, but the <EM>perldoc</EM> program used for reading documentation does not. This can be corrected by
716
adding $prefix/lib/swish-e and $prefix/lib/swish-e/perl to the PERL5LIB
717
environment variable.
720
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
722
<H2><A NAME="Documentation">Documentation</A></H2>
724
Documentation can be found in the $prefix/share/doc/swish-e directory.
725
Documentation can also be read on-line at the Swish-e web site:
732
<td bgcolor="#eeeeee" width="1">
737
<pre> <A HREF="http://swish-e.org/">http://swish-e.org/</A></pre>
743
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
745
<H2><A NAME="The_Swish_e_documentation_as_man_1_pages">The Swish-e documentation as man(1) pages</A></H2>
747
Running ``make install'' installs some of the Swish-e documentation as man
748
pages. The following man pages are installed:
755
<td bgcolor="#eeeeee" width="1">
763
SWISH-LIBRARY(1)</pre>
769
The man pages are installed in the system man directory. This directory is
770
determined by running ./configure and can be set by passing the directory
771
when running ./configure.
781
<td bgcolor="#eeeeee" width="1">
786
<pre> ./configure --mandir=/usr/local/doc/man</pre>
792
Information on running ./configure can be found by typing:
799
<td bgcolor="#eeeeee" width="1">
804
<pre> ./configure --help</pre>
810
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
812
<H2><A NAME="Join_the_Swish_e_discussion_list">Join the Swish-e discussion list</A></H2>
814
The final step when installing Swish-e is to join the Swish-e discussion
818
The Swish-e discussion list is the place to ask questions about installing
819
and using Swish-e, see or post bug fixes or security announcements, and a
820
place where <STRONG>you</STRONG> can offer help to others. Please do not contact the developers directly.
823
The list is typically <EM>very low traffic</EM>, so it won't overload your inbox. Please take time to subscribe. See <A
824
HREF="http://Swish-e.org.">http://Swish-e.org.</A>
827
If you are using Swish-e on a public site, please let the list know so it
828
can be added to the list of sites that use Swish-e!
831
Please review the next section before posting a question to the Swish-e
835
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
837
<H1><A NAME="QUESTIONS_AND_TROUBLESHOOTING">QUESTIONS AND TROUBLESHOOTING</A></H1>
839
Support for installation, configuration and usage is available via the
840
Swish-e discussion list. Visit <A
841
HREF="http://swish-e.org">http://swish-e.org</A> for information. Do not
842
contact developers directly for help -- always post your question to the
846
It's very important to provide the right information when asking for help.
849
Please search the Swish-e list archive before posting a question, and check
850
the <A HREF="././SWISH-FAQ.html">SWISH-FAQ</A> to see if your question hasn't already been asked.
853
Before posting use tools available to narrow down the problem.
856
Swish-e has the -T, -v, and -k switches that may help resolve issues. These
857
switches are described on the <A HREF="././SWISH-RUN.html">SWISH-RUN</A> page. For example, if you cannot find a document by a keyword that you
858
believe should be indexed try indexing just that single file, and use the
859
-T INDEXED_WORDS option to see if the word is actually being indexed. First
860
try without any changes to default settings:
867
<td bgcolor="#eeeeee" width="1">
872
<pre> swish-e -i testdoc.html -T indexed_words | less</pre>
878
if that works then add in your configuration file:
885
<td bgcolor="#eeeeee" width="1">
890
<pre> swish-e -i testdoc.html -c swish.conf -T indexed_words | less</pre>
896
If that still isn't working as you expect try to reduce the test document
897
to a very small example. This will be very helpful when asking for help.
900
Other tools are to use -H9 when searching to display full headers in search
901
results. Look at the ``Parsed Words'' header to see what words swish-e is
905
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
907
<H2><A NAME="When_posting_please_provide_the_following_information_">When posting please provide the following information:</A></H2>
911
The exact version of Swish-e that you are using. Running Swish-e with the
912
<CODE>-V</CODE> switch will print the version number. Also, supply the output from
913
<CODE>uname -a</CODE> or similar command that identifies the operating system you are running on.
914
If you are running an old version of swish be prepared for a response to
915
your question of ``upgrade.''
919
A summary of the problem. This should include the commands issued (e.g. for
920
indexing or searching) and their output, and why you don't think it's
921
working correctly. Please cut-n-paste the exact commands and their output
922
instead of retyping to avoid errors.
926
Include a copy of the configuration file you are using, if any. Swish-e has
927
reasonable defaults so in many cases you can run it without using a
928
configuration file. But, if you need to use a configuration file,
929
<STRONG>reduce it down</STRONG> to the absolute minimum number of commands required to demonstrate your
930
problem. Again, cut-n-paste.
934
A small copy of a source document that demonstrates the problem.
937
If you are having problems spidering a web server, use lwp-download or wget
938
to copy the file locally to make sure you can index the document using the
939
file system method. This will help determine if the problem is with
940
spidering or with indexing.
943
If you expect help with spidering, don't post fake URLs, as it makes it
944
impossible to test. If you don't want to expose your web page to the people
945
on the Swish-e list, find some other site to test spidering on. If that
946
works, but you still cannot spider your own site then post your real URL if
947
you want help, or make a test document available via some other source.
951
If you are having trouble building Swish-e please cut-n-paste the output
952
from make (or from ./configure if that's where the problem is).
956
The key is to provide enough information so that others may reproduce the
960
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
962
<H1><A NAME="ADDITIONAL_INSTALLATION_OPTIONS">ADDITIONAL INSTALLATION OPTIONS</A></H1>
964
These steps are not required for normal use of Swish-e.
967
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
969
<H2><A NAME="The_SWISH_API_Perl_Module">The SWISH::API Perl Module</A></H2>
971
The Swish-e distribution includes a module that provides a Perl interface
972
to the Swish-e C library. This module provides a way to search a Swish-e
973
index without running the swish-e program. Searching an index will be many
974
times faster when running under a persistent environment such as
975
Apache/mod_perl with the SWISH::API module.
978
See the <EM>perl/README</EM> file for information on installing and using the SWISH::API Perl module.
981
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
983
<H2><A NAME="Creating_PDF_and_Postscript_documentation">Creating PDF and Postscript documentation</A></H2>
985
The Swish-e documentation in HTML format was created with Pod::HtmlPsPdf, a
986
package of Perl modules written and/or modified by Stas Bekman to automate
987
the conversion of documents in pod format (see perldoc perlpod) to HTML,
988
Postscript, and PDF. A slightly modified version of this package is
989
included with the Swish-e distribution and used for building the HTML.
992
If your system has the <STRONG>necessary tools</STRONG> to build Postscript and the converter ps2pdf installed, you may be able to
993
build the Postscript and PDF versions of the documentation. After you have
994
run ./configure, type from the <EM>doc</EM> directory of the distribution:
1001
<td bgcolor="#eeeeee" width="1">
1006
<pre> make pdf</pre>
1012
And with any luck you will end up with the these two files in the top-level
1020
<td bgcolor="#eeeeee" width="1">
1025
<pre> swish-e_documentation.pdf
1026
swish-e_documentation.ps</pre>
1032
Most people find reading the documentation in HTML most convenient.
1035
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1037
<H1><A NAME="GENERAL_CONFIGURATION_AND_USAGE">GENERAL CONFIGURATION AND USAGE</A></H1>
1039
This section should give you a basic overview of indexing and searching
1040
with <STRONG>Swish-e</STRONG>. Other examples can be found in the <EM>conf</EM> directory which will step you through a number of different configurations.
1041
Also, please review the <A HREF="././SWISH-FAQ.html">SWISH-FAQ</A>.
1044
Swish-e is a command line program. The program is controlled by passing
1045
switches on the command line. A configuration file may be used, but often
1046
is not required. Swish-e does not include a graphical user interface. There
1047
are example CGI scripts provided in the distribution, but they require
1048
additional setup to use.
1051
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1053
<H2><A NAME="Introduction_to_Indexing_and_Searching">Introduction to Indexing and Searching</A></H2>
1055
Swish-e can index files on the local file system. For example, running:
1062
<td bgcolor="#eeeeee" width="1">
1067
<pre> swish-e -i /var/www/htdocs</pre>
1073
will index <EM>all</EM> files in the /var/www/htdocs directory. You may specify one or more files
1074
or directories with the -i option. By default this will create an index
1075
(which is made up of more than one file) in the current directory called <EM>index.swish-e</EM>.
1078
Then to search the resulting index for a given word:
1085
<td bgcolor="#eeeeee" width="1">
1090
<pre> swish-e -w apache</pre>
1096
This will find the word ``apache'' in the body or title of the indexed
1100
As mentioned above, Swish-e will index all files in a directory unless
1101
instructed otherwise. So if /var/www/htdocs contains non-HTML then you will
1102
need a configuration file to limit the files that Swish-e indexes. Create a
1103
file called ``swish.conf'':
1110
<td bgcolor="#eeeeee" width="1">
1115
<pre> # Example configuration file</pre>
1125
<td bgcolor="#eeeeee" width="1">
1130
<pre> # Tell swish what to index (same as -i switch above)
1131
IndexDir /var/www/htdocs</pre>
1141
<td bgcolor="#eeeeee" width="1">
1146
<pre> # Only index HTML and text files
1147
IndexOnly .htm .html .txt</pre>
1157
<td bgcolor="#eeeeee" width="1">
1162
<pre> # Tell swish that .txt files are to use the text parser.
1163
IndexContents TXT* .txt</pre>
1173
<td bgcolor="#eeeeee" width="1">
1178
<pre> # Otherwise, use the HTML parser
1179
DefaultContents HTML*</pre>
1185
Save that as ``swish.conf'' and reindex:
1192
<td bgcolor="#eeeeee" width="1">
1197
<pre> swish-e -c swish.conf</pre>
1203
The Swish-e configuration settings are described in the
1204
<A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A> manual page. Order of statements in the configuration file is typically not
1205
important, although some statements depend on previously set statements.
1206
There are many possible settings. Good advice is to use as few settings as
1207
possible when first starting out with Swish-e.
1210
The runtime options (switches) are described in the <A HREF="././SWISH-RUN.html">SWISH-RUN</A>
1211
manual page. You may also see a summary of options by running:
1218
<td bgcolor="#eeeeee" width="1">
1223
<pre> swish-e -h</pre>
1229
Swish-e has two other methods reading input files. One method uses a Perl
1230
helper script and the LWP Perl library to spider remote web sites:
1237
<td bgcolor="#eeeeee" width="1">
1242
<pre> swish-e -S http -i <A HREF="http://localhost/index.html">http://localhost/index.html</A> -v2</pre>
1248
This will spider the web server running on the local host. The <CODE>-S</CODE> option defines the input source method to be ``http'', <CODE>-i</CODE> specifies the URL to spider, and <CODE>-v</CODE> sets the verbose level to two. There are a number of configuration options
1249
specific to the <CODE>-S</CODE>
1250
http input source. See <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>. Note that only files of Content-Type text/* will be indexed.
1253
The <CODE>-S http</CODE> method is depreciated in favor of the next input method.
1256
The other method is a general purpose input method where Swish-e reads
1257
input from a program that produces documents in a special format. The
1258
program might read and format data stored in a database, or parse and
1259
format messages in a mailing list archive, or run a program that spiders
1260
web sites like the previous method.
1263
The Swish-e distribution includes a spider program that uses this method of
1264
input. This spider program is much more configurable and feature-rich than
1265
the previous -S http method.
1268
To duplicate the previous example create a configuration file called
1276
<td bgcolor="#eeeeee" width="1">
1281
<pre> # Example for spidering
1282
# Use the "spider.pl" program included with Swish-e
1283
IndexDir spider.pl</pre>
1293
<td bgcolor="#eeeeee" width="1">
1298
<pre> # Define what site to index
1299
SwishProgParameters default <A HREF="http://localhost/index.html">http://localhost/index.html</A></pre>
1305
Then create the index using the command:
1312
<td bgcolor="#eeeeee" width="1">
1317
<pre> swish-e -S prog -c swish2.conf</pre>
1323
This says to use the <CODE>-S prog</CODE> input source method. Note that in this case the IndexDir settings does not
1324
list a file or directory to index, but a program name run. This program,
1325
spider.pl, does the work of fetching the documents from the web server and
1326
passing them to Swish-e for indexing.
1329
The SwishProgParameters options is a special feature that allows passing
1330
command line parameters to the program specified with IndexDir. In this
1331
case passing the word ``default'' which tells spider.pl to use default
1332
settings, and the URL to spider.
1335
Running a script under Windows requires specifying the interpreter (e.g.
1336
perl.exe) and then use SwishPropParameters to specify the script and the
1337
script's parameters. See <EM>Notes when using -S prog on MS Windows</EM> on the
1338
<A HREF="././SWISH-RUN.html">SWISH-RUN</A> page.
1341
The advantage of the <CODE>-S prog</CODE> method of spidering (over the previous <CODE>-S http</CODE> method) is that the Perl code is only compiled once instead of for every
1342
document fetched from the web server. In addition it is a much more
1343
advanced spider with many, many features. Still, as used here, spider.pl
1344
will automatically index PDF or MS Word documents if (when) Xpdf and Catdoc
1348
A special form of the <CODE>-S prog</CODE> input source method is:
1355
<td bgcolor="#eeeeee" width="1">
1360
<pre> ./myprog --option | swish-e -S prog -i stdin -c config</pre>
1366
This allows running Swish-e from a program (instead of running the external
1367
program from Swish-e). Thus, this also can be done:
1374
<td bgcolor="#eeeeee" width="1">
1379
<pre> ./myprog --option > outfile
1380
swish-e -S prog -i stdin -c config < outfile</pre>
1393
<td bgcolor="#eeeeee" width="1">
1398
<pre> ./myprog --option > outfile
1399
cat outfile | swish-e -S prog -i stdin -c config</pre>
1405
One final note about the <CODE>-S prog</CODE> input source method. The program specified with -i or IndexDir needs to be
1406
an absolute path. The exception is when the program is installed in the
1407
``libexecdir'' directory and then a plain program name may be specified (as
1408
in the example showing spider.pl above).
1411
All three input source methods are described in more detail on the <A HREF="././SWISH-RUN.html">SWISH-RUN</A> page.
1414
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1416
<H2><A NAME="Metanames_and_Properties">Metanames and Properties</A></H2>
1418
There's two key Swish-e concepts that you need to be familiar with:
1419
Metanames and Properties.
1422
<P><LI><STRONG><A NAME="item_Metanames">Metanames</A></STRONG>
1424
Swish-e creates a reverse index. Just like an index in a book, you look up
1425
a word and it lists the pages (or documents) where that word can be found.
1428
Swish-e can create multiple index tables within the same index file. For
1429
example, you might want to create an index of just words in HTML titles so
1430
searches can be limited to just titles. Or you might have descriptive words
1431
in a meta tag called ``keywords'' you would like to search.
1434
Some database systems might call these different ``fields'' or ``columns'',
1435
but swish-e calls them <EM>MetaNames</EM> (as a result of first indexing HTML meta tags).
1438
To find documents with ``foo'' in their title you might run:
1445
<td bgcolor="#eeeeee" width="1">
1450
<pre> swish-e -w swishtitle=foo</pre>
1463
<td bgcolor="#eeeeee" width="1">
1468
<pre> swish-e -w swishtitle=(foo or bar) or swishdefault=(baz)</pre>
1474
The Metaname ``swishdefault'' is the name used by Swish-e if no other name
1475
is specified. The following two searches are the same:
1482
<td bgcolor="#eeeeee" width="1">
1487
<pre> swish-e -w foo
1488
swish-e -w swishdefault=foo</pre>
1494
When indexing HTML documents Swish-e indexes words in the body and title
1495
under the Metaname ``swishdefault''.
1497
<P><LI><STRONG><A NAME="item_Properties">Properties</A></STRONG>
1499
Swish-e search results is a list of files -- actually internally swish uses
1500
file numbers. Data can be associated with each file number when indexing.
1501
For example, by default Swish-e associates the file's name, title, last
1502
modified date, and size with the file number and these items can be printed
1503
in search results. In Swish-e this associated data is called a file's
1504
<EM>Properties</EM>. Properties can be any data you wish to associated with a document -- even
1505
the entire text of the document can be stored in the index. What data is
1506
stored as a Property is controlled by the <EM>PropertyNames</EM> (and others) configuration directive.
1509
What properties are printed with search results depends on the -x or -p
1510
switches. By default Swish-e returns the rank, path/URL, title and file
1511
size in bytes for each result.
1515
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1517
<H2><A NAME="Getting_Started_With_Swish_e">Getting Started With Swish-e</A></H2>
1519
Swish-e reads a configuration file (see <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>) for directives that control what and how Swish-e indexes files. Swish-e
1520
is also controlled by command line arguments (see
1521
<A HREF="././SWISH-RUN.html">SWISH-RUN</A>). Many of the command line arguments have equivalent configuration
1522
directives (e.g. -i and IndexDir).
1525
Swish-e does not require a configuration file, but most people need to
1526
change the default behavior by placing settings in a configuration file.
1529
To try the examples below you may change to the <EM>tests</EM> subdirectory of the distribution. The tests will use the *.html files in
1530
this directory when creating the test index. You may wish to review these
1531
*.html files to get an idea of the various native file formats that Swish-e
1535
You may also use your own test documents. It's recommended to use small
1536
test documents when first using Swish-e.
1539
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1541
<H2><A NAME="Step_1_Create_a_Configuration_File">Step 1: Create a Configuration File</A></H2>
1543
The configuration file controls what and how Swish-e indexes. The
1544
configuration file consists of directives, comments, and blank lines. The
1545
configuration file can be any name you like.
1548
This example will work with the documents in the <EM>tests</EM> directory. You may wish to review the <EM>tests/test.config</EM> configuration file used for the <CODE>make test</CODE> tests.
1551
For example, a simple configuration file (<EM>swish-e.conf</EM>):
1558
<td bgcolor="#eeeeee" width="1">
1563
<pre> # Example Swish-e Configuration file</pre>
1573
<td bgcolor="#eeeeee" width="1">
1578
<pre> # Define *what* to index
1579
# IndexDir can point to a directories and/or a files
1580
# Here it's pointing to the current directory
1581
# Swish-e will also recurse into sub-directories.
1592
<td bgcolor="#eeeeee" width="1">
1597
<pre> # But only index the .html files
1598
IndexOnly .html</pre>
1608
<td bgcolor="#eeeeee" width="1">
1613
<pre> # Show basic info while indexing
1620
And that's a simple configuration file. It says to index all the .html
1621
files in the current directory and sub-directories, if any, and provide
1622
some basic output while indexing.
1625
As mentioned above, the complete list of all configuration file directives
1626
are described in <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>.
1629
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1631
<H2><A NAME="Step_2_Index_your_Files">Step 2: Index your Files</A></H2>
1633
Run Swish-e using the <CODE>-c</CODE> switch to specify the name of the configuration file.
1640
<td bgcolor="#eeeeee" width="1">
1645
<pre> swish-e -c swish-e.conf</pre>
1655
<td bgcolor="#eeeeee" width="1">
1660
<pre> Indexing Data Source: "File-System"
1661
Indexing "."
1662
Removing very common words...
1664
Writing main index...
1666
Sorting 55 words alphabetically
1668
Writing index entries ...
1669
Writing word text: Complete
1670
Writing word hash: Complete
1671
Writing word data: Complete
1672
55 unique words indexed.
1673
4 properties sorted.
1674
5 files indexed. 1252 total bytes. 140 total words.
1675
Elapsed time: 00:00:00 CPU time: 00:00:00
1676
Indexing done!</pre>
1682
This created the index file <EM>index.swish-e</EM>. This is the default index file name unless the <STRONG>IndexFile</STRONG> directive is specified in the configuration file:
1689
<td bgcolor="#eeeeee" width="1">
1694
<pre> IndexFile ./website.index</pre>
1700
You may use the -f switch to specify a index file at indexing time. The -f
1701
option overrides a IndexFile setting in the configuration file.
1704
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1706
<H2><A NAME="Step_3_Search">Step 3: Search</A></H2>
1708
You specify your search terms with the <CODE>-w</CODE> switch. For example, to find the files that contain the word <STRONG>sample</STRONG> you would issue the command:
1715
<td bgcolor="#eeeeee" width="1">
1720
<pre> swish-e -w sample</pre>
1726
This example assumes that you are in the <EM>tests</EM> directory. Swish-e returns in response to that command the following:
1733
<td bgcolor="#eeeeee" width="1">
1738
<pre> swish-e -w sample</pre>
1748
<td bgcolor="#eeeeee" width="1">
1753
<pre> # SWISH format: 2.4.0
1754
# Search words: sample
1756
# Search time: 0.000 seconds
1757
# Run time: 0.005 seconds
1758
1000 ./test_xml.html "If you are seeing this, the METATAG XML search was successful!" 159
1759
1000 ./test.html "If you are seeing this, the test was successful!" 437
1766
So the word <STRONG>sample</STRONG> was found in two documents. The first number shown is the relevance or rank
1767
of the search term, followed by the file containing the search term, the
1768
title of the document, and finally the length of the document.
1771
The period (``.'') alone at the end marks the end of results.
1774
Much more information may be retrieved while searching by using the <CODE>-x</CODE> and <CODE>-H</CODE> switches (see <A HREF="././SWISH-RUN.html">SWISH-RUN</A>) and by using Document Properties (see <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>).
1777
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1779
<H2><A NAME="Phrase_Searching">Phrase Searching</A></H2>
1781
To search for a phrase in a document use double-quotes to delimit your
1782
search terms. (The default phrase delimiter is set in src/swish.h.)
1785
You must protect the quotes from the shell.
1788
For example, under Unix:
1795
<td bgcolor="#eeeeee" width="1">
1800
<pre> swish-e -w '"this is a phrase" or (this and that)'
1801
swish-e -w 'meta1=("this is a phrase") or (this and that)'</pre>
1807
Or under Windows <EM>command.com</EM> shell.
1814
<td bgcolor="#eeeeee" width="1">
1819
<pre> swish-e -w \"this is a phrase\" or (this and that)</pre>
1825
The phrase delimiter can be set with the <CODE>-P</CODE> switch.
1828
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1830
<H2><A NAME="Boolean_Searching">Boolean Searching</A></H2>
1832
You can use the Boolean operators <STRONG>and</STRONG>, <STRONG>or</STRONG>, or <STRONG>not</STRONG> in searching. Without these Boolean, Swish-e will assume you're <STRONG>and</STRONG>ing the words together.
1835
Here are some examples:
1842
<td bgcolor="#eeeeee" width="1">
1847
<pre> swish-e -w 'apples oranges'
1848
swish-e -w 'apples and oranges' ( Same thing )</pre>
1858
<td bgcolor="#eeeeee" width="1">
1863
<pre> swish-e -w 'apples or oranges'</pre>
1873
<td bgcolor="#eeeeee" width="1">
1878
<pre> swish-e -w 'apples or oranges not juice' -f myIndex </pre>
1884
retrieves first the files that contain both the words ``apples'' and
1885
``oranges''; then among those the ones that do not contain the word
1889
A few others to ponder:
1896
<td bgcolor="#eeeeee" width="1">
1901
<pre> swish-e -w 'apples and oranges or pears'
1902
swish-e -w '(apples and oranges) or pears' ( Same thing )
1903
swish-e -w 'apples and (oranges or pears)' ( Not the same thing )</pre>
1909
Swish processes the query left to right.
1912
See <A HREF="././SWISH-SEARCH.html">SWISH-SEARCH</A> for more information.
1915
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1917
<H2><A NAME="Context_Searching">Context Searching</A></H2>
1919
The <CODE>-t</CODE> option in the search command line allows you to search for words that exist
1920
only in specific HTML tags. Each character in the string you specify in the
1921
argument to this option represents a different tag in which the word is
1922
searched; that is you can use any combinations of the following characters:
1929
<td bgcolor="#eeeeee" width="1">
1934
<pre> H search in all <HEAD> tags
1935
B search in the <BODY> tags
1936
t search in <TITLE> tags
1937
h is <H1> to <H6> (header) tags
1938
e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
1939
c is HTML comment tags (<!-- ... -->)</pre>
1952
<td bgcolor="#eeeeee" width="1">
1957
<pre> # Find only documents with the word "linux" in the <TITLE> tags.
1958
swish-e -w linux -t t</pre>
1968
<td bgcolor="#eeeeee" width="1">
1973
<pre> # Find the word "apple" in titles or comments
1974
swish-e -w apple -t tc</pre>
1980
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
1982
<H2><A NAME="META_Tags">META Tags</A></H2>
1984
As mentioned above, Metanames are a way to define ``fields'' in your
1985
documents. You can use the Metanames in your queries to limit the search to
1986
just the words contained in that META name of your document. For example,
1987
you might have a META tagged field in your documents called
1988
<CODE>subjects</CODE> and then you can search your documents for the word ``foo'' but only return
1989
documents where ``foo'' is within the <CODE>subjects</CODE> META tag.
1992
Document <EM>Properties</EM> are somewhat related: Properties allow the content of a META tag in a
1993
source document to be stored within the index, and that text to be returned
1994
along with search results.
1997
META tags can have two formats in your documents.
2004
<td bgcolor="#eeeeee" width="1">
2009
<pre> <META NAME="keyName" CONTENT="some Content"></pre>
2022
<td bgcolor="#eeeeee" width="1">
2027
<pre> <keyName>
2029
</keyName></pre>
2035
If using libxml, you can optionally use a non-html tag as a metaname:
2042
<td bgcolor="#eeeeee" width="1">
2059
This, of course, is invalid HTML.
2062
To continue with our sample <EM>Swish-e.conf</EM> file, add the following lines:
2069
<td bgcolor="#eeeeee" width="1">
2074
<pre> # Define META tags
2075
MetaNames meta1 meta2 meta3</pre>
2081
Reindex to include the changes:
2088
<td bgcolor="#eeeeee" width="1">
2093
<pre> swish-e -c swish-e.conf</pre>
2099
Now search, but this time limit your search to META tag ``meta1'':
2106
<td bgcolor="#eeeeee" width="1">
2111
<pre> swish-e -w 'meta1=metatest1'</pre>
2117
Again, please see <A HREF="././SWISH-RUN.html">SWISH-RUN</A> and <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>
2118
for complete documentation of the various indexing and searching options.
2121
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
2123
<H2><A NAME="Spidering_and_Searching_with_a_Web_form_">Spidering and Searching with a Web form.</A></H2>
2125
This example demonstrates how to spider a web site and setup the included
2126
CGI script to provide a web-based search page. This example uses Perl
2127
programs included in the Swish-e distribution: <EM>spider.pl</EM> will be used for reading files from the web server, and <EM>swish.cgi</EM> will provide the web search form and display results.
2130
As an example we will index the Apache Web Server documentation installed
2131
on the local computer at <A
2132
HREF="http://localhost/apache_docs/index.html">http://localhost/apache_docs/index.html</A>
2136
<P><LI><STRONG><A NAME="item_Make_a_Working_Directory">Make a Working Directory</A></STRONG>
2138
Create a directory to store the Swish-e configuration and the Swish-e
2146
<td bgcolor="#eeeeee" width="1">
2151
<pre> ~$ mkdir web_index
2158
<P><LI><STRONG><A NAME="item_Create_a_Swish_e_Configuration_file">Create a Swish-e Configuration file</A></STRONG>
2164
<td bgcolor="#eeeeee" width="1">
2169
<pre> ~/web_index$ cat swish.conf
2170
# Swish-e config to index the Apache documentation
2172
# Use spider.pl for indexing (location of spider.pl set at installation time)
2173
IndexDir spider.pl</pre>
2183
<td bgcolor="#eeeeee" width="1">
2188
<pre> # Use spider.pl's default configuration and specify the URL to spider
2189
SwishProgParameters default <A HREF="http://localhost/apache_docs/index.html">http://localhost/apache_docs/index.html</A></pre>
2199
<td bgcolor="#eeeeee" width="1">
2204
<pre> # Allow extra searching by title, path
2205
Metanames swishtitle swishdocpath</pre>
2215
<td bgcolor="#eeeeee" width="1">
2220
<pre> # Set StoreDescription for each parser
2221
# to display context with search results
2222
StoreDescription TXT* 10000
2223
StoreDescription HTML* <body> 10000</pre>
2228
<P><LI><STRONG><A NAME="item_Generate_the_Index">Generate the Index</A></STRONG>
2230
Now run swish-e to create the index:
2237
<td bgcolor="#eeeeee" width="1">
2242
<pre> ~/web_index$ swish-e -S prog -c swish.conf </pre>
2252
<td bgcolor="#eeeeee" width="1">
2257
<pre> Indexing Data Source: "External-Program"
2258
Indexing "spider.pl"
2259
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'</pre>
2269
<td bgcolor="#eeeeee" width="1">
2274
<pre> Summary for: <A HREF="http://localhost/apache_docs/index.html">http://localhost/apache_docs/index.html</A>
2275
Duplicates: 4,188 (349.0/sec)
2276
Off-site links: 276 (23.0/sec)
2277
Skipped: 1 (0.1/sec)
2278
Total Bytes: 2,090,125 (174177.1/sec)
2279
Total Docs: 147 (12.2/sec)
2280
Unique URLs: 149 (12.4/sec)
2281
Removing very common words...
2283
Writing main index...
2285
Sorting 7736 words alphabetically
2287
Writing index entries ...
2288
Writing word text: Complete
2289
Writing word hash: Complete
2290
Writing word data: Complete
2291
7736 unique words indexed.
2292
5 properties sorted.
2293
147 files indexed. 2090125 total bytes. 200783 total words.
2294
Elapsed time: 00:00:13 CPU time: 00:00:02
2295
Indexing done!</pre>
2301
The above output is actually a mix of output from both swish-e and from
2302
spider.pl. Spider.pl reports the ``Summary for: <A
2303
HREF="http://localhost/apache_docs/index.html">http://localhost/apache_docs/index.html</A>''.
2307
Also note that swish-e knows to find spider.pl at
2308
/usr/local/lib/swish-e/spider.pl. The script installation directory (called
2309
libexecdir) is set at configure time. You can see your setting by running
2317
<td bgcolor="#eeeeee" width="1">
2322
<pre> ~/web_index$ swish-e -h | grep libexecdir
2323
Scripts and Modules at: (libexecdir) = /usr/local/lib/swish-e</pre>
2329
This directory will be needed when setting up the CGI script in the next
2333
Finally, verify that the index can be searched from the command line:
2340
<td bgcolor="#eeeeee" width="1">
2345
<pre> ~/web_index$ swish-e -w installing -m3
2346
# SWISH format: 2.4.0
2347
# Search words: installing
2348
# Removed stopwords:
2349
# Number of hits: 17
2350
# Search time: 0.018 seconds
2351
# Run time: 0.050 seconds
2352
1000 <A HREF="http://localhost/apache_docs/install.html">http://localhost/apache_docs/install.html</A> "Compiling and Installing Apache" 17960
2353
718 <A HREF="http://localhost/apache_docs/install-tpf.html">http://localhost/apache_docs/install-tpf.html</A> "Installing Apache on TPF" 25734
2354
680 <A HREF="http://localhost/apache_docs/windows.html">http://localhost/apache_docs/windows.html</A> "Using Apache with Microsoft Windows" 27165
2361
Now try limiting the search to the title:
2368
<td bgcolor="#eeeeee" width="1">
2373
<pre> ~/web_index$ swish-e -w swishtitle=installing -m3
2374
# SWISH format: 2.3.5
2375
# Search words: swishtitle=installing
2376
# Removed stopwords:
2378
# Search time: 0.018 seconds
2379
# Run time: 0.048 seconds
2380
1000 <A HREF="http://localhost/apache_docs/install-tpf.html">http://localhost/apache_docs/install-tpf.html</A> "Installing Apache on TPF" 25734
2381
1000 <A HREF="http://localhost/apache_docs/install.html">http://localhost/apache_docs/install.html</A> "Compiling and Installing Apache" 17960
2388
Note that the above can also be done using the -t option:
2395
<td bgcolor="#eeeeee" width="1">
2400
<pre> ~/web_index$ swish-e -w installing -m3 -tH</pre>
2405
<P><LI><STRONG><A NAME="item_Setup_the_CGI_script">Setup the CGI script</A></STRONG>
2407
Swish-e does not include a web server, therefore you must use your locally
2408
installed web server. Apache is highly recommended, of course.
2411
Locate your web server's CGI directory. This may be a cgi-bin directory in
2412
your home directory or a central cgi-bin directory setup by the web server
2413
administrator. Once located copy the swish.cgi script into the cgi-bin
2417
Where CGI scripts can be located depends completely on the web server used
2418
and how it is configured. See your web server's documentation or your
2419
site's administrator for additional information.
2422
This example will use a site cgi-bin directory located at /usr/lib/cgi-bin.
2423
Copy the swish.cgi script into the cgi-bin directory. Again, we will need
2424
the location of the libexecdir directory:
2431
<td bgcolor="#eeeeee" width="1">
2436
<pre> ~/web_index$ swish-e -h | grep libexecdir
2437
Scripts and Modules at: (libexecdir) = /usr/local/lib/swish-e</pre>
2447
<td bgcolor="#eeeeee" width="1">
2452
<pre> ~/web_index$ cd /usr/lib/cgi-bin
2453
/usr/lib/cgi-bin$ su
2455
/usr/lib/cgi-bin# cp /usr/local/lib/swish-e/swish.cgi .</pre>
2461
If your operating system supports symbolic links, AND your web server
2462
allows programs to be symbolic links, then you may wish to create a link to
2463
the swish.cgi program instead.
2470
<td bgcolor="#eeeeee" width="1">
2475
<pre> /usr/lib/cgi-bin# ln -s /usr/local/lib/swish-e/swish.cgi</pre>
2481
We need to tell the swish.cgi script where to look for the index created in
2482
the previous step. It's also recommended to enter the path to the swish-e
2483
binary, otherwise the swish.cgi script will look for the binary in the
2484
PATH, and that may change when running under the CGI environment.
2487
Here's the configuration file:
2494
<td bgcolor="#eeeeee" width="1">
2499
<pre> /usr/lib/cgi-bin# cat .swishcgi.conf
2501
title => 'Search Apache Documentation',
2502
swish_binary => '/usr/local/bin/swish-e',
2503
swish_index => '/home/moseley/web_index/index.swish-e',
2510
Now, test the script from the command line as a normal user:
2517
<td bgcolor="#eeeeee" width="1">
2522
<pre> /usr/lib/cgi-bin# exit
2533
<td bgcolor="#eeeeee" width="1">
2538
<pre> /usr/lib/cgi-bin$ ./swish.cgi | head
2539
Content-Type: text/html; charset=ISO-8859-1</pre>
2549
<td bgcolor="#eeeeee" width="1">
2554
<pre> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2558
Search Apache Documentation
2567
Notice that the CGI script returns the HTTP header (Content-Type) and the
2568
body of the web page, just like a well behaved CGI scrip should do.
2571
Now test using the web server (this step depends on the location of your
2572
cgi-bin directory). This example uses the ``GET'' command that is part of
2573
the LWP Perl library, but any web browser can run this test.
2580
<td bgcolor="#eeeeee" width="1">
2585
<pre> /usr/lib/cgi-bin$ GET <A HREF="http://localhost/cgi-bin/swish.cgi">http://localhost/cgi-bin/swish.cgi</A> | head
2586
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Tranitional//EN">
2590
Search Apache Documentation
2600
The script reports errors to stderr, so consult the web server's error log
2601
if problems occur. The message ``Service currently unavailable'' reported
2602
by running swish.cgi typically indicates a configuration error, and the
2603
exact problem will be listed in the web server's error log.
2606
Detailed instructions on using the <EM>swish.cgi</EM> script and debugging tips can be found by running:
2613
<td bgcolor="#eeeeee" width="1">
2618
<pre> $ perldoc swish.cgi</pre>
2624
while in the cgi-bin directory where swish.cgi was copied.
2627
The spider program <EM>spider.pl</EM> also has a large number of configuration options.
2630
Documentation is also available in the directory $prefix/share/doc/swish-e
2631
or at <A HREF="http://swish-e.org.">http://swish-e.org.</A>
2634
Note: Also check out the search.cgi script found at the same location as
2635
the swish.cgi script. This is more of a skeleton script for those that want
2636
to create a custom search script.
2640
Now you are ready to search.
2643
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
2645
<H1><A NAME="Indexing_Other_Types_of_Documents_Filtering">Indexing Other Types of Documents - Filtering</A></H1>
2647
Swish-e can only index HTML, XML and text documents. In order to index
2648
other documents such as PDF or MS Word documents you must use a utility to
2649
convert or ``filter'' those documents.
2652
How documents are filtered with Swish-e has changed over time. This has
2653
resulting in a bit of confusion. It's also a somewhat complex process as
2654
different programs need to communicate with each other.
2657
You may wish to read the Swish-e FAQ question on filtering before
2659
<A HREF="././SWISH-FAQ.html#How_Do_I_filter_documents_">How Do I filter documents?</A>
2664
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
2666
<H2><A NAME="Filtering_Overview">Filtering Overview</A></H2>
2668
There's two ways to filter documents with Swish-e. Both are described in
2669
the <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A> man page. They are using the FileFilter directive and the SWISH::Filter
2673
The FileFilter directive is a general purpose method of filtering. It
2674
allows running of an external program for each document processed (based on
2675
file extension), and requires. The external programs open an input file,
2676
convert as needed, and write their output to standard output.
2679
Previous versions of Swish-e (before 2.4.0) used a collection of filter
2680
programs for converting files such as PDF or MS Word documents. The
2681
external programs call other program do to the work of filtering (e.g.
2682
pdftotext to extract the contents from PDF files). Although these filter
2683
programs are still included with the Swish-e distribution as examples, it
2684
is not recommended to use the SWISH::Filter method instead.
2687
One disadvantage of using FileFilter is that the filter is called once for
2688
every document that needs to be filtered. This can slow down the indexing
2692
The SWISH::Filter Perl module works very much like the old system and uses
2693
the same helper programs. But, it provides a single interface for filtering
2694
all types of documents. The advantage of SWISH::Filter is that it is built
2695
into the program used for spidering web sites (spider.pl), so all that's
2696
required is installing the filter programs that do the actual work of
2697
filtering (e.g. catdoc, xpdf). (The Windows binary includes some of the
2701
But, Swish-e will not use SWISH::Filter by default when using the file
2702
system method of indexing. To use SWISH::Filter when indexing by file
2703
system method (-S fs) you can use a FileFilter directive with the
2704
``swish_filter.pl'' filter (which is just a program that uses
2705
SWISH::Filter), or use the -S prog method of indexing and use the
2706
DirTree.pl program for fetching documents. DirTree.pl is included with the
2707
Swish-e distribution and is designed to work with SWISH::Filter. Using
2708
DirTree.pl will likely be faster way to index since the SWISH::Filter set
2709
of modules do not need to be compiled for every document that needs to be
2713
See the contents of swish_filter.pl and DirTree.pl for specifics on their
2717
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
2719
<H2><A NAME="Filtering_Examples">Filtering Examples</A></H2>
2721
The FileFilter directive can be used in your config file to convert
2722
documents based on their extension. This is the old way of filtering, but
2723
provides an easy way to add filters to swish-e.
2733
<td bgcolor="#eeeeee" width="1">
2738
<pre> FileFilter .pdf pdftotext "'%p' -"
2739
IndexContents TXT* .pdf</pre>
2745
will cause all .pdf files to be filtered through the pdftotext program
2746
(part of the Xpdf package), and to parse the resulting output from
2747
pdftotext with the text (``TXT'') parser.
2750
The other ways to filter documents is to use a -S prog program and convert
2751
the documents before passing them onto Swish-e.
2754
For example, spider.pl makes use of the Perl module included with the
2755
Swish-e distribution called SWISH::Filter. SWISH::Filter is passed a
2756
document and the document's content type and then looks for modules and
2757
utilities to convert the document into one of the types that Swish-e can
2761
Swish-e comes ready to index PDF, MS Word, MP3 ID3 tags, and MS Excel file
2762
types. But these filters need extra modules or tools to do the actual
2766
For example, the Swish-e distribution includes a module called
2767
SWISH::Filter::Pdf2HTML that uses the pdftotext and pdfinfo utilities
2768
provided by the Xpdf package.
2771
This means that if you are using spider.pl to spider your web site and you
2772
wish to index PDF documents, all that is needed is to install the Xpdf
2773
package and Swish-e (with the help of spider.pl) will begin indexing your
2777
Ok, so what does all that mean? For a very simple site you should be able
2785
<td bgcolor="#eeeeee" width="1">
2790
<pre> $ /usr/local/lib/swish-e/spider.pl default <A HREF="http://localhost/">http://localhost/</A> | swish-e -S prog -i stdin</pre>
2796
which is running the spider with default spider settings, indexing the Web
2797
server on localhost and piping its output into Swish-e using default
2798
indexing settings. Documents will be filtered automatically if you have the
2799
required helper applications installed.
2802
Most people will not want to just use the default settings (for one thing
2803
the spider will take a while because its default is to delay a few seconds
2804
between every request). So, read the documentation for spider.pl to learn
2805
how to use a spider config file. And also read <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A> to learn about what configuration options can be used with Swish-e.
2808
The SWISH::Filter documentation provides more details on filtering and
2809
hints for debugging problems when filtering.
2812
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
2814
<H1><A NAME="Document_Info">Document Info</A></H1>
2816
$Id: INSTALL.pod,v 1.36.2.2 2003/12/18 00:10:06 whmoseley Exp $
2821
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
2827
<div class="navbar">
2828
<a href="./README.html">Prev</a> |
2829
<a href="./index.html">Contents</a> |
2830
<a href="./CHANGES.html">Next</a>
2835
<IMG ALT="" WIDTH="470" HEIGHT="10" SRC="images/dotrule1.gif"></P>
2838
<div class="footer">
2839
<BR>SWISH-E is distributed with <B>no warranty</B> under the terms of the
2840
<A HREF="http://www.fsf.org/copyleft/gpl.html">GNU Public License</A>,<BR>
2841
Free Software Foundation, Inc.,
2842
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA<BR>
2843
Public questions may be posted to
2844
the <A HREF="http://swish-e.org/Discussion/">SWISH-E Discussion</A>.