1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
5
***** GENERATED FILE *** DO NOT EDIT DIRECTLY - any changes will be LOST ******
7
swish-e.org mockup based on http://www.oswd.org/design/1773/prosimii/index2.html
11
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en-US">
13
<meta http-equiv="content-type" content="application/xhtml+xml; charset=iso-8859-1" />
14
<meta name="author" content="haran" />
15
<meta name="generator" content="haran" />
18
<link rel="stylesheet" type="text/css" href="./swish.css" media="screen" title="swish css" />
21
<link rel="Last" href="./filter.html" />
23
<link rel="Prev" href="./readme.html" />
25
<link rel="Up" href="./index.html" />
27
<link rel="Next" href="./changes.html" />
29
<link rel="Start" href="./index.html" />
31
<link rel="First" href="./readme.html" />
34
<title>Swish-e :: INSTALL - Swish-e Installation Instructions</title>
43
<!-- For non-visual user agents: -->
44
<div id="top"><a href="#main-copy" class="doNotDisplay doNotPrint">Skip to main content.</a></div>
46
<!-- ##### Header ##### -->
49
<div class="superHeader">
50
<span>Related Sites:</span>
51
<a href="http://swishewiki.org/" title="swishe wiki">swish-e wiki</a> |
52
<a href="http://www.xmlsoft.org/" title="libxml2 home page">libxml2</a> |
53
<a href="http://www.zlib.net/" title="zlib home page">zlib</a> |
54
<a href="http://www.foolabs.com/xpdf/" title="xpdf home page">xpdf</a> |
55
<a href="http://cvs.sourceforge.net/viewcvs.py/swishe/" title="View CVS at SourceForge">CVS @ SourceForge</a>
58
<div class="midHeader">
59
<h1 class="headerTitle" lang="la">Swish-e</h1>
60
<div class="headerSubTitle">Simple Web Indexing System for Humans - Enhanced</div>
62
<br class="doNotDisplay doNotPrint" />
64
<div class="headerLinks">
65
<span class="doNotDisplay">Tools:</span>
67
<!-- don't know what platform, so link to download page -->
69
<a href="http://swish-e.org/download/index.html">download latest version</a>
78
<div class="subHeader">
81
action="http://swish-e.org/search/index.html"
82
enctype="application/x-www-form-urlencoded"
88
<a href="http://swish-e.org/index.html">home</a> |
89
<a href="http://swish-e.org/support.html">support</a> |
90
<a href="http://swish-e.org/download/index.html">download</a>
96
<label for="searchfield">Search for</label>
97
<input maxlength="200" value="" id="searchfield" size="30" name="query" type="text" alt="Search input field"/>
98
<input value="search swish-e.org" name="submit" type="submit" class='button' />
114
<ul class="menu"><li class="menuparent">
121
<ul class="submenu"><li class="">
125
title="First time users">README</a>
130
href="./install.html"
131
title="Installation and usage overview">Install »</a>
136
href="./changes.html"
137
title="Important changes from previous versions">Changes</a>
142
href="./swish-config.html"
143
title="Directives that go in your Swish-e configuration file">Configuration</a>
148
href="./swish-run.html"
149
title="Command line options for Swish-e binary">Running</a>
154
href="./swish-search.html"
155
title="Swish-e's search language">Searching</a>
160
href="./swish-faq.html"
166
href="./swish-bugs.html"
172
href="./swish-3.0.html"
178
href="./swish-library.html"
179
title="Swish-e C API">C API</a>
185
title="Perl interface to the Swish-e library">Perl API</a>
190
href="./swish.cgi.html"
191
title="Example CGI/mod_perl script">Swish.cgi</a>
196
href="./search.cgi.html"
197
title="Example Perl script using SWISH::API">Search.cgi</a>
203
title="The Swish-e HTTP spider">Spider.pl</a>
209
title="How to index non-text documents">Filters</a>
227
<h1>INSTALL - Swish-e Installation Instructions</h1>
228
Swish-e version 2.4.5
235
<h2>Table of Contents</h2>
241
<a href="#overview">OVERVIEW</a>
246
<a href="#upgrading_from_previous_versions_of_swish_e">Upgrading from previous versions of Swish-e</a>
251
<a href="#windows_users">Windows Users</a>
256
<a href="#building_from_cvs">Building from CVS</a>
265
<a href="#system_requirements">SYSTEM REQUIREMENTS</a>
270
<a href="#software_requirements">Software Requirements</a>
275
<a href="#optional_but_recommended_packages">Optional but Recommended Packages</a>
284
<a href="#installation">INSTALLATION</a>
289
<a href="#building_swish_e">Building Swish-e</a>
294
<a href="#installing_without_root_access">Installing without root access</a>
299
<a href="#run_time_paths">Run-time paths</a>
304
<a href="#building_a_debian_package">Building a Debian Package</a>
309
<a href="#what_s_installed">What's installed</a>
314
<a href="#documentation">Documentation</a>
319
<a href="#the_swish_e_documentation_as_man_1_pages">The Swish-e documentation as man(1) pages</a>
324
<a href="#join_the_swish_e_discussion_list">Join the Swish-e discussion list</a>
333
<a href="#questions_and_troubleshooting">QUESTIONS AND TROUBLESHOOTING</a>
338
<a href="#when_posting_please_provide_the_following_information_">When posting, please provide the following information:</a>
347
<a href="#additional_installation_options">ADDITIONAL INSTALLATION OPTIONS</a>
352
<a href="#the_swish_api_perl_module">The SWISH::API Perl Module</a>
361
<a href="#general_configuration_and_usage">GENERAL CONFIGURATION AND USAGE</a>
366
<a href="#introduction_to_indexing_and_searching">Introduction to Indexing and Searching</a>
371
<a href="#metanames_and_properties">Metanames and Properties</a>
376
<a href="#getting_started_with_swish_e">Getting Started With Swish-e</a>
381
<a href="#step_1_create_a_configuration_file">Step 1: Create a Configuration File</a>
386
<a href="#step_2_index_your_files">Step 2: Index your Files</a>
391
<a href="#step_3_search">Step 3: Search</a>
396
<a href="#phrase_searching">Phrase Searching</a>
401
<a href="#boolean_searching">Boolean Searching</a>
406
<a href="#context_searching">Context Searching</a>
411
<a href="#meta_tags">META Tags</a>
416
<a href="#spidering_and_searching_with_a_web_form_">Spidering and Searching with a Web form.</a>
425
<a href="#indexing_other_types_of_documents_filtering">Indexing Other Types of Documents - Filtering</a>
430
<a href="#filtering_overview">Filtering Overview</a>
435
<a href="#filtering_examples">Filtering Examples</a>
444
<a href="#document_info">Document Info</a>
461
<div class="sub-section">
463
<h1><a name="overview"></a>OVERVIEW</h1>
465
<p>This document describes how to download, build,
466
and install Swish-e from source.
467
Also found below is a basic overview of using Swish-e to index documents,
468
with pointers to other, more advanced examples.</p>
469
<p>This document also provides instructions
470
on how to get help installing and using Swish-e
471
(and the important information you should provide when asking for help).
472
Please read these instructions <b>before requesting help</b>
473
on the Swish-e discussion list.
474
See <a href="#questions_and_troubleshooting">"QUESTIONS AND TROUBLESHOOTING"</a>.</p>
475
<p>Although building from source is recommended,
476
some OS distributions (e.g., Debian) provide pre-compiled binaries.
477
Check with your distribution for available packages.
479
if your distribution does not offer the current version of Swish-e.</p>
480
<p>Also, please read the Swish-e FAQ (<a href="swish-faq.html">SWISH-FAQ</a>),
481
as it answers many frequently-asked questions.</p>
482
<p>Swish-e knows how to index HTML, XML, and plain text documents.
483
Helper applications and other tools are used to convert documents
484
such as PDF or MS Word into a format that Swish-e can index.
485
These additional applications and tools (listed below)
486
must be installed separately.
487
The process of converting documents is called "filtering".</p>
488
<p>NOTE: Swish-e version 4.2.0 installs a lot more files
489
when running "make install".
490
Be aware that the Swish-e documentation may thus include errors
491
about where files are located.
492
Please notify the Swish-e discussion list of any documentation errors.</p>
496
<div class="sub-section">
498
<h2><a name="upgrading_from_previous_versions_of_swish_e"></a>Upgrading from previous versions of Swish-e</h2>
500
<p>If you are upgrading from a previous version of Swish-e,
501
read the <a href="changes.html">CHANGES</a> page first.
502
The Swish-e index format may have changed
503
and existing indexes may not work with the newer version of Swish-e.</p>
504
<p>If you have existing indexes,
505
you may need to re-index your data
506
before running the "make install" step described below.
507
Swish-e may be run from the build directory after compiling,
508
but before installation.</p>
512
<div class="sub-section">
514
<h2><a name="windows_users"></a>Windows Users</h2>
516
<p>A Windows binary version is available
517
as a separate download from the Swish-e site (<a href="http://swish-e.org">http://swish-e.org</a>).
518
Many of the installation instructions below will not apply to Windows users;
519
the Windows version is pre-compiled
520
and includes <i>libxml2</i>, <i>zlib</i>, <i>xpdf</i>, and <i>catdoc</i>.</p>
521
<p>A number of Perl modules may also be needed.
522
These can be installed with ActiveState's PPM utility.</p>
523
<pre class="pre-section"> libwww-perl - the LWP modules (for spidering)
524
HTML-Tagset - used by web spider
525
HTML-Parser - used by web spider
526
MIME-Types - used for filtering documents when not spidering
527
HTML-Template - formatting output from swish.cgi (optional)
528
HTML-FillInForm (if HTML-Template is used)</pre>
532
<div class="sub-section">
534
<h2><a name="building_from_cvs"></a>Building from CVS</h2>
536
<p>Please refer to the <i>README.cvs</i> file found in the documentation directory
537
<i>$prefix/share/doc/swish-e</i>.</p>
541
<div class="sub-section">
543
<h1><a name="system_requirements"></a>SYSTEM REQUIREMENTS</h1>
545
<p>Swish-e makes use of a number of libraries and tools
546
that are not distributed with Swish-e.
547
Some libraries need to be installed before building Swish-e from source;
548
other tools can be installed at any time.
549
See below for details.</p>
553
<div class="sub-section">
555
<h2><a name="software_requirements"></a>Software Requirements</h2>
557
<p>Swish-e is written in C.
558
It has been tested on a number of platforms,
559
including Sun/Solaris, Dec Alpha, BSD, Linux, Mac OS X, and Open VMS.</p>
560
<p>The GNU C compiler (gcc) and GNU make are strongly recommended.
561
Repeat: you will find life easier if you use the GNU tools.</p>
565
<div class="sub-section">
567
<h2><a name="optional_but_recommended_packages"></a>Optional but Recommended Packages</h2>
569
<p>Most of the packages listed below are available
570
as easily installable packages.
571
Check with your operating system vendor or install them from source.
572
Most are very common packages that may already be installed on your computer.</p>
574
some packages need to be installed before building Swish-e from source,
575
while others may be added after Swish-e is installed.</p>
577
<li><a name="item_libxml2"></a><a name="libxml2"></a><b>Libxml2</b>
578
<p><i>libxml2</i> is very strongly recommended.
579
It is used for parsing both HTML and XML files.
580
Swish-e can be built and installed without <i>libxml2</i>,
581
but the HTML parser that is built into Swish-e
582
is not as accurate as <i>libxml2</i>.</p>
583
<pre class="pre-section"> http://xmlsoft.org/</pre>
584
<p><i>libxml2</i> must be installed before Swish-e is built,
585
or it will not be used.</p>
586
<p>If <i>libxml2</i> is installed in a non-standard location
587
(e.g., <i>libxml2</i> is built with <code>--prefix $HOME/local</code>),
588
make sure that you add the <code>bin</code> directory to your <code>$PATH</code>
589
before building Swish-e.
590
Swish-e's configure script uses a program
591
created by <i>libxml2</i> (<code>xml2-config</code>)
592
to find the location of <i>libxml2</i>.
593
Use <code>which xml2-config</code> to verify
594
that the program can be found where expected.</p>
596
<li><a name="item_zlib"></a><a name="zlib"></a><b>Zlib Compression</b>
597
<p>The <i>Zlib</i> compression library is commonly installed on most systems
598
and is recommended for use with Swish-e.
599
<i>Zlib</i> is used for compressing text stored in the Swish-e index.</p>
600
<pre class="pre-section"> http://www.gzip.org/zlib/</pre>
601
<p><i>Zlib</i> must be installed before building Swish-e.</p>
603
<li><a name="item_perl"></a><a name="perl"></a><b>Perl Modules</b>
604
<p>Although Swish-e is a compiled C program,
605
many support features use Perl.
606
For example, both the web spiders
607
and modules to help with filtering documents are written in Perl.</p>
608
<p>The following Perl modules may be required.
609
Check your current Perl installation,
610
as many may already be installed.</p>
611
<pre class="pre-section"> LWP
615
MIME::Types (optional)</pre>
616
<p>Note that installing <code>Bundle::LWP</code> with the CPAN module</p>
617
<pre class="pre-section"> perl -MCPAN -e 'install Bundle::LWP'</pre>
618
<p>will install many of the above modules.</p>
619
<p>If you wish to use <code>HTML-Template</code> with swish.cgi to generate output,
621
<pre class="pre-section"> HTML::Template
622
HTML::FillInForm</pre>
623
<p>If you wish to use <code>Template-Toolkit</code> with <code>swish.cgi</code>
624
to generate output, install:</p>
625
<pre class="pre-section"> Template</pre>
626
<p>Questions about installing these modules
627
may be sent to the Swish-e discussion list.</p>
628
<p>The <code>search.cgi</code> example script
629
requires both <code>Template-Toolkit</code> and <code>HTML::FillInForm</code>.</p>
631
<li><a name="item_indexing"></a><a name="indexing"></a><b>Indexing PDF Documents</b>
632
<p>Indexing PDF files requires the <code>xpdf</code> package.
633
This is a common package,
634
available with most operating systems
635
and often provided as an add-on package.</p>
636
<pre class="pre-section"> http://www.foolabs.com/xpdf/</pre>
637
<p>Xpdf may be added after Swish-e is installed.</p>
639
<li><a name="item_indexing"></a><a name="indexing"></a><b>Indexing MS Word Documents</b>
640
<p>Indexing MS Word documents requires the <i>Catdoc</i> program.</p>
641
<pre class="pre-section"> http://www.45.free.net/~vitus/ice/catdoc</pre>
642
<p><i>Catdoc</i> may be added after Swish-e is installed.</p>
644
<li><a name="item_indexing"></a><a name="indexing"></a><b>Indexing MP3 ID3 Tags</b>
645
<p>Indexing MP3 ID3 Tags requires the <code>MP3::Tag</code> Perl module.
646
See <a href="http://search.cpan.org">http://search.cpan.org</a>.
647
<code>MP3::Tag</code> may be installed after Swish-e is installed.</p>
649
<li><a name="item_indexing"></a><a name="indexing"></a><b>Indexing MS Excel Files</b>
650
<p>Indexing MS Excel files is supported by the following Perl modules,
651
also available at <a href="http://search.cpan.org">http://search.cpan.org</a>.</p>
652
<pre class="pre-section"> Spreadsheet::ParseExcel
654
<p>These Perl modules may be installed after Swish-e is installed.</p>
660
<div class="sub-section">
662
<h1><a name="installation"></a>INSTALLATION</h1>
664
<p>Here are brief installation instructions that should work in most cases.
665
Following this section are more detailed instructions and examples.</p>
669
<div class="sub-section">
671
<h2><a name="building_swish_e"></a>Building Swish-e</h2>
673
<p>Download Swish-e using your favorite web browser
674
or a utility such as <code>wget</code>, <code>lynx</code>, or <code>lwp-download</code>.
675
Unpack and build the distribution, using the following steps:</p>
676
<p>Note: "swish-e-2.4.0" is used as an example.
677
Download the most current available version
678
and adjust the commands below!
679
Also, if you are running Debian,
680
see the notes below on building a <code>.deb</code> package
681
from the Swish-e source package.</p>
682
<p>Pay careful attention to the "prompt" character used
683
on the following command lines.
684
A "$" prompt indicates steps run as an unprivileged user.
685
A "#" indicates steps run as the superuser (root).</p>
686
<pre class="pre-section"> $ wget http://swish-e.org/Download/swish-e-2.4.0.tar.gz
687
$ gzip -dc swihs-e-2.4.0.tar.gz | tar xof -
688
$ cd swish-e-2.4.0 (this directory will depend on the version of Swish-e)
698
$ su root (or use sudo)
706
Once Swish-e is installed, do not run it as the superuser (root) --
707
root is only required during the installation step,
708
when installing into system directories.
709
Please do not break this rule.</p>
711
If you are upgrading from an older version of Swish-e,
712
be sure and review the <a href="changes.html">CHANGES</a> file.
713
Old index files may not be compatible with newer versions of Swish-e.
714
After building Swish-e (but before running "make install"),
715
Swish-e can be run from the build directory:</p>
716
<pre class="pre-section"> $ src/swish-e -V</pre>
717
<p>To minimize downtime,
718
create new index files before running "make install",
719
by using Swish-e from the build directory.
720
Then, copy the index files to the live location and run "make install":</p>
721
<pre class="pre-section"> $ src/swish-e -c /path/to/config -f index.new</pre>
722
<p>Keep in mind that the location you index from
723
may affect the paths stored in the index file.</p>
727
<div class="sub-section">
729
<h2><a name="installing_without_root_access"></a>Installing without root access</h2>
731
<p>Here's another installation example.
732
This might be used if you do not have root access
733
or you wish to install Swish-e someplace other than <code>/usr/local</code>.</p>
734
<p>This example also shows building Swish-e in a "build" directory
735
that is separate from where the source files are located.
736
This is the recommended way to build Swish-e,
737
but it requires GNU Make.
739
you will likely need to build from within the source directory,
740
as shown in the previous example.</p>
741
<pre class="pre-section"> $ tar zxof swish-e-2.4.0.tar.gz (GNU tar with "z" option)
744
<p>Note that the current directory is not where Swish-e was unpacked.</p>
745
<p>Swish-e uses a <i>configure</i> script.
746
<i>configure</i> has many options,
747
but it uses reasonable and standard defaults.
749
<pre class="pre-section"> $ ../swish-e-2.4.0/configure --help</pre>
750
<p>will display the options.</p>
751
<p>Two options are of common interest:
752
<code>--prefix</code> sets the top-level installation directory;
753
<code>--disable-shared</code> will link Swish-e statically,
754
which may be needed on some platforms (Solaris 2.6, perhaps).</p>
755
<p>Platforms may require varying link instructions
756
when libraries are installed in non-standard locations.
757
Swish-e uses the GNU <i>autoconf</i> tools for building the package.
758
<i>autoconf</i> is good at building and testing,
759
but still requires you to provide information appropriate for your platform.
760
This may mean reading the manual page for your compiler and linker
761
to see how to specify non-standard file locations.</p>
762
<p>For most Unix-type platforms,
763
you can use <code>LDFLAGS</code> and <code>CPPFLAGS</code> environment variables
764
to specify paths to "include" (header) files
765
and to libraries that are not in standard locations.</p>
766
<p>In this example, we do not have root access.
767
We have installed <i>libxml2</i> and <i>libz</i> in <code>$HOME/local</code>.
768
Swish-e will also be installed in <code>$HOME/local</code>
769
(by using the <code>--prefix</code> setting).</p>
771
you would need to add <code>$HOME/local/bin</code>
772
to the start of your shell's <code>$PATH</code> setting.
773
This is required because <i>libxml2</i> installs a program
774
that is used when running the configure script.
775
Before running configure, type:</p>
776
<pre class="pre-section"> $ which xml2-config</pre>
777
<p>It should list <code>$HOME/local/bin/xml2-config</code>.</p>
778
<p>Now run <i>configure</i> (remember, we are in a separate "build" directory):</p>
779
<pre class="pre-section"> $ ../swish-e-2.4.0/configure \
780
--prefix=$HOME/local \
781
CPPFLAGS=-I$HOME/local/include \
782
LDFLAGS="-R$HOME/local/lib -L$HOME/local/lib"
784
$ make >/dev/null (redirect output to only see warnings and errors)
793
$ $HOME/local/bin/swish-e -V
795
<p>Note the use of double quotes in the <code>LDFLAGS</code> line above.
796
This allows <code>$HOME</code> to be expanded within the text string.</p>
800
<div class="sub-section">
802
<h2><a name="run_time_paths"></a>Run-time paths</h2>
804
<p>The <code>-R</code> option says to add a specified path (or paths)
805
to those that are used to find shared libraries at run time.
806
These paths are stored in the Swish-e binary.
808
it will look in these directories for shared libraries.</p>
809
<p>Some platforms may not support the <code>-R</code> option.
811
set the <code>LD_RUN_PATH</code> environment variable <b>before</b> running make.</p>
812
<p>Some systems, such as Redhat,
813
do not look in <code>/usr/local/lib</code> for libraries.
815
you can either use <code>-R</code>, as above, when building Swish-e
816
or add <code>/usr/local/lib</code> to <code>/etc/ld.so.conf</code> and run <i>ldconfig</i> as root.</p>
817
<p>If all else fails,
818
you may need to actually read the man pages for your platform.</p>
822
<div class="sub-section">
824
<h2><a name="building_a_debian_package"></a>Building a Debian Package</h2>
826
<p>The Swish-e distribution includes the files required
827
to build a Debian package.</p>
828
<pre class="pre-section"> $ tar zxof swish-e-2.4.0.tar.gz (GNU tar with "z" option)
830
$ fakeroot debian/rules binary
832
dpkg-deb: building package `swish-e' in `../swish-e_2.4.0-0_i386.deb'.
834
# dpkg -i ../swish-e_2.4.0-0_i386.deb</pre>
838
<div class="sub-section">
840
<h2><a name="what_s_installed"></a>What's installed</h2>
842
<p>Swish installs a number of files.
843
By default, all files are installed below <code>/usr/local</code>,
844
but this can be changed by setting <code>--prefix</code>
845
when running <i>configure</i> (as shown above).
846
Individual paths may also be set.
847
Run <code>configure --help</code> for details.</p>
848
<pre class="pre-section"> $prefix/bin/swish-e The Swish-e binary program
849
$prefix/share/doc/swish-e/ Full documentation and examples
850
$prefix/lib/libswish-e The Swish-e C library
851
$prefix/include/swish-e.h The library header file
852
$prefix/man/man1/ Documentation as manual pages
853
$prefix/lib/swish-e/ Helper programs (spider.pl, swishspider, swish.cgi)
854
$prefix/lib/swish-e/perl/ Perl helper modules</pre>
855
<p>Note that the Perl modules are <i>not</i> installed in the system Perl library.
856
Swish-e and the Perl scripts that require the modules
857
know where to find the modules,
858
but the <i>perldoc</i> program (used for reading documentation) does not.
859
This can be corrected by adding <code>$prefix/lib/swish-e</code>
860
and <code>$prefix/lib/swish-e/perl</code> to the <code>PERL5LIB</code> environment variable.</p>
864
<div class="sub-section">
866
<h2><a name="documentation"></a>Documentation</h2>
868
<p>Documentation can be found in the <code>$prefix/share/doc/swish-e</code> directory.
869
Documentation is in html format at <code>$prefix/share/doc/swish-e/html</code> and
870
can also be read on-line at the Swish-e web site:</p>
871
<pre class="pre-section"> http://swish-e.org/</pre>
875
<div class="sub-section">
877
<h2><a name="the_swish_e_documentation_as_man_1_pages"></a>The Swish-e documentation as man(1) pages</h2>
879
<p>Running "make install" installs some of the Swish-e documentation as man pages.
880
The following man pages are installed:</p>
881
<pre class="pre-section"> SWISH-FAQ(1)
884
SWISH-LIBRARY(1)</pre>
885
<p>The man pages are installed, by default, in the system man directory.
886
This directory is determined when <i>configure</i> is run;
887
it can be set by passing a directory name to <i>configure</i>.</p>
889
<pre class="pre-section"> ./configure --mandir=/usr/local/doc/man</pre>
890
<p>The man directory is specified relative to the <code>--prefix</code> setting.
891
If you use <code>--prefix</code>,
892
you do not normally need to also specify <code>--mandir</code>.</p>
893
<p>Information on running <i>configure</i> can be found by typing:</p>
894
<pre class="pre-section"> ./configure --help</pre>
898
<div class="sub-section">
900
<h2><a name="join_the_swish_e_discussion_list"></a>Join the Swish-e discussion list</h2>
902
<p>The final step, when installing Swish-e,
903
is to join the Swish-e discussion list.</p>
904
<p>The Swish-e discussion list is the place
905
to ask questions about installing and using Swish-e,
906
see or post bug fixes or security announcements,
907
and offer help to others.
908
Please do not contact the developers directly.</p>
909
<p>The list is typically <i>very low traffic</i>,
910
so it won't overload your inbox.
911
Please take the time to subscribe.
912
See <a href="http://Swish-e.org">http://Swish-e.org</a>.</p>
913
<p>If you are using Swish-e on a public site,
914
please let the list know,
915
so that your URL can be added to the list of sites that use Swish-e!</p>
916
<p>Please review the next section
917
before posting questions to the Swish-e list.</p>
921
<div class="sub-section">
923
<h1><a name="questions_and_troubleshooting"></a>QUESTIONS AND TROUBLESHOOTING</h1>
925
<p>Support for installation, configuration, and usage
926
is available via the Swish-e discussion list.
927
Visit <a href="http://swish-e.org">http://swish-e.org</a> for information.
928
Do not contact developers directly for help --
929
always post your question to the list.</p>
930
<p>It's very important to provide the right information
931
when asking for help.</p>
932
<p>Please search the Swish-e list archive before posting a question.
933
Also, check the <a href="swish-faq.html">SWISH-FAQ</a>
934
to see if your question has already been asked and answered.</p>
935
<p>Before posting, use the available tools to narrow down the problem.</p>
936
<p>Swish-e has several switches
937
(e.g., <code>-T</code>, <code>-v</code>, and <code>-k</code>)
938
that may help you resolve issues.
939
These switches are described on the <a href="swish-run.html">SWISH-RUN</a> page.
940
For example, if you cannot find a document by a keyword
941
that you believe should be indexed,
942
try indexing just that single file
943
and use the <code>-T INDEXED_WORDS</code> option
944
to see if the word is actually being indexed.
945
First, try it without any changes to default settings:</p>
946
<pre class="pre-section"> swish-e -i testdoc.html -T indexed_words | less</pre>
947
<p>if that works, add in your configuration file:</p>
948
<pre class="pre-section"> swish-e -i testdoc.html -c swish.conf -T indexed_words | less</pre>
949
<p>If it still isn't working as you expect,
950
try to reduce the test document to a very small example.
951
This will be very helpful to your readers,
952
when you are asking for help.</p>
953
<p>Another useful trick is to use <code>-H9</code> when searching,
954
to display full headers in search results.
955
Look at the "Parsed Words" header
956
to see what words Swish-e is searching for.</p>
960
<div class="sub-section">
962
<h2><a name="when_posting_please_provide_the_following_information_"></a>When posting, please provide the following information:</h2>
964
<p>Use these guidelines when asking for help.
965
The most important tip is to provide the <b>least</b> amount of information
966
that can be used to reproduce your problem.
967
Do not paraphrase output -- copy-and-paste --
968
but trim text that is not necessary. </p>
971
<p>The exact version of Swish-e that you are using.
972
Running Swish-e with the <code>-V</code> switch will print the version number.
973
Also, supply the output from <code>uname -a</code> or similar command
974
that identifies the operating system you are running on.
975
If you are running an old version of swish,
976
be prepared for a response of "upgrade" to your question.</p>
979
<p>A summary of the problem.
980
This should include the commands issued
981
(e.g. for indexing or searching) and their output,
982
along with an explanation of why you don't think it's working correctly.
983
Please copy-and-paste the exact commands and their output,
984
instead of retyping, to avoid errors.</p>
987
<p>Include a copy of the configuration file you are using, if any.
988
Swish-e has reasonable defaults,
989
so in many cases you can run it without using a configuration file.
990
But, if you need to use a configuration file,
991
<b>reduce it down</b> to the absolute minimum number of commands
992
that is required to demonstrate your problem.
993
Again, copy-and-paste.</p>
996
<p>A small copy of a source document that demonstrates the problem.</p>
997
<p>If you are having problems spidering a web server,
998
use lwp-download or wget to copy the file locally,
999
then make sure you can index the document using the file system method.
1000
This will help you determine if the problem
1001
is with spidering or indexing.</p>
1002
<p>If you expect help with spidering,
1003
don't post fake URLs, as it makes it impossible to test.
1004
If you don't want to expose your web page to the people on the Swish-e list,
1005
find some other site to test spidering on.
1006
If that works, but you still cannot spider your own site,
1007
you may need to request help from others.
1008
If so, you must post your real URL
1009
or make a test document available via some other source.</p>
1012
<p>If you are having trouble building Swish-e,
1013
please copy-and-paste the output from make
1014
(or from <code>./configure</code>, if that's where the problem is).</p>
1017
<p>The key is to provide enough information
1018
so that others may reproduce the problem. </p>
1022
<div class="sub-section">
1024
<h1><a name="additional_installation_options"></a>ADDITIONAL INSTALLATION OPTIONS</h1>
1026
<p>These steps are not required for normal use of Swish-e.</p>
1030
<div class="sub-section">
1032
<h2><a name="the_swish_api_perl_module"></a>The SWISH::API Perl Module</h2>
1034
<p>The Swish-e distribution includes a module
1035
that provides a Perl interface to the Swish-e C library.
1036
This module provides a way to search a Swish-e index
1037
without running the Swish-e program.
1038
Searching an index will be many times faster
1039
when running under a persistent environment
1040
such as Apache/mod_perl with the <code>SWISH::API</code> module.</p>
1041
<p>See the <i>perl/README</i> file for information
1042
on installing and using the <code>SWISH::API</code> Perl module.</p>
1046
<div class="sub-section">
1048
<h1><a name="general_configuration_and_usage"></a>GENERAL CONFIGURATION AND USAGE</h1>
1050
<p>This section should give you a basic overview
1051
of indexing and searching with <b>Swish-e</b>.
1052
Other examples can be found in the <code>conf</code> directory;
1053
these will step you through a number of different configurations.
1054
Also, please review the <a href="swish-faq.html">SWISH-FAQ</a>.</p>
1055
<p>Swish-e is a command-line program.
1056
The program is controlled by passing switches on the command line.
1057
A configuration file may be used,
1058
but often is not required.
1059
Swish-e does not include a graphical user interface.
1060
Example CGI scripts are provided in the distribution,
1061
but they require additional setup to use.</p>
1065
<div class="sub-section">
1067
<h2><a name="introduction_to_indexing_and_searching"></a>Introduction to Indexing and Searching</h2>
1069
<p>Swish-e can index files that are located on the local file system.
1070
For example, running:</p>
1071
<pre class="pre-section"> swish-e -i /var/www/htdocs</pre>
1072
<p>will index <i>all</i> files in the <code>/var/www/htdocs</code> directory.
1073
You may specify one or more files or directories with the <code>-i</code> option.
1074
By default, this will create an index called <code>index.swish-e</code>
1075
in the current directory.</p>
1076
<p>To search the resulting index for a given word, try:</p>
1077
<pre class="pre-section"> swish-e -w apache</pre>
1078
<p>This will find the word "apache" in the body or title
1079
of the indexed documents.</p>
1080
<p>As mentioned above,
1081
Swish-e will index all files in a directory,
1082
unless instructed otherwise.
1083
So, if <code>/var/www/htdocs</code> contains non-HTML files,
1084
you will need a configuration file to limit the files that Swish-e indexes.
1085
Create a file called <code>swish.conf</code>:</p>
1086
<pre class="pre-section"> # Example configuration file
1088
# Tell Swish-e what to index (same as -i switch above)
1089
IndexDir /var/www/htdocs
1091
# Only index HTML and text files
1092
IndexOnly .htm .html .txt
1094
# Tell Swish-e that .txt files are to use the text parser.
1095
IndexContents TXT* .txt
1097
# Otherwise, use the HTML parser
1098
DefaultContents HTML*
1100
# Ask libxml2 to report any parsing errors and warnings or
1101
# any UTF-8 to 8859-1 conversion errors
1102
ParserWarnLevel 9</pre>
1103
<p>After saving the configuration file, reindex:</p>
1104
<pre class="pre-section"> swish-e -c swish.conf</pre>
1105
<p>The Swish-e configuration settings are described
1106
in the <a href="swish-config.html">SWISH-CONFIG</a> manual page.
1107
The order of statements in the configuration file is typically not important,
1108
although some statements depend on previously set statements.
1109
There are many possible settings.
1110
Good advice is to use as few settings as possible
1111
when first starting out with Swish-e.</p>
1112
<p>The runtime options (switches) are described
1113
in the <a href="swish-run.html">SWISH-RUN</a> manual page.
1114
You may also see a summary of options by running:</p>
1115
<pre class="pre-section"> swish-e -h</pre>
1116
<p>Swish-e has two other methods for reading input files.
1117
One method uses a Perl helper script and the LWP Perl library
1118
to spider remote web sites:</p>
1119
<pre class="pre-section"> swish-e -S http -i http://localhost/index.html -v2</pre>
1120
<p>This will spider the web server running on the local host.
1121
The <code>-S</code> option defines the input source method to be "http",
1122
<code>-i</code> specifies the URL to spider,
1123
and <code>-v</code> sets the verbose level to two.
1124
There are a number of configuration options
1125
that are specific to the <code>-S</code> http input source.
1126
See <a href="swish-config.html">SWISH-CONFIG</a>.
1127
Note that only files of <code>Content-Type text/*</code> will be indexed.</p>
1128
<p>The <code>-S http</code> method is deprecated, however,
1129
in favor of a variation on the following input method.</p>
1130
<p>There is a general-purpose input method
1131
wherein Swish-e reads input from a program
1132
that produces documents in a special format.
1133
The program might read and format data stored in a database,
1134
or parse and format messages in a mailing list archive,
1135
or run a program that spiders web sites (like the previous method).</p>
1136
<p>The Swish-e distribution includes a spider program
1137
that uses this method of input.
1138
This spider program is much more configurable and feature-rich
1139
than the previous (<code>-S http</code>) method.</p>
1140
<p>To duplicate the previous example,
1141
create a configuration file called <code>swish2.conf</code>:</p>
1142
<pre class="pre-section"> # Example for spidering
1143
# Use the "spider.pl" program included with Swish-e
1146
# Define what site to index
1147
SwishProgParameters default http://localhost/index.html</pre>
1148
<p>Then, create the index using the command:</p>
1149
<pre class="pre-section"> swish-e -S prog -c swish2.conf</pre>
1150
<p>This says to use the <code>-S prog</code> input source method.
1151
Note that, in this case,
1152
the <code>IndexDir</code> setting does not specify a file or directory to index,
1153
but a program name to be run.
1154
This program, <code>spider.pl</code>,
1155
does the work of fetching the documents from the web server
1156
and passing them to Swish-e for indexing.</p>
1157
<p>The <code>SwishProgParameters</code> option is a special feature
1158
that allows passing command-line parameters
1159
to the program specified with <code>IndexDir</code>.
1160
In this case, we are passing the word <code>default</code>
1161
(which tells <code>spider.pl</code> to use default settings)
1162
and the URL to spider.</p>
1163
<p>Running a script under Windows requires specifying the interpreter
1164
(e.g., <code>perl.exe</code>)
1165
and then using <code>SwishPropParameters</code>
1166
to specify the script and the script's parameters.
1167
See <i>Notes when using <code>-S prog</code> on MS Windows</i>
1168
on the <a href="swish-run.html">SWISH-RUN</a> page.</p>
1169
<p>The advantage of the <code>-S prog</code> method of spidering
1170
(over the previous <code>-S http</code> method)
1171
is that the Perl code is only compiled once
1172
instead of once for every document fetched from the web server.
1173
In addition, it is a much more advanced spider with many, many features.
1174
Still, as used here,
1175
<code>spider.pl</code> will automatically index PDF or MS Word documents
1176
if (when) Xpdf and Catdoc are installed.</p>
1177
<p>A special form of the <code>-S prog</code> input source method is:</p>
1178
<pre class="pre-section"> ./myprog --option | swish-e -S prog -i stdin -c config</pre>
1179
<p>This allows running Swish-e from a program
1180
(instead of running the external program from Swish-e).
1181
So, this also can be done as:</p>
1182
<pre class="pre-section"> ./myprog --option > outfile
1183
swish-e -S prog -i stdin -c config < outfile</pre>
1185
<pre class="pre-section"> ./myprog --option > outfile
1186
cat outfile | swish-e -S prog -i stdin -c config</pre>
1187
<p>One final note about the <code>-S prog</code> input source method.
1188
The program specified with <code>-i</code> or <code>IndexDir</code> needs to be an absolute path.
1189
The exception is when the program is installed in the <code>libexecdir</code> directory.
1190
Then, a plain program name may be specified
1191
(as in the example showing <code>spider.pl</code>, above).</p>
1192
<p>All three input source methods are described in more detail
1193
on the <a href="swish-run.html">SWISH-RUN</a> page.</p>
1197
<div class="sub-section">
1199
<h2><a name="metanames_and_properties"></a>Metanames and Properties</h2>
1201
<p>There are two key Swish-e concepts
1202
that you need to be familiar with:
1203
Metanames and Properties.</p>
1205
<li><a name="item_metanames"></a><a name="metanames"></a><b>Metanames</b>
1206
<p>Swish-e creates a reverse (i.e., inverted) index.
1207
Just like an index in a book,
1208
you look up a word and it lists the pages (or documents)
1209
where that word can be found.</p>
1210
<p>Swish-e can create multiple index tables within the same index file.
1212
you might want to create an index that only contains words in HTML titles,
1213
so that searches can be limited to title text.
1214
Or, you might have descriptive words
1215
that you would like to search,
1216
stored in a meta tag called "keywords".</p>
1217
<p>Some database systems might call these different "fields" or "columns",
1218
but Swish-e calls them <i>MetaNames</i>
1219
(as a result of its first indexing HTML "meta" tags).</p>
1220
<p>To find documents containing "foo" in their titles, you might run:</p>
1221
<pre class="pre-section"> swish-e -w swishtitle=foo</pre>
1222
<p>or, a more advanced example:</p>
1223
<pre class="pre-section"> swish-e -w swishtitle=(foo or bar) or swishdefault=(baz)</pre>
1224
<p>The Metaname "swishdefault" is the name that is used by Swish-e
1225
if no other name is specified.
1226
The following two searches are thus equivalent:</p>
1227
<pre class="pre-section"> swish-e -w foo
1228
swish-e -w swishdefault=foo</pre>
1229
<p>When indexing HTML documents,
1230
Swish-e indexes words in the body and title
1231
under the Metaname "swishdefault".</p>
1233
<li><a name="item_properties"></a><a name="properties"></a><b>Properties</b>
1234
<p>Swish-e's search result is a list of files --
1235
actually, Swish-e uses file numbers internally.
1236
Data can be associated with each file number when indexing.
1237
For example, by default Swish-e associates the file's name, title,
1238
last modified date, and size with the file number.
1239
These items can be printed in search results.</p>
1240
<p>In Swish-e, this associated data is called a file's <i>Properties</i>.
1241
Properties can be any data you wish to associated with a document --
1242
in fact, the entire text of the document can be stored in the index.
1243
What data is stored as a Property is controlled by the <i>PropertyNames</i>
1244
(and other) configuration directives.</p>
1245
<p>What properties are printed with search results
1246
depends on the <code>-x</code> or <code>-p</code> switches.
1248
Swish-e returns the rank, path/URL, title, and file size in bytes
1249
for each result.</p>
1255
<div class="sub-section">
1257
<h2><a name="getting_started_with_swish_e"></a>Getting Started With Swish-e</h2>
1259
<p>Swish-e reads a configuration file (see <a href="swish-config.html">SWISH-CONFIG</a>)
1260
for directives that control whether and how Swish-e indexes files.
1261
Swish-e is also controlled by command-line arguments
1262
(see <a href="swish-run.html">SWISH-RUN</a>).
1263
Many of the command-line arguments
1264
have equivalent configuration directives (e.g., <code>-i</code> and <code>IndexDir</code>).</p>
1265
<p>Swish-e does not require a configuration file,
1266
but most people change its default behavior
1267
by placing settings in a configuration file.</p>
1268
<p>To try the examples below,
1269
go to the <code>tests</code> subdirectory of the distribution.
1270
The tests will use the <code>*.html</code> files in this directory
1271
when creating the test index.
1272
You may wish to review these <code>*.html</code> files
1273
to get an idea of the various native file formats that Swish-e supports.</p>
1274
<p>You may also use your own test documents.
1275
It's recommended to use small test documents when first using Swish-e.</p>
1279
<div class="sub-section">
1281
<h2><a name="step_1_create_a_configuration_file"></a>Step 1: Create a Configuration File</h2>
1283
<p>The configuration file controls what and how Swish-e indexes. The
1284
configuration file consists of directives, comments, and blank lines.
1285
The configuration file can be any name you like.</p>
1286
<p>This example will work with the documents in the <i>tests</i> directory.
1287
You may wish to review the <i>tests/test.config</i> configuration file used
1288
for the <code>make test</code> tests.</p>
1289
<p>For example, a simple configuration file (<i>swish-e.conf</i>):</p>
1290
<pre class="pre-section"> # Example Swish-e Configuration file
1292
# Define *what* to index
1293
# IndexDir can point to a directories and/or a files
1294
# Here it's pointing to the current directory
1295
# Swish-e will also recurse into sub-directories.
1298
# But only index the .html files
1301
# Show basic info while indexing
1303
<p>And that's a simple configuration file.
1304
It says to index all the <code>.html</code> files
1305
in the current directory and sub-directories, if any,
1306
and provide some basic output while indexing.</p>
1307
<p>As mentioned above,
1308
the complete list of all configuration file directives
1309
is detailed in <a href="swish-config.html">SWISH-CONFIG</a>.</p>
1313
<div class="sub-section">
1315
<h2><a name="step_2_index_your_files"></a>Step 2: Index your Files</h2>
1318
using the <code>-c</code> switch to specify the name of the configuration file.</p>
1319
<pre class="pre-section"> swish-e -c swish-e.conf
1321
Indexing Data Source: "File-System"
1323
Removing very common words...
1325
Writing main index...
1327
Sorting 55 words alphabetically
1329
Writing index entries ...
1330
Writing word text: Complete
1331
Writing word hash: Complete
1332
Writing word data: Complete
1333
55 unique words indexed.
1334
4 properties sorted.
1335
5 files indexed. 1252 total bytes. 140 total words.
1336
Elapsed time: 00:00:00 CPU time: 00:00:00
1337
Indexing done!</pre>
1338
<p>This created the index file <code>index.swish-e</code>.
1339
This is the default index file name,
1340
unless the <b>IndexFile</b> directive is specified in the configuration file:</p>
1341
<pre class="pre-section"> IndexFile ./website.index</pre>
1342
<p>You may use the <code>-f</code> switch to specify a index file at indexing time.
1343
The <code>-f</code> option overrides any <code>IndexFile</code> setting
1344
that may be in the configuration file.</p>
1348
<div class="sub-section">
1350
<h2><a name="step_3_search"></a>Step 3: Search</h2>
1352
<p>You specify your search terms with the <code>-w</code> switch.
1353
For example, to find the files that contain the word <code>sample</code>,
1354
you would issue the command:</p>
1355
<pre class="pre-section"> swish-e -w sample</pre>
1356
<p>This example assumes that you are in the <code>tests</code> directory.
1357
Swish-e returns the following, in response to this command:</p>
1358
<pre class="pre-section"> swish-e -w sample
1360
# SWISH format: 2.4.0
1361
# Search words: sample
1363
# Search time: 0.000 seconds
1364
# Run time: 0.005 seconds
1365
1000 ./test_xml.html "If you are seeing this, the METATAG XML search was successful!" 159
1366
1000 ./test.html "If you are seeing this, the test was successful!" 437
1368
<p>So, the word <code>sample</code> was found in two documents.
1369
The first number shown is the relevance (or rank) of the search term,
1370
followed by the file containing the search term,
1371
the title of the document,
1372
and finally, the length of the document (in bytes).</p>
1373
<p>The period ("."), sitting alone at the end,
1374
marks the end of the search results.</p>
1375
<p>Much more information may be retrieved while searching,
1376
by using the <code>-x</code> and <code>-H</code> switches (see <a href="swish-run.html">SWISH-RUN</a>)
1377
and by using Document Properties (see <a href="swish-config.html">SWISH-CONFIG</a>).</p>
1381
<div class="sub-section">
1383
<h2><a name="phrase_searching"></a>Phrase Searching</h2>
1385
<p>To search for a phrase in a document,
1386
use double-quotes to delimit your search terms.
1387
(The default phrase delimiter is set in <code>src/swish.h</code>.)</p>
1388
<p>You must protect the quotes from the shell.</p>
1389
<p>For example, under Unix:</p>
1390
<pre class="pre-section"> swish-e -w '"this is a phrase" or (this and that)'
1391
swish-e -w 'meta1=("this is a phrase") or (this and that)'</pre>
1392
<p>Or under the Windows <code>command.com</code> shell.</p>
1393
<pre class="pre-section"> swish-e -w \"this is a phrase\" or (this and that)</pre>
1394
<p>The phrase delimiter can be set with the <code>-P</code> switch.</p>
1398
<div class="sub-section">
1400
<h2><a name="boolean_searching"></a>Boolean Searching</h2>
1402
<p>You can use the Boolean operators <b>and</b>, <b>or</b>, or <b>not</b> in searching.
1403
Without these Boolean operatots,
1404
Swish-e will assume you're <b>and</b>ing the words together.</p>
1405
<p>Here are some examples:</p>
1406
<pre class="pre-section"> swish-e -w 'apples oranges'
1407
swish-e -w 'apples and oranges' ( Same thing )
1409
swish-e -w 'apples or oranges'
1411
swish-e -w 'apples or oranges not juice' -f myIndex </pre>
1412
<p>retrieves first the files that contain both the words "apples" and "oranges";
1413
then among those, selects the ones that do not contain the word "juice".</p>
1414
<p>A few other examples to ponder:</p>
1415
<pre class="pre-section"> swish-e -w 'apples and oranges or pears'
1416
swish-e -w '(apples and oranges) or pears' ( Same thing )
1417
swish-e -w 'apples and (oranges or pears)' ( Not the same thing )</pre>
1418
<p>Swish processes the query left to right.</p>
1419
<p>See <a href="swish-search.html">SWISH-SEARCH</a> for more information.</p>
1423
<div class="sub-section">
1425
<h2><a name="context_searching"></a>Context Searching</h2>
1427
<p>The <code>-t</code> option in the search command line
1428
allows you to search for words that exist only in specific HTML tags.
1429
This option takes a string of characters as its argument.
1430
Each character represents a different tag in which the word is searched;
1431
that is, you can use any combinations of the following characters:</p>
1432
<pre class="pre-section"> H search in all <HEAD> tags
1433
B search in the <BODY> tags
1434
t search in <TITLE> tags
1435
h is <H1> to <H6> (header) tags
1436
e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
1437
c is HTML comment tags (<!-- ... -->)</pre>
1439
<pre class="pre-section"> # Find only documents with the word "linux" in the <TITLE> tags.
1440
swish-e -w linux -t t
1442
# Find the word "apple" in titles or comments
1443
swish-e -w apple -t tc</pre>
1447
<div class="sub-section">
1449
<h2><a name="meta_tags"></a>META Tags</h2>
1451
<p>As mentioned above,
1452
Metanames are a way to define "fields" in your documents.
1453
You can use the Metanames in your queries to limit the search
1454
to just the words contained in that META name of your document.
1456
you might have a META-tagged field called <code>subjects</code> in your documents.
1457
This would let you search your documents for the word "foo",
1458
but only return documents where "foo" is within the <code>subjects</code> META tag.</p>
1459
<p>Document <i>Properties</i> are somewhat related:
1460
Properties allow the content of a META tag in a source document
1461
to be stored within the index,
1462
and that text to be returned along with search results.</p>
1463
<p>META tags can have two formats in your documents.</p>
1464
<pre class="pre-section"> <META NAME="keyName" CONTENT="some Content"></pre>
1465
<p>And in XML format</p>
1466
<pre class="pre-section"> <keyName>
1468
</keyName></pre>
1469
<p>If using <i>libxml</i>, you can optionally use a non-HTML tag as a metaname:</p>
1470
<pre class="pre-section"> <html>
1477
<p>This, of course, is invalid HTML.</p>
1478
<p>To continue with our sample <code>Swish-e.conf</code> file,
1479
add the following lines:</p>
1480
<pre class="pre-section"> # Define META tags
1481
MetaNames meta1 meta2 meta3</pre>
1482
<p>Reindex to include the changes:</p>
1483
<pre class="pre-section"> swish-e -c swish-e.conf</pre>
1484
<p>Now search, but this time limit your search to META tag <code>meta1</code>:</p>
1485
<pre class="pre-section"> swish-e -w 'meta1=metatest1'</pre>
1486
<p>Again, please see <a href="swish-run.html">SWISH-RUN</a> and <a href="swish-config.html">SWISH-CONFIG</a>
1487
for complete documentation of the various indexing and searching options.</p>
1491
<div class="sub-section">
1493
<h2><a name="spidering_and_searching_with_a_web_form_"></a>Spidering and Searching with a Web form.</h2>
1495
<p>This example demonstrates how to spider a web site
1496
and set up the included CGI script to provide a web-based search page.
1497
This example uses Perl programs that are included in the Swish-e distribution:
1498
<i>spider.pl</i> will be used for reading files from the web server;
1499
<i>swish.cgi</i> will provide the web search form and display results.</p>
1501
we will index the Apache Web Server documentation,
1502
installed on the local computer at <a href="http://localhost/apache_docs/index.html">http://localhost/apache_docs/index.html</a>.</p>
1504
<li><a name="item_1"></a><a name="1"></a><b>1 Make a Working Directory</b>
1505
<p>Create a directory to store the Swish-e configuration and the Swish-e index.</p>
1506
<pre class="pre-section"> ~$ mkdir web_index
1510
<li><a name="item_2"></a><a name="2"></a><b>2 Create a Swish-e Configuration file</b>
1511
<pre class="pre-section"> ~/web_index$ cat swish.conf
1512
# Swish-e config to index the Apache documentation
1514
# Use spider.pl for indexing (location of spider.pl set at installation time)
1517
# Use spider.pl's default configuration and specify the URL to spider
1518
SwishProgParameters default http://localhost/apache_docs/index.html
1520
# Allow extra searching by title, path
1521
Metanames swishtitle swishdocpath
1523
# Set StoreDescription for each parser
1524
# to display context with search results
1525
StoreDescription TXT* 10000
1526
StoreDescription HTML* <body> 10000</pre>
1528
<li><a name="item_3"></a><a name="3"></a><b>3 Generate the Index</b>
1529
<p>Now, run Swish-e to create the index:</p>
1530
<pre class="pre-section"> ~/web_index$ swish-e -S prog -c swish.conf
1532
Indexing Data Source: "External-Program"
1533
Indexing "spider.pl"
1534
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
1536
Summary for: http://localhost/apache_docs/index.html
1537
Duplicates: 4,188 (349.0/sec)
1538
Off-site links: 276 (23.0/sec)
1539
Skipped: 1 (0.1/sec)
1540
Total Bytes: 2,090,125 (174177.1/sec)
1541
Total Docs: 147 (12.2/sec)
1542
Unique URLs: 149 (12.4/sec)
1543
Removing very common words...
1545
Writing main index...
1547
Sorting 7736 words alphabetically
1549
Writing index entries ...
1550
Writing word text: Complete
1551
Writing word hash: Complete
1552
Writing word data: Complete
1553
7736 unique words indexed.
1554
5 properties sorted.
1555
147 files indexed. 2090125 total bytes. 200783 total words.
1556
Elapsed time: 00:00:13 CPU time: 00:00:02
1557
Indexing done!</pre>
1558
<p>The above output is actually a mix of output from both Swish-e
1559
and <code>spider.pl</code>.
1560
<code>spider.pl</code> reports the
1561
"Summary for: <a href="http://localhost/apache_docs/index.html">http://localhost/apache_docs/index.html</a>".</p>
1562
<p>Also note that Swish-e knows to find <code>spider.pl</code>
1563
at <code>/usr/local/lib/swish-e/spider.pl</code>.
1564
The script installation directory (called <code>libexecdir</code>)
1565
is set at configure time.
1566
You can see your setting by running <code>swish-e -h</code>:</p>
1567
<pre class="pre-section"> ~/web_index$ swish-e -h | grep libexecdir
1568
Scripts and Modules at: (libexecdir) = /usr/local/lib/swish-e</pre>
1569
<p>This directory will be needed in the next step,
1570
when setting up the CGI script.</p>
1571
<p>Finally, verify that the index can be searched from the command line:</p>
1572
<pre class="pre-section"> ~/web_index$ swish-e -w installing -m3
1573
# SWISH format: 2.4.0
1574
# Search words: installing
1575
# Removed stopwords:
1576
# Number of hits: 17
1577
# Search time: 0.018 seconds
1578
# Run time: 0.050 seconds
1579
1000 http://localhost/apache_docs/install.html "Compiling and Installing Apache" 17960
1580
718 http://localhost/apache_docs/install-tpf.html "Installing Apache on TPF" 25734
1581
680 http://localhost/apache_docs/windows.html "Using Apache with Microsoft Windows" 27165
1583
<p>Now, try limiting the search to the title:</p>
1584
<pre class="pre-section"> ~/web_index$ swish-e -w swishtitle=installing -m3
1585
# SWISH format: 2.3.5
1586
# Search words: swishtitle=installing
1587
# Removed stopwords:
1589
# Search time: 0.018 seconds
1590
# Run time: 0.048 seconds
1591
1000 http://localhost/apache_docs/install-tpf.html "Installing Apache on TPF" 25734
1592
1000 http://localhost/apache_docs/install.html "Compiling and Installing Apache" 17960
1594
<p>Note that the above can also be done using the <code>-t</code> option:</p>
1595
<pre class="pre-section"> ~/web_index$ swish-e -w installing -m3 -tH</pre>
1597
<li><a name="item_4"></a><a name="4"></a><b>4 Set up the CGI script</b>
1598
<p>Swish-e does not include a web server.
1599
So, you must use your locally installed web server.
1600
Apache is highly recommended, of course.</p>
1601
<p>Locate your web server's CGI directory.
1602
This may be a <code>cgi-bin</code> directory in your home directory
1603
or a central <code>cgi-bin</code> directory set up by the web server administrator.
1604
Once this is located,
1605
copy the <code>swish.cgi</code> script into the <code>cgi-bin</code> directory.</p>
1606
<p>Where CGI scripts can be located
1607
depends completely on the web server that is being used
1608
and how it has been configured.
1609
See your web server's documentation or your site's administrator
1610
for additional information.</p>
1611
<p>This example will use a site <code>cgi-bin</code> directory,
1612
located at <code>/usr/lib/cgi-bin</code>.
1613
Copy the <code>swish.cgi</code> script into the <code>cgi-bin</code> directory.
1614
Again, we will need the location of the <code>libexecdir</code> directory:</p>
1615
<pre class="pre-section"> ~/web_index$ swish-e -h | grep libexecdir
1616
Scripts and Modules at: (libexecdir) = /usr/local/lib/swish-e
1618
~/web_index$ cd /usr/lib/cgi-bin
1619
/usr/lib/cgi-bin$ su
1621
/usr/lib/cgi-bin# cp /usr/local/lib/swish-e/swish.cgi.</pre>
1622
<p>If your operating system supports symbolic links
1623
<b>and</b> your web server allows programs to be symbolic links,
1624
then you may wish to create a link to the <code>swish.cgi</code> program, instead.</p>
1625
<pre class="pre-section"> /usr/lib/cgi-bin# ln -s /usr/local/lib/swish-e/swish.cgi</pre>
1626
<p>We need to tell the <code>swish.cgi</code> script where to look
1627
for the index created in the previous step.
1628
It's also recommended to enter the path to the swish-e binary.
1629
Otherwise, the <code>swish.cgi</code> script will look for the binary in the <code>PATH</code>,
1630
and that may change when running under the CGI environment.</p>
1631
<p>Here's the configuration file:</p>
1632
<pre class="pre-section"> /usr/lib/cgi-bin# cat .swishcgi.conf
1634
title => 'Search Apache Documentation',
1635
swish_binary => '/usr/local/bin/swish-e',
1636
swish_index => '/home/moseley/web_index/index.swish-e',
1638
<p>Now, test the script from the command line (as a normal user!):</p>
1639
<pre class="pre-section"> /usr/lib/cgi-bin# exit
1642
/usr/lib/cgi-bin$ ./swish.cgi | head
1643
Content-Type: text/html; charset=ISO-8859-1
1645
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
1649
Search Apache Documentation
1653
<p>Notice that the CGI script returns the HTTP header (Content-Type)
1654
and the body of the web page,
1655
just like a well behaved CGI scrip should do.</p>
1656
<p>Now, test using the web server
1657
(this step depends on the location of your <code>cgi-bin</code> directory).
1658
This example uses the "GET" command that is part of the LWP Perl library,
1659
but any web browser can run this test.</p>
1660
<pre class="pre-section"> /usr/lib/cgi-bin$ GET http://localhost/cgi-bin/swish.cgi | head
1661
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Tranitional//EN">
1665
Search Apache Documentation
1670
<p>The script reports errors to stderr,
1671
so consult the web server's error log if problems occur.
1672
The message "Service currently unavailable",
1673
reported by running <code>swish.cgi</code>,
1674
typically indicates a configuration error;
1675
the exact problem will be listed in the web server's error log.</p>
1676
<p>Detailed instructions on using the <code>swish.cgi</code> script
1677
and debugging tips can be found by running:</p>
1678
<pre class="pre-section"> $ perldoc swish.cgi</pre>
1679
<p>while in the <code>cgi-bin</code> directory where <code>swish.cgi</code> was copied. </p>
1680
<p>The spider program <code>spider.pl</code> also has a large number
1681
of configuration options.</p>
1682
<p>Documentation is also available
1683
in the directory <code>$prefix/share/doc/swish-e</code> or at
1684
<a href="http://swish-e.org">http://swish-e.org</a>.</p>
1685
<p>Note: Also check out the <code>search.cgi</code> script,
1686
found at the same location as the <code>swish.cgi</code> script.
1687
This is more of a skeleton script,
1688
for those that want to create a custom search script.</p>
1691
<p>Now you are ready to search.</p>
1695
<div class="sub-section">
1697
<h1><a name="indexing_other_types_of_documents_filtering"></a>Indexing Other Types of Documents - Filtering</h1>
1699
<p>Swish-e can only index HTML, XML, and text documents.
1700
In order to index other documents,
1701
such as PDF or MS Word documents,
1702
you must use a utility to convert or "filter" those documents.</p>
1703
<p>How documents are filtered with Swish-e has changed over time.
1704
This has resulting in a bit of confusion.
1705
It's also a somewhat complex process,
1706
as different programs need to communicate with each other.</p>
1707
<p>You may wish to read the Swish-e FAQ question on filtering,
1708
before continuing here.
1709
<a href="swish-faq.html#how_do_i_filter_documents_">How do I filter documents?</a></p>
1713
<div class="sub-section">
1715
<h2><a name="filtering_overview"></a>Filtering Overview</h2>
1717
<p>There are two ways to filter documents with Swish-e.
1718
Both are described in the <a href="swish-config.html">SWISH-CONFIG</a> man page.
1719
They use the <code>FileFilter</code> directive
1720
and the <code>SWISH::Filter</code> Perl module.</p>
1721
<p>The <code>FileFilter</code> directive is a general-purpose method of filtering.
1722
It allows running of an external program for each document processed
1723
(based on file extension),
1724
and requires one or more external programs.
1725
These programs open an input file, convert as needed,
1726
and write their output to standard output. </p>
1727
<p>Previous versions of Swish-e (before 2.4.0)
1728
used a collection of filter programs
1729
for converting files such as PDF or MS Word documents.
1730
The external programs call other program to do the work of filtering
1731
(e.g. <i>pdftotext</i> to extract the contents from PDF files).
1732
Although these filter programs are still included
1733
with the Swish-e distribution as examples,
1734
it is recommended to use the <code>SWISH::Filter</code> method, instead.</p>
1735
<p>One disadvantage of using <code>FileFilter</code>
1736
is that the filter program is run once
1737
for every document that needs to be filtered.
1738
This can slow down the indexing process <b>substantially</b>.</p>
1739
<p>The <code>SWISH::Filter</code> Perl module works very much like the old system
1740
and uses the same helper programs.
1741
Convieniently, however,
1742
it provides a single interface for filtering all types of documents.
1743
The primary advantage of <code>SWISH::Filter</code>
1744
is that it is built into the program used for spidering web sites (spider.pl),
1745
so all that's required is installing the filter programs that do the
1746
actual work of filtering (e.g. <i>catdoc</i>, <i>xpdf</i>).
1747
(The Windows binary includes some of the filter programs.)</p>
1748
<p>But, Swish-e will not use <code>SWISH::Filter</code> by default
1749
when using the file system method of indexing.
1750
To use <code>SWISH::Filter</code> when indexing by file system method (-S fs),
1751
you can use a <code>FileFilter</code> directive with the <code>swish_filter.pl</code> filter
1752
(which is just a program that uses <code>SWISH::Filter</code>)
1753
or use the <code>-S prog</code> method of indexing
1754
and use the <code>DirTree.pl</code> program for fetching documents.</p>
1755
<p><code>DirTree.pl</code> is included with the Swish-e distribution
1756
and is designed to work with <code>SWISH::Filter</code>.
1757
Using DirTree.pl will likely be a faster way to index,
1758
since the <code>SWISH::Filter</code> set of modules
1759
does not need to be compiled for every document
1760
that needs to be filtered.</p>
1761
<p>See the contents of <code>swish_filter.pl</code> and <code>DirTree.pl</code>
1762
for specifics on their use.</p>
1766
<div class="sub-section">
1768
<h2><a name="filtering_examples"></a>Filtering Examples</h2>
1770
<p>The <code>FileFilter</code> directive can be used in your config file
1771
to convert documents, based on their extensions.
1772
This is the old way of filtering,
1773
but provides an easy way to add filters to Swish-e.</p>
1775
<pre class="pre-section"> FileFilter .pdf pdftotext "'%p' -"
1776
IndexContents TXT* .pdf</pre>
1777
<p>will cause all <code>.pdf</code> files to be filtered through the <i>pdftotext</i> program
1778
(part of the <i>Xpdf</i> package)
1779
and to parse the resulting output (from <i>pdftotext</i>)
1780
with the text ("TXT") parser.</p>
1781
<p>The other way to filter documents
1782
is to use a <code>-S prog</code> prograam
1783
and convert the documents before passing them onto Swish-e.</p>
1784
<p>For example, <code>spider.pl</code> makes use of the <code>SWISH::Filter"</code> Perl module,
1785
included with the Swish-e distribution.
1786
<code>SWISH::Filter</code> is passed a document and the document's content type;
1787
it looks for modules and utilities to convert the document
1788
into one of the types that Swish-e can index.</p>
1789
<p>Swish-e comes ready to index PDF, MS Word, MP3 ID3 tags,
1790
and MS Excel file types.
1791
But these filters need extra modules or tools to do the actual conversion.</p>
1792
<p>For example, the Swish-e distribution includes a module called
1793
<code>SWISH::Filter::Pdf2HTML</code>
1794
that uses the <i>pdftotext</i> and <i>pdfinfo</i> utilities
1795
provided by the <i>Xpdf</i> package.</p>
1796
<p>This means that if you are using <code>spider.pl</code>
1797
to spider your web site
1798
and you wish to index PDF documents,
1799
all that is needed is to install the Xpdf package
1800
and Swish-e (with the help of spider.pl) will begin indexing your PDF files.</p>
1801
<p>Ok, so what does all that mean?
1802
For a very simple site,
1803
you should be able to run this:</p>
1804
<pre class="pre-section"> $ /usr/local/lib/swish-e/spider.pl default http://localhost/ | swish-e -S prog -i stdin</pre>
1805
<p>which is running the spider with default spider settings,
1806
indexing the Web server on localhost,
1807
and piping its output into Swish-e (using the default indexing settings).
1808
Documents will be filtered automatically,
1809
if you have the required helper applications installed.</p>
1810
<p>Most people will not want to just use the default settings
1811
(for one thing, the spider will take a while
1812
because its default is to delay a few seconds between every request).
1813
So, read the documentation for <code>spider.pl</code>,
1814
to learn how to use a spider config file.
1815
Also read <a href="swish-config.html">SWISH-CONFIG</a>
1816
to learn about what configuration options can be used with Swish-e.</p>
1817
<p>The <code>SWISH::Filter</code> documentation provides more details on filtering
1818
and hints for debugging problems when filtering. </p>
1822
<div class="sub-section">
1824
<h1><a name="document_info"></a>Document Info</h1>
1826
<p>$Id: INSTALL.pod,v 1.42 2005/02/13 22:57:07 whmoseley Exp $</p>
1847
<!-- ##### Footer ##### -->
1851
<span class="doNotPrint">
1852
Swish-e is distributed with <strong>no warranty</strong> under the terms of the <br />
1853
<a href='/license.html'>Swish-e License</a>.<br />
1854
Questions may be posted to the
1855
<a href="http://swish-e.org/discuss.html" title="email list and list archive">Swish-e Discussion list</a>.
1859
<SCRIPT type='text/javascript' language='JavaScript'
1860
src='http://www.ohloh.net/projects/3196;badge_js'></SCRIPT>
1864
<strong>URI »</strong> http://swish-e.org/
1865
• <strong>Updated »</strong> Sun, 13 Feb 2005 22:57:07 UTC