4
## Run this program with the -man option for documentation ##
7
# This is set to where Swish-e's "make install" installed the helper modules.
8
use lib qw( @@perlmoduledir@@ );
11
use File::Find; # for recursing a directory tree
16
#--------------- User Configuration Section ------------------------
17
# Regular expression that says these files are text
18
# even though SWISH::Filter thinks they might be binary
20
my @not_binary_extensions = qw/
29
# Subroutine to validate file names: return true if file is ok to process
30
# or false to skip the file.
31
# The first parameter passed in is the
35
return 1; # return true to process
40
return 1; # return true to process this directory
43
#-------------------- End User Config ------------------------------------
49
my $extensions = join '|', map { quotemeta } @not_binary_extensions;
50
my $textre = qr/($extensions)$/;
54
GetOptions( \%options,
63
pod2usage( -verbose => 2 ) if $options{man};
65
if ( $options{path} ) {
66
print '@@perlmoduledir@@',"\n";
71
pod2usage("Must supply at least one directory") unless @ARGV;
74
$ENV{FILTER_DEBUG} = 1 if $options{debug};
78
# See perldoc File::Find for information on following symbolic links
79
# and other important topics.
81
use constant DEBUG => 0;
83
# Try to load the filter module
84
eval { require SWISH::Filter };
85
my $filter = SWISH::Filter->new unless $@;
91
no_chdir => 1, # 5.6 feature
92
follow => $options{follow_symlinks},
98
my $path = $File::Find::name;
101
if ( !check_dir( $path ) ) {
102
$File::Find::prune = 1;
103
warn "Skipped dir [$path] by user function check_dir()\n"
104
if $options{verbose} || $options{debug};
110
warn "$File::Find::name is not readable\n";
115
my $mtime = (stat _ )[9];
117
if ( !check_path( $path ) ) {
118
warn "Skipped path [$path] by user function check_path()\n"
119
if $options{verbose} || $options{debug};
125
my $doc = $filter->convert(
129
if ( $options{no_skip} ) {
130
process_file( $path, $mtime );
134
warn "Failed [$path] SWISH::Filter->convert failed.\n"
135
if $options{verbose};
140
if ( $doc->is_binary && $path !~ /$textre/ ) { # ignore "binary" files (not text/* mime type)
141
warn "Skipping [$path] due to content type: " . $doc->content_type .": may be binary\n"
142
if $options{verbose};
146
my $bytes = output_document( $path, $doc->fetch_doc, $mtime, $doc->swish_parser_type );
148
if ( $options{verbose} ) {
149
print STDERR "Indexed [$path] ",
150
($doc->was_filtered ? "(Was filtered) " : "(Not filtered) "),
151
$doc->content_type . " ",
152
($doc->swish_parser_type || '(parser unspecified)'),
160
# Otherwise, fetch document manually
161
process_file( $path, $mtime );
166
my ( $path, $mtime ) = @_;
168
unless ( open FH, $path ) {
169
warn "Failed to open '$path': $!\n";
176
my $bytes = output_document( $path, \$content, $mtime );
178
if ( $options{verbose} ) {
179
print STDERR "Indexed [$path] (not processed with SWISH:Filter) ($bytes bytes)\n";
185
sub output_document {
186
my ( $path, $content_ref, $mtime, $parser_type ) = @_;
188
# Get the length of the content - have to worry about multi-byte content
189
# ugly and maybe expensive, but perhaps more portable than "use bytes"
190
my $bytecount = length pack 'C0a*', $$content_ref;
192
my $header = "Path-Name: $path\nContent-Length: $bytecount\nLast-Mtime: $mtime\n";
193
$header .= "Document-Type: $parser_type\n" if $parser_type;
195
print $header . "\n" . $$content_ref;
202
DirTree.pl - program to fetch local documents for Swish-e
206
DirTree.pl [options] directory <directory...> | swish-e -S prog -i stdin
209
-verbose Display processing info
210
-debug Enable debugging (including SWISH::Filter debugging)
211
-man Display documentation
212
-path Display location lib path set at installation
213
-no_skip Process documents even if filtering fails
214
-symlinks Follow symbolic links. Default is to NOT follow symlinks
218
DirTree.pl is an example Perl script that can be used with Swish-e to
219
fetch documents from the local file system. It works somewhat like
220
Swish-e's default -S fs input method (reading from the file system).
221
DirTree.pl will attempt to load the SWISH::Filter module for use in filtering
222
documents (e.g. PDF or MS Word).
224
DirTree.pl is a thin wrapper around Perl's File::Find module. Before modifying
225
this script for your own use please read the documentation for File::Find:
229
IMPORTANT: By default DirTree.pl will attempt to index all files in the
230
directories and sub-directories supplied. It's expected that you will
231
customize this script for your own needs.
233
When using -S prog many of the features available to select or exclude
234
files that can be specified in the swish-e config file will have no effect.
235
It's expected that checks on files will be added to the DirTree.pl program.
236
This is much more powerful and allows more control, but requires more work
239
There are two skeleton functions at the top of DirTree.pl that can be modified
240
for filtering what gets indexed: check_path() and check_dir(). Both are passed
241
in the path or directory name as their only parameter. Return FALSE to skip
242
the given path or directory.
246
# Skip all .wav files.
249
return if $path =~ /\.wav$/; # return false if ends in .wav?
250
return 1; # otherwise return true
253
# Skip all directories that start with a dot (hidden dirs)
256
return ! m[^\.]; # return true if does not start with a dot
259
Those are called for each file or directory processed. The File::Find module also
260
provides a preprocess option where all the files and directories in a directory are
261
passed in as a list to a subroutine. This list can be filtered and passed back to
262
File::Find. This would be useful if, say, you wanted to skip a directory if a file
263
"noindex" existed in the directory. See perldoc File::Find for details.
267
Filtering is the process of converting a document that swish-e cannot index into
268
a document that swish-e can index.
270
The SWISH::Filter module is used for filtering documents. SWISH::Filter is
271
part of the swish-e distribution and was installed at the same time Swish-e was
272
installed. SWISH::Filter uses "helper" programs to do the actual filtering.
273
For example, to filter PDF files you would need to have the Xpdf package
274
installed (included with the Windows version of Swish-e). When SWISH::Filter
275
is first loaded it determines which filters are available.
277
SWISH::Filter uses the MIME::Types module to convert a file name into a MIME
278
type (e.g. .doc => application/msword) and that type is used to determine what
279
filter to use, if any. Filters convert the document to a new MIME type (e.g. the MS Word
280
filter might convert the document to text/html or text/plain).
284
After Filtering, this program (DirTree.pl) then checks to see if the file is a binary
285
file. This is a very simple test that simply looks for "text/" at the start
286
of the MIME type. Clearly, this is incorrect for man MIME types. For example,
287
if you were indexing Perl scripts of type "application/x-perl" this program
288
would think the file was binary and not index it.
290
At the top of the program is a list of file endings that tell DirTree.pl that
291
they should be indexed even if their MIME type does not start with "text/".
293
Another problem is some files will not map to a MIME type. The best solution
294
is to add the file ending and MIME type to your mime.types file. But, if you just
295
want to index any file that does not have a MIME type use the -no_skip option.
300
To use the SWISH::Filter module you will need the helper applications installed.
301
Check with your OS packages or Google for sources.
303
PDF conversion requires the Xpdf package
304
MS Word conversion requires the Catdoc package
306
The Windows version of Swish-e includes Xpdf and Catdoc packages.
308
For content type matching install the Perl Mime::Types module.
312
A few options may be passed to DirTree.pl
318
Produces information about each file as it is processed.
322
Enables detailed debugging. SWISH::Filter debugging is also enabled.
326
When set documents that fail processing with SWISH::Filter will still
327
be processed. Typically this means documents where a content-type could be determined.
328
Make sure you have the Mime::Types module installed.
332
When specified will recurse into directories that are symbolic links.
333
The default is to NOT recurse into symbolic links. This options sets the "follow"
334
option in the File::Find module.
340
May not work well on multi-byte input files.
342
In order to work on Windows (where two chars are used to terminate lines)
343
this program reads the ENTIRE file into memory so that an accurate byte count
344
can be made. Therefore, it's probably a good idea not to index files that are too big.
349
Contact the Swish-e discussion list. See: