~ubuntu-branches/ubuntu/warty/swish-e/warty

The example configuration file <CODE>SwishSpiderConfig.pl</CODE> is included in the <CODE>prog-bin</CODE> directory along with this script. Please just use it as an example, as it

171

contains more settings than you probably want to use. Start with a tiny

172

config file and add settings as required by your situation.

173

174

<P>

175

The available configuration parameters are discussed below.

176

177

<P>

178

If all that sounds confusing, then you can run the spider with default

179

settings. In fact, you can run the spider without using swish just to make

180

sure it works. Just run

181

182

<P>

183

184

<table>

185

<tr>

186

187

188

189

</td>

190

191

<td>

192

<pre> ./spider.pl default <A HREF="http://someserver.com/sometestdoc.html">http://someserver.com/sometestdoc.html</A></pre>

193

</td>

194

195

</tr>

196

</table>

197

<P>

198

And you should see <EM>sometestdoc.html</EM> dumped to your screen. Get ready to kill the script if the file you request

199

contains links as the output from the fetched pages will be displayed.

200

201

<P>

202

203

<table>

204

<tr>

205

206

207

208

</td>

209

210

<td>

211

<pre> ./spider.pl default <A HREF="http://someserver.com/sometestdoc.html">http://someserver.com/sometestdoc.html</A> > output.file</pre>

212

</td>

213

214

</tr>

215

</table>

216

<P>

217

might be more friendly.

218

219

<P>

220

If the first parameter passed to the spider is the word ``default'' (as in

221

the preceeding example) then the spider uses the default parameters, and

222

the following <CODE>parameter(s)</CODE> are expected to be

223

<CODE>URL(s)</CODE> to spider. Otherwise, the first parameter is considered

224

to be the name of the configuration file (as described below). When using <CODE>-S prog</CODE>, the swish-e configuration setting <A HREF="#item_SwishProgParameters">SwishProgParameters</A> is used to pass parameters to the program specified with <A HREF="#item_IndexDir">IndexDir</A> or the <CODE>-i</CODE> switch.

225

226

<P>

227

If you do not specify any parameters the program will look for the file

228

229

<P>

230

231

<table>

232

<tr>

233

234

235

236

</td>

237

238

<td>

239

<pre> SwishSpiderConfig.pl</pre>

240

</td>

241

242

</tr>

243

</table>

244

<P>

245

in the current directory.

246

247

<P>

248

The spider does require Perl's LWP library and a few other reasonably

249

common modules. Most well maintained systems should have these modules

250

installed. If not, start here:

251

252

<P>

253

254

<table>

255

<tr>

256

257

258

259

</td>

260

261

<td>

262

<pre> <A HREF="http://search.cpan.org/search?dist=libwww-perl">http://search.cpan.org/search?dist=libwww-perl</A>

263

<A HREF="http://search.cpan.org/search?dist=HTML-Parser">http://search.cpan.org/search?dist=HTML-Parser</A></pre>

264

</td>

265

266

</tr>

267

</table>

268

<P>

269

See more below in <CODE>REQUIREMENTS</CODE>. It's a good idea to check that you are running a current version of these

270

modules.

271

272

<P>

273

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

274

<HR>

275

<H2><A NAME="Robots_Exclusion_Rules_and_being_nice">Robots Exclusion Rules and being nice</A></H2>

276

<P>

277

By default, this script will not spider files blocked by <EM>robots.txt</EM>. In addition, The script will check for <meta name=``robots''..>

278

tags, which allows finer control over what files are indexed and/or

279

spidered. See <A

280

HREF="http://www.robotstxt.org/wc/exclusion.html">http://www.robotstxt.org/wc/exclusion.html</A>

281

for details.

282

283

<P>

284

This spider provides an extension to the <meta> tag exclusion, by adding a

285

<CODE>NOCONTENTS</CODE> attribute. This attribute turns on the <CODE>no_contents</CODE> setting, which asks swish-e to only index the document's title (or file

286

name if not title is found).

287

288

<P>

289

For example:

290

291

<P>

292

293

<table>

294

<tr>

295

296

297

298

</td>

299

300

<td>

301

302

</td>

303

304

</tr>

305

</table>

306

<P>

307

says to just index the document's title, but don't index its contents, and

308

don't follow any links within the document. Granted, it's unlikely that

309

this feature will ever be used...

310

311

<P>

312

If you are indexing your own site, and know what you are doing, you can

313

disable robot exclusion by the <A HREF="#item_ignore_robots_file">ignore_robots_file</A> configuration parameter, described below. This disables both <EM>robots.txt</EM> and the meta tag parsing. You may disable just the meta tag parsing by

314

using <CODE>ignore_robots_headers</CODE>.

315

316

<P>

317

This script only spiders one file at a time, so load on the web server is

318

not that great. And with libwww-perl-5.53_91 HTTP/1.1 keep alive requests

319

can reduce the load on the server even more (and potentially reduce

320

spidering time considerably!)

321

322

<P>

323

Still, discuss spidering with a site's administrator before beginning. Use

324

the <A HREF="#item_delay_sec">delay_sec</A> to adjust how fast the spider fetches documents. Consider running a second

325

web server with a limited number of children if you really want to fine

326

tune the resources used by spidering.

327

328

<P>

329

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

330

<HR>

331

<H2><A NAME="Duplicate_Documents">Duplicate Documents</A></H2>

332

<P>

333

The spider program keeps track of URLs visited, so a document is only

334

indexed one time.

335

336

<P>

337

The Digest::MD5 module can be used to create a ``fingerprint'' of every

338

page indexed and this fingerprint is used in a hash to find duplicate

339

pages. For example, MD5 will prevent indexing these as two different

340

documents:

341

342

<P>

343

344

<table>

345

<tr>

346

347

348

349

</td>

350

351

<td>

352

<pre> <A HREF="http://localhost/path/to/some/index.html">http://localhost/path/to/some/index.html</A>

353

<A HREF="http://localhost/path/to/some/">http://localhost/path/to/some/</A></pre>

354

</td>

355

356

</tr>

357

</table>

358

<P>

359

But note that this may have side effects you don't want. If you want this

360

file indexed under this URL:

361

362

<P>

363

364

<table>

365

<tr>

366

367

368

369

</td>

370

371

<td>

372

<pre> <A HREF="http://localhost/important.html">http://localhost/important.html</A></pre>

373

</td>

374

375

</tr>

376

</table>

377

<P>

378

But the spider happens to find the exact content in this file first:

379

380

<P>

381

382

<table>

383

<tr>

384

385

386

387

</td>

388

389

<td>

390

<pre> <A HREF="http://localhost/developement/test/todo/maybeimportant.html">http://localhost/developement/test/todo/maybeimportant.html</A></pre>

391

</td>

392

393

</tr>

394

</table>

395

<P>

396

Then only that URL will be indexed.

397

398

<P>

399

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

400

<HR>

401

<H2><A NAME="Broken_relative_links">Broken relative links</A></H2>

402

<P>

403

Some times web page authors use too many <CODE>/../</CODE> segments in relative URLs which reference documents above the document

404

root. Some web servers such as Apache will return a 400 Bad Request when

405

requesting a document above the root. Other web servers such as Micorsoft

406

IIS/5.0 will try and ``correct'' these errors. This correction will lead to

407

loops when spidering.

408

409

<P>

410

The spider can fix these above-root links by placing the following in your

411

spider config:

412

413

<P>

414

415

<table>

416

<tr>

417

418

419

420

</td>

421

422

<td>

423

<pre> remove_leading_dots => 1,</pre>

424

</td>

425

426

</tr>

427

</table>

428

<P>

429

It is not on by default so that the spider can report the broken links (as

430

400 errors on sane webservers).

431

432

<P>

433

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

434

<HR>

435

<H2><A NAME="Compression">Compression</A></H2>

436

<P>

437

If The Perl module Compress::Zlib is installed the spider will send the

438

439

<P>

440

441

<table>

442

<tr>

443

444

445

446

</td>

447

448

<td>

449

<pre> Accept-Encoding: gzip</pre>

450

</td>

451

452

</tr>

453

</table>

454

<P>

455

header and uncompress the document if the server returns the header

456

457

<P>

458

459

<table>

460

<tr>

461

462

463

464

</td>

465

466

<td>

467

<pre> Content-Encoding: gzip</pre>

468

</td>

469

470

</tr>

471

</table>

472

<P>

473

MD5 checksomes are done on the compressed data.

474

475

<P>

476

MD5 may slow down indexing a tiny bit, so test with and without if speed is

477

an issue (which it probably isn't since you are spidering in the first

478

place). This feature will also use more memory.

479

480

<P>

481

Note: the ``prog'' document source in swish bypasses many swish-e

482

configuration settings. For example, you cannot use the <A HREF="#item_IndexOnly">IndexOnly</A> directive with the ``prog'' document source. This is by design to limit the

483

overhead when using an external program for providing documents to swish;

484

after all, with ``prog'', if you don't want to index a file, then don't

485

give it to swish to index in the first place.

486

487

<P>

488

So, for spidering, if you do not wish to index images, for example, you

489

will need to either filter by the URL or by the content-type returned from

490

the web server. See <A HREF="#CALLBACK_FUNCTIONS">CALLBACK FUNCTIONS</A> below for more information.

491

492

<P>

493

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

494

<HR>

495

<H1><A NAME="REQUIREMENTS">REQUIREMENTS</A></H1>

496

<P>

497

Perl 5 (hopefully at least 5.00503) or later.

498

499

<P>

500

You must have the LWP Bundle on your computer. Load the LWP::Bundle via the

501

CPAN.pm shell, or download libwww-perl-x.xx from CPAN (or via ActiveState's

502

ppm utility). Also required is the the HTML-Parser-x.xx bundle of modules

503

also from CPAN (and from ActiveState for Windows).

504

505

<P>

506

507

<table>

508

<tr>

509

510

511

512

</td>

513

514

<td>

515

<pre> <A HREF="http://search.cpan.org/search?dist=libwww-perl">http://search.cpan.org/search?dist=libwww-perl</A>

516

<A HREF="http://search.cpan.org/search?dist=HTML-Parser">http://search.cpan.org/search?dist=HTML-Parser</A></pre>

517

</td>

518

519

</tr>

520

</table>

521

<P>

522

You will also need Digest::MD5 if you wish to use the MD5 feature.

523

HTML::Tagset is also required. Other modules may be required (for example,

524

the pod2xml.pm module has its own requirementes -- see perldoc pod2xml for

525

info).

526

527

<P>

528

The spider.pl script, like everyone else, expects perl to live in

529

/usr/local/bin. If this is not the case then either add a symlink at

530

/usr/local/bin/perl to point to where perl is installed or modify the

531

shebang (#!) line at the top of the spider.pl program.

532

533

<P>

534

Note that the libwww-perl package does not support SSL (Secure Sockets

535

Layer) (https) by default. See <EM>README.SSL</EM> included in the libwww-perl package for information on installing SSL

536

support.

537

538

<P>

539

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

540

<HR>

541

<H1><A NAME="CONFIGURATION_FILE">CONFIGURATION FILE</A></H1>

542

<P>

543

Configuration is not very fancy. The spider.pl program simply does a

544

<CODE>do "path";</CODE> to read in the parameters and create the callback subroutines. The <CODE>path</CODE> is the first parameter passed to the spider script, which is set by the

545

Swish-e configuration setting <A HREF="#item_SwishProgParameters">SwishProgParameters</A>.

546

547

<P>

548

For example, if in your swish-e configuration file you have

549

550

<P>

551

552

<table>

553

<tr>

554

555

556

557

</td>

558

559

<td>

560

<pre> SwishProgParameters /path/to/config.pl

561

IndexDir /home/moseley/swish-e/prog-bin/spider.pl</pre>

562

</td>

563

564

</tr>

565

</table>

566

<P>

567

And then run swish as

568

569

<P>

570

571

<table>

572

<tr>

573

574

575

576

</td>

577

578

<td>

579

<pre> swish-e -c swish.config -S prog</pre>

580

</td>

581

582

</tr>

583

</table>

584

<P>

585

swish will run <CODE>/home/moseley/swish-e/prog-bin/spider.pl</CODE> and the spider.pl program will receive as its first parameter <CODE>/path/to/config.pl</CODE>, and spider.pl will read <CODE>/path/to/config.pl</CODE> to get the spider configuration settings. If <A HREF="#item_SwishProgParameters">SwishProgParameters</A> is not set, the program will try to use <CODE>SwishSpiderConfig.pl</CODE> by default.

586

587

<P>

588

There is a special case of:

589

590

<P>

591

592

<table>

593

<tr>

594

595

596

597

</td>

598

599

<td>

600

<pre> SwishProgParameters default <A HREF="http://www.mysite/index.html">http://www.mysite/index.html</A> ...</pre>

601

</td>

602

603

</tr>

604

</table>

605

<P>

606

Where default parameters are used. This will only index documents of type

607

<CODE>text/html</CODE> or <CODE>text/plain</CODE>, and will skip any file with an extension that matches the pattern:

608

609

<P>

610

611

<table>

612

<tr>

613

614

615

616

</td>

617

618

<td>

619

620

</td>

621

622

</tr>

623

</table>

624

<P>

625

This can be useful for indexing just your web documnts, but you will

626

probably want finer control over your spidering by using a configuration

627

file.

628

629

<P>

630

The configuration file must set a global variable <CODE>@servers</CODE> (in package main). Each element in <CODE>@servers</CODE> is a reference to a hash. The elements of the has are described next. More

631

than one server hash may be defined -- each server will be spidered in

632

order listed in <CODE>@servers</CODE>, although currently a <EM>global</EM> hash is used to prevent spidering the same URL twice.

633

634

<P>

635

Examples:

636

637

<P>

638

639

<table>

640

<tr>

641

642

643

644

</td>

645

646

<td>

647

<pre> my %serverA = (

648

base_url => '<A HREF="http://swish-e.org/">http://swish-e.org/</A>',

649

same_hosts => [ qw/www.swish-e.org/ ],

650

email => 'my@email.address',

651

);

652

my %serverB = (

653

...

654

...

655

);

656

@servers = ( \%serverA, \%serverB, );</pre>

657

</td>

658

659

</tr>

660

</table>

661

<P>

662

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

663

<HR>

664

<H1><A NAME="CONFIGURATION_OPTIONS">CONFIGURATION OPTIONS</A></H1>

665

<P>

666

This describes the required and optional keys in the server configuration

667

hash, in random order...

668

669

<DL>

670

671

<P>

672

This required setting is the starting URL for spidering.

673

674

<P>

675

Typically, you will just list one URL for the base_url. You may specify

676

more than one URL as a reference to a list

677

678

<P>

679

680

<table>

681

<tr>

682

683

684

685

</td>

686

687

<td>

688

<pre> base_url => [qw! <A HREF="http://swish-e.org/">http://swish-e.org/</A> <A HREF="http://othersite.org/other/index.html">http://othersite.org/other/index.html</A> !],</pre>

689

</td>

690

691

</tr>

692

</table>

693

<P>

694

You may specify a username and password:

695

696

<P>

697

698

<table>

699

<tr>

700

701

702

703

</td>

704

705

<td>

706

<pre> base_url => '<A HREF="http://user:pass@swish-e.org/index.html">http://user:pass@swish-e.org/index.html</A>',</pre>

707

</td>

708

709

</tr>

710

</table>

711

<P>

712

but you may find that to be a security issue. If a URL is protected by

713

Basic Authentication you will be prompted for a username and password. This

714

might be a slighly safer way to go.

715

716

<P>

717

The parameter <A HREF="#item_max_wait_time">max_wait_time</A> controls how long to wait for user entry before skipping the current URL.

718

719

<P>

720

See also <A HREF="#item_credentials">credentials</A> below.

721

722

<P><DT><STRONG><A NAME="item_same_hosts">same_hosts</A></STRONG><DD>

723

<P>

724

This optional key sets equivalent <STRONG>authority</STRONG> <CODE>name(s)</CODE> for the site you are spidering. For example, if your

725

site is <CODE>www.mysite.edu</CODE> but also can be reached by

726

<CODE>mysite.edu</CODE> (with or without <CODE>www</CODE>) and also <CODE>web.mysite.edu</CODE> then:

727

728

<P>

729

Example:

730

731

<P>

732

733

<table>

734

<tr>

735

736

737

738

</td>

739

740

<td>

741

<pre> $serverA{base_url} = '<A HREF="http://www.mysite.edu/index.html">http://www.mysite.edu/index.html</A>';

742

$serverA{same_hosts} = ['mysite.edu', 'web.mysite.edu'];</pre>

743

</td>

744

745

</tr>

746

</table>

747

<P>

748

Now, if a link is found while spidering of:

749

750

<P>

751

752

<table>

753

<tr>

754

755

756

757

</td>

758

759

<td>

760

<pre> <A HREF="http://web.mysite.edu/path/to/file.html">http://web.mysite.edu/path/to/file.html</A></pre>

761

</td>

762

763

</tr>

764

</table>

765

<P>

766

it will be considered on the same site, and will actually spidered and

767

indexed as:

768

769

<P>

770

771

<table>

772

<tr>

773

774

775

776

</td>

777

778

<td>

779

<pre> <A HREF="http://www.mysite.edu/path/to/file.html">http://www.mysite.edu/path/to/file.html</A></pre>

780

</td>

781

782

</tr>

783

</table>

784

<P>

785

Note: This should probably be called <STRONG>same_host_port</STRONG> because it compares the URI <CODE>host:port</CODE>

786

against the list of host names in <A HREF="#item_same_hosts">same_hosts</A>. So, if you specify a port name in you will want to specify the port name

787

in the the list of hosts in <A HREF="#item_same_hosts">same_hosts</A>:

788

789

<P>

790

791

<table>

792

<tr>

793

794

795

796

</td>

797

798

<td>

799

<pre> my %serverA = (

800

base_url => '<A HREF="http://sunsite.berkeley.edu:4444/">http://sunsite.berkeley.edu:4444/</A>',

801

same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],

802

email => 'my@email.address',

803

);</pre>

804

</td>

805

806

</tr>

807

</table>

808

<P><DT><STRONG><A NAME="item_email">email</A></STRONG><DD>

809

<P>

810

This required key sets the email address for the spider. Set this to your

811

email address.

812

813

<P><DT><STRONG><A NAME="item_agent">agent</A></STRONG><DD>

814

<P>

815

This optional key sets the name of the spider.

816

817

818

<P>

819

This optional tag is a reference to an array of tags. Only links found in

820

these tags will be extracted. The default is to only extract links from <CODE>a</CODE> tags.

821

822

<P>

823

For example, to extract tags from <CODE>a</CODE> tags and from <CODE>frame</CODE> tags:

824

825

<P>

826

827

<table>

828

<tr>

829

830

831

832

</td>

833

834

<td>

835

<pre> my %serverA = (

836

base_url => '<A HREF="http://sunsite.berkeley.edu:4444/">http://sunsite.berkeley.edu:4444/</A>',

837

same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],

838

email => 'my@email.address',

839

link_tags => [qw/ a frame /],

840

);</pre>

841

</td>

842

843

</tr>

844

</table>

845

<P><DT><STRONG><A NAME="item_delay_sec">delay_sec</A></STRONG><DD>

846

<P>

847

This optional key sets the delay in seconds to wait between requests. See

848

the LWP::RobotUA man page for more information. The default is 5 seconds.

849

Set to zero for no delay.

850

851

<P>

852

When using the keep_alive feature (recommended) the delay will used only

853

where the previous request returned a ``Connection: closed'' header.

854

855

<P>

856

Note: A common recommendation is to use a delay of one minute between

857

requests. For example, one minute is the default used in the LWP::RobotUA

858

Perl module.

859

860

<P><DT><STRONG><A NAME="item_delay_min">delay_min (depreciated)</A></STRONG><DD>

861

<P>

862

Set the delay to wait between requests in minutes. If both delay_sec and

863

delay_min are defined, delay_sec will be used.

864

865

866

<P>

867

This setting is the number of seconds to wait for data to be returned from

868

the request. Data is returned in chunks to the spider, and the timer is

869

reset each time a new chunk is reported. Therefore, documents (requests)

870

that take longer than this setting should not be aborted as long as some

871

data is received every max_wait_time seconds. The default it 30 seconds.

872

873

<P>

874

NOTE: This option has no effect on Windows.

875

876

877

<P>

878

This optional key will set the max minutes to spider. Spidering for this

879

host will stop after <A HREF="#item_max_time">max_time</A> minutes, and move on to the next server, if any. The default is to not

880

limit by time.

881

882

<P><DT><STRONG><A NAME="item_max_files">max_files</A></STRONG><DD>

883

<P>

884

This optional key sets the max number of files to spider before aborting.

885

The default is to not limit by number of files. This is the number of

886

requests made to the remote server, not the total number of files to index

887

(see <A HREF="#item_max_indexed">max_indexed</A>). This count is displayted at the end of indexing as <CODE>Unique URLs</CODE>.

888

889

<P>

890

This feature can (and perhaps should) be use when spidering a web site

891

where dynamic content may generate unique URLs to prevent run-away

892

spidering.

893

894

<P><DT><STRONG><A NAME="item_max_indexed">max_indexed</A></STRONG><DD>

895

<P>

896

This optional key sets the max number of files that will be indexed. The

897

default is to not limit. This is the number of files sent to swish for

898

indexing (and is reported by <CODE>Total Docs</CODE> when spidering ends).

899

900

901

<P>

902

This optional key sets the max size of a file read from the web server.

903

This <STRONG>defaults</STRONG> to 5,000,000 bytes. If the size is exceeded the resource is skipped and a

904

message is written to STDERR if the DEBUG_SKIPPED debug flag is set.

905

906

<P><DT><STRONG><A NAME="item_keep_alive">keep_alive</A></STRONG><DD>

907

<P>

908

This optional parameter will enable keep alive requests. This can

909

dramatically speed up spidering and reduce the load on server being

910

spidered. The default is to not use keep alives, although enabling it will

911

probably be the right thing to do.

912

913

<P>

914

To get the most out of keep alives, you may want to set up your web server

915

to allow a lot of requests per single connection (i.e MaxKeepAliveRequests

916

on Apache). Apache's default is 100, which should be good.

917

918

<P>

919

When a connection is not closed the spider does not wait the ``delay_sec''

920

time when making the next request. In other words, there is no delay in

921

requesting documents while the connection is open.

922

923

<P>

924

Note: try to filter as many documents as possible <STRONG>before</STRONG> making the request to the server. In other words, use <A HREF="#item_test_url">test_url</A> to look for files ending in <CODE>.html</CODE> instead of using <A HREF="#item_test_response">test_response</A> to look for a content type of <CODE>text/html</CODE> if possible. Do note that aborting a request from <A HREF="#item_test_response">test_response</A> will break the current keep alive connection.

925

926

<P>

927

Note: you must have at least libwww-perl-5.53_90 installed to use this

928

feature.

929

930

931

<P>

932

This optional key can be used to skip the current server. It's only purpose

933

is to make it easy to disable a server in a configuration file.

934

935

<P><DT><STRONG><A NAME="item_debug">debug</A></STRONG><DD>

936

<P>

937

Set this to a number to display different amounts of info while spidering.

938

Writes info to STDERR. Zero/undefined is no debug output.

939

940

<P>

941

The following constants are defined for debugging. They may be or'ed

942

together to get the individual debugging of your choice.

943

944

<P>

945

Here are basically the levels:

946

947

<P>

948

949

<table>

950

<tr>

951

952

953

954

</td>

955

956

<td>

957

<pre> DEBUG_ERRORS general program errors (not used at this time)

958

DEBUG_URL print out every URL processes

959

DEBUG_HEADERS prints the response headers

960

DEBUG_FAILED failed to return a 200

961

DEBUG_SKIPPED didn't index for some reason

962

DEBUG_INFO more verbose

963

DEBUG_LINKS prints links as they are extracted</pre>

964

</td>

965

966

</tr>

967

</table>

968

<P>

969

For example, to display the urls processed, failed, and skipped use:

970

971

<P>

972

973

<table>

974

<tr>

975

976

977

978

</td>

979

980

<td>

981

<pre> debug => DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED,</pre>

982

</td>

983

984

</tr>

985

</table>

986

<P>

987

To display the returned headers

988

989

<P>

990

991

<table>

992

<tr>

993

994

995

996

</td>

997

998

<td>

999

<pre> debug => DEBUG_HEADERS,</pre>

1000

</td>

1001

1002

</tr>

1003

</table>

1004

<P>

1005

You can easily run the spider without using swish for debugging purposes:

1006

1007

<P>

1008

1009

<table>

1010

<tr>

1011

1012

1013

1014

</td>

1015

1016

<td>

1017

<pre> ./spider.pl test.config > spider.out</pre>

1018

</td>

1019

1020

</tr>

1021

</table>

1022

<P>

1023

And you will see debugging info as it runs, and the fetched documents will

1024

be saved in the <CODE>spider.out</CODE> file.

1025

1026

<P>

1027

Debugging can be also be set by an environment variable when running swish.

1028

This will override any setting in the configuration file. Set the variable

1029

SPIDER_DEBUG when running the spider. You can specify any of the above

1030

debugging options, separated by a comma.

1031

1032

<P>

1033

For example with Bourne type shell:

1034

1035

<P>

1036

1037

<table>

1038

<tr>

1039

1040

1041

1042

</td>

1043

1044

<td>

1045

<pre> SPIDER_DEBUG=url,links</pre>

1046

</td>

1047

1048

</tr>

1049

</table>

1050

<P><DT><STRONG><A NAME="item_quiet">quiet</A></STRONG><DD>

1051

<P>

1052

If this is true then normal, non-error messages will be supressed. Quiet

1053

mode can also be set by setting the environment variable SPIDER_QUIET to

1054

any true value.

1055

1056

<P>

1057

1058

<table>

1059

<tr>

1060

1061

1062

1063

</td>

1064

1065

<td>

1066

<pre> SPIDER_QUIET=1</pre>

1067

</td>

1068

1069

</tr>

1070

</table>

1071

<P><DT><STRONG><A NAME="item_max_depth">max_depth</A></STRONG><DD>

1072

<P>

1073

The <A HREF="#item_max_depth">max_depth</A> parameter can be used to limit how deeply to recurse a web site. The depth

1074

is just a count of levels of web pages decended, and not related to the

1075

number of path elements in a URL.

1076

1077

<P>

1078

A max_depth of zero says to only spider the page listed as the <A HREF="#item_base_url">base_url</A>. A max_depth of one will spider the <A HREF="#item_base_url">base_url</A> page, plus all links on that page, and no more. The default is to spider

1079

all pages.

1080

1081

<P><DT><STRONG><A NAME="item_ignore_robots_file">ignore_robots_file</A></STRONG><DD>

1082

<P>

1083

If this is set to true then the robots.txt file will not be checked when

1084

spidering this server. Don't use this option unless you know what you are

1085

doing.

1086

1087

<P><DT><STRONG><A NAME="item_use_cookies">use_cookies</A></STRONG><DD>

1088

<P>

1089

If this is set then a ``cookie jar'' will be maintained while spidering.

1090

Some (poorly written ;) sites require cookies to be enabled on clients.

1091

1092

<P>

1093

This requires the HTTP::Cookies module.

1094

1095

1096

<P>

1097

If this setting is true, then a MD5 digest ``fingerprint'' will be made

1098

from the content of every spidered document. This digest number will be

1099

used as a hash key to prevent indexing the same content more than once.

1100

This is helpful if different URLs generate the same content.

1101

1102

<P>

1103

Obvious example is these two documents will only be indexed one time:

1104

1105

<P>

1106

1107

<table>

1108

<tr>

1109

1110

1111

1112

</td>

1113

1114

<td>

1115

<pre> <A HREF="http://localhost/path/to/index.html">http://localhost/path/to/index.html</A>

1116

<A HREF="http://localhost/path/to/">http://localhost/path/to/</A></pre>

1117

</td>

1118

1119

</tr>

1120

</table>

1121

<P>

1122

This option requires the Digest::MD5 module. Spidering with this option

1123

might be a tiny bit slower.

1124

1125

<P><DT><STRONG><A NAME="item_validate_links">validate_links</A></STRONG><DD>

1126

<P>

1127

Just a hack. If you set this true the spider will do HEAD requests all

1128

links (e.g. off-site links), just to make sure that all your links work.

1129

1130

<P><DT><STRONG><A NAME="item_credentials">credentials</A></STRONG><DD>

1131

<P>

1132

You may specify a username and password to be used automatically when

1133

spidering:

1134

1135

<P>

1136

1137

<table>

1138

<tr>

1139

1140

1141

1142

</td>

1143

1144

<td>

1145

<pre> credentials => 'username:password',</pre>

1146

</td>

1147

1148

</tr>

1149

</table>

1150

<P>

1151

A username and password supplied in a URL will override this setting.

1152

1153

<P><DT><STRONG><A NAME="item_credential_timeout">credential_timeout</A></STRONG><DD>

1154

<P>

1155

Sets the number of seconds to wait for user input when prompted for a

1156

username or password. The default is 30 seconds.

1157

1158

<P><DT><STRONG><A NAME="item_remove_leading_dots">remove_leading_dots</A></STRONG><DD>

1159

<P>

1160

Removes leading dots from URLs that might reference documents above the

1161

document root. The default is to not remove the dots.

1162

1163

</DL>

1164

<P>

1165

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

1166

<HR>

1167

<H1><A NAME="CALLBACK_FUNCTIONS">CALLBACK FUNCTIONS</A></H1>

1168

<P>

1169

Three callback functions can be defined in your parameter hash. These

1170

optional settings are <EM>callback</EM> subroutines that are called while processing URLs.

1171

1172

<P>

1173

A little perl discussion is in order:

1174

1175

<P>

1176

In perl, a scalar variable can contain a reference to a subroutine. The

1177

config example above shows that the configuration parameters are stored in

1178

a perl <EM>hash</EM>.

1179

1180

<P>

1181

1182

<table>

1183

<tr>

1184

1185

1186

1187

</td>

1188

1189

<td>

1190

<pre> my %serverA = (

1191

base_url => '<A HREF="http://sunsite.berkeley.edu:4444/">http://sunsite.berkeley.edu:4444/</A>',

1192

same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],

1193

email => 'my@email.address',

1194

link_tags => [qw/ a frame /],

1195

);</pre>

1196

</td>

1197

1198

</tr>

1199

</table>

1200

<P>

1201

There's two ways to add a reference to a subroutine to this hash:

1202

1203

<P>

1204

sub foo { return 1; }

1205

1206

<P>

1207

1208

<table>

1209

<tr>

1210

1211

1212

1213

</td>

1214

1215

<td>

1216

<pre> my %serverA = (

1217

base_url => '<A HREF="http://sunsite.berkeley.edu:4444/">http://sunsite.berkeley.edu:4444/</A>',

1218

same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],

1219

email => 'my@email.address',

1220

link_tags => [qw/ a frame /],

1221

test_url => \&foo, # a reference to a named subroutine

1222

);</pre>

1223

</td>

1224

1225

</tr>

1226

</table>

1227

<P>

1228

Or the subroutine can be coded right in place:

1229

1230

<P>

1231

1232

<table>

1233

<tr>

1234

1235

1236

1237

</td>

1238

1239

<td>

1240

<pre> my %serverA = (

1241

base_url => '<A HREF="http://sunsite.berkeley.edu:4444/">http://sunsite.berkeley.edu:4444/</A>',

1242

same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],

1243

email => 'my@email.address',

1244

link_tags => [qw/ a frame /],

1245

test_url => sub { reutrn 1; },

1246

);</pre>

1247

</td>

1248

1249

</tr>

1250

</table>

1251

<P>

1252

The above example is not very useful as it just creates a user callback

1253

function that always returns a true value (the number 1). But, it's just an

1254

example.

1255

1256

<P>

1257

The function calls are wrapped in an eval, so calling die (or doing

1258

something that dies) will just cause that URL to be skipped. If you really

1259

want to stop processing you need to set $server->{abort} in your

1260

subroutine (or send a kill -HUP to the spider).

1261

1262

<P>

1263

The first two parameters passed are a URI object (to have access to the

1264

current URL), and a reference to the current server hash. The <CODE>server</CODE> hash is just a global hash for holding data, and useful for setting flags

1265

as described below.

1266

1267

<P>

1268

Other parameters may be also passed in depending the the callback function,

1269

as described below. In perl parameters are passed in an array called

1270

``@_''. The first element (first parameter) of that array is $_[0], and the

1271

second is $_[1], and so on. Depending on how complicated your function is

1272

you may wish to shift your parameters off of the <CODE>@_</CODE> list to

1273

make working with them easier. See the examples below.

1274

1275

<P>

1276

To make use of these routines you need to understand when they are called,

1277

and what changes you can make in your routines. Each routine deals with a

1278

given step, and returning false from your routine will stop processing for

1279

the current URL.

1280

1281

<DL>

1282

1283

<P>

1284

<A HREF="#item_test_url">test_url</A> allows you to skip processing of urls based on the url before the request

1285

to the server is made. This function is called for the <A HREF="#item_base_url">base_url</A> links (links you define in the spider configuration file) and for every

1286

link extracted from a fetched web page.

1287

1288

<P>

1289

This function is a good place to skip links that you are not interested in

1290

following. For example, if you know there's no point in requesting images

1291

then you can exclude them like:

1292

1293

<P>

1294

1295

<table>

1296

<tr>

1297

1298

1299

1300

</td>

1301

1302

<td>

1303

<pre> test_url => sub {

1304

my $uri = shift;

1305

return 0 if $uri->path =~ /\.(gif|jpeg|png)$/;

1306

return 1;

1307

},</pre>

1308

</td>

1309

1310

</tr>

1311

</table>

1312

<P>

1313

Or to write it another way:

1314

1315

<P>

1316

1317

<table>

1318

<tr>

1319

1320

1321

1322

</td>

1323

1324

<td>

1325

1326

</td>

1327

1328

</tr>

1329

</table>

1330

<P>

1331

Another feature would be if you were using a web server where path names

1332

are NOT case sensitive (e.g. Windows). You can normalize all links in this

1333

situation using something like

1334

1335

<P>

1336

1337

<table>

1338

<tr>

1339

1340

1341

1342

</td>

1343

1344

<td>

1345

<pre> test_url => sub {

1346

my $uri = shift;

1347

return 0 if $uri->path =~ /\.(gif|jpeg|png)$/;</pre>

1348

</td>

1349

1350

</tr>

1351

</table>

1352

<P>

1353

1354

<table>

1355

<tr>

1356

1357

1358

1359

</td>

1360

1361

<td>

1362

<pre> $uri->path( lc $uri->path ); # make all path names lowercase

1363

return 1;

1364

},</pre>

1365

</td>

1366

1367

</tr>

1368

</table>

1369

<P>

1370

The important thing about <A HREF="#item_test_url">test_url</A> (compared to the other callback functions) is that it is called while <EM>extracting</EM> links, not while actually fetching that page from the web server. Returning

1371

false from <A HREF="#item_test_url">test_url</A> simple says to not add the URL to the list of links to spider.

1372

1373

<P>

1374

You may set a flag in the server hash (second parameter) to tell the spider

1375

to abort processing.

1376

1377

<P>

1378

1379

<table>

1380

<tr>

1381

1382

1383

1384

</td>

1385

1386

<td>

1387

<pre> test_url => sub {

1388

my $server = $_[1];

1389

$server->{abort}++ if $_[0]->path =~ /foo\.html/;

1390

return 1;

1391

},</pre>

1392

</td>

1393

1394

</tr>

1395

</table>

1396

<P>

1397

You cannot use the server flags:

1398

1399

<P>

1400

1401

<table>

1402

<tr>

1403

1404

1405

1406

</td>

1407

1408

<td>

1409

<pre> no_contents

1410

no_index

1411

no_spider</pre>

1412

</td>

1413

1414

</tr>

1415

</table>

1416

<P>

1417

This is discussed below.

1418

1419

<P><DT><STRONG><A NAME="item_test_response">test_response</A></STRONG><DD>

1420

<P>

1421

This function allows you to filter based on the response from the remote

1422

server (such as by content-type). This function is called while the web

1423

pages is being fetched from the remote server, typically after just enought

1424

data has been returned to read the response from the web server.

1425

1426

<P>

1427

The spider requests a document in ``chunks'' of 4096 bytes. 4096 is only a

1428

suggestion of how many bytes to return in each chunk. The <A HREF="#item_test_response">test_response</A> routine is called when the first chunk is received only. This allows

1429

ignoring (aborting) reading of a very large file, for example, without

1430

having to read the entire file. Although not much use, a reference to this

1431

chunk is passed as the forth parameter.

1432

1433

<P>

1434

Web servers use a Content-Type: header to define the type of data returned

1435

from the server. On a web server you could have a .jpeg file be a web page

1436

-- file extensions may not always indicate the type of the file. The third

1437

parameter ($_[2]) returned is a reference to a HTTP::Response object:

1438

1439

<P>

1440

For example, to only index true HTML (text/html) pages:

1441

1442

<P>

1443

1444

<table>

1445

<tr>

1446

1447

1448

1449

</td>

1450

1451

<td>

1452

<pre> test_response => sub {

1453

my $content_type = $_[2]->content_type;

1454

return $content_type =~ m!text/html!;

1455

},</pre>

1456

</td>

1457

1458

</tr>

1459

</table>

1460

<P>

1461

You can also set flags in the server hash (the second parameter) to control

1462

indexing:

1463

1464

<P>

1465

1466

<table>

1467

<tr>

1468

1469

1470

1471

</td>

1472

1473

<td>

1474

<pre> no_contents -- index only the title (or file name), and not the contents

1475

no_index -- do not index this file, but continue to spider if HTML

1476

no_spider -- index, but do not spider this file for links to follow

1477

abort -- stop spidering any more files</pre>

1478

</td>

1479

1480

</tr>

1481

</table>

1482

<P>

1483

For example, to avoid index the contents of ``private.html'', yet still

1484

follow any links in that file:

1485

1486

<P>

1487

1488

<table>

1489

<tr>

1490

1491

1492

1493

</td>

1494

1495

<td>

1496

<pre> test_response => sub {

1497

my $server = $_[1];

1498

$server->{no_index}++ if $_[0]->path =~ /private\.html$/;

1499

return 1;

1500

},</pre>

1501

</td>

1502

1503

</tr>

1504

</table>

1505

<P>

1506

Note: Do not modify the URI object in this call back function.

1507

1508

<P><DT><STRONG><A NAME="item_filter_content">filter_content</A></STRONG><DD>

1509

<P>

1510

This callback function is called right before sending the content to swish.

1511

Like the other callback function, returning false will cause the URL to be

1512

skipped. Setting the <CODE>abort</CODE> server flag and returning false will abort spidering.

1513

1514

<P>

1515

You can also set the <CODE>no_contents</CODE> flag.

1516

1517

<P>

1518

This callback function is passed four parameters. The URI object, server

1519

hash, the HTTP::Response object, and a reference to the content.

1520

1521

<P>

1522

You can modify the content as needed. For example you might not like upper

1523

case:

1524

1525

<P>

1526

1527

<table>

1528

<tr>

1529

1530

1531

1532

</td>

1533

1534

<td>

1535

<pre> filter_content => sub {

1536

my $content_ref = $_[3];</pre>

1537

</td>

1538

1539

</tr>

1540

</table>

1541

<P>

1542

1543

<table>

1544

<tr>

1545

1546

1547

1548

</td>

1549

1550

<td>

1551

<pre> $$content_ref = lc $$content_ref;

1552

return 1;

1553

},</pre>

1554

</td>

1555

1556

</tr>

1557

</table>

1558

<P>

1559

I more reasonable example would be converting PDF or MS Word documents for

1560

parsing by swish. Examples of this are provided in the <EM>prog-bin</EM> directory of the swish-e distribution.

1561

1562

<P>

1563

You may also modify the URI object to change the path name passed to swish

1564

for indexing.

1565

1566

<P>

1567

1568

<table>

1569

<tr>

1570

1571

1572

1573

</td>

1574

1575

<td>

1576

<pre> filter_content => sub {

1577

my $uri = $_[0];

1578

$uri->host('www.other.host') ;

1579

return 1;

1580

},</pre>

1581

</td>

1582

1583

</tr>

1584

</table>

1585

<P>

1586

Swish-e's ReplaceRules feature can also be used for modifying the path name

1587

indexed.

1588

1589

<P>

1590

Here's a bit more advanced example of indexing text/html and PDF files

1591

only:

1592

1593

<P>

1594

1595

<table>

1596

<tr>

1597

1598

1599

1600

</td>

1601

1602

<td>

1603

<pre> use pdf2xml; # included example pdf converter module

1604

$server{filter_content} = sub {

1605

my ( $uri, $server, $response, $content_ref ) = @_;</pre>

1606

</td>

1607

1608

</tr>

1609

</table>

1610

<P>

1611

1612

<table>

1613

<tr>

1614

1615

1616

1617

</td>

1618

1619

<td>

1620

<pre> return 1 if $response->content_type eq 'text/html';

1621

return 0 unless $response->content_type eq 'application/pdf';</pre>

1622

</td>

1623

1624

</tr>

1625

</table>

1626

<P>

1627

1628

<table>

1629

<tr>

1630

1631

1632

1633

</td>

1634

1635

<td>

1636

<pre> # for logging counts

1637

$server->{counts}{'PDF transformed'}++;</pre>

1638

</td>

1639

1640

</tr>

1641

</table>

1642

<P>

1643

1644

<table>

1645

<tr>

1646

1647

1648

1649

</td>

1650

1651

<td>

1652

<pre> $$content_ref = ${pdf2xml( $content_ref )};

1653

return 1;

1654

}</pre>

1655

</td>

1656

1657

</tr>

1658

</table>

1659

<P>

1660

Note: Swish-e not includes a method of filtering based on the SWISH::Filter

1661

Perl modules. See the SwishSpiderConfig.pl file for an example how to use

1662

SWISH::Filter in a filter_content callback function.

1663

1664

</DL>

1665

<P>

1666

Note that you can create your own counters to display in the summary list

1667

when spidering is finished by adding a value to the hash pointed to by <CODE>$server-</CODE>{counts}>.

1668

1669

<P>

1670

1671

<table>

1672

<tr>

1673

1674

1675

1676

</td>

1677

1678

<td>

1679

<pre> test_url => sub {

1680

my $server = $_[1];

1681

$server->{no_index}++ if $_[0]->path =~ /private\.html$/;

1682

$server->{counts}{'Private Files'}++;

1683

return 1;

1684

},</pre>

1685

</td>

1686

1687

</tr>

1688

</table>

1689

<P>

1690

Each callback function <STRONG>must</STRONG> return true to continue processing the URL. Returning false will cause

1691

processing of <EM>the current</EM> URL to be skipped.

1692

1693

<P>

1694

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

1695

<HR>

1696

<H2><A NAME="More_on_setting_flags">More on setting flags</A></H2>

1697

<P>

1698

Swish (not this spider) has a configuration directive <A HREF="#item_NoContents">NoContents</A> that will instruct swish to index only the title (or file name), and not

1699

the contents. This is often used when indexing binary files such as image

1700

files, but can also be used with html files to index only the document

1701

titles.

1702

1703

<P>

1704

As shown above, you can turn this feature on for specific documents by

1705

setting a flag in the server hash passed into the <A HREF="#item_test_response">test_response</A> or <A HREF="#item_filter_content">filter_content</A> subroutines. For example, in your configuration file you might have the <A HREF="#item_test_response">test_response</A> callback set as:

1706

1707

<P>

1708

1709

<table>

1710

<tr>

1711

1712

1713

1714

</td>

1715

1716

<td>

1717

<pre> test_response => sub {

1718

my ( $uri, $server, $response ) = @_;

1719

# tell swish not to index the contents if this is of type image

1720

$server->{no_contents} = $response->content_type =~ m[^image/];

1721

return 1; # ok to index and spider this document

1722

}</pre>

1723

</td>

1724

1725

</tr>

1726

</table>

1727

<P>

1728

The entire contents of the resource is still read from the web server, and

1729

passed on to swish, but swish will also be passed a <A HREF="#item_No_Contents">No-Contents</A> header which tells swish to enable the NoContents feature for this document

1730

only.

1731

1732

<P>

1733

Note: Swish will index the path name only when <A HREF="#item_NoContents">NoContents</A> is set, unless the document's type (as set by the swish configuration

1734

settings <A HREF="#item_IndexContents">IndexContents</A> or <A HREF="#item_DefaultContents">DefaultContents</A>) is HTML <EM>and</EM> a title is found in the html document.

1735

1736

<P>

1737

Note: In most cases you probably would not want to send a large binary file

1738

to swish, just to be ignored. Therefore, it would be smart to use a <A HREF="#item_filter_content">filter_content</A> callback routine to replace the contents with single character (you cannot

1739

use the empty string at this time).

1740

1741

<P>

1742

A similar flag may be set to prevent indexing a document at all, but still

1743

allow spidering. In general, if you want completely skip spidering a file

1744

you return false from one of the callback routines (<A HREF="#item_test_url">test_url</A>, <A HREF="#item_test_response">test_response</A>, or <A HREF="#item_filter_content">filter_content</A>). Returning false from any of those three callbacks will stop processing

1745

of that file, and the file will <STRONG>not</STRONG> be spidered.

1746

1747

<P>

1748

But there may be some cases where you still want to spider (extract links)

1749

yet, not index the file. An example might be where you wish to index only

1750

PDF files, but you still need to spider all HTML files to find the links to

1751

the PDF files.

1752

1753

<P>

1754

1755

<table>

1756

<tr>

1757

1758

1759

1760

</td>

1761

1762

<td>

1763

<pre> $server{test_response} = sub {

1764

my ( $uri, $server, $response ) = @_;

1765

$server->{no_index} = $response->content_type ne 'application/pdf';

1766

return 1; # ok to spider, but don't index

1767

}</pre>

1768

</td>

1769

1770

</tr>

1771

</table>

1772

<P>

1773

So, the difference between <CODE>no_contents</CODE> and <CODE>no_index</CODE> is that <CODE>no_contents</CODE> will still index the file name, just not the contents. <CODE>no_index</CODE> will still spider the file (if it's <CODE>text/html</CODE>) but the file will not be processed by swish at all.

1774

1775

<P>

1776

<STRONG>Note:</STRONG> If <CODE>no_index</CODE> is set in a <A HREF="#item_test_response">test_response</A> callback function then the document <EM>will not be filtered</EM>. That is, your <A HREF="#item_filter_content">filter_content</A>

1777

callback function will not be called.

1778

1779

<P>

1780

The <CODE>no_spider</CODE> flag can be set to avoid spiderering an HTML file. The file will still be

1781

indexed unless

1782

<CODE>no_index</CODE> is also set. But if you do not want to index and spider, then simply return

1783

false from one of the three callback funtions.

1784

1785

<P>

1786

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

1787

<HR>

1788

<H1><A NAME="SIGNALS">SIGNALS</A></H1>

1789

<P>

1790

Sending a SIGHUP to the running spider will cause it to stop spidering.

1791

This is a good way to abort spidering, but let swish index the documents

1792

retrieved so far.

1793

1794

<P>

1795

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

1796

<HR>

1797

<H1><A NAME="COPYRIGHT">COPYRIGHT</A></H1>

1798

<P>

1799

1800

1801

<P>

1802

This program is free software; you can redistribute it and/or modify it

1803

under the same terms as Perl itself.

1804

1805

<P>

1806

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

1807

<HR>

1808

<H1><A NAME="SUPPORT">SUPPORT</A></H1>

1809

<P>

1810

Send all questions to the The SWISH-E discussion list.

1811

1812

<P>

1813

See <A

1814

HREF="http://sunsite.berkeley.edu/SWISH-E.">http://sunsite.berkeley.edu/SWISH-E.</A>

1815

1816

1817

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]

1818

<HR>

1819

1820

1821

1822

<p>

1823

1824

<a href="./search.html">Prev</a> |

1825

<a href="./index.html">Contents</a> |

1826

1827

</div>

1828

<p>

1829

1830

1831

1832

1833

1834

1835

<BR>SWISH-E is distributed with <B>no warranty</B> under the terms of the

1836

<A HREF="http://www.fsf.org/copyleft/gpl.html">GNU Public License</A>,<BR>

1837

Free Software Foundation, Inc.,

1838

59 Temple Place - Suite 330, Boston, MA 02111-1307, USA<BR>

1839

Public questions may be posted to

1840

the <A HREF="http://swish-e.org/Discussion/">SWISH-E Discussion</A>.

1841

</div>

1842

1843

</body>

1844

</html>

Older »