~ubuntu-branches/ubuntu/trusty/recoll/trusty

Viewing changes to doc/user/usermanual.sgml

Committer: Bazaar Package Importer
Author(s): Kartik Mistry
Date: 2008-11-13 21:18:15 UTC
mfrom: (1.1.7 upstream) (4.1.3 sid)
Revision ID: james.westby@ubuntu.com-20081113211815-2hxp996xj5hyjh08

Tags: 1.11.0-1

http://bugs.debian.org/500690

http://bugs.debian.org/502427

http://bugs.debian.org/505376

* New upstream release:
  + Remebers missing filters in first run (Closes: #500690)
* debian/control:
  + Added libimage-exiftool-perl as Suggests (Closes: #502427)
  + Added Python as recommaded due to filters/rclpython script
    although, its not necessary as it will be installed only
    when Python is present
* debian/patches:
  + Refreshed patch for gcc 4.4 FTBFS (Closes: #505376)
* debian/copyright:
  + Updated for newly added filter and image files

files added:
doc/user/rcl.program.api.html

doc/user/rcl.program.fields.html

doc/user/rcl.program.filters.html

doc/user/rcl.program.html

doc/user/rcl.program.python.html

doc/user/rclq.py

doc/user/recoll.txt

filters/rclpurple

filters/rclpython

filters/rcltext

python/recoll/pyrecoll.cpp

python/recoll/setup.py

python/samples/rcldlkp.py

python/samples/rclmbox.py

python/samples/recollq.py

python/samples/recollqsd.py

python/xesam/xesam-recoll-service

qt4gui/ktrace.out

qtgui/mtpics/pidgin.png

qtgui/mtpics/text-x-python.png

query/filtseq.cpp

query/filtseq.h

rcldb/rcldoc.cpp

sampleconf/fields

utils/fileudi.cpp

utils/fileudi.h

files removed:
lib/fileudi.dep.stamp

files modified:
ChangeLog

INSTALL

Makefile.in

README

VERSION

common/rclconfig.cpp

common/rclconfig.h

configure

configure.ac

debian/changelog

debian/control

debian/copyright

debian/patches/02_gcc-snapshot-missing-headers-fix.dpatch

debian/rules

doc/man/recollindex.1

doc/user/HTML.manifest

doc/user/index.html

doc/user/rcl.extending.html

doc/user/rcl.indexing.html

doc/user/rcl.install.building.html

doc/user/rcl.install.config.html

doc/user/rcl.install.external.html

doc/user/rcl.install.html

doc/user/rcl.kicker-applet.html

doc/user/rcl.search.custom.html

doc/user/rcl.search.lang.html

doc/user/usermanual.html

doc/user/usermanual.html-text

doc/user/usermanual.sgml

doc/user/usermanual.txt

filters/rclabw

filters/rcldjvu

filters/rcldvi

filters/rclflac

filters/rclid3

filters/rclimg

filters/rclkwd

filters/rclogg

filters/rclopxml

filters/rclppt

filters/rclsoff

filters/rclsvg

filters/rclxls

index/indexer.cpp

index/indexer.h

index/recollindex.cpp

internfile/Filter.h

internfile/internfile.cpp

internfile/internfile.h

internfile/mh_exec.cpp

internfile/mh_exec.h

internfile/mh_html.cpp

internfile/mh_html.h

internfile/mh_mail.cpp

internfile/mh_mail.h

internfile/mh_mbox.cpp

internfile/mh_mbox.h

internfile/mh_text.h

internfile/mh_unknown.h

internfile/mimehandler.cpp

internfile/mimehandler.h

internfile/myhtmlparse.cpp

lib/Makefile

lib/mkMake

mk/localdefs.in

qt4gui/uifrom3

qtgui/advsearch_w.cpp

qtgui/confgui/confguiindex.cpp

qtgui/guiutils.cpp

qtgui/guiutils.h

qtgui/i18n/recoll_de.qm

qtgui/i18n/recoll_de.ts

qtgui/i18n/recoll_fr.qm

qtgui/i18n/recoll_fr.ts

qtgui/i18n/recoll_it.qm

qtgui/i18n/recoll_it.ts

qtgui/i18n/recoll_ru.qm

qtgui/i18n/recoll_ru.ts

qtgui/i18n/recoll_tr.qm

qtgui/i18n/recoll_tr.ts

qtgui/i18n/recoll_uk.qm

qtgui/i18n/recoll_uk.ts

qtgui/i18n/recoll_xx.ts

qtgui/idxthread.cpp

qtgui/idxthread.h

qtgui/main.cpp

qtgui/mtpics/README

qtgui/plaintorich.cpp

qtgui/plaintorich.h

qtgui/preview_w.cpp

qtgui/preview_w.h

qtgui/rclmain.ui

qtgui/rclmain_w.cpp

qtgui/rclmain_w.h

qtgui/reslist.cpp

qtgui/reslist.h

qtgui/sort_w.cpp

qtgui/ssearch_w.cpp

qtgui/uiprefs.ui

qtgui/uiprefs_w.cpp

query/docseq.cpp

query/docseq.h

query/docseqdb.cpp

query/docseqdb.h

query/docseqhist.cpp

query/docseqhist.h

query/recollq.cpp

query/sortseq.cpp

query/sortseq.h

query/wasastringtoquery.cpp

query/wasastringtoquery.h

query/wasatorcl.cpp

rcldb/pathhash.cpp

rcldb/rcldb.cpp

rcldb/rcldb.h

rcldb/rcldb_p.h

rcldb/rcldoc.h

rcldb/rclquery.cpp

rcldb/rclquery.h

rcldb/searchdata.cpp

rcldb/searchdata.h

rcldb/stemdb.cpp

recollinstall.in

sampleconf/mimeconf

sampleconf/mimemap

sampleconf/mimeview

utils/Makefile

utils/base64.cpp

utils/base64.h

utils/execmd.cpp

utils/refcntr.h

utils/smallut.cpp

utils/smallut.h

utils/transcode.cpp

utils/transcode.h

Show diffs side-by-side

added added

removed removed

doc/user/usermanual.sgml

Dockes</holder>

</copyright>

<releaseinfo>$Id: usermanual.sgml,v 1.63 2008/05/07 06:14:14 dockes Exp $</releaseinfo>

<releaseinfo>$Id: usermanual.sgml,v 1.68 2008/10/13 07:57:12 dockes Exp $</releaseinfo>

<para>This document introduces full text search notions

228

</para>

229

230

<para>&RCL; indexing processes plain text, HTML, openoffice

231

and e-mail files internally. Other types (ie: postscript, pdf,

232

ms-word, rtf) need external applications for preprocessing. The

233

list is in the <link linkend="rcl.install.external">

234

installation</link> section.</para>

231

and e-mail files internally.</para>

232

233

<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)

234

need external applications for preprocessing. The list is in the

235

236

section. After every indexing operation, &RCL; updates a list of

237

commands that would be needed for indexing existing files

238

types. This list can be displayed from the

239

<command>recoll</command> <guilabel>File</guilabel> menu. It is

240

stored in the <filename>missing</filename> text file

241

inside the configuration directory.</para>

235

242

236

243

<para>Without further configuration, &RCL; will index all

237

244

appropriate files from your home directory, with a reasonable

834

841

simple search entry when the search mode selector is set to

835

842

<guilabel>Query Language</guilabel>.</para>

836

843

844

<para>The language is roughly based on the <ulink

845

url="http://www.xesam.org/main/XesamUserSearchLanguage95">

846

Xesam</ulink> user search language specification.</para>

847

837

848

<para>Here follows a sample request that we are going to

838

849

explain:</para>

850

839

851

840

852

author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes

841

853

</programlisting>

851

863

<replaceable>unplugged</replaceable> but not

852

864

<replaceable>potatoes</replaceable> (in any part of the document).</para>

853

865

866

<para>An element is composed of an optional field specification,

867

and a value, separated by a colon. Exemple:

868

<replaceable>Beatles</replaceable>,

869

<replaceable>author:balzac</replaceable>,

870

<replaceable>dc:title:grandet</replaceable> </para>

871

872

<para>The colon, if present, means "contains". Xesam defines other

873

relations, which are not supported for now.</para>

874

854

875

<para>All elements in the search entry are normally combined

855

876

with an implicit AND. It is possible to specify that elements be

856

877

OR'ed instead, as in <replaceable>Beatles</replaceable>

870

891

<replaceable>word3</replaceable>. Do not enter explicit

871

892

parenthesis, they are not supported for now.</para>

872

893

873

<para>An entry preceded by a <literal>-</literal> specifies a

874

term that should <emphasis>not</emphasis> appear.</para>

875

876

<para>The first element in the above exemple,

877

<literal>author:"john doe"</literal> is a phrase search limited

878

to a specific field. Phrase searches are specified as usual by

879

enclosing the words in double quotes. The field specification

880

appears before the colon (of course this is not limited to

881

phrases, <literal>author:Balzac</literal> would be ok

882

too). &RCL; currently manages the following fields:</para>

894

<para>An element preceded by a <literal>-</literal> specifies a

895

term that should <emphasis>not</emphasis> appear. Pure negative

896

queries are forbidden.</para>

897

898

<para>As usual, words inside quotes define a phrase

899

(the order of words is significant), so that

900

<replaceable>title:"prejudice pride"</replaceable> is not the same as

901

<replaceable>title:prejudice title:pride</replaceable>, and is

902

unlikely to find a result.</para>

903

904

<para>&RCL; currently manages the following default fields:</para>

883

905

884

906

<listitem><para><literal>title</literal>,

885

907

<literal>subject</literal> or <literal>caption</literal> are

889

911

<listitem><para><literal>author</literal> or

890

912

<literal>from</literal> for searching the documents originators.</para>

891

913

</listitem>

914

<listitem><para><literal>recipient</literal> or

915

<literal>to</literal> for searching the documents recipients.</para>

916

</listitem>

892

917

<listitem><para><literal>keyword</literal> for searching the

893

document specified keywords (few documents actually have any).</para>

918

document-specified keywords (few documents actually have any).</para>

894

919

</listitem>

895

</itemizedlist>

896

897

<para>As of release 1.9, the filters have the possibility to

898

create other fields with arbitrary names. No standard filters

899

use this possibility yet.</para>

900

901

<para>There are two other elements which may be specified

902

through the field syntax, but are somewhat special:</para>

903

904

<listitem><para><literal>ext</literal> for specifying the file

920

<listitem><para><literal>filename</literal> for the document's

921

file name.</listitem>

922

<listitem><para><literal>ext</literal> specifies the file

905

923

name extension (Ex: <literal>ext:html</literal>)</para>

906

924

</listitem>

907

<listitem><para><literal>dir</literal> for specifying the file

908

location (Ex:

925

</itemizedlist>

926

927

<para>The field syntax also supports a few field-like, but

928

special, criteria:</para>

929

930

<listitem><para><literal>dir</literal> for filtering the

931

results on file location (Ex:

909

932

<literal>dir:/home/me/somedir</literal>). Please note

910

933

that this is quite inefficient, that it may produce very

911

934

slow searches, and that it may be worth in some

912

935

cases to set up separate databases instead.</para>

913

936

</listitem>

914

<listitem><para><literal>mime</literal> for specifying the

937

938

<listitem><para><literal>mime</literal> or

939

<literal>format</literal> for specifying the

915

940

mime type. This one is quite special because you can specify

916

941

several values which will be OR'ed (the normal default for the

917

942

language is AND). Ex: <literal>mime:text/plain

920

945

<literal>mime</literal> specification is not supported and

921

946

will produce strange results.</para>

922

947

</listitem>

948

949

<listitem><para><literal>type</literal> or

950

<literal>rclcat</literal> for specifying the category (as in

951

text/media/presentation/etc.). The classification of mime

952

types in categories is defined in the &RCL; configuration

953

(<filename>mimeconf</filename>), and can be modified or

954

extended. The default category names are those which permit

955

filtering results in the main GUI screen. Categories are OR'ed

956

like mime types above.</para>

957

</listitem>

958

923

959

</itemizedlist>

960

961

<para>The document filters used while indexing have the

962

possibility to create other fields with arbitrary names, and

963

aliases may be defined in the configuration, so that the exact

964

field search possibilities may be different for you if someone

965

took care of the customisation.</para>

966

924

967

<para>The query language is currently the only way to use the

925

968

&RCL; field search capability.</para>

926

969

927

970

<para>Words inside phrases and capitalized words are not

928

971

stem-expanded. Wildcards may be used anywhere inside a term.

929

972

Specifying a wild-card on the left of a term can produce a very

930

slow search.</para>

973

slow search (or even an incorrect one if the expansion is

974

truncated because of excessive size).</para>

931

975

932

976

<para>You can use the <literal>show query</literal> link at the

933

977

top of the result list to check the exact query which was

934

978

finally executed by Xapian.</para>

935

979

980

<para>Most Xesam phrase modifiers are unsupported, except for

981

<literal>l</literal> (small ell) to disable stemming, and

982

<literal>p</literal> to turn an phrase into a NEAR (unordered)

983

search. Exemple: <replaceable>"prejudice pride"p</replaceable></para>

984

936

985

</sect1>

937

986

938

987

1526

1575

<para>Your main database (the one the current configuration

1527

1576

indexes to), is always implicitly active. If this is not

1528

1577

desirable, you can set up your configuration so that it indexes,

1529

for example, an empty directory.</para>

1530

1531

</sect1>

1578

for example, an empty directory. An alternative indexer may also

1579

need to implement a way of purging the index from stale data,

1580

</para>

1581

1582

</sect1>

1583

1584

</chapter>

1585

1586

1587

<title>Programming interface</title>

1588

1589

<para>&RCL; has an Application programming Interface, usable both

1590

for indexing and searching, currently accessible from the

1591

<application>Python</application> language.</para>

1592

1593

<para>Another less radical way to extend the application is to

1594

write filters for new types of documents.</para>

1595

1596

<para>The processing of metadata attributes for documents

1597

(<literal>fields</literal>) is highly configurable.</para>

1598

1599

1600

<title>Writing a document filter</title>

1601

1602

<para>&RCL; filters are executable programs which

1603

translate from a specific format (ie:

1604

<application>openoffice</application>,

1605

<application>acrobat</application>, etc.) to the &RCL;

1606

indexing input format, which may be

1607

<literal>text/plain</literal> or

1608

1609

1610

<para>&RCL; filters are usually shell-scripts, but this is in

1611

no way necessary. These programs are extremely simple and most

1612

of the difficulty lies in extracting the text from the native

1613

format, not outputting what is expected by &RCL;. Happily

1614

enough, most document formats already have translators or text

1615

extractors which handle the difficult part and can be called

1616

from the filter. In some case the output of the translating

1617

program is appropriate, and no intermediate shell-script is

1618

needed.</para>

1619

1620

<para>Filters are called with a single argument which is the

1621

source file name. They should output the result to stdout.</para>

1622

1623

<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>

1624

environment variable (values <literal>yes</literal>,

1625

<literal>no</literal>) tells the filter if the operation is

1626

for indexing or previewing. Some filters use this to output a

1627

slightly different format. This is not essential.</para>

1628

1629

<para>The association of file types to filters is performed in

1630

the <filename>mimeconf</filename> file. A sample:</para>

1631

1632

1633

[index]

1634

application/msword = exec antiword -t -i 1 -m UTF-8;\

1635

mimetype=text/plain;charset=utf-8

1636

1637

application/ogg = exec rclogg

1638

1639

text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html

1640

</programlisting>

1641

1642

<para>The fragment specifies that:</para>

1643

1644

1645

1646

<listitem><para><literal>application/msword</literal> files

1647

are processed by executing the <command>antiword</command>

1648

program, which outputs

1649

<literal>text/plain</literal> encoded in

1650

1651

</listitem>

1652

1653

<listitem><para><literal>application/ogg</literal> files are

1654

processed by the <command>rclogg</command> script, with

1655

default output type (<literal>text/html</literal>, with

1656

encoding specified in the header, or <literal>utf-8</literal>

1657

by default).</para>

1658

</listitem>

1659

1660

<listitem><para><literal>text/rtf</literal> is processed by

1661

<command>unrtf</command>, which outputs

1662

<literal>text/html</literal>. The

1663

<literal>iso-8859-1</literal> encoding is specified because it

1664

is not the <literal>utf-8</literal> default, and not output by

1665

<command>unrtf</command> in the HTML header section.</para>

1666

</listitem>

1667

</itemizedlist>

1668

1669

<para>The easiest way to write a new filter is probably to start

1670

from an existing one.</para>

1671

1672

<para>Filters which output <literal>text/plain</literal> text

1673

are generally simpler, but they cannot specify the character set

1674

and other metadata, so they are limited to cases where these

1675

elements are not needed.</para>

1676

1677

1678

1679

<title>Filter HTML output</title>

1680

1681

<para>The output HTML could be very minimal like the following

1682

example:</para>

1683

1684

1685

1686

&lt/head>

1687

<body>some text content</body></html>

1688

</programlisting>

1689

1690

<para>You should take care to escape some

1691

characters inside

1692

the text by transforming them into appropriate

1693

entities. "<literal>&</literal>" should be transformed into

1694

"<literal>&amp;</literal>", "<literal><</literal>"

1695

should be transformed into

1696

"<literal>&lt;</literal>". This is not always properly

1697

done by translating programs which output HTML, and of

1698

course nerver by those which output plain text.</para>

1699

1700

<para>The character set needs to be specified in the

1701

header. It does not need to be UTF-8 (&RCL; will take care

1702

of translating it), but it must be accurate for good

1703

results.</para>

1704

1705

<para>&RCL; will also make use of other header fields if

1706

they are present: <literal>title</literal>,

1707

<literal>description</literal>,

1708

<literal>keywords</literal>.</para>

1709

1710

<para>Filters also have the possibility to "invent" field

1711

names. This should be output as meta tags:</para>

1712

1713

1714

1715

</programlisting>

1716

1717

<para> See the following section for details about configuring

1718

how field data is processed by the indexer.</para>

1719

1720

</sect2>

1721

1722

</sect1>

1723

1724

1725

<title>Field data processing configuration</title>

1726

1727

<para><literal>Fields</literal> are named pieces of information

1728

in or about documents, like <literal>title</literal>,

1729

<literal>author</literal>, <literal>abstract</literal>.</para>

1730

1731

<para>The field values for documents can appear in several ways

1732

during indexing: either output by filters as

1733

<literal>meta</literal> fields in the HTML header section, or

1734

added as attributes of the <literal>Doc</literal> object when

1735

using the API, or again synthetized internally by &RCL;.</para>

1736

1737

<para>The &RCL; query language allows searching for text in a

1738

specific field.</para>

1739

1740

<para>&RCL; defines a number of default fields. Additional

1741

ones can be output by filters, and described in the

1742

<filename>fields</filename> configuration file.</para>

1743

1744

<para>Fields can be:</para>

1745

1746

1747

<listitem><para><literal>indexed</literal>, meaning that their

1748

terms are separately stored in inverted lists (with a specific

1749

prefix), and that a field-specific search is possible.</para>

1750

</listitem>

1751

1752

<listitem><para><literal>stored</literal>, meaning that their

1753

value is recorded in the index data record for the document,

1754

and can be returned and displayed with search results.</para>

1755

</listitem>

1756

1757

</itemizedlist>

1758

1759

<para>A field can be either or both indexed and stored.</para>

1760

1761

<para>A field becomes indexed by having a prefix defined in

1762

the <literal>[prefixes]</literal> section of the

1763

<filename>fields</filename> file. See the comments in there for

1764

details</para>

1765

1766

<para>A field becomes stored by appearing in

1767

the <literal>[stored]</literal> section of the

1768

<filename>fields</filename> file.</para>

1769

1770

</sect1>

1771

1772

1773

1774

1775

1776

1777

<title>Interface elements</title>

1778

1779

<para>A few elements in the interface are specific and and need

1780

an explanation.</para>

1781

1782

1783

1784

1785

<term>udi</term> <listitem><para>An udi (unique document

1786

identifier) identifies a document. Because of limitations

1787

inside the index engine, it is restricted in length (to

1788

200 bytes), which is why a regular URI cannot be used. The

1789

structure and contents of the udi is defined by the

1790

application and opaque to the index engine. For example,

1791

the internal file system indexer uses the complete

1792

document path (file path + internal path), truncated to

1793

length, the suppressed part being replaced by a hash

1794

value.</para> </listitem>

1795

</varlistentry>

1796

1797

1798

<term>ipath</term>

1799

1800

<listitem><para>This data value (set as a field in the Doc

1801

object) is stored, along with the URL, but not indexed by

1802

&RCL;. Its contents are not interpreted, and its use is up

1803

to the application. For example, the &RCL; internal file

1804

system indexer stores the part of the document access path

1805

internal to the container file (<literal>ipath</literal> in

1806

this case is a list of subdocument sequential numbers). url

1807

and ipath are returned in every search result and permit

1808

access to the original document.</para>

1809

</listitem>

1810

</varlistentry>

1811

1812

1813

<term>Stored and indexed fields</term>

1814

1815

<listitem><para>The <filename>fields</filename> file inside

1816

the &RCL; configuration defines which document fields are

1817

either "indexed" (searchable), "stored" (retrievable with

1818

search results), or both.</para>

1819

</listitem>

1820

</varlistentry>

1821

1822

</variablelist>

1823

1824

<para>Data for an external indexer, should be stored in a

1825

separate index, not the one for the &RCL; internal file system

1826

indexer, except if the latter is not used at all). The reason

1827

is that the main document indexer purge pass would remove all

1828

the other indexer's documents, as they were not seen during

1829

indexing. The main indexer documents would also probably be a

1830

problem for the external indexer purge operation.</para>

1831

1832

</sect2>

1833

1834

1835

<title>Python interface</title>

1836

1837

1838

<title>Introduction</title>

1839

1840

<para>&RCL; versions after 1.11 define a Python programming

1841

interface, both for searching and indexing.</para>

1842

1843

<para>The python interface is not built by default and can be

1844

found in the source package, under python/recoll. The

1845

directory contains the usual <filename>setup.py</filename>

1846

script which you can use to build and install the

1847

module:

1848

1849

1850

<userinput>cd recoll-xxx/python/recoll</userinput>

1851

<userinput>python setup.py build</userinput>

1852

<userinput>python setup.py install</userinput>

1853

</screen>

1854

</para>

1855

1856

</sect3>

1857

1858

1859

1860

<title>Interface manual</title>

1861

1862

1863

NAME

1864

recoll - This is an interface to the Recoll full text indexer.

1865

1866

FILE

1867

/usr/local/lib/python2.5/site-packages/recoll.so

1868

1869

CLASSES

1870

1871

Doc

1872

Query

1873

SearchData

1874

1875

class Db(__builtin__.object)

1876

| Db([confdir=None], [extra_dbs=None], [writable = False])

1877

1878

| A Db object holds a connection to a Recoll index. Use the connect()

1879

| function to create one.

1880

| confdir specifies a Recoll configuration directory (default:

1881

| $RECOLL_CONFDIR or ~/.recoll).

1882

| extra_dbs is a list of external databases (xapian directories)

1883

| writable decides if we can index new data through this connection

1884

1885

| Methods defined here:

1886

1887

1888

| addOrUpdate(...)

1889

| addOrUpdate(udi, doc, parent_udi=None) -> None

1890

| Add or update index data for a given document

1891

| The udi string must define a unique id for the document. It is not

1892

| interpreted inside Recoll

1893

| doc is a Doc object

1894

| if parent_udi is set, this is a unique identifier for the

1895

| top-level container (ie mbox file)

1896

1897

| delete(...)

1898

| delete(udi) -> Bool.

1899

| Purge index from all data for udi. If udi matches a container

1900

| document, purge all subdocs (docs with a parent_udi matching udi).

1901

1902

| makeDocAbstract(...)

1903

| makeDocAbstract(Doc, Query) -> string

1904

| Build and return 'keyword-in-context' abstract for document

1905

| and query.

1906

1907

| needUpdate(...)

1908

| needUpdate(udi, sig) -> Bool.

1909

| Check if the index is up to date for the document defined by udi,

1910

| having the current signature sig.

1911

1912

| purge(...)

1913

| purge() -> Bool.

1914

| Delete all documents that were not touched during the just finished

1915

| indexing pass (since open-for-write). These are the documents for

1916

| the needUpdate() call was not performed, indicating that they no

1917

| longer exist in the primary storage system.

1918

1919

| query(...)

1920

| query() -> Query. Return a new, blank query object for this index.

1921

1922

| setAbstractParams(...)

1923

| setAbstractParams(maxchars, contextwords).

1924

| Set the parameters used to build 'keyword-in-context' abstracts

1925

1926

| ----------------------------------------------------------------------

1927

| Data and other attributes defined here:

1928

1929

1930

class Doc(__builtin__.object)

1931

| Doc()

1932

1933

| A Doc object contains index data for a given document.

1934

| The data is extracted from the index when searching, or set by the

1935

| indexer program when updating. The Doc object has no useful methods but

1936

| many attributes to be read or set by its user. It matches exactly the

1937

| Rcl::Doc c++ object. Some of the attributes are predefined, but,

1938

| especially when indexing, others can be set, the name of which will be

1939

| processed as field names by the indexing configuration.

1940

| Inputs can be specified as unicode or strings.

1941

| Outputs are unicode objects.

1942

| All dates are specified as unix timestamps, printed as strings

1943

| Predefined attributes (index/query/both):

1944

| text (index): document plain text

1945

| url (both)

1946

| fbytes (both) optional) file size in bytes

1947

| filename (both)

1948

| fmtime (both) optional file modification date. Unix time printed

1949

| as string

1950

| dbytes (both) document text bytes

1951

| dmtime (both) document creation/modification date

1952

| ipath (both) value private to the app.: internal access path

1953

| inside file

1954

| mtype (both) mime type for original document

1955

| mtime (query) dmtime if set else fmtime

1956

| origcharset (both) charset the text was converted from

1957

| size (query) dbytes if set, else fbytes

1958

| sig (both) app-defined file modification signature.

1959

| For up to date checks

1960

| relevancyrating (query)

1961

| abstract (both)

1962

| author (both)

1963

| title (both)

1964

| keywords (both)

1965

1966

| Methods defined here:

1967

1968

1969

| ----------------------------------------------------------------------

1970

| Data and other attributes defined here:

1971

1972

1973

class Query(__builtin__.object)

1974

| Recoll Query objects are used to execute index searches.

1975

| They must be created by the Db.query() method.

1976

1977

| Methods defined here:

1978

1979

1980

| execute(...)

1981

| execute(query_string, stemming=1|0)

1982

1983

| Starts a search for query_string, a Recoll search language string

1984

| (mostly Xesam-compatible).

1985

| The query can be a simple list of terms (and'ed by default), or more

1986

| complicated with field specs etc. See the Recoll manual.

1987

1988

| executesd(...)

1989

| executesd(SearchData)

1990

1991

| Starts a search for the query defined by the SearchData object.

1992

1993

| fetchone(...)

1994

| fetchone(None) -> Doc

1995

1996

| Fetches the next Doc object in the current search results.

1997

1998

| sortby(...)

1999

| sortby(field=fieldname, ascending=true)

2000

| Sort results by 'fieldname', in ascending or descending order.

2001

| Only one field can be used, no subsorts for now.

2002

| Must be called before executing the search

2003

2004

| ----------------------------------------------------------------------

2005

| Data descriptors defined here:

2006

2007

| next

2008

| Next index to be fetched from results. Normally increments after

2009

| each fetchone() call, but can be set/reset before the call effect

2010

| seeking. Starts at 0

2011

2012

| ----------------------------------------------------------------------

2013

| Data and other attributes defined here:

2014

2015

2016

class SearchData(__builtin__.object)

2017

| SearchData()

2018

2019

| A SearchData object describes a query. It has a number of global

2020

| parameters and a chain of search clauses.

2021

2022

| Methods defined here:

2023

2024

2025

| addclause(...)

2026

2027

| qstring=string, slack=int, field=string, stemming=1|0,

2028

| subSearch=SearchData)

2029

| Adds a simple clause to the SearchData And/Or chain, or a subquery

2030

| defined by another SearchData object

2031

2032

| ----------------------------------------------------------------------

2033

| Data and other attributes defined here:

2034

2035

2036

FUNCTIONS

2037

connect(...)

2038

connect([confdir=None], [extra_dbs=None], [writable = False])

2039

-> Db.

2040

2041

Connects to a Recoll database and returns a Db object.

2042

confdir specifies a Recoll configuration directory

2043

(the default is built like for any Recoll program).

2044

extra_dbs is a list of external databases (xapian directories)

2045

writable decides if we can index new data through this connection

2046

2047

2048

</literalLayout>

2049

</sect3>

2050

2051

2052

<title>Example code</title>

2053

2054

<para>The following sample would query the index with a user

2055

language string. See the <filename>python/samples</filename>

2056

directory inside the &RCL; source for other examples.</para>

2057

2058

2059

#!/usr/bin/env python

2060

2061

import recoll

2062

2063

db = recoll.connect()

2064

db.setAbstractParams(maxchars=80, contextwords=2)

2065

2066

query = db.query()

2067

nres = query.execute("some user question")

2068

print "Result count: ", nres

2069

if nres > 5:

2070

nres = 5

2071

while query.next >= 0 and query.next < nres:

2072

doc = query.fetchone()

2073

print query.next

2074

for k in ("title", "size"):

2075

print k, ":", getattr(doc, k).encode('utf-8')

2076

abs = db.makeDocAbstract(doc, query).encode('utf-8')

2077

print abs

2078

2079

2080

2081

2082

</programlisting>

2083

2084

</sect3>

2085

2086

</sect2>

1532

2087

1533

2088

</chapter>

1534

2089

1582

2137

1583

2138

<title>Supporting packages</title>

1584

2139

1585

<para>&RCL; uses external applications to index some file

2140

<para>&RCL; uses external applications to index some file

1586

2141

types. You need to install them for the file types that you wish to

1587

2142

have indexed (these are run-time dependencies. None is needed for

1588

building &RCL;):</para>

2143

building &RCL;).</para>

2144

2145

<para>After an indexing pass, the commands that were found

2146

missing can be displayed from the <command>recoll</command>

2147

<guilabel>File</guilabel> menu. The list is stored in the

2148

<filename>missing</filename> text file inside the configuration

2149

directory.</para>

2150

2151

<para>A list of common file types which need external

2152

commands:</para>

1589

2153

1590

2154

1591

2155

2311

2875

be an executable program or script which exists inside

2312

2876

<filename>/usr/[local/]share/recoll/filters</filename>. It

2313

2877

will be given a file name as argument and should output the

2314

text contents in html format on the standard output.</para>

2878

text contents on the standard output.</para>

2315

2879

2316

<para>You can find more details about writing a &RCL; filter

2317

in the <link linkend="rcl.extending.filters">section about

2318

writing filters</link></para>

2880

<para>The <link linkend="rcl.program.filters">filter

2881

programming</link> section describes in more detail how to

2882

write a filter.</para>

2319

2883

</sect3>

2320

2884

2321

2885

</sect2>

2331

2895

add a small &RCL; launcher to the KDE panel.</para>

2332

2896

2333

2897

<para>The applet is not automatically built with the main &RCL;

2334

programs. To build it, you need to unpack the &RCL; source

2335

code, then go to the <filename>kde/recoll_applet/</filename>

2336

directory, and type the usual

2337

<userinput>configure;make;make install</userinput>.</para>

2898

programs, nor is it included with the main source distribution

2899

(because the KDE build boilerplate makes it relatively big). You

2900

can download its source from the recoll.org download page. Use

2901

the omnipotent <userinput>configure;make;make

2902

install</userinput> incantation to build and install.</para>

2338

2903

2339

2904

<para>You can then add the applet to the panel by right-clicking

2340

2905

the panel and choosing the <guilabel>Add applet</guilabel>

2343

2908

<para>The <literal>recoll_applet</literal> has a small text

2344

2909

window where you can type a &RCL; query (in query language

2345

2910

form), and an icon which can be used to restrict the search to

2346

certain types of files.</para>

2347

</sect1>

2348

2349

2350

2351

<title>Extending &RCL;</title>

2352

2353

2354

<title>Writing a document filter</title>

2355

2356

<para>&RCL; filters are executable programs which

2357

translate from a specific format (ie:

2358

<application>openoffice</application>,

2359

<application>acrobat</application>, etc.) to the &RCL;

2360

indexing input format, which was chosen to be HTML.</para>

2361

2362

<para>&RCL; filters are usually shell-scripts, but this is in

2363

no way necessary. These programs are extremely simple and most

2364

of the difficulty lies in extracting the text from the native

2365

format, not outputting what is expected by &RCL;. Happily

2366

enough, most document formats already have translators or text

2367

extractors which handle the difficult part and can be called

2368

from the filter.</para>

2369

2370

<para>Filters are called with a single argument which is the

2371

source file name. They should output the result to stdout.</para>

2372

2373

<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>

2374

environment variable (values <literal>yes</literal>,

2375

<literal>no</literal>) tells the filter if the operation is

2376

for indexing or previewing. Some filters use this to output a

2377

slightly different format. This is not essential.</para>

2378

2379

<para>The output HTML could be very minimal like the following

2380

example:</para>

2381

2382

2383

2384

&lt/head>

2385

<body>some text content</body></html>

2386

</programlisting>

2387

2388

<para>You should take care to escape some characters inside

2389

the text by transforming them into appropriate

2390

entities. "<literal>&</literal>" should be transformed into

2391

"<literal>&amp;</literal>", "<literal><</literal>"

2392

should be transformed into "<literal>&lt;</literal>".</para>

2393

2394

<para>The character set needs to be specified in the

2395

header. It does not need to be UTF-8 (&RCL; will take care

2396

of translating it), but it must be accurate for good

2397

results.</para>

2398

2399

<para>&RCL; will also make use of other header fields if

2400

they are present: <literal>title</literal>,

2401

<literal>description</literal>,

2402

<literal>keywords</literal>.</para>

2403

2404

<para>As of &RCL; release 1.9, filters also have the

2405

possibility to "invent" field names. This should be output as

2406

meta tags:</para>

2407

2408

2409

2410

</programlisting>

2411

2412

<para>In this case, a correspondance between field name and

2413

&XAP; prefix should also be added to the

2414

<filename>mimeconf</filename> file. See the existing entries

2415

for inspiration. The field can then be used inside the query

2416

language to narrow searches.</para>

2417

2418

<para>The easiest way to write a new filter is probably to start

2419

from an existing one.</para>

2420

2421

2422

</sect2>

2423

2911

certain types of files. It is quite primitive, and launches a

2912

new recoll GUI instance every time (even if it is already

2913

running). You may find it useful anyway.</para>

2424

2914

</sect1>

2425

2915

2426

2916

</chapter>

Older »