1526
1575
<para>Your main database (the one the current configuration
1527
1576
indexes to), is always implicitly active. If this is not
1528
1577
desirable, you can set up your configuration so that it indexes,
1529
for example, an empty directory.</para>
1578
for example, an empty directory. An alternative indexer may also
1579
need to implement a way of purging the index from stale data,
1586
<chapter id="rcl.program">
1587
<title>Programming interface</title>
1589
<para>&RCL; has an Application programming Interface, usable both
1590
for indexing and searching, currently accessible from the
1591
<application>Python</application> language.</para>
1593
<para>Another less radical way to extend the application is to
1594
write filters for new types of documents.</para>
1596
<para>The processing of metadata attributes for documents
1597
(<literal>fields</literal>) is highly configurable.</para>
1599
<sect1 id="rcl.program.filters">
1600
<title>Writing a document filter</title>
1602
<para>&RCL; filters are executable programs which
1603
translate from a specific format (ie:
1604
<application>openoffice</application>,
1605
<application>acrobat</application>, etc.) to the &RCL;
1606
indexing input format, which may be
1607
<literal>text/plain</literal> or
1608
<literal>text/html</literal>.</para>
1610
<para>&RCL; filters are usually shell-scripts, but this is in
1611
no way necessary. These programs are extremely simple and most
1612
of the difficulty lies in extracting the text from the native
1613
format, not outputting what is expected by &RCL;. Happily
1614
enough, most document formats already have translators or text
1615
extractors which handle the difficult part and can be called
1616
from the filter. In some case the output of the translating
1617
program is appropriate, and no intermediate shell-script is
1620
<para>Filters are called with a single argument which is the
1621
source file name. They should output the result to stdout.</para>
1623
<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
1624
environment variable (values <literal>yes</literal>,
1625
<literal>no</literal>) tells the filter if the operation is
1626
for indexing or previewing. Some filters use this to output a
1627
slightly different format. This is not essential.</para>
1629
<para>The association of file types to filters is performed in
1630
the <filename>mimeconf</filename> file. A sample:</para>
1634
application/msword = exec antiword -t -i 1 -m UTF-8;\
1635
mimetype=text/plain;charset=utf-8
1637
application/ogg = exec rclogg
1639
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
1642
<para>The fragment specifies that:</para>
1646
<listitem><para><literal>application/msword</literal> files
1647
are processed by executing the <command>antiword</command>
1648
program, which outputs
1649
<literal>text/plain</literal> encoded in
1650
<literal>iso-8859-1</literal>.</para>
1653
<listitem><para><literal>application/ogg</literal> files are
1654
processed by the <command>rclogg</command> script, with
1655
default output type (<literal>text/html</literal>, with
1656
encoding specified in the header, or <literal>utf-8</literal>
1660
<listitem><para><literal>text/rtf</literal> is processed by
1661
<command>unrtf</command>, which outputs
1662
<literal>text/html</literal>. The
1663
<literal>iso-8859-1</literal> encoding is specified because it
1664
is not the <literal>utf-8</literal> default, and not output by
1665
<command>unrtf</command> in the HTML header section.</para>
1669
<para>The easiest way to write a new filter is probably to start
1670
from an existing one.</para>
1672
<para>Filters which output <literal>text/plain</literal> text
1673
are generally simpler, but they cannot specify the character set
1674
and other metadata, so they are limited to cases where these
1675
elements are not needed.</para>
1678
<sect2 id="rcl.program.filters.html">
1679
<title>Filter HTML output</title>
1681
<para>The output HTML could be very minimal like the following
1684
<programlisting><html><head>
1685
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
1687
<body>some text content</body></html>
1690
<para>You should take care to escape some
1692
the text by transforming them into appropriate
1693
entities. "<literal>&</literal>" should be transformed into
1694
"<literal>&amp;</literal>", "<literal><</literal>"
1695
should be transformed into
1696
"<literal>&lt;</literal>". This is not always properly
1697
done by translating programs which output HTML, and of
1698
course nerver by those which output plain text.</para>
1700
<para>The character set needs to be specified in the
1701
header. It does not need to be UTF-8 (&RCL; will take care
1702
of translating it), but it must be accurate for good
1705
<para>&RCL; will also make use of other header fields if
1706
they are present: <literal>title</literal>,
1707
<literal>description</literal>,
1708
<literal>keywords</literal>.</para>
1710
<para>Filters also have the possibility to "invent" field
1711
names. This should be output as meta tags:</para>
1714
<meta name="somefield" content="Some textual data" />
1717
<para> See the following section for details about configuring
1718
how field data is processed by the indexer.</para>
1724
<sect1 id="rcl.program.fields">
1725
<title>Field data processing configuration</title>
1727
<para><literal>Fields</literal> are named pieces of information
1728
in or about documents, like <literal>title</literal>,
1729
<literal>author</literal>, <literal>abstract</literal>.</para>
1731
<para>The field values for documents can appear in several ways
1732
during indexing: either output by filters as
1733
<literal>meta</literal> fields in the HTML header section, or
1734
added as attributes of the <literal>Doc</literal> object when
1735
using the API, or again synthetized internally by &RCL;.</para>
1737
<para>The &RCL; query language allows searching for text in a
1738
specific field.</para>
1740
<para>&RCL; defines a number of default fields. Additional
1741
ones can be output by filters, and described in the
1742
<filename>fields</filename> configuration file.</para>
1744
<para>Fields can be:</para>
1747
<listitem><para><literal>indexed</literal>, meaning that their
1748
terms are separately stored in inverted lists (with a specific
1749
prefix), and that a field-specific search is possible.</para>
1752
<listitem><para><literal>stored</literal>, meaning that their
1753
value is recorded in the index data record for the document,
1754
and can be returned and displayed with search results.</para>
1759
<para>A field can be either or both indexed and stored.</para>
1761
<para>A field becomes indexed by having a prefix defined in
1762
the <literal>[prefixes]</literal> section of the
1763
<filename>fields</filename> file. See the comments in there for
1766
<para>A field becomes stored by appearing in
1767
the <literal>[stored]</literal> section of the
1768
<filename>fields</filename> file.</para>
1773
<sect1 id="rcl.program.api">
1776
<sect2 id="rcl.program.api.elements">
1777
<title>Interface elements</title>
1779
<para>A few elements in the interface are specific and and need
1780
an explanation.</para>
1785
<term>udi</term> <listitem><para>An udi (unique document
1786
identifier) identifies a document. Because of limitations
1787
inside the index engine, it is restricted in length (to
1788
200 bytes), which is why a regular URI cannot be used. The
1789
structure and contents of the udi is defined by the
1790
application and opaque to the index engine. For example,
1791
the internal file system indexer uses the complete
1792
document path (file path + internal path), truncated to
1793
length, the suppressed part being replaced by a hash
1794
value.</para> </listitem>
1800
<listitem><para>This data value (set as a field in the Doc
1801
object) is stored, along with the URL, but not indexed by
1802
&RCL;. Its contents are not interpreted, and its use is up
1803
to the application. For example, the &RCL; internal file
1804
system indexer stores the part of the document access path
1805
internal to the container file (<literal>ipath</literal> in
1806
this case is a list of subdocument sequential numbers). url
1807
and ipath are returned in every search result and permit
1808
access to the original document.</para>
1813
<term>Stored and indexed fields</term>
1815
<listitem><para>The <filename>fields</filename> file inside
1816
the &RCL; configuration defines which document fields are
1817
either "indexed" (searchable), "stored" (retrievable with
1818
search results), or both.</para>
1824
<para>Data for an external indexer, should be stored in a
1825
separate index, not the one for the &RCL; internal file system
1826
indexer, except if the latter is not used at all). The reason
1827
is that the main document indexer purge pass would remove all
1828
the other indexer's documents, as they were not seen during
1829
indexing. The main indexer documents would also probably be a
1830
problem for the external indexer purge operation.</para>
1834
<sect2 id="rcl.program.api.python">
1835
<title>Python interface</title>
1837
<sect3 id="rcl.program.python.intro">
1838
<title>Introduction</title>
1840
<para>&RCL; versions after 1.11 define a Python programming
1841
interface, both for searching and indexing.</para>
1843
<para>The python interface is not built by default and can be
1844
found in the source package, under python/recoll. The
1845
directory contains the usual <filename>setup.py</filename>
1846
script which you can use to build and install the
1850
<userinput>cd recoll-xxx/python/recoll</userinput>
1851
<userinput>python setup.py build</userinput>
1852
<userinput>python setup.py install</userinput>
1859
<sect3 id="rcl.program.python.manual">
1860
<title>Interface manual</title>
1864
recoll - This is an interface to the Recoll full text indexer.
1867
/usr/local/lib/python2.5/site-packages/recoll.so
1875
class Db(__builtin__.object)
1876
| Db([confdir=None], [extra_dbs=None], [writable = False])
1878
| A Db object holds a connection to a Recoll index. Use the connect()
1879
| function to create one.
1880
| confdir specifies a Recoll configuration directory (default:
1881
| $RECOLL_CONFDIR or ~/.recoll).
1882
| extra_dbs is a list of external databases (xapian directories)
1883
| writable decides if we can index new data through this connection
1885
| Methods defined here:
1889
| addOrUpdate(udi, doc, parent_udi=None) -> None
1890
| Add or update index data for a given document
1891
| The udi string must define a unique id for the document. It is not
1892
| interpreted inside Recoll
1893
| doc is a Doc object
1894
| if parent_udi is set, this is a unique identifier for the
1895
| top-level container (ie mbox file)
1898
| delete(udi) -> Bool.
1899
| Purge index from all data for udi. If udi matches a container
1900
| document, purge all subdocs (docs with a parent_udi matching udi).
1902
| makeDocAbstract(...)
1903
| makeDocAbstract(Doc, Query) -> string
1904
| Build and return 'keyword-in-context' abstract for document
1908
| needUpdate(udi, sig) -> Bool.
1909
| Check if the index is up to date for the document defined by udi,
1910
| having the current signature sig.
1914
| Delete all documents that were not touched during the just finished
1915
| indexing pass (since open-for-write). These are the documents for
1916
| the needUpdate() call was not performed, indicating that they no
1917
| longer exist in the primary storage system.
1920
| query() -> Query. Return a new, blank query object for this index.
1922
| setAbstractParams(...)
1923
| setAbstractParams(maxchars, contextwords).
1924
| Set the parameters used to build 'keyword-in-context' abstracts
1926
| ----------------------------------------------------------------------
1927
| Data and other attributes defined here:
1930
class Doc(__builtin__.object)
1933
| A Doc object contains index data for a given document.
1934
| The data is extracted from the index when searching, or set by the
1935
| indexer program when updating. The Doc object has no useful methods but
1936
| many attributes to be read or set by its user. It matches exactly the
1937
| Rcl::Doc c++ object. Some of the attributes are predefined, but,
1938
| especially when indexing, others can be set, the name of which will be
1939
| processed as field names by the indexing configuration.
1940
| Inputs can be specified as unicode or strings.
1941
| Outputs are unicode objects.
1942
| All dates are specified as unix timestamps, printed as strings
1943
| Predefined attributes (index/query/both):
1944
| text (index): document plain text
1946
| fbytes (both) optional) file size in bytes
1948
| fmtime (both) optional file modification date. Unix time printed
1950
| dbytes (both) document text bytes
1951
| dmtime (both) document creation/modification date
1952
| ipath (both) value private to the app.: internal access path
1954
| mtype (both) mime type for original document
1955
| mtime (query) dmtime if set else fmtime
1956
| origcharset (both) charset the text was converted from
1957
| size (query) dbytes if set, else fbytes
1958
| sig (both) app-defined file modification signature.
1959
| For up to date checks
1960
| relevancyrating (query)
1966
| Methods defined here:
1969
| ----------------------------------------------------------------------
1970
| Data and other attributes defined here:
1973
class Query(__builtin__.object)
1974
| Recoll Query objects are used to execute index searches.
1975
| They must be created by the Db.query() method.
1977
| Methods defined here:
1981
| execute(query_string, stemming=1|0)
1983
| Starts a search for query_string, a Recoll search language string
1984
| (mostly Xesam-compatible).
1985
| The query can be a simple list of terms (and'ed by default), or more
1986
| complicated with field specs etc. See the Recoll manual.
1989
| executesd(SearchData)
1991
| Starts a search for the query defined by the SearchData object.
1994
| fetchone(None) -> Doc
1996
| Fetches the next Doc object in the current search results.
1999
| sortby(field=fieldname, ascending=true)
2000
| Sort results by 'fieldname', in ascending or descending order.
2001
| Only one field can be used, no subsorts for now.
2002
| Must be called before executing the search
2004
| ----------------------------------------------------------------------
2005
| Data descriptors defined here:
2008
| Next index to be fetched from results. Normally increments after
2009
| each fetchone() call, but can be set/reset before the call effect
2010
| seeking. Starts at 0
2012
| ----------------------------------------------------------------------
2013
| Data and other attributes defined here:
2016
class SearchData(__builtin__.object)
2019
| A SearchData object describes a query. It has a number of global
2020
| parameters and a chain of search clauses.
2022
| Methods defined here:
2026
| addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
2027
| qstring=string, slack=int, field=string, stemming=1|0,
2028
| subSearch=SearchData)
2029
| Adds a simple clause to the SearchData And/Or chain, or a subquery
2030
| defined by another SearchData object
2032
| ----------------------------------------------------------------------
2033
| Data and other attributes defined here:
2038
connect([confdir=None], [extra_dbs=None], [writable = False])
2041
Connects to a Recoll database and returns a Db object.
2042
confdir specifies a Recoll configuration directory
2043
(the default is built like for any Recoll program).
2044
extra_dbs is a list of external databases (xapian directories)
2045
writable decides if we can index new data through this connection
2051
<sect3 id="rcl.program.python.examples">
2052
<title>Example code</title>
2054
<para>The following sample would query the index with a user
2055
language string. See the <filename>python/samples</filename>
2056
directory inside the &RCL; source for other examples.</para>
2059
#!/usr/bin/env python
2063
db = recoll.connect()
2064
db.setAbstractParams(maxchars=80, contextwords=2)
2067
nres = query.execute("some user question")
2068
print "Result count: ", nres
2071
while query.next >= 0 and query.next < nres:
2072
doc = query.fetchone()
2074
for k in ("title", "size"):
2075
print k, ":", getattr(doc, k).encode('utf-8')
2076
abs = db.makeDocAbstract(doc, query).encode('utf-8')
2343
2908
<para>The <literal>recoll_applet</literal> has a small text
2344
2909
window where you can type a &RCL; query (in query language
2345
2910
form), and an icon which can be used to restrict the search to
2346
certain types of files.</para>
2350
<sect1 id="rcl.extending">
2351
<title>Extending &RCL;</title>
2353
<sect2 id="rcl.extending.filters">
2354
<title>Writing a document filter</title>
2356
<para>&RCL; filters are executable programs which
2357
translate from a specific format (ie:
2358
<application>openoffice</application>,
2359
<application>acrobat</application>, etc.) to the &RCL;
2360
indexing input format, which was chosen to be HTML.</para>
2362
<para>&RCL; filters are usually shell-scripts, but this is in
2363
no way necessary. These programs are extremely simple and most
2364
of the difficulty lies in extracting the text from the native
2365
format, not outputting what is expected by &RCL;. Happily
2366
enough, most document formats already have translators or text
2367
extractors which handle the difficult part and can be called
2368
from the filter.</para>
2370
<para>Filters are called with a single argument which is the
2371
source file name. They should output the result to stdout.</para>
2373
<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
2374
environment variable (values <literal>yes</literal>,
2375
<literal>no</literal>) tells the filter if the operation is
2376
for indexing or previewing. Some filters use this to output a
2377
slightly different format. This is not essential.</para>
2379
<para>The output HTML could be very minimal like the following
2382
<programlisting><html><head>
2383
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
2385
<body>some text content</body></html>
2388
<para>You should take care to escape some characters inside
2389
the text by transforming them into appropriate
2390
entities. "<literal>&</literal>" should be transformed into
2391
"<literal>&amp;</literal>", "<literal><</literal>"
2392
should be transformed into "<literal>&lt;</literal>".</para>
2394
<para>The character set needs to be specified in the
2395
header. It does not need to be UTF-8 (&RCL; will take care
2396
of translating it), but it must be accurate for good
2399
<para>&RCL; will also make use of other header fields if
2400
they are present: <literal>title</literal>,
2401
<literal>description</literal>,
2402
<literal>keywords</literal>.</para>
2404
<para>As of &RCL; release 1.9, filters also have the
2405
possibility to "invent" field names. This should be output as
2409
<meta name="somefield" content="Some textual data" />
2412
<para>In this case, a correspondance between field name and
2413
&XAP; prefix should also be added to the
2414
<filename>mimeconf</filename> file. See the existing entries
2415
for inspiration. The field can then be used inside the query
2416
language to narrow searches.</para>
2418
<para>The easiest way to write a new filter is probably to start
2419
from an existing one.</para>
2911
certain types of files. It is quite primitive, and launches a
2912
new recoll GUI instance every time (even if it is already
2913
running). You may find it useful anyway.</para>