Indexing is the process by which the set of documents is analyzed and the data entered into the database. Recoll indexing is normally incremental: documents will only be processed if they have been modified. On the first execution, all documents will need processing. A full index build can be forced later by specifying an option to the indexing command (recollindex -z).
Recoll indexing can be performed with two different methods:
Periodic (or Batch) indexing: indexing takes place at discrete times, by executing the recollindex command. The typical usage is to have a nightly indexing run programmed into your cron file.
Real time indexing: indexing takes place as soon as a file is created or changed. recollindex runs as a daemon and uses a file system alteration monitor such as inotify, Fam or Gamin to detect file changes.
The choice between the two methods is mostly a matter of preference, and they can be combined by setting up multiple indexes (ie: use periodic indexing on a big documentation directory, and real time indexing on a small home directory). Monitoring a big file system tree can consume significant system resources.
Recoll knows about quite a few different document types. The parameters for document types recognition and processing are set in configuration files.
Most file types, like HTML or word processing files, only hold one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may in turn be themselves compound ones. Such hierarchies can go quite deep, and Recoll can process, for example, an ms-word document stored as an attachment to an email message inside an email folder archived in a zip file...
Recoll indexing processes plain text, HTML, OpenDocument (Open/LibreOffice), email formats, and a few others internally.
Other file types (ie: postscript, pdf, ms-word, rtf ...) need external applications for preprocessing. The list is in the installation section. After every indexing operation, Recoll updates a list of commands that would be needed for indexing existing files types. This list can be displayed by selecting the menu option File->Show Missing Helpers in the recoll GUI. It is stored in the missing text file inside the configuration directory.
Without further configuration, Recoll will index all appropriate files from your home directory, with a reasonable set of defaults.
In some cases, it may be interesting to index different areas of the file system to separate databases. You can do this by using multiple configuration directories, each indexing a file system area to a specific database. See the section about using multiple databases for more information on multiple configurations and indexes.
In the rare case where the index becomes corrupted (which can signal itself by weird
search results or crashes), the index files need to be erased before restarting a clean
indexing pass. Just delete the xapiandb directory (see next section), or, alternatively, start the next
recollindex with the -z
option,
which will reset the database before indexing.