1.3. Recoll overview

Recoll uses the Xapian information retrieval library as its storage and retrieval engine. Xapian is a very mature package using a sophisticated probabilistic ranking model. Recoll provides the mechanisms and interface to get data into and out of the system.

In practice, Xapian works by remembering where terms appear in your document files. The acquisition process is called indexing.

The resulting index can be big (roughly the size of the original document set), but it is not a document archive. Recoll can only display documents that still exist at the place from which they were indexed. (Actually, there is a way to reconstruct a document from the information in the index, but the result is not nice, as all formatting, punctuation and capitalization are lost).

Recoll stores all internal data in Unicode UTF-8 format, and it can index files with different character sets, encodings, and languages into the same index. It has input filters for many document types.

Stemming is the process by which Recoll reduces words to their radicals so that searching does not depend, for example, on a word being singular or plural (floor, floors), or on a verb tense (flooring, floored). Because the mechanisms used for stemming depend on the specific grammatical rules for each language, there is a separate stemmer module for most common languages where stemming makes sense. Storing documents written in different languages in the same index is possible, and commonly done. In this situation, you can specify several stemming languages for the index. Recoll stores the unstemmed versions of terms in the main index and uses auxiliary databases for term expansion (one for each stemming language), which means that you can switch stemming languages between searches, or add a language without needing a full reindex. Recoll currently makes no attempt at automatic language recognition, which means that the stemmer will sometimes be applied to terms from other languages with potentially strange results. In practise, even if this introduces possibilities of confusion, this approach has been proven quite useful, and, awaiting the addition of an automatic language recognition module to Recoll, it is much less cumbersome than separating your documents according to what language they are written in.

Recoll has many parameters which define exactly what to index, and how to classify and decode the source documents. These are kept in configuration files. A default configuration is copied into a standard location (usually something like /usr/[local/]share/recoll/examples) during installation. The default values set by the configuration files in this directory may be overridden by values that you set inside your personal configuration, found by default in the .recoll sub-directory of your home directory. The default configuration will index your home directory with default parameters and should be sufficient for giving Recoll a try, but you may want to adjust it later, which can be done either by editing the text files or by using configuration menus in the recoll GUI

The indexing process is started automatically the first time you execute the recoll GUI. Indexing can also be performed by executing the recollindex command.

Searches are usually performed inside the recoll GUI, which has many options to help you find what you are looking for. However, there are other ways to perform Recoll searches: mostly a command line interface, a Python programming interface, a KDE KIO slave module, and a Ubuntu Unity Lens module.