5
Strigi is a fast and light desktop search engine. It can handle a large range of file formats such as emails, office documents, media files, and file archives. It can index files that are embedded in other files. This means email attachments and files in zip files are searchable as if they were normal files on your harddisk.
7
Strigi is normally run as a background daemon that can be accessesed by many other programs at once. In addition to the daemon, Strigi comes with powerful replacements for the popular unix commands 'find' and 'grep'. These are called 'deepfind' and 'deepgrep' and can search inside files just like the strigi daemon can.
11
For my personal use I've written the jstreams classes that allow one to easily read nested files. These have proven very fast and have been included in the clucene c++ search engine. These classes would also be a cool extension to the kio plugins for allowing the user to browse e.g. files in a zip file that is stored in an email attachment. Another use would be to write a crawler that can gather information from all files in the filesystem even if they are hidden in emails or archives. I intended to add this feature to Kat, but because of the slowdown in the Kat project the latest Kat development version is not complete and does not build.
13
So I developed a small daemon that can index information using the new crawler. Now i've reached a point that the crawler is very stable and fast. How fast exactly depends on your system. It comes complete with a simple gui to controll the daemon and to search. I've named the thing Strigi, because I hope it grows into a Kat.
15
Here are the main features of Strigi:
17
- very small memory footprint
18
- no hammering of the system
19
- pluggable backend, currently clucene and hyperestraier, sqlite3 and xapian are in the works
20
- communication between daemon and search program over an abstract interface, this is currently a simple socket but implementation of dbus is a possibility. There's a small perl program in the code as an example of how to query. This is so easy that any KDE app could implement this.
21
- simple interface for implementing plugins for extracting information. we'll try to reuse the kat plugins, although native plugins will have a large speed advantage
22
- calculation of sha1 for every file crawled (allows fast finding of duplicates)
25
- CLucene >= 0.9.16 (http://clucene.sf.net)
26
- CMake >= 2.4.2 (http://www.cmake.org)
27
- ZLib >= 1.2.3 (http://www.zlib.net)
28
- BZip2 >= 1.0.3 (http://www.bzip.org)
29
- LibXml2 (http://xmlsoft.org/)
32
- Qt4 >= 4.2 (for a graphical interface)
33
- D-Bus (http://www.freedesktop.org/wiki/Software/dbus): it is optional but is turned on by default
34
- linux kernel >= 2.6.13 (for inotify support)
35
- FAM or Gamin for FAM file system monitoring support (Gamin is recommended)
36
- log4cxx >= 0.9.7 (http://logging.apache.org/log4cxx/) for advanced logging features
39
If you are upgrading from a version prior of 0.3.9 you can convert your old configuration files using a simple command-line program called strigiconfupdater. Just type 'strigiconfupdater' to see the help message explaining how it works.
42
How to obtain and build Strigi from SVN?
43
-----------------------------------------------
45
Execute these commands:
47
svn co svn://anonsvn.kde.org/home/kde/trunk/kdesupport/strigi
51
cmake -DCMAKE_BUILD_TYPE=DEBUG ..
55
Some possible cmake options:
56
-DCMAKE_INSTALL_PREFIX=${HOME}/testinstall
57
install strigi in a custom directory
58
-DCMAKE_INCLUDE_PATH=${HOME}/testinstall/include
59
include a custom include directory
60
-DCMAKE_LIBRARY_PATH=${HOME}/testinstall/lib
61
include a custom library directory
62
-DENABLE_INOTIFY:BOOL=ON
63
enable inotify support, requires kernel >= 2.6.13 with inotify support enabled
65
enable FAM support, requires FAM or Gamin (which is better) installed
66
-DENABLE_LOG4CXX:BOOL=ON
67
enable log4cxx support, provides advanced logging features using log4cxx lib
69
use DBus for communication instead of the socket based communication
70
-DLIB_DESTINATION=lib64
71
if you have a 64 bit system with separate libraries for 64 bit libraries
73
You can't enable inotify and polling at the same time, you've to choose one of them.
74
If you want to use the GUI, you need to have >= Qt 4.1.2 installed. On Debian and Kubuntu, you can do this with 'sudo apt-get install libqt4-*'.
75
If the cmake call still cannot find Qt4, you can call cmake like this:
76
QTDIR=/usr/lib/qt4 PATH=$QTDIR/bin:$PATH cmake ..
79
Strigi can currently use 2 different backends with 2 more in the works. Install at least CLucene or Hyper Estraier.
81
++ CLucene http://clucene.sf.net/
82
++ Hyper Estraier http://hyperestraier.sourceforge.net/
83
+ Sqlite3 http://sqlite.org/
84
+ Xapian http://xapian.org/
86
You need to use a CLucene 0.9.16. It can be found here:
87
http://sourceforge.net/project/showfiles.php?group_id=80013
90
Start Strigi by running 'strigiclient', then choose a backend and press 'Start daemon'. Now you can configure directories to index and start indexing.
94
Here's what's in the different subdirectories:
97
A collection of stream classes that are inspired by java.io.Inputstream. These
98
classes can be nicely nested so that you can transform streams or read
99
substreams that represent a nested file. E.g. ZipStreamProvider takes a
100
stream as input and gives out substreams with the contents of the files in
101
the zipfile/zipstream.
104
If you want to crawl nested files, you need a special crawler that can work on
105
files in different levels at the same time. This is what StreamIndexer does.
106
It takes a stream as an input and passes is through two types of analyzers:
107
TroughStreamAnalyzers and one EndStreamAnalyzer. One ThroughStreamAnalyzer
108
can e.g. calculate sha1 or md5 from a stream and another one can extract URL
109
or email addresses. An EndStreamAnalyzer is an analyzer that 'consumes' the
110
stream. Usually, these split up a stream into it's substreams and pass these
111
into the indexer again. I hope to write a plugin mechanism for these
112
analyzers. Maybe I'll just add wrappers around other efforts such as
113
kio-plugins and libextractor. These usually don't like streams very much, but
116
All information for a document is stored into an Indexable document. This
117
calls an IndexWriter to actually store the information. An IndexReader allows
118
one to read from an index and to query. Handling of concurrency and resources
119
of the particular index implementation is done by an IndexManager. These are
120
all abstract classes that can be implemented for different types of indexes,
121
eg clucene, sqlite or xapian.
124
Code to run a daemon for handling indexing and client requests. Also code to
125
handle re-indexing a directory. Will have code for filtering out directories
126
and selecting which plugins to use for which files. Maybe add code for
127
merging different IndexReaders for querying multiple databases.
130
Implementations of IndexManager, IndexReader and IndexWriter.
133
Yeah, the original project. It has glue between jstreams and Qt4
134
QAbstractFileEngine. This allows you to let Qt4 read an arbitrarily deeply
138
Qt4 file dialog that uses libarchivereader.