1
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
4
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
5
<meta content="Apache Forrest" name="Generator">
6
<meta name="Forrest-version" content="0.9">
7
<meta name="Forrest-skin-name" content="lucene">
9
Apache Lucene - Basic Demo Sources Walk-through
11
<link type="text/css" href="skin/basic.css" rel="stylesheet">
12
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
13
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
14
<link type="text/css" href="skin/profile.css" rel="stylesheet">
15
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
16
<link rel="shortcut icon" href="images/favicon.ico">
18
<body onload="init()">
19
<script type="text/javascript">ndeSetTextSize();</script>
24
<div class="breadtrail">
25
<a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
34
<div class="grouplogo">
35
<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://www.apache.org/images/asf_logo_simple.png" title="Apache Lucene"></a>
43
<div class="projectlogo">
44
<a href="http://lucene.apache.org/java/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/images/lucene_green_300.gif" title="Apache Lucene is a high-performance, full-featured text search engine library written entirely in
45
Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform."></a>
53
<div class="searchbox">
54
<form action="http://search.lucidimagination.com/p:lucene" method="get" class="roundtopsmall">
55
<input onFocus="getBlank (this, 'Search the site with Lucene');" size="25" name="q" id="query" type="text" value="Search the site with Lucene">
56
<input name="Search" value="Search" type="submit">
58
<div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a>
69
<a class="selected" href="http://lucene.apache.org/java/docs/">Main</a>
72
<a class="unselected" href="http://wiki.apache.org/lucene-java">Wiki</a>
75
<a class="selected" href="index.html">Lucene 3.5 Documentation</a>
84
<div id="publishedStrip">
88
<div id="level2tabs"></div>
92
<script type="text/javascript"><!--
93
document.write("Last Published: " + document.lastModified);
99
<div class="breadtrail">
104
|start Menu, mainarea
110
<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Documentation</div>
111
<div id="menu_1.1" class="menuitemgroup">
112
<div class="menuitem">
113
<a href="index.html">Overview</a>
115
<div onclick="SwitchMenu('menu_1.1.2', 'skin/')" id="menu_1.1.2Title" class="menutitle">Changes</div>
116
<div id="menu_1.1.2" class="menuitemgroup">
117
<div class="menuitem">
118
<a href="changes/Changes.html">Core</a>
120
<div class="menuitem">
121
<a href="changes/Contrib-Changes.html">Contrib</a>
124
<div onclick="SwitchMenu('menu_1.1.3', 'skin/')" id="menu_1.1.3Title" class="menutitle">Javadocs</div>
125
<div id="menu_1.1.3" class="menuitemgroup">
126
<div class="menuitem">
127
<a href="api/all/index.html">All</a>
129
<div class="menuitem">
130
<a href="api/core/index.html">Core</a>
132
<div class="menuitem">
133
<a href="api/test-framework/index.html">Test Framework</a>
135
<div onclick="SwitchMenu('menu_1.1.3.4', 'skin/')" id="menu_1.1.3.4Title" class="menutitle">Contrib</div>
136
<div id="menu_1.1.3.4" class="menuitemgroup">
137
<div class="menuitem">
138
<a href="api/contrib-analyzers/index.html">Analyzers</a>
140
<div class="menuitem">
141
<a href="api/contrib-smartcn/index.html">Smart Chinese Analyzer</a>
143
<div class="menuitem">
144
<a href="api/contrib-stempel/index.html">Stempel Polish Analyzer</a>
146
<div class="menuitem">
147
<a href="api/contrib-benchmark/index.html">Benchmark</a>
149
<div class="menuitem">
150
<a href="api/contrib-demo/index.html">Demo</a>
152
<div class="menuitem">
153
<a href="api/contrib-grouping/index.html">Grouping</a>
155
<div class="menuitem">
156
<a href="api/contrib-highlighter/index.html">Highlighter</a>
158
<div class="menuitem">
159
<a href="api/contrib-icu/index.html">ICU</a>
161
<div class="menuitem">
162
<a href="api/contrib-instantiated/index.html">Instantiated</a>
164
<div class="menuitem">
165
<a href="api/contrib-join/index.html">Join</a>
167
<div class="menuitem">
168
<a href="api/contrib-memory/index.html">Memory</a>
170
<div class="menuitem">
171
<a href="api/contrib-misc/index.html">Miscellaneous</a>
173
<div class="menuitem">
174
<a href="api/contrib-queries/index.html">Queries</a>
176
<div class="menuitem">
177
<a href="api/contrib-queryparser/index.html">Query Parser Framework</a>
179
<div class="menuitem">
180
<a href="api/contrib-remote/index.html">Remote</a>
182
<div class="menuitem">
183
<a href="api/contrib-spatial/index.html">Spatial</a>
185
<div class="menuitem">
186
<a href="api/contrib-spellchecker/index.html">Spellchecker</a>
188
<div class="menuitem">
189
<a href="api/contrib-xml-query-parser/index.html">XML Query Parser</a>
193
<div class="menuitem">
194
<a href="systemrequirements.html">System Requirements</a>
196
<div class="menuitem">
197
<a href="contributions.html">Contributions</a>
199
<div class="menuitem">
200
<a href="http://wiki.apache.org/lucene-java/LuceneFAQ">FAQ</a>
202
<div class="menuitem">
203
<a href="fileformats.html">File Formats</a>
205
<div class="menuitem">
206
<a href="gettingstarted.html">Getting Started</a>
208
<div class="menuitem">
209
<a href="lucene-contrib/index.html">Lucene Contrib</a>
211
<div class="menuitem">
212
<a href="queryparsersyntax.html">Query Syntax</a>
214
<div class="menuitem">
215
<a href="scoring.html">Scoring</a>
217
<div class="menuitem">
218
<a href="http://wiki.apache.org/lucene-java">Wiki</a>
221
<div id="credit"></div>
222
<div id="roundbottom">
223
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
227
<div id="credit2"></div>
237
Apache Lucene - Basic Demo Sources Walk-through
239
<div id="minitoc-area">
242
<a href="#About the Code">About the Code</a>
245
<a href="#Location of the source">Location of the source</a>
248
<a href="#IndexFiles">IndexFiles</a>
251
<a href="#Searching Files">Searching Files</a>
257
<a name="N10013"></a><a name="About the Code"></a>
258
<h2 class="boxed">About the Code</h2>
259
<div class="section">
261
In this section we walk through the sources behind the command-line Lucene demo: where to find them,
262
their parts and their function. This section is intended for Java developers wishing to understand
263
how to use Lucene in their applications.
269
<a name="N1001C"></a><a name="Location of the source"></a>
270
<h2 class="boxed">Location of the source</h2>
271
<div class="section">
273
NOTE: to examine the sources, you need to download and extract a source checkout of
274
Lucene: (lucene-{version}-src.zip).
277
Relative to the directory created when you extracted Lucene, you
278
should see a directory called <span class="codefrag">lucene/contrib/demo/</span>. This is the root for the Lucene
279
demo. Under this directory is <span class="codefrag">src/java/org/apache/lucene/demo/</span>. This is where all
280
the Java sources for the demo live.
283
Within this directory you should see the <span class="codefrag">IndexFiles.java</span> class we executed earlier.
284
Bring it up in <span class="codefrag">vi</span> or your editor of choice and let's take a look at it.
289
<a name="N10037"></a><a name="IndexFiles"></a>
290
<h2 class="boxed">IndexFiles</h2>
291
<div class="section">
293
As we discussed in the previous walk-through, the <a href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class creates a Lucene
294
Index. Let's take a look at how it does this.
297
The <span class="codefrag">main()</span> method parses the command-line parameters, then in preparation for
298
instantiating <a href="api/core/org/apache/lucene/index/IndexWriter.html">IndexWriter</a>, opens a
299
<a href="api/core/org/apache/lucene/store/Directory.html">Directory</a> and instantiates
300
<a href="api/module-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer</a> and
301
<a href="api/core/org/apache/lucene/index/IndexWriterConfig.html">IndexWriterConfig</a>.
304
The value of the <span class="codefrag">-index</span> command-line parameter is the name of the filesystem directory
305
where all index information should be stored. If <span class="codefrag">IndexFiles</span> is invoked with a
306
relative path given in the <span class="codefrag">-index</span> command-line parameter, or if the <span class="codefrag">-index</span>
307
command-line parameter is not given, causing the default relative index path "<span class="codefrag">index</span>"
308
to be used, the index path will be created as a subdirectory of the current working directory
309
(if it does not already exist). On some platforms, the index path may be created in a different
310
directory (such as the user's home directory).
313
The <span class="codefrag">-docs</span> command-line parameter value is the location of the directory containing
317
The <span class="codefrag">-update</span> command-line parameter tells <span class="codefrag">IndexFiles</span> not to delete the
318
index if it already exists. When <span class="codefrag">-update</span> is not given, <span class="codefrag">IndexFiles</span> will
319
first wipe the slate clean before indexing any documents.
322
Lucene <a href="api/core/org/apache/lucene/store/Directory.html">Directory</a>s are used by the
323
<span class="codefrag">IndexWriter</span> to store information in the index. In addition to the
324
<a href="api/core/org/apache/lucene/store/FSDirectory.html">FSDirectory</a> implementation we are using,
325
there are several other <span class="codefrag">Directory</span> subclasses that can write to RAM, to databases, etc.
328
Lucene <a href="api/core/org/apache/lucene/analysis/Analyzer.html">Analyzer</a>s are processing pipelines
329
that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these
330
tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc. The <span class="codefrag">Analyzer</span>
331
we are using is <span class="codefrag">StandardAnalyzer</span>, which creates tokens using the Word Break rules from the
332
Unicode Text Segmentation algorithm specified in <a href="http://unicode.org/reports/tr29/">Unicode
333
Standard Annex #29</a>; converts tokens to lowercase; and then filters out stopwords. Stopwords are
334
common language words such as articles (a, an, the, etc.) and other tokens that may have less value for
335
searching. It should be noted that there are different rules for every language, and you should use the
336
proper analyzer for each. Lucene currently provides Analyzers for a number of different languages (see
338
<a href="api/all/org/apache/lucene/analysis/">lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis</a>).
341
The <span class="codefrag">IndexWriterConfig</span> instance holds all configuration for <span class="codefrag">IndexWriter</span>. For
342
example, we set the <span class="codefrag">OpenMode</span> to use here based on the value of the <span class="codefrag">-update</span>
343
command-line parameter.
346
Looking further down in the file, after <span class="codefrag">IndexWriter</span> is instantiated, you should see the
347
<span class="codefrag">indexDocs()</span> code. This recursive function crawls the directories and creates
348
<a href="api/core/org/apache/lucene/document/Document.html">Document</a> objects. The
349
<span class="codefrag">Document</span> is simply a data object to represent the text content from the file as well as
350
its creation time and location. These instances are added to the <span class="codefrag">IndexWriter</span>. If
351
the <span class="codefrag">-update</span> command-line parameter is given, the <span class="codefrag">IndexWriter</span>
352
<span class="codefrag">OpenMode</span> will be set to <span class="codefrag">OpenMode.CREATE_OR_APPEND</span>, and rather than
353
adding documents to the index, the <span class="codefrag">IndexWriter</span> will <strong>update</strong> them
354
in the index by attempting to find an already-indexed document with the same identifier (in our
355
case, the file path serves as the identifier); deleting it from the index if it exists; and then
356
adding the new document to the index.
361
<a name="N100DB"></a><a name="Searching Files"></a>
362
<h2 class="boxed">Searching Files</h2>
363
<div class="section">
365
The <a href="api/contrib-demo/org/apache/lucene/demo/SearchFiles.html">SearchFiles</a> class is
366
quite simple. It primarily collaborates with an
367
<a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>,
368
<a href="api/modules-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer</a> (which is used in the
369
<a href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class as well)
370
and a <a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a>. The
371
query parser is constructed with an analyzer used to interpret your query text in the same way the
372
documents are interpreted: finding word boundaries, downcasing, and removing useless words like
373
'a', 'an' and 'the'. The <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
374
object contains the results from the
375
<a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a> which is passed
376
to the searcher. Note that it's also possible to programmatically construct a rich
377
<a href="api/core/org/apache/lucene/search/Query.html">Query</a> object without using the query
378
parser. The query parser just enables decoding the <a href="queryparsersyntax.html">Lucene query
379
syntax</a> into the corresponding <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
384
<span class="codefrag">SearchFiles</span> uses the <span class="codefrag">IndexSearcher.search(query,n)</span> method that returns
385
<a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs</a> with max <span class="codefrag">n</span> hits.
386
The results are printed in pages, sorted by score (i.e. relevance).
394
<div class="clearboth"> </div>
400
<div class="lastmodified">
401
<script type="text/javascript"><!--
402
document.write("Last Published: " + document.lastModified);
405
<div class="copyright">
407
2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>