1
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
4
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
5
<meta content="Apache Forrest" name="Generator">
6
<meta name="Forrest-version" content="0.9">
7
<meta name="Forrest-skin-name" content="lucene">
9
Apache Lucene - Scoring
11
<link type="text/css" href="skin/basic.css" rel="stylesheet">
12
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
13
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
14
<link type="text/css" href="skin/profile.css" rel="stylesheet">
15
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
16
<link rel="shortcut icon" href="images/favicon.ico">
18
<body onload="init()">
19
<script type="text/javascript">ndeSetTextSize();</script>
24
<div class="breadtrail">
25
<a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
34
<div class="grouplogo">
35
<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://www.apache.org/images/asf_logo_simple.png" title="Apache Lucene"></a>
43
<div class="projectlogo">
44
<a href="http://lucene.apache.org/java/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/images/lucene_green_300.gif" title="Apache Lucene is a high-performance, full-featured text search engine library written entirely in
45
Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform."></a>
53
<div class="searchbox">
54
<form action="http://search.lucidimagination.com/p:lucene" method="get" class="roundtopsmall">
55
<input onFocus="getBlank (this, 'Search the site with Lucene');" size="25" name="q" id="query" type="text" value="Search the site with Lucene">
56
<input name="Search" value="Search" type="submit">
58
<div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a>
69
<a class="selected" href="http://lucene.apache.org/java/docs/">Main</a>
72
<a class="unselected" href="http://wiki.apache.org/lucene-java">Wiki</a>
75
<a class="selected" href="index.html">Lucene 3.5 Documentation</a>
84
<div id="publishedStrip">
88
<div id="level2tabs"></div>
92
<script type="text/javascript"><!--
93
document.write("Last Published: " + document.lastModified);
99
<div class="breadtrail">
104
|start Menu, mainarea
110
<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
111
<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
112
<div class="menuitem">
113
<a href="index.html">Overview</a>
115
<div onclick="SwitchMenu('menu_1.1.2', 'skin/')" id="menu_1.1.2Title" class="menutitle">Changes</div>
116
<div id="menu_1.1.2" class="menuitemgroup">
117
<div class="menuitem">
118
<a href="changes/Changes.html">Core</a>
120
<div class="menuitem">
121
<a href="changes/Contrib-Changes.html">Contrib</a>
124
<div onclick="SwitchMenu('menu_1.1.3', 'skin/')" id="menu_1.1.3Title" class="menutitle">Javadocs</div>
125
<div id="menu_1.1.3" class="menuitemgroup">
126
<div class="menuitem">
127
<a href="api/all/index.html">All</a>
129
<div class="menuitem">
130
<a href="api/core/index.html">Core</a>
132
<div class="menuitem">
133
<a href="api/test-framework/index.html">Test Framework</a>
135
<div onclick="SwitchMenu('menu_1.1.3.4', 'skin/')" id="menu_1.1.3.4Title" class="menutitle">Contrib</div>
136
<div id="menu_1.1.3.4" class="menuitemgroup">
137
<div class="menuitem">
138
<a href="api/contrib-analyzers/index.html">Analyzers</a>
140
<div class="menuitem">
141
<a href="api/contrib-smartcn/index.html">Smart Chinese Analyzer</a>
143
<div class="menuitem">
144
<a href="api/contrib-stempel/index.html">Stempel Polish Analyzer</a>
146
<div class="menuitem">
147
<a href="api/contrib-benchmark/index.html">Benchmark</a>
149
<div class="menuitem">
150
<a href="api/contrib-demo/index.html">Demo</a>
152
<div class="menuitem">
153
<a href="api/contrib-grouping/index.html">Grouping</a>
155
<div class="menuitem">
156
<a href="api/contrib-highlighter/index.html">Highlighter</a>
158
<div class="menuitem">
159
<a href="api/contrib-icu/index.html">ICU</a>
161
<div class="menuitem">
162
<a href="api/contrib-instantiated/index.html">Instantiated</a>
164
<div class="menuitem">
165
<a href="api/contrib-join/index.html">Join</a>
167
<div class="menuitem">
168
<a href="api/contrib-memory/index.html">Memory</a>
170
<div class="menuitem">
171
<a href="api/contrib-misc/index.html">Miscellaneous</a>
173
<div class="menuitem">
174
<a href="api/contrib-queries/index.html">Queries</a>
176
<div class="menuitem">
177
<a href="api/contrib-queryparser/index.html">Query Parser Framework</a>
179
<div class="menuitem">
180
<a href="api/contrib-remote/index.html">Remote</a>
182
<div class="menuitem">
183
<a href="api/contrib-spatial/index.html">Spatial</a>
185
<div class="menuitem">
186
<a href="api/contrib-spellchecker/index.html">Spellchecker</a>
188
<div class="menuitem">
189
<a href="api/contrib-xml-query-parser/index.html">XML Query Parser</a>
193
<div class="menuitem">
194
<a href="systemrequirements.html">System Requirements</a>
196
<div class="menuitem">
197
<a href="contributions.html">Contributions</a>
199
<div class="menuitem">
200
<a href="http://wiki.apache.org/lucene-java/LuceneFAQ">FAQ</a>
202
<div class="menuitem">
203
<a href="fileformats.html">File Formats</a>
205
<div class="menuitem">
206
<a href="gettingstarted.html">Getting Started</a>
208
<div class="menuitem">
209
<a href="lucene-contrib/index.html">Lucene Contrib</a>
211
<div class="menuitem">
212
<a href="queryparsersyntax.html">Query Syntax</a>
214
<div class="menupage">
215
<div class="menupagetitle">Scoring</div>
217
<div class="menuitem">
218
<a href="http://wiki.apache.org/lucene-java">Wiki</a>
221
<div id="credit"></div>
222
<div id="roundbottom">
223
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
227
<div id="credit2"></div>
237
Apache Lucene - Scoring
239
<div id="minitoc-area">
242
<a href="#Introduction">Introduction</a>
245
<a href="#Scoring">Scoring</a>
248
<a href="#Fields and Documents">Fields and Documents</a>
251
<a href="#Score Boosting">Score Boosting</a>
254
<a href="#Understanding the Scoring Formula">Understanding the Scoring Formula</a>
257
<a href="#The Big Picture">The Big Picture</a>
260
<a href="#Query Classes">Query Classes</a>
263
<a href="#Changing Similarity">Changing Similarity</a>
268
<a href="#Changing your Scoring -- Expert Level">Changing your Scoring -- Expert Level</a>
271
<a href="#Appendix">Appendix</a>
274
<a href="#Algorithm">Algorithm</a>
282
<a name="N10013"></a><a name="Introduction"></a>
283
<h2 class="boxed">Introduction</h2>
284
<div class="section">
285
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
286
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
287
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
288
scores lower than a different document with only one of the query terms. </p>
289
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
290
help you figure out the what and why of Lucene scoring.</p>
291
<p>Lucene scoring uses a combination of the
292
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
293
Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
295
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
296
times a query term appears in a document relative to
297
the number of times the term appears in all the documents in the collection, the more relevant that
298
document is to the query. It uses the Boolean model to first narrow down the documents that need to
299
be scored based on the use of boolean logic in the Query specification. Lucene also adds some
300
capabilities and refinements onto this model to support boolean and fuzzy searching, but it
301
essentially remains a VSM based system at the heart.
302
For some valuable references on VSM and IR in general refer to the
303
<a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
305
<p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your
306
<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>. Next it will cover ways you can
307
customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
308
-- Expert Level</a> which gives details on implementing your own
309
<a href="api/core/org/apache/lucene/search/Query.html">Query</a> class and related functionality. Finally, we
310
will finish up with some reference material in the <a href="#Appendix">Appendix</a>.
314
<a name="N10045"></a><a name="Scoring"></a>
315
<h2 class="boxed">Scoring</h2>
316
<div class="section">
317
<p>Scoring is very much dependent on the way documents are indexed,
318
so it is important to understand indexing (see
319
<a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
321
<a href="fileformats.html">file formats</a>
322
before continuing on with this section.) It is also assumed that readers know how to use the
323
<a href="api/core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
324
which can go a long way in informing why a score is returned.
326
<a name="N10059"></a><a name="Fields and Documents"></a>
327
<h3 class="boxed">Fields and Documents</h3>
328
<p>In Lucene, the objects we are scoring are
329
<a href="api/core/org/apache/lucene/document/Document.html">Documents</a>. A Document is a collection
331
<a href="api/core/org/apache/lucene/document/Field.html">Fields</a>. Each Field has semantics about how
332
it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to
333
note that Lucene scoring works on Fields and then combines the results to return Documents. This is
334
important because two Documents with the exact same content, but one having the content in two Fields
335
and the other in one Field will return different scores for the same query due to length normalization
337
<a href="api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
340
<a name="N1006E"></a><a name="Score Boosting"></a>
341
<h3 class="boxed">Score Boosting</h3>
342
<p>Lucene allows influencing search results by "boosting" in more than one level:
346
<b>Document level boosting</b>
347
- while indexing - by calling
348
<a href="api/core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
349
before a document is added to the index.
353
<b>Document's Field level boosting</b>
354
- while indexing - by calling
355
<a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
356
before adding a field to the document (and before adding the document to the index).
360
<b>Query level boosting</b>
361
- during search, by setting a boost on a query clause, calling
362
<a href="api/core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
368
<p>Indexing time boosts are preprocessed for storage efficiency and written to
369
the directory (when writing the document) in a single byte (!) as follows:
370
For each field of a document, all boosts of that field
371
(i.e. all boosts under the same field name in that doc) are multiplied.
372
The result is multiplied by the boost of the document,
373
and also multiplied by a "field length norm" value
374
that represents the length of that field in that doc
375
(so shorter fields are automatically boosted up).
376
The result is decoded as a single byte
377
(with some precision loss of course) and stored in the directory.
378
The similarity object in effect at indexing computes the length-norm of the field.
380
<p>This composition of 1-byte representation of norms
381
(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
382
is nicely described in
383
<a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
385
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
386
static methods of the class Similarity:
387
<a href="api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
388
<a href="api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
389
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
390
e.g. decode(encode(0.89)) = 0.75.
391
At scoring (search) time, this norm is brought into the score of document
392
as <b>norm(t, d)</b>, as shown by the formula in
393
<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>.
395
<a name="N100B1"></a><a name="Understanding the Scoring Formula"></a>
396
<h3 class="boxed">Understanding the Scoring Formula</h3>
398
This scoring formula is described in the
399
<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
400
basics of Lucene scoring work, especially the
401
<a href="api/core/org/apache/lucene/search/TermQuery.html">TermQuery</a>.
403
<a name="N100C2"></a><a name="The Big Picture"></a>
404
<h3 class="boxed">The Big Picture</h3>
405
<p>OK, so the tf-idf formula and the
406
<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>
407
is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
408
the use and interactions between the
409
<a href="api/core/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
410
response to a user's information need.
412
<p>In this regard, Lucene offers a wide variety of <a href="api/core/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
413
<a href="api/core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
414
These implementations can be combined in a wide variety of ways to provide complex querying
415
capabilities along with
416
information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
418
highlights some of the more important Query classes. For information on the other ones, see the
419
<a href="api/core/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
420
your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
421
Expert Level</a> below.
423
<p>Once a Query has been created and submitted to the
424
<a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
425
begins. (See the <a href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
426
control finally passes to the <a href="api/core/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
427
<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
428
<a href="api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
429
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>
430
(link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or
431
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>
432
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).
435
Assuming the use of the BooleanWeight2, a
436
BooleanScorer2 is created by bringing together
438
<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
439
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
440
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
441
provided by each scorer while factoring in the coord() score.
442
<!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
444
<a name="N10112"></a><a name="Query Classes"></a>
445
<h3 class="boxed">Query Classes</h3>
446
<p>For information on the Query Classes, refer to the
447
<a href="api/core/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>
450
<a name="N1011F"></a><a name="Changing Similarity"></a>
451
<h3 class="boxed">Changing Similarity</h3>
452
<p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on
453
how to do this, see the
454
<a href="api/core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a>
458
<a name="N1012C"></a><a name="Changing your Scoring -- Expert Level"></a>
459
<h2 class="boxed">Changing your Scoring -- Expert Level</h2>
460
<div class="section">
461
<p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more
462
about how to do this, refer to the
463
<a href="api/core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>
469
<a name="N10139"></a><a name="Appendix"></a>
470
<h2 class="boxed">Appendix</h2>
471
<div class="section">
472
<a name="N1013E"></a><a name="Algorithm"></a>
473
<h3 class="boxed">Algorithm</h3>
474
<p>This section is mostly notes on stepping through the Scoring process and serves as
475
fertilizer for the earlier sections.</p>
476
<p>In the typical search application, a
477
<a href="api/core/org/apache/lucene/search/Query.html">Query</a>
479
<a href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a>
480
, beginning the scoring process.
482
<p>Once inside the Searcher, a
483
<a href="api/core/org/apache/lucene/search/Collector.html">Collector</a>
484
is used for the scoring and sorting of the search results.
485
These important objects are involved in a search:
489
<a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>
490
object of the Query. The Weight object is an internal representation of the Query that
491
allows the Query to be reused by the Searcher.
494
<li>The Searcher that initiated the call.</li>
497
<a href="api/core/org/apache/lucene/search/Filter.html">Filter</a>
498
for limiting the result set. Note, the Filter may be null.
502
<a href="api/core/org/apache/lucene/search/Sort.html">Sort</a>
503
object for specifying how to sort the results if the standard score based sort method is not
510
<p> Assuming we are not sorting (since sorting doesn't
511
effect the raw Lucene score),
512
we call one of the search methods of the Searcher, passing in the
513
<a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>
514
object created by Searcher.createWeight(Query),
515
<a href="api/core/org/apache/lucene/search/Filter.html">Filter</a>
516
and the number of results we want. This method
518
<a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs</a>
519
object, which is an internal collection of search results.
520
The Searcher creates a
521
<a href="api/core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector</a>
522
and passes it along with the Weight, Filter to another expert search method (for more on the
523
<a href="api/core/org/apache/lucene/search/Collector.html">Collector</a>
525
<a href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a>
526
.) The TopDocCollector uses a
527
<a href="api/core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>
528
to collect the top results for the search.
530
<p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,
531
we ask the Weight for
533
<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
535
<a href="api/core/org/apache/lucene/index/IndexReader.html">IndexReader</a>
536
of the current searcher and we proceed by
537
calling the score method on the
538
<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
541
<p>At last, we are actually going to score some documents. The score method takes in the Collector
542
(most likely the TopScoreDocCollector or TopFieldCollector) and does its business.
543
Of course, here is where things get involved. The
544
<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
545
that is returned by the
546
<a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>
547
object depends on what type of Query was submitted. In most real world applications with multiple
550
<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
552
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>
553
(see the section on customizing your scoring for info on changing this.)
556
<p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the
557
coord() factor. We then
558
get a internal Scorer based on the required, optional and prohibited parts of the query.
559
Using this internal Scorer, the BooleanScorer2 then proceeds
560
into a while loop based on the Scorer#next() method. The next() method advances to the next document
561
matching the query. This is an
562
abstract method in the Scorer class and is thus overriden by all derived
563
implementations. <!-- DOUBLE CHECK THIS -->If you have a simple OR query
564
your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers
565
from the sub scorers of the OR'd terms.</p>
572
<div class="clearboth"> </div>
578
<div class="lastmodified">
579
<script type="text/javascript"><!--
580
document.write("Last Published: " + document.lastModified);
583
<div class="copyright">
585
2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>