~slub.team/goobi-indexserver/3.x

work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms

scores lower than a different document with only one of the query terms. </p>

<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can

help you figure out the what and why of Lucene scoring.</p>

<p>Lucene scoring uses a combination of the

<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information

Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>

to determine

how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more

times a query term appears in a document relative to

the number of times the term appears in all the documents in the collection, the more relevant that

document is to the query. It uses the Boolean model to first narrow down the documents that need to

be scored based on the use of boolean logic in the Query specification. Lucene also adds some

capabilities and refinements onto this model to support boolean and fuzzy searching, but it

essentially remains a VSM based system at the heart.

For some valuable references on VSM and IR in general refer to the

<a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.

</p>

<p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your

<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>. Next it will cover ways you can

customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring

-- Expert Level</a> which gives details on implementing your own

<a href="api/core/org/apache/lucene/search/Query.html">Query</a> class and related functionality. Finally, we

will finish up with some reference material in the <a href="#Appendix">Appendix</a>.

</p>

</section>

<section id="Scoring"><title>Scoring</title>

<p>Scoring is very much dependent on the way documents are indexed,

so it is important to understand indexing (see

<a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>

and the Lucene

<a href="fileformats.html">file formats</a>

before continuing on with this section.) It is also assumed that readers know how to use the

<a href="api/core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,

which can go a long way in informing why a score is returned.

</p>

<section id="Fields and Documents"><title>Fields and Documents</title>

<p>In Lucene, the objects we are scoring are

<a href="api/core/org/apache/lucene/document/Document.html">Documents</a>. A Document is a collection

<a href="api/core/org/apache/lucene/document/Field.html">Fields</a>. Each Field has semantics about how

it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to

note that Lucene scoring works on Fields and then combines the results to return Documents. This is

important because two Documents with the exact same content, but one having the content in two Fields

and the other in one Field will return different scores for the same query due to length normalization

(assumming the

<a href="api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>

on the Fields).

</p>

</section>

<section id="Score Boosting"><title>Score Boosting</title>

<p>Lucene allows influencing search results by "boosting" in more than one level:

<ul>

<li><b>Document level boosting</b>

- while indexing - by calling

<a href="api/core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>

before a document is added to the index.

</li>

<li><b>Document's Field level boosting</b>

- while indexing - by calling

<a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>

before adding a field to the document (and before adding the document to the index).

</li>

<li><b>Query level boosting</b>

- during search, by setting a boost on a query clause, calling

<a href="api/core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.

</li>

</ul>

</p>

<p>Indexing time boosts are preprocessed for storage efficiency and written to

the directory (when writing the document) in a single byte (!) as follows:

For each field of a document, all boosts of that field

(i.e. all boosts under the same field name in that doc) are multiplied.

The result is multiplied by the boost of the document,

and also multiplied by a "field length norm" value

that represents the length of that field in that doc

(so shorter fields are automatically boosted up).

The result is decoded as a single byte

(with some precision loss of course) and stored in the directory.

The similarity object in effect at indexing computes the length-norm of the field.

</p>

<p>This composition of 1-byte representation of norms

100

(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)

101

is nicely described in

102

<a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.

103

</p>

104

<p>Encoding and decoding of the resulted float norm in a single byte are done by the

105

static methods of the class Similarity:

106

<a href="api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and

107

<a href="api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.

108

Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,

109

e.g. decode(encode(0.89)) = 0.75.

110

At scoring (search) time, this norm is brought into the score of document

111

as <b>norm(t, d)</b>, as shown by the formula in

112

<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>.

113

</p>

114

</section>

115

<section id="Understanding the Scoring Formula"><title>Understanding the Scoring Formula</title>

116

117

<p>

118

This scoring formula is described in the

119

<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the

120

basics of Lucene scoring work, especially the

121

<a href="api/core/org/apache/lucene/search/TermQuery.html">TermQuery</a>.

122

</p>

123

</section>

124

<section id="The Big Picture"><title>The Big Picture</title>

125

<p>OK, so the tf-idf formula and the

126

<a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>

127

is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are

128

the use and interactions between the

129

<a href="api/core/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in

130

response to a user's information need.

131

</p>

132

<p>In this regard, Lucene offers a wide variety of <a href="api/core/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the

133

<a href="api/core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.

134

These implementations can be combined in a wide variety of ways to provide complex querying

135

capabilities along with

136

information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>

137

section below

138

highlights some of the more important Query classes. For information on the other ones, see the

139

<a href="api/core/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing

140

your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --

141

Expert Level</a> below.

142

</p>

143

<p>Once a Query has been created and submitted to the

144

<a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process

145

begins. (See the <a

146

href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,

147

control finally passes to the <a href="api/core/org/apache/lucene/search/Weight.html">Weight</a> implementation and its

148

<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of

149

<a href="api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the

150

<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>

151

(link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or

152

<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>

153

(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).

154

</p>

155

<p>

156

Assuming the use of the BooleanWeight2, a

157

BooleanScorer2 is created by bringing together

158

all of the

159

<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.

160

When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type

161

of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores

162

provided by each scorer while factoring in the coord() score.

163

164

</p>

165

</section>

166

<section id="Query Classes"><title>Query Classes</title>

167

<p>For information on the Query Classes, refer to the

168

<a href="api/core/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>

169

</p>

170

</section>

171

<section id="Changing Similarity"><title>Changing Similarity</title>

172

<p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on

173

how to do this, see the

174

<a href="api/core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a></p>

175

</section>

176

177

</section>

178

<section id="Changing your Scoring -- Expert Level"><title>Changing your Scoring -- Expert Level</title>

179

<p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more

180

about how to do this, refer to the

181

<a href="api/core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>

182

</p>

183

</section>

184

185

<section id="Appendix"><title>Appendix</title>

186

<section id="Algorithm"><title>Algorithm</title>

187

<p>This section is mostly notes on stepping through the Scoring process and serves as

188

fertilizer for the earlier sections.</p>

189

<p>In the typical search application, a

190

<a href="api/core/org/apache/lucene/search/Query.html">Query</a>

191

is passed to the

192

193

href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a>

194

, beginning the scoring process.

195

</p>

196

<p>Once inside the Searcher, a

197

<a href="api/core/org/apache/lucene/search/Collector.html">Collector</a>

198

is used for the scoring and sorting of the search results.

199

These important objects are involved in a search:

200

<ol>

201

<li>The

202

<a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>

203

object of the Query. The Weight object is an internal representation of the Query that

204

allows the Query to be reused by the Searcher.

205

</li>

206

<li>The Searcher that initiated the call.</li>

207

<li>A

208

<a href="api/core/org/apache/lucene/search/Filter.html">Filter</a>

209

for limiting the result set. Note, the Filter may be null.

210

</li>

211

<li>A

212

213

object for specifying how to sort the results if the standard score based sort method is not

214

desired.

215

</li>

216

</ol>

217

</p>

218

<p> Assuming we are not sorting (since sorting doesn't

219

effect the raw Lucene score),

220

we call one of the search methods of the Searcher, passing in the

221

<a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>

222

object created by Searcher.createWeight(Query),

223

<a href="api/core/org/apache/lucene/search/Filter.html">Filter</a>

224

and the number of results we want. This method

225

returns a

226

<a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs</a>

227

object, which is an internal collection of search results.

228

The Searcher creates a

229

<a href="api/core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector</a>

230

and passes it along with the Weight, Filter to another expert search method (for more on the

231

<a href="api/core/org/apache/lucene/search/Collector.html">Collector</a>

232

mechanism, see

233

<a href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a>

234

.) The TopDocCollector uses a

235

<a href="api/core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>

236

to collect the top results for the search.

237

</p>

238

<p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,

239

we ask the Weight for

240

241

<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>

242

for the

243

<a href="api/core/org/apache/lucene/index/IndexReader.html">IndexReader</a>

244

of the current searcher and we proceed by

245

calling the score method on the

246

<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>

247

248

</p>

249

<p>At last, we are actually going to score some documents. The score method takes in the Collector

250

(most likely the TopScoreDocCollector or TopFieldCollector) and does its business.

251

Of course, here is where things get involved. The

252

<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>

253

that is returned by the

254

<a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>

255

object depends on what type of Query was submitted. In most real world applications with multiple

256

query terms,

257

the

258

<a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>

259

is going to be a

260

<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>

261

(see the section on customizing your scoring for info on changing this.)

262

263

</p>

264

<p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the

265

coord() factor. We then

266

get a internal Scorer based on the required, optional and prohibited parts of the query.

267

Using this internal Scorer, the BooleanScorer2 then proceeds

268

into a while loop based on the Scorer#next() method. The next() method advances to the next document

269

matching the query. This is an

270

abstract method in the Scorer class and is thus overriden by all derived

271

implementations. If you have a simple OR query

272

your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers

273

from the sub scorers of the OR'd terms.</p>

274

</section>

275

</section>

276

</body>

277

</document>

Older »