2
Licensed to the Apache Software Foundation (ASF) under one or more
3
contributor license agreements. See the NOTICE file distributed with
4
this work for additional information regarding copyright ownership.
5
The ASF licenses this file to You under the Apache License, Version 2.0
6
(the "License"); you may not use this file except in compliance with
7
the License. You may obtain a copy of the License at
9
http://www.apache.org/licenses/LICENSE-2.0
11
Unless required by applicable law or agreed to in writing, software
12
distributed under the License is distributed on an "AS IS" BASIS,
13
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
See the License for the specific language governing permissions and
15
limitations under the License.
19
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
21
Apache Lucene ICU integration module
26
This module exposes functionality from
27
<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
28
library that enhances Java's internationalization support by improving
29
performance, keeping current with the Unicode Standard, and providing richer
30
APIs. This module exposes the following functionality:
33
<li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
34
properties and rules defined in Unicode.</li>
35
<li><a href="#collation">Collation</a>: Compare strings according to the
36
conventions and standards of a particular language, region or country.</li>
37
<li><a href="#normalization">Normalization</a>: Converts text to a unique,
39
<li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
40
Unicode's Default Caseless Matching algorithm.</li>
41
<li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
42
(such as accent marks) between similar characters for a loose or fuzzy search.</li>
43
<li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
44
a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
47
<h1><a name="segmentation">Text Segmentation</a></h1>
49
Text Segmentation (Tokenization) divides document and query text into index terms
50
(typically words). Unicode provides special properties and rules so that this can
51
be done in a manner that works well with most languages.
54
Text Segmentation implements the word segmentation specified in
55
<a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
56
Additionally the algorithm can be tailored based on writing system, for example
57
text in the Thai script is automatically delegated to a dictionary-based segmentation
63
As a more thorough replacement for StandardTokenizer that works well for
67
<h2>Example Usages</h2>
68
<h3>Tokenizing multilanguage text</h3>
69
<pre class="prettyprint">
71
* This tokenizer will work well in general for most languages.
73
Tokenizer tokenizer = new ICUTokenizer(reader);
76
<h1><a name="collation">Collation</a></h1>
78
<code>ICUCollationKeyFilter</code>
79
converts each token into its binary <code>CollationKey</code> using the
80
provided <code>Collator</code>, and then encode the <code>CollationKey</code>
82
{@link org.apache.lucene.util.IndexableBinaryStringTools}, to allow it to be
83
stored as an index term.
86
<code>ICUCollationKeyFilter</code> depends on ICU4J 4.4 to produce the
87
<code>CollationKey</code>s. <code>icu4j-4.4.jar</code>
88
is included in Lucene's Subversion repository at <code>contrib/icu/lib/</code>.
95
Efficient sorting of terms in languages that use non-Unicode character
96
orderings. (Lucene Sort using a Locale can be very slow.)
99
Efficient range queries over fields that contain terms in languages that
100
use non-Unicode character orderings. (Range queries using a Locale can be
104
Effective Locale-specific normalization (case differences, diacritics, etc.).
105
({@link org.apache.lucene.analysis.LowerCaseFilter} and
106
{@link org.apache.lucene.analysis.ASCIIFoldingFilter} provide these services
107
in a generic way that doesn't take into account locale-specific needs.)
111
<h2>Example Usages</h2>
113
<h3>Farsi Range Queries</h3>
114
<pre class="prettyprint">
115
Collator collator = Collator.getInstance(new Locale("ar"));
116
ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator);
117
RAMDirectory ramDir = new RAMDirectory();
118
IndexWriter writer = new IndexWriter
119
(ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
120
Document doc = new Document();
121
doc.add(new Field("content", "\u0633\u0627\u0628",
122
Field.Store.YES, Field.Index.ANALYZED));
123
writer.addDocument(doc);
125
IndexSearcher is = new IndexSearcher(ramDir, true);
127
// The AnalyzingQueryParser in Lucene's contrib allows terms in range queries
128
// to be passed through an analyzer - Lucene's standard QueryParser does not
130
AnalyzingQueryParser aqp = new AnalyzingQueryParser("content", analyzer);
131
aqp.setLowercaseExpandedTerms(false);
133
// Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
134
// orders the U+0698 character before the U+0633 character, so the single
135
// indexed Term above should NOT be returned by a ConstantScoreRangeQuery
136
// with a Farsi Collator (or an Arabic one for the case when Farsi is not
139
= is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
140
assertEquals("The index Term should not be included.", 0, result.length);
143
<h3>Danish Sorting</h3>
144
<pre class="prettyprint">
146
= new ICUCollationKeyAnalyzer(Collator.getInstance(new Locale("da", "dk")));
147
RAMDirectory indexStore = new RAMDirectory();
148
IndexWriter writer = new IndexWriter
149
(indexStore, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
150
String[] tracer = new String[] { "A", "B", "C", "D", "E" };
151
String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
152
String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
153
for (int i = 0 ; i < data.length ; ++i) {
154
Document doc = new Document();
155
doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
156
doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
157
writer.addDocument(doc);
160
Searcher searcher = new IndexSearcher(indexStore, true);
161
Sort sort = new Sort();
162
sort.setSort(new SortField("contents", SortField.STRING));
163
Query query = new MatchAllDocsQuery();
164
ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
165
for (int i = 0 ; i < result.length ; ++i) {
166
Document doc = searcher.doc(result[i].doc);
167
assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
171
<h3>Turkish Case Normalization</h3>
172
<pre class="prettyprint">
173
Collator collator = Collator.getInstance(new Locale("tr", "TR"));
174
collator.setStrength(Collator.PRIMARY);
175
Analyzer analyzer = new ICUCollationKeyAnalyzer(collator);
176
RAMDirectory ramDir = new RAMDirectory();
177
IndexWriter writer = new IndexWriter
178
(ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
179
Document doc = new Document();
180
doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
181
writer.addDocument(doc);
183
IndexSearcher is = new IndexSearcher(ramDir, true);
184
QueryParser parser = new QueryParser("contents", analyzer);
185
Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
186
ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
187
assertEquals("The index Term should be included.", 1, result.length);
190
<h2>Caveats and Comparisons</h2>
192
<strong>WARNING:</strong> Make sure you use exactly the same
193
<code>Collator</code> at index and query time -- <code>CollationKey</code>s
194
are only comparable when produced by
195
the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s
196
are not independently versioned, it is unsafe to search against stored
197
<code>CollationKey</code>s unless the following are exactly the same (best
198
practice is to store this information with the index and check that they
199
remain the same at query time):
203
<li>JVM version, including patch version</li>
205
The language (and country and variant, if specified) of the Locale
206
used when constructing the collator via
207
{@link java.text.Collator#getInstance(java.util.Locale)}.
210
The collation strength used - see {@link java.text.Collator#setStrength(int)}
214
<code>ICUCollationKeyFilter</code> uses ICU4J's <code>Collator</code>, which
215
makes its version available, thus allowing collation to be versioned
216
independently from the JVM. <code>ICUCollationKeyFilter</code> is also
217
significantly faster and generates significantly shorter keys than
218
<code>CollationKeyFilter</code>. See
219
<a href="http://site.icu-project.org/charts/collation-icu4j-sun"
220
>http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
221
generation timing and key length comparisons between ICU4J and
222
<code>java.text.Collator</code> over several languages.
225
<code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
226
not compatible with those those generated by ICU Collators. Specifically, if
227
you use <code>CollationKeyFilter</code> to generate index terms, do not use
228
<code>ICUCollationKeyFilter</code> on the query side, or vice versa.
231
<h1><a name="normalization">Normalization</a></h1>
233
<code>ICUNormalizer2Filter</code> normalizes term text to a
234
<a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
235
that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
236
forms are standardized to a unique form.
240
<li> Removing differences in width for Asian-language text.
242
<li> Standardizing complex text with non-spacing marks so that characters are
243
ordered consistently.
246
<h2>Example Usages</h2>
247
<h3>Normalizing text to NFC</h3>
248
<pre class="prettyprint">
250
* Normalizer2 objects are unmodifiable and immutable.
252
Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
254
* This filter will normalize to NFC.
256
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
259
<h1><a name="casefolding">Case Folding</a></h1>
261
Default caseless matching, or case-folding is more than just conversion to
262
lowercase. For example, it handles cases such as the Greek sigma, so that
263
"Μάϊος" and "ΜΆΪΟΣ" will match correctly.
266
Case-folding is still only an approximation of the language-specific rules
267
governing case. If the specific language is known, consider using
268
ICUCollationKeyFilter and indexing collation keys instead. This implementation
269
performs the "full" case-folding specified in the Unicode standard, and this
270
may change the length of the term. For example, the German ß is case-folded
274
Case folding is related to normalization, and as such is coupled with it in
275
this integration. To perform case-folding, you use normalization with the form
276
"nfkc_cf" (which is the default).
281
As a more thorough replacement for LowerCaseFilter that has good behavior
285
<h2>Example Usages</h2>
286
<h3>Lowercasing text</h3>
287
<pre class="prettyprint">
289
* This filter will case-fold and normalize to NFKC.
291
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
294
<h1><a name="searchfolding">Search Term Folding</a></h1>
296
Search term folding removes distinctions (such as accent marks) between
297
similar characters. It is useful for a fuzzy or loose search.
300
Search term folding implements many of the foldings specified in
301
<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
302
as a special normalization form. This folding applies NFKC, Case Folding, and
303
many character foldings recursively.
308
As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
309
that applies the same ideas to many more languages.
312
<h2>Example Usages</h2>
313
<h3>Removing accents</h3>
314
<pre class="prettyprint">
316
* This filter will case-fold, remove accents and other distinctions, and
319
TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
322
<h1><a name="transform">Text Transformation</a></h1>
324
ICU provides text-transformation functionality via its Transliteration API. This allows
325
you to transform text in a variety of ways, taking context into account.
328
For more information, see the
329
<a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
331
<a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
336
Convert Traditional to Simplified
339
Transliterate between different writing systems: e.g. Romanization
342
<h2>Example Usages</h2>
343
<h3>Convert Traditional to Simplified</h3>
344
<pre class="prettyprint">
346
* This filter will map Traditional Chinese to Simplified Chinese
348
TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
350
<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
351
<pre class="prettyprint">
353
* This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
355
TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
358
<h1><a name="backcompat">Backwards Compatibility</a></h1>
360
This module exists to provide up-to-date Unicode functionality that supports
361
the most recent version of Unicode (currently 6.0). However, some users who wish
362
for stronger backwards compatibility can restrict
363
{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
364
a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
366
<h2>Example Usages</h2>
367
<h3>Restricting normalization to Unicode 5.0</h3>
368
<pre class="prettyprint">
370
* This filter will do NFC normalization, but will ignore any characters that
371
* did not exist as of Unicode 5.0. Because of the normalization stability policy
372
* of Unicode, this is an easy way to force normalization to a specific version.
374
Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
375
UnicodeSet set = new UnicodeSet("[:age=5.0:]");
376
// see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
378
FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
379
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);