1
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4
<meta name="generator" content=
5
"HTML Tidy for Linux/x86 (vers 1st November 2002), see www.w3.org">
6
<meta name="description" content=
7
"A simple, portable and lightweigt C++ library for easy handling of UTF-8 encoded strings">
8
<meta name="keywords" content="UTF-8 C++ portable utf8 unicode generic templates">
9
<meta name="author" content="Nemanja Trifunovic">
11
UTF8-CPP: UTF-8 with C++ in a Portable Way
13
<style type="text/css">
34
list-style-type: none;
45
UTF8-CPP: UTF-8 with C++ in a Portable Way
48
<a href="https://sourceforge.net/projects/utfcpp">The Sourceforge project page</a>
56
<a href="#introduction">Introduction</a>
59
<a href="#examples">Examples of Use</a>
62
<a href="#reference">Reference</a>
65
<a href="#funutf8">Functions From utf8 Namespace </a>
68
<a href="#typesutf8">Types From utf8 Namespace </a>
71
<a href="#fununchecked">Functions From utf8::unchecked Namespace </a>
74
<a href="#typesunchecked">Types From utf8::unchecked Namespace </a>
79
<a href="#points">Points of Interest</a>
82
<a href="#conclusion">Conclusion</a>
85
<a href="#links">Links</a>
89
<h2 id="introduction">
93
Many C++ developers miss an easy and portable way of handling Unicode encoded
94
strings. C++ Standard is currently Unicode agnostic, and while some work is being
95
done to introduce Unicode to the next incarnation called C++0x, for the moment
96
nothing of the sort is available. In the meantime, developers use 3rd party
97
libraries like ICU, OS specific capabilities, or simply roll out their own
101
In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small
102
generic library. For anybody used to work with STL algorithms and iterators, it should be
103
easy and natural to use. The code is freely available for any purpose - check out
104
the license at the beginning of the utf8.h file. If you run into
105
bugs or performance issues, please let me know and I'll do my best to address them.
108
The purpose of this article is not to offer an introduction to Unicode in general,
109
and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out
110
<a href="http://www.unicode.org/">Unicode Home Page</a> or some other source of
111
information for Unicode. Also, it is not my aim to advocate the use of UTF-8
112
encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from
113
C++, I am sure you have good reasons for it.
119
To illustrate the use of this utf8 library, we shall open a file containing UTF-8
120
encoded text, check whether it starts with a byte order mark, read each line into a
121
<code>std::string</code>, check it for validity, convert the text to UTF-16, and
125
<span class="preprocessor">#include <fstream></span>
126
<span class="preprocessor">#include <iostream></span>
127
<span class="preprocessor">#include <string></span>
128
<span class="preprocessor">#include <vector></span>
129
<span class="preprocessor">#include "utf8.h"</span>
130
<span class="keyword">using namespace</span> std;
131
<span class="keyword">int</span> main()
133
<span class="keyword">if</span> (argc != <span class="literal">2</span>) {
134
cout << <span class="literal">"\nUsage: docsample filename\n"</span>;
135
<span class="keyword">return</span> <span class="literal">0</span>;
137
<span class="keyword">const char</span>* test_file_path = argv[1];
138
<span class="comment">// Open the test file (must be UTF-8 encoded)</span>
139
ifstream fs8(test_file_path);
140
<span class="keyword">if</span> (!fs8.is_open()) {
141
cout << <span class=
142
"literal">"Could not open "</span> << test_file_path << endl;
143
<span class="keyword">return</span> <span class="literal">0</span>;
145
<span class="comment">// Read the first line of the file</span>
146
<span class="keyword">unsigned</span> line_count = <span class="literal">1</span>;
148
<span class="keyword">if</span> (!getline(fs8, line))
149
<span class="keyword">return</span> <span class="literal">0</span>;
150
<span class="comment">// Look for utf-8 byte-order mark at the beginning</span>
151
<span class="keyword">if</span> (line.size() > <span class="literal">2</span>) {
152
<span class="keyword">if</span> (utf8::is_bom(line.c_str()))
153
cout << <span class=
154
"literal">"There is a byte order mark at the beginning of the file\n"</span>;
156
<span class="comment">// Play with all the lines in the file</span>
157
<span class="keyword">do</span> {
158
<span class="comment">// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)</span>
159
string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
160
<span class="keyword">if</span> (end_it != line.end()) {
161
cout << <span class=
162
"literal">"Invalid UTF-8 encoding detected at line "</span> << line_count << <span
163
class="literal">"\n"</span>;
164
cout << <span class=
165
"literal">"This part is fine: "</span> << string(line.begin(), end_it) << <span
166
class="literal">"\n"</span>;
168
<span class="comment">// Get the line length (at least for the valid part)</span>
169
<span class="keyword">int</span> length = utf8::distance(line.begin(), end_it);
170
cout << <span class=
171
"literal">"Length of line "</span> << line_count << <span class=
172
"literal">" is "</span> << length << <span class="literal">"\n"</span>;
173
<span class="comment">// Convert it to utf-16</span>
174
vector<unsigned short> utf16line;
175
utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
176
<span class="comment">// And back to utf-8</span>
178
utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
179
<span class="comment">// Confirm that the conversion went OK:</span>
180
<span class="keyword">if</span> (utf8line != string(line.begin(), end_it))
181
cout << <span class=
182
"literal">"Error in UTF-16 conversion at line: "</span> << line_count << <span
183
class="literal">"\n"</span>;
186
} <span class="keyword">while</span> (!fs8.eof());
187
<span class="keyword">return</span> <span class="literal">0</span>;
191
In the previous code sample, we have seen the use of the following functions from
192
<code>utf8</code> namespace: first we used <code>is_bom</code> function to detect
193
UTF-8 byte order mark at the beginning of the file; then for each line we performed
194
a detection of invalid UTF-8 sequences with <code>find_invalid</code>; the number
195
of characters (more precisely - the number of Unicode code points) in each line was
196
determined with a use of <code>utf8::distance</code>; finally, we have converted
197
each line to UTF-16 encoding with <code>utf8to16</code> and back to UTF-8 with
198
<code>utf16to8</code>.
204
Functions From utf8 Namespace
210
Available in version 1.0 and later.
213
Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence
217
<span class="keyword">template</span> <<span class=
218
"keyword">typename</span> octet_iterator>
219
octet_iterator append(uint32_t cp, octet_iterator result);
223
<code>cp</code>: A 32 bit integer representing a code point to append to the
225
<code>result</code>: An output iterator to the place in the sequence where to
226
append the code point.<br>
227
<span class="return_value">Return value</span>: An iterator pointing to the place
228
after the newly appended sequence.
234
<span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span
235
class="literal">0</span>,<span class="literal">0</span>,<span class=
236
"literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>};
237
<span class="keyword">unsigned char</span>* end = append(<span class=
238
"literal">0x0448</span>, u);
239
assert (u[<span class="literal">0</span>] == <span class=
240
"literal">0xd1</span> && u[<span class="literal">1</span>] == <span class=
241
"literal">0x88</span> && u[<span class="literal">2</span>] == <span class=
242
"literal">0</span> && u[<span class="literal">3</span>] == <span class=
243
"literal">0</span> && u[<span class="literal">4</span>] == <span class=
247
Note that <code>append</code> does not allocate any memory - it is the burden of
248
the caller to make sure there is enough memory allocated for the operation. To make
249
things more interesting, <code>append</code> can add anywhere between 1 and 4
250
octets to the sequence. In practice, you would most often want to use
251
<code>std::back_inserter</code> to ensure that the necessary memory is allocated.
254
In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception
261
Available in version 1.0 and later.
264
Given the iterator to the beginning of the UTF-8 sequence, it returns the code
265
point and moves the iterator to the next position.
268
<span class="keyword">template</span> <<span class=
269
"keyword">typename</span> octet_iterator>
270
uint32_t next(octet_iterator& it, octet_iterator end);
274
<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
275
encoded code point. After the function returns, it is incremented to point to the
276
beginning of the next code point.<br>
277
<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
278
gets equal to <code>end</code> during the extraction of a code point, an
279
<code>utf8::not_enough_room</code> exception is thrown.<br>
280
<span class="return_value">Return value</span>: the 32 bit representation of the
281
processed UTF-8 code point.
287
<span class="keyword">char</span>* twochars = <span class=
288
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
289
<span class="keyword">char</span>* w = twochars;
290
<span class="keyword">int</span> cp = next(w, twochars + <span class="literal">6</span>);
291
assert (cp == <span class="literal">0x65e5</span>);
292
assert (w == twochars + <span class="literal">3</span>);
295
This function is typically used to iterate through a UTF-8 encoded string.
298
In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
305
Available in version 2.1 and later.
308
Given the iterator to the beginning of the UTF-8 sequence, it returns the code
309
point for the following sequence without changing the value of the iterator.
312
<span class="keyword">template</span> <<span class=
313
"keyword">typename</span> octet_iterator>
314
uint32_t peek_next(octet_iterator it, octet_iterator end);
318
<code>it</code>: an iterator pointing to the beginning of an UTF-8
319
encoded code point.<br>
320
<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
321
gets equal to <code>end</code> during the extraction of a code point, an
322
<code>utf8::not_enough_room</code> exception is thrown.<br>
323
<span class="return_value">Return value</span>: the 32 bit representation of the
324
processed UTF-8 code point.
330
<span class="keyword">char</span>* twochars = <span class=
331
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
332
<span class="keyword">char</span>* w = twochars;
333
<span class="keyword">int</span> cp = peek_next(w, twochars + <span class="literal">6</span>);
334
assert (cp == <span class="literal">0x65e5</span>);
335
assert (w == twochars);
338
In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
345
Available in version 1.02 and later.
348
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
349
decreases the iterator until it hits the beginning of the previous UTF-8 encoded
350
code point and returns the 32 bits representation of the code point.
353
<span class="keyword">template</span> <<span class=
354
"keyword">typename</span> octet_iterator>
355
uint32_t prior(octet_iterator& it, octet_iterator start);
359
<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
360
After the function returns, it is decremented to point to the beginning of the
361
previous code point.<br>
362
<code>start</code>: an iterator to the beginning of the sequence where the search
363
for the beginning of a code point is performed. It is a
364
safety measure to prevent passing the beginning of the string in the search for a
365
UTF-8 lead octet.<br>
366
<span class="return_value">Return value</span>: the 32 bit representation of the
373
<span class="keyword">char</span>* twochars = <span class=
374
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
375
<span class="keyword">unsigned char</span>* w = twochars + <span class=
377
<span class="keyword">int</span> cp = prior (w, twochars);
378
assert (cp == <span class="literal">0x65e5</span>);
379
assert (w == twochars);
382
This function has two purposes: one is two iterate backwards through a UTF-8
383
encoded string. Note that it is usually a better idea to iterate forward instead,
384
since <code>utf8::next</code> is faster. The second purpose is to find a beginning
385
of a UTF-8 sequence if we have a random position within a string.
388
<code>it</code> will typically point to the beginning of
389
a code point, and <code>start</code> will point to the
390
beginning of the string to ensure we don't go backwards too far. <code>it</code> is
391
decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
392
beginning with that octet is decoded to a 32 bit representation and returned.
395
In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an
396
invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>
403
Deprecated in version 1.02 and later.
406
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
407
decreases the iterator until it hits the beginning of the previous UTF-8 encoded
408
code point and returns the 32 bits representation of the code point.
411
<span class="keyword">template</span> <<span class=
412
"keyword">typename</span> octet_iterator>
413
uint32_t previous(octet_iterator& it, octet_iterator pass_start);
417
<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
418
After the function returns, it is decremented to point to the beginning of the
419
previous code point.<br>
420
<code>pass_start</code>: an iterator to the point in the sequence where the search
421
for the beginning of a code point is aborted if no result was reached. It is a
422
safety measure to prevent passing the beginning of the string in the search for a
423
UTF-8 lead octet.<br>
424
<span class="return_value">Return value</span>: the 32 bit representation of the
431
<span class="keyword">char</span>* twochars = <span class=
432
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
433
<span class="keyword">unsigned char</span>* w = twochars + <span class=
435
<span class="keyword">int</span> cp = previous (w, twochars - <span class=
437
assert (cp == <span class="literal">0x65e5</span>);
438
assert (w == twochars);
441
<code>utf8::previous</code> is deprecated, and <code>utf8::prior</code> should
442
be used instead, although the existing code can continue using this function.
443
The problem is the parameter <code>pass_start</code> that points to the position
444
just before the beginning of the sequence. Standard containers don't have the
445
concept of "pass start" and the function can not be used with their iterators.
448
<code>it</code> will typically point to the beginning of
449
a code point, and <code>pass_start</code> will point to the octet just before the
450
beginning of the string to ensure we don't go backwards too far. <code>it</code> is
451
decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
452
beginning with that octet is decoded to a 32 bit representation and returned.
455
In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an
456
invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>
463
Available in version 1.0 and later.
466
Advances an iterator by the specified number of code points within an UTF-8
470
<span class="keyword">template</span> <<span class=
471
"keyword">typename</span> octet_iterator, typename distance_type>
473
"keyword">void</span> advance (octet_iterator& it, distance_type n, octet_iterator end);
477
<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
478
encoded code point. After the function returns, it is incremented to point to the
479
nth following code point.<br>
480
<code>n</code>: a positive integer that shows how many code points we want to
482
<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
483
gets equal to <code>end</code> during the extraction of a code point, an
484
<code>utf8::not_enough_room</code> exception is thrown.<br>
490
<span class="keyword">char</span>* twochars = <span class=
491
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
492
<span class="keyword">unsigned char</span>* w = twochars;
493
advance (w, <span class="literal">2</span>, twochars + <span class="literal">6</span>);
494
assert (w == twochars + <span class="literal">5</span>);
497
This function works only "forward". In case of a negative <code>n</code>, there is
501
In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception
508
Available in version 1.0 and later.
511
Given the iterators to two UTF-8 encoded code points in a seqence, returns the
512
number of code points between them.
515
<span class="keyword">template</span> <<span class=
516
"keyword">typename</span> octet_iterator>
518
"keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
522
<code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br>
523
<code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code
524
point in the sequence we are trying to determine the length. It can be the
525
beginning of a new code point, or not.<br>
526
<span class="return_value">Return value</span> the distance between the iterators,
533
<span class="keyword">char</span>* twochars = <span class=
534
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
535
size_t dist = utf8::distance(twochars, twochars + <span class="literal">5</span>);
536
assert (dist == <span class="literal">2</span>);
539
This function is used to find the length (in code points) of a UTF-8 encoded
540
string. The reason it is called <em>distance</em>, rather than, say,
541
<em>length</em> is mainly because developers are used that <em>length</em> is an
542
O(1) function. Computing the length of an UTF-8 string is a linear operation, and
543
it looked better to model it after <code>std::distance</code> algorithm.
546
In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
547
thrown. If <code>last</code> does not point to the past-of-end of a UTF-8 seqence,
548
a <code>utf8::not_enough_room</code> exception is thrown.
554
Available in version 1.0 and later.
557
Converts a UTF-16 encoded string to UTF-8.
560
<span class="keyword">template</span> <<span class=
561
"keyword">typename</span> u16bit_iterator, <span class=
562
"keyword">typename</span> octet_iterator>
563
octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
567
<code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded
568
string to convert.<br>
569
<code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded
570
string to convert.<br>
571
<code>result</code>: an output iterator to the place in the UTF-8 string where to
572
append the result of conversion.<br>
573
<span class="return_value">Return value</span>: An iterator pointing to the place
574
after the appended UTF-8 string.
580
<span class="keyword">unsigned short</span> utf16string[] = {<span class=
581
"literal">0x41</span>, <span class="literal">0x0448</span>, <span class=
582
"literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class=
583
"literal">0xdd1e</span>};
584
vector<<span class="keyword">unsigned char</span>> utf8result;
585
utf16to8(utf16string, utf16string + <span class=
586
"literal">5</span>, back_inserter(utf8result));
587
assert (utf8result.size() == <span class="literal">10</span>);
590
In case of invalid UTF-16 sequence, a <code>utf8::invalid_utf16</code> exception is
597
Available in version 1.0 and later.
600
Converts an UTF-8 encoded string to UTF-16
603
<span class="keyword">template</span> <<span class=
604
"keyword">typename</span> u16bit_iterator, typename octet_iterator>
605
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
609
<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
610
string to convert. < br /> <code>end</code>: an iterator pointing to
611
pass-the-end of the UTF-8 encoded string to convert.<br>
612
<code>result</code>: an output iterator to the place in the UTF-16 string where to
613
append the result of conversion.<br>
614
<span class="return_value">Return value</span>: An iterator pointing to the place
615
after the appended UTF-16 string.
621
<span class="keyword">char</span> utf8_with_surrogates[] = <span class=
622
"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>;
623
vector <<span class="keyword">unsigned short</span>> utf16result;
624
utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=
625
"literal">9</span>, back_inserter(utf16result));
626
assert (utf16result.size() == <span class="literal">4</span>);
627
assert (utf16result[<span class="literal">2</span>] == <span class=
628
"literal">0xd834</span>);
629
assert (utf16result[<span class="literal">3</span>] == <span class=
630
"literal">0xdd1e</span>);
633
In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
634
thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a
635
<code>utf8::not_enough_room</code> exception is thrown.
641
Available in version 1.0 and later.
644
Converts a UTF-32 encoded string to UTF-8.
647
<span class="keyword">template</span> <<span class=
648
"keyword">typename</span> octet_iterator, typename u32bit_iterator>
649
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
653
<code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded
654
string to convert.<br>
655
<code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded
656
string to convert.<br>
657
<code>result</code>: an output iterator to the place in the UTF-8 string where to
658
append the result of conversion.<br>
659
<span class="return_value">Return value</span>: An iterator pointing to the place
660
after the appended UTF-8 string.
666
<span class="keyword">int</span> utf32string[] = {<span class=
667
"literal">0x448</span>, <span class="literal">0x65E5</span>, <span class=
668
"literal">0x10346</span>, <span class="literal">0</span>};
669
vector<<span class="keyword">unsigned char</span>> utf8result;
670
utf32to8(utf32string, utf32string + <span class=
671
"literal">3</span>, back_inserter(utf8result));
672
assert (utf8result.size() == <span class="literal">9</span>);
675
In case of invalid UTF-32 string, a <code>utf8::invalid_code_point</code> exception
682
Available in version 1.0 and later.
685
Converts a UTF-8 encoded string to UTF-32.
688
<span class="keyword">template</span> <<span class=
689
"keyword">typename</span> octet_iterator, <span class=
690
"keyword">typename</span> u32bit_iterator>
691
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
695
<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
696
string to convert.<br>
697
<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string
699
<code>result</code>: an output iterator to the place in the UTF-32 string where to
700
append the result of conversion.<br>
701
<span class="return_value">Return value</span>: An iterator pointing to the place
702
after the appended UTF-32 string.
708
<span class="keyword">char</span>* twochars = <span class=
709
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
710
vector<<span class="keyword">int</span>> utf32result;
711
utf8to32(twochars, twochars + <span class=
712
"literal">5</span>, back_inserter(utf32result));
713
assert (utf32result.size() == <span class="literal">2</span>);
716
In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
717
thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a
718
<code>utf8::not_enough_room</code> exception is thrown.
724
Available in version 1.0 and later.
727
Detects an invalid sequence within a UTF-8 string.
730
<span class="keyword">template</span> <<span class=
731
"keyword">typename</span> octet_iterator>
732
octet_iterator find_invalid(octet_iterator start, octet_iterator end);
735
<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
736
test for validity.<br>
737
<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test
739
<span class="return_value">Return value</span>: an iterator pointing to the first
740
invalid octet in the UTF-8 string. In case none were found, equals
747
<span class="keyword">char</span> utf_invalid[] = <span class=
748
"literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>;
750
"keyword">char</span>* invalid = find_invalid(utf_invalid, utf_invalid + <span class=
752
assert (invalid == utf_invalid + <span class="literal">5</span>);
755
This function is typically used to make sure a UTF-8 string is valid before
756
processing it with other functions. It is especially important to call it if before
757
doing any of the <em>unchecked</em> operations on it.
763
Available in version 1.0 and later.
766
Checks whether a sequence of octets is a valid UTF-8 string.
769
<span class="keyword">template</span> <<span class=
770
"keyword">typename</span> octet_iterator>
771
<span class="keyword">bool</span> is_valid(octet_iterator start, octet_iterator end);
775
<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
776
test for validity.<br>
777
<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test
779
<span class="return_value">Return value</span>: <code>true</code> if the sequence
780
is a valid UTF-8 string; <code>false</code> if not.
784
<span class="keyword">char</span> utf_invalid[] = <span class=
785
"literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>;
786
<span class="keyword">bool</span> bvalid = is_valid(utf_invalid, utf_invalid + <span
787
class="literal">6</span>);
788
assert (bvalid == false);
791
<code>is_valid</code> is a shorthand for <code>find_invalid(start, end) ==
792
end;</code>. You may want to use it to make sure that a byte seqence is a valid
793
UTF-8 string without the need to know where it fails if it is not valid.
796
utf8::replace_invalid
799
Available in version 2.0 and later.
802
Replaces all invalid UTF-8 sequences within a string with a replacement marker.
805
<span class="keyword">template</span> <<span class=
806
"keyword">typename</span> octet_iterator, <span class=
807
"keyword">typename</span> output_iterator>
808
output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
809
<span class="keyword">template</span> <<span class=
810
"keyword">typename</span> octet_iterator, <span class=
811
"keyword">typename</span> output_iterator>
812
output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);
816
<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
817
look for invalid UTF-8 sequences.<br>
818
<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to look
819
for invalid UTF-8 sequences.<br>
820
<code>out</code>: An output iterator to the range where the result of replacement
822
<code>replacement</code>: A Unicode code point for the replacement marker. The
823
version without this parameter assumes the value <code>0xfffd</code><br>
824
<span class="return_value">Return value</span>: An iterator pointing to the place
825
after the UTF-8 string with replaced invalid sequences.
831
<span class="keyword">char</span> invalid_sequence[] = <span class=
832
"literal">"a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"</span>;
833
vector<<span class="keyword">char</span>> replace_invalid_result;
834
replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), <span
835
class="literal">'?'</span>);
836
bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
838
<span class="keyword">char</span>* fixed_invalid_sequence = <span class=
839
"literal">"a????z"</span>;
840
assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));
843
<code>replace_invalid</code> does not perform in-place replacement of invalid
844
sequences. Rather, it produces a copy of the original string with the invalid
845
sequences replaced with a replacement marker. Therefore, <code>out</code> must not
846
be in the <code>[start, end]</code> range.
849
If <code>end</code> does not point to the past-of-end of a UTF-8 sequence, a
850
<code>utf8::not_enough_room</code> exception is thrown.
856
Available in version 1.0 and later.
859
Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM)
862
<span class="keyword">template</span> <<span class=
863
"keyword">typename</span> octet_iterator>
864
<span class="keyword">bool</span> is_bom (octet_iterator it);
867
<code>it</code>: beginning of the 3-octet sequence to check<br>
868
<span class="return_value">Return value</span>: <code>true</code> if the sequence
869
is UTF-8 byte order mark; <code>false</code> if not.
875
<span class="keyword">unsigned char</span> byte_order_mark[] = {<span class=
876
"literal">0xef</span>, <span class="literal">0xbb</span>, <span class=
877
"literal">0xbf</span>};
878
<span class="keyword">bool</span> bbom = is_bom(byte_order_mark);
879
assert (bbom == <span class="literal">true</span>);
882
The typical use of this function is to check the first three bytes of a file. If
883
they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8
887
Types From utf8 Namespace
893
Available in version 2.0 and later.
896
Adapts the underlying octet iterator to iterate over the sequence of code points,
897
rather than raw octets.
900
<span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator>
901
<span class="keyword">class</span> iterator;
904
<h5>Member functions</h5>
906
<dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is
907
constructed with its default constructor.
908
<dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it,
909
const octet_iterator& range_start,
910
const octet_iterator& range_end);</code> <dd> a constructor
911
that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>
912
and sets the range in which the iterator is considered valid.
913
<dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the
914
underlying <code>octet_iterator</code>.
915
<dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence
916
the underlying <code>octet_iterator</code> is pointing to and returns the code point.
917
<dt><code><span class="keyword">bool operator</span> == (const iterator& rhs)
918
<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
919
if the two underlaying iterators are equal.
920
<dt><code><span class="keyword">bool operator</span> != (const iterator& rhs)
921
<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
922
if the two underlaying iterators are not equal.
923
<dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves
924
the iterator to the next UTF-8 encoded code point.
925
<dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd>
926
the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
927
<dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves
928
the iterator to the previous UTF-8 encoded code point.
929
<dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd>
930
the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
936
<span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>;
937
utf8::iterator<<span class="keyword">char</span>*> it(threechars, threechars, threechars + <span class="literal">9</span>);
938
utf8::iterator<<span class="keyword">char</span>*> it2 = it;
940
assert (*it == <span class="literal">0x10346</span>);
941
assert (*(++it) == <span class="literal">0x65e5</span>);
942
assert ((*it++) == <span class="literal">0x65e5</span>);
943
assert (*it == <span class="literal">0x0448</span>);
945
utf8::iterator<<span class="keyword">char</span>*> endit (threechars + <span class="literal">9</span>, threechars, threechars + <span class="literal">9</span>);
946
assert (++it == endit);
947
assert (*(--it) == <span class="literal">0x0448</span>);
948
assert ((*it--) == <span class="literal">0x0448</span>);
949
assert (*it == <span class="literal">0x65e5</span>);
950
assert (--it == utf8::iterator<<span class="keyword">char</span>*>(threechars, threechars, threechars + <span class="literal">9</span>));
951
assert (*it == <span class="literal">0x10346</span>);
954
The purpose of <code>utf8::iterator</code> adapter is to enable easy iteration as well as the use of STL
955
algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of
956
<code>utf8::next()</code> and <code>utf8::prior()</code> functions.
959
Note that <code>utf8::iterator</code> adapter is a checked iterator. It operates on the range specified in
960
the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators
961
require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically,
962
the range will be determined by sequence container functions <code>begin</code> and <code>end</code>, i.e.:
965
std::string s = <span class="literal">"example"</span>;
966
utf8::iterator i (s.begin(), s.begin(), s.end());
968
<h3 id="fununchecked">
969
Functions From utf8::unchecked Namespace
972
utf8::unchecked::append
975
Available in version 1.0 and later.
978
Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence
982
<span class="keyword">template</span> <<span class=
983
"keyword">typename</span> octet_iterator>
984
octet_iterator append(uint32_t cp, octet_iterator result);
988
<code>cp</code>: A 32 bit integer representing a code point to append to the
990
<code>result</code>: An output iterator to the place in the sequence where to
991
append the code point.<br>
992
<span class="return_value">Return value</span>: An iterator pointing to the place
993
after the newly appended sequence.
999
<span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span
1000
class="literal">0</span>,<span class="literal">0</span>,<span class=
1001
"literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>};
1002
<span class="keyword">unsigned char</span>* end = unchecked::append(<span class=
1003
"literal">0x0448</span>, u);
1004
assert (u[<span class="literal">0</span>] == <span class=
1005
"literal">0xd1</span> && u[<span class="literal">1</span>] == <span class=
1006
"literal">0x88</span> && u[<span class="literal">2</span>] == <span class=
1007
"literal">0</span> && u[<span class="literal">3</span>] == <span class=
1008
"literal">0</span> && u[<span class="literal">4</span>] == <span class=
1009
"literal">0</span>);
1012
This is a faster but less safe version of <code>utf8::append</code>. It does not
1013
check for validity of the supplied code point, and may produce an invalid UTF-8
1017
utf8::unchecked::next
1020
Available in version 1.0 and later.
1023
Given the iterator to the beginning of a UTF-8 sequence, it returns the code point
1024
and moves the iterator to the next position.
1027
<span class="keyword">template</span> <<span class=
1028
"keyword">typename</span> octet_iterator>
1029
uint32_t next(octet_iterator& it);
1033
<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
1034
encoded code point. After the function returns, it is incremented to point to the
1035
beginning of the next code point.<br>
1036
<span class="return_value">Return value</span>: the 32 bit representation of the
1037
processed UTF-8 code point.
1043
<span class="keyword">char</span>* twochars = <span class=
1044
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1045
<span class="keyword">char</span>* w = twochars;
1046
<span class="keyword">int</span> cp = unchecked::next(w);
1047
assert (cp == <span class="literal">0x65e5</span>);
1048
assert (w == twochars + <span class="literal">3</span>);
1051
This is a faster but less safe version of <code>utf8::next</code>. It does not
1052
check for validity of the supplied UTF-8 sequence.
1055
utf8::unchecked::peek_next
1058
Available in version 2.1 and later.
1061
Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.
1064
<span class="keyword">template</span> <<span class=
1065
"keyword">typename</span> octet_iterator>
1066
uint32_t peek_next(octet_iterator it);
1070
<code>it</code>: an iterator pointing to the beginning of an UTF-8
1071
encoded code point.<br>
1072
<span class="return_value">Return value</span>: the 32 bit representation of the
1073
processed UTF-8 code point.
1079
<span class="keyword">char</span>* twochars = <span class=
1080
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1081
<span class="keyword">char</span>* w = twochars;
1082
<span class="keyword">int</span> cp = unchecked::peek_next(w);
1083
assert (cp == <span class="literal">0x65e5</span>);
1084
assert (w == twochars);
1087
This is a faster but less safe version of <code>utf8::peek_next</code>. It does not
1088
check for validity of the supplied UTF-8 sequence.
1091
utf8::unchecked::prior
1094
Available in version 1.02 and later.
1097
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
1098
decreases the iterator until it hits the beginning of the previous UTF-8 encoded
1099
code point and returns the 32 bits representation of the code point.
1102
<span class="keyword">template</span> <<span class=
1103
"keyword">typename</span> octet_iterator>
1104
uint32_t prior(octet_iterator& it);
1108
<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
1109
After the function returns, it is decremented to point to the beginning of the
1110
previous code point.<br>
1111
<span class="return_value">Return value</span>: the 32 bit representation of the
1112
previous code point.
1118
<span class="keyword">char</span>* twochars = <span class=
1119
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1120
<span class="keyword">char</span>* w = twochars + <span class="literal">3</span>;
1121
<span class="keyword">int</span> cp = unchecked::prior (w);
1122
assert (cp == <span class="literal">0x65e5</span>);
1123
assert (w == twochars);
1126
This is a faster but less safe version of <code>utf8::prior</code>. It does not
1127
check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1130
utf8::unchecked::previous (deprecated, see utf8::unchecked::prior)
1133
Deprecated in version 1.02 and later.
1136
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
1137
decreases the iterator until it hits the beginning of the previous UTF-8 encoded
1138
code point and returns the 32 bits representation of the code point.
1141
<span class="keyword">template</span> <<span class=
1142
"keyword">typename</span> octet_iterator>
1143
uint32_t previous(octet_iterator& it);
1147
<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
1148
After the function returns, it is decremented to point to the beginning of the
1149
previous code point.<br>
1150
<span class="return_value">Return value</span>: the 32 bit representation of the
1151
previous code point.
1157
<span class="keyword">char</span>* twochars = <span class=
1158
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1159
<span class="keyword">char</span>* w = twochars + <span class="literal">3</span>;
1160
<span class="keyword">int</span> cp = unchecked::previous (w);
1161
assert (cp == <span class="literal">0x65e5</span>);
1162
assert (w == twochars);
1165
The reason this function is deprecated is just the consistency with the "checked"
1166
versions, where <code>prior</code> should be used instead of <code>previous</code>.
1167
In fact, <code>unchecked::previous</code> behaves exactly the same as <code>
1168
unchecked::prior</code>
1171
This is a faster but less safe version of <code>utf8::previous</code>. It does not
1172
check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1175
utf8::unchecked::advance
1178
Available in version 1.0 and later.
1181
Advances an iterator by the specified number of code points within an UTF-8
1185
<span class="keyword">template</span> <<span class=
1186
"keyword">typename</span> octet_iterator, typename distance_type>
1187
<span class="keyword">void</span> advance (octet_iterator& it, distance_type n);
1191
<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
1192
encoded code point. After the function returns, it is incremented to point to the
1193
nth following code point.<br>
1194
<code>n</code>: a positive integer that shows how many code points we want to
1201
<span class="keyword">char</span>* twochars = <span class=
1202
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1203
<span class="keyword">char</span>* w = twochars;
1204
unchecked::advance (w, <span class="literal">2</span>);
1205
assert (w == twochars + <span class="literal">5</span>);
1208
This function works only "forward". In case of a negative <code>n</code>, there is
1212
This is a faster but less safe version of <code>utf8::advance</code>. It does not
1213
check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1216
utf8::unchecked::distance
1219
Available in version 1.0 and later.
1222
Given the iterators to two UTF-8 encoded code points in a seqence, returns the
1223
number of code points between them.
1226
<span class="keyword">template</span> <<span class=
1227
"keyword">typename</span> octet_iterator>
1229
"keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
1232
<code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br>
1233
<code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code
1234
point in the sequence we are trying to determine the length. It can be the
1235
beginning of a new code point, or not.<br>
1236
<span class="return_value">Return value</span> the distance between the iterators,
1243
<span class="keyword">char</span>* twochars = <span class=
1244
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1245
size_t dist = utf8::unchecked::distance(twochars, twochars + <span class=
1246
"literal">5</span>);
1247
assert (dist == <span class="literal">2</span>);
1250
This is a faster but less safe version of <code>utf8::distance</code>. It does not
1251
check for validity of the supplied UTF-8 sequence.
1254
utf8::unchecked::utf16to8
1257
Available in version 1.0 and later.
1260
Converts a UTF-16 encoded string to UTF-8.
1263
<span class="keyword">template</span> <<span class=
1264
"keyword">typename</span> u16bit_iterator, <span class=
1265
"keyword">typename</span> octet_iterator>
1266
octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
1270
<code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded
1271
string to convert.<br>
1272
<code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded
1273
string to convert.<br>
1274
<code>result</code>: an output iterator to the place in the UTF-8 string where to
1275
append the result of conversion.<br>
1276
<span class="return_value">Return value</span>: An iterator pointing to the place
1277
after the appended UTF-8 string.
1283
<span class="keyword">unsigned short</span> utf16string[] = {<span class=
1284
"literal">0x41</span>, <span class="literal">0x0448</span>, <span class=
1285
"literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class=
1286
"literal">0xdd1e</span>};
1287
vector<<span class="keyword">unsigned char</span>> utf8result;
1288
unchecked::utf16to8(utf16string, utf16string + <span class=
1289
"literal">5</span>, back_inserter(utf8result));
1290
assert (utf8result.size() == <span class="literal">10</span>);
1293
This is a faster but less safe version of <code>utf8::utf16to8</code>. It does not
1294
check for validity of the supplied UTF-16 sequence.
1297
utf8::unchecked::utf8to16
1300
Available in version 1.0 and later.
1303
Converts an UTF-8 encoded string to UTF-16
1306
<span class="keyword">template</span> <<span class=
1307
"keyword">typename</span> u16bit_iterator, typename octet_iterator>
1308
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
1312
<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
1313
string to convert. < br /> <code>end</code>: an iterator pointing to
1314
pass-the-end of the UTF-8 encoded string to convert.<br>
1315
<code>result</code>: an output iterator to the place in the UTF-16 string where to
1316
append the result of conversion.<br>
1317
<span class="return_value">Return value</span>: An iterator pointing to the place
1318
after the appended UTF-16 string.
1324
<span class="keyword">char</span> utf8_with_surrogates[] = <span class=
1325
"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>;
1326
vector <<span class="keyword">unsigned short</span>> utf16result;
1327
unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=
1328
"literal">9</span>, back_inserter(utf16result));
1329
assert (utf16result.size() == <span class="literal">4</span>);
1330
assert (utf16result[<span class="literal">2</span>] == <span class=
1331
"literal">0xd834</span>);
1332
assert (utf16result[<span class="literal">3</span>] == <span class=
1333
"literal">0xdd1e</span>);
1336
This is a faster but less safe version of <code>utf8::utf8to16</code>. It does not
1337
check for validity of the supplied UTF-8 sequence.
1340
utf8::unchecked::utf32to8
1343
Available in version 1.0 and later.
1346
Converts a UTF-32 encoded string to UTF-8.
1349
<span class="keyword">template</span> <<span class=
1350
"keyword">typename</span> octet_iterator, <span class=
1351
"keyword">typename</span> u32bit_iterator>
1352
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
1356
<code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded
1357
string to convert.<br>
1358
<code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded
1359
string to convert.<br>
1360
<code>result</code>: an output iterator to the place in the UTF-8 string where to
1361
append the result of conversion.<br>
1362
<span class="return_value">Return value</span>: An iterator pointing to the place
1363
after the appended UTF-8 string.
1369
<span class="keyword">int</span> utf32string[] = {<span class=
1370
"literal">0x448</span>, <span class="literal">0x65e5</span>, <span class=
1371
"literal">0x10346</span>, <span class="literal">0</span>};
1372
vector<<span class="keyword">unsigned char</span>> utf8result;
1373
utf32to8(utf32string, utf32string + <span class=
1374
"literal">3</span>, back_inserter(utf8result));
1375
assert (utf8result.size() == <span class="literal">9</span>);
1378
This is a faster but less safe version of <code>utf8::utf32to8</code>. It does not
1379
check for validity of the supplied UTF-32 sequence.
1382
utf8::unchecked::utf8to32
1385
Available in version 1.0 and later.
1388
Converts a UTF-8 encoded string to UTF-32.
1391
<span class="keyword">template</span> <<span class=
1392
"keyword">typename</span> octet_iterator, typename u32bit_iterator>
1393
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
1397
<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
1398
string to convert.<br>
1399
<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string
1401
<code>result</code>: an output iterator to the place in the UTF-32 string where to
1402
append the result of conversion.<br>
1403
<span class="return_value">Return value</span>: An iterator pointing to the place
1404
after the appended UTF-32 string.
1410
<span class="keyword">char</span>* twochars = <span class=
1411
"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1412
vector<<span class="keyword">int</span>> utf32result;
1413
unchecked::utf8to32(twochars, twochars + <span class=
1414
"literal">5</span>, back_inserter(utf32result));
1415
assert (utf32result.size() == <span class="literal">2</span>);
1418
This is a faster but less safe version of <code>utf8::utf8to32</code>. It does not
1419
check for validity of the supplied UTF-8 sequence.
1421
<h3 id="typesunchecked">
1422
Types From utf8::unchecked Namespace
1428
Available in version 2.0 and later.
1431
Adapts the underlying octet iterator to iterate over the sequence of code points,
1432
rather than raw octets.
1435
<span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator>
1436
<span class="keyword">class</span> iterator;
1439
<h5>Member functions</h5>
1441
<dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is
1442
constructed with its default constructor.
1443
<dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it);
1444
</code> <dd> a constructor
1445
that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>
1446
<dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the
1447
underlying <code>octet_iterator</code>.
1448
<dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence
1449
the underlying <code>octet_iterator</code> is pointing to and returns the code point.
1450
<dt><code><span class="keyword">bool operator</span> == (const iterator& rhs)
1451
<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
1452
if the two underlaying iterators are equal.
1453
<dt><code><span class="keyword">bool operator</span> != (const iterator& rhs)
1454
<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
1455
if the two underlaying iterators are not equal.
1456
<dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves
1457
the iterator to the next UTF-8 encoded code point.
1458
<dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd>
1459
the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
1460
<dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves
1461
the iterator to the previous UTF-8 encoded code point.
1462
<dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd>
1463
the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
1469
<span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>;
1470
utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it(threechars);
1471
utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it2 = un_it;
1472
assert (un_it2 == un_it);
1473
assert (*un_it == <span class="literal">0x10346</span>);
1474
assert (*(++un_it) == <span class="literal">0x65e5</span>);
1475
assert ((*un_it++) == <span class="literal">0x65e5</span>);
1476
assert (*un_it == <span class="literal">0x0448</span>);
1477
assert (un_it != un_it2);
1478
utf8::::unchecked::iterator<<span class="keyword">char</span>*> un_endit (threechars + <span class="literal">9</span>);
1479
assert (++un_it == un_endit);
1480
assert (*(--un_it) == <span class="literal">0x0448</span>);
1481
assert ((*un_it--) == <span class="literal">0x0448</span>);
1482
assert (*un_it == <span class="literal">0x65e5</span>);
1483
assert (--un_it == utf8::unchecked::iterator<<span class="keyword">char</span>*>(threechars));
1484
assert (*un_it == <span class="literal">0x10346</span>);
1487
This is an unchecked version of <code>utf8::iterator</code>. It is faster in many cases, but offers
1488
no validity or range checks.
1494
Design goals and decisions
1497
The library was designed to be:
1501
Generic: for better or worse, there are many C++ string classes out there, and
1502
the library should work with as many of them as possible.
1505
Portable: the library should be portable both accross different platforms and
1506
compilers. The only non-portable code is a small section that declares unsigned
1507
integers of different sizes: three typedefs. They can be changed by the users of
1508
the library if they don't match their platform. The default setting should work
1509
for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives.
1512
Lightweight: follow the "pay only for what you use" guidline.
1515
Unintrusive: avoid forcing any particular design or even programming style on the
1516
user. This is a library, not a framework.
1523
In case you want to look into other means of working with UTF-8 strings from C++,
1524
here is the list of solutions I am aware of:
1528
<a href="http://icu.sourceforge.net/">ICU Library</a>. It is very powerful,
1529
complete, feature-rich, mature, and widely used. Also big, intrusive,
1530
non-generic, and doesn't play well with the Standard Library. I definitelly
1531
recommend looking at ICU even if you don't plan to use it.
1535
"http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html">Glib::ustring</a>.
1536
A class specifically made to work with UTF-8 strings, and also feel like
1537
<code>std::string</code>. If you prefer to have yet another string class in your
1538
code, it may be worth a look. Be aware of the licensing issues, though.
1541
Platform dependent solutions: Windows and POSIX have functions to convert strings
1542
from one encoding to another. That is only a subset of what my library offers,
1543
but if that is all you need it may be good enough, especially given the fact that
1544
these functions are mature and tested in production.
1547
<h2 id="conclusion">
1551
Until Unicode becomes officially recognized by the C++ Standard Library, we need to
1552
use other means to work with UTF-8 strings. Template functions I describe in this
1553
article may be a good step in this direction.
1560
<a href="http://www.unicode.org/">The Unicode Consortium</a>.
1563
<a href="http://icu.sourceforge.net/">ICU Library</a>.
1566
<a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 at Wikipedia</a>
1569
<a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">UTF-8 and Unicode FAQ for