~ubuntu-branches/ubuntu/wily/ledger/wily

« back to all changes in this revision

Viewing changes to lib/utfcpp/doc/utf8cpp.html

Committer: Package Import Robot
Author(s): David Bremner
Date: 2014-10-08 19:20:38 UTC
mfrom: (1.1.3) (9.1.7 sid)
Revision ID: package-import@ubuntu.com-20141008192038-py5cxm93rdt3x2uz

Tags: 3.1+dfsg1-1

New upstream release

files added:
.pc/0001-debcherry-fixup-patch.patch

.pc/0001-debcherry-fixup-patch.patch/doc

.pc/0001-debcherry-fixup-patch.patch/doc/CMakeLists.txt

.pc/0001-debcherry-fixup-patch.patch/src

.pc/0001-debcherry-fixup-patch.patch/src/CMakeLists.txt

.pc/0002-Fix-texinfo-syntax-errors.patch

.pc/0002-Fix-texinfo-syntax-errors.patch/doc

.pc/0002-Fix-texinfo-syntax-errors.patch/doc/ledger3.texi

contrib/iso4127-commodities

contrib/iso4127-commodities/iso4217ledger.sh

contrib/iso4127-commodities/iso4217ledger.xsl

debian/patches/0001-debcherry-fixup-patch.patch

debian/patches/0002-Fix-texinfo-syntax-errors.patch

default.nix

doc/DEVELOP.md

doc/GLOSSARY.md

lisp/ledger-fontify.el

lisp/ledger-navigate.el

test/baseline/cmd-equity.test

test/baseline/cmd-tags.test

test/baseline/feat-balance_assert-off.test

test/baseline/feat-balance_assert_split.test

test/baseline/feat-convert-with-directives.dat

test/baseline/feat-convert-with-directives.test

test/baseline/opt-no-aliases.test

test/baseline/opt-recursive-aliases.test

test/baseline/opt-time-colon.test

test/regress/04D86CD0.test

test/regress/1036.test

test/regress/1046.test

test/regress/1050.test

test/regress/1072.test

test/regress/1074.test

test/regress/375.test

test/regress/383.test

test/regress/494-a.ledger

test/regress/494-b.ledger

test/regress/634AA589.test

test/regress/712-a.test

test/regress/712-b.test

test/regress/713-a.test

test/regress/713-b.test

test/regress/755.test

test/regress/785.test

test/regress/999-a.test

test/regress/999-b.test

test/regress/AA2FF2B.test

test/regress/DE17CCF1.test

test/regress/error-in-include.dat

test/regress/error-in-include.test

files removed:
.pc/0001-enable-tidy-mode-for-texi2pdf.patch

.pc/0001-enable-tidy-mode-for-texi2pdf.patch/doc

.pc/0001-enable-tidy-mode-for-texi2pdf.patch/doc/CMakeLists.txt

.pc/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch

.pc/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch/src

.pc/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch/src/CMakeLists.txt

.pc/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch/src/filters.cc

.pc/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch/src/system.hh.in

.pc/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch/src/utils.h

debian/clean

debian/patches/0001-enable-tidy-mode-for-texi2pdf.patch

debian/patches/0002-replace-sha1.cc-with-boost-uuid-details-sha1.patch

doc/INSTALL

doc/LICENSE-sha1

doc/README

lib/utfcpp/doc

lib/utfcpp/doc/ReleaseNotes

lib/utfcpp/doc/utf8cpp.html

lib/utfcpp/source

lib/utfcpp/source/utf8

lib/utfcpp/source/utf8.h

lib/utfcpp/source/utf8/checked.h

lib/utfcpp/source/utf8/core.h

lib/utfcpp/source/utf8/unchecked.h

test/baseline/dir-alias-recursive.test

test/baseline/feat-convert-with-diretives.dat

test/baseline/feat-convert-with-diretives.test

test/regress/712.test

files modified:
.gitignore

.pc/applied-patches

.travis.yml

CMakeLists.txt

README-1ST

README.md

acprep

contrib/ledger-completion.bash

debian/changelog

debian/control

debian/copyright

debian/docs

debian/get-orig-source.sh

debian/patches/series

debian/rules

doc/CMakeLists.txt

doc/Doxyfile.in

doc/NEWS

doc/ledger-mode.texi

doc/ledger.1

doc/ledger3.texi

lisp/CMakeLists.txt

lisp/ledger-commodities.el

lisp/ledger-complete.el

lisp/ledger-context.el

lisp/ledger-exec.el

lisp/ledger-fonts.el

lisp/ledger-init.el

lisp/ledger-mode.el

lisp/ledger-occur.el

lisp/ledger-post.el

lisp/ledger-reconcile.el

lisp/ledger-regex.el

lisp/ledger-report.el

lisp/ledger-schedule.el

lisp/ledger-sort.el

lisp/ledger-state.el

lisp/ledger-test.el

lisp/ledger-texi.el

lisp/ledger-xact.el

src/CMakeLists.txt

src/account.cc

src/account.h

src/amount.cc

src/annotate.cc

src/context.h

src/convert.cc

src/csv.cc

src/emacs.cc

src/emacs.h

src/error.cc

src/format.cc

src/global.cc

src/item.h

src/iterators.cc

src/journal.cc

src/journal.h

src/option.h

src/output.cc

src/pool.cc

src/post.h

src/print.cc

src/ptree.cc

src/py_times.cc

src/report.cc

src/report.h

src/select.cc

src/session.cc

src/stream.cc

src/system.hh.in

src/textual.cc

src/times.cc

src/times.h

src/token.cc

src/utils.cc

src/utils.h

src/xact.cc

test/CMakeLists.txt

test/baseline/cmd-select.test

test/baseline/cmd-source.test

test/baseline/dir-account.test

test/baseline/dir-alias-fail.test

test/baseline/dir-alias.test

test/baseline/dir-commodity.test

test/baseline/dir-payee.test

test/baseline/opt-base.test

test/baseline/opt-date.test

test/baseline/opt-datetime-format.test

test/baseline/opt-dc.test

test/baseline/opt-equity.test

test/baseline/opt-gain.test

test/baseline/opt-head.test

test/baseline/opt-historical.test

test/baseline/opt-immediate.test

test/baseline/opt-lot-dates.test

test/baseline/opt-lot-prices.test

test/baseline/opt-lots-actual.test

test/baseline/opt-lots.test

test/baseline/opt-lots_basis.test

test/baseline/opt-lots_basis_base.test

test/baseline/opt-pedantic.test

test/baseline/opt-pivot.test

test/baseline/opt-price.test

test/baseline/opt-primary-date.test

test/baseline/opt-register-format.test

test/baseline/opt-rich-data.test

test/baseline/opt-tail.test

test/baseline/opt-time-report.test

test/regress/15A80F68.test

test/regress/1A546C4D.test

test/regress/2E3496BD.test

test/regress/5D92A5EB.test

test/regress/6188B0EC.test

test/regress/78AB4B87_py.test

test/regress/8EAF77C0.test

test/regress/9188F587_py.test

test/regress/BF3C1F82-2.test

test/regress/BF3C1F82.test

test/regress/C0212EAC.test

test/regress/C19E4E9B.test

test/regress/CAE63F5C-b.test

test/regress/CAE63F5C-c.test

test/regress/D51BFF74.test

test/regress/D9C8EB08.test

test/regress/DDB54BB8.test

test/unit/CMakeLists.txt

test/unit/t_balance.cc

tools/build.sh

tools/gendocs.sh

tools/spellcheck.sh

Show diffs side-by-side

added added

removed removed

lib/utfcpp/doc/utf8cpp.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

"HTML Tidy for Linux/x86 (vers 1st November 2002), see www.w3.org">

"A simple, portable and lightweigt C++ library for easy handling of UTF-8 encoded strings">

<title>

UTF8-CPP: UTF-8 with C++ in a Portable Way

</title>

<!--

span.return_value {

color: brown;

}

span.keyword {

color: blue;

}

span.preprocessor {

color: navy;

}

span.literal {

color: olive;

}

span.comment {

color: green;

}

code {

font-weight: bold;

}

ul.toc {

list-style-type: none;

}

p.version {

font-size: small;

font-style: italic;

}

-->

</style>

</head>

<body>

<h1>

UTF8-CPP: UTF-8 with C++ in a Portable Way

</h1>

<a href="https://sourceforge.net/projects/utfcpp">The Sourceforge project page</a>

<h2>

Table of Contents

</h2>

<li>

<a href="#introduction">Introduction</a>

</li>

<li>

<a href="#examples">Examples of Use</a>

</li>

<li>

<a href="#reference">Reference</a>

<li>

<a href="#funutf8">Functions From utf8 Namespace </a>

</li>

<li>

<a href="#typesutf8">Types From utf8 Namespace </a>

</li>

<li>

<a href="#fununchecked">Functions From utf8::unchecked Namespace </a>

</li>

<li>

<a href="#typesunchecked">Types From utf8::unchecked Namespace </a>

</li>

</ul>

</li>

<li>

<a href="#points">Points of Interest</a>

</li>

<li>

<a href="#conclusion">Conclusion</a>

</li>

<li>

<a href="#links">Links</a>

</li>

</ul>

</div>

Introduction

</h2>

Many C++ developers miss an easy and portable way of handling Unicode encoded

strings. C++ Standard is currently Unicode agnostic, and while some work is being

done to introduce Unicode to the next incarnation called C++0x, for the moment

nothing of the sort is available. In the meantime, developers use 3rd party

libraries like ICU, OS specific capabilities, or simply roll out their own

solutions.

100

101

In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small

102

generic library. For anybody used to work with STL algorithms and iterators, it should be

103

easy and natural to use. The code is freely available for any purpose - check out

104

the license at the beginning of the utf8.h file. If you run into

105

bugs or performance issues, please let me know and I'll do my best to address them.

106

107

108

The purpose of this article is not to offer an introduction to Unicode in general,

109

and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out

110

<a href="http://www.unicode.org/">Unicode Home Page</a> or some other source of

111

information for Unicode. Also, it is not my aim to advocate the use of UTF-8

112

encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from

113

C++, I am sure you have good reasons for it.

114

115

116

Examples of use

117

</h2>

118

119

To illustrate the use of this utf8 library, we shall open a file containing UTF-8

120

encoded text, check whether it starts with a byte order mark, read each line into a

121

<code>std::string</code>, check it for validity, convert the text to UTF-16, and

122

back to UTF-8:

123

124

<pre>

125

#include <fstream>

126

#include <iostream>

127

#include <string>

128

#include <vector>

129

#include "utf8.h"

130

using namespace std;

131

int main()

132

{

133

if (argc != 2) {

134

cout << "\nUsage: docsample filename\n";

135

return 0;

136

}

137

const char* test_file_path = argv[1];

138

// Open the test file (must be UTF-8 encoded)

139

ifstream fs8(test_file_path);

140

if (!fs8.is_open()) {

141

cout << <span class=

142

"literal">"Could not open " << test_file_path << endl;

143

return 0;

144

}

145

// Read the first line of the file

146

unsigned line_count = 1;

147

string line;

148

if (!getline(fs8, line))

149

return 0;

150

// Look for utf-8 byte-order mark at the beginning

151

if (line.size() > 2) {

152

if (utf8::is_bom(line.c_str()))

153

cout << <span class=

154

"literal">"There is a byte order mark at the beginning of the file\n";

155

}

156

// Play with all the lines in the file

157

do {

158

// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)

159

string::iterator end_it = utf8::find_invalid(line.begin(), line.end());

160

if (end_it != line.end()) {

161

cout << <span class=

162

"literal">"Invalid UTF-8 encoding detected at line " << line_count << <span

163

class="literal">"\n";

164

cout << <span class=

165

"literal">"This part is fine: " << string(line.begin(), end_it) << <span

166

class="literal">"\n";

167

}

168

// Get the line length (at least for the valid part)

169

int length = utf8::distance(line.begin(), end_it);

170

cout << <span class=

171

"literal">"Length of line " << line_count << <span class=

172

"literal">" is " << length << "\n";

173

// Convert it to utf-16

174

vector<unsigned short> utf16line;

175

utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));

176

// And back to utf-8

177

string utf8line;

178

utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));

179

// Confirm that the conversion went OK:

180

if (utf8line != string(line.begin(), end_it))

181

cout << <span class=

182

"literal">"Error in UTF-16 conversion at line: " << line_count << <span

183

class="literal">"\n";

184

getline(fs8, line);

185

line_count++;

186

} while (!fs8.eof());

187

return 0;

188

}

189

</pre>

190

191

In the previous code sample, we have seen the use of the following functions from

192

<code>utf8</code> namespace: first we used <code>is_bom</code> function to detect

193

UTF-8 byte order mark at the beginning of the file; then for each line we performed

194

a detection of invalid UTF-8 sequences with <code>find_invalid</code>; the number

195

of characters (more precisely - the number of Unicode code points) in each line was

196

determined with a use of <code>utf8::distance</code>; finally, we have converted

197

each line to UTF-16 encoding with <code>utf8to16</code> and back to UTF-8 with

198

<code>utf16to8</code>.

199

200

201

Reference

202

</h2>

203

204

Functions From utf8 Namespace

205

</h3>

206

<h4>

207

utf8::append

208

</h4>

209

210

Available in version 1.0 and later.

211

212

213

Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence

214

to a UTF-8 string.

215

216

<pre>

217

template <<span class=

218

"keyword">typename octet_iterator>

219

octet_iterator append(uint32_t cp, octet_iterator result);

220

221

</pre>

222

223

<code>cp</code>: A 32 bit integer representing a code point to append to the

224

sequence.

225

<code>result</code>: An output iterator to the place in the sequence where to

226

append the code point.

227

Return value: An iterator pointing to the place

228

after the newly appended sequence.

229

230

231

Example of use:

232

233

<pre>

234

unsigned char u[5] = {<span

235

class="literal">0,0,<span class=

236

"literal">0,0,0};

237

unsigned char* end = append(<span class=

238

"literal">0x0448, u);

239

assert (u[0] == <span class=

240

"literal">0xd1 && u[1] == <span class=

241

"literal">0x88 && u[2] == <span class=

242

"literal">0 && u[3] == <span class=

243

"literal">0 && u[4] == <span class=

244

"literal">0);

245

</pre>

246

247

Note that <code>append</code> does not allocate any memory - it is the burden of

248

the caller to make sure there is enough memory allocated for the operation. To make

249

things more interesting, <code>append</code> can add anywhere between 1 and 4

250

octets to the sequence. In practice, you would most often want to use

251

<code>std::back_inserter</code> to ensure that the necessary memory is allocated.

252

253

254

In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception

255

is thrown.

256

257

<h4>

258

utf8::next

259

</h4>

260

261

Available in version 1.0 and later.

262

263

264

Given the iterator to the beginning of the UTF-8 sequence, it returns the code

265

point and moves the iterator to the next position.

266

267

<pre>

268

template <<span class=

269

"keyword">typename octet_iterator>

270

uint32_t next(octet_iterator& it, octet_iterator end);

271

272

</pre>

273

274

<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8

275

encoded code point. After the function returns, it is incremented to point to the

276

beginning of the next code point.

277

<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>

278

gets equal to <code>end</code> during the extraction of a code point, an

279

<code>utf8::not_enough_room</code> exception is thrown.

280

Return value: the 32 bit representation of the

281

processed UTF-8 code point.

282

283

284

Example of use:

285

286

<pre>

287

char* twochars = <span class=

288

"literal">"\xe6\x97\xa5\xd1\x88";

289

char* w = twochars;

290

int cp = next(w, twochars + 6);

291

assert (cp == 0x65e5);

292

assert (w == twochars + 3);

293

</pre>

294

295

This function is typically used to iterate through a UTF-8 encoded string.

296

297

298

In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is

299

thrown.

300

301

<h4>

302

utf8::peek_next

303

</h4>

304

305

Available in version 2.1 and later.

306

307

308

Given the iterator to the beginning of the UTF-8 sequence, it returns the code

309

point for the following sequence without changing the value of the iterator.

310

311

<pre>

312

template <<span class=

313

"keyword">typename octet_iterator>

314

uint32_t peek_next(octet_iterator it, octet_iterator end);

315

316

</pre>

317

318

<code>it</code>: an iterator pointing to the beginning of an UTF-8

319

encoded code point.

320

<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>

321

gets equal to <code>end</code> during the extraction of a code point, an

322

<code>utf8::not_enough_room</code> exception is thrown.

323

Return value: the 32 bit representation of the

324

processed UTF-8 code point.

325

326

327

Example of use:

328

329

<pre>

330

char* twochars = <span class=

331

"literal">"\xe6\x97\xa5\xd1\x88";

332

char* w = twochars;

333

int cp = peek_next(w, twochars + 6);

334

assert (cp == 0x65e5);

335

assert (w == twochars);

336

</pre>

337

338

In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is

339

thrown.

340

341

<h4>

342

utf8::prior

343

</h4>

344

345

Available in version 1.02 and later.

346

347

348

Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it

349

decreases the iterator until it hits the beginning of the previous UTF-8 encoded

350

code point and returns the 32 bits representation of the code point.

351

352

<pre>

353

template <<span class=

354

"keyword">typename octet_iterator>

355

uint32_t prior(octet_iterator& it, octet_iterator start);

356

357

</pre>

358

359

<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.

360

After the function returns, it is decremented to point to the beginning of the

361

previous code point.

362

<code>start</code>: an iterator to the beginning of the sequence where the search

363

for the beginning of a code point is performed. It is a

364

safety measure to prevent passing the beginning of the string in the search for a

365

UTF-8 lead octet.

366

Return value: the 32 bit representation of the

367

previous code point.

368

369

370

Example of use:

371

372

<pre>

373

char* twochars = <span class=

374

"literal">"\xe6\x97\xa5\xd1\x88";

375

unsigned char* w = twochars + <span class=

376

"literal">3;

377

int cp = prior (w, twochars);

378

assert (cp == 0x65e5);

379

assert (w == twochars);

380

</pre>

381

382

This function has two purposes: one is two iterate backwards through a UTF-8

383

encoded string. Note that it is usually a better idea to iterate forward instead,

384

since <code>utf8::next</code> is faster. The second purpose is to find a beginning

385

of a UTF-8 sequence if we have a random position within a string.

386

387

388

<code>it</code> will typically point to the beginning of

389

a code point, and <code>start</code> will point to the

390

beginning of the string to ensure we don't go backwards too far. <code>it</code> is

391

decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence

392

beginning with that octet is decoded to a 32 bit representation and returned.

393

394

395

In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an

396

invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>

397

exception is thrown.

398

399

<h4>

400

utf8::previous

401

</h4>

402

403

Deprecated in version 1.02 and later.

404

405

406

Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it

407

decreases the iterator until it hits the beginning of the previous UTF-8 encoded

408

code point and returns the 32 bits representation of the code point.

409

410

<pre>

411

template <<span class=

412

"keyword">typename octet_iterator>

413

uint32_t previous(octet_iterator& it, octet_iterator pass_start);

414

415

</pre>

416

417

<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.

418

After the function returns, it is decremented to point to the beginning of the

419

previous code point.

420

<code>pass_start</code>: an iterator to the point in the sequence where the search

421

for the beginning of a code point is aborted if no result was reached. It is a

422

safety measure to prevent passing the beginning of the string in the search for a

423

UTF-8 lead octet.

424

Return value: the 32 bit representation of the

425

previous code point.

426

427

428

Example of use:

429

430

<pre>

431

char* twochars = <span class=

432

"literal">"\xe6\x97\xa5\xd1\x88";

433

unsigned char* w = twochars + <span class=

434

"literal">3;

435

int cp = previous (w, twochars - <span class=

436

"literal">1);

437

assert (cp == 0x65e5);

438

assert (w == twochars);

439

</pre>

440

441

<code>utf8::previous</code> is deprecated, and <code>utf8::prior</code> should

442

be used instead, although the existing code can continue using this function.

443

The problem is the parameter <code>pass_start</code> that points to the position

444

just before the beginning of the sequence. Standard containers don't have the

445

concept of "pass start" and the function can not be used with their iterators.

446

447

448

<code>it</code> will typically point to the beginning of

449

a code point, and <code>pass_start</code> will point to the octet just before the

450

beginning of the string to ensure we don't go backwards too far. <code>it</code> is

451

decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence

452

beginning with that octet is decoded to a 32 bit representation and returned.

453

454

455

In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an

456

invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>

457

exception is thrown

458

459

<h4>

460

utf8::advance

461

</h4>

462

463

Available in version 1.0 and later.

464

465

466

Advances an iterator by the specified number of code points within an UTF-8

467

sequence.

468

469

<pre>

470

template <<span class=

471

"keyword">typename octet_iterator, typename distance_type>

472

<span class=

473

"keyword">void advance (octet_iterator& it, distance_type n, octet_iterator end);

474

475

</pre>

476

477

<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8

478

encoded code point. After the function returns, it is incremented to point to the

479

nth following code point.

480

<code>n</code>: a positive integer that shows how many code points we want to

481

advance.

482

<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>

483

gets equal to <code>end</code> during the extraction of a code point, an

484

<code>utf8::not_enough_room</code> exception is thrown.

485

486

487

Example of use:

488

489

<pre>

490

char* twochars = <span class=

491

"literal">"\xe6\x97\xa5\xd1\x88";

492

unsigned char* w = twochars;

493

advance (w, 2, twochars + 6);

494

assert (w == twochars + 5);

495

</pre>

496

497

This function works only "forward". In case of a negative <code>n</code>, there is

498

no effect.

499

500

501

In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception

502

is thrown.

503

504

<h4>

505

utf8::distance

506

</h4>

507

508

Available in version 1.0 and later.

509

510

511

Given the iterators to two UTF-8 encoded code points in a seqence, returns the

512

number of code points between them.

513

514

<pre>

515

template <<span class=

516

"keyword">typename octet_iterator>

517

<span class=

518

"keyword">typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);

519

520

</pre>

521

522

<code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.

523

<code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code

524

point in the sequence we are trying to determine the length. It can be the

525

beginning of a new code point, or not.

526

Return value the distance between the iterators,

527

in code points.

528

529

530

Example of use:

531

532

<pre>

533

char* twochars = <span class=

534

"literal">"\xe6\x97\xa5\xd1\x88";

535

size_t dist = utf8::distance(twochars, twochars + 5);

536

assert (dist == 2);

537

</pre>

538

539

This function is used to find the length (in code points) of a UTF-8 encoded

540

string. The reason it is called distance, rather than, say,

541

length is mainly because developers are used that length is an

542

O(1) function. Computing the length of an UTF-8 string is a linear operation, and

543

it looked better to model it after <code>std::distance</code> algorithm.

544

545

546

In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is

547

thrown. If <code>last</code> does not point to the past-of-end of a UTF-8 seqence,

548

a <code>utf8::not_enough_room</code> exception is thrown.

549

550

<h4>

551

utf8::utf16to8

552

</h4>

553

554

Available in version 1.0 and later.

555

556

557

Converts a UTF-16 encoded string to UTF-8.

558

559

<pre>

560

template <<span class=

561

"keyword">typename u16bit_iterator, <span class=

562

"keyword">typename octet_iterator>

563

octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);

564

565

</pre>

566

567

<code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded

568

string to convert.

569

<code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded

570

string to convert.

571

<code>result</code>: an output iterator to the place in the UTF-8 string where to

572

append the result of conversion.

573

Return value: An iterator pointing to the place

574

after the appended UTF-8 string.

575

576

577

Example of use:

578

579

<pre>

580

unsigned short utf16string[] = {<span class=

581

"literal">0x41, 0x0448, <span class=

582

"literal">0x65e5, 0xd834, <span class=

583

"literal">0xdd1e};

584

vector<unsigned char> utf8result;

585

utf16to8(utf16string, utf16string + <span class=

586

"literal">5, back_inserter(utf8result));

587

assert (utf8result.size() == 10);

588

</pre>

589

590

In case of invalid UTF-16 sequence, a <code>utf8::invalid_utf16</code> exception is

591

thrown.

592

593

<h4>

594

utf8::utf8to16

595

</h4>

596

597

Available in version 1.0 and later.

598

599

600

Converts an UTF-8 encoded string to UTF-16

601

602

<pre>

603

template <<span class=

604

"keyword">typename u16bit_iterator, typename octet_iterator>

605

u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);

606

607

</pre>

608

609

<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded

610

string to convert. <code>end</code>: an iterator pointing to

611

pass-the-end of the UTF-8 encoded string to convert.

612

<code>result</code>: an output iterator to the place in the UTF-16 string where to

613

append the result of conversion.

614

Return value: An iterator pointing to the place

615

after the appended UTF-16 string.

616

617

618

Example of use:

619

620

<pre>

621

char utf8_with_surrogates[] = <span class=

622

"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";

623

vector <unsigned short> utf16result;

624

utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=

625

"literal">9, back_inserter(utf16result));

626

assert (utf16result.size() == 4);

627

assert (utf16result[2] == <span class=

628

"literal">0xd834);

629

assert (utf16result[3] == <span class=

630

"literal">0xdd1e);

631

</pre>

632

633

In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is

634

thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a

635

<code>utf8::not_enough_room</code> exception is thrown.

636

637

<h4>

638

utf8::utf32to8

639

</h4>

640

641

Available in version 1.0 and later.

642

643

644

Converts a UTF-32 encoded string to UTF-8.

645

646

<pre>

647

template <<span class=

648

"keyword">typename octet_iterator, typename u32bit_iterator>

649

octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);

650

651

</pre>

652

653

<code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded

654

string to convert.

655

<code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded

656

string to convert.

657

<code>result</code>: an output iterator to the place in the UTF-8 string where to

658

append the result of conversion.

659

Return value: An iterator pointing to the place

660

after the appended UTF-8 string.

661

662

663

Example of use:

664

665

<pre>

666

int utf32string[] = {<span class=

667

"literal">0x448, 0x65E5, <span class=

668

"literal">0x10346, 0};

669

vector<unsigned char> utf8result;

670

utf32to8(utf32string, utf32string + <span class=

671

"literal">3, back_inserter(utf8result));

672

assert (utf8result.size() == 9);

673

</pre>

674

675

In case of invalid UTF-32 string, a <code>utf8::invalid_code_point</code> exception

676

is thrown.

677

678

<h4>

679

utf8::utf8to32

680

</h4>

681

682

Available in version 1.0 and later.

683

684

685

Converts a UTF-8 encoded string to UTF-32.

686

687

<pre>

688

template <<span class=

689

"keyword">typename octet_iterator, <span class=

690

"keyword">typename u32bit_iterator>

691

u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);

692

693

</pre>

694

695

<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded

696

string to convert.

697

<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string

698

to convert.

699

<code>result</code>: an output iterator to the place in the UTF-32 string where to

700

append the result of conversion.

701

Return value: An iterator pointing to the place

702

after the appended UTF-32 string.

703

704

705

Example of use:

706

707

<pre>

708

char* twochars = <span class=

709

"literal">"\xe6\x97\xa5\xd1\x88";

710

vector<int> utf32result;

711

utf8to32(twochars, twochars + <span class=

712

"literal">5, back_inserter(utf32result));

713

assert (utf32result.size() == 2);

714

</pre>

715

716

In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is

717

thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a

718

<code>utf8::not_enough_room</code> exception is thrown.

719

720

<h4>

721

utf8::find_invalid

722

</h4>

723

724

Available in version 1.0 and later.

725

726

727

Detects an invalid sequence within a UTF-8 string.

728

729

<pre>

730

template <<span class=

731

"keyword">typename octet_iterator>

732

octet_iterator find_invalid(octet_iterator start, octet_iterator end);

733

</pre>

734

735

<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to

736

test for validity.

737

<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test

738

for validity.

739

Return value: an iterator pointing to the first

740

invalid octet in the UTF-8 string. In case none were found, equals

741

<code>end</code>.

742

743

744

Example of use:

745

746

<pre>

747

char utf_invalid[] = <span class=

748

"literal">"\xe6\x97\xa5\xd1\x88\xfa";

749

<span class=

750

"keyword">char* invalid = find_invalid(utf_invalid, utf_invalid + <span class=

751

"literal">6);

752

assert (invalid == utf_invalid + 5);

753

</pre>

754

755

This function is typically used to make sure a UTF-8 string is valid before

756

processing it with other functions. It is especially important to call it if before

757

doing any of the unchecked operations on it.

758

759

<h4>

760

utf8::is_valid

761

</h4>

762

763

Available in version 1.0 and later.

764

765

766

Checks whether a sequence of octets is a valid UTF-8 string.

767

768

<pre>

769

template <<span class=

770

"keyword">typename octet_iterator>

771

bool is_valid(octet_iterator start, octet_iterator end);

772

773

</pre>

774

775

<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to

776

test for validity.

777

<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test

778

for validity.

779

Return value: <code>true</code> if the sequence

780

is a valid UTF-8 string; <code>false</code> if not.

781

782

Example of use:

783

<pre>

784

char utf_invalid[] = <span class=

785

"literal">"\xe6\x97\xa5\xd1\x88\xfa";

786

bool bvalid = is_valid(utf_invalid, utf_invalid + <span

787

class="literal">6);

788

assert (bvalid == false);

789

</pre>

790

791

<code>is_valid</code> is a shorthand for <code>find_invalid(start, end) ==

792

end;</code>. You may want to use it to make sure that a byte seqence is a valid

793

UTF-8 string without the need to know where it fails if it is not valid.

794

795

<h4>

796

utf8::replace_invalid

797

</h4>

798

799

Available in version 2.0 and later.

800

801

802

Replaces all invalid UTF-8 sequences within a string with a replacement marker.

803

804

<pre>

805

template <<span class=

806

"keyword">typename octet_iterator, <span class=

807

"keyword">typename output_iterator>

808

output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);

809

template <<span class=

810

"keyword">typename octet_iterator, <span class=

811

"keyword">typename output_iterator>

812

output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);

813

814

</pre>

815

816

<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to

817

look for invalid UTF-8 sequences.

818

<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to look

819

for invalid UTF-8 sequences.

820

<code>out</code>: An output iterator to the range where the result of replacement

821

is stored.

822

<code>replacement</code>: A Unicode code point for the replacement marker. The

823

version without this parameter assumes the value <code>0xfffd</code>

824

Return value: An iterator pointing to the place

825

after the UTF-8 string with replaced invalid sequences.

826

827

828

Example of use:

829

830

<pre>

831

char invalid_sequence[] = <span class=

832

"literal">"a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";

833

vector<char> replace_invalid_result;

834

replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), <span

835

class="literal">'?');

836

bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());

837

assert (bvalid);

838

char* fixed_invalid_sequence = <span class=

839

"literal">"a????z";

840

assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));

841

</pre>

842

843

<code>replace_invalid</code> does not perform in-place replacement of invalid

844

sequences. Rather, it produces a copy of the original string with the invalid

845

sequences replaced with a replacement marker. Therefore, <code>out</code> must not

846

be in the <code>[start, end]</code> range.

847

848

849

If <code>end</code> does not point to the past-of-end of a UTF-8 sequence, a

850

<code>utf8::not_enough_room</code> exception is thrown.

851

852

<h4>

853

utf8::is_bom

854

</h4>

855

856

Available in version 1.0 and later.

857

858

859

Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM)

860

861

<pre>

862

template <<span class=

863

"keyword">typename octet_iterator>

864

bool is_bom (octet_iterator it);

865

</pre>

866

867

<code>it</code>: beginning of the 3-octet sequence to check

868

Return value: <code>true</code> if the sequence

869

is UTF-8 byte order mark; <code>false</code> if not.

870

871

872

Example of use:

873

874

<pre>

875

unsigned char byte_order_mark[] = {<span class=

876

"literal">0xef, 0xbb, <span class=

877

"literal">0xbf};

878

bool bbom = is_bom(byte_order_mark);

879

assert (bbom == true);

880

</pre>

881

882

The typical use of this function is to check the first three bytes of a file. If

883

they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8

884

encoded text.

885

886

887

Types From utf8 Namespace

888

</h3>

889

<h4>

890

utf8::iterator

891

</h4>

892

893

Available in version 2.0 and later.

894

895

896

Adapts the underlying octet iterator to iterate over the sequence of code points,

897

rather than raw octets.

898

899

<pre>

900

template <typename octet_iterator>

901

class iterator;

902

</pre>

903

904

<h5>Member functions</h5>

905

<dl>

906

<dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is

907

constructed with its default constructor.

908

<dt><code>explicit iterator (const octet_iterator& octet_it,

909

const octet_iterator& range_start,

910

const octet_iterator& range_end);</code> <dd> a constructor

911

that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>

912

and sets the range in which the iterator is considered valid.

913

<dt><code>octet_iterator base () const;</code> <dd> returns the

914

underlying <code>octet_iterator</code>.

915

<dt><code>uint32_t operator * () const;</code> <dd> decodes the utf-8 sequence

916

the underlying <code>octet_iterator</code> is pointing to and returns the code point.

917

<dt><code>bool operator == (const iterator& rhs)

918

const;</code> <dd> returns true

919

if the two underlaying iterators are equal.

920

<dt><code>bool operator != (const iterator& rhs)

921

const;</code> <dd> returns true

922

if the two underlaying iterators are not equal.

923

<dt><code>iterator& operator ++ (); </code> <dd> the prefix increment - moves

924

the iterator to the next UTF-8 encoded code point.

925

<dt><code>iterator operator ++ (int); </code> <dd>

926

the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.

927

<dt><code>iterator& operator -- (); </code> <dd> the prefix decrement - moves

928

the iterator to the previous UTF-8 encoded code point.

929

<dt><code>iterator operator -- (int); </code> <dd>

930

the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.

931

</dl>

932

933

Example of use:

934

935

<pre>

936

char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";

937

utf8::iterator<char*> it(threechars, threechars, threechars + 9);

938

utf8::iterator<char*> it2 = it;

939

assert (it2 == it);

940

assert (*it == 0x10346);

941

assert (*(++it) == 0x65e5);

942

assert ((*it++) == 0x65e5);

943

assert (*it == 0x0448);

944

assert (it != it2);

945

utf8::iterator<char*> endit (threechars + 9, threechars, threechars + 9);

946

assert (++it == endit);

947

assert (*(--it) == 0x0448);

948

assert ((*it--) == 0x0448);

949

assert (*it == 0x65e5);

950

assert (--it == utf8::iterator<char*>(threechars, threechars, threechars + 9));

951

assert (*it == 0x10346);

952

</pre>

953

954

The purpose of <code>utf8::iterator</code> adapter is to enable easy iteration as well as the use of STL

955

algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of

956

<code>utf8::next()</code> and <code>utf8::prior()</code> functions.

957

958

959

Note that <code>utf8::iterator</code> adapter is a checked iterator. It operates on the range specified in

960

the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators

961

require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically,

962

the range will be determined by sequence container functions <code>begin</code> and <code>end</code>, i.e.:

963

964

<pre>

965

std::string s = "example";

966

utf8::iterator i (s.begin(), s.begin(), s.end());

967

</pre>

968

969

Functions From utf8::unchecked Namespace

970

</h3>

971

<h4>

972

utf8::unchecked::append

973

</h4>

974

975

Available in version 1.0 and later.

976

977

978

Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence

979

to a UTF-8 string.

980

981

<pre>

982

template <<span class=

983

"keyword">typename octet_iterator>

984

octet_iterator append(uint32_t cp, octet_iterator result);

985

986

</pre>

987

988

<code>cp</code>: A 32 bit integer representing a code point to append to the

989

sequence.

990

<code>result</code>: An output iterator to the place in the sequence where to

991

append the code point.

992

Return value: An iterator pointing to the place

993

after the newly appended sequence.

994

995

996

Example of use:

997

998

<pre>

999

unsigned char u[5] = {<span

1000

class="literal">0,0,<span class=

1001

"literal">0,0,0};

1002

unsigned char* end = unchecked::append(<span class=

1003

"literal">0x0448, u);

1004

assert (u[0] == <span class=

1005

"literal">0xd1 && u[1] == <span class=

1006

"literal">0x88 && u[2] == <span class=

1007

"literal">0 && u[3] == <span class=

1008

"literal">0 && u[4] == <span class=

1009

"literal">0);

1010

</pre>

1011

1012

This is a faster but less safe version of <code>utf8::append</code>. It does not

1013

check for validity of the supplied code point, and may produce an invalid UTF-8

1014

sequence.

1015

1016

<h4>

1017

utf8::unchecked::next

1018

</h4>

1019

1020

Available in version 1.0 and later.

1021

1022

1023

Given the iterator to the beginning of a UTF-8 sequence, it returns the code point

1024

and moves the iterator to the next position.

1025

1026

<pre>

1027

template <<span class=

1028

"keyword">typename octet_iterator>

1029

uint32_t next(octet_iterator& it);

1030

1031

</pre>

1032

1033

<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8

1034

encoded code point. After the function returns, it is incremented to point to the

1035

beginning of the next code point.

1036

Return value: the 32 bit representation of the

1037

processed UTF-8 code point.

1038

1039

1040

Example of use:

1041

1042

<pre>

1043

char* twochars = <span class=

1044

"literal">"\xe6\x97\xa5\xd1\x88";

1045

char* w = twochars;

1046

int cp = unchecked::next(w);

1047

assert (cp == 0x65e5);

1048

assert (w == twochars + 3);

1049

</pre>

1050

1051

This is a faster but less safe version of <code>utf8::next</code>. It does not

1052

check for validity of the supplied UTF-8 sequence.

1053

1054

<h4>

1055

utf8::unchecked::peek_next

1056

</h4>

1057

1058

Available in version 2.1 and later.

1059

1060

1061

Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.

1062

1063

<pre>

1064

template <<span class=

1065

"keyword">typename octet_iterator>

1066

uint32_t peek_next(octet_iterator it);

1067

1068

</pre>

1069

1070

<code>it</code>: an iterator pointing to the beginning of an UTF-8

1071

encoded code point.

1072

Return value: the 32 bit representation of the

1073

processed UTF-8 code point.

1074

1075

1076

Example of use:

1077

1078

<pre>

1079

char* twochars = <span class=

1080

"literal">"\xe6\x97\xa5\xd1\x88";

1081

char* w = twochars;

1082

int cp = unchecked::peek_next(w);

1083

assert (cp == 0x65e5);

1084

assert (w == twochars);

1085

</pre>

1086

1087

This is a faster but less safe version of <code>utf8::peek_next</code>. It does not

1088

check for validity of the supplied UTF-8 sequence.

1089

1090

<h4>

1091

utf8::unchecked::prior

1092

</h4>

1093

1094

Available in version 1.02 and later.

1095

1096

1097

Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it

1098

decreases the iterator until it hits the beginning of the previous UTF-8 encoded

1099

code point and returns the 32 bits representation of the code point.

1100

1101

<pre>

1102

template <<span class=

1103

"keyword">typename octet_iterator>

1104

uint32_t prior(octet_iterator& it);

1105

1106

</pre>

1107

1108

<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.

1109

After the function returns, it is decremented to point to the beginning of the

1110

previous code point.

1111

Return value: the 32 bit representation of the

1112

previous code point.

1113

1114

1115

Example of use:

1116

1117

<pre>

1118

char* twochars = <span class=

1119

"literal">"\xe6\x97\xa5\xd1\x88";

1120

char* w = twochars + 3;

1121

int cp = unchecked::prior (w);

1122

assert (cp == 0x65e5);

1123

assert (w == twochars);

1124

</pre>

1125

1126

This is a faster but less safe version of <code>utf8::prior</code>. It does not

1127

check for validity of the supplied UTF-8 sequence and offers no boundary checking.

1128

1129

<h4>

1130

utf8::unchecked::previous (deprecated, see utf8::unchecked::prior)

1131

</h4>

1132

1133

Deprecated in version 1.02 and later.

1134

1135

1136

Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it

1137

decreases the iterator until it hits the beginning of the previous UTF-8 encoded

1138

code point and returns the 32 bits representation of the code point.

1139

1140

<pre>

1141

template <<span class=

1142

"keyword">typename octet_iterator>

1143

uint32_t previous(octet_iterator& it);

1144

1145

</pre>

1146

1147

<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.

1148

After the function returns, it is decremented to point to the beginning of the

1149

previous code point.

1150

Return value: the 32 bit representation of the

1151

previous code point.

1152

1153

1154

Example of use:

1155

1156

<pre>

1157

char* twochars = <span class=

1158

"literal">"\xe6\x97\xa5\xd1\x88";

1159

char* w = twochars + 3;

1160

int cp = unchecked::previous (w);

1161

assert (cp == 0x65e5);

1162

assert (w == twochars);

1163

</pre>

1164

1165

The reason this function is deprecated is just the consistency with the "checked"

1166

versions, where <code>prior</code> should be used instead of <code>previous</code>.

1167

In fact, <code>unchecked::previous</code> behaves exactly the same as <code>

1168

unchecked::prior</code>

1169

1170

1171

This is a faster but less safe version of <code>utf8::previous</code>. It does not

1172

check for validity of the supplied UTF-8 sequence and offers no boundary checking.

1173

1174

<h4>

1175

utf8::unchecked::advance

1176

</h4>

1177

1178

Available in version 1.0 and later.

1179

1180

1181

Advances an iterator by the specified number of code points within an UTF-8

1182

sequence.

1183

1184

<pre>

1185

template <<span class=

1186

"keyword">typename octet_iterator, typename distance_type>

1187

void advance (octet_iterator& it, distance_type n);

1188

1189

</pre>

1190

1191

<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8

1192

encoded code point. After the function returns, it is incremented to point to the

1193

nth following code point.

1194

<code>n</code>: a positive integer that shows how many code points we want to

1195

advance.

1196

1197

1198

Example of use:

1199

1200

<pre>

1201

char* twochars = <span class=

1202

"literal">"\xe6\x97\xa5\xd1\x88";

1203

char* w = twochars;

1204

unchecked::advance (w, 2);

1205

assert (w == twochars + 5);

1206

</pre>

1207

1208

This function works only "forward". In case of a negative <code>n</code>, there is

1209

no effect.

1210

1211

1212

This is a faster but less safe version of <code>utf8::advance</code>. It does not

1213

check for validity of the supplied UTF-8 sequence and offers no boundary checking.

1214

1215

<h4>

1216

utf8::unchecked::distance

1217

</h4>

1218

1219

Available in version 1.0 and later.

1220

1221

1222

Given the iterators to two UTF-8 encoded code points in a seqence, returns the

1223

number of code points between them.

1224

1225

<pre>

1226

template <<span class=

1227

"keyword">typename octet_iterator>

1228

<span class=

1229

"keyword">typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);

1230

</pre>

1231

1232

<code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.

1233

<code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code

1234

point in the sequence we are trying to determine the length. It can be the

1235

beginning of a new code point, or not.

1236

Return value the distance between the iterators,

1237

in code points.

1238

1239

1240

Example of use:

1241

1242

<pre>

1243

char* twochars = <span class=

1244

"literal">"\xe6\x97\xa5\xd1\x88";

1245

size_t dist = utf8::unchecked::distance(twochars, twochars + <span class=

1246

"literal">5);

1247

assert (dist == 2);

1248

</pre>

1249

1250

This is a faster but less safe version of <code>utf8::distance</code>. It does not

1251

check for validity of the supplied UTF-8 sequence.

1252

1253

<h4>

1254

utf8::unchecked::utf16to8

1255

</h4>

1256

1257

Available in version 1.0 and later.

1258

1259

1260

Converts a UTF-16 encoded string to UTF-8.

1261

1262

<pre>

1263

template <<span class=

1264

"keyword">typename u16bit_iterator, <span class=

1265

"keyword">typename octet_iterator>

1266

octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);

1267

1268

</pre>

1269

1270

<code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded

1271

string to convert.

1272

<code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded

1273

string to convert.

1274

<code>result</code>: an output iterator to the place in the UTF-8 string where to

1275

append the result of conversion.

1276

Return value: An iterator pointing to the place

1277

after the appended UTF-8 string.

1278

1279

1280

Example of use:

1281

1282

<pre>

1283

unsigned short utf16string[] = {<span class=

1284

"literal">0x41, 0x0448, <span class=

1285

"literal">0x65e5, 0xd834, <span class=

1286

"literal">0xdd1e};

1287

vector<unsigned char> utf8result;

1288

unchecked::utf16to8(utf16string, utf16string + <span class=

1289

"literal">5, back_inserter(utf8result));

1290

assert (utf8result.size() == 10);

1291

</pre>

1292

1293

This is a faster but less safe version of <code>utf8::utf16to8</code>. It does not

1294

check for validity of the supplied UTF-16 sequence.

1295

1296

<h4>

1297

utf8::unchecked::utf8to16

1298

</h4>

1299

1300

Available in version 1.0 and later.

1301

1302

1303

Converts an UTF-8 encoded string to UTF-16

1304

1305

<pre>

1306

template <<span class=

1307

"keyword">typename u16bit_iterator, typename octet_iterator>

1308

u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);

1309

1310

</pre>

1311

1312

<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded

1313

string to convert. <code>end</code>: an iterator pointing to

1314

pass-the-end of the UTF-8 encoded string to convert.

1315

<code>result</code>: an output iterator to the place in the UTF-16 string where to

1316

append the result of conversion.

1317

Return value: An iterator pointing to the place

1318

after the appended UTF-16 string.

1319

1320

1321

Example of use:

1322

1323

<pre>

1324

char utf8_with_surrogates[] = <span class=

1325

"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";

1326

vector <unsigned short> utf16result;

1327

unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=

1328

"literal">9, back_inserter(utf16result));

1329

assert (utf16result.size() == 4);

1330

assert (utf16result[2] == <span class=

1331

"literal">0xd834);

1332

assert (utf16result[3] == <span class=

1333

"literal">0xdd1e);

1334

</pre>

1335

1336

This is a faster but less safe version of <code>utf8::utf8to16</code>. It does not

1337

check for validity of the supplied UTF-8 sequence.

1338

1339

<h4>

1340

utf8::unchecked::utf32to8

1341

</h4>

1342

1343

Available in version 1.0 and later.

1344

1345

1346

Converts a UTF-32 encoded string to UTF-8.

1347

1348

<pre>

1349

template <<span class=

1350

"keyword">typename octet_iterator, <span class=

1351

"keyword">typename u32bit_iterator>

1352

octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);

1353

1354

</pre>

1355

1356

<code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded

1357

string to convert.

1358

<code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded

1359

string to convert.

1360

<code>result</code>: an output iterator to the place in the UTF-8 string where to

1361

append the result of conversion.

1362

Return value: An iterator pointing to the place

1363

after the appended UTF-8 string.

1364

1365

1366

Example of use:

1367

1368

<pre>

1369

int utf32string[] = {<span class=

1370

"literal">0x448, 0x65e5, <span class=

1371

"literal">0x10346, 0};

1372

vector<unsigned char> utf8result;

1373

utf32to8(utf32string, utf32string + <span class=

1374

"literal">3, back_inserter(utf8result));

1375

assert (utf8result.size() == 9);

1376

</pre>

1377

1378

This is a faster but less safe version of <code>utf8::utf32to8</code>. It does not

1379

check for validity of the supplied UTF-32 sequence.

1380

1381

<h4>

1382

utf8::unchecked::utf8to32

1383

</h4>

1384

1385

Available in version 1.0 and later.

1386

1387

1388

Converts a UTF-8 encoded string to UTF-32.

1389

1390

<pre>

1391

template <<span class=

1392

"keyword">typename octet_iterator, typename u32bit_iterator>

1393

u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);

1394

1395

</pre>

1396

1397

<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded

1398

string to convert.

1399

<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string

1400

to convert.

1401

<code>result</code>: an output iterator to the place in the UTF-32 string where to

1402

append the result of conversion.

1403

Return value: An iterator pointing to the place

1404

after the appended UTF-32 string.

1405

1406

1407

Example of use:

1408

1409

<pre>

1410

char* twochars = <span class=

1411

"literal">"\xe6\x97\xa5\xd1\x88";

1412

vector<int> utf32result;

1413

unchecked::utf8to32(twochars, twochars + <span class=

1414

"literal">5, back_inserter(utf32result));

1415

assert (utf32result.size() == 2);

1416

</pre>

1417

1418

This is a faster but less safe version of <code>utf8::utf8to32</code>. It does not

1419

check for validity of the supplied UTF-8 sequence.

1420

1421

1422

Types From utf8::unchecked Namespace

1423

</h3>

1424

<h4>

1425

utf8::iterator

1426

</h4>

1427

1428

Available in version 2.0 and later.

1429

1430

1431

Adapts the underlying octet iterator to iterate over the sequence of code points,

1432

rather than raw octets.

1433

1434

<pre>

1435

template <typename octet_iterator>

1436

class iterator;

1437

</pre>

1438

1439

<h5>Member functions</h5>

1440

<dl>

1441

<dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is

1442

constructed with its default constructor.

1443

<dt><code>explicit iterator (const octet_iterator& octet_it);

1444

</code> <dd> a constructor

1445

that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>

1446

<dt><code>octet_iterator base () const;</code> <dd> returns the

1447

underlying <code>octet_iterator</code>.

1448

<dt><code>uint32_t operator * () const;</code> <dd> decodes the utf-8 sequence

1449

the underlying <code>octet_iterator</code> is pointing to and returns the code point.

1450

<dt><code>bool operator == (const iterator& rhs)

1451

const;</code> <dd> returns true

1452

if the two underlaying iterators are equal.

1453

<dt><code>bool operator != (const iterator& rhs)

1454

const;</code> <dd> returns true

1455

if the two underlaying iterators are not equal.

1456

<dt><code>iterator& operator ++ (); </code> <dd> the prefix increment - moves

1457

the iterator to the next UTF-8 encoded code point.

1458

<dt><code>iterator operator ++ (int); </code> <dd>

1459

the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.

1460

<dt><code>iterator& operator -- (); </code> <dd> the prefix decrement - moves

1461

the iterator to the previous UTF-8 encoded code point.

1462

<dt><code>iterator operator -- (int); </code> <dd>

1463

the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.

1464

</dl>

1465

1466

Example of use:

1467

1468

<pre>

1469

char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";

1470

utf8::unchecked::iterator<char*> un_it(threechars);

1471

utf8::unchecked::iterator<char*> un_it2 = un_it;

1472

assert (un_it2 == un_it);

1473

assert (*un_it == 0x10346);

1474

assert (*(++un_it) == 0x65e5);

1475

assert ((*un_it++) == 0x65e5);

1476

assert (*un_it == 0x0448);

1477

assert (un_it != un_it2);

1478

utf8::::unchecked::iterator<char*> un_endit (threechars + 9);

1479

assert (++un_it == un_endit);

1480

assert (*(--un_it) == 0x0448);

1481

assert ((*un_it--) == 0x0448);

1482

assert (*un_it == 0x65e5);

1483

assert (--un_it == utf8::unchecked::iterator<char*>(threechars));

1484

assert (*un_it == 0x10346);

1485

</pre>

1486

1487

This is an unchecked version of <code>utf8::iterator</code>. It is faster in many cases, but offers

1488

no validity or range checks.

1489

1490

1491

Points of interest

1492

</h2>

1493

<h4>

1494

Design goals and decisions

1495

</h4>

1496

1497

The library was designed to be:

1498

1499

<ol>

1500

<li>

1501

Generic: for better or worse, there are many C++ string classes out there, and

1502

the library should work with as many of them as possible.

1503

</li>

1504

<li>

1505

Portable: the library should be portable both accross different platforms and

1506

compilers. The only non-portable code is a small section that declares unsigned

1507

integers of different sizes: three typedefs. They can be changed by the users of

1508

the library if they don't match their platform. The default setting should work

1509

for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives.

1510

</li>

1511

<li>

1512

Lightweight: follow the "pay only for what you use" guidline.

1513

</li>

1514

<li>

1515

Unintrusive: avoid forcing any particular design or even programming style on the

1516

user. This is a library, not a framework.

1517

</li>

1518

</ol>

1519

<h4>

1520

Alternatives

1521

</h4>

1522

1523

In case you want to look into other means of working with UTF-8 strings from C++,

1524

here is the list of solutions I am aware of:

1525

1526

<ol>

1527

<li>

1528

<a href="http://icu.sourceforge.net/">ICU Library</a>. It is very powerful,

1529

complete, feature-rich, mature, and widely used. Also big, intrusive,

1530

non-generic, and doesn't play well with the Standard Library. I definitelly

1531

recommend looking at ICU even if you don't plan to use it.

1532

</li>

1533

<li>

1534

<a href=

1535

"http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html">Glib::ustring</a>.

1536

A class specifically made to work with UTF-8 strings, and also feel like

1537

<code>std::string</code>. If you prefer to have yet another string class in your

1538

code, it may be worth a look. Be aware of the licensing issues, though.

1539

</li>

1540

<li>

1541

Platform dependent solutions: Windows and POSIX have functions to convert strings

1542

from one encoding to another. That is only a subset of what my library offers,

1543

but if that is all you need it may be good enough, especially given the fact that

1544

these functions are mature and tested in production.

1545

</li>

1546

</ol>

1547

1548

Conclusion

1549

</h2>

1550

1551

Until Unicode becomes officially recognized by the C++ Standard Library, we need to

1552

use other means to work with UTF-8 strings. Template functions I describe in this

1553

article may be a good step in this direction.

1554

1555

1556

Links

1557

</h2>

1558

<ol>

1559

<li>

1560

<a href="http://www.unicode.org/">The Unicode Consortium</a>.

1561

</li>

1562

<li>

1563

<a href="http://icu.sourceforge.net/">ICU Library</a>.

1564

</li>

1565

<li>

1566

<a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 at Wikipedia</a>

1567

</li>

1568

<li>

1569

<a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">UTF-8 and Unicode FAQ for

1570

Unix/Linux</a>

1571

</li>

1572

</ol>

1573

</body>

1574

</html>

Older »