~ubuntu-branches/ubuntu/precise/perl/precise

« back to all changes in this revision

Viewing changes to pod/perlunicode.pod

Committer: Bazaar Package Importer
Author(s): Niko Tyni
Date: 2011-02-06 11:31:38 UTC
mto: (8.2.12 experimental) (1.1.12)
mto: This revision was merged to the branch mainline in revision 46.
Revision ID: james.westby@ubuntu.com-20110206113138-lzpm3g6rur7i3eyp

Tags: upstream-5.12.3

Import upstream version 5.12.3

files added:
cpan/CGI/t/headers.t

cpan/CGI/t/multipart_init.t

pod/perl5123delta.pod

files modified:
Cross/config.sh-arm-linux

Cross/config.sh-arm-linux-n770

INSTALL

MANIFEST

META.yml

Makefile.SH

NetWare/Makefile

NetWare/config_H.wc

Porting/config.sh

Porting/config_H

README.aix

README.haiku

README.os2

README.vms

README.vos

cpan/CGI/lib/CGI.pm

cpan/Module-Build/lib/Module/Build/Platform/cygwin.pm

dist/B-Deparse/Deparse.pm

dist/Module-CoreList/Changes

dist/Module-CoreList/lib/Module/CoreList.pm

dist/constant/t/constant.t

epoc/config.sh

epoc/createpkg.pl

ext/B/t/concise-xs.t

ext/Socket/Socket.pm

ext/Socket/Socket.xs

ext/VMS-Stdio/t/vms_stdio.t

gv.c

hints/catamount.sh

hints/vos.sh

lib/utf8_heavy.pl

patchlevel.h

perlio.c

plan9/config.plan9

plan9/config_sh.sample

pod.lst

pod/perl.pod

pod/perl5122delta.pod

pod/perldebguts.pod

pod/perlebcdic.pod

pod/perlhack.pod

pod/perlhist.pod

pod/perlpolicy.pod

pod/perlport.pod

pod/perlre.pod

pod/perlrebackslash.pod

pod/perlrecharclass.pod

pod/perlrepository.pod

pod/perlreref.pod

pod/perlunicode.pod

pod/perluniintro.pod

pod/perlvar.pod

pp_hot.c

t/op/sub_lval.t

t/re/regexp_unicode_prop.t

t/uni/class.t

vms/descrip_mms.template

vms/vms.c

win32/Makefile

win32/Makefile.ce

win32/makefile.mk

win32/pod.mak

Show diffs side-by-side

added added

removed removed

pod/perlunicode.pod

from cover to cover, Perl does support many Unicode features.

People who want to learn to use Unicode in Perl, should probably read

L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading

the L<Perl Unicode tutorial, perlunitut|perlunitut>, before reading

this reference document.

Also, the use of Unicode may present security issues that aren't obvious.

Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.

=over 4

=item Input and Output Layers

102

semantics in a particular lexical scope. See L<bytes>.

100

103

101

104

The C<use feature 'unicode_strings'> pragma is intended to always, regardless

102

of platform, force Unicode semantics in a particular lexical scope. In

103

release 5.12, it is partially implemented, applying only to case changes.

105

of platform, force character (Unicode) semantics in a particular lexical scope.

106

In release 5.12, it is partially implemented, applying only to case changes.

104

107

See L</The "Unicode Bug"> below.

105

108

106

109

The C<utf8> pragma is primarily a compatibility device that enables

180

183

181

184

=item *

182

185

183

Character classes in regular expressions match characters instead of

186

Bracketed character classes in regular expressions match characters instead of

184

187

bytes and match against the character properties specified in the

185

188

Unicode properties database. C<\w> can be used to match a Japanese

186

189

ideograph, for instance.

187

190

188

191

=item *

189

192

190

Named Unicode properties, scripts, and block ranges may be used like

191

character classes via the C<\p{}> "matches property" construct and

193

Named Unicode properties, scripts, and block ranges may be used (like bracketed

194

character classes) by using the C<\p{}> "matches property" construct and

192

195

the C<\P{}> negation, "doesn't match property".

193

196

See L</"Unicode Character Properties"> for more details.

194

197

261

264

262

265

=item *

263

266

264

You can define your own mappings to be used in lc(),

265

lcfirst(), uc(), and ucfirst() (or their string-inlined versions).

267

You can define your own mappings to be used in C<lc()>,

268

C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined

269

versions such as C<\U>).

266

270

See L</"User-Defined Case Mappings"> for more details.

267

271

268

272

=back

278

282

=head2 Unicode Character Properties

279

283

280

284

Most Unicode character properties are accessible by using regular expressions.

281

They are used like character classes via the C<\p{}> "matches property"

282

construct and the C<\P{}> negation, "doesn't match property".

283

284

For instance, C<\p{Uppercase}> matches any character with the Unicode

285

They are used (like bracketed character classes) by using the C<\p{}> "matches

286

property" construct and the C<\P{}> negation, "doesn't match property".

287

288

Note that the only time that Perl considers a sequence of individual code

289

points as a single logical character is in the C<\X> construct, already

290

mentioned above. Therefore "character" in this discussion means a single

291

Unicode code point.

292

293

For instance, C<\p{Uppercase}> matches any single character with the Unicode

285

294

"Uppercase" property, while C<\p{L}> matches any character with a

286

295

General_Category of "L" (letter) property. Brackets are not

287

required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.

296

required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.

288

297

289

More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase

290

property value is True, and C<\P{Uppercase}> matches any character whose

291

Uppercase property value is False, and they could have been written as

292

C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively

298

More formally, C<\p{Uppercase}> matches any single character whose Unicode

299

Uppercase property value is True, and C<\P{Uppercase}> matches any character

300

whose Uppercase property value is False, and they could have been written as

301

C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.

293

302

294

303

This formality is needed when properties are not binary, that is if they can

295

304

take on more values than just True and False. For example, the Bidi_Class (see

296

305

L</"Bidirectional Character Types"> below), can take on a number of different

297

306

values, such as Left, Right, Whitespace, and others. To match these, one needs

298

307

to specify the property name (Bidi_Class), and the value being matched against

299

(Left, Right, I<etc.>). This is done, as in the examples above, by having the

308

(Left, Right, etc.). This is done, as in the examples above, by having the

300

309

two components separated by an equal sign (or interchangeably, a colon), like

301

310

C<\p{Bidi_Class: Left}>.

302

311

403

412

404

413

Single-letter properties match all characters in any of the

405

414

two-letter sub-properties starting with the same letter.

406

C<LC> and C<L&> are special cases, which are aliases for the set of

407

C<Ll>, C<Lu>, and C<Lt>.

415

C<LC> and C<L&> are special cases, which are both aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.

408

416

409

417

Because Perl hides the need for the user to understand the internal

410

418

representation of Unicode characters, there is no need to implement

413

421

414

422

=head3 B<Bidirectional Character Types>

415

423

416

Because scripts differ in their directionality--Hebrew is

417

written right to left, for example--Unicode supplies these properties in

424

Because scripts differ in their directionality (Hebrew is

425

written right to left, for example) Unicode supplies these properties in

418

426

the Bidi_Class class:

419

427

420

428

Property Meaning

451

459

Hiragana or Katakana. There are many more.

452

460

453

461

The Unicode Script property gives what script a given character is in,

454

and can be matched with the compound form like C<\p{Script=Hebrew}> (short:

455

C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit

456

everything up through the equals (or colon), and simply write C<\p{Latin}> or

457

C<\P{Cyrillic}>.

462

and the property can be specified with the compound form like

463

C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all

464

script names. You can omit everything up through the equals (or colon), and

465

simply write C<\p{Latin}> or C<\P{Cyrillic}>.

458

466

459

467

A complete list of scripts and their shortcuts is in L<perluniprops>.

460

468

475

483

block is all characters whose ordinals are between 0 and 127, inclusive, in

476

484

other words, the ASCII characters. The "Latin" script contains some letters

477

485

from this block as well as several more, like "Latin-1 Supplement",

478

"Latin Extended-A", I<etc.>, but it does not contain all the characters from

486

"Latin Extended-A", etc., but it does not contain all the characters from

479

487

those blocks. It does not, for example, contain digits, because digits are

480

488

shared across many scripts. Digits and similar groups, like punctuation, are in

481

489

the script called C<Common>. There is also a script called C<Inherited> for

571

579

necessary to know some basics about decomposition.

572

580

Consider a character, say H. It could appear with various marks around it,

573

581

such as an acute accent, or a circumflex, or various hooks, circles, arrows,

574

I<etc.>, above, below, to one side and/or the other, I<etc.> There are many

582

I<etc.>, above, below, to one side and/or the other, etc. There are many

575

583

possibilities among the world's languages. The number of combinations is

576

584

astronomical, and if there were a character for each combination, it would

577

585

soon exhaust Unicode's more than a million possible characters. So Unicode

Older »