11
11
from cover to cover, Perl does support many Unicode features.
13
13
People who want to learn to use Unicode in Perl, should probably read
14
L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading
14
the L<Perl Unicode tutorial, perlunitut|perlunitut>, before reading
15
15
this reference document.
17
Also, the use of Unicode may present security issues that aren't obvious.
18
Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
19
22
=item Input and Output Layers
99
102
semantics in a particular lexical scope. See L<bytes>.
101
104
The C<use feature 'unicode_strings'> pragma is intended to always, regardless
102
of platform, force Unicode semantics in a particular lexical scope. In
103
release 5.12, it is partially implemented, applying only to case changes.
105
of platform, force character (Unicode) semantics in a particular lexical scope.
106
In release 5.12, it is partially implemented, applying only to case changes.
104
107
See L</The "Unicode Bug"> below.
106
109
The C<utf8> pragma is primarily a compatibility device that enables
183
Character classes in regular expressions match characters instead of
186
Bracketed character classes in regular expressions match characters instead of
184
187
bytes and match against the character properties specified in the
185
188
Unicode properties database. C<\w> can be used to match a Japanese
186
189
ideograph, for instance.
190
Named Unicode properties, scripts, and block ranges may be used like
191
character classes via the C<\p{}> "matches property" construct and
193
Named Unicode properties, scripts, and block ranges may be used (like bracketed
194
character classes) by using the C<\p{}> "matches property" construct and
192
195
the C<\P{}> negation, "doesn't match property".
193
196
See L</"Unicode Character Properties"> for more details.
264
You can define your own mappings to be used in lc(),
265
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
267
You can define your own mappings to be used in C<lc()>,
268
C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined
269
versions such as C<\U>).
266
270
See L</"User-Defined Case Mappings"> for more details.
278
282
=head2 Unicode Character Properties
280
284
Most Unicode character properties are accessible by using regular expressions.
281
They are used like character classes via the C<\p{}> "matches property"
282
construct and the C<\P{}> negation, "doesn't match property".
284
For instance, C<\p{Uppercase}> matches any character with the Unicode
285
They are used (like bracketed character classes) by using the C<\p{}> "matches
286
property" construct and the C<\P{}> negation, "doesn't match property".
288
Note that the only time that Perl considers a sequence of individual code
289
points as a single logical character is in the C<\X> construct, already
290
mentioned above. Therefore "character" in this discussion means a single
293
For instance, C<\p{Uppercase}> matches any single character with the Unicode
285
294
"Uppercase" property, while C<\p{L}> matches any character with a
286
295
General_Category of "L" (letter) property. Brackets are not
287
required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
296
required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
289
More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase
290
property value is True, and C<\P{Uppercase}> matches any character whose
291
Uppercase property value is False, and they could have been written as
292
C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
298
More formally, C<\p{Uppercase}> matches any single character whose Unicode
299
Uppercase property value is True, and C<\P{Uppercase}> matches any character
300
whose Uppercase property value is False, and they could have been written as
301
C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
294
303
This formality is needed when properties are not binary, that is if they can
295
304
take on more values than just True and False. For example, the Bidi_Class (see
296
305
L</"Bidirectional Character Types"> below), can take on a number of different
297
306
values, such as Left, Right, Whitespace, and others. To match these, one needs
298
307
to specify the property name (Bidi_Class), and the value being matched against
299
(Left, Right, I<etc.>). This is done, as in the examples above, by having the
308
(Left, Right, etc.). This is done, as in the examples above, by having the
300
309
two components separated by an equal sign (or interchangeably, a colon), like
301
310
C<\p{Bidi_Class: Left}>.
404
413
Single-letter properties match all characters in any of the
405
414
two-letter sub-properties starting with the same letter.
406
C<LC> and C<L&> are special cases, which are aliases for the set of
407
C<Ll>, C<Lu>, and C<Lt>.
415
C<LC> and C<L&> are special cases, which are both aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
409
417
Because Perl hides the need for the user to understand the internal
410
418
representation of Unicode characters, there is no need to implement
451
459
Hiragana or Katakana. There are many more.
453
461
The Unicode Script property gives what script a given character is in,
454
and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
455
C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit
456
everything up through the equals (or colon), and simply write C<\p{Latin}> or
462
and the property can be specified with the compound form like
463
C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all
464
script names. You can omit everything up through the equals (or colon), and
465
simply write C<\p{Latin}> or C<\P{Cyrillic}>.
459
467
A complete list of scripts and their shortcuts is in L<perluniprops>.
475
483
block is all characters whose ordinals are between 0 and 127, inclusive, in
476
484
other words, the ASCII characters. The "Latin" script contains some letters
477
485
from this block as well as several more, like "Latin-1 Supplement",
478
"Latin Extended-A", I<etc.>, but it does not contain all the characters from
486
"Latin Extended-A", etc., but it does not contain all the characters from
479
487
those blocks. It does not, for example, contain digits, because digits are
480
488
shared across many scripts. Digits and similar groups, like punctuation, are in
481
489
the script called C<Common>. There is also a script called C<Inherited> for
571
579
necessary to know some basics about decomposition.
572
580
Consider a character, say H. It could appear with various marks around it,
573
581
such as an acute accent, or a circumflex, or various hooks, circles, arrows,
574
I<etc.>, above, below, to one side and/or the other, I<etc.> There are many
582
I<etc.>, above, below, to one side and/or the other, etc. There are many
575
583
possibilities among the world's languages. The number of combinations is
576
584
astronomical, and if there were a character for each combination, it would
577
585
soon exhaust Unicode's more than a million possible characters. So Unicode