554
554
Character Ranges and Classes
556
Character ranges in regular expression character classes (C</[a-z]/>)
557
and in the C<tr///> (also known as C<y///>) operator are not magically
558
Unicode-aware. What this means is that C<[A-Za-z]> will not magically start
559
to mean "all alphabetic letters"; not that it does mean that even for
560
8-bit characters, you should be using C</[[:alpha:]]/> in that case.
562
For specifying character classes like that in regular expressions,
563
you can use the various Unicode properties--C<\pL>, or perhaps
564
C<\p{Alphabetic}>, in this particular case. You can use Unicode
565
code points as the end points of character ranges, but there is no
566
magic associated with specifying a certain range. For further
567
information--there are dozens of Unicode character classes--see
556
Character ranges in regular expression bracketed character classes ( e.g.,
557
C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not
558
magically Unicode-aware. What this means is that C<[A-Za-z]> will not
559
magically start to mean "all alphabetic letters" (not that it does mean that
560
even for 8-bit characters; for those, if you are using locales (L<perllocale>),
561
use C</[[:alpha:]]/>; and if not, use the 8-bit-aware property C<\p{alpha}>).
563
All the properties that begin with C<\p> (and its inverse C<\P>) are actually
564
character classes that are Unicode-aware. There are dozens of them, see
567
You can use Unicode code points as the end points of character ranges, and the
568
range will include all Unicode code points that lie between those end points.
607
607
How Do I Know Whether My String Is In Unicode?
609
609
You shouldn't have to care. But you may, because currently the semantics of the
610
characters whose ordinals are in the range 128 to 255 is different depending on
610
characters whose ordinals are in the range 128 to 255 are different depending on
611
611
whether the string they are contained within is in Unicode or not.
612
612
(See L<perlunicode/When Unicode Does Not Happen>.)
622
622
return the value of the internal "utf8ness" flag attached to the
623
623
C<$string>. If the flag is off, the bytes in the scalar are interpreted
624
624
as a single byte encoding. If the flag is on, the bytes in the scalar
625
are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
626
points of the characters. Bytes added to a UTF-8 encoded string are
625
are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded
626
code points of the characters. Bytes added to a UTF-8 encoded string are
627
627
automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars
628
628
are merged (double-quoted interpolation, explicit concatenation, and
629
629
printf/sprintf parameter substitution), the result will be UTF-8 encoded