~ubuntu-branches/ubuntu/precise/perl/precise

« back to all changes in this revision

Viewing changes to pod/perluniintro.pod

Committer: Bazaar Package Importer
Author(s): Niko Tyni
Date: 2011-02-06 11:31:38 UTC
mto: (8.2.12 experimental) (1.1.12)
mto: This revision was merged to the branch mainline in revision 46.
Revision ID: james.westby@ubuntu.com-20110206113138-lzpm3g6rur7i3eyp

Tags: upstream-5.12.3

Import upstream version 5.12.3

files added:
cpan/CGI/t/headers.t

cpan/CGI/t/multipart_init.t

pod/perl5123delta.pod

files modified:
Cross/config.sh-arm-linux

Cross/config.sh-arm-linux-n770

INSTALL

MANIFEST

META.yml

Makefile.SH

NetWare/Makefile

NetWare/config_H.wc

Porting/config.sh

Porting/config_H

README.aix

README.haiku

README.os2

README.vms

README.vos

cpan/CGI/lib/CGI.pm

cpan/Module-Build/lib/Module/Build/Platform/cygwin.pm

dist/B-Deparse/Deparse.pm

dist/Module-CoreList/Changes

dist/Module-CoreList/lib/Module/CoreList.pm

dist/constant/t/constant.t

epoc/config.sh

epoc/createpkg.pl

ext/B/t/concise-xs.t

ext/Socket/Socket.pm

ext/Socket/Socket.xs

ext/VMS-Stdio/t/vms_stdio.t

gv.c

hints/catamount.sh

hints/vos.sh

lib/utf8_heavy.pl

patchlevel.h

perlio.c

plan9/config.plan9

plan9/config_sh.sample

pod.lst

pod/perl.pod

pod/perl5122delta.pod

pod/perldebguts.pod

pod/perlebcdic.pod

pod/perlhack.pod

pod/perlhist.pod

pod/perlpolicy.pod

pod/perlport.pod

pod/perlre.pod

pod/perlrebackslash.pod

pod/perlrecharclass.pod

pod/perlrepository.pod

pod/perlreref.pod

pod/perlunicode.pod

pod/perluniintro.pod

pod/perlvar.pod

pp_hot.c

t/op/sub_lval.t

t/re/regexp_unicode_prop.t

t/uni/class.t

vms/descrip_mms.template

vms/vms.c

win32/Makefile

win32/Makefile.ce

win32/makefile.mk

win32/pod.mak

Show diffs side-by-side

added added

removed removed

pod/perluniintro.pod

553

554

Character Ranges and Classes

555

556

Character ranges in regular expression character classes (C</[a-z]/>)

557

and in the C<tr///> (also known as C<y///>) operator are not magically

558

Unicode-aware. What this means is that C<[A-Za-z]> will not magically start

559

to mean "all alphabetic letters"; not that it does mean that even for

560

8-bit characters, you should be using C</[[:alpha:]]/> in that case.

561

562

For specifying character classes like that in regular expressions,

563

you can use the various Unicode properties--C<\pL>, or perhaps

564

C<\p{Alphabetic}>, in this particular case. You can use Unicode

565

code points as the end points of character ranges, but there is no

566

magic associated with specifying a certain range. For further

567

information--there are dozens of Unicode character classes--see

568

L<perlunicode>.

556

Character ranges in regular expression bracketed character classes ( e.g.,

557

C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not

558

magically Unicode-aware. What this means is that C<[A-Za-z]> will not

559

magically start to mean "all alphabetic letters" (not that it does mean that

560

even for 8-bit characters; for those, if you are using locales (L<perllocale>),

561

use C</[[:alpha:]]/>; and if not, use the 8-bit-aware property C<\p{alpha}>).

562

563

All the properties that begin with C<\p> (and its inverse C<\P>) are actually

564

character classes that are Unicode-aware. There are dozens of them, see

565

L<perluniprops>.

566

567

You can use Unicode code points as the end points of character ranges, and the

568

range will include all Unicode code points that lie between those end points.

569

570

=item *

571

607

How Do I Know Whether My String Is In Unicode?

608

609

You shouldn't have to care. But you may, because currently the semantics of the

610

characters whose ordinals are in the range 128 to 255 is different depending on

610

characters whose ordinals are in the range 128 to 255 are different depending on

611

whether the string they are contained within is in Unicode or not.

612

(See L<perlunicode/When Unicode Does Not Happen>.)

613

622

return the value of the internal "utf8ness" flag attached to the

623

C<$string>. If the flag is off, the bytes in the scalar are interpreted

624

as a single byte encoding. If the flag is on, the bytes in the scalar

625

are interpreted as the (multi-byte, variable-length) UTF-8 encoded code

626

points of the characters. Bytes added to a UTF-8 encoded string are

625

are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded

626

code points of the characters. Bytes added to a UTF-8 encoded string are

627

automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars

628

are merged (double-quoted interpolation, explicit concatenation, and

629

printf/sprintf parameter substitution), the result will be UTF-8 encoded

648

use bytes;

649

print length($unicode), "\n"; # will also print 2

650

# (the 0xC4 0x80 of the UTF-8)

651

no bytes;

651

652

653

=item *

653

654

Older »