1
<?xml version="1.0" encoding="latin1" ?>
2
<!DOCTYPE erlref SYSTEM "erlref.dtd">
9
<holder>Ericsson AB, All Rights Reserved</holder>
12
The contents of this file are subject to the Erlang Public License,
13
Version 1.1, (the "License"); you may not use this file except in
14
compliance with the License. You should have received a copy of the
15
Erlang Public License along with this software. If not, it can be
16
retrieved online at http://www.erlang.org/.
18
Software distributed under the License is distributed on an "AS IS"
19
basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
20
the License for the specific language governing rights and limitations
23
The Initial Developer of the Original Code is Ericsson AB.
32
<module>unicode</module>
33
<modulesummary>Functions for converting Unicode characters</modulesummary>
35
<p>This module contains functions for converting between different character representations. Basically it converts between iso-latin-1 characters and Unicode ditto, but it can also convert between different Unicode encodings (like UTF-8, UTF-16 and UTF-32).</p>
36
<p>The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built in functions and libraries in OTP expect to find binary Unicode data. In lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode codepoint for the character.</p>
37
<p>Other Unicode encodings than integers representing codepoints or UTF-8 in binaries are referred to as "external encodings". The iso-latin-1 encoding is in binaries and lists referred to as latin1-encoding.</p>
38
<p>It is recommended to only use external encodings for communication with external entities where this is required. When working inside the Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when representing Unicode characters. Latin1 encoding is supported both for backward compatibility and for communication with external entities not supporting Unicode character sets.</p>
42
<title>DATA TYPES</title>
43
<marker id="charlist_definition"></marker>
45
unicode_binary() = binary() with characters encoded in UTF-8 coding standard
46
unicode_char() = integer() representing valid unicode codepoint
48
chardata() = charlist() | unicode_binary()
50
charlist() = [unicode_char() | unicode_binary() | charlist()]
51
a unicode_binary is allowed as the tail of the list</code>
54
external_unicode_binary() = binary() with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
56
external_chardata() = external_charlist() | external_unicode_binary()
58
external_charlist() = [unicode_char() | external_unicode_binary() | external_charlist()]
59
a external_unicode_binary is allowed as the tail of the list</code>
62
latin1_binary() = binary() with characters coded in iso-latin-1
63
latin1_char() = integer() representing valid latin1 character (0-255)
65
latin1_chardata() = latin1_charlist() | latin1_binary()
67
latin1_charlist() = [latin1_char() | latin1_binary() | latin1_charlist()]
68
a latin1_binary is allowed as the tail of the list</code>
72
<name>erlang:characters_to_list(Data, Encoding) -> list() | {error, list(), RestData} | {incomplete, list(), binary()} </name>
73
<fsummary>Convert a collection of characters to list of Unicode characters</fsummary>
75
<v>Data = ListData | binary()</v>
76
<v>RestData = ListData | binary()</v>
77
<v>ListData = [ int() | binary() ] (binary allowed as tail of list)</v>
78
<v>Encoding = unicode | latin1</v>
82
<p>This function converts a possibly deep list of integers and
83
binaries into a list of integers representing unicode
84
characters. The binaries in the input may have characters
85
encoded as latin1 (0 - 255, one character per byte), in which
86
case the <c>Encoding</c> parameter should be given as
87
<c>latin1</c>, or have characters encoded as UTF-8, in
88
which case the <c>Encoding</c> should be given as
89
<c>unicode</c>. Only when the <c>Encoding</c> is <c>unicode</c>,
90
integers in the list are allowed to be grater than 255.</p>
92
<p>If <c>Encoding</c> is <c>latin1</c>, the <c>Data</c> parameter
93
corresponds to the <c>iodata()</c> type, but for <c>unicode</c>,
94
the <c>Data</c> parameter can contain integers greater than 255
95
(unicode characters beyond the iso-latin-1 range), which would
96
make it invalid as <c>iodata()</c>.</p>
98
<p>The purpose of the function is mainly to be able to convert
99
combinations of unicode characters into a pure unicode
100
string in list representation for further processing. For
101
writing the data to an external entity, the reverse function
103
marker="#erlang:characters_to_utf8/2">erlang:characters_to_utf8/2</seealso>
106
<p>If for some reason, the data cannot be converted, either
107
because of illegal unicode/latin1 characters in the list, or
108
because of invalid UTF-8 encoding in any binaries, an error
109
tuple is returned. The error tuple contains the tag
110
<c>error</c>, a list representing the characters that could be
111
converted before the error occured and a representation of the
112
characters including and after the offending integer/bytes. The
113
last part is mostly for debugging as it still constitutes a
114
possibly deep and/or mixed list, not necessarily of the same
115
depth as the original data. The error occurs when traversing the
116
list and whatever's left to decode is simply returned as is.</p>
118
<p>However, if the input <c>Data</c> is a pure binary, the third
119
part of the error tuple is guaranteed to be a binary as
122
<p>Errors occur for the following reasons:</p>
123
<list type="bulleted">
125
<item>Integers out of range - If <c>Encoding</c> is
126
<c>latin1</c>, an error occurs whenever an integer greater
127
than 255 is found in the lists. If <c>Encoding</c> is
128
unicode, error occurs whenever an integer greater than
129
<c>16#10FFFF</c> (the maximum unicode character) or in the
130
range <c>16#D800</c> to <c>16#DFFF</c> (invalid unicode
131
range) is found.</item>
133
<item>UTF-8 encoding incorrect - If <c>Encoding</c> is
134
<c>unicode</c>, the bytes in any binaries have to be valid
135
UTF-8. Errors can occur for various
136
reasons, including "pure" decoding errors
138
bits of the bytes being wrong), the bytes are decoded to a
139
too large number, the bytes are decoded to a code-point in the
141
range or encoding is "overlong", meaning that a
142
number should have been encoded in fewer bytes. The
143
case of a truncated UTF-8 is handled specially, see the
144
paragraph about incomplete binaries below. If
145
<c>Encoding</c> is <c>latin1</c>, binaries are always valid
146
as long as they contain whole bytes,
147
as each byte falls into the valid iso-latin-1 range.</item>
151
<p>A special type of error is when no actual invalid integers or
152
bytes are found, but a trailing <c>binary()</c> consists of too
153
few bytes to decode the last character. This error might occur
154
if bytes are read from a file in chunks or binaries in other
155
ways are split on non UTF-8 boundaries. In this case an
156
<c>incomplete</c> tuple is returned instead of the <c>error</c>
157
tuple. It consists of the same parts as the <c>error</c> tuple, but
158
the tag is <c>incomplete</c> instead of <c>error</c> and the
159
last element is always guaranteed to be a binary consisting of
160
the first part of a (so far) valid UTF-8 character.</p>
162
<p>If one UTF-8 characters is split over two consecutive
163
binaries in the <c>Data</c>, the conversion succeeds. This means
164
that a character can be decoded from a range of binaries as long
165
as the whole range is given as input without errors
166
occuring. Example:</p>
170
case erlang:characters_to_list(Data,unicode) of
171
{inclomplete,Encoded, Rest} ->
172
More = get_some_more_data(),
173
Encoded ++ decode_data([Rest, More]);
174
{error,Encoded,Rest} ->
175
handle_error(Encoded,Rest);
180
<p>Bit-strings that are not whole bytes are however not allowed,
181
so a UTF-8 character has to be split along 8-bit boundaries to
184
<p>If any parameters are of the wrong type, the list structure
185
is invalid (a number as tail) or the binaries does not contain
186
whole bytes (bit-strings), a <c>badarg</c> exception is
193
<name>erlang:characters_to_utf8(Data, Encoding) -> binary() | {error, binary(), RestData} | {incomplete, binary(), binary()} </name>
194
<fsummary>Convert a collection of characters to an UTF-8 binary</fsummary>
196
<v>Data = ListData | binary()</v>
197
<v>RestData = ListData | binary()</v>
198
<v>ListData = [ int() | binary() ] (binary allowed as tail of list)</v>
199
<v>Encoding = unicode | latin1</v>
203
<p>This function behaves as <seealso
204
marker="#erlang:characters_to_list/2">erlang:characters_to_list/2</seealso>,
205
but produces an UTF-8 binary instead of a unicode list. Note
206
that even if <c>Encoding</c> is given as <c>latin1</c>, the
207
output is UTF-8. The <c>Encoding</c> defines how input is to be
208
interpreted, not what output is generated. To convert a possibly
209
deep list of iso-latin-1 characters to a iso-latin-1 binary, use
211
marker="#iolist_to_binary/1">iolist_to_binary/1</seealso>.</p>
213
<p>Errors and exceptions occur as in <seealso
214
marker="#erlang:characters_to_list/2">erlang:characters_to_list/2</seealso>,
215
but of course the second element in the <c>error</c> or
216
<c>incomplete</c> tuple will be a <c>binary()</c> and not a