~rdoering/ubuntu/karmic/erlang/fix-535090

This module contains functions for converting between different character representations. Basically it converts between iso-latin-1 characters and Unicode ditto, but it can also convert between different Unicode encodings (like UTF-8, UTF-16 and UTF-32).

The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built in functions and libraries in OTP expect to find binary Unicode data. In lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode codepoint for the character.

Other Unicode encodings than integers representing codepoints or UTF-8 in binaries are referred to as "external encodings". The iso-latin-1 encoding is in binaries and lists referred to as latin1-encoding.

It is recommended to only use external encodings for communication with external entities where this is required. When working inside the Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when representing Unicode characters. Latin1 encoding is supported both for backward compatibility and for communication with external entities not supporting Unicode character sets.

</description>

<title>DATA TYPES</title>

unicode_binary() = binary() with characters encoded in UTF-8 coding standard

unicode_char() = integer() representing valid unicode codepoint

chardata() = charlist() | unicode_binary()

charlist() = [unicode_char() | unicode_binary() | charlist()]

a unicode_binary is allowed as the tail of the list</code>

external_unicode_binary() = binary() with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)

external_chardata() = external_charlist() | external_unicode_binary()

external_charlist() = [unicode_char() | external_unicode_binary() | external_charlist()]

a external_unicode_binary is allowed as the tail of the list</code>

latin1_binary() = binary() with characters coded in iso-latin-1

latin1_char() = integer() representing valid latin1 character (0-255)

latin1_chardata() = latin1_charlist() | latin1_binary()

latin1_charlist() = [latin1_char() | latin1_binary() | latin1_charlist()]

a latin1_binary is allowed as the tail of the list</code>

</section>

<funcs>

<func>

<name>erlang:characters_to_list(Data, Encoding) -> list() | {error, list(), RestData} | {incomplete, list(), binary()} </name>

<fsummary>Convert a collection of characters to list of Unicode characters</fsummary>

<type>

<v>Data = ListData | binary()</v>

<v>RestData = ListData | binary()</v>

<v>ListData = [ int() | binary() ] (binary allowed as tail of list)</v>

<v>Encoding = unicode | latin1</v>

</type>

<desc>

This function converts a possibly deep list of integers and

binaries into a list of integers representing unicode

characters. The binaries in the input may have characters

encoded as latin1 (0 - 255, one character per byte), in which

case the <c>Encoding</c> parameter should be given as

<c>latin1</c>, or have characters encoded as UTF-8, in

which case the <c>Encoding</c> should be given as

<c>unicode</c>. Only when the <c>Encoding</c> is <c>unicode</c>,

integers in the list are allowed to be grater than 255.

If <c>Encoding</c> is <c>latin1</c>, the <c>Data</c> parameter

corresponds to the <c>iodata()</c> type, but for <c>unicode</c>,

the <c>Data</c> parameter can contain integers greater than 255

(unicode characters beyond the iso-latin-1 range), which would

make it invalid as <c>iodata()</c>.

The purpose of the function is mainly to be able to convert

combinations of unicode characters into a pure unicode

100

string in list representation for further processing. For

101

writing the data to an external entity, the reverse function

102

<seealso

103

marker="#erlang:characters_to_utf8/2">erlang:characters_to_utf8/2</seealso>

104

comes in handy.

105

106

If for some reason, the data cannot be converted, either

107

because of illegal unicode/latin1 characters in the list, or

108

because of invalid UTF-8 encoding in any binaries, an error

109

tuple is returned. The error tuple contains the tag

110

<c>error</c>, a list representing the characters that could be

111

converted before the error occured and a representation of the

112

characters including and after the offending integer/bytes. The

113

last part is mostly for debugging as it still constitutes a

114

possibly deep and/or mixed list, not necessarily of the same

115

depth as the original data. The error occurs when traversing the

116

list and whatever's left to decode is simply returned as is.

117

118

However, if the input <c>Data</c> is a pure binary, the third

119

part of the error tuple is guaranteed to be a binary as

120

well.

121

122

Errors occur for the following reasons:

123

124

125

<item>Integers out of range - If <c>Encoding</c> is

126

<c>latin1</c>, an error occurs whenever an integer greater

127

than 255 is found in the lists. If <c>Encoding</c> is

128

unicode, error occurs whenever an integer greater than

129

<c>16#10FFFF</c> (the maximum unicode character) or in the

130

range <c>16#D800</c> to <c>16#DFFF</c> (invalid unicode

131

range) is found.</item>

132

133

<item>UTF-8 encoding incorrect - If <c>Encoding</c> is

134

<c>unicode</c>, the bytes in any binaries have to be valid

135

UTF-8. Errors can occur for various

136

reasons, including "pure" decoding errors

137

(like the upper

138

bits of the bytes being wrong), the bytes are decoded to a

139

too large number, the bytes are decoded to a code-point in the

140

invalid unicode

141

range or encoding is "overlong", meaning that a

142

number should have been encoded in fewer bytes. The

143

case of a truncated UTF-8 is handled specially, see the

144

paragraph about incomplete binaries below. If

145

<c>Encoding</c> is <c>latin1</c>, binaries are always valid

146

as long as they contain whole bytes,

147

as each byte falls into the valid iso-latin-1 range.</item>

148

149

</list>

150

151

A special type of error is when no actual invalid integers or

152

bytes are found, but a trailing <c>binary()</c> consists of too

153

few bytes to decode the last character. This error might occur

154

if bytes are read from a file in chunks or binaries in other

155

ways are split on non UTF-8 boundaries. In this case an

156

<c>incomplete</c> tuple is returned instead of the <c>error</c>

157

tuple. It consists of the same parts as the <c>error</c> tuple, but

158

the tag is <c>incomplete</c> instead of <c>error</c> and the

159

last element is always guaranteed to be a binary consisting of

160

the first part of a (so far) valid UTF-8 character.

161

162

If one UTF-8 characters is split over two consecutive

163

binaries in the <c>Data</c>, the conversion succeeds. This means

164

that a character can be decoded from a range of binaries as long

165

as the whole range is given as input without errors

166

occuring. Example:

167

168

<code>

169

decode_data(Data) ->

170

case erlang:characters_to_list(Data,unicode) of

171

{inclomplete,Encoded, Rest} ->

172

More = get_some_more_data(),

173

Encoded ++ decode_data([Rest, More]);

174

{error,Encoded,Rest} ->

175

handle_error(Encoded,Rest);

176

List ->

177

List

178

end.

179

</code>

180

Bit-strings that are not whole bytes are however not allowed,

181

so a UTF-8 character has to be split along 8-bit boundaries to

182

ever be decoded.

183

184

If any parameters are of the wrong type, the list structure

185

is invalid (a number as tail) or the binaries does not contain

186

whole bytes (bit-strings), a <c>badarg</c> exception is

187

thrown.

188

189

</desc>

190

</func>

191

192

<func>

193

<name>erlang:characters_to_utf8(Data, Encoding) -> binary() | {error, binary(), RestData} | {incomplete, binary(), binary()} </name>

194

<fsummary>Convert a collection of characters to an UTF-8 binary</fsummary>

195

<type>

196

<v>Data = ListData | binary()</v>

197

<v>RestData = ListData | binary()</v>

198

<v>ListData = [ int() | binary() ] (binary allowed as tail of list)</v>

199

<v>Encoding = unicode | latin1</v>

200

</type>

201

<desc>

202

203

This function behaves as <seealso

204

marker="#erlang:characters_to_list/2">erlang:characters_to_list/2</seealso>,

205

but produces an UTF-8 binary instead of a unicode list. Note

206

that even if <c>Encoding</c> is given as <c>latin1</c>, the

207

output is UTF-8. The <c>Encoding</c> defines how input is to be

208

interpreted, not what output is generated. To convert a possibly

209

deep list of iso-latin-1 characters to a iso-latin-1 binary, use

210

<seealso

211

marker="#iolist_to_binary/1">iolist_to_binary/1</seealso>.

212

213

Errors and exceptions occur as in <seealso

214

marker="#erlang:characters_to_list/2">erlang:characters_to_list/2</seealso>,

215

but of course the second element in the <c>error</c> or

216

<c>incomplete</c> tuple will be a <c>binary()</c> and not a

217

<c>list()</c>.

218

219

</desc>

220

</func>

221

</funcs>

222

</erlref>

Older »