~vcs-imports/libiconv/trunk

1 by Bruno Haible
Import from libiconv-0.3.
1
Q: Why does libiconv support encoding XXX? Why does libiconv not support
2
   encoding ZZZ?
3
4
A: libiconv, as an internationalization library, supports those character
5
   sets and encodings which are in wide-spread use in at least one territory
6
   of the world.
7
63 by Bruno Haible
Small update, initiated by Nerijus Baliunas.
8
   Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
9
   page "Languages, countries, and the charsets typically used for them".
1 by Bruno Haible
Import from libiconv-0.3.
10
   From this table, we can conclude that the following are in active use:
11
12
     ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
13
                          English, Faroese, Finnish, French, Galician, German,
14
                          Icelandic, Irish, Italian, Norwegian, Portuguese,
15
                          Scottish, Spanish, Swedish
16
     ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
17
                          Slovenian
18
     ISO-8859-3           Esperanto, Maltese
19
     ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
20
                          Serbian, Ukrainian
21
     ISO-8859-6           Arabic
22
     ISO-8859-7           Greek
23
     ISO-8859-8           Hebrew
24
     ISO-8859-9, CP1254   Turkish
63 by Bruno Haible
Small update, initiated by Nerijus Baliunas.
25
     ISO-8859-10          Inuit, Lapp
26
     ISO-8859-13          Latvian, Lithuanian
27
     ISO-8859-15          Estonian
1 by Bruno Haible
Import from libiconv-0.3.
28
     KOI8-R               Russian
29
     SHIFT_JIS            Japanese
30
     ISO-2022-JP          Japanese
31
     EUC-JP               Japanese
32
33
   Ordered by frequency on the web (1997):
34
     ISO-8859-1, CP1252   96%
35
     SHIFT_JIS             1.6%
36
     ISO-2022-JP           1.2%
37
     EUC-JP                0.4%
38
     CP1250                0.3%
39
     CP1251                0.2%
40
     CP850                 0.1%
41
     MACINTOSH             0.1%
42
     ISO-8859-5            0.1%
43
     ISO-8859-2            0.0%
44
45
   Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
46
47
     ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
48
                          English, Estonian, Faroese, Finnish, French,
49
                          Galician, German, Greenlandic, Icelandic,
50
                          Indonesian, Irish, Italian, Lithuanian, Norwegian,
51
                          Occitan, Portuguese, Scottish, Spanish, Swedish,
52
                          Walloon, Welsh
53
     ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
54
                          Romanian, Serbian, Slovak, Slovenian
55
     ISO-8859-3           Esperanto
56
     ISO-8859-4           Estonian, Latvian, Lithuanian
57
     ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
58
                          Serbian, Ukrainian
59
     ISO-8859-6           Arabic
60
     ISO-8859-7           Greek
61
     ISO-8859-8           Hebrew
62
     ISO-8859-9           Turkish
63
     ISO-8859-14          Breton, Irish, Scottish, Welsh
64
     ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
65
                          Faroese, Finnish, French, Galician, German,
66
                          Greenlandic, Icelandic, Irish, Italian, Lithuanian,
67
                          Norwegian, Occitan, Portuguese, Scottish, Spanish,
68
                          Swedish, Walloon, Welsh
69
     KOI8-R               Russian
70
     KOI8-U               Russian, Ukrainian
71
     EUC-JP (alias eucJP)      Japanese
72
     ISO-2022-JP (alias JIS7)  Japanese
73
     SHIFT_JIS (alias SJIS)    Japanese
74
     U90                       Japanese
75
     S90                       Japanese
76
     EUC-CN (alias eucCN)      Chinese
77
     EUC-TW (alias eucTW)      Chinese
78
     BIG5                      Chinese
79
     EUC-KR (alias eucKR)      Korean
80
     ARMSCII-8                 Armenian
81
     GEORGIAN-ACADEMY          Georgian
82
     GEORGIAN-PS               Georgian
83
     TIS-620 (alias TACTIS)    Thai
84
     MULELAO-1                 Laothian
85
     IBM-CP1133                Laothian
86
     VISCII                    Vietnamese
87
     TCVN                      Vietnamese
88
     NUNACOM-8                 Inuktitut
89
90
   Hint3: The character sets supported by Netscape Communicator 4.
91
92
     Where is this documented? For the complete picture, I had to use
93
     "strings netscape" and then a lot of guesswork. For a quick take,
94
     look at the "View - Character set" menu of Netscape Communicator 4.6:
95
96
     ISO-8859-{1,2,5,7,9,15}
97
     WINDOWS-{1250,1251,1253}
98
     KOI8-R               Cyrillic
99
     CP866                Cyrillic
100
     Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
101
     EUC-JP               Japanese
102
     SHIFT_JIS            Japanese
103
     GB2312               Chinese
104
     BIG5                 Chinese
105
     EUC-TW               Chinese
106
     Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)
107
108
     UTF-8
109
     UTF-7
110
111
   Hint4: The character sets supported by Microsoft Internet Explorer 4.
112
113
     ISO-8859-{1,2,3,4,5,6,7,8,9}
114
     WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
115
     KOI8-R               Cyrillic
116
     KOI8-RU              Ukrainian
117
     ASMO-708             Arabic
118
     EUC-JP               Japanese
119
     ISO-2022-JP          Japanese
120
     SHIFT_JIS            Japanese
121
     GB2312               Chinese
122
     HZ-GB-2312           Chinese
123
     BIG5                 Chinese
124
     EUC-KR               Korean
125
     ISO-2022-KR          Korean
126
     WINDOWS-874          Thai
127
     WINDOWS-1258         Vietnamese
128
129
     UTF-8
130
     UTF-7
131
     UNICODE             actually UNICODE-LITTLE
132
     UNICODEFEFF         actually UNICODE-BIG
133
134
     and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
135
136
   We take the union of all these four sets. The result is:
137
138
   European and Semitic languages
139
     * ASCII.
140
       We implement this because it is occasionally useful to know or to
141
       check whether some text is entirely ASCII (i.e. if the conversion
142
       ISO-8859-x -> UTF-8 is trivial).
143
     * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
144
       We implement this because they are widely used. Except ISO-8859-4
46 by Bruno Haible
Fix rationale about ISO-8859-4 and ISO-8859-13. Comments by
145
       which appears to have been superseded by ISO-8859-13 in the baltic
1 by Bruno Haible
Import from libiconv-0.3.
146
       countries. But it's an ISO standard anyway.
46 by Bruno Haible
Fix rationale about ISO-8859-4 and ISO-8859-13. Comments by
147
     * ISO-8859-13
148
       We implement this because it's a standard in Lithuania and Latvia.
149
     * ISO-8859-14
1 by Bruno Haible
Import from libiconv-0.3.
150
       We implement this because it's an ISO standard.
151
     * ISO-8859-15
152
       We implement this because it's increasingly used in Europe, because
153
       of the Euro symbol.
3 by Bruno Haible
Upgrade to libiconv-1.1.
154
     * ISO-8859-16
155
       We implement this because it's an ISO standard.
1 by Bruno Haible
Import from libiconv-0.3.
156
     * KOI8-R, KOI8-U
157
       We implement this because it appears to be the predominant encoding
158
       on Unix in Russia and Ukraine, respectively.
159
     * KOI8-RU
160
       We implement this because MSIE4 supports it.
301 by Bruno Haible
Add KOI8-T encoding.
161
     * KOI8-T
162
       We implement this because it is the locale encoding in glibc's Tajik
163
       locale.
574 by Bruno Haible
Support for PT154 encoding.
164
     * PT154
165
       We implement this because it is the locale encoding in glibc's Kazakh
166
       locale.
861 by Bruno Haible
Add support for the Kazakh RK1048 encoding.
167
     * RK1048
168
       We implement this because it's a standard in Kazakhstan.
1 by Bruno Haible
Import from libiconv-0.3.
169
     * CP{1250,1251,1252,1253,1254,1255,1256,1257}
170
       We implement these because they are the predominant Windows encodings
171
       in Europe.
172
     * CP850
173
       We implement this because it is mentioned as occurring in the web
174
       in the aforementioned statistics.
109 by Bruno Haible
Add support for CP862.
175
     * CP862
176
       We implement this because Ron Aaron says it is sometimes used in web
177
       pages and emails.
1 by Bruno Haible
Import from libiconv-0.3.
178
     * CP866
179
       We implement this because Netscape Communicator does.
989 by Bruno Haible
New converter for CP1131.
180
     * CP1131
181
       We implement this because it is the locale encoding of a Belorusian
182
       locale in FreeBSD and MacOS X.
1 by Bruno Haible
Import from libiconv-0.3.
183
     * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
184
       Mac{Hebrew,Arabic}
185
       We implement these because the Sun JDK does, and because Mac users
186
       don't deserve to be punished.
187
     * Macintosh
188
       We implement this because it is mentioned as occurring in the web
189
       in the aforementioned statistics.
190
   Japanese
309 by Bruno Haible
Write Shift_JIS instead of Shift-JIS.
191
     * EUC-JP, SHIFT_JIS, ISO-2022-JP
192
       We implement these because they are widely used. EUC-JP and SHIFT_JIS
1 by Bruno Haible
Import from libiconv-0.3.
193
       are more used for files, whereas ISO-2022-JP is recommended for email.
3 by Bruno Haible
Upgrade to libiconv-1.1.
194
     * CP932
309 by Bruno Haible
Write Shift_JIS instead of Shift-JIS.
195
       We implement this because it is the Microsoft variant of SHIFT_JIS,
3 by Bruno Haible
Upgrade to libiconv-1.1.
196
       used on Windows.
1 by Bruno Haible
Import from libiconv-0.3.
197
     * ISO-2022-JP-2
198
       We implement this because it's the common way to represent mails which
199
       make use of JIS X 0212 characters.
200
     * ISO-2022-JP-1
201
       We implement this because it's in the RFCs, but I don't think it is
202
       really used.
1084 by Bruno Haible
New encoding ISO-2022-CP-MS.
203
     * ISO-2022-JP-MS
204
       We implement this because Microsoft Outlook Express / Microsoft MimeOLE
205
       sends emails in this encoding.
1 by Bruno Haible
Import from libiconv-0.3.
206
     * U90, S90
207
       We DON'T implement this because I have no informations about what it
208
       is or who uses it.
209
   Simplified Chinese
210
     * EUC-CN = GB2312
211
       We implement this because it is the widely used representation
212
       of simplified Chinese.
213
     * GBK
214
       We implement this because it appears to be used on Solaris and Windows.
54 by Bruno Haible
Add support for GB18030 and BIG5HKSCS.
215
     * GB18030
216
       We implement this because it is an official requirement in the
217
       People's Republic of China.
1 by Bruno Haible
Import from libiconv-0.3.
218
     * ISO-2022-CN
219
       We implement this because it is in the RFCs, but I have no idea
220
       whether it is really used.
221
     * ISO-2022-CN-EXT
222
       We implement this because it's in the RFCs, but I don't think it is
223
       really used.
224
     * HZ = HZ-GB-2312
225
       We implement this because the RFCs recommend it for Usenet postings,
226
       and because MSIE4 supports it.
227
   Traditional Chinese
228
     * EUC-TW
229
       We implement it because it appears to be used on Unix.
230
     * BIG5
231
       We implement it because it is the de-facto standard for traditional
232
       Chinese.
233
     * CP950
234
       We implement this because it is the Microsoft variant of BIG5, used
235
       on Windows.
236
     * BIG5+
237
       We DON'T implement this because it doesn't appear to be in wide use.
238
       Only the CWEX fonts use this encoding. Furthermore, the conversion
239
       tables in the big5p package are not coherent: If you convert directly,
240
       you get different results than when you convert via GBK.
202 by Bruno Haible
Rename BIG5HKSCS to BIG5-HKSCS.
241
     * BIG5-HKSCS
54 by Bruno Haible
Add support for GB18030 and BIG5HKSCS.
242
       We implement it because it is the de-facto standard for traditional
243
       Chinese in Hongkong.
1 by Bruno Haible
Import from libiconv-0.3.
244
   Korean
3 by Bruno Haible
Upgrade to libiconv-1.1.
245
     * EUC-KR
1 by Bruno Haible
Import from libiconv-0.3.
246
       We implement these because they appear to be the widely used
247
       representations for Korean.
3 by Bruno Haible
Upgrade to libiconv-1.1.
248
     * CP949
249
       We implement this because it is the Microsoft variant of EUC-KR, used
250
       on Windows.
1 by Bruno Haible
Import from libiconv-0.3.
251
     * ISO-2022-KR
252
       We implement it because it is in the RFCs and because MSIE4 supports
253
       it, but I have no idea whether it's really used.
3 by Bruno Haible
Upgrade to libiconv-1.1.
254
     * JOHAB
68 by Bruno Haible
Document JOHAB again.
255
       We implement this because it is apparently used on Windows as a locale
256
       encoding (codepage 1361).
3 by Bruno Haible
Upgrade to libiconv-1.1.
257
     * ISO-646-KR
258
       We DON'T implement this because although an old ASCII variant, its
259
       glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
260
       say it's a tilde, but Ken Lunde's "CJKV information processing" says
261
       it's an overline. And it is not ISO-IR registered.
1 by Bruno Haible
Import from libiconv-0.3.
262
   Armenian
263
     * ARMSCII-8
264
       We implement it because XFree86 supports it.
265
   Georgian
266
     * Georgian-Academy, Georgian-PS
267
       We implement these because they appear to be both used for Georgian;
268
       Xfree86 supports them.
269
   Thai
499 by Bruno Haible
New encoding ISO-8859-11.
270
     * ISO-8859-11, TIS-620
271
       We implement these because it seems to be standard for Thai.
1 by Bruno Haible
Import from libiconv-0.3.
272
     * CP874
273
       We implement this because MSIE4 supports it.
274
     * MacThai
275
       We implement this because the Sun JDK does, and because Mac users
276
       don't deserve to be punished.
277
   Laotian
278
     * MuleLao-1, CP1133
279
       We implement these because XFree86 supports them. I have no idea which
280
       one is used more widely.
281
   Vietnamese
282
     * VISCII, TCVN
283
       We implement these because XFree86 supports them.
284
     * CP1258
285
       We implement this because MSIE4 supports it.
286
   Other languages
287
     * NUNACOM-8 (Inuktitut)
288
       We DON'T implement this because it isn't part of Unicode yet, and
289
       therefore doesn't convert to anything except itself.
290
   Platform specifics
291
     * HP-ROMAN8, NEXTSTEP
292
       We implement these because they were the native character set on HPs
293
       and NeXTs for a long time, and libiconv is intended to be usable on
294
       these old machines.
295
   Full Unicode
296
     * UTF-8, UCS-2, UCS-4
297
       We implement these. Obviously.
20 by Bruno Haible
Upgrade to libiconv-1.3.
298
     * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
299
       We implement these because they are the preferred internal
300
       representation of strings in Unicode aware applications. These are
301
       non-ambiguous names, known to glibc. (glibc doesn't have
302
       UCS-2-INTERNAL and UCS-4-INTERNAL.)
13 by Bruno Haible
Upgrade to libiconv-1.2.
303
     * UTF-16, UTF-16BE, UTF-16LE
304
       We implement these, because UTF-16 is still the favourite encoding of
305
       the president of the Unicode Consortium (for political reasons), and
306
       because they appear in RFC 2781.
180 by Bruno Haible
Add UTF-32 encodings.
307
     * UTF-32, UTF-32BE, UTF-32LE
308
       We implement these because they are part of Unicode 3.1.
1 by Bruno Haible
Import from libiconv-0.3.
309
     * UTF-7
310
       We implement this because it is essential functionality for mail
311
       applications.
325 by Bruno Haible
New encoding C99.
312
     * C99
313
       We implement it because it's used for C and C++ programs and because
314
       it's a nice encoding for debugging.
1 by Bruno Haible
Import from libiconv-0.3.
315
     * JAVA
316
       We implement it because it's used for Java programs and because it's
317
       a nice encoding for debugging.
318
     * UNICODE (big endian), UNICODEFEFF (little endian)
319
       We DON'T implement these because they are stupid and not standardized.
1092 by Bruno Haible
Modernize quoting.
320
   Full Unicode, in terms of 'uint16_t' or 'uint32_t'
1 by Bruno Haible
Import from libiconv-0.3.
321
   (with machine dependent endianness and alignment)
322
     * UCS-2-INTERNAL, UCS-4-INTERNAL
323
       We implement these because they are the preferred internal
324
       representation of strings in Unicode aware applications.
325
326
Q: Support encodings mentioned in RFC 1345 ?
327
A: No, they are not in use any more. Supporting ISO-646 variants is pointless
328
   since ISO-8859-* have been adopted.
329
330
Q: Support EBCDIC ?
1224 by Bruno Haible
New EBCDIC encodings.
331
A: Available through --enable-extra-encodings.
332
   Why? Because several people (Ulrich Schwab, Calvin Buckley) have shown
333
   interest in these encodings, by preparing forks of GNU libiconv.
1 by Bruno Haible
Import from libiconv-0.3.
334
335
Q: How do I add a new character set?
336
A: 1. Explain the "why" in this file, above.
337
   2. You need to have a conversion table from/to Unicode. Transform it into
338
   the format used by the mapping tables found on ftp.unicode.org: each line
339
   contains the character code, in hex, with 0x prefix, then whitespace,
340
   then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
341
   counts as a comment delimiter until end of line.
342
   Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
343
   can include it in his collection.
344
   3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
345
   tools directory to generate the C code for the conversion. You may tweak
109 by Bruno Haible
Add support for CP862.
346
   the resulting C code if you are not satisfied with its quality, but this
347
   is rarely needed.
1 by Bruno Haible
Import from libiconv-0.3.
348
   If it's a two-dimensional character set (with rows and columns), use the
349
   'cjk_tab_to_h' program in the tools directory to generate the C code for
350
   the conversion. You will need to modify the main() function to recognize
351
   the new character set name, with the proper dimensions, but that shouldn't
352
   be too hard. This yields the CCS. The CES you have to write by hand.
109 by Bruno Haible
Add support for CP862.
353
   4. Store the resulting C code file in the lib directory. Add a #include
2 by Bruno Haible
Upgrade to libiconv-1.0.
354
   directive to converters.h, and add an entry to the encodings.def file.
1 by Bruno Haible
Import from libiconv-0.3.
355
   5. Compile the package, and test your new encoding using a program like
356
   iconv(1) or clisp(1).
749 by Bruno Haible
Remove OS/2 build support that doesn't assume GNU make and GNU bash.
357
   6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
358
   encoding, create the complete table as a TXT file. For a stateful encoding,
2 by Bruno Haible
Upgrade to libiconv-1.0.
359
   provide a text snippet encoded using your new encoding and its UTF-8
360
   equivalent.
3 by Bruno Haible
Upgrade to libiconv-1.1.
361
   7. Update the README and man/iconv_open.3, to mention the new encoding.
1 by Bruno Haible
Import from libiconv-0.3.
362
   Add a note in the NEWS file.
363
364
Q: What about bidirectional text? Should it be tagged or reversed when
365
   converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
366
   this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
367
A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
368
   ISO-8859-E remains to be implemented.
369
   On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
370
   is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
371
   the same as ISO-8859-8-I. I'm confused.
372
373
Other character sets not implemented:
374
"MNEMONIC" = "csMnemonic"
375
"MNEM" = "csMnem"
376
"ISO-10646-UCS-Basic" = "csUnicodeASCII"
377
"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
378
"ISO-10646-J-1"
379
"UNICODE-1-1" = "csUnicode11"
380
"csWindows31Latin5"
381
382
Other aliases not implemented (and not implemented in glibc-2.1 either):
383
  From MSIE4:
384
    ISO-8859-1: alias ISO8859-1
385
    ISO-8859-2: alias ISO8859-2
386
    KSC_5601: alias KS_C_5601
387
    UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
388
2 by Bruno Haible
Upgrade to libiconv-1.0.
389
390
Q: How can I integrate libiconv into my package?
391
A: Just copy the entire libiconv package into a subdirectory of your package.
392
   At configuration time, call libiconv's configure script with the
393
   appropriate --srcdir option and maybe --enable-static or --disable-shared.
394
   Then "cd libiconv && make && make install-lib libdir=... includedir=...".
395
   'install-lib' is a special (not GNU standardized) target which installs
396
   only the include file - in $(includedir) - and the library - in $(libdir) -
397
   and does not use other directory variables. After "installing" libiconv
398
   in your package's build directory, building of your package can proceed.
399
400
Q: Why is the testsuite so big?
401
A: Because some of the tests are very comprehensive.
402
   If you don't feel like using the testsuite, you can simply remove the
403
   tests/ directory.
404