2
# $Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp $
8
This package generates some data files that contain character properties useful
14
The first data file is called "ctype.dat" and contains a compressed form of
15
the character properties found in the Unicode Character Database (UCDB).
16
Additional properties can be specified in limited UCDB format in another file
17
to avoid modifying the original UCDB.
19
The following is a property name and code table to be used with the character
24
Mn 0 Mark, Non-Spacing
25
Mc 1 Mark, Spacing Combining
27
Nd 3 Number, Decimal Digit
32
Zp 8 Separator, Paragraph
35
Cs 11 Other, Surrogate
36
Co 12 Other, Private Use
37
Cn 13 Other, Not Assigned
38
Lu 14 Letter, Uppercase
39
Ll 15 Letter, Lowercase
40
Lt 16 Letter, Titlecase
41
Lm 17 Letter, Modifier
43
Pc 19 Punctuation, Connector
44
Pd 20 Punctuation, Dash
45
Ps 21 Punctuation, Open
46
Pe 22 Punctuation, Close
47
Po 23 Punctuation, Other
49
Sc 25 Symbol, Currency
50
Sk 26 Symbol, Modifier
55
ES 31 European Number Separator
56
ET 32 European Number Terminator
58
CS 34 Common Number Separator
60
S 36 Segment Separator
63
Pi 47 Punctuation, Initial
64
Pf 48 Punctuation, Final
66
# Implementation specific properties.
70
Sy 41 Symmetric (characters which are part of open/close pairs)
74
Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
75
Cp 46 Defined character
77
The actual binary data is formatted as follows:
79
Assumptions: unsigned short is at least 16-bits in size and unsigned long
80
is at least 32-bits in size.
82
unsigned short ByteOrderMark
83
unsigned short OffsetArraySize
85
unsigned short Offsets[OffsetArraySize + 1]
86
unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
88
The Bytes field provides the total byte count used for the Offsets[] and
89
Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
90
there is always one extra node on the end to hold the final index of the
91
Ranges[] array. The Ranges[] array contains pairs of 4-byte values
92
representing a range of Unicode characters. The pairs are arranged in
93
increasing order by the first character code in the range.
95
Determining if a particular character is in the property list requires a
96
simple binary search to determine if a character is in any of the ranges
99
If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
100
machine with a different endian order and the values must be byte-swapped.
102
To swap a 16-bit value:
103
c = (c >> 8) | ((c & 0xff) << 8)
105
To swap a 32-bit value:
106
c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
107
(((c >> 16) & 0xff) << 8) | (c >> 24)
112
The next data file is called "case.dat" and contains three case mapping tables
113
in the following order: upper, lower, and title case. Each table is in
114
increasing order by character code and each mapping contains 3 unsigned longs
115
which represent the possible mappings.
117
The format for the binary form of these tables is:
119
unsigned short ByteOrderMark
120
unsigned short NumMappingNodes, count of all mapping nodes
121
unsigned short CaseTableSizes[2], upper and lower mapping node counts
122
unsigned long CaseTables[NumMappingNodes]
124
The starting indexes of the case tables are calculated as following:
127
LowerIndex = CaseTableSizes[0] * 3;
128
TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
130
The order of the fields for the three tables are:
150
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
151
same way as described in the CHARACTER PROPERTIES section.
153
Because the tables are in increasing order by character code, locating a
154
mapping requires a simple binary search on one of the 3 codes that make up
157
It is important to note that there can only be 65536 mapping nodes which
158
divided into 3 portions allows 21845 nodes for each case mapping table. The
159
distribution of mappings may be more or less than 21845 per table, but only
165
This data file is called "comp.dat" and contains data that tracks character
166
pairs that have a single Unicode value representing the combination of the two
169
The format for the binary form of this table is:
171
unsigned short ByteOrderMark
172
unsigned short NumCompositionNodes, count of composition nodes
173
unsigned long Bytes, total number of bytes used for composition nodes
174
unsigned long CompositionNodes[NumCompositionNodes * 4]
176
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
177
same way as described in the CHARACTER PROPERTIES section.
179
The CompositionNodes[] array consists of groups of 4 unsigned longs. The
180
first of these is the character code representing the combination of two
181
other character codes, the second records the number of character codes that
182
make up the composition (not currently used), and the last two are the pair
183
of character codes whose combination is represented by the character code in
189
The next data file is called "decomp.dat" and contains the decomposition data
190
for all characters with decompositions containing more than one character and
191
are *not* compatibility decompositions. Compatibility decompositions are
192
signaled in the UCDB format by the use of the <compat> tag in the
193
decomposition field. Each list of character codes represents a full
194
decomposition of a composite character. The nodes are arranged in increasing
195
order by character code.
197
The format for the binary form of this table is:
199
unsigned short ByteOrderMark
200
unsigned short NumDecompNodes, count of all decomposition nodes
202
unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
203
unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
205
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
206
same way as described in the CHARACTER PROPERTIES section.
208
The DecompNodes[] array consists of pairs of unsigned longs, the first of
209
which is the character code and the second is the initial index of the list
210
of character codes representing the decomposition.
212
Locating the decomposition of a composite character requires a binary search
213
for a character code in the DecompNodes[] array and using its index to
214
locate the start of the decomposition. The length of the decomposition list
215
is the index in the following element in DecompNode[] minus the current
221
The fourth data file is called "cmbcl.dat" and contains the characters with
222
non-zero combining classes.
224
The format for the binary form of this table is:
226
unsigned short ByteOrderMark
227
unsigned short NumCCLNodes
229
unsigned long CCLNodes[NumCCLNodes * 3]
231
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
232
same way as described in the CHARACTER PROPERTIES section.
234
The CCLNodes[] array consists of groups of three unsigned longs. The first
235
and second are the beginning and ending of a range and the third is the
236
combining class of that range.
238
If a character is not found in this table, then the combining class is
241
It is important to note that only 65536 distinct ranges plus combining class
242
can be specified because the NumCCLNodes is usually a 16-bit number.
247
The final data file is called "num.dat" and contains the characters that have
248
a numeric value associated with them.
250
The format for the binary form of the table is:
252
unsigned short ByteOrderMark
253
unsigned short NumNumberNodes
255
unsigned long NumberNodes[NumNumberNodes]
256
unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
259
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
260
same way as described in the CHARACTER PROPERTIES section.
262
The NumberNodes array contains pairs of values, the first of which is the
263
character code and the second an index into the ValueNodes array. The
264
ValueNodes array contains pairs of integers which represent the numerator
265
and denominator of the numeric value of the character. If the character
266
happens to map to an integer, both the values in ValueNodes will be the