4
A structure containing data for charset+collation pair implementation.
6
Virtual functions that use this data are collected into separate
7
structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER.
10
typedef struct charset_info_st
27
MY_UNI_IDX *tab_from_uni;
32
uint strxfrm_multiply;
35
uint16 max_sort_char; /* For LIKE optimization */
37
MY_CHARSET_HANDLER *cset;
38
MY_COLLATION_HANDLER *coll;
43
CHARSET_INFO fields description:
44
===============================
50
number - an ID uniquely identifying this charset+collation pair.
52
primary_number - ID of a charset+collation pair, which consists
53
of the same character set and the default collation of this
54
character set. Not really used now. Intended to optimize some
55
parts of the code where we need to find the default collation
56
using its non-default counterpart for the given character set.
58
binary_number - ID of a charset+collation pair, which consists
59
of the same character set and the binary collation of this
60
character set. Not really used now.
65
csname - name of the character set for this charset+collation pair.
66
name - name of the collation for this charset+collation pair.
67
comment - a text comment, displayed in "Description" column of
68
SHOW CHARACTER SET output.
73
ctype - pointer to array[257] of "type of characters"
74
bit mask for each character, e.g., whether a
75
character is a digit, letter, separator, etc.
78
If you look at the macros, we use ctype[(char)+1].
79
ctype[0] is traditionally in most ctype libraries
80
reserved for EOF (-1). The idea is that you can use
81
the result from fgetc() directly with ctype[]. As
82
we have to be compatible with external ctype[] versions,
83
it's better to do it the same way as they do...
85
to_lower - pointer to array[256] used in LCASE()
86
to_upper - pointer to array[256] used in UCASE()
87
sort_order - pointer to array[256] used for strings comparison
89
In all Asian charsets these arrays are set up as follows:
91
- All bytes in the range 0x80..0xFF were marked as letters in the
94
- The to_lower and to_upper arrays map only ASCII letters.
95
UPPER() and LOWER() doesn't really work for multi-byte characters.
96
Most of the characters in Asian character sets are ideograms
97
anyway and they don't have case mapping. However, there are
98
still some characters from European alphabets.
100
_ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE
101
_ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE
103
But they don't map to each other with UPPER and LOWER operations.
105
- The sort_order array is filled case insensitively for the
106
ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte
107
range 0x80..0xFF for these collations:
116
So multi-byte characters are sorted just according to their codes.
119
- Two collations are still case insensitive for the ASCII characters,
120
but have special sorting order for multi-byte characters
121
(something more complex than just according to codes):
126
So handlers for these collations use only the 0x00..0x7F part
127
of their sort_order arrays, and apply the special functions
128
for multi-byte characters
130
In Unicode character sets we have full support of UPPER/LOWER mapping,
131
for sorting order, and for character type detection.
132
"utf8_general_ci" still has the "old-fashioned" arrays
133
like to_upper, to_lower, sort_order and ctype, but they are
134
not really used (maybe only in some rare legacy functions).
138
Unicode conversion data
139
-----------------------
140
For 8-bit character sets:
142
tab_to_uni : array[256] of charset->Unicode translation
143
tab_from_uni: a structure for Unicode->charset translation
145
Non-8-bit charsets have their own structures per charset
146
hidden in corresponding ctype-xxx.c file and don't use
147
tab_to_uni and tab_from_uni tables.
155
These maps are used to quickly identify whether a character is an
156
identifier part, a digit, a special character, or a part of another
157
SQL language lexical item.
159
Probably can be combined with ctype array in the future.
160
But for some reasons these two arrays are used in the parser,
161
while a separate ctype[] array is used in the other part of the
162
code, like fulltext, etc.
168
strxfrm_multiply - how many times a sort key (that is, a string
169
that can be passed into memcmp() for comparison)
170
can be longer than the original string.
171
Usually it is 1. For some complex
172
collations it can be bigger. For example,
173
in latin1_german2_ci, a sort key is up to
174
two times longer than the original string.
175
e.g. Letter 'A' with two dots above is
176
substituted with 'AE'.
177
mbminlen - minimum multi-byte sequence length.
178
Now always 1 except for ucs2. For ucs2,
180
mbmaxlen - maximum multi-byte sequence length.
181
1 for 8-bit charsets. Can be also 2 or 3.
183
max_sort_char - for LIKE range
184
in case of 8-bit character sets - native code
185
of maximum character (max_str pad byte);
186
in case of UTF8 and UCS2 - Unicode code of the maximum
187
possible character (usually U+FFFF). This code is
188
converted to multi-byte representation (usually 0xEFBFBF)
189
and then used as a pad sequence for max_str.
190
in case of other multi-byte character sets -
191
max_str pad byte (usually 0xFF).
196
MY_CHARSET_HANDLER is a collection of character-set
197
related routines. Defined in m_ctype.h. Have the
198
following set of functions:
202
ismbchar() - detects whether the given string is a multi-byte sequence
203
mbcharlen() - returns length of multi-byte sequence starting with
205
numchars() - returns number of characters in the given string, e.g.
206
in SQL function CHAR_LENGTH().
207
charpos() - calculates the offset of the given position in the string.
208
Used in SQL functions LEFT(), RIGHT(), SUBSTRING(),
212
- finds the length of correctly formed multi-byte beginning.
213
Used in INSERTs to cut a beginning of the given string
215
a) "well formed" according to the given character set.
216
b) can fit into the given data type
217
Terminates the string in the good position, taking in account
218
multi-byte character boundaries.
220
lengthsp() - returns the length of the given string without trailing spaces.
223
Unicode conversion routines
224
---------------------------
225
mb_wc - converts the left multi-byte sequence into its Unicode code.
226
mc_mb - converts the given Unicode code into multi-byte sequence.
229
Case and sort conversion
230
------------------------
231
caseup_str - converts the given 0-terminated string to uppercase
232
casedn_str - converts the given 0-terminated string to lowercase
233
caseup - converts the given string to lowercase using length
234
casedn - converts the given string to lowercase using length
236
Number-to-string conversion routines
237
------------------------------------
242
The names are pretty self-describing.
244
String padding routines
245
-----------------------
246
fill() - writes the given Unicode value into the given string
247
with the given length. Used to pad the string, usually
248
with space character, according to the given charset.
250
String-to-number conversion routines
251
------------------------------------
258
These functions are almost the same as their STDLIB counterparts,
260
- accept length instead of 0-terminator
261
- are character set dependent
263
Simple scanner routines
264
-----------------------
265
scan() - to skip leading spaces in the given string.
266
Used when a string value is inserted into a numeric field.
272
strnncoll() - compares two strings according to the given collation
273
strnncollsp() - like the above but ignores trailing spaces
274
strnxfrm() - makes a sort key suitable for memcmp() corresponding
276
like_range() - creates a LIKE range, for optimizer
277
wildcmp() - wildcard comparison, for LIKE
278
strcasecmp() - 0-terminated string comparison
279
instr() - finds the first substring appearance in the string
280
hash_sort() - calculates hash value taking into account
281
the collation rules, e.g. case-insensitivity,
282
accent sensitivity, etc.