1 * String collation Notes
5 We are going to implement Windows-like collation, apart from ICU (that
6 implements UCA, Unicode Collation Algorithm).
11 * create collation element table(s)
12 - infer how Windows collation element table is composed
14 - write table generator source(s)
15 : mostly implemented. Need to fix nearly 400 mappings.
16 They are mainly 1) IPA extensions (U+250-U+300),
17 2) Latin extensions (U+1E00-U+1F00), 3) Letterlike
18 symbols (U+2100-U+2140), 4) some Cyrillic letters
19 (U+460-U+500), and 5) some Hangul characters.
20 - culture-specific sortkey data
21 : They are defined in mono-tailoring-source.txt.
22 All single sortkey remapping in all cultures are filled.
23 Contractions are not fully checked yet (should be filled
24 from UCA tailorings via create-tailorings.exe).
27 ** How to implement CompareInfo members
30 Compute sort key for every character elements into byte[].
32 Find first difference and compare it.
33 "Larger/smaller" matters (beyond "different").
35 It calls CompareInternal() which also answers if the target
36 is fully consumed, so it just returns true if it says that
37 the target is fully consumed.
39 It tries CompareInternal() to compare source and target at
40 the end, where source varies from minimum tail to the
42 IndexOf(), LastIndexOf()
43 For character search, it finds the matching character element
44 to the end (or start) of the string to find.
45 For string search, it invokes one of private IndexOf() (or
46 LastIndexOf()) overload passing the first character element
47 of the target, and if found, tests if the sequence is a valid
48 start point, using IsPrefix() (or IsSuffix()).
52 For Compare() and IsPrefix(), it uses forward iteration, which moves
53 forward and don't stop until either it finds next "primary" character
54 or it reached the end of the string, checking with IsSafe(char).
56 For IndexOf(char) and LastIndexOf(char), there is no special
57 optimization (since the codepoints usually do not match, while they
58 often matches as a natural collation), but it omits extraneous sortkey
61 IsSuffix() reuses Compare() and returns false if it does not consume
62 the target string more than 3 times. 3 is kind of magic number that
63 represents the longest expansion.
65 IndexOf(string) is implemented as a combination of IndexOf(char) and
68 LastIndexOf(string) is implemented as a combination of
69 LastIndexOf(char) and IsPrefix().
71 Porting them to C code is an alternative possible approach, but from
72 Compare() optimization experience, it is quick enough.
74 ** How to support CompareOptions
76 There are two kind of "ignorance" : strippers' ignorance and
77 normalizers' ignorance.
79 The strippers will "filter characters out" and there will be no
80 corresponding character elements in SortKey binaries.
82 Normalizers, on the other hand, will result in certain characters
83 that is still in effect between irrelevant character and itself.
84 For example, with IgnoreKanaType Hiragana "A" and Katakana "A" are
85 not distinguished, but Hiragana "A" and Hiragana "I" are.
87 Actually, even without any IgnoreXXX flags (i.e. "None"), there are
88 many characters that are ignored ("completely ignorable").
90 Except for LCID 101/1125(div), '\ufdf2' is completely ignorable.
91 This rule even applies to CompareOptions.None.
96 Maybe culture-dependent TextInfo.ToLower() could be used.
98 Unlike ICU (specialCaseToLower()), even with tr-TR(LCID 31)
99 and IgnoreCase, I\u0307 is not regarded as equal to i.
102 ToKanaTypeInsensitive(). Note that this does not cover the
103 whole "level 4" differences described later.
106 ToWidthInsensitive(), which is likely to be culture
107 independent. See also "Notes".
109 IgnoreNonSpace (see also Strippers; this flag works in both sides)
110 For some cultures this logic is still incomplete. All culture-
111 dependent collators must handle valid "replacement" of "one or
112 more characters" which might be related to specific
114 For example, there is a Japanese text sorting rule that
115 however applies to InvariantCulture. Concretely to say,
116 "\u3042\u30FC" is equivalent to "\u3042\u3042" only when
117 IgnoreNonSpace is specified.
119 I'll take those items from CLDR (those items which has
120 <reset before="..." />), case by case though.
124 I already wrote all the required strippers which should be MS
125 compatible (at least with .NET 1.1 invariant culture).
128 IsIgnorableNonSpacing().
129 Some Diacritic characters are covered by this flag.
131 There are some culture *dependent* characters:
132 LCID 90/1114(syr) : 64b, 652, 670
136 UnicodeCategory does not work here.
138 There are some culture *dependent* characters:
139 LCID 17/1041(ja) : 2015
140 LCID 90/1114(syr) : 64b, 652
144 See "sort order categories" section.
148 First to note: we won't use collation element table from unicode.org.
150 There are many differences between ICU/UCA and Windows despite they
151 look so similar; having collation keys in different levels, culture
152 dependent composition, etc. In the history, Windows collation is
153 designed before UCA was specified, so basically Windows is obsolete
156 - Logic: Unlike UCA it has no concept of "blocked" combining marks,
157 and combining marks are never considered as an independent character
158 (thus combining in Windows is buggy).
159 - Data: Windows is based on old Unicode standard (even older than 1.1).
160 It ignores minor cultures. Character property values differ as well
161 as those from the default Unicode collation element table (DUCET).
162 In a few cultures Windows collation is close to the native language
163 (e.g. Tamil, while it does not conform to TAM).
165 IgnoreWidth/IgnoreSymbols is processed after Kana voice mark
166 decomposition (something like NFD, but not equivalent. Example: \u304C
167 is completely equivalent to \u304B\u309B, which is not part of NFKD).
168 <del>This means, if there is a combined Kana characters, it will be
169 first decomposed and then compared.<del> It scarcely matters since
170 there are special weight data for Japanese.
172 *** Microsoft design problem
174 Microsoft implementation seems to have a serious problem that many,
175 many characters that are used in for each specific culture, such as
176 Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
177 "completely ignorable".
179 I tagged many LAMESPEC items in the implementation (both in collator
180 and table generator).
183 ** MS collation design inference
187 Each character has several "weights". It is a common concept between
192 - level 1: primary difference
193 The first byte of level 1 means the category of the character.
194 - level 2: diacritic difference, including Japanese voice mark countup
195 - level 3: case/width sensitivity, and Hangul properties
196 - level 4: kana weight (all of them have primary category 22, at least
198 - level 5: shift weight (apostrophe, hyphens etc.)
200 Note that these levels does not digitally match IgnoreXXX flags. Thus
201 it is not OK that we omit some levels of sortkey values in reflection
204 String comparison is done from level 1 to 5. The comparison won't
205 stop until either it found the primary difference, or it reached to
206 the end (thus upper level differences are returned).
208 For example, "e" is smaller than "E", but "eB" is bigger than "EA".
209 If the collator just returned case difference at first 'e' and 'E',
210 "eB" is still smaller than "EA".
212 **** level 5: shift weight by StringSort
214 There are some characters that are treated specially. Namely they are
215 apostrophe and hyphens. The sortkeys for them is put after level 4
216 (thus here I write them as "level 5"). It has different sort key
217 format. See immediate below. There is no level 5 characters when
218 StringSort is specified.
222 00 means the end of sort key.
223 01 means the end of the level.
224 02-FF means the value.
225 Actually '2' could be cut when all the following values are
226 also '2' (i.e. the sort key binary won't contain extraneous '2').
228 Every level has different key layout.
232 It looks like all level 2 keys are just accumulated, however without
233 considering overflow. It sometimes makes sense (e.g. diaeresis and
234 acute) but it causes many conflicts (e.g. "A\u0308\u0301" and "\u1EA6"
235 are incorrectly regarded as equal).
237 Anyways since Japanese voice mark has level 2 value as 1 it just
238 looked like the sum of voice marks.
242 The actual value analysis is not complete in this document. See the
243 actual generator code.
245 The actual values are + 2 (e.g. for Hangul Normal Jamo, the value is 4)
248 - 2: Jongseong (11A8-11F9)
249 - 4: Half width? (FFA0-FFDC) and Compatibility Jamo? (3165-318E)
250 - 5: Compatibility Jamo (3130-3164)?
251 TODO: Learn about Korean characters.
254 - 4 circled inverse (2776-277F)
255 - 8 circled sans serif (2780-2789)
256 - C circled inverse && sans serif (278A-2793)
257 - 47 roman (2160-2182)
260 - 2 Isolated form in presentation form B in FE80-FE8D
261 - 4 Alef/Bet/Gimel/Dalet (2135-2138)
262 - 8 Final form in presentation form B in FE82-FEF2
263 - 18 Medial form in presentation form B in FE8C-FEF4
264 Grep "ISOLATED", "FINAL" or "MEDIAL" on UnicodeData.txt
265 (and filter by codepoints).
266 or alternatively, see DerivedDecompositionType.txt.
267 - 22 6A9 (TODO: what is it?)
268 - 28 6AA (TODO: what is it?)
271 - 1 Fullwidth. UnicodeData.txt has <full>.
272 - 2 Subscript. UnicodeData.txt has <sub>.
273 - 8 Small capital, 03C2 (TODO: why?),
274 2104, 212B(flag=1A) (TODO: why?)
275 grep "SMALL CAPITAL" against UnicodeData.txt.
276 - C only FE42. TODO: what is this?
277 - E Superscripts. UnicodeData.txt has <super>.
279 DerivedCoreProperties.txt has Uppercase property.
281 Note that simple 02 (value is 00) could be omitted.
283 Summary: at least 7 bits are required as to represent a table -
284 smallCapital, uppercase, normalization forms (2 bits:full/sub/super),
285 arabic forms (2 bits:isolated/medial/final)
289 Those sortkey data is collected only for Japanese (category 22)
292 There are 3 sections each of them ends with FF. Each of them
293 represents the values for character by character:
294 - small letter type (kogaki moji); C4 (small) or E4 (normal)
295 - category middle section:
296 two subsections separated by 0x02
299 or 4 (voice mark - \u309D,\u309E,\u30FD,\u30FE, \uFF70)
300 or 5 (dash mark - \u30FC)
301 - kana type; C4 (katakana) or E4 (hiragana)
302 - width; 2 (normal) or C5 (full) or C4 (half)
304 LAMESPEC: those characters of value '4' of middle section differs
305 in level 2 wrt voice marks, but does not differetiate kana types
306 (bug). It is ignored when IgnoreNonSpace applies.
310 UPDATED: I noticed offsetL does not exist, so removed it from here.
312 [offsetM + 0x80]? [const 3 + (offsetS + 1) * 4] [category] [level1]
314 where "offsetM" and "offsetS" represents the offset in the input
315 string. "offsetM" is always larger than 0x80.
316 LAMESPEC: This design results in a buggy overflow.
319 byte [] data = new CultureInfo ("").CompareInfo.GetSortKey (s).KeyData;
321 for (int i = 0; i < 4; i++, idx++)
322 for (; data [idx] != 1; idx++)
324 for (; idx < data.Length; idx++)
325 Console.Write ("{0:X02} ", data [idx]);
326 Console.WriteLine ();
329 inputs (s) and results:
331 80 07 06 82 80 2F 06 82 00 // '-' + new string ('A', 10) + '-'
332 80 07 06 82 81 97 06 82 00 // (100)
333 80 07 06 82 8F A7 06 82 00 // (1000)
334 80 07 06 82 9C 47 06 82 00 // (10000)
335 80 07 06 82 9A 87 06 82 00 // (100000)
336 80 07 06 82 89 07 06 82 00 // (1000000)
338 The actual offset is 63 * offsetM + offsetS
340 (const '3' may actually vary but no idea.
341 At least 00, 01 and 02 are not acceptable since they are reserved.
342 02 is not reserved by definition above, but the key-size optimizer
343 uses it as a special mark, as mentioned above.)
347 Here is the simple sortkey dumper:
349 public static void Main (string [] args)
351 CultureInfo culture = args.Length > 0 ?
352 new CultureInfo (args [0]) :
353 CultureInfo.InvariantCulture;
354 CompareInfo ci = culture.CompareInfo;
355 for (int i = 0; i < char.MaxValue; i++) {
356 string s = new string ((char) i, 1);
357 if (ci.Compare (s, "") == 0)
359 byte [] data = ci.GetSortKey (s).KeyData;
360 foreach (byte b in data) {
361 Console.Write ("{0:X02}", b);
364 Console.WriteLine (" : {0:X}, {1} {2}",
366 Char.GetUnicodeCategory ((char) i),
367 data [2] != 1 ? '!' : ' ');
371 *** multiple character mappings
373 Some sequence of characters are considered as a "composite" that is
374 to be composed either as another character or another sequence of
375 characters. Those "composite" might not have corresponding equivalent
376 character in sortkey.
377 Similarly, some single characters are expanded to a sequence of
380 **** diacritic characters
382 Except for those shift-weight characters, there are only
383 diacritical (or other kinds of nonspacing) characters that don't
384 have primary weights.
386 Diacritics are not regarded as a base character when placed after
387 (maybe some kind of) letters.
389 The behavior is diacritic character dependent. For example, Japanese
390 combination of a Kana character and a voice mark is compulsory (the
391 resulting sort key is regarded as identical to the corresponding
392 single character. Try \u304B\u309B with \u304C. It is invariant).
394 In French cultures, diacritic orderings are checked from right to left.
396 **** Composite character processing
398 There are some sequences of characters that are treated as another
399 character or another sequence of characters.
401 By default, there is no composite form.
402 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
403 (Note that composite is different from expansion.)
405 Note that composite characters is likely to not have equivalent
408 **** Expanded character processing
410 Some characters are expanded to two or more characters:
412 C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
413 132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
414 DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
416 (CJK extension is not really expanded)
418 They don't match with any of Unicode normalization.
420 Some alphabetic cultures have different mappings, but mostly small
421 (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
423 Invariant culture also puts Czech unique character \u0161 between s
424 and t, unlike described here:
425 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
427 *** default sort key table
431 When CompareOptions.StringSort is specified, then it modifies
432 characters in category 2 from "1 1 1 1 80 07 06 xx" to
433 "06 xx yy zz" and some characters become case sensitive.
435 For details, "level 5" description above.
437 To handle them simply, they are laid out as "category 0x01" (which
438 never happens in the actual sortkeys) for those shift-weight ones
441 There seems no further differences between StringSort and None.
445 The value analysis is not complete in this document. See the
446 actual generator code.
449 -0A: Korean parenthesized numbers (3200-321C)
450 -0C: Korean circled numbers (3260-327B)
452 -03: Japanese voice mark
454 <primary category 13 : Arabic>
455 -08: 627-648 (basic Abjad letters)
457 -05: waw with hamza (624)
458 -07: yeh with hamza (626. ignore Presentation Form A area)
459 -0A: alef with hamza above (623)
460 -0A: alef with hamza below (625)
462 <primary category 0E : diacritics>
463 Characters in non "0E" category are out of scope.
464 They can be grepped in UnicodeData.txt.
472 Note that 1C4-1C6 are covered but they are also expanded.
473 -15: breve (cyrillic are also covered? at least 4C1/4C2 are.)
474 -16: dialytika and tonos (category 0F though)
477 -1A: ring above | 212B
478 -1B: ogonek ("WITH OGONEK;")
479 -1C: cedilla (WITH CEDILLA;")
480 -1D: double acute | acute and dot above
481 -1E: stroke, except for 0E[1F] and cp{19B, 1BE} |
482 circumflex and acute | 18B,18C,19A,289
483 (i.e. they not one-to-one mapping. Neither that every
484 "stroke" are mapped to 1E, nor not every 1E are mapped to
486 -1F: diaeresis and acute | with circumflex and grave | l slash
487 beware "symbol slash"
488 -20: diaeresis and grave | 19B,19F
489 -21: breve and acute | D8,F8
490 -22: caron and dot above | breve and grave
491 -23: macron and acute
492 -24: macron and grave
493 -25: diaeresis and caron | dot above and macron | tilde and acute
494 -26: ring above and acute
495 -28: diaeresis and macron | cedilla and acute |
497 -29: circumflex and tilde
498 -2A: tilde and diaeresis
499 -2B: stroke and acute
501 -2F: cedilla and breve
502 -30: ogonek and macron
503 -43: hook, except for cp{192,1B2,25A,25D,27B,28B,2B1,2B5} |
504 left hook | with hook above except for cp{1EF6,1EF7} |
506 -44: double grave | 1EF6,1EF7
508 -48: preceded by apostrophe (actually only 149)
510 -55: line below | circumflex and hook above
511 -57: palatal hook (actually only 1AB)
512 -58: dot below except for cp{1EA0,1EA1}
513 -59: "retroflex" (without "WITH") | diaeresis below | 1EA0,1EA1
514 -5A: ring below | 1E76,1E77
515 -60: circumflex below except for cp{1E76,1E77} | horn and acute
516 -61: breve below | horn and grave
517 -63: tilde below | 2125
518 -68: D0,F0,182,183 | dot below and dot above | topbar
519 -69: right half ring | horn and tilde
520 -6A: circumflex and dot below
521 -6D: breve and dot below
522 -6E: dot below and macron
523 -95: horn and hook above
526 (for 01-0D and 7B-8A, they are not related to diacritics.)
528 <category BlahBlahNumbers from 0100 to 1000>
529 -38: Arabic-Indic numbers (660-669)
530 -39: extended Arabic-Indic numbers (6F0-6F9)
531 -3A: Devanagari numbers (966-96F)
532 -3B: Bengali numbers (9E6-9EF)
533 -3C: Bengali currency enumerators (9F4-9F9)
534 -3D: Gurmukhi numbers (A66-A6F)
535 -3E: Gujarati numbesr (AE6-AEF)
536 -3F: Oriya digit numbers (B66-B6F)
537 -40: Tamil numbers (BE7-BF2)
538 -41: Telugu numbers (C66-C6F)
539 -42: Kannada numbers (CE6-CEF)
540 -43: Malayam numbers (D66-D6F)
541 -44: Thai numbers (E50-E59)
542 -45: Lao numbers (ED0-ED9)
543 <miscellaneous numbers>
544 -47: Roman numbers (2160-2182)
545 -4E: Hangchou numbers (3021-3029)
547 -E0[64]: 2107 (Eurer)
548 -E0[87]: some Tone letters (TONE TWO / TONE SIX)
549 -EE: Circled letter-or-digits and katakanas
550 CIRCLED {DIGIT|NUMBER|LATIN|KATAKANA}
551 numbers (2460-2473,2776-2793,24EA)
554 -F3: Parenthesized enumerations
557 PARENTHESIZED {DIGIT|NUMBER|LATIN}
558 -F4: Numbers with dot (2488-249B)
559 {DIGIT|NUMBER} * FULL STOP
562 -258,25C-25E,285,286,29A,297 -> 0E[80-86,88]
563 -27F,2B3-2B6 -> 0E 8A[80-84]
568 -20D0-20E1 -> 01[DD-F0]
569 -483-486 -> 01[94-97]
570 -559,55A -> 01[98,99]
573 -346-348,2BE-2C5,2CE-2CF -> 01[74-7F]
574 -2D1-2D3,2DE,2E4-2E9 -> 01[81-8A]
576 -342,343 -> 01[8D,8E]
579 -700-780 01[8D-AF]. Maybe there is some kind of traditional
580 order in Estrangela, but for now am not sure.
582 -740-742 -> 01[8D-8F]
583 -747,748,732,735,738,739,73C,73F,743-746,730 -> 01[90,91,94-9F]
584 -731,733,734,736,737,73A,73B,73D,73E,749,74A,7A6-7A9
587 -7AA-7B0 -> 01[B0-B6]
589 -591-5C2 except for 5BA,5BE -> 01[03-33] in order
591 No further patterns for >= 80
593 TODO: Below are not done yet:
594 - x < 0x80 in non-"0E" part
595 - 03 <= x <= 0D in "0E" part
596 - 7B <= x <= 7F in "0E" part
598 **** sortkey details by category
600 The actual value analysis is not complete in this document. See the
601 actual generator code.
603 1 specially ignored ones (Japanese, Tamil, Thai)
605 IdentifyBy: constants
606 Unicode: 3099-309C, BCD, E47, E4C, FF9E, FF9F
607 SortKey: 01 01 01 01 00
609 2 shift weight characters
611 They are either at 01 01 01 01 or 06, depending on StringSort. For
612 convenience, I use 06 to describe them.
614 2.1 control characters (specified as such in Unicode), except for
615 whitespaces (0009-000D).
618 IdentifyBy: UnicodeCategory.Control
619 Unicode: 0001-000F minus 0009-000D, 007F-009F
620 SortKey: 06 03 - 06 3D
624 Unicode: 0027,FF07 (')
625 SortKey: 06 80 (and width insensitive equivalents)
627 2.3 minus sign, hyphen, dash
628 minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width)
629 hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN?
630 dashes, horizontal bars: FE58 ... UnicodeCategory.DashPunctuation
632 IdentifyBy: UnicodeCategory.DashPunctuation
633 SortKey: 06 81 - 06 90 (and nonspace equivalents)
635 2.4 Arabic spacing and equivalents (64B-652, FE70-FE7F)
636 They are part of nonspacing mark, but not equal.
638 SortKey: 06 A0 - 06 A7 (and nonspace equivalents)
640 3 nonprimary characters, mixed.
642 ModifierSymbol, except for that are not in category 0 and "07" area
643 (i.e. < 128) nor those equivalents
645 NonSpacingMark which is ignorable (IsIgnorableNonSpacing())
646 // 30D, CD5-CD6, ABD, 2B9-2C5, 2C8, 2CB-2CD, 591-5C2. NonSpacingMark in
647 // 981-A3C. A4D, A70, A71, ABC ...
649 TODO: I need more insight to write table generator.
651 SortKey: 01 03 01 - 01 B6 01
653 This part of MS table design is problematic (buggy): \u0592 should
654 not be equal to \u09BC.
656 I guess, this buggy design is because Microsoft first thought that
657 there won't be more than 255 characters in this area. Or they might be
658 aware of the problem but prefer table optimization.
662 1) We should not mix those code (make things sequential) and expands
663 level 2 length to 2 bytes. Instead of having direct value, we
664 could use index (pointer) to zero-terminating level 2 table.
666 2) Include those charactors from minor cultures here.
668 If in "discriminatory mode", those tables could be still provided
669 as to be compatible to Windows.
671 Additionally there seems some bugs around Modifier letter collection.
672 For example, 2C6 should be nonspacing diacritical character but it
673 is regarded as a primary character. The same applies to Mandarin
674 tone marks (2C9-2CB) (and there's a plenty of such characters).
676 4 space separators and some kind of marks
678 4.1 whitespaces, paragraph separator etc.
679 UnicodeCategory.SpaceSeparator : 20, 3000, A0, 9-D, 2000-200B
681 SortKey : 07 02 - 07 18
683 4.2 some OtherSymbols: 2422-2423
685 SortKey : 07 19 - 07 1A
687 4.3 ASCII compatible marks ('!', '^', ...)
688 Non-alpha-numeric < 0x7F except for [[+-<=>']]
689 small compatibility equivalents -> itself, wide
692 FIXME: how to identify them?
693 some Punctuations: InitialQuote/FinalQuote/Open/Close/Connector
694 some OtherSymbols: 2400-2424
695 3003, 3006, 2D0, 10FB
696 remaining Puncuations: 9xx, 7xx
699 SortKey : 07 1B - 07 F0
701 5 mathmatical symbols
702 InitialQuotePunctuation and FinalQuotePunctuation in ASCII
703 (not Quotation_Mark property in PropList.txt ; 22, 27)
705 byte area MathSymbol: 2B,3C,3D,3E,AB,B1,BB,D7,F7 except for AC
706 some MathSymbol (2044, 208A, 208C, 207A, 207C)
707 OtherLetter (1C0-1C2)
708 2200-22FF MathSymbol except for 221E (INF. ; regarded as a number)
710 SortKey : 08 02 - 08 F8
712 6 Arrows and Box drawings
713 09 02 .. 09 7C : 2300-237A
714 only primary differences
715 09 BC ... 09 FE : 25A0-AB, 25E7-EB, 25AC-B5, 25EC-EF, 25B6-B9,
716 25BC-C3, 25BA-25BB, 25C4-25D8, 25E6, 25DA-25E5
718 This area contains level 2 values.
719 2190- (non-codepoint order)
720 note that there are many compatibility equivalents
721 2500- except for 266F (#)
723 SortKey : 09 02 - 09 7C, 09 BC 01 03 - 09 BC 01 13,
724 09 {BD|BE|BF} 01 {03|04}, ...
725 TODO: fill the patterns
727 7 currency sumbols and some punctuations
728 byte CurrencySymbols except for 24 ($)
729 byte OtherSymbols (A7-B6)
730 ConnectorPunctuation - 2040 (i.e. FF65, 30FB)
731 OtherPunct/ConnectorPunct/CurrencyCymbol 2020-20AC - 20AC
732 OtherSymbol 3012-303F,3004,327F
733 MathSymbol/OtherSymbol 2600-2767 (math = 266F)
734 OtherSymbol 2440-244A, 2117
735 20AC (CurrencySymbol)
737 Sortey : 0A 02 - 0A FB
740 all DecimalDigitNumber, LetterNumber, non-CJK OtherNumber.
742 digits, in numeric order. We can use NET_2_0 CharUnicodeInfo.
745 SortKey : 0C 02 (9F8), 0C 03 - 0C E1 (normal numbers), 0C FF (INF.)
747 9 (E) latin letters (alphabets), mixing alphabetical symbols
748 Alphabets, A to Z, mixing alphabetical symbols. See below.
749 F8-2B8 except for (1BB-1BD and 1C0-1C3), but not sequential.
752 For diacritical orders, see level 2.
754 For 'A' it is "0E 02", for 'B' "0E 09" ... 'Z' "0E A9", ezh "0E AA".
755 0E B3 (1BE), 0E B4 (298)
757 There are CJK compatibility characters (3800-) and letterlike
758 symbols (2100-) in those A-to-Z area, ordered by character name.
760 Primary weights are sometimes culture-dependent.
761 FIXME: [0E 0D], [0E 0E], [0E 4B], [0E 75], [0E B2] are unknown
768 0B: 10D in hr|lt|lv|pl, 107 in pl
769 0C: C7 in az|tr, 10D in cs|sk, 106 in hr
775 1E: 110 (D with stroke) in hr
778 22: 18F=259 in az, E9 in is, 119 in pl, EA in vi, 1EBE-1EC7 in vi
782 26: 11F in az|tr, 123 in lv
785 2D: 267 (Heng with hook)
786 2E: 33CB in az, 33CA in tr
789 33: CD in is, 79 in lt
804 73: F1 in es, 1CC in hr
808 7D: F6 in az|hu|tr, 151 in hu, F3 in is|pl, F4 in sk|vi, 1ED0-1ED9 in vi
821 96: 17F (LATIN SMALL LONG S)
822 97: 15F in az|tr, 161 in cs|hr|lt|lv|sk|sl, 7A,179-17C in et, 15B in pl
823 98: 17E in et, 15F in ro, 15B in sl
831 A0: FA in is, 1B0,1EE8-1EF1 in vi
832 A1: FC in az|tr, 56,57 in et, FC,171 in hu, FB in vi
842 AB: DE in is, 17E in lt|lv, 17A in pl
843 AC: E6 in da|is, 1E3 in is, 17C in pl, 17E in sl
844 AD: 17E in cs|hr|sk, E5 in fi, F6,F8 in is 17A in sl
848 B1: E5 in da, "aa" in da
852 10 culture dependent letters (general)
853 0F: 386-3F2 ... Greek and Coptic
854 386-3CF: [0F 02] - [0F 19] (consider primary equivalents)
855 3D0-3EF: [0F 40] - [0F 54]
856 10: 400-4E9 ... Cyrillic.
857 For 400-45F and 4B1, they are mostly UCA DUCET order.
858 After that 460-481 follows, by codepoint.
859 (490-4FF except for 4B1 and Cyrillic supplementary are unused.)
860 11: 531-586 ... Armenian.
861 Simply sorted by codepoint (handle case).
862 12: 5D0-5F2 ... Hebrew.
863 Codepoint order (handle case).
864 13: 621-6D5 plus 670 (NonSpacingMark) ... Arabic
866 They look like ordered by Arabic Presentation Form B except
867 for FE95, and considering diacritical equivalents maybe based
868 on the primary character area (621-6D5).
869 There are still some special characters: 67E,686,698,6AF ...
870 which might not have equivalent characters (I wonder how they
871 are inserted into the presentation form B map).
874 - hamza, waw, yeh (621,624,626) are special: [13 07]
875 - For all remaining letters, get primary letter name
876 and store it into dictionary. If unique, then
877 increment index by 4 from [13 0B]
879 674-6D5 : by codepoint from [13 84].
880 14: 901-963 exc. 93A-93D 950-954 ... Devanagari.
881 For <905 codepoint order, x2 from [14 04].
882 For 905-939 codepoint order, x4 from [14 0B].
883 For 93E-94D codepoint order, x2 from [14 DA].
884 15: 982-9FA ... Bengali. Actually all UnicodeCategories except for
885 NonSpacingMark, DecimalDigitNumber and OtherNumber.
886 For <9E0 simple codepoint order from [15 02].
887 For >9E0 simple codepoint order from [15 3B].
888 16: A05-A74 exc. A3C A4D A66-A71 ... Gurmukhi.
889 The same as UCA order, x4 from [16 04].
890 17: A81-AE0 exc. ABC-ABD ... Gujarati.
891 Mostly equivalent to UCA, but insert {AB3,A81-A83} before AB9,
893 18: B00-B70 ... Oriya
894 All but NonSpacingMark and DecimalDigitNumber, by codepoint.
895 19: B80-BFF ... Tamil
896 BD7 is special : [19 02].
897 B82-B93 (vowels) : x2 from [19 0A].
898 B94 (vowel AU) : [19 24]
899 For consonant order Windows has native Tamil order which is
901 http://www.nationmaster.com/encyclopedia/Tamil-alphabet
902 (The order is still different in "Grantha" order from TAM.)
903 So, we should just hold constant array for consonants.
904 And put them in order, x4 form [19 26].
905 BBE-BCC : SpacingCombiningMark and BC0 ... x2 from [19 82].
906 1A: C00-C61 ... Telugu.
907 C55 and C56 are ignored (C5x line and remaining part of C6x
908 line just look like ignored).
909 C60 and C61 are specially placed. C60 after C0B, C61 after C0C.
910 Except for above, by codepoint, x3 from [1A 04].
911 1B: C80-CE5 ... Kannada.
912 CD5,CD6 (and CE6-CEF: DecimalDigitNumber) are ignored.
913 by codepoint, 3x from [1B 04].
914 1C: D02-D40 ... Malayalam.
915 by simple codepoint from [1C 02].
916 (1D: Sinhala ... totally ignored?)
917 1E: E00-E44 ... Thai.
918 preceding vowels (E40-E44) by codepoint [1E 02 - 1E 06]
919 consonants (E01-E2A) by codepoint, x6 from [1E 07].
920 1F: E2B-E5B,E80-EDF ... Thai / Lao. (Thai breaks the category wall.)
922 remaining consonants (E2B-E2E) by codepoint, x6 from [1E 07].
923 remaining vowels (E2F-E3A) by codepoint.
924 E45,E46,E4E,E4F,E5A,E5B
926 E80-EDF by codepoint from [1F 02].
927 21: 10A0-10FF ... Georgian
928 Mostly equal to UCA order, but swap 10E3 <-> 10F3,
931 11 (22) japanese kana letters and symbols, not in codepoint order
933 For single character, the sortkeys look like:
934 - Katakana normal A, Half Width (FF71) : FF 02 C4 FF C4 FF 01 00
935 - Katakana normal A, Full Width (30A2) : FF C4 FF 01 00
936 - Hiragana normal A, Full Width (3042) : FF FF 01 00
938 Actually for level 4 weights, there is a different rule (see
939 "level 4" format above).
941 There is also 32D0 (normal katakana A with circle) that have
942 diacritic difference.
944 For primary weights, 'A' to 'O' are mapped to 22-26, 'Ka' to 'Ko'
945 are to 2A-2E, 'Sa' to 'So' are to 32-36 ... and follows.
946 'Nn' is special: [22 80].
948 After Kana characters, there are CJK compat characters.
949 From 22 97 01 01 01 01 00 (3349) to 22 A6 01 01 01 01 00 (333B) are
950 sorted in JIS table order (CP932.TXT). Remaining square characters
951 are maybe sorted in Alphabetic order.
953 UCA DUCET also does not apply here.
955 12 (23) bopomofo letters
956 3105-312C: simple codepoint order from [23 02].
958 13 culture dependent letters 2
959 710-72C : Estrangela (ancient Syriac).
961 711 is excluded (superscript).
962 714,716,71C,724 and 727 are "alternative" characters.
963 SortKey: [24 0B]-[24 60], by x where x is 2 for those
964 which is "alternative" defined above, otherwise 4.
967 Equals to UCA order, x2 from [24 6E].
969 (Maybe we should add remaining minor-culture characters here. Tibetan,
970 Limbu, Tagalog, Hanunoo, Buhid, Tagbanwa, Myanmar, Kumer, Tai-Le,
971 Mongolian, Cherokee, Canadian-Aboriginal, Ogham, Runic are ignored)
973 14 (41-45) surrogate Pt.1
975 15 (52 02-7E C8) hangul, mixing combined ones
977 It starts from 1100. After width-insensitive equivalents, those
978 syllables (from AC00) follow (until <del>AE4B</del>D7A3).
979 It follows kinda based on some formula (sometimes it looks not
980 e.g. 1117). FIXME: this area should be clarified more.
982 Hangle Syllables should not be filled in the table. Instead, they
983 can be easily computed by the following formulum:
985 // rc is the codepoint for the input Syllable
986 // (p holds "category << 8 + level1weight")
987 int ri = ((int) rc - 0xAC00) + 1;
989 ((ri / 254) * 256 + (ri % 254) + 2);
991 Hangul Jamo cannot be filled in the table directly, since
992 U+1113 - U+159 holds additional primary key bytes.
993 FIXME: find out how they can be computed.
994 See http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/collation/ICU_collation_design.htm?rev=HEAD&content-type=text/html#Hangul_Implicit_CEs
998 9E 02-F0 B4 [3192-319F,3220-3243,3280-32B0,4E00-9FA5] : CJK mark,
999 parenthesized CJK (part), circled CJK (part), CJK ideograph.
1000 Ordered but condidering compatible characters (i.e. there is
1001 no other way than having massive mapping).
1002 F0 B5-F1 E4 [F900-FA2D]. CJK compatibility ideograph.
1004 LAMESPEC: in the latest spec CJK ends at 9F BB. Since MS table
1005 joins these two categories without any consideration, it is
1006 impossible to insert those new characters without breaking binary
1009 17 (E5 02-FE 33) PrivateUse.
1011 In fact it overlaps to CJK characters (maybe layout design failure).
1013 18 (F2 01-F2 31) surrogate Pt.2
1015 In fact it overlaps to PrivateUse (maybe layout design failure).
1017 19 (FE FF 10 02 - FE FF 29 E9) CJK extensions
1021 They should be computed, since this range should be anyways
1022 checked (to not directly acquire the sortkey values but needs
1023 FE FF part) and anyways it can be computed.
1025 20 (FF FF 01 01 01 01 00) special.
1026 Japanese extender marks:
1027 3005, 3031, 3032, 309D, 309E, 30FC, 30FD, 30FE, FF70
1029 LAMESPEC: In native context Microsoft's understanding of Japanese
1030 3031 and 3032 is wrong. They can never be used to repeat *just
1031 previous one* character, but are usually used to repeat two or
1032 more characters. Also, 3005 is not always used to repeat exactly
1033 one character but sometimes used to repeat two (or possibly more)
1036 Arabic shadda: FE7C (isolated), FE7D (medium)
1037 (Actually they are not extender in Unicode PropList.txt)
1040 - by UnicodeCategory -
1042 DashPunctuation 6 (no exception)
1043 DecimalDigitNumber C (no exception)
1044 EnclosingMark 1 E (no exception)
1046 LetterNumber C (no exception)
1047 LineSeparator 7 (only 2028)
1048 ParagraphSeparator 7 (only 2029)
1050 SpaceSeparator 7 (no exception)
1053 OtherNumber C(<3192), 9E-A7 (3124<)
1055 Control 6 except for 9-D (7)
1056 FinalQuotePunctuation 7 except for BB (8)
1057 InitialQuotePunctuation 7 except for AB (8)
1058 ClosePunctuation 7 except for 232A (9)
1059 OpenPunctuation 7 except for 2329 (9)
1060 ConnectorPunctuation 7 except for FF65, 30FB, 2040 (A)
1062 OtherLetter 1, 7, 8 (1C0-1C2), C, 12-FF
1063 MathSymbol 8, 9, 6, 7, A, C
1064 OtherSymbol 7, 9, A, C, E, F, <22, 52<
1065 CurrencySymbol A except for FF69,24,FF04 (7) and 9F2,9F3 (15)
1067 LowercaseLetter E-11 except for B5 (A) and 1BD (C)
1068 TitlecaseLetter E (no exception)
1069 UppercaseLetter E,F,10,11,21 except for 1BC (C)
1070 ModifierLetter 1, 7, E, 1F, FF
1071 ModifierSymbol 1, 6, 7
1072 NonSpacingMark 1, 6, 13-1F
1073 OtherPunctuation 1, 7, A, 1F
1074 SpacingCombiningMark 1, 14-22
1076 *** Culture dependent design
1078 (To assure this section, run the simple dumper code shown above,
1079 with all the supported cultures.)
1081 **** primary cultures and non-primary cultures
1083 This code is used to iterate character dump through all cultures,
1084 using sort key dumper put above.
1086 public static void Main ()
1088 foreach (CultureInfo ci in CultureInfo.GetCultures (
1089 CultureTypes.AllCultures)) {
1090 ProcessStartInfo psi = new ProcessStartInfo ();
1091 psi.FileName = "../allsortkey.exe";
1092 psi.Arguments = ci.Name;
1093 psi.RedirectStandardOutput = true;
1094 psi.UseShellExecute = false;
1095 Process p = new Process ();
1098 string s = p.StandardOutput.ReadToEnd ();
1099 StreamWriter sw = new StreamWriter (ci.Name + ".txt", false, Encoding.UTF8);
1105 For each sub culture (that has a parent culture), its collation
1106 mapping is identical to that of its parent, except for az-AZ-Cyrl.
1110 - zh-CHS = zh-CN = zh-SG = zh-MO : pronounciation
1111 - zh-TW = zh-HK = zh-CHT : stroke count
1116 (UCA implies that there are some cultures that sorts alphabets from
1117 large to small, but as long as I see there is no such CultureInfo.)
1119 **** Latin characters and NonSpacingMark order tailorings
1121 div : FDF2 is 24 83 01 01 01 01 00 (only 1 difference)
1122 syr : some NonSpacingMarks are totally ignorable.
1123 tt,kk,mk,az-AZ-Cyrl,uk : cyrillic difference
1124 az,et,lt,lv,sl,tr,sv,ro,pl,no,is,hu,fi,es,da : latin difference
1126 sk,hr,cs : latin and NonSpacingMark differences
1130 **** CJK character order tailorings
1134 There are five different CJK orderings:
1135 default, ko(-KR), ja(-JP), zh-CHS and zh-CHT
1136 They have very different CJK mapping for each.
1138 Since they seems based on traditional encodings, we are likely to
1139 provide other constant tables and switch depending on the culture.
1141 <what characters are different from the invariant culture?>
1143 ko : CJK layout difference (52 -> 80)
1144 ja,zh-CHS,zh-TW : dash (5C), CJK layout difference.
1146 Target characters are : CJK misc (3190-), Parenthesized CJK
1147 (3200-), CJK compat (3300-), CJK ideographs (4E00-),
1148 CJK compat ideograph (F900-), Half/Full width compat (FF00-)
1150 Additionally for Korean: Jamo (1100-), Hangle syllables (AC00)
1152 <how do they consist of?>
1154 Japanese CJK order looks based on JIS table order. Those characters
1155 which are also in JIS table are moved to 80 xx. Those which are *not*
1156 in JIS table are left as is (9E-FE).
1158 Additionally, Windows has different order for characters below:
1159 4EDD,337B,337E,337D,337D,337C
1160 They come in front of the first CJK character.
1162 Maybe Korean CJK order respects KS C 5619. Note that Korean mixes
1163 Hangul and CJK in their order so it's not flat order without indexes
1164 (thus, for CJK they are not computable). Also, there is an extra
1165 level2 values for Korean CJK map.
1167 For some Chinese such as zh-CHS, character order is based on pinyin.
1168 And for remaining Chinese such as zh-TW, it is stroke count based.
1170 CLDR of unicode.org has reference ordering of those characters, so
1171 our collation table extracts the sorting order from it.
1172 http://www.unicode.org/cldr/
1174 **** Accent evaluation order
1176 With French cultures, diacritical marks are counted in reverse order.
1177 French ordering does not affect only on some diacritics (Japanese
1178 voice mark is not affected - FIXME: I doubt it, because the algorithm
1179 does not seem to allow it).
1181 Some other cultures might also have different ones, but not obvious.
1184 ** Mono implementation plans
1188 CompareInfo contains many overloaded methods that are just for
1189 convenience. This class contains almost only required members.
1191 This class also provices access to tailoring information which is
1192 culture instance dependent:
1195 - contractions/expansions - returns contraction or expansion
1196 - diacritical remapping
1197 - CJK custom mapping
1199 For data area, see CollationDataStructures.txt for now.
1201 *** UnicodeTable (for now MSCompatUnicodeTable)
1203 Provides several access to character information with related to
1204 the collation element table (of our own).
1205 FIXME: I want to fix some bugs in Windows collation table especially
1206 to not ignore some characters, but it requires table modification
1207 which results in further memory allocation. Maybe it would be done
1208 as a patch for the runtime (or classlib) sources.
1210 - ignorable, ignorable nonspace, normalize width, normalize kanatype
1211 - level 4 sortkey provision method(s)
1213 **** character comparison
1215 Since composite character is likely to *not have* equivalent
1216 codepoint, character comparison could not just be done by expecting
1217 "resulting char" value.
1218 In contrast, since composite character is likely to *do have*
1219 equivalent codepoint, character comparison could not also just be done
1220 by comparing "source char" value.
1222 ***** future optimizations
1224 From where those codepoints differ, for each strings it adjusts the
1225 position so that it represents exactly one character element. That is,
1226 find primary character as the start of the range and the last
1227 nonprimary character as the end of the range.
1229 Once Compare() adjusted the character location to be valid
1230 comparison position, further comparison is done as usual comparison,
1231 i.e. sortkey comparison considering comparisonLevel.
1233 **** Characters in the table / characters computed
1235 Currently I plan not to contain following characters in the table
1236 but compute on demand:
1241 **** CJK Unified Ideographs
1243 For CJK unified ideographs, I had to make those culture-dependent
1244 tables in memory. Since they came from some classical encodings, they
1245 are not computed. Thus, they are in separate table.
1247 **** Level 4: Kana type
1249 The table does not contain level 4 (kanatype) properties for
1250 the whole characters. They can be simply computed.
1252 **** Level 3: Case/Width properties
1254 Case properties will be stored as a byte array, with limited areas of
1255 codepoint (cp < 3120 || FE00 < cp).
1257 For Hangul characters, it will be computed by codepoint areas.
1259 **** Level 2: Diacritical properties
1261 The table will be composed as a byte for a character. If we provide
1262 non-buggy mode (Windows is buggy here by design; it just sums
1263 secondary weight values up), the values will come from UCA and
1264 non-blocking check will be introduced.
1266 Note that Japanese voice marks are considered at level 2 but no need to
1270 ** Reference materials
1272 Developing International Software for Windows 95 and Windows NT
1273 Appendix D Sort Order for Selected Languages
1274 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
1276 UTR#10 Unicode Collation Algorithm (It is still informative)
1277 http://www.unicode.org/reports/tr10/
1279 UAX#15 Unicode Normalization
1280 http://www.unicode.org/reports/tr15/
1281 especially its canonical/compatibility equivalent characters might
1282 be informative to get those equivalent characters.
1284 To know which character can be expanded, Unicode Character Database
1285 (UCD) is informative (it's informative but not normative to us)
1286 http://www.unicode.org/Public/UNIDATA/UCD.html
1288 Decent char-by-char explaination is available here:
1289 http://www.fileformat.info/info/unicode/
1291 Wine uses UCA default element table, but has windows-like character
1292 filterings support in their LCMapString implementation:
1293 http://cvs.winehq.com/cvsweb/wine/dlls/kernel/locale.c
1294 http://cvs.winehq.com/cvsweb/wine/libs/unicode/sortkey.c
1296 Mimer has decent materials on culture specific collations:
1297 http://developer.mimer.com/collations/
1299 This is written in Japanese, but awesome analysis on MS Access
1301 http://www.asahi-net.or.jp/~ez3k-msym/comp/acccoll.htm