5 We are going to implement Windows-like collation, apart from ICU (that
6 implements UCA, Unicode Collation Algorithm).
11 * create collation element table(s)
12 - infer how Windows collation element table is composed
14 - write table generator source(s)
15 : mostly implemented. Need to fix nearly 400 mappings.
16 They are mainly 1) IPA extensions (U+250-U+300),
17 2) Latin extensions (U+1E00-U+1F00), 3) Letterlike
18 symbols (U+2100-U+2140), 4) some Cyrillic letters
19 (U+460-U+500), and 5) some Hangul characters.
20 - culture-specific sortkey data
21 : They are defined in mono-tailoring-source.txt.
22 All single sortkey remapping in all cultures are filled.
23 Contractions are not fully checked yet (should be filled
24 from UCA tailorings via create-tailorings.exe).
27 ** How to implement CompareInfo members
30 Compute sort key for every character elements into byte[].
32 Find first difference and compare it.
33 "Larger/smaller" matters (beyond "different").
35 It calls CompareInternal() which also answers if the target
36 is fully consumed, so it just returns true if it says that
37 the target is fully consumed.
39 It tries CompareInternal() to compare source and target at
40 the end, where source varies from minimum tail to the
42 IndexOf(), LastIndexOf()
43 For character search, it finds the matching character element
44 to the end (or start) of the string to find.
45 For string search, it invokes one of private IndexOf() (or
46 LastIndexOf()) overload passing the first character element
47 of the target, and if found, tests if the sequence is a valid
48 start point, using IsPrefix() (or IsSuffix()).
52 For Compare() and IsPrefix(), it uses forward iteration, which moves
53 forward and don't stop until either it finds next "primary" character
54 or it reached the end of the string, checking with IsSafe(char).
56 For IndexOf(char) and LastIndexOf(char), there is no special
57 optimization (since the codepoints usually do not match, while they
58 often matches as a natural collation), but it omits extraneous sortkey
61 IsSuffix() reuses Compare() and returns false if it does not consume
62 the target string more than 3 times. 3 is kind of magic number that
63 represents the longest expansion.
65 IndexOf(string) is implemented as a combination of IndexOf(char) and
68 LastIndexOf(string) is implemented as a combination of
69 LastIndexOf(char) and IsPrefix().
71 Porting them to C code is an alternative possible approach, but from
72 Compare() optimization experience, it is quick enough.
74 ** How to support CompareOptions
76 There are two kind of "ignorance" : strippers' ignorance and
77 normalizers' ignorance.
79 The strippers will "filter characters out" and there will be no
80 corresponding character elements in SortKey binaries.
82 Normalizers, on the other hand, will result in certain characters
83 that is still in effect between irrelevant character and itself.
84 For example, with IgnoreKanaType Hiragana "A" and Katakana "A" are
85 not distinguished, but Hiragana "A" and Hiragana "I" are.
87 Actually, even without any IgnoreXXX flags (i.e. "None"), there are
88 many characters that are ignored ("completely ignorable").
90 Except for LCID 101/1125(div), '\ufdf2' is completely ignorable.
91 This rule even applies to CompareOptions.None.
96 Maybe culture-dependent TextInfo.ToLower() could be used.
98 Unlike ICU (specialCaseToLower()), even with tr-TR(LCID 31)
99 and IgnoreCase, I\u0307 is not regarded as equal to i.
102 ToKanaTypeInsensitive(). Note that this does not cover the
103 whole "level 4" differences described later.
106 ToWidthInsensitive(), which is likely to be culture
107 independent. See also "Notes".
109 IgnoreNonSpace (see also Strippers; this flag works in both sides)
110 For some cultures this logic is still incomplete. All culture-
111 dependent collators must handle valid "replacement" of "one or
112 more characters" which might be related to specific
114 For example, there is a Japanese text sorting rule that
115 however applies to InvariantCulture. Concretely to say,
116 "\u3042\u30FC" is equivalent to "\u3042\u3042" only when
117 IgnoreNonSpace is specified.
119 I'll take those items from CLDR (those items which has
120 <reset before="..." />), case by case though.
124 I already wrote all the required strippers which should be MS
125 compatible (at least with .NET 1.1 invariant culture).
128 IsIgnorableNonSpacing().
129 Some Diacritic characters are covered by this flag.
131 There are some culture *dependent* characters:
132 LCID 90/1114(syr) : 64b, 652, 670
136 UnicodeCategory does not work here.
138 There are some culture *dependent* characters:
139 LCID 17/1041(ja) : 2015
140 LCID 90/1114(syr) : 64b, 652
144 See "sort order categories" section.
148 First to note: we won't use collation element table from unicode.org.
150 There are many differences between ICU/UCA and Windows despite they
151 look so similar; having collation keys in different levels, culture
152 dependent composition, etc. In the history, Windows collation is
153 designed before UCA was specified, so basically Windows is obsolete
156 - Logic: Unlike UCA it has no concept of "blocked" combining marks,
157 and combining marks are never considered as an independent character
158 (thus combining in Windows is buggy).
159 - Data: Windows is based on old Unicode standard (even older than 1.1).
160 It ignores minor cultures. Character property values differ as well
161 as those from the default Unicode collation element table (DUCET).
162 In a few cultures Windows collation is close to the native language
163 (e.g. Tamil, while it does not conform to TAM).
165 IgnoreWidth/IgnoreSymbols is processed after Kana voice mark
166 decomposition (something like NFD, but not equivalent. Example: \u304C
167 is completely equivalent to \u304B\u309B, which is not part of NFKD).
168 <del>This means, if there is a combined Kana characters, it will be
169 first decomposed and then compared.<del> It scarcely matters since
170 there are special weight data for Japanese.
172 *** Microsoft design problem
174 Microsoft implementation seems to have a serious problem that many,
175 many characters that are used in for each specific culture, such as
176 Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
177 "completely ignorable".
179 I tagged many LAMESPEC items in the implementation (both in collator
180 and table generator).
183 ** MS collation design inference
187 Each character has several "weights". It is a common concept between
192 - level 1: primary difference
193 The first byte of level 1 means the category of the character.
194 - level 2: diacritic difference, including Japanese voice mark countup
195 - level 3: case/width sensitivity, and Hangul properties
196 - level 4: kana weight (all of them have primary category 22, at least
198 - level 5: shift weight (apostrophe, hyphens etc.)
200 Note that these levels does not digitally match IgnoreXXX flags. Thus
201 it is not OK that we omit some levels of sortkey values in reflection
204 String comparison is done from level 1 to 5. The comparison won't
205 stop until either it found the primary difference, or it reached to
206 the end (thus upper level differences are returned).
208 For example, "e" is smaller than "E", but "eB" is bigger than "EA".
209 If the collator just returned case difference at first 'e' and 'E',
210 "eB" is still smaller than "EA".
212 **** level 5: shift weight by StringSort
214 There are some characters that are treated specially. Namely they are
215 apostrophe and hyphens. The sortkeys for them is put after level 4
216 (thus here I write them as "level 5"). It has different sort key
217 format. See immediate below. There is no level 5 characters when
218 StringSort is specified.
222 00 means the end of sort key.
223 01 means the end of the level.
224 02-FF means the value.
225 Actually '2' could be cut when all the following values are
226 also '2' (i.e. the sort key binary won't contain extraneous '2').
228 Every level has different key layout.
232 It looks like all level 2 keys are just accumulated, however without
233 considering overflow. It sometimes makes sense (e.g. diaeresis and
234 acute) but it causes many conflicts (e.g. "A\u0308\u0301" and "\u1EA6"
235 are incorrectly regarded as equal).
237 Anyways since Japanese voice mark has level 2 value as 1 it just
238 looked like the sum of voice marks.
242 The actual values are + 2 (e.g. for Hangul Normal Jamo, the value is 4)
245 - 2: Jongseong (11A8-11F9)
246 - 4: Half width? (FFA0-FFDC) and Compatibility Jamo? (3165-318E)
247 - 5: Compatibility Jamo (3130-3164)?
248 TODO: Learn about Korean characters.
251 - 4 circled inverse (2776-277F)
252 - 8 circled sans serif (2780-2789)
253 - C circled inverse && sans serif (278A-2793)
254 - 47 roman (2160-2182)
257 - 2 Isolated form in presentation form B in FE80-FE8D
258 - 4 Alef/Bet/Gimel/Dalet (2135-2138)
259 - 8 Final form in presentation form B in FE82-FEF2
260 - 18 Medial form in presentation form B in FE8C-FEF4
261 Grep "ISOLATED", "FINAL" or "MEDIAL" on UnicodeData.txt
262 (and filter by codepoints).
263 or alternatively, see DerivedDecompositionType.txt.
264 - 22 6A9 (TODO: what is it?)
265 - 28 6AA (TODO: what is it?)
268 - 1 Fullwidth. UnicodeData.txt has <full>.
269 - 2 Subscript. UnicodeData.txt has <sub>.
270 - 8 Small capital, 03C2 (TODO: why?),
271 2104, 212B(flag=1A) (TODO: why?)
272 grep "SMALL CAPITAL" against UnicodeData.txt.
273 - C only FE42. TODO: what is this?
274 - E Superscripts. UnicodeData.txt has <super>.
276 DerivedCoreProperties.txt has Uppercase property.
278 Note that simple 02 (value is 00) could be omitted.
280 Summary: at least 7 bits are required as to represent a table -
281 smallCapital, uppercase, normalization forms (2 bits:full/sub/super),
282 arabic forms (2 bits:isolated/medial/final)
286 Those sortkey data is collected only for Japanese (category 22)
289 There are 3 sections each of them ends with FF. Each of them
290 represents the values for character by character:
291 - small letter type (kogaki moji); C4 (small) or E4 (normal)
292 - category middle section:
293 two subsections separated by 0x02
296 or 4 (voice mark - \u309D,\u309E,\u30FD,\u30FE, \uFF70)
297 or 5 (dash mark - \u30FC)
298 - kana type; C4 (katakana) or E4 (hiragana)
299 - width; 2 (normal) or C5 (full) or C4 (half)
301 LAMESPEC: those characters of value '4' of middle section differs
302 in level 2 wrt voice marks, but does not differetiate kana types
303 (bug). It is ignored when IgnoreNonSpace applies.
307 UPDATED: I noticed offsetL does not exist, so removed it from here.
309 [offsetM + 0x80]? [const 3 + (offsetS + 1) * 4] [category] [level1]
311 where "offsetM" and "offsetS" represents the offset in the input
312 string. "offsetM" is always larger than 0x80.
313 LAMESPEC: This design results in a buggy overflow.
316 byte [] data = new CultureInfo ("").CompareInfo.GetSortKey (s).KeyData;
318 for (int i = 0; i < 4; i++, idx++)
319 for (; data [idx] != 1; idx++)
321 for (; idx < data.Length; idx++)
322 Console.Write ("{0:X02} ", data [idx]);
323 Console.WriteLine ();
326 inputs (s) and results:
328 80 07 06 82 80 2F 06 82 00 // '-' + new string ('A', 10) + '-'
329 80 07 06 82 81 97 06 82 00 // (100)
330 80 07 06 82 8F A7 06 82 00 // (1000)
331 80 07 06 82 9C 47 06 82 00 // (10000)
332 80 07 06 82 9A 87 06 82 00 // (100000)
333 80 07 06 82 89 07 06 82 00 // (1000000)
335 The actual offset is 63 * offsetM + offsetS
337 (const '3' may actually vary but no idea.
338 At least 00, 01 and 02 are not acceptable since they are reserved.
339 02 is not reserved by definition above, but the key-size optimizer
340 uses it as a special mark, as mentioned above.)
344 Here is the simple sortkey dumper:
346 public static void Main (string [] args)
348 CultureInfo culture = args.Length > 0 ?
349 new CultureInfo (args [0]) :
350 CultureInfo.InvariantCulture;
351 CompareInfo ci = culture.CompareInfo;
352 for (int i = 0; i < char.MaxValue; i++) {
353 string s = new string ((char) i, 1);
354 if (ci.Compare (s, "") == 0)
356 byte [] data = ci.GetSortKey (s).KeyData;
357 foreach (byte b in data) {
358 Console.Write ("{0:X02}", b);
361 Console.WriteLine (" : {0:X}, {1} {2}",
363 Char.GetUnicodeCategory ((char) i),
364 data [2] != 1 ? '!' : ' ');
368 *** multiple character mappings
370 Some sequence of characters are considered as a "composite" that is
371 to be composed either as another character or another sequence of
372 characters. Those "composite" might not have corresponding equivalent
373 character in sortkey.
374 Similarly, some single characters are expanded to a sequence of
377 **** diacritic characters
379 Except for those shift-weight characters, there are only
380 diacritical (or other kinds of nonspacing) characters that don't
381 have primary weights.
383 Diacritics are not regarded as a base character when placed after
384 (maybe some kind of) letters.
386 The behavior is diacritic character dependent. For example, Japanese
387 combination of a Kana character and a voice mark is compulsory (the
388 resulting sort key is regarded as identical to the corresponding
389 single character. Try \u304B\u309B with \u304C. It is invariant).
391 In French cultures, diacritic orderings are checked from right to left.
393 **** Composite character processing
395 There are some sequences of characters that are treated as another
396 character or another sequence of characters.
398 By default, there is no composite form.
399 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
400 (Note that composite is different from expansion.)
402 Note that composite characters is likely to not have equivalent
405 **** Expanded character processing
407 Some characters are expanded to two or more characters:
409 C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
410 132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
411 DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
413 (CJK extension is not really expanded)
415 They don't match with any of Unicode normalization.
417 Some alphabetic cultures have different mappings, but mostly small
418 (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
420 Invariant culture also puts Czech unique character \u0161 between s
421 and t, unlike described here:
422 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
424 *** default sort key table
428 When CompareOptions.StringSort is specified, then it modifies
429 characters in category 2 from "1 1 1 1 80 07 06 xx" to
430 "06 xx yy zz" and some characters become case sensitive.
432 For details, "level 5" description above.
434 To handle them simply, they are laid out as "category 0x01" (which
435 never happens in the actual sortkeys) for those shift-weight ones
438 There seems no further differences between StringSort and None.
443 -0A: Korean parenthesized numbers (3200-321C)
444 -0C: Korean circled numbers (3260-327B)
446 -03: Japanese voice mark
448 <primary category 13 : Arabic>
449 -08: 627-648 (basic Abjad letters)
451 -05: waw with hamza (624)
452 -07: yeh with hamza (626. ignore Presentation Form A area)
453 -0A: alef with hamza above (623)
454 -0A: alef with hamza below (625)
456 <primary category 0E : diacritics>
457 Characters in non "0E" category are out of scope.
458 They can be grepped in UnicodeData.txt.
466 Note that 1C4-1C6 are covered but they are also expanded.
467 -15: breve (cyrillic are also covered? at least 4C1/4C2 are.)
468 -16: dialytika and tonos (category 0F though)
471 -1A: ring above | 212B
472 -1B: ogonek ("WITH OGONEK;")
473 -1C: cedilla (WITH CEDILLA;")
474 -1D: double acute | acute and dot above
475 -1E: stroke, except for 0E[1F] and cp{19B, 1BE} |
476 circumflex and acute | 18B,18C,19A,289
477 (i.e. they not one-to-one mapping. Neither that every
478 "stroke" are mapped to 1E, nor not every 1E are mapped to
480 -1F: diaeresis and acute | with circumflex and grave | l slash
481 beware "symbol slash"
482 -20: diaeresis and grave | 19B,19F
483 -21: breve and acute | D8,F8
484 -22: caron and dot above | breve and grave
485 -23: macron and acute
486 -24: macron and grave
487 -25: diaeresis and caron | dot above and macron | tilde and acute
488 -26: ring above and acute
489 -28: diaeresis and macron | cedilla and acute |
491 -29: circumflex and tilde
492 -2A: tilde and diaeresis
493 -2B: stroke and acute
495 -2F: cedilla and breve
496 -30: ogonek and macron
497 -43: hook, except for cp{192,1B2,25A,25D,27B,28B,2B1,2B5} |
498 left hook | with hook above except for cp{1EF6,1EF7} |
500 -44: double grave | 1EF6,1EF7
502 -48: preceded by apostrophe (actually only 149)
504 -55: line below | circumflex and hook above
505 -57: palatal hook (actually only 1AB)
506 -58: dot below except for cp{1EA0,1EA1}
507 -59: "retroflex" (without "WITH") | diaeresis below | 1EA0,1EA1
508 -5A: ring below | 1E76,1E77
509 -60: circumflex below except for cp{1E76,1E77} | horn and acute
510 -61: breve below | horn and grave
511 -63: tilde below | 2125
512 -68: D0,F0,182,183 | dot below and dot above | topbar
513 -69: right half ring | horn and tilde
514 -6A: circumflex and dot below
515 -6D: breve and dot below
516 -6E: dot below and macron
517 -95: horn and hook above
520 (for 01-0D and 7B-8A, they are not related to diacritics.)
522 <category BlahBlahNumbers from 0100 to 1000>
523 -38: Arabic-Indic numbers (660-669)
524 -39: extended Arabic-Indic numbers (6F0-6F9)
525 -3A: Devanagari numbers (966-96F)
526 -3B: Bengali numbers (9E6-9EF)
527 -3C: Bengali currency enumerators (9F4-9F9)
528 -3D: Gurmukhi numbers (A66-A6F)
529 -3E: Gujarati numbesr (AE6-AEF)
530 -3F: Oriya digit numbers (B66-B6F)
531 -40: Tamil numbers (BE7-BF2)
532 -41: Telugu numbers (C66-C6F)
533 -42: Kannada numbers (CE6-CEF)
534 -43: Malayam numbers (D66-D6F)
535 -44: Thai numbers (E50-E59)
536 -45: Lao numbers (ED0-ED9)
537 <miscellaneous numbers>
538 -47: Roman numbers (2160-2182)
539 -4E: Hangchou numbers (3021-3029)
541 -E0[64]: 2107 (Eurer)
542 -E0[87]: some Tone letters (TONE TWO / TONE SIX)
543 -EE: Circled letter-or-digits and katakanas
544 CIRCLED {DIGIT|NUMBER|LATIN|KATAKANA}
545 numbers (2460-2473,2776-2793,24EA)
548 -F3: Parenthesized enumerations
551 PARENTHESIZED {DIGIT|NUMBER|LATIN}
552 -F4: Numbers with dot (2488-249B)
553 {DIGIT|NUMBER} * FULL STOP
556 -258,25C-25E,285,286,29A,297 -> 0E[80-86,88]
557 -27F,2B3-2B6 -> 0E 8A[80-84]
562 -20D0-20E1 -> 01[DD-F0]
563 -483-486 -> 01[94-97]
564 -559,55A -> 01[98,99]
567 -346-348,2BE-2C5,2CE-2CF -> 01[74-7F]
568 -2D1-2D3,2DE,2E4-2E9 -> 01[81-8A]
570 -342,343 -> 01[8D,8E]
573 -700-780 01[8D-AF]. Maybe there is some kind of traditional
574 order in Estrangela, but for now am not sure.
576 -740-742 -> 01[8D-8F]
577 -747,748,732,735,738,739,73C,73F,743-746,730 -> 01[90,91,94-9F]
578 -731,733,734,736,737,73A,73B,73D,73E,749,74A,7A6-7A9
581 -7AA-7B0 -> 01[B0-B6]
583 -591-5C2 except for 5BA,5BE -> 01[03-33] in order
585 No further patterns for >= 80
587 TODO: Below are not done yet:
588 - x < 0x80 in non-"0E" part
589 - 03 <= x <= 0D in "0E" part
590 - 7B <= x <= 7F in "0E" part
592 **** sortkey details by category
594 1 specially ignored ones (Japanese, Tamil, Thai)
596 IdentifyBy: constants
597 Unicode: 3099-309C, BCD, E47, E4C, FF9E, FF9F
598 SortKey: 01 01 01 01 00
600 2 shift weight characters
602 They are either at 01 01 01 01 or 06, depending on StringSort. For
603 convenience, I use 06 to describe them.
605 2.1 control characters (specified as such in Unicode), except for
606 whitespaces (0009-000D).
609 IdentifyBy: UnicodeCategory.Control
610 Unicode: 0001-000F minus 0009-000D, 007F-009F
611 SortKey: 06 03 - 06 3D
615 Unicode: 0027,FF07 (')
616 SortKey: 06 80 (and width insensitive equivalents)
618 2.3 minus sign, hyphen, dash
619 minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width)
620 hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN?
621 dashes, horizontal bars: FE58 ... UnicodeCategory.DashPunctuation
623 IdentifyBy: UnicodeCategory.DashPunctuation
624 SortKey: 06 81 - 06 90 (and nonspace equivalents)
626 2.4 Arabic spacing and equivalents (64B-652, FE70-FE7F)
627 They are part of nonspacing mark, but not equal.
629 SortKey: 06 A0 - 06 A7 (and nonspace equivalents)
631 3 nonprimary characters, mixed.
633 ModifierSymbol, except for that are not in category 0 and "07" area
634 (i.e. < 128) nor those equivalents
636 NonSpacingMark which is ignorable (IsIgnorableNonSpacing())
637 // 30D, CD5-CD6, ABD, 2B9-2C5, 2C8, 2CB-2CD, 591-5C2. NonSpacingMark in
638 // 981-A3C. A4D, A70, A71, ABC ...
640 TODO: I need more insight to write table generator.
642 SortKey: 01 03 01 - 01 B6 01
644 This part of MS table design is problematic (buggy): \u0592 should
645 not be equal to \u09BC.
647 I guess, this buggy design is because Microsoft first thought that
648 there won't be more than 255 characters in this area. Or they might be
649 aware of the problem but prefer table optimization.
653 1) We should not mix those code (make things sequential) and expands
654 level 2 length to 2 bytes. Instead of having direct value, we
655 could use index (pointer) to zero-terminating level 2 table.
657 2) Include those charactors from minor cultures here.
659 If in "discriminatory mode", those tables could be still provided
660 as to be compatible to Windows.
662 Additionally there seems some bugs around Modifier letter collection.
663 For example, 2C6 should be nonspacing diacritical character but it
664 is regarded as a primary character. The same applies to Mandarin
665 tone marks (2C9-2CB) (and there's a plenty of such characters).
667 4 space separators and some kind of marks
669 4.1 whitespaces, paragraph separator etc.
670 UnicodeCategory.SpaceSeparator : 20, 3000, A0, 9-D, 2000-200B
672 SortKey : 07 02 - 07 18
674 4.2 some OtherSymbols: 2422-2423
676 SortKey : 07 19 - 07 1A
678 4.3 ASCII compatible marks ('!', '^', ...)
679 Non-alpha-numeric < 0x7F except for [[+-<=>']]
680 small compatibility equivalents -> itself, wide
683 FIXME: how to identify them?
684 some Punctuations: InitialQuote/FinalQuote/Open/Close/Connector
685 some OtherSymbols: 2400-2424
686 3003, 3006, 2D0, 10FB
687 remaining Puncuations: 9xx, 7xx
690 SortKey : 07 1B - 07 F0
692 5 mathmatical symbols
693 InitialQuotePunctuation and FinalQuotePunctuation in ASCII
694 (not Quotation_Mark property in PropList.txt ; 22, 27)
696 byte area MathSymbol: 2B,3C,3D,3E,AB,B1,BB,D7,F7 except for AC
697 some MathSymbol (2044, 208A, 208C, 207A, 207C)
698 OtherLetter (1C0-1C2)
699 2200-22FF MathSymbol except for 221E (INF. ; regarded as a number)
701 SortKey : 08 02 - 08 F8
703 6 Arrows and Box drawings
704 09 02 .. 09 7C : 2300-237A
705 only primary differences
706 09 BC ... 09 FE : 25A0-AB, 25E7-EB, 25AC-B5, 25EC-EF, 25B6-B9,
707 25BC-C3, 25BA-25BB, 25C4-25D8, 25E6, 25DA-25E5
709 This area contains level 2 values.
710 2190- (non-codepoint order)
711 note that there are many compatibility equivalents
712 2500- except for 266F (#)
714 SortKey : 09 02 - 09 7C, 09 BC 01 03 - 09 BC 01 13,
715 09 {BD|BE|BF} 01 {03|04}, ...
716 TODO: fill the patterns
718 7 currency sumbols and some punctuations
719 byte CurrencySymbols except for 24 ($)
720 byte OtherSymbols (A7-B6)
721 ConnectorPunctuation - 2040 (i.e. FF65, 30FB)
722 OtherPunct/ConnectorPunct/CurrencyCymbol 2020-20AC - 20AC
723 OtherSymbol 3012-303F,3004,327F
724 MathSymbol/OtherSymbol 2600-2767 (math = 266F)
725 OtherSymbol 2440-244A, 2117
726 20AC (CurrencySymbol)
728 Sortey : 0A 02 - 0A FB
731 all DecimalDigitNumber, LetterNumber, non-CJK OtherNumber.
733 digits, in numeric order. We can use NET_2_0 CharUnicodeInfo.
736 SortKey : 0C 02 (9F8), 0C 03 - 0C E1 (normal numbers), 0C FF (INF.)
738 9 (E) latin letters (alphabets), mixing alphabetical symbols
739 Alphabets, A to Z, mixing alphabetical symbols. See below.
740 F8-2B8 except for (1BB-1BD and 1C0-1C3), but not sequential.
743 For diacritical orders, see level 2.
745 For 'A' it is "0E 02", for 'B' "0E 09" ... 'Z' "0E A9", ezh "0E AA".
746 0E B3 (1BE), 0E B4 (298)
748 There are CJK compatibility characters (3800-) and letterlike
749 symbols (2100-) in those A-to-Z area, ordered by character name.
751 Primary weights are sometimes culture-dependent.
752 FIXME: [0E 0D], [0E 0E], [0E 4B], [0E 75], [0E B2] are unknown
759 0B: 10D in hr|lt|lv|pl, 107 in pl
760 0C: C7 in az|tr, 10D in cs|sk, 106 in hr
766 1E: 110 (D with stroke) in hr
769 22: 18F=259 in az, E9 in is, 119 in pl, EA in vi, 1EBE-1EC7 in vi
773 26: 11F in az|tr, 123 in lv
776 2D: 267 (Heng with hook)
777 2E: 33CB in az, 33CA in tr
780 33: CD in is, 79 in lt
795 73: F1 in es, 1CC in hr
799 7D: F6 in az|hu|tr, 151 in hu, F3 in is|pl, F4 in sk|vi, 1ED0-1ED9 in vi
812 96: 17F (LATIN SMALL LONG S)
813 97: 15F in az|tr, 161 in cs|hr|lt|lv|sk|sl, 7A,179-17C in et, 15B in pl
814 98: 17E in et, 15F in ro, 15B in sl
822 A0: FA in is, 1B0,1EE8-1EF1 in vi
823 A1: FC in az|tr, 56,57 in et, FC,171 in hu, FB in vi
833 AB: DE in is, 17E in lt|lv, 17A in pl
834 AC: E6 in da|is, 1E3 in is, 17C in pl, 17E in sl
835 AD: 17E in cs|hr|sk, E5 in fi, F6,F8 in is 17A in sl
839 B1: E5 in da, "aa" in da
843 10 culture dependent letters (general)
844 0F: 386-3F2 ... Greek and Coptic
845 386-3CF: [0F 02] - [0F 19] (consider primary equivalents)
846 3D0-3EF: [0F 40] - [0F 54]
847 10: 400-4E9 ... Cyrillic.
848 For 400-45F and 4B1, they are mostly UCA DUCET order.
849 After that 460-481 follows, by codepoint.
850 (490-4FF except for 4B1 and Cyrillic supplementary are unused.)
851 11: 531-586 ... Armenian.
852 Simply sorted by codepoint (handle case).
853 12: 5D0-5F2 ... Hebrew.
854 Codepoint order (handle case).
855 13: 621-6D5 plus 670 (NonSpacingMark) ... Arabic
857 They look like ordered by Arabic Presentation Form B except
858 for FE95, and considering diacritical equivalents maybe based
859 on the primary character area (621-6D5).
860 There are still some special characters: 67E,686,698,6AF ...
861 which might not have equivalent characters (I wonder how they
862 are inserted into the presentation form B map).
865 - hamza, waw, yeh (621,624,626) are special: [13 07]
866 - For all remaining letters, get primary letter name
867 and store it into dictionary. If unique, then
868 increment index by 4 from [13 0B]
870 674-6D5 : by codepoint from [13 84].
871 14: 901-963 exc. 93A-93D 950-954 ... Devanagari.
872 For <905 codepoint order, x2 from [14 04].
873 For 905-939 codepoint order, x4 from [14 0B].
874 For 93E-94D codepoint order, x2 from [14 DA].
875 15: 982-9FA ... Bengali. Actually all UnicodeCategories except for
876 NonSpacingMark, DecimalDigitNumber and OtherNumber.
877 For <9E0 simple codepoint order from [15 02].
878 For >9E0 simple codepoint order from [15 3B].
879 16: A05-A74 exc. A3C A4D A66-A71 ... Gurmukhi.
880 The same as UCA order, x4 from [16 04].
881 17: A81-AE0 exc. ABC-ABD ... Gujarati.
882 Mostly equivalent to UCA, but insert {AB3,A81-A83} before AB9,
884 18: B00-B70 ... Oriya
885 All but NonSpacingMark and DecimalDigitNumber, by codepoint.
886 19: B80-BFF ... Tamil
887 BD7 is special : [19 02].
888 B82-B93 (vowels) : x2 from [19 0A].
889 B94 (vowel AU) : [19 24]
890 For consonant order Windows has native Tamil order which is
892 http://www.nationmaster.com/encyclopedia/Tamil-alphabet
893 (The order is still different in "Grantha" order from TAM.)
894 So, we should just hold constant array for consonants.
895 And put them in order, x4 form [19 26].
896 BBE-BCC : SpacingCombiningMark and BC0 ... x2 from [19 82].
897 1A: C00-C61 ... Telugu.
898 C55 and C56 are ignored (C5x line and remaining part of C6x
899 line just look like ignored).
900 C60 and C61 are specially placed. C60 after C0B, C61 after C0C.
901 Except for above, by codepoint, x3 from [1A 04].
902 1B: C80-CE5 ... Kannada.
903 CD5,CD6 (and CE6-CEF: DecimalDigitNumber) are ignored.
904 by codepoint, 3x from [1B 04].
905 1C: D02-D40 ... Malayalam.
906 by simple codepoint from [1C 02].
907 (1D: Sinhala ... totally ignored?)
908 1E: E00-E44 ... Thai.
909 preceding vowels (E40-E44) by codepoint [1E 02 - 1E 06]
910 consonants (E01-E2A) by codepoint, x6 from [1E 07].
911 1F: E2B-E5B,E80-EDF ... Thai / Lao. (Thai breaks the category wall.)
913 remaining consonants (E2B-E2E) by codepoint, x6 from [1E 07].
914 remaining vowels (E2F-E3A) by codepoint.
915 E45,E46,E4E,E4F,E5A,E5B
917 E80-EDF by codepoint from [1F 02].
918 21: 10A0-10FF ... Georgian
919 Mostly equal to UCA order, but swap 10E3 <-> 10F3,
922 11 (22) japanese kana letters and symbols, not in codepoint order
924 For single character, the sortkeys look like:
925 - Katakana normal A, Half Width (FF71) : FF 02 C4 FF C4 FF 01 00
926 - Katakana normal A, Full Width (30A2) : FF C4 FF 01 00
927 - Hiragana normal A, Full Width (3042) : FF FF 01 00
929 Actually for level 4 weights, there is a different rule (see
930 "level 4" format above).
932 There is also 32D0 (normal katakana A with circle) that have
933 diacritic difference.
935 For primary weights, 'A' to 'O' are mapped to 22-26, 'Ka' to 'Ko'
936 are to 2A-2E, 'Sa' to 'So' are to 32-36 ... and follows.
937 'Nn' is special: [22 80].
939 After Kana characters, there are CJK compat characters.
940 From 22 97 01 01 01 01 00 (3349) to 22 A6 01 01 01 01 00 (333B) are
941 sorted in JIS table order (CP932.TXT). Remaining square characters
942 are maybe sorted in Alphabetic order.
944 UCA DUCET also does not apply here.
946 12 (23) bopomofo letters
947 3105-312C: simple codepoint order from [23 02].
949 13 culture dependent letters 2
950 710-72C : Estrangela (ancient Syriac).
952 711 is excluded (superscript).
953 714,716,71C,724 and 727 are "alternative" characters.
954 SortKey: [24 0B]-[24 60], by x where x is 2 for those
955 which is "alternative" defined above, otherwise 4.
958 Equals to UCA order, x2 from [24 6E].
960 (Maybe we should add remaining minor-culture characters here. Tibetan,
961 Limbu, Tagalog, Hanunoo, Buhid, Tagbanwa, Myanmar, Kumer, Tai-Le,
962 Mongolian, Cherokee, Canadian-Aboriginal, Ogham, Runic are ignored)
964 14 (41-45) surrogate Pt.1
966 15 (52 02-7E C8) hangul, mixing combined ones
968 It starts from 1100. After width-insensitive equivalents, those
969 syllables (from AC00) follow (until <del>AE4B</del>D7A3).
970 It follows kinda based on some formula (sometimes it looks not
971 e.g. 1117). FIXME: this area should be clarified more.
973 Hangle Syllables should not be filled in the table. Instead, they
974 can be easily computed by the following formulum:
976 // rc is the codepoint for the input Syllable
977 // (p holds "category << 8 + level1weight")
978 int ri = ((int) rc - 0xAC00) + 1;
980 ((ri / 254) * 256 + (ri % 254) + 2);
982 Hangul Jamo cannot be filled in the table directly, since
983 U+1113 - U+159 holds additional primary key bytes.
984 FIXME: find out how they can be computed.
985 See http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/collation/ICU_collation_design.htm?rev=HEAD&content-type=text/html#Hangul_Implicit_CEs
989 9E 02-F0 B4 [3192-319F,3220-3243,3280-32B0,4E00-9FA5] : CJK mark,
990 parenthesized CJK (part), circled CJK (part), CJK ideograph.
991 Ordered but condidering compatible characters (i.e. there is
992 no other way than having massive mapping).
993 F0 B5-F1 E4 [F900-FA2D]. CJK compatibility ideograph.
995 LAMESPEC: in the latest spec CJK ends at 9F BB. Since MS table
996 joins these two categories without any consideration, it is
997 impossible to insert those new characters without breaking binary
1000 17 (E5 02-FE 33) PrivateUse.
1002 In fact it overlaps to CJK characters (maybe layout design failure).
1004 18 (F2 01-F2 31) surrogate Pt.2
1006 In fact it overlaps to PrivateUse (maybe layout design failure).
1008 19 (FE FF 10 02 - FE FF 29 E9) CJK extensions
1012 They should be computed, since this range should be anyways
1013 checked (to not directly acquire the sortkey values but needs
1014 FE FF part) and anyways it can be computed.
1016 20 (FF FF 01 01 01 01 00) special.
1017 Japanese extender marks:
1018 3005, 3031, 3032, 309D, 309E, 30FC, 30FD, 30FE, FF70
1020 LAMESPEC: In native context Microsoft's understanding of Japanese
1021 3031 and 3032 is wrong. They can never be used to repeat *just
1022 previous one* character, but are usually used to repeat two or
1023 more characters. Also, 3005 is not always used to repeat exactly
1024 one character but sometimes used to repeat two (or possibly more)
1027 Arabic shadda: FE7C (isolated), FE7D (medium)
1028 (Actually they are not extender in Unicode PropList.txt)
1031 - by UnicodeCategory -
1033 DashPunctuation 6 (no exception)
1034 DecimalDigitNumber C (no exception)
1035 EnclosingMark 1 E (no exception)
1037 LetterNumber C (no exception)
1038 LineSeparator 7 (only 2028)
1039 ParagraphSeparator 7 (only 2029)
1041 SpaceSeparator 7 (no exception)
1044 OtherNumber C(<3192), 9E-A7 (3124<)
1046 Control 6 except for 9-D (7)
1047 FinalQuotePunctuation 7 except for BB (8)
1048 InitialQuotePunctuation 7 except for AB (8)
1049 ClosePunctuation 7 except for 232A (9)
1050 OpenPunctuation 7 except for 2329 (9)
1051 ConnectorPunctuation 7 except for FF65, 30FB, 2040 (A)
1053 OtherLetter 1, 7, 8 (1C0-1C2), C, 12-FF
1054 MathSymbol 8, 9, 6, 7, A, C
1055 OtherSymbol 7, 9, A, C, E, F, <22, 52<
1056 CurrencySymbol A except for FF69,24,FF04 (7) and 9F2,9F3 (15)
1058 LowercaseLetter E-11 except for B5 (A) and 1BD (C)
1059 TitlecaseLetter E (no exception)
1060 UppercaseLetter E,F,10,11,21 except for 1BC (C)
1061 ModifierLetter 1, 7, E, 1F, FF
1062 ModifierSymbol 1, 6, 7
1063 NonSpacingMark 1, 6, 13-1F
1064 OtherPunctuation 1, 7, A, 1F
1065 SpacingCombiningMark 1, 14-22
1067 *** Culture dependent design
1069 (To assure this section, run the simple dumper code shown above,
1070 with all the supported cultures.)
1072 **** primary cultures and non-primary cultures
1074 This code is used to iterate character dump through all cultures,
1075 using sort key dumper put above.
1077 public static void Main ()
1079 foreach (CultureInfo ci in CultureInfo.GetCultures (
1080 CultureTypes.AllCultures)) {
1081 ProcessStartInfo psi = new ProcessStartInfo ();
1082 psi.FileName = "../allsortkey.exe";
1083 psi.Arguments = ci.Name;
1084 psi.RedirectStandardOutput = true;
1085 psi.UseShellExecute = false;
1086 Process p = new Process ();
1089 string s = p.StandardOutput.ReadToEnd ();
1090 StreamWriter sw = new StreamWriter (ci.Name + ".txt", false, Encoding.UTF8);
1096 For each sub culture (that has a parent culture), its collation
1097 mapping is identical to that of its parent, except for az-AZ-Cyrl.
1101 - zh-CHS = zh-CN = zh-SG = zh-MO : pronounciation
1102 - zh-TW = zh-HK = zh-CHT : stroke count
1107 (UCA implies that there are some cultures that sorts alphabets from
1108 large to small, but as long as I see there is no such CultureInfo.)
1110 **** Latin characters and NonSpacingMark order tailorings
1112 div : FDF2 is 24 83 01 01 01 01 00 (only 1 difference)
1113 syr : some NonSpacingMarks are totally ignorable.
1114 tt,kk,mk,az-AZ-Cyrl,uk : cyrillic difference
1115 az,et,lt,lv,sl,tr,sv,ro,pl,no,is,hu,fi,es,da : latin difference
1117 sk,hr,cs : latin and NonSpacingMark differences
1121 **** CJK character order tailorings
1125 There are five different CJK orderings:
1126 default, ko(-KR), ja(-JP), zh-CHS and zh-CHT
1127 They have very different CJK mapping for each.
1129 Since they seems based on traditional encodings, we are likely to
1130 provide other constant tables and switch depending on the culture.
1132 <what characters are different from the invariant culture?>
1134 ko : CJK layout difference (52 -> 80)
1135 ja,zh-CHS,zh-TW : dash (5C), CJK layout difference.
1137 Target characters are : CJK misc (3190-), Parenthesized CJK
1138 (3200-), CJK compat (3300-), CJK ideographs (4E00-),
1139 CJK compat ideograph (F900-), Half/Full width compat (FF00-)
1141 Additionally for Korean: Jamo (1100-), Hangle syllables (AC00)
1143 <how do they consist of?>
1145 Japanese CJK order looks based on JIS table order. Those characters
1146 which are also in JIS table are moved to 80 xx. Those which are *not*
1147 in JIS table are left as is (9E-FE).
1149 Additionally, Windows has different order for characters below:
1150 4EDD,337B,337E,337D,337D,337C
1151 They come in front of the first CJK character.
1153 Maybe Korean CJK order respects KS C 5619. Note that Korean mixes
1154 Hangul and CJK in their order so it's not flat order without indexes
1155 (thus, for CJK they are not computable). Also, there is an extra
1156 level2 values for Korean CJK map.
1158 For some Chinese such as zh-CHS, character order is based on pinyin.
1159 And for remaining Chinese such as zh-TW, it is stroke count based.
1161 CLDR of unicode.org has reference ordering of those characters, so
1162 our collation table extracts the sorting order from it.
1163 http://www.unicode.org/cldr/
1165 **** Accent evaluation order
1167 With French cultures, diacritical marks are counted in reverse order.
1168 French ordering does not affect only on some diacritics (Japanese
1169 voice mark is not affected - FIXME: I doubt it, because the algorithm
1170 does not seem to allow it).
1172 Some other cultures might also have different ones, but not obvious.
1175 ** Mono implementation plans
1179 CompareInfo contains many overloaded methods that are just for
1180 convenience. This class contains almost only required members.
1182 This class also provices access to tailoring information which is
1183 culture instance dependent:
1186 - contractions/expansions - returns contraction or expansion
1187 - diacritical remapping
1188 - CJK custom mapping
1190 For data area, see CollationDataStructures.txt for now.
1192 *** UnicodeTable (for now MSCompatUnicodeTable)
1194 Provides several access to character information with related to
1195 the collation element table (of our own).
1196 FIXME: I want to fix some bugs in Windows collation table especially
1197 to not ignore some characters, but it requires table modification
1198 which results in further memory allocation. Maybe it would be done
1199 as a patch for the runtime (or classlib) sources.
1201 - ignorable, ignorable nonspace, normalize width, normalize kanatype
1202 - level 4 sortkey provision method(s)
1204 **** character comparison
1206 Since composite character is likely to *not have* equivalent
1207 codepoint, character comparison could not just be done by expecting
1208 "resulting char" value.
1209 In contrast, since composite character is likely to *do have*
1210 equivalent codepoint, character comparison could not also just be done
1211 by comparing "source char" value.
1213 ***** future optimizations
1215 From where those codepoints differ, for each strings it adjusts the
1216 position so that it represents exactly one character element. That is,
1217 find primary character as the start of the range and the last
1218 nonprimary character as the end of the range.
1220 Once Compare() adjusted the character location to be valid
1221 comparison position, further comparison is done as usual comparison,
1222 i.e. sortkey comparison considering comparisonLevel.
1224 **** Characters in the table / characters computed
1226 Currently I plan not to contain following characters in the table
1227 but compute on demand:
1232 **** CJK Unified Ideographs
1234 For CJK unified ideographs, I had to make those culture-dependent
1235 tables in memory. Since they came from some classical encodings, they
1236 are not computed. Thus, they are in separate table.
1238 **** Level 4: Kana type
1240 The table does not contain level 4 (kanatype) properties for
1241 the whole characters. They can be simply computed.
1243 **** Level 3: Case/Width properties
1245 Case properties will be stored as a byte array, with limited areas of
1246 codepoint (cp < 3120 || FE00 < cp).
1248 For Hangul characters, it will be computed by codepoint areas.
1250 **** Level 2: Diacritical properties
1252 The table will be composed as a byte for a character. If we provide
1253 non-buggy mode (Windows is buggy here by design; it just sums
1254 secondary weight values up), the values will come from UCA and
1255 non-blocking check will be introduced.
1257 Note that Japanese voice marks are considered at level 2 but no need to
1261 ** Reference materials
1263 Developing International Software for Windows 95 and Windows NT
1264 Appendix D Sort Order for Selected Languages
1265 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
1267 UTR#10 Unicode Collation Algorithm (It is still informative)
1268 http://www.unicode.org/reports/tr10/
1270 UAX#15 Unicode Normalization
1271 http://www.unicode.org/reports/tr15/
1272 especially its canonical/compatibility equivalent characters might
1273 be informative to get those equivalent characters.
1275 To know which character can be expanded, Unicode Character Database
1276 (UCD) is informative (it's informative but not normative to us)
1277 http://www.unicode.org/Public/UNIDATA/UCD.html
1279 Decent char-by-char explaination is available here:
1280 http://www.fileformat.info/info/unicode/
1282 Wine uses UCA default element table, but has windows-like character
1283 filterings support in their LCMapString implementation:
1284 http://cvs.winehq.com/cvsweb/wine/dlls/kernel/locale.c
1285 http://cvs.winehq.com/cvsweb/wine/libs/unicode/sortkey.c
1287 Mimer has decent materials on culture specific collations:
1288 http://developer.mimer.com/collations/
1290 This is written in Japanese, but awesome analysis on MS Access
1292 http://www.asahi-net.or.jp/~ez3k-msym/comp/acccoll.htm