5 We are going to implement Windows-like collation, apart from ICU (that
6 implements UCA, Unicode Collation Algorithm).
11 * create collation element table(s)
12 - infer how Windows collation element table is composed
14 - write table generator source(s)
15 : mostly implemented. Need to fill 50 blank characters
16 which should not be blank, and fix 1100 lines of
18 - culture-specific sortkey data
19 : Some are written in mono-tailoring-source.txt. They
20 come from dumped diffs (shown later) and UCA tailorings
21 (via create-tailorings.exe).
22 * implement collation methods
23 - All methods are basically implemented in practical level.
24 - except for GetSortKey(), Compare() and Prefix(), they don't
25 handle Japanese and Arabic extenders.
27 - Currently tailored sortkey does not handle CompareOptions.
28 However the flag is significant.
31 ** How to implement CompareInfo members
34 Compute sort key for every character elements into byte[].
36 Find first difference and compare it.
37 "Larger/smaller" matters (beyond "different").
39 It calls CompareInternal() which also answers if the target
40 is fully consumed, so it just returns true if it says that
41 the target is fully consumed.
43 It tries CompareInternal() to compare source and target at
44 the end, where source varies from minimum tail to the
46 IndexOf(), LastIndexOf()
47 For character search, it finds the matching character element
48 to the end (or start) of the string to find.
49 For string search, it invokes one of private IndexOf() (or
50 LastIndexOf()) overload passing the first character element
51 of the target, and if found, tests if the sequence is a valid
52 start point, using IsPrefix() (or IsSuffix()).
54 *** future Optimization
56 For GetSortKey(), Compare(), IsPrefix() and IndexOf(), it uses forward
57 iteration, which moves forward and don't stop until either it finds
58 next "primary" character or it reached the end of the string, checking
59 HasContractHead(char) for composite.
61 For IsSuffix() and LastIndexOf(), it uses backward iteration, which
62 moves backward and don't stop until either it finds "primary"
63 characters or it reached the beginning of the string, checking
64 HasContractTail(char) for composite.
66 Porting them to C code is an alternative possible approach.
68 ** How to support CompareOptions
70 There are two kind of "ignorance" : strippers' ignorance and
71 normalizers' ignorance.
73 The strippers will "filter characters out" and there will be no
74 corresponding character elements in SortKey binaries.
76 Normalizers, on the other hand, will result in certain characters
77 that is still in effect between irrelevant character and itself.
78 For example, with IgnoreKanaType Hiragana "A" and Katakana "A" are
79 not distinguished, but Hiragana "A" and Hiragana "I" are.
81 Actually, even without any IgnoreXXX flags (i.e. "None"), there are
82 many characters that are ignored ("completely ignorable").
84 Except for LCID 101/1125(div), '\ufdf2' is completely ignorable.
85 This rule even applies to CompareOptions.None.
90 Maybe culture-dependent TextInfo.ToLower() could be used.
92 Unlike ICU (specialCaseToLower()), even with tr-TR(LCID 31)
93 and IgnoreCase, I\u0307 is not regarded as equal to i.
96 ToKanaTypeInsensitive(). Note that this does not cover the
97 whole "level 4" differences described later.
100 ToWidthInsensitive(), which is likely to be culture
101 independent. See also "Notes".
103 IgnoreNonSpace (see also Strippers; this flag works in both sides)
104 For some cultures this logic is still incomplete. All culture-
105 dependent collators must handle valid "replacement" of "one or
106 more characters" which might be related to specific
108 For example, there is a Japanese text sorting rule that
109 however applies to InvariantCulture. Concretely to say,
110 "\u3042\u30FC" is equivalent to "\u3042\u3042" only when
111 IgnoreNonSpace is specified.
113 I'll take those items from CLDR (those items which has
114 <reset before="..." />), case by case though.
118 I already wrote all the required strippers which should be MS
119 compatible (at least with .NET 1.1 invariant culture).
122 IsIgnorableNonSpacing().
123 Some Diacritic characters are covered by this flag.
125 There are some culture *dependent* characters:
126 LCID 90/1114(syr) : 64b, 652, 670
130 UnicodeCategory does not work here.
132 There are some culture *dependent* characters:
133 LCID 17/1041(ja) : 2015
134 LCID 90/1114(syr) : 64b, 652
138 See "sort order categories" section.
142 First to note: we won't use collation element table from unicode.org.
144 There are many differences between ICU/UCA and Windows despite they
145 look so similar; having collation keys in different levels, culture
146 dependent composition, etc. In the history, Windows collation is
147 designed before UCA was specified, so basically Windows is obsolete
150 - Logic: Unlike UCA it has no concept of "blocked" combining marks,
151 and combining marks are never considered as an independent character
152 (thus combining in Windows is buggy).
153 - Data: Windows is based on old Unicode standard (even older than 1.1).
154 It ignores minor cultures. Character property values differ as well
155 as those from the default Unicode collation element table (DUCET).
156 In a few cultures Windows collation is close to the native language
157 (e.g. Tamil, while it does not conform to TAM).
159 IgnoreWidth/IgnoreSymbols is processed after Kana voice mark
160 decomposition (something like NFD, but not equivalent. Example: \u304C
161 is completely equivalent to \u304B\u309B, which is not part of NFKD).
162 <del>This means, if there is a combined Kana characters, it will be
163 first decomposed and then compared.<del> It scarcely matters since
164 there are special weight data for Japanese.
166 *** Microsoft design problem
168 Microsoft implementation seems to have a serious problem that many,
169 many characters that are used in for each specific culture, such as
170 Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
171 "completely ignorable".
173 I tagged many LAMESPEC items in the implementation (both in collator
174 and table generator).
177 ** MS collation design inference
181 Each character has several "weights". It is a common concept between
186 - level 1: primary difference
187 The first byte of level 1 means the category of the character.
188 - level 2: diacritic difference, including Japanese voice mark countup
189 - level 3: case/width sensitivity, and Hangul properties
190 - level 4: kana weight (all of them have primary category 22, at least
192 - level 5: shift weight (apostrophe, hyphens etc.)
194 Note that these levels does not digitally match IgnoreXXX flags. Thus
195 it is not OK that we omit some levels of sortkey values in reflection
198 String comparison is done from level 1 to 5. The comparison won't
199 stop until either it found the primary difference, or it reached to
200 the end (thus upper level differences are returned).
202 For example, "e" is smaller than "E", but "eB" is bigger than "EA".
203 If the collator just returned case difference at first 'e' and 'E',
204 "eB" is still smaller than "EA".
206 **** level 5: shift weight by StringSort
208 There are some characters that are treated specially. Namely they are
209 apostrophe and hyphens. The sortkeys for them is put after level 4
210 (thus here I write them as "level 5"). It has different sort key
211 format. See immediate below. There is no level 5 characters when
212 StringSort is specified.
216 00 means the end of sort key.
217 01 means the end of the level.
218 02-FF means the value.
219 Actually '2' could be cut when all the following values are
220 also '2' (i.e. the sort key binary won't contain extraneous '2').
222 Every level has different key layout.
226 It looks like all level 2 keys are just accumulated, however without
227 considering overflow. It sometimes makes sense (e.g. diaeresis and
228 acute) but it causes many conflicts (e.g. "A\u0308\u0301" and "\u1EA6"
229 are incorrectly regarded as equal).
231 Anyways since Japanese voice mark has level 2 value as 1 it just
232 looked like the sum of voice marks.
236 The actual values are + 2 (e.g. for Hangul Normal Jamo, the value is 4)
239 - 2: Jongseong (11A8-11F9)
240 - 4: Half width? (FFA0-FFDC) and Compatibility Jamo? (3165-318E)
241 - 5: Compatibility Jamo (3130-3164)?
242 TODO: Learn about Korean characters.
245 - 4 circled inverse (2776-277F)
246 - 8 circled sans serif (2780-2789)
247 - C circled inverse && sans serif (278A-2793)
248 - 47 roman (2160-2182)
251 - 2 Isolated form in presentation form B in FE80-FE8D
252 - 4 Alef/Bet/Gimel/Dalet (2135-2138)
253 - 8 Final form in presentation form B in FE82-FEF2
254 - 18 Medial form in presentation form B in FE8C-FEF4
255 Grep "ISOLATED", "FINAL" or "MEDIAL" on UnicodeData.txt
256 (and filter by codepoints).
257 or alternatively, see DerivedDecompositionType.txt.
258 - 22 6A9 (TODO: what is it?)
259 - 28 6AA (TODO: what is it?)
262 - 1 Fullwidth. UnicodeData.txt has <full>.
263 - 2 Subscript. UnicodeData.txt has <sub>.
264 - 8 Small capital, 03C2 (TODO: why?),
265 2104, 212B(flag=1A) (TODO: why?)
266 grep "SMALL CAPITAL" against UnicodeData.txt.
267 - C only FE42. TODO: what is this?
268 - E Superscripts. UnicodeData.txt has <super>.
270 DerivedCoreProperties.txt has Uppercase property.
272 Note that simple 02 (value is 00) could be omitted.
274 Summary: at least 7 bits are required as to represent a table -
275 smallCapital, uppercase, normalization forms (2 bits:full/sub/super),
276 arabic forms (2 bits:isolated/medial/final)
280 Those sortkey data is collected only for Japanese (category 22)
283 There are 3 sections each of them ends with FF. Each of them
284 represents the values for character by character:
285 - small letter type (kogaki moji); C4 (small) or E4 (normal)
286 - category middle section:
287 two subsections separated by 0x02
290 or 4 (voice mark - \u309D,\u309E,\u30FD,\u30FE, \uFF70)
291 or 5 (dash mark - \u30FC)
292 - kana type; C4 (katakana) or E4 (hiragana)
293 - width; 2 (normal) or C5 (full) or C4 (half)
295 LAMESPEC: those characters of value '4' of middle section differs
296 in level 2 wrt voice marks, but does not differetiate kana types
297 (bug). It is ignored when IgnoreNonSpace applies.
301 UPDATED: I noticed offsetL does not exist, so removed it from here.
303 [offsetM + 0x80]? [const 3 + (offsetS + 1) * 4] [category] [level1]
305 where "offsetM" and "offsetS" represents the offset in the input
306 string. "offsetM" is always larger than 0x80.
307 LAMESPEC: This design results in a buggy overflow.
310 byte [] data = new CultureInfo ("").CompareInfo.GetSortKey (s).KeyData;
312 for (int i = 0; i < 4; i++, idx++)
313 for (; data [idx] != 1; idx++)
315 for (; idx < data.Length; idx++)
316 Console.Write ("{0:X02} ", data [idx]);
317 Console.WriteLine ();
320 inputs (s) and results:
322 80 07 06 82 80 2F 06 82 00 // '-' + new string ('A', 10) + '-'
323 80 07 06 82 81 97 06 82 00 // (100)
324 80 07 06 82 8F A7 06 82 00 // (1000)
325 80 07 06 82 9C 47 06 82 00 // (10000)
326 80 07 06 82 9A 87 06 82 00 // (100000)
327 80 07 06 82 89 07 06 82 00 // (1000000)
329 The actual offset is 63 * offsetM + offsetS
331 (const '3' may actually vary but no idea.
332 At least 00, 01 and 02 are not acceptable since they are reserved.
333 02 is not reserved by definition above, but the key-size optimizer
334 uses it as a special mark, as mentioned above.)
338 Here is the simple sortkey dumper:
340 public static void Main (string [] args)
342 CultureInfo culture = args.Length > 0 ?
343 new CultureInfo (args [0]) :
344 CultureInfo.InvariantCulture;
345 CompareInfo ci = culture.CompareInfo;
346 for (int i = 0; i < char.MaxValue; i++) {
347 string s = new string ((char) i, 1);
348 if (ci.Compare (s, "") == 0)
350 byte [] data = ci.GetSortKey (s).KeyData;
351 foreach (byte b in data) {
352 Console.Write ("{0:X02}", b);
355 Console.WriteLine (" : {0:X}, {1} {2}",
357 Char.GetUnicodeCategory ((char) i),
358 data [2] != 1 ? '!' : ' ');
362 *** multiple character mappings
364 Some sequence of characters are considered as a "composite" that is
365 to be composed either as another character or another sequence of
366 characters. Those "composite" might not have corresponding equivalent
367 character in sortkey.
368 Similarly, some single characters are expanded to a sequence of
371 **** diacritic characters
373 Except for those shift-weight characters, there are only
374 diacritical (or other kinds of nonspacing) characters that don't
375 have primary weights.
377 Diacritics are not regarded as a base character when placed after
378 (maybe some kind of) letters.
380 The behavior is diacritic character dependent. For example, Japanese
381 combination of a Kana character and a voice mark is compulsory (the
382 resulting sort key is regarded as identical to the corresponding
383 single character. Try \u304B\u309B with \u304C. It is invariant).
385 In French cultures, diacritic orderings are checked from right to left.
387 **** Composite character processing
389 There are some sequences of characters that are treated as another
390 character or another sequence of characters.
392 By default, there is no composite form.
393 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
394 (Note that composite is different from expansion.)
396 Note that composite characters is likely to not have equivalent
399 **** Expanded character processing
401 Some characters are expanded to two or more characters:
403 C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
404 132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
405 DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
407 (CJK extension is not really expanded)
409 They don't match with any of Unicode normalization.
411 Some alphabetic cultures have different mappings, but mostly small
412 (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
414 Invariant culture also puts Czech unique character \u0161 between s
415 and t, unlike described here:
416 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
418 *** default sort key table
422 When CompareOptions.StringSort is specified, then it modifies
423 characters in category 2 from "1 1 1 1 80 07 06 xx" to
424 "06 xx yy zz" and some characters become case sensitive.
426 For details, "level 5" description above.
428 To handle them simply, they are laid out as "category 0x01" (which
429 never happens in the actual sortkeys) for those shift-weight ones
432 There seems no further differences between StringSort and None.
437 -0A: Korean parenthesized numbers (3200-321C)
438 -0C: Korean circled numbers (3260-327B)
440 -03: Japanese voice mark
442 <primary category 13 : Arabic>
443 -08: 627-648 (basic Abjad letters)
445 -05: waw with hamza (624)
446 -07: yeh with hamza (626. ignore Presentation Form A area)
447 -0A: alef with hamza above (623)
448 -0A: alef with hamza below (625)
450 <primary category 0E : diacritics>
451 Characters in non "0E" category are out of scope.
452 They can be grepped in UnicodeData.txt.
460 Note that 1C4-1C6 are covered but they are also expanded.
461 -15: breve (cyrillic are also covered? at least 4C1/4C2 are.)
462 -16: dialytika and tonos (category 0F though)
465 -1A: ring above | 212B
466 -1B: ogonek ("WITH OGONEK;")
467 -1C: cedilla (WITH CEDILLA;")
468 -1D: double acute | acute and dot above
469 -1E: stroke, except for 0E[1F] and cp{19B, 1BE} |
470 circumflex and acute | 18B,18C,19A,289
471 (i.e. they not one-to-one mapping. Neither that every
472 "stroke" are mapped to 1E, nor not every 1E are mapped to
474 -1F: diaeresis and acute | with circumflex and grave | l slash
475 beware "symbol slash"
476 -20: diaeresis and grave | 19B,19F
477 -21: breve and acute | D8,F8
478 -22: caron and dot above | breve and grave
479 -23: macron and acute
480 -24: macron and grave
481 -25: diaeresis and caron | dot above and macron | tilde and acute
482 -26: ring above and acute
483 -28: diaeresis and macron | cedilla and acute |
485 -29: circumflex and tilde
486 -2A: tilde and diaeresis
487 -2B: stroke and acute
489 -2F: cedilla and breve
490 -30: ogonek and macron
491 -43: hook, except for cp{192,1B2,25A,25D,27B,28B,2B1,2B5} |
492 left hook | with hook above except for cp{1EF6,1EF7} |
494 -44: double grave | 1EF6,1EF7
496 -48: preceded by apostrophe (actually only 149)
498 -55: line below | circumflex and hook above
499 -57: palatal hook (actually only 1AB)
500 -58: dot below except for cp{1EA0,1EA1}
501 -59: "retroflex" (without "WITH") | diaeresis below | 1EA0,1EA1
502 -5A: ring below | 1E76,1E77
503 -60: circumflex below except for cp{1E76,1E77} | horn and acute
504 -61: breve below | horn and grave
505 -63: tilde below | 2125
506 -68: D0,F0,182,183 | dot below and dot above | topbar
507 -69: right half ring | horn and tilde
508 -6A: circumflex and dot below
509 -6D: breve and dot below
510 -6E: dot below and macron
511 -95: horn and hook above
514 (for 01-0D and 7B-8A, they are not related to diacritics.)
516 <category BlahBlahNumbers from 0100 to 1000>
517 -38: Arabic-Indic numbers (660-669)
518 -39: extended Arabic-Indic numbers (6F0-6F9)
519 -3A: Devanagari numbers (966-96F)
520 -3B: Bengali numbers (9E6-9EF)
521 -3C: Bengali currency enumerators (9F4-9F9)
522 -3D: Gurmukhi numbers (A66-A6F)
523 -3E: Gujarati numbesr (AE6-AEF)
524 -3F: Oriya digit numbers (B66-B6F)
525 -40: Tamil numbers (BE7-BF2)
526 -41: Telugu numbers (C66-C6F)
527 -42: Kannada numbers (CE6-CEF)
528 -43: Malayam numbers (D66-D6F)
529 -44: Thai numbers (E50-E59)
530 -45: Lao numbers (ED0-ED9)
531 <miscellaneous numbers>
532 -47: Roman numbers (2160-2182)
533 -4E: Hangchou numbers (3021-3029)
535 -E0[64]: 2107 (Eurer)
536 -E0[87]: some Tone letters (TONE TWO / TONE SIX)
537 -EE: Circled letter-or-digits and katakanas
538 CIRCLED {DIGIT|NUMBER|LATIN|KATAKANA}
539 numbers (2460-2473,2776-2793,24EA)
542 -F3: Parenthesized enumerations
545 PARENTHESIZED {DIGIT|NUMBER|LATIN}
546 -F4: Numbers with dot (2488-249B)
547 {DIGIT|NUMBER} * FULL STOP
550 -258,25C-25E,285,286,29A,297 -> 0E[80-86,88]
551 -27F,2B3-2B6 -> 0E 8A[80-84]
556 -20D0-20E1 -> 01[DD-F0]
557 -483-486 -> 01[94-97]
558 -559,55A -> 01[98,99]
561 -346-348,2BE-2C5,2CE-2CF -> 01[74-7F]
562 -2D1-2D3,2DE,2E4-2E9 -> 01[81-8A]
564 -342,343 -> 01[8D,8E]
567 -700-780 01[8D-AF]. Maybe there is some kind of traditional
568 order in Estrangela, but for now am not sure.
570 -740-742 -> 01[8D-8F]
571 -747,748,732,735,738,739,73C,73F,743-746,730 -> 01[90,91,94-9F]
572 -731,733,734,736,737,73A,73B,73D,73E,749,74A,7A6-7A9
575 -7AA-7B0 -> 01[B0-B6]
577 -591-5C2 except for 5BA,5BE -> 01[03-33] in order
579 No further patterns for >= 80
581 TODO: Below are not done yet:
582 - x < 0x80 in non-"0E" part
583 - 03 <= x <= 0D in "0E" part
584 - 7B <= x <= 7F in "0E" part
586 **** sortkey details by category
588 1 specially ignored ones (Japanese, Tamil, Thai)
590 IdentifyBy: constants
591 Unicode: 3099-309C, BCD, E47, E4C, FF9E, FF9F
592 SortKey: 01 01 01 01 00
594 2 shift weight characters
596 They are either at 01 01 01 01 or 06, depending on StringSort. For
597 convenience, I use 06 to describe them.
599 2.1 control characters (specified as such in Unicode), except for
600 whitespaces (0009-000D).
603 IdentifyBy: UnicodeCategory.Control
604 Unicode: 0001-000F minus 0009-000D, 007F-009F
605 SortKey: 06 03 - 06 3D
609 Unicode: 0027,FF07 (')
610 SortKey: 06 80 (and width insensitive equivalents)
612 2.3 minus sign, hyphen, dash
613 minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width)
614 hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN?
615 dashes, horizontal bars: FE58 ... UnicodeCategory.DashPunctuation
617 IdentifyBy: UnicodeCategory.DashPunctuation
618 SortKey: 06 81 - 06 90 (and nonspace equivalents)
620 2.4 Arabic spacing and equivalents (64B-652, FE70-FE7F)
621 They are part of nonspacing mark, but not equal.
623 SortKey: 06 A0 - 06 A7 (and nonspace equivalents)
625 3 nonprimary characters, mixed.
627 ModifierSymbol, except for that are not in category 0 and "07" area
628 (i.e. < 128) nor those equivalents
630 NonSpacingMark which is ignorable (IsIgnorableNonSpacing())
631 // 30D, CD5-CD6, ABD, 2B9-2C5, 2C8, 2CB-2CD, 591-5C2. NonSpacingMark in
632 // 981-A3C. A4D, A70, A71, ABC ...
634 TODO: I need more insight to write table generator.
636 SortKey: 01 03 01 - 01 B6 01
638 This part of MS table design is problematic (buggy): \u0592 should
639 not be equal to \u09BC.
641 I guess, this buggy design is because Microsoft first thought that
642 there won't be more than 255 characters in this area. Or they might be
643 aware of the problem but prefer table optimization.
647 1) We should not mix those code (make things sequential) and expands
648 level 2 length to 2 bytes. Instead of having direct value, we
649 could use index (pointer) to zero-terminating level 2 table.
651 2) Include those charactors from minor cultures here.
653 If in "discriminatory mode", those tables could be still provided
654 as to be compatible to Windows.
656 Additionally there seems some bugs around Modifier letter collection.
657 For example, 2C6 should be nonspacing diacritical character but it
658 is regarded as a primary character. The same applies to Mandarin
659 tone marks (2C9-2CB) (and there's a plenty of such characters).
661 4 space separators and some kind of marks
663 4.1 whitespaces, paragraph separator etc.
664 UnicodeCategory.SpaceSeparator : 20, 3000, A0, 9-D, 2000-200B
666 SortKey : 07 02 - 07 18
668 4.2 some OtherSymbols: 2422-2423
670 SortKey : 07 19 - 07 1A
672 4.3 ASCII compatible marks ('!', '^', ...)
673 Non-alpha-numeric < 0x7F except for [[+-<=>']]
674 small compatibility equivalents -> itself, wide
677 FIXME: how to identify them?
678 some Punctuations: InitialQuote/FinalQuote/Open/Close/Connector
679 some OtherSymbols: 2400-2424
680 3003, 3006, 2D0, 10FB
681 remaining Puncuations: 9xx, 7xx
684 SortKey : 07 1B - 07 F0
686 5 mathmatical symbols
687 InitialQuotePunctuation and FinalQuotePunctuation in ASCII
688 (not Quotation_Mark property in PropList.txt ; 22, 27)
690 byte area MathSymbol: 2B,3C,3D,3E,AB,B1,BB,D7,F7 except for AC
691 some MathSymbol (2044, 208A, 208C, 207A, 207C)
692 OtherLetter (1C0-1C2)
693 2200-22FF MathSymbol except for 221E (INF. ; regarded as a number)
695 SortKey : 08 02 - 08 F8
697 6 Arrows and Box drawings
698 09 02 .. 09 7C : 2300-237A
699 only primary differences
700 09 BC ... 09 FE : 25A0-AB, 25E7-EB, 25AC-B5, 25EC-EF, 25B6-B9,
701 25BC-C3, 25BA-25BB, 25C4-25D8, 25E6, 25DA-25E5
703 This area contains level 2 values.
704 2190- (non-codepoint order)
705 note that there are many compatibility equivalents
706 2500- except for 266F (#)
708 SortKey : 09 02 - 09 7C, 09 BC 01 03 - 09 BC 01 13,
709 09 {BD|BE|BF} 01 {03|04}, ...
710 TODO: fill the patterns
712 7 currency sumbols and some punctuations
713 byte CurrencySymbols except for 24 ($)
714 byte OtherSymbols (A7-B6)
715 ConnectorPunctuation - 2040 (i.e. FF65, 30FB)
716 OtherPunct/ConnectorPunct/CurrencyCymbol 2020-20AC - 20AC
717 OtherSymbol 3012-303F,3004,327F
718 MathSymbol/OtherSymbol 2600-2767 (math = 266F)
719 OtherSymbol 2440-244A, 2117
720 20AC (CurrencySymbol)
722 Sortey : 0A 02 - 0A FB
725 all DecimalDigitNumber, LetterNumber, non-CJK OtherNumber.
727 digits, in numeric order. We can use NET_2_0 CharUnicodeInfo.
730 SortKey : 0C 02 (9F8), 0C 03 - 0C E1 (normal numbers), 0C FF (INF.)
732 9 (E) latin letters (alphabets), mixing alphabetical symbols
733 Alphabets, A to Z, mixing alphabetical symbols. See below.
734 F8-2B8 except for (1BB-1BD and 1C0-1C3), but not sequential.
737 For diacritical orders, see level 2.
739 For 'A' it is "0E 02", for 'B' "0E 09" ... 'Z' "0E A9", ezh "0E AA".
740 0E B3 (1BE), 0E B4 (298)
742 There are CJK compatibility characters (3800-) and letterlike
743 symbols (2100-) in those A-to-Z area, ordered by character name.
745 Primary weights are sometimes culture-dependent.
746 FIXME: [0E 0D], [0E 0E], [0E 4B], [0E 75], [0E B2] are unknown
753 0B: 10D in hr|lt|lv|pl, 107 in pl
754 0C: C7 in az|tr, 10D in cs|sk, 106 in hr
760 1E: 110 (D with stroke) in hr
763 22: 18F=259 in az, E9 in is, 119 in pl, EA in vi, 1EBE-1EC7 in vi
767 26: 11F in az|tr, 123 in lv
770 2D: 267 (Heng with hook)
771 2E: 33CB in az, 33CA in tr
774 33: CD in is, 79 in lt
789 73: F1 in es, 1CC in hr
793 7D: F6 in az|hu|tr, 151 in hu, F3 in is|pl, F4 in sk|vi, 1ED0-1ED9 in vi
806 96: 17F (LATIN SMALL LONG S)
807 97: 15F in az|tr, 161 in cs|hr|lt|lv|sk|sl, 7A,179-17C in et, 15B in pl
808 98: 17E in et, 15F in ro, 15B in sl
816 A0: FA in is, 1B0,1EE8-1EF1 in vi
817 A1: FC in az|tr, 56,57 in et, FC,171 in hu, FB in vi
827 AB: DE in is, 17E in lt|lv, 17A in pl
828 AC: E6 in da|is, 1E3 in is, 17C in pl, 17E in sl
829 AD: 17E in cs|hr|sk, E5 in fi, F6,F8 in is 17A in sl
833 B1: E5 in da, "aa" in da
837 10 culture dependent letters (general)
838 0F: 386-3F2 ... Greek and Coptic
839 386-3CF: [0F 02] - [0F 19] (consider primary equivalents)
840 3D0-3EF: [0F 40] - [0F 54]
841 10: 400-4E9 ... Cyrillic.
842 For 400-45F and 4B1, they are mostly UCA DUCET order.
843 After that 460-481 follows, by codepoint.
844 (490-4FF except for 4B1 and Cyrillic supplementary are unused.)
845 11: 531-586 ... Armenian.
846 Simply sorted by codepoint (handle case).
847 12: 5D0-5F2 ... Hebrew.
848 Codepoint order (handle case).
849 13: 621-6D5 plus 670 (NonSpacingMark) ... Arabic
851 They look like ordered by Arabic Presentation Form B except
852 for FE95, and considering diacritical equivalents maybe based
853 on the primary character area (621-6D5).
854 There are still some special characters: 67E,686,698,6AF ...
855 which might not have equivalent characters (I wonder how they
856 are inserted into the presentation form B map).
859 - hamza, waw, yeh (621,624,626) are special: [13 07]
860 - For all remaining letters, get primary letter name
861 and store it into dictionary. If unique, then
862 increment index by 4 from [13 0B]
864 674-6D5 : by codepoint from [13 84].
865 14: 901-963 exc. 93A-93D 950-954 ... Devanagari.
866 For <905 codepoint order, x2 from [14 04].
867 For 905-939 codepoint order, x4 from [14 0B].
868 For 93E-94D codepoint order, x2 from [14 DA].
869 15: 982-9FA ... Bengali. Actually all UnicodeCategories except for
870 NonSpacingMark, DecimalDigitNumber and OtherNumber.
871 For <9E0 simple codepoint order from [15 02].
872 For >9E0 simple codepoint order from [15 3B].
873 16: A05-A74 exc. A3C A4D A66-A71 ... Gurmukhi.
874 The same as UCA order, x4 from [16 04].
875 17: A81-AE0 exc. ABC-ABD ... Gujarati.
876 Mostly equivalent to UCA, but insert {AB3,A81-A83} before AB9,
878 18: B00-B70 ... Oriya
879 All but NonSpacingMark and DecimalDigitNumber, by codepoint.
880 19: B80-BFF ... Tamil
881 BD7 is special : [19 02].
882 B82-B93 (vowels) : x2 from [19 0A].
883 B94 (vowel AU) : [19 24]
884 For consonant order Windows has native Tamil order which is
886 http://www.nationmaster.com/encyclopedia/Tamil-alphabet
887 (The order is still different in "Grantha" order from TAM.)
888 So, we should just hold constant array for consonants.
889 And put them in order, x4 form [19 26].
890 BBE-BCC : SpacingCombiningMark and BC0 ... x2 from [19 82].
891 1A: C00-C61 ... Telugu.
892 C55 and C56 are ignored (C5x line and remaining part of C6x
893 line just look like ignored).
894 C60 and C61 are specially placed. C60 after C0B, C61 after C0C.
895 Except for above, by codepoint, x3 from [1A 04].
896 1B: C80-CE5 ... Kannada.
897 CD5,CD6 (and CE6-CEF: DecimalDigitNumber) are ignored.
898 by codepoint, 3x from [1B 04].
899 1C: D02-D40 ... Malayalam.
900 by simple codepoint from [1C 02].
901 (1D: Sinhala ... totally ignored?)
902 1E: E00-E44 ... Thai.
903 preceding vowels (E40-E44) by codepoint [1E 02 - 1E 06]
904 consonants (E01-E2A) by codepoint, x6 from [1E 07].
905 1F: E2B-E5B,E80-EDF ... Thai / Lao. (Thai breaks the category wall.)
907 remaining consonants (E2B-E2E) by codepoint, x6 from [1E 07].
908 remaining vowels (E2F-E3A) by codepoint.
909 E45,E46,E4E,E4F,E5A,E5B
911 E80-EDF by codepoint from [1F 02].
912 21: 10A0-10FF ... Georgian
913 Mostly equal to UCA order, but swap 10E3 <-> 10F3,
916 11 (22) japanese kana letters and symbols, not in codepoint order
918 For single character, the sortkeys look like:
919 - Katakana normal A, Half Width (FF71) : FF 02 C4 FF C4 FF 01 00
920 - Katakana normal A, Full Width (30A2) : FF C4 FF 01 00
921 - Hiragana normal A, Full Width (3042) : FF FF 01 00
923 Actually for level 4 weights, there is a different rule (see
924 "level 4" format above).
926 There is also 32D0 (normal katakana A with circle) that have
927 diacritic difference.
929 For primary weights, 'A' to 'O' are mapped to 22-26, 'Ka' to 'Ko'
930 are to 2A-2E, 'Sa' to 'So' are to 32-36 ... and follows.
931 'Nn' is special: [22 80].
933 After Kana characters, there are CJK compat characters.
934 From 22 97 01 01 01 01 00 (3349) to 22 A6 01 01 01 01 00 (333B) are
935 sorted in JIS table order (CP932.TXT). Remaining square characters
936 are maybe sorted in Alphabetic order.
938 UCA DUCET also does not apply here.
940 12 (23) bopomofo letters
941 3105-312C: simple codepoint order from [23 02].
943 13 culture dependent letters 2
944 710-72C : Estrangela (ancient Syriac).
946 711 is excluded (superscript).
947 714,716,71C,724 and 727 are "alternative" characters.
948 SortKey: [24 0B]-[24 60], by x where x is 2 for those
949 which is "alternative" defined above, otherwise 4.
952 Equals to UCA order, x2 from [24 6E].
954 (Maybe we should add remaining minor-culture characters here. Tibetan,
955 Limbu, Tagalog, Hanunoo, Buhid, Tagbanwa, Myanmar, Kumer, Tai-Le,
956 Mongolian, Cherokee, Canadian-Aboriginal, Ogham, Runic are ignored)
958 14 (41-45) surrogate Pt.1
960 15 (52 02-7E C8) hangul, mixing combined ones
962 It starts from 1100. After width-insensitive equivalents, those
963 syllables (from AC00) follow (until <del>AE4B</del>D7A3).
964 It follows kinda based on some formula (sometimes it looks not
965 e.g. 1117). FIXME: this area should be clarified more.
967 Hangle Syllables should not be filled in the table. Instead, they
968 can be easily computed by the following formulum:
970 // rc is the codepoint for the input Syllable
971 // (p holds "category << 8 + level1weight")
972 int ri = ((int) rc - 0xAC00) + 1;
974 ((ri / 254) * 256 + (ri % 254) + 2);
976 Hangul Jamo cannot be filled in the table directly, since
977 U+1113 - U+159 holds additional primary key bytes.
978 FIXME: find out how they can be computed.
979 See http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/collation/ICU_collation_design.htm?rev=HEAD&content-type=text/html#Hangul_Implicit_CEs
983 9E 02-F0 B4 [3192-319F,3220-3243,3280-32B0,4E00-9FA5] : CJK mark,
984 parenthesized CJK (part), circled CJK (part), CJK ideograph.
985 Ordered but condidering compatible characters (i.e. there is
986 no other way than having massive mapping).
987 F0 B5-F1 E4 [F900-FA2D]. CJK compatibility ideograph.
989 LAMESPEC: in the latest spec CJK ends at 9F BB. Since MS table
990 joins these two categories without any consideration, it is
991 impossible to insert those new characters without breaking binary
994 17 (E5 02-FE 33) PrivateUse.
996 In fact it overlaps to CJK characters (maybe layout design failure).
998 18 (F2 01-F2 31) surrogate Pt.2
1000 In fact it overlaps to PrivateUse (maybe layout design failure).
1002 19 (FE FF 10 02 - FE FF 29 E9) CJK extensions
1006 They should be computed, since this range should be anyways
1007 checked (to not directly acquire the sortkey values but needs
1008 FE FF part) and anyways it can be computed.
1010 20 (FF FF 01 01 01 01 00) special.
1011 Japanese extender marks:
1012 3005, 3031, 3032, 309D, 309E, 30FC, 30FD, 30FE, FF70
1014 LAMESPEC: In native context Microsoft's understanding of Japanese
1015 3031 and 3032 is wrong. They can never be used to repeat *just
1016 previous one* character, but are usually used to repeat two or
1017 more characters. Also, 3005 is not always used to repeat exactly
1018 one character but sometimes used to repeat two (or possibly more)
1021 Arabic shadda: FE7C (isolated), FE7D (medium)
1022 (Actually they are not extender in Unicode PropList.txt)
1025 - by UnicodeCategory -
1027 DashPunctuation 6 (no exception)
1028 DecimalDigitNumber C (no exception)
1029 EnclosingMark 1 E (no exception)
1031 LetterNumber C (no exception)
1032 LineSeparator 7 (only 2028)
1033 ParagraphSeparator 7 (only 2029)
1035 SpaceSeparator 7 (no exception)
1038 OtherNumber C(<3192), 9E-A7 (3124<)
1040 Control 6 except for 9-D (7)
1041 FinalQuotePunctuation 7 except for BB (8)
1042 InitialQuotePunctuation 7 except for AB (8)
1043 ClosePunctuation 7 except for 232A (9)
1044 OpenPunctuation 7 except for 2329 (9)
1045 ConnectorPunctuation 7 except for FF65, 30FB, 2040 (A)
1047 OtherLetter 1, 7, 8 (1C0-1C2), C, 12-FF
1048 MathSymbol 8, 9, 6, 7, A, C
1049 OtherSymbol 7, 9, A, C, E, F, <22, 52<
1050 CurrencySymbol A except for FF69,24,FF04 (7) and 9F2,9F3 (15)
1052 LowercaseLetter E-11 except for B5 (A) and 1BD (C)
1053 TitlecaseLetter E (no exception)
1054 UppercaseLetter E,F,10,11,21 except for 1BC (C)
1055 ModifierLetter 1, 7, E, 1F, FF
1056 ModifierSymbol 1, 6, 7
1057 NonSpacingMark 1, 6, 13-1F
1058 OtherPunctuation 1, 7, A, 1F
1059 SpacingCombiningMark 1, 14-22
1061 *** Culture dependent design
1063 (To assure this section, run the simple dumper code shown above,
1064 with all the supported cultures.)
1066 **** primary cultures and non-primary cultures
1068 This code is used to iterate character dump through all cultures,
1069 using sort key dumper put above.
1071 public static void Main ()
1073 foreach (CultureInfo ci in CultureInfo.GetCultures (
1074 CultureTypes.AllCultures)) {
1075 ProcessStartInfo psi = new ProcessStartInfo ();
1076 psi.FileName = "../allsortkey.exe";
1077 psi.Arguments = ci.Name;
1078 psi.RedirectStandardOutput = true;
1079 psi.UseShellExecute = false;
1080 Process p = new Process ();
1083 string s = p.StandardOutput.ReadToEnd ();
1084 StreamWriter sw = new StreamWriter (ci.Name + ".txt", false, Encoding.UTF8);
1090 For each sub culture (that has a parent culture), its collation
1091 mapping is identical to that of its parent, except for az-AZ-Cyrl.
1095 - zh-CHS = zh-CN = zh-SG = zh-MO : pronounciation
1096 - zh-TW = zh-HK = zh-CHT : stroke count
1101 (UCA implies that there are some cultures that sorts alphabets from
1102 large to small, but as long as I see there is no such CultureInfo.)
1104 **** Latin characters and NonSpacingMark order tailorings
1106 div : FDF2 is 24 83 01 01 01 01 00 (only 1 difference)
1107 syr : some NonSpacingMarks are totally ignorable.
1108 tt,kk,mk,az-AZ-Cyrl,uk : cyrillic difference
1109 az,et,lt,lv,sl,tr,sv,ro,pl,no,is,hu,fi,es,da : latin difference
1111 sk,hr,cs : latin and NonSpacingMark differences
1115 **** CJK character order tailorings
1119 There are five different CJK orderings:
1120 default, ko(-KR), ja(-JP), zh-CHS and zh-CHT
1121 They have very different CJK mapping for each.
1123 Since they seems based on traditional encodings, we are likely to
1124 provide other constant tables and switch depending on the culture.
1126 <what characters are different from the invariant culture?>
1128 ko : CJK layout difference (52 -> 80)
1129 ja,zh-CHS,zh-TW : dash (5C), CJK layout difference.
1131 Target characters are : CJK misc (3190-), Parenthesized CJK
1132 (3200-), CJK compat (3300-), CJK ideographs (4E00-),
1133 CJK compat ideograph (F900-), Half/Full width compat (FF00-)
1135 Additionally for Korean: Jamo (1100-), Hangle syllables (AC00)
1137 <how do they consist of?>
1139 Japanese CJK order looks based on JIS table order. Those characters
1140 which are also in JIS table are moved to 80 xx. Those which are *not*
1141 in JIS table are left as is (9E-FE).
1143 Additionally, Windows has different order for characters below:
1144 4EDD,337B,337E,337D,337D,337C
1145 They come in front of the first CJK character.
1147 Maybe Korean CJK order respects KS C 5619. Note that Korean mixes
1148 Hangul and CJK in their order so it's not flat order without indexes
1149 (thus, for CJK they are not computable). Also, there is an extra
1150 level2 values for Korean CJK map.
1152 For some Chinese such as zh-CHS, character order is based on pinyin.
1153 And for remaining Chinese such as zh-TW, it is stroke count based.
1155 CLDR of unicode.org has reference ordering of those characters, so
1156 our collation table extracts the sorting order from it.
1157 http://www.unicode.org/cldr/
1159 **** Accent evaluation order
1161 With French cultures, diacritical marks are counted in reverse order.
1162 French ordering does not affect only on some diacritics (Japanese
1163 voice mark is not affected - FIXME: I doubt it, because the algorithm
1164 does not seem to allow it).
1166 Some other cultures might also have different ones, but not obvious.
1169 ** Mono implementation plans
1173 CompareInfo contains many overloaded methods that are just for
1174 convenience. This class contains almost only required members.
1176 This class also provices access to tailoring information which is
1177 culture instance dependent:
1180 - contractions/expansions - returns contraction or expansion
1181 - diacritical remapping
1182 - CJK custom mapping
1184 For data area, see CollationDataStructures.txt for now.
1186 *** UnicodeTable (for now MSCompatUnicodeTable)
1188 Provides several access to character information with related to
1189 the collation element table (of our own).
1190 FIXME: I want to fix some bugs in Windows collation table especially
1191 to not ignore some characters, but it requires table modification
1192 which results in further memory allocation. Maybe it would be done
1193 as a patch for the runtime (or classlib) sources.
1195 - ignorable, ignorable nonspace, normalize width, normalize kanatype
1196 - level 4 sortkey provision method(s)
1198 **** character comparison
1200 Since composite character is likely to *not have* equivalent
1201 codepoint, character comparison could not just be done by expecting
1202 "resulting char" value.
1203 In contrast, since composite character is likely to *do have*
1204 equivalent codepoint, character comparison could not also just be done
1205 by comparing "source char" value.
1207 ***** future optimizations
1209 From where those codepoints differ, for each strings it adjusts the
1210 position so that it represents exactly one character element. That is,
1211 find primary character as the start of the range and the last
1212 nonprimary character as the end of the range.
1214 Once Compare() adjusted the character location to be valid
1215 comparison position, further comparison is done as usual comparison,
1216 i.e. sortkey comparison considering comparisonLevel.
1218 **** Characters in the table / characters computed
1220 Currently I plan not to contain following characters in the table
1221 but compute on demand:
1226 **** CJK Unified Ideographs
1228 For CJK unified ideographs, I had to make those culture-dependent
1229 tables in memory. Since they came from some classical encodings, they
1230 are not computed. Thus, they are in separate table.
1232 **** Level 4: Kana type
1234 The table does not contain level 4 (kanatype) properties for
1235 the whole characters. They can be simply computed.
1237 **** Level 3: Case/Width properties
1239 Case properties will be stored as a byte array, with limited areas of
1240 codepoint (cp < 3120 || FE00 < cp).
1242 For Hangul characters, it will be computed by codepoint areas.
1244 **** Level 2: Diacritical properties
1246 The table will be composed as a byte for a character. If we provide
1247 non-buggy mode (Windows is buggy here by design; it just sums
1248 secondary weight values up), the values will come from UCA and
1249 non-blocking check will be introduced.
1251 Note that Japanese voice marks are considered at level 2 but no need to
1255 ** Reference materials
1257 Developing International Software for Windows 95 and Windows NT
1258 Appendix D Sort Order for Selected Languages
1259 http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
1261 UTR#10 Unicode Collation Algorithm (It is still informative)
1262 http://www.unicode.org/reports/tr10/
1264 UAX#15 Unicode Normalization
1265 http://www.unicode.org/reports/tr15/
1266 especially its canonical/compatibility equivalent characters might
1267 be informative to get those equivalent characters.
1269 To know which character can be expanded, Unicode Character Database
1270 (UCD) is informative (it's informative but not normative to us)
1271 http://www.unicode.org/Public/UNIDATA/UCD.html
1273 Decent char-by-char explaination is available here:
1274 http://www.fileformat.info/info/unicode/
1276 Wine uses UCA default element table, but has windows-like character
1277 filterings support in their LCMapString implementation:
1278 http://cvs.winehq.com/cvsweb/wine/dlls/kernel/locale.c
1279 http://cvs.winehq.com/cvsweb/wine/libs/unicode/sortkey.c
1281 Mimer has decent materials on culture specific collations:
1282 http://developer.mimer.com/collations/
1284 This is written in Japanese, but awesome analysis on MS Access
1286 http://www.asahi-net.or.jp/~ez3k-msym/comp/acccoll.htm