mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt

   1 * String collation
   2
   3 ** Summary
   4
   5         We are going to implement Windows-like collation, apart from ICU (that
   6         implements UCA, Unicode Collation Algorithm).
   7
   8
   9 ** Tasks
  10
  11         * create collation element table(s)
  12                 - infer how Windows collation element table is composed
  13                   : mostly analyzed.
  14                 - write table generator source(s)
  15                   : mostly implemented. Need to fill 50 blank characters
  16                     which should not be blank, and fix 1100 lines of
  17                     different mappings.
  18                 - culture-specific sortkey data
  19                   : Some are written in mono-tailoring-source.txt. They
  20                     come from dumped diffs (shown later) and UCA tailorings
  21                     (via create-tailorings.exe).
  22         * implement collation methods
  23                 - All methods are basically implemented in practical level.
  24                 - except for GetSortKey(), Compare() and Prefix(), they don't
  25                   handle Japanese and Arabic extenders.
  26         * fix design (FIXMEs)
  27                 - Currently tailored sortkey does not handle CompareOptions.
  28                   However the flag is significant.
  29
  30
  31 ** How to implement CompareInfo members
  32
  33         GetSortKey() : done
  34                 Compute sort key for every character elements into byte[].
  35         Compare() : done
  36                 Find first difference and compare it.
  37                 "Larger/smaller" matters (beyond "different").
  38         IsPrefix()
  39                 It calls CompareInternal() which also answers if the target
  40                 is fully consumed, so it just returns true if it says that
  41                 the target is fully consumed.
  42         IsSuffix()
  43                 It tries CompareInternal() to compare source and target at
  44                 the end, where source varies from minimum tail to the
  45                 original args.
  46         IndexOf(), LastIndexOf()
  47                 For character search, it finds the matching character element
  48                 to the end (or start) of the string to find.
  49                 For string search, it invokes one of private IndexOf() (or
  50                 LastIndexOf()) overload passing the first character element
  51                 of the target, and if found, tests if the sequence is a valid
  52                 start point, using IsPrefix() (or IsSuffix()).
  53
  54 *** future Optimization
  55
  56         For GetSortKey(), Compare(), IsPrefix() and IndexOf(), it uses forward
  57         iteration, which moves forward and don't stop until either it finds
  58         next "primary" character or it reached the end of the string, checking
  59         HasContractHead(char) for composite.
  60
  61         For IsSuffix() and LastIndexOf(), it uses backward iteration, which
  62         moves backward and don't stop until either it finds "primary"
  63         characters or it reached the beginning of the string, checking
  64         HasContractTail(char) for composite.
  65
  66         Porting them to C code is an alternative possible approach.
  67
  68 ** How to support CompareOptions
  69
  70         There are two kind of "ignorance" : strippers' ignorance and
  71         normalizers' ignorance.
  72
  73         The strippers will "filter characters out" and there will be no
  74         corresponding character elements in SortKey binaries.
  75
  76         Normalizers, on the other hand, will result in certain characters
  77         that is still in effect between irrelevant character and itself.
  78         For example, with IgnoreKanaType Hiragana "A" and Katakana "A" are
  79         not distinguished, but Hiragana "A" and Hiragana "I" are.
  80
  81         Actually, even without any IgnoreXXX flags (i.e. "None"), there are
  82         many characters that are ignored ("completely ignorable").
  83
  84         Except for LCID 101/1125(div), '\ufdf2' is completely ignorable.
  85         This rule even applies to CompareOptions.None.
  86
  87 *** Normalizers
  88
  89         IgnoreCase
  90                 Maybe culture-dependent TextInfo.ToLower() could be used.
  91
  92                 Unlike ICU (specialCaseToLower()), even with tr-TR(LCID 31)
  93                 and IgnoreCase, I\u0307 is not regarded as equal to i.
  94
  95         IgnoreKanaType
  96                 ToKanaTypeInsensitive(). Note that this does not cover the
  97                 whole "level 4" differences described later.
  98
  99         IgnoreWidth
 100                 ToWidthInsensitive(), which is likely to be culture
 101                 independent. See also "Notes".
 102
 103         IgnoreNonSpace (see also Strippers; this flag works in both sides)
 104                 For some cultures this logic is still incomplete. All culture-
 105                 dependent collators must handle valid "replacement" of "one or
 106                 more characters" which might be related to specific
 107                 CompareOptions.
 108                 For example, there is a Japanese text sorting rule that
 109                 however applies to InvariantCulture. Concretely to say,
 110                 "\u3042\u30FC" is equivalent to "\u3042\u3042" only when
 111                 IgnoreNonSpace is specified.
 112
 113                 I'll take those items from CLDR (those items which has
 114                 <reset before="..." />), case by case though.
 115
 116 *** Strippers
 117
 118         I already wrote all the required strippers which should be MS
 119         compatible (at least with .NET 1.1 invariant culture).
 120
 121         IgnoreNonSpace
 122                 IsIgnorableNonSpacing().
 123                 Some Diacritic characters are covered by this flag.
 124
 125                 There are some culture *dependent* characters:
 126                         LCID 90/1114(syr) : 64b, 652, 670
 127
 128         IgnoreSymbols
 129                 IsIgnorableSymbol().
 130                 UnicodeCategory does not work here.
 131
 132                 There are some culture *dependent* characters:
 133                         LCID 17/1041(ja) : 2015
 134                         LCID 90/1114(syr) : 64b, 652
 135
 136 *** StringSort
 137
 138         See "sort order categories" section.
 139
 140 ** ICU and UCA
 141
 142         First to note: we won't use collation element table from unicode.org.
 143
 144         There are many differences between ICU/UCA and Windows despite they
 145         look so similar; having collation keys in different levels, culture
 146         dependent composition, etc. In the history, Windows collation is
 147         designed before UCA was specified, so basically Windows is obsolete
 148         in this area.
 149
 150         - Logic: Unlike UCA it has no concept of "blocked" combining marks,
 151           and combining marks are never considered as an independent character
 152           (thus combining in Windows is buggy).
 153         - Data: Windows is based on old Unicode standard (even older than 1.1).
 154           It ignores minor cultures. Character property values differ as well
 155           as those from the default Unicode collation element table (DUCET).
 156           In a few cultures Windows collation is close to the native language
 157           (e.g. Tamil, while it does not conform to TAM).
 158
 159         IgnoreWidth/IgnoreSymbols is processed after Kana voice mark
 160         decomposition (something like NFD, but not equivalent. Example: \u304C
 161         is completely equivalent to \u304B\u309B, which is not part of NFKD).
 162         <del>This means, if there is a combined Kana characters, it will be
 163         first decomposed and then compared.<del> It scarcely matters since
 164         there are special weight data for Japanese.
 165
 166 *** Microsoft design problem
 167
 168         Microsoft implementation seems to have a serious problem that many,
 169         many characters that are used in for each specific culture, such as
 170         Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
 171         "completely ignorable".
 172
 173         I tagged many LAMESPEC items in the implementation (both in collator
 174         and table generator).
 175
 176
 177 ** MS collation design inference
 178
 179 *** Levels
 180
 181         Each character has several "weights". It is a common concept between
 182         Windows and UCA.
 183
 184         There are 5 levels:
 185
 186         - level 1: primary difference
 187           The first byte of level 1 means the category of the character.
 188         - level 2: diacritic difference, including Japanese voice mark countup
 189         - level 3: case/width sensitivity, and Hangul properties
 190         - level 4: kana weight (all of them have primary category 22, at least
 191           in InvariantCulture)
 192         - level 5: shift weight (apostrophe, hyphens etc.)
 193
 194         Note that these levels does not digitally match IgnoreXXX flags. Thus
 195         it is not OK that we omit some levels of sortkey values in reflection
 196         to CompareOptions.
 197
 198         String comparison is done from level 1 to 5. The comparison won't
 199         stop until either it found the primary difference, or it reached to
 200         the end (thus upper level differences are returned).
 201
 202         For example, "e" is smaller than "E", but "eB" is bigger than "EA".
 203         If the collator just returned case difference at first 'e' and 'E',
 204         "eB" is still smaller than "EA".
 205
 206 **** level 5: shift weight by StringSort
 207
 208         There are some characters that are treated specially. Namely they are
 209         apostrophe and hyphens. The sortkeys for them is put after level 4
 210         (thus here I write them as "level 5"). It has different sort key
 211         format. See immediate below. There is no level 5 characters when
 212         StringSort is specified.
 213
 214 *** sort key format
 215
 216         00 means the end of sort key.
 217         01 means the end of the level.
 218         02-FF means the value.
 219         Actually '2' could be cut when all the following values are
 220         also '2' (i.e. the sort key binary won't contain extraneous '2').
 221
 222         Every level has different key layout.
 223
 224 **** level 2
 225
 226         It looks like all level 2 keys are just accumulated, however without
 227         considering overflow. It sometimes makes sense (e.g. diaeresis and
 228         acute) but it causes many conflicts (e.g. "A\u0308\u0301" and "\u1EA6"
 229         are incorrectly regarded as equal).
 230
 231         Anyways since Japanese voice mark has level 2 value as 1 it just
 232         looked like the sum of voice marks.
 233
 234 **** level 3
 235
 236         The actual values are + 2 (e.g. for Hangul Normal Jamo, the value is 4)
 237
 238         For Korean letters:
 239                 - 2: Jongseong (11A8-11F9)
 240                 - 4: Half width? (FFA0-FFDC) and Compatibility Jamo? (3165-318E)
 241                 - 5: Compatibility Jamo (3130-3164)?
 242                      TODO: Learn about Korean characters.
 243
 244         For numbers:
 245                 - 4 circled inverse (2776-277F)
 246                 - 8 circled sans serif (2780-2789)
 247                 - C circled inverse && sans serif (278A-2793)
 248                 - 47 roman (2160-2182)
 249
 250         For Arabic letters:
 251                 - 2 Isolated form in presentation form B in FE80-FE8D
 252                 - 4 Alef/Bet/Gimel/Dalet (2135-2138)
 253                 - 8 Final form in presentation form B in FE82-FEF2
 254                 - 18 Medial form in presentation form B in FE8C-FEF4
 255                      Grep "ISOLATED", "FINAL" or "MEDIAL" on UnicodeData.txt
 256                      (and filter by codepoints).
 257                      or alternatively, see DerivedDecompositionType.txt.
 258                 - 22 6A9 (TODO: what is it?)
 259                 - 28 6AA (TODO: what is it?)
 260
 261         For other letters:
 262                 - 1 Fullwidth. UnicodeData.txt has <full>.
 263                 - 2 Subscript. UnicodeData.txt has <sub>.
 264                 - 8 Small capital, 03C2 (TODO: why?),
 265                     2104, 212B(flag=1A) (TODO: why?)
 266                     grep "SMALL CAPITAL" against UnicodeData.txt.
 267                 - C only FE42. TODO: what is this?
 268                 - E Superscripts. UnicodeData.txt has <super>.
 269                 - 10 Uppercase.
 270                      DerivedCoreProperties.txt has Uppercase property.
 271
 272         Note that simple 02 (value is 00) could be omitted.
 273
 274         Summary: at least 7 bits are required as to represent a table -
 275         smallCapital, uppercase, normalization forms (2 bits:full/sub/super),
 276         arabic forms (2 bits:isolated/medial/final)
 277
 278 **** level 4
 279
 280         Those sortkey data is collected only for Japanese (category 22)
 281         characters.
 282
 283         There are 3 sections each of them ends with FF. Each of them
 284         represents the values for character by character:
 285         - small letter type (kogaki moji); C4 (small) or E4 (normal)
 286         - category middle section:
 287                 two subsections separated by 0x02
 288                 - char type;
 289                   3 (normal)
 290                   or 4 (voice mark - \u309D,\u309E,\u30FD,\u30FE, \uFF70)
 291                   or 5 (dash mark - \u30FC)
 292                 - kana type; C4 (katakana) or E4 (hiragana)
 293         - width; 2 (normal) or C5 (full) or C4 (half)
 294
 295           LAMESPEC: those characters of value '4' of middle section differs
 296           in level 2 wrt voice marks, but does not differetiate kana types
 297           (bug). It is ignored when IgnoreNonSpace applies.
 298
 299 **** level 5
 300
 301         UPDATED: I noticed offsetL does not exist, so removed it from here.
 302
 303         [offsetM + 0x80]? [const 3 + (offsetS + 1) * 4] [category] [level1]
 304
 305         where "offsetM" and "offsetS" represents the offset in the input
 306         string. "offsetM" is always larger than 0x80.
 307         LAMESPEC: This design results in a buggy overflow.
 308
 309         <xmp>
 310         byte [] data = new CultureInfo ("").CompareInfo.GetSortKey (s).KeyData;
 311         int idx = 0;
 312         for (int i = 0; i < 4; i++, idx++)
 313                 for (; data [idx] != 1; idx++)
 314                         ;
 315         for (; idx < data.Length; idx++)
 316                 Console.Write ("{0:X02} ", data [idx]);
 317         Console.WriteLine ();
 318         </xmp>
 319
 320         inputs (s) and results:
 321
 322         80 07 06 82 80 2F 06 82 00 // '-' + new string ('A', 10) + '-'
 323         80 07 06 82 81 97 06 82 00 // (100)
 324         80 07 06 82 8F A7 06 82 00 // (1000)
 325         80 07 06 82 9C 47 06 82 00 // (10000)
 326         80 07 06 82 9A 87 06 82 00 // (100000)
 327         80 07 06 82 89 07 06 82 00 // (1000000)
 328
 329         The actual offset is 63 * offsetM + offsetS
 330
 331         (const '3' may actually vary but no idea.
 332         At least 00, 01 and 02 are not acceptable since they are reserved.
 333         02 is not reserved by definition above, but the key-size optimizer
 334         uses it as a special mark, as mentioned above.)
 335
 336 *** sort key table
 337
 338         Here is the simple sortkey dumper:
 339
 340         public static void Main (string [] args)
 341         {
 342                 CultureInfo culture = args.Length > 0 ?
 343                         new CultureInfo (args [0]) :
 344                         CultureInfo.InvariantCulture;
 345                 CompareInfo ci = culture.CompareInfo;
 346                 for (int i = 0; i < char.MaxValue; i++) {
 347                         string s = new string ((char) i, 1);
 348                         if (ci.Compare (s, "") == 0)
 349                                 continue; // ignored
 350                         byte [] data = ci.GetSortKey (s).KeyData;
 351                         foreach (byte b in data) {
 352                                 Console.Write ("{0:X02}", b);
 353                                 Console.Write (' ');
 354                         }
 355                         Console.WriteLine (" : {0:X}, {1} {2}",
 356                                 i,
 357                                 Char.GetUnicodeCategory ((char) i),
 358                                 data [2] != 1 ? '!' : ' ');
 359                 }
 360         }
 361
 362 *** multiple character mappings
 363
 364         Some sequence of characters are considered as a "composite" that is
 365         to be composed either as another character or another sequence of
 366         characters. Those "composite" might not have corresponding equivalent
 367         character in sortkey.
 368         Similarly, some single characters are expanded to a sequence of
 369         characters.
 370
 371 **** diacritic characters
 372
 373         Except for those shift-weight characters, there are only
 374         diacritical (or other kinds of nonspacing) characters that don't
 375         have primary weights.
 376
 377         Diacritics are not regarded as a base character when placed after
 378         (maybe some kind of) letters.
 379
 380         The behavior is diacritic character dependent. For example, Japanese
 381         combination of a Kana character and a voice mark is compulsory (the
 382         resulting sort key is regarded as identical to the corresponding
 383         single character. Try \u304B\u309B with \u304C. It is invariant).
 384
 385         In French cultures, diacritic orderings are checked from right to left.
 386
 387 **** Composite character processing
 388
 389         There are some sequences of characters that are treated as another
 390         character or another sequence of characters.
 391
 392         By default, there is no composite form.
 393         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
 394         (Note that composite is different from expansion.)
 395
 396         Note that composite characters is likely to not have equivalent
 397         codepoint.
 398
 399 **** Expanded character processing
 400
 401         Some characters are expanded to two or more characters:
 402
 403         C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
 404         132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
 405         DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
 406         1113-115F (hangul)
 407         (CJK extension is not really expanded)
 408
 409         They don't match with any of Unicode normalization.
 410
 411         Some alphabetic cultures have different mappings, but mostly small
 412         (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
 413
 414         Invariant culture also puts Czech unique character \u0161 between s
 415         and t, unlike described here:
 416         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
 417
 418 *** default sort key table
 419
 420 **** StringSort
 421
 422         When CompareOptions.StringSort is specified, then it modifies
 423         characters in category 2 from "1 1 1 1 80 07 06 xx" to
 424         "06 xx yy zz" and some characters become case sensitive.
 425
 426         For details, "level 5" description above.
 427
 428         To handle them simply, they are laid out as "category 0x01" (which
 429         never happens in the actual sortkeys) for those shift-weight ones
 430         in the table.
 431
 432         There seems no further differences between StringSort and None.
 433
 434 **** level 2 details
 435
 436         Known value maps:
 437                 -0A: Korean parenthesized numbers (3200-321C)
 438                 -0C: Korean circled numbers (3260-327B)
 439
 440                 -03: Japanese voice mark
 441
 442                 <primary category 13 : Arabic>
 443                 -08: 627-648 (basic Abjad letters)
 444                 -09: madda (622)
 445                 -05: waw with hamza (624)
 446                 -07: yeh with hamza (626. ignore Presentation Form A area)
 447                 -0A: alef with hamza above (623)
 448                 -0A: alef with hamza below (625)
 449
 450                 <primary category 0E : diacritics>
 451                 Characters in non "0E" category are out of scope.
 452                 They can be grepped in UnicodeData.txt.
 453                 -0E: acute
 454                 -0F: grave
 455                 -10: dot above
 456                 -11: middle dot
 457                 -12: circumflex
 458                 -13: diaeresis
 459                 -14: caron
 460                      Note that 1C4-1C6 are covered but they are also expanded.
 461                 -15: breve (cyrillic are also covered? at least 4C1/4C2 are.)
 462                 -16: dialytika and tonos (category 0F though)
 463                 -17: macron
 464                 -19: tilde
 465                 -1A: ring above | 212B
 466                 -1B: ogonek ("WITH OGONEK;")
 467                 -1C: cedilla (WITH CEDILLA;")
 468                 -1D: double acute | acute and dot above
 469                 -1E: stroke, except for 0E[1F] and cp{19B, 1BE} |
 470                      circumflex and acute | 18B,18C,19A,289
 471                      (i.e. they not one-to-one mapping. Neither that every
 472                      "stroke" are mapped to 1E, nor not every 1E are mapped to
 473                      "stroke".)
 474                 -1F: diaeresis and acute | with circumflex and grave | l slash
 475                         beware "symbol slash"
 476                 -20: diaeresis and grave | 19B,19F
 477                 -21: breve and acute | D8,F8
 478                 -22: caron and dot above | breve and grave
 479                 -23: macron and acute
 480                 -24: macron and grave
 481                 -25: diaeresis and caron | dot above and macron | tilde and acute
 482                 -26: ring above and acute
 483                 -28: diaeresis and macron | cedilla and acute |
 484                      macron and diaeresis
 485                 -29: circumflex and tilde
 486                 -2A: tilde and diaeresis
 487                 -2B: stroke and acute
 488                 -2C: breve and tilde
 489                 -2F: cedilla and breve
 490                 -30: ogonek and macron
 491                 -43: hook, except for cp{192,1B2,25A,25D,27B,28B,2B1,2B5} |
 492                      left hook | with hook above except for cp{1EF6,1EF7} |
 493                      27D,284
 494                 -44: double grave | 1EF6,1EF7
 495                 -46: inverted breve
 496                 -48: preceded by apostrophe (actually only 149)
 497                 -52: horn
 498                 -55: line below | circumflex and hook above
 499                 -57: palatal hook (actually only 1AB)
 500                 -58: dot below except for cp{1EA0,1EA1}
 501                 -59: "retroflex" (without "WITH") | diaeresis below | 1EA0,1EA1
 502                 -5A: ring below | 1E76,1E77
 503                 -60: circumflex below except for cp{1E76,1E77} | horn and acute
 504                 -61: breve below | horn and grave
 505                 -63: tilde below | 2125
 506                 -68: D0,F0,182,183 | dot below and dot above | topbar
 507                 -69: right half ring | horn and tilde
 508                 -6A: circumflex and dot below
 509                 -6D: breve and dot below
 510                 -6E: dot below and macron
 511                 -95: horn and hook above
 512                 -AA: horn and dot
 513
 514                 (for 01-0D and 7B-8A, they are not related to diacritics.)
 515
 516                 <category BlahBlahNumbers from 0100 to 1000>
 517                 -38: Arabic-Indic numbers (660-669)
 518                 -39: extended Arabic-Indic numbers (6F0-6F9)
 519                 -3A: Devanagari numbers (966-96F)
 520                 -3B: Bengali numbers (9E6-9EF)
 521                 -3C: Bengali currency enumerators (9F4-9F9)
 522                 -3D: Gurmukhi numbers (A66-A6F)
 523                 -3E: Gujarati numbesr (AE6-AEF)
 524                 -3F: Oriya digit numbers (B66-B6F)
 525                 -40: Tamil numbers (BE7-BF2)
 526                 -41: Telugu numbers (C66-C6F)
 527                 -42: Kannada numbers (CE6-CEF)
 528                 -43: Malayam numbers (D66-D6F)
 529                 -44: Thai numbers (E50-E59)
 530                 -45: Lao numbers (ED0-ED9)
 531                 <miscellaneous numbers>
 532                 -47: Roman numbers (2160-2182)
 533                 -4E: Hangchou numbers (3021-3029)
 534
 535                 -E0[64]: 2107 (Eurer)
 536                 -E0[87]: some Tone letters (TONE TWO / TONE SIX)
 537                 -EE: Circled letter-or-digits and katakanas
 538                         CIRCLED {DIGIT|NUMBER|LATIN|KATAKANA}
 539                         numbers (2460-2473,2776-2793,24EA)
 540                         latin (24B6-24E9)
 541                         katakana (32D0-32FE)
 542                 -F3: Parenthesized enumerations
 543                         numbers (2474-2487)
 544                         latin (249C-24B5)
 545                         PARENTHESIZED {DIGIT|NUMBER|LATIN}
 546                 -F4: Numbers with dot (2488-249B)
 547                         {DIGIT|NUMBER} * FULL STOP
 548
 549                 <miscellaneous>
 550                 -258,25C-25E,285,286,29A,297 -> 0E[80-86,88]
 551                 -27F,2B3-2B6 -> 0E 8A[80-84]
 552                 -3D3 -> 0F[44]
 553                 -476,477 -> 10[46]
 554                 -215F -> 0C[03]
 555
 556                 -20D0-20E1 -> 01[DD-F0]
 557                 -483-486 -> 01[94-97]
 558                 -559,55A -> 01[98,99]
 559                 -711 -> 01[9A]
 560
 561                 -346-348,2BE-2C5,2CE-2CF -> 01[74-7F]
 562                 -2D1-2D3,2DE,2E4-2E9 -> 01[81-8A]
 563
 564                 -342,343 -> 01[8D,8E]
 565                 -345 -> 01[90]
 566
 567                 -700-780 01[8D-AF]. Maybe there is some kind of traditional
 568                 order in Estrangela, but for now am not sure.
 569                 /*
 570                 -740-742 -> 01[8D-8F]
 571                 -747,748,732,735,738,739,73C,73F,743-746,730 -> 01[90,91,94-9F]
 572                 -731,733,734,736,737,73A,73B,73D,73E,749,74A,7A6-7A9
 573                  -> 01[A0-AA,AC-AF]
 574                 */
 575                 -7AA-7B0 -> 01[B0-B6]
 576
 577                 -591-5C2 except for 5BA,5BE -> 01[03-33] in order
 578
 579                 No further patterns for >= 80
 580
 581                 TODO: Below are not done yet:
 582                         - x < 0x80 in non-"0E" part
 583                         - 03 <= x <= 0D in "0E" part
 584                         - 7B <= x <= 7F in "0E" part
 585
 586 **** sortkey details by category
 587
 588         1 specially ignored ones (Japanese, Tamil, Thai)
 589
 590                 IdentifyBy: constants
 591                 Unicode: 3099-309C, BCD, E47, E4C, FF9E, FF9F
 592                 SortKey: 01 01 01 01 00
 593
 594         2 shift weight characters
 595
 596         They are either at 01 01 01 01 or 06, depending on StringSort. For
 597         convenience, I use 06 to describe them.
 598
 599         2.1 control characters (specified as such in Unicode), except for
 600         whitespaces (0009-000D).
 601
 602                 ProcessAfter: 4.1
 603                 IdentifyBy: UnicodeCategory.Control
 604                 Unicode: 0001-000F minus 0009-000D, 007F-009F
 605                 SortKey: 06 03 - 06 3D
 606
 607         2.2 Apostrophe
 608                 IdentifyBy: constant
 609                 Unicode: 0027,FF07 (')
 610                 SortKey: 06 80 (and width insensitive equivalents)
 611
 612         2.3  minus sign, hyphen, dash
 613           minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width)
 614           hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN?
 615           dashes, horizontal bars: FE58 ... UnicodeCategory.DashPunctuation
 616
 617                 IdentifyBy: UnicodeCategory.DashPunctuation
 618                 SortKey: 06 81 - 06 90 (and nonspace equivalents)
 619
 620         2.4 Arabic spacing and equivalents (64B-652, FE70-FE7F)
 621           They are part of nonspacing mark, but not equal.
 622
 623                 SortKey: 06 A0 - 06 A7 (and nonspace equivalents)
 624
 625         3 nonprimary characters, mixed.
 626
 627           ModifierSymbol, except for that are not in category 0 and "07" area
 628           (i.e. < 128) nor those equivalents
 629
 630           NonSpacingMark which is ignorable (IsIgnorableNonSpacing())
 631           // 30D, CD5-CD6, ABD, 2B9-2C5, 2C8, 2CB-2CD, 591-5C2. NonSpacingMark in
 632           // 981-A3C. A4D, A70, A71, ABC ...
 633
 634           TODO: I need more insight to write table generator.
 635
 636           SortKey: 01 03 01 - 01 B6 01
 637
 638           This part of MS table design is problematic (buggy): \u0592 should
 639           not be equal to \u09BC.
 640
 641           I guess, this buggy design is because Microsoft first thought that
 642           there won't be more than 255 characters in this area. Or they might be
 643           aware of the problem but prefer table optimization.
 644
 645           Ideal solutions:
 646
 647           1) We should not mix those code (make things sequential) and expands
 648              level 2 length to 2 bytes. Instead of having direct value, we
 649              could use index (pointer) to zero-terminating level 2 table.
 650
 651           2) Include those charactors from minor cultures here.
 652
 653           If in "discriminatory mode", those tables could be still provided
 654           as to be compatible to Windows.
 655
 656           Additionally there seems some bugs around Modifier letter collection.
 657           For example, 2C6 should be nonspacing diacritical character but it
 658           is regarded as a primary character. The same applies to Mandarin
 659           tone marks (2C9-2CB) (and there's a plenty of such characters).
 660
 661         4 space separators and some kind of marks
 662
 663         4.1 whitespaces, paragraph separator etc.
 664           UnicodeCategory.SpaceSeparator : 20, 3000, A0, 9-D, 2000-200B
 665
 666           SortKey : 07 02 - 07 18
 667
 668         4.2 some OtherSymbols: 2422-2423
 669
 670           SortKey : 07 19 - 07 1A
 671
 672         4.3 ASCII compatible marks ('!', '^', ...)
 673           Non-alpha-numeric < 0x7F except for [[+-<=>']]
 674           small compatibility equivalents -> itself, wide
 675
 676         4.3 other marks
 677           FIXME: how to identify them?
 678           some Punctuations: InitialQuote/FinalQuote/Open/Close/Connector
 679           some OtherSymbols: 2400-2424
 680           3003, 3006, 2D0, 10FB
 681           remaining Puncuations: 9xx, 7xx
 682           70F (Format)
 683
 684           SortKey : 07 1B - 07 F0
 685
 686         5 mathmatical symbols
 687           InitialQuotePunctuation and FinalQuotePunctuation in ASCII
 688           (not Quotation_Mark property in PropList.txt ; 22, 27)
 689
 690           byte area MathSymbol: 2B,3C,3D,3E,AB,B1,BB,D7,F7 except for AC
 691           some MathSymbol (2044, 208A, 208C, 207A, 207C)
 692           OtherLetter (1C0-1C2)
 693           2200-22FF MathSymbol except for 221E (INF. ; regarded as a number)
 694
 695           SortKey : 08 02 - 08 F8
 696
 697         6 Arrows and Box drawings
 698           09 02 .. 09 7C : 2300-237A
 699                         only primary differences
 700           09 BC ... 09 FE : 25A0-AB, 25E7-EB, 25AC-B5, 25EC-EF, 25B6-B9,
 701                         25BC-C3, 25BA-25BB, 25C4-25D8, 25E6, 25DA-25E5
 702                         21*,25*,26*,27*
 703                         This area contains level 2 values.
 704           2190- (non-codepoint order)
 705                 note that there are many compatibility equivalents
 706           2500- except for 266F (#)
 707
 708           SortKey : 09 02 - 09 7C, 09 BC 01 03 - 09 BC 01 13,
 709                     09 {BD|BE|BF} 01 {03|04}, ...
 710                     TODO: fill the patterns
 711
 712         7 currency sumbols and some punctuations
 713           byte CurrencySymbols except for 24 ($)
 714           byte OtherSymbols (A7-B6)
 715           ConnectorPunctuation - 2040 (i.e. FF65, 30FB)
 716           OtherPunct/ConnectorPunct/CurrencyCymbol 2020-20AC - 20AC
 717           OtherSymbol 3012-303F,3004,327F
 718           MathSymbol/OtherSymbol 2600-2767 (math = 266F)
 719           OtherSymbol 2440-244A, 2117
 720           20AC (CurrencySymbol)
 721
 722           Sortey : 0A 02 - 0A FB
 723
 724         8 (C) numbers
 725           all DecimalDigitNumber, LetterNumber, non-CJK OtherNumber.
 726           9F8.
 727           digits, in numeric order. We can use NET_2_0 CharUnicodeInfo.
 728           221E. (INF.)
 729
 730           SortKey : 0C 02 (9F8), 0C 03 - 0C E1 (normal numbers), 0C FF (INF.)
 731
 732         9 (E) latin letters (alphabets), mixing alphabetical symbols
 733           Alphabets, A to Z, mixing alphabetical symbols. See below.
 734           F8-2B8 except for (1BB-1BD and 1C0-1C3), but not sequential.
 735           2E0-2E3.
 736
 737           For diacritical orders, see level 2.
 738
 739           For 'A' it is "0E 02", for 'B' "0E 09" ... 'Z' "0E A9", ezh "0E AA".
 740           0E B3 (1BE), 0E B4 (298)
 741
 742           There are CJK compatibility characters (3800-) and letterlike
 743           symbols (2100-) in those A-to-Z area, ordered by character name.
 744
 745           Primary weights are sometimes culture-dependent.
 746                 FIXME: [0E 0D], [0E 0E], [0E 4B], [0E 75], [0E B2] are unknown
 747                 02: A
 748                 03: C4 in sk|vi
 749                 04: C1 in is|pl|vi
 750                 05-08: CJKext
 751                 09: B
 752                 0A: C
 753                 0B: 10D in hr|lt|lv|pl, 107 in pl
 754                 0C: C7 in az|tr, 10D in cs|sk, 106 in hr
 755                 0F-19: CJKext
 756                 1A: D
 757                 1B: 189 (African D)
 758                 1C: 2A3 (DZ Digraph)
 759                 1D: 1C6 (dz) in hr
 760                 1E: 110 (D with stroke) in hr
 761                 1F-20: CJKext
 762                 21: E
 763                 22: 18F=259 in az, E9 in is, 119 in pl, EA in vi, 1EBE-1EC7 in vi
 764                 23: F
 765                 24: CJKext
 766                 25: G
 767                 26: 11F in az|tr, 123 in lv
 768                 28-2B: CJKext
 769                 2C: H
 770                 2D: 267 (Heng with hook)
 771                 2E: 33CB in az, 33CA in tr
 772                 2F-31: CJKext
 773                 32: I
 774                 33: CD in is, 79 in lt
 775                 34: CJKext
 776                 35: J
 777                 36: K
 778                 37-47: CJKext
 779                 48: L
 780                 49: 2114
 781                 4A: 1C9 in hr
 782                 4C: 142 in pl
 783                 4D-50: CJKext
 784                 51: M
 785                 52-6F: CJKext
 786                 70: N
 787                 71: 2116
 788                 72: 144 in pl
 789                 73: F1 in es, 1CC in hr
 790                 74: 14B
 791                 76-7B: CJKext
 792                 7C: O
 793                 7D: F6 in az|hu|tr, 151 in hu, F3 in is|pl, F4 in sk|vi, 1ED0-1ED9 in vi
 794                 7E: P
 795                 7F-88: CJKext
 796                 89: Q
 797                 8A: R
 798                 8B: 211E
 799                 8C: 211F
 800                 8D: 159 in cs|sk
 801                 8E-90: CJKext
 802                 91: S
 803                 92: 2108
 804                 93: 2120
 805                 94-95: CJKext
 806                 96: 17F (LATIN SMALL LONG S)
 807                 97: 15F in az|tr, 161 in cs|hr|lt|lv|sk|sl, 7A,179-17C in et, 15B in pl
 808                 98: 17E in et, 15F in ro, 15B in sl
 809                 99: T
 810                 9A: 2121
 811                 9B: CJKext
 812                 9C: 2122
 813                 9D: 2A6
 814                 9E: 166
 815                 9F: U
 816                 A0: FA in is, 1B0,1EE8-1EF1 in vi
 817                 A1: FC in az|tr, 56,57 in et, FC,171 in hu, FB in vi
 818                 A2: V
 819                 A3: 2123
 820                 A4: W
 821                 A5: CJKext
 822                 A6: X
 823                 A7: Y
 824                 A8: FD in is
 825                 A9: Z
 826                 AA: 292
 827                 AB: DE in is, 17E in lt|lv, 17A in pl
 828                 AC: E6 in da|is, 1E3 in is, 17C in pl, 17E in sl
 829                 AD: 17E in cs|hr|sk, E5 in fi, F6,F8 in is 17A in sl
 830                 AE: F6,F8,151 in da
 831                 AF: E4 in fi
 832                 B0: F6,F8,151 in fi
 833                 B1: E5 in da, "aa" in da
 834                 B3: 1BE
 835                 B4: 298
 836
 837         10 culture dependent letters (general)
 838           0F: 386-3F2 ... Greek and Coptic
 839                 386-3CF: [0F 02] - [0F 19] (consider primary equivalents)
 840                 3D0-3EF: [0F 40] - [0F 54]
 841           10: 400-4E9 ... Cyrillic.
 842                 For 400-45F and 4B1, they are mostly UCA DUCET order.
 843                 After that 460-481 follows, by codepoint.
 844                 (490-4FF except for 4B1 and Cyrillic supplementary are unused.)
 845           11: 531-586 ... Armenian.
 846                 Simply sorted by codepoint (handle case).
 847           12: 5D0-5F2 ... Hebrew.
 848                 Codepoint order (handle case).
 849           13: 621-6D5 plus 670 (NonSpacingMark) ... Arabic
 850                 Area 1:
 851                 They look like ordered by Arabic Presentation Form B except
 852                 for FE95, and considering diacritical equivalents maybe based
 853                 on the primary character area (621-6D5).
 854                 There are still some special characters: 67E,686,698,6AF ...
 855                 which might not have equivalent characters (I wonder how they
 856                 are inserted into the presentation form B map).
 857
 858                         Solution:
 859                         - hamza, waw, yeh (621,624,626) are special: [13 07]
 860                         - For all remaining letters, get primary letter name
 861                           and store it into dictionary. If unique, then
 862                           increment index by 4 from [13 0B]
 863                 Area 2:
 864                 674-6D5 : by codepoint from [13 84].
 865           14: 901-963 exc. 93A-93D 950-954 ... Devanagari.
 866                 For <905 codepoint order, x2 from [14 04].
 867                 For 905-939 codepoint order, x4 from [14 0B].
 868                 For 93E-94D codepoint order, x2 from [14 DA].
 869           15: 982-9FA ... Bengali. Actually all UnicodeCategories except for
 870                 NonSpacingMark, DecimalDigitNumber and OtherNumber.
 871                 For <9E0 simple codepoint order from [15 02].
 872                 For >9E0 simple codepoint order from [15 3B].
 873           16: A05-A74 exc. A3C A4D A66-A71 ... Gurmukhi.
 874                 The same as UCA order, x4 from [16 04].
 875           17: A81-AE0 exc. ABC-ABD ... Gujarati.
 876                 Mostly equivalent to UCA, but insert {AB3,A81-A83} before AB9,
 877                 x4 from [17 04].
 878           18: B00-B70 ... Oriya
 879                 All but NonSpacingMark and DecimalDigitNumber, by codepoint.
 880           19: B80-BFF ... Tamil
 881                 BD7 is special : [19 02].
 882                 B82-B93 (vowels) : x2 from [19 0A].
 883                 B94 (vowel AU) : [19 24]
 884                 For consonant order Windows has native Tamil order which is
 885                 different from UCA.
 886                 http://www.nationmaster.com/encyclopedia/Tamil-alphabet
 887                 (The order is still different in "Grantha" order from TAM.)
 888                 So, we should just hold constant array for consonants.
 889                 And put them in order, x4 form [19 26].
 890                 BBE-BCC : SpacingCombiningMark and BC0 ... x2 from [19 82].
 891           1A: C00-C61 ... Telugu.
 892                 C55 and C56 are ignored (C5x line and remaining part of C6x
 893                 line just look like ignored).
 894                 C60 and C61 are specially placed. C60 after C0B, C61 after C0C.
 895                 Except for above, by codepoint, x3 from [1A 04].
 896           1B: C80-CE5 ... Kannada.
 897                 CD5,CD6 (and CE6-CEF: DecimalDigitNumber) are ignored.
 898                 by codepoint, 3x from [1B 04].
 899           1C: D02-D40 ... Malayalam.
 900                 by simple codepoint from [1C 02].
 901           (1D: Sinhala ... totally ignored?)
 902           1E: E00-E44 ... Thai.
 903                 preceding vowels (E40-E44) by codepoint [1E 02 - 1E 06]
 904                 consonants (E01-E2A) by codepoint, x6 from [1E 07].
 905           1F: E2B-E5B,E80-EDF ... Thai / Lao. (Thai breaks the category wall.)
 906                 Thai:
 907                 remaining consonants (E2B-E2E) by codepoint, x6 from [1E 07].
 908                 remaining vowels (E2F-E3A) by codepoint.
 909                 E45,E46,E4E,E4F,E5A,E5B
 910                 Lao:
 911                 E80-EDF by codepoint from [1F 02].
 912           21: 10A0-10FF ... Georgian
 913                 Mostly equal to UCA order, but swap 10E3 <-> 10F3,
 914                 x5 from [21 05].
 915
 916         11 (22) japanese kana letters and symbols, not in codepoint order
 917
 918           For single character, the sortkeys look like:
 919           - Katakana normal A, Half Width (FF71) : FF 02 C4 FF C4 FF 01 00
 920           - Katakana normal A, Full Width (30A2) : FF C4 FF 01 00
 921           - Hiragana normal A, Full Width (3042) : FF FF 01 00
 922
 923           Actually for level 4 weights, there is a different rule (see
 924           "level 4" format above).
 925
 926           There is also 32D0 (normal katakana A with circle) that have
 927           diacritic difference.
 928
 929           For primary weights, 'A' to 'O' are mapped to 22-26, 'Ka' to 'Ko'
 930           are to 2A-2E, 'Sa' to 'So' are to 32-36 ... and follows.
 931           'Nn' is special: [22 80].
 932
 933           After Kana characters, there are CJK compat characters.
 934           From 22 97 01 01 01 01 00 (3349) to 22 A6 01 01 01 01 00 (333B) are
 935           sorted in JIS table order (CP932.TXT). Remaining square characters
 936           are maybe sorted in Alphabetic order.
 937
 938           UCA DUCET also does not apply here.
 939
 940         12 (23) bopomofo letters
 941                 3105-312C: simple codepoint order from [23 02].
 942
 943         13 culture dependent letters 2
 944                 710-72C : Estrangela (ancient Syriac).
 945                         codepoint order.
 946                         711 is excluded (superscript).
 947                         714,716,71C,724 and 727 are "alternative" characters.
 948                         SortKey: [24 0B]-[24 60], by x where x is 2 for those
 949                         which is "alternative" defined above, otherwise 4.
 950
 951                 780-7A5 : Thaana
 952                         Equals to UCA order, x2 from [24 6E].
 953
 954         (Maybe we should add remaining minor-culture characters here. Tibetan,
 955         Limbu, Tagalog, Hanunoo, Buhid, Tagbanwa, Myanmar, Kumer, Tai-Le,
 956         Mongolian, Cherokee, Canadian-Aboriginal, Ogham, Runic are ignored)
 957
 958         14 (41-45) surrogate Pt.1
 959
 960         15 (52 02-7E C8) hangul, mixing combined ones
 961
 962           It starts from 1100. After width-insensitive equivalents, those
 963           syllables (from AC00) follow (until <del>AE4B</del>D7A3).
 964           It follows kinda based on some formula (sometimes it looks not
 965           e.g. 1117). FIXME: this area should be clarified more.
 966
 967           Hangle Syllables should not be filled in the table. Instead, they
 968           can be easily computed by the following formulum:
 969
 970                 // rc is the codepoint for the input Syllable
 971                 // (p holds "category << 8 + level1weight")
 972                 int ri = ((int) rc - 0xAC00) + 1;
 973                 ushort p = (ushort)
 974                         ((ri / 254) * 256 + (ri % 254) + 2);
 975
 976           Hangul Jamo cannot be filled in the table directly, since
 977           U+1113 - U+159 holds additional primary key bytes.
 978           FIXME: find out how they can be computed.
 979           See http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/collation/ICU_collation_design.htm?rev=HEAD&content-type=text/html#Hangul_Implicit_CEs
 980
 981         16 (9E 02-F1 E4) CJK
 982
 983            9E 02-F0 B4 [3192-319F,3220-3243,3280-32B0,4E00-9FA5] : CJK mark,
 984                 parenthesized CJK (part), circled CJK (part), CJK ideograph.
 985                 Ordered but condidering compatible characters (i.e. there is
 986                 no other way than having massive mapping).
 987            F0 B5-F1 E4 [F900-FA2D]. CJK compatibility ideograph.
 988
 989            LAMESPEC: in the latest spec CJK ends at 9F BB. Since MS table
 990            joins these two categories without any consideration, it is
 991            impossible to insert those new characters without breaking binary
 992            compatibility.
 993
 994         17 (E5 02-FE 33) PrivateUse.
 995
 996            In fact it overlaps to CJK characters (maybe layout design failure).
 997
 998         18 (F2 01-F2 31) surrogate Pt.2
 999
1000            In fact it overlaps to PrivateUse (maybe layout design failure).
1001
1002         19 (FE FF 10 02 - FE FF 29 E9) CJK extensions
1003
1004            3400-4DB5. Ordered.
1005
1006            They should be computed, since this range should be anyways
1007            checked (to not directly acquire the sortkey values but needs
1008            FE FF part) and anyways it can be computed.
1009
1010         20 (FF FF 01 01 01 01 00) special.
1011            Japanese extender marks:
1012                 3005, 3031, 3032, 309D, 309E, 30FC, 30FD, 30FE, FF70
1013
1014            LAMESPEC: In native context Microsoft's understanding of Japanese
1015            3031 and 3032 is wrong. They can never be used to repeat *just
1016            previous one* character, but are usually used to repeat two or
1017            more characters. Also, 3005 is not always used to repeat exactly
1018            one character but sometimes used to repeat two (or possibly more)
1019            characters.
1020
1021            Arabic shadda: FE7C (isolated), FE7D (medium)
1022            (Actually they are not extender in Unicode PropList.txt)
1023
1024
1025         - by UnicodeCategory -
1026
1027         DashPunctuation         6 (no exception)
1028         DecimalDigitNumber      C (no exception)
1029         EnclosingMark           1 E (no exception)
1030         Format                  7 (only 70F)
1031         LetterNumber            C (no exception)
1032         LineSeparator           7 (only 2028)
1033         ParagraphSeparator      7 (only 2029)
1034         PrivateUse
1035         SpaceSeparator          7 (no exception)
1036         Surrogate
1037
1038         OtherNumber             C(<3192), 9E-A7 (3124<)
1039
1040         Control                 6 except for 9-D (7)
1041         FinalQuotePunctuation   7 except for BB (8)
1042         InitialQuotePunctuation 7 except for AB (8)
1043         ClosePunctuation        7 except for 232A (9)
1044         OpenPunctuation         7 except for 2329 (9)
1045         ConnectorPunctuation    7 except for FF65, 30FB, 2040 (A)
1046
1047         OtherLetter             1, 7, 8 (1C0-1C2), C, 12-FF
1048         MathSymbol              8, 9, 6, 7, A, C
1049         OtherSymbol             7, 9, A, C, E, F, <22, 52<
1050         CurrencySymbol          A except for FF69,24,FF04 (7) and 9F2,9F3 (15)
1051
1052         LowercaseLetter         E-11 except for B5 (A) and 1BD (C)
1053         TitlecaseLetter         E (no exception)
1054         UppercaseLetter         E,F,10,11,21 except for 1BC (C)
1055         ModifierLetter          1, 7, E, 1F, FF
1056         ModifierSymbol          1, 6, 7
1057         NonSpacingMark          1, 6, 13-1F
1058         OtherPunctuation        1, 7, A, 1F
1059         SpacingCombiningMark    1, 14-22
1060
1061 *** Culture dependent design
1062
1063         (To assure this section, run the simple dumper code shown above,
1064         with all the supported cultures.)
1065
1066 **** primary cultures and non-primary cultures
1067
1068         This code is used to iterate character dump through all cultures,
1069         using sort key dumper put above.
1070
1071         public static void Main ()
1072         {
1073                 foreach (CultureInfo ci in CultureInfo.GetCultures (
1074                         CultureTypes.AllCultures)) {
1075                         ProcessStartInfo psi = new ProcessStartInfo ();
1076                         psi.FileName = "../allsortkey.exe";
1077                         psi.Arguments = ci.Name;
1078                         psi.RedirectStandardOutput = true;
1079                         psi.UseShellExecute = false;
1080                         Process p = new Process ();
1081                         p.StartInfo = psi;
1082                         p.Start ();
1083                         string s = p.StandardOutput.ReadToEnd ();
1084                         StreamWriter sw = new StreamWriter (ci.Name + ".txt", false, Encoding.UTF8);
1085                         sw.Write (s);
1086                         sw.Close ();
1087                 }
1088         }
1089
1090         For each sub culture (that has a parent culture), its collation
1091         mapping is identical to that of its parent, except for az-AZ-Cyrl.
1092
1093         Additionally,
1094
1095         - zh-CHS = zh-CN = zh-SG = zh-MO : pronounciation
1096         - zh-TW = zh-HK = zh-CHT : stroke count
1097         - da = no
1098         - fi = sv
1099         - hr = sr
1100
1101         (UCA implies that there are some cultures that sorts alphabets from
1102         large to small, but as long as I see there is no such CultureInfo.)
1103
1104 **** Latin characters and NonSpacingMark order tailorings
1105
1106         div : FDF2 is 24 83 01 01 01 01 00 (only 1 difference)
1107         syr : some NonSpacingMarks are totally ignorable.
1108         tt,kk,mk,az-AZ-Cyrl,uk : cyrillic difference
1109         az,et,lt,lv,sl,tr,sv,ro,pl,no,is,hu,fi,es,da : latin difference
1110         fr : 1C4-1C6.
1111         sk,hr,cs : latin and NonSpacingMark differences
1112
1113         ja,ko : 5C
1114
1115 **** CJK character order tailorings
1116
1117         <how many tables?>
1118
1119         There are five different CJK orderings:
1120         default, ko(-KR), ja(-JP), zh-CHS and zh-CHT
1121         They have very different CJK mapping for each.
1122
1123         Since they seems based on traditional encodings, we are likely to
1124         provide other constant tables and switch depending on the culture.
1125
1126         <what characters are different from the invariant culture?>
1127
1128         ko : CJK layout difference (52 -> 80)
1129         ja,zh-CHS,zh-TW : dash (5C), CJK layout difference.
1130
1131         Target characters are : CJK misc (3190-), Parenthesized CJK
1132         (3200-), CJK compat (3300-), CJK ideographs (4E00-),
1133         CJK compat ideograph (F900-), Half/Full width compat (FF00-)
1134
1135         Additionally for Korean: Jamo (1100-), Hangle syllables (AC00)
1136
1137         <how do they consist of?>
1138
1139         Japanese CJK order looks based on JIS table order. Those characters
1140         which are also in JIS table are moved to 80 xx. Those which are *not*
1141         in JIS table are left as is (9E-FE).
1142
1143         Additionally, Windows has different order for characters below:
1144         4EDD,337B,337E,337D,337D,337C
1145         They come in front of the first CJK character.
1146
1147         Maybe Korean CJK order respects KS C 5619. Note that Korean mixes
1148         Hangul and CJK in their order so it's not flat order without indexes
1149         (thus, for CJK they are not computable). Also, there is an extra
1150         level2 values for Korean CJK map.
1151
1152         For some Chinese such as zh-CHS, character order is based on pinyin.
1153         And for remaining Chinese such as zh-TW, it is stroke count based.
1154
1155         CLDR of unicode.org has reference ordering of those characters, so
1156         our collation table extracts the sorting order from it.
1157         http://www.unicode.org/cldr/
1158
1159 **** Accent evaluation order
1160
1161         With French cultures, diacritical marks are counted in reverse order.
1162         French ordering does not affect only on some diacritics (Japanese
1163         voice mark is not affected - FIXME: I doubt it, because the algorithm
1164         does not seem to allow it).
1165
1166         Some other cultures might also have different ones, but not obvious.
1167
1168
1169 ** Mono implementation plans
1170
1171 *** Collator
1172
1173         CompareInfo contains many overloaded methods that are just for
1174         convenience. This class contains almost only required members.
1175
1176         This class also provices access to tailoring information which is
1177         culture instance dependent:
1178
1179         - French sorting
1180         - contractions/expansions - returns contraction or expansion
1181         - diacritical remapping
1182         - CJK custom mapping
1183
1184         For data area, see CollationDataStructures.txt for now.
1185
1186 *** UnicodeTable (for now MSCompatUnicodeTable)
1187
1188         Provides several access to character information with related to
1189         the collation element table (of our own).
1190         FIXME: I want to fix some bugs in Windows collation table especially
1191         to not ignore some characters, but it requires table modification
1192         which results in further memory allocation. Maybe it would be done
1193         as a patch for the runtime (or classlib) sources.
1194
1195         - ignorable, ignorable nonspace, normalize width, normalize kanatype
1196         - level 4 sortkey provision method(s)
1197
1198 **** character comparison
1199
1200         Since composite character is likely to *not have* equivalent
1201         codepoint, character comparison could not just be done by expecting
1202         "resulting char" value.
1203         In contrast, since composite character is likely to *do have*
1204         equivalent codepoint, character comparison could not also just be done
1205         by comparing "source char" value.
1206
1207 ***** future optimizations
1208
1209         From where those codepoints differ, for each strings it adjusts the
1210         position so that it represents exactly one character element. That is,
1211         find primary character as the start of the range and the last
1212         nonprimary character as the end of the range.
1213
1214         Once Compare() adjusted the character location to be valid
1215         comparison position, further comparison is done as usual comparison,
1216         i.e. sortkey comparison considering comparisonLevel.
1217
1218 **** Characters in the table / characters computed
1219
1220         Currently I plan not to contain following characters in the table
1221         but compute on demand:
1222
1223         - PrivateUse
1224         - Surrogate
1225
1226 **** CJK Unified Ideographs
1227
1228         For CJK unified ideographs, I had to make those culture-dependent
1229         tables in memory. Since they came from some classical encodings, they
1230         are not computed. Thus, they are in separate table.
1231
1232 **** Level 4: Kana type
1233
1234         The table does not contain level 4 (kanatype) properties for
1235         the whole characters. They can be simply computed.
1236
1237 **** Level 3: Case/Width properties
1238
1239         Case properties will be stored as a byte array, with limited areas of
1240         codepoint (cp < 3120 || FE00 < cp).
1241
1242         For Hangul characters, it will be computed by codepoint areas.
1243
1244 **** Level 2: Diacritical properties
1245
1246         The table will be composed as a byte for a character. If we provide
1247         non-buggy mode (Windows is buggy here by design; it just sums
1248         secondary weight values up), the values will come from UCA and
1249         non-blocking check will be introduced.
1250
1251         Note that Japanese voice marks are considered at level 2 but no need to
1252         have maps.
1253
1254
1255 ** Reference materials
1256
1257         Developing International Software for Windows 95 and Windows NT
1258         Appendix D Sort Order for Selected Languages
1259         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
1260
1261         UTR#10 Unicode Collation Algorithm (It is still informative)
1262         http://www.unicode.org/reports/tr10/
1263
1264         UAX#15 Unicode Normalization
1265         http://www.unicode.org/reports/tr15/
1266         especially its canonical/compatibility equivalent characters might
1267         be informative to get those equivalent characters.
1268
1269         To know which character can be expanded, Unicode Character Database
1270         (UCD) is informative (it's informative but not normative to us)
1271         http://www.unicode.org/Public/UNIDATA/UCD.html
1272
1273         Decent char-by-char explaination is available here:
1274         http://www.fileformat.info/info/unicode/
1275
1276         Wine uses UCA default element table, but has windows-like character
1277         filterings support in their LCMapString implementation:
1278         http://cvs.winehq.com/cvsweb/wine/dlls/kernel/locale.c
1279         http://cvs.winehq.com/cvsweb/wine/libs/unicode/sortkey.c
1280
1281         Mimer has decent materials on culture specific collations:
1282         http://developer.mimer.com/collations/
1283
1284         This is written in Japanese, but awesome analysis on MS Access
1285         string sorting:
1286         http://www.asahi-net.or.jp/~ez3k-msym/comp/acccoll.htm