mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt

   1 * String collation
   2
   3 ** Summary
   4
   5         We are going to implement Windows-like collation, apart from ICU (that
   6         implements UCA, Unicode Collation Algorithm).
   7
   8
   9 ** Tasks
  10
  11         * create collation element table(s)
  12                 - infer how Windows collation element table is composed
  13                   : mostly analyzed.
  14                 - write table generator source(s)
  15                   : mostly implemented. Need to fix nearly 400 mappings.
  16                     They are mainly 1) IPA extensions (U+250-U+300),
  17                     2) Latin extensions (U+1E00-U+1F00), 3) Letterlike
  18                     symbols (U+2100-U+2140), 4) some Cyrillic letters
  19                     (U+460-U+500), and 5) some Hangul characters.
  20                 - culture-specific sortkey data
  21                   : They are defined in mono-tailoring-source.txt.
  22                     All single sortkey remapping in all cultures are filled.
  23                     Contractions are not fully checked yet (should be filled
  24                     from UCA tailorings via create-tailorings.exe).
  25
  26
  27 ** How to implement CompareInfo members
  28
  29         GetSortKey() : done
  30                 Compute sort key for every character elements into byte[].
  31         Compare() : done
  32                 Find first difference and compare it.
  33                 "Larger/smaller" matters (beyond "different").
  34         IsPrefix()
  35                 It calls CompareInternal() which also answers if the target
  36                 is fully consumed, so it just returns true if it says that
  37                 the target is fully consumed.
  38         IsSuffix()
  39                 It tries CompareInternal() to compare source and target at
  40                 the end, where source varies from minimum tail to the
  41                 original args.
  42         IndexOf(), LastIndexOf()
  43                 For character search, it finds the matching character element
  44                 to the end (or start) of the string to find.
  45                 For string search, it invokes one of private IndexOf() (or
  46                 LastIndexOf()) overload passing the first character element
  47                 of the target, and if found, tests if the sequence is a valid
  48                 start point, using IsPrefix() (or IsSuffix()).
  49
  50 *** Optimizations
  51
  52         For Compare() and IsPrefix(), it uses forward iteration, which moves
  53         forward and don't stop until either it finds next "primary" character
  54         or it reached the end of the string, checking with IsSafe(char).
  55
  56         For IndexOf(char) and LastIndexOf(char), there is no special
  57         optimization (since the codepoints usually do not match, while they
  58         often matches as a natural collation), but it omits extraneous sortkey
  59         value computation.
  60
  61         IsSuffix() reuses Compare() and returns false if it does not consume
  62         the target string more than 3 times. 3 is kind of magic number that
  63         represents the longest expansion.
  64
  65         IndexOf(string) is implemented as a combination of IndexOf(char) and
  66         IsPrefix().
  67
  68         LastIndexOf(string) is implemented as a combination of
  69         LastIndexOf(char) and IsPrefix().
  70
  71         Porting them to C code is an alternative possible approach, but from
  72         Compare() optimization experience, it is quick enough.
  73
  74 ** How to support CompareOptions
  75
  76         There are two kind of "ignorance" : strippers' ignorance and
  77         normalizers' ignorance.
  78
  79         The strippers will "filter characters out" and there will be no
  80         corresponding character elements in SortKey binaries.
  81
  82         Normalizers, on the other hand, will result in certain characters
  83         that is still in effect between irrelevant character and itself.
  84         For example, with IgnoreKanaType Hiragana "A" and Katakana "A" are
  85         not distinguished, but Hiragana "A" and Hiragana "I" are.
  86
  87         Actually, even without any IgnoreXXX flags (i.e. "None"), there are
  88         many characters that are ignored ("completely ignorable").
  89
  90         Except for LCID 101/1125(div), '\ufdf2' is completely ignorable.
  91         This rule even applies to CompareOptions.None.
  92
  93 *** Normalizers
  94
  95         IgnoreCase
  96                 Maybe culture-dependent TextInfo.ToLower() could be used.
  97
  98                 Unlike ICU (specialCaseToLower()), even with tr-TR(LCID 31)
  99                 and IgnoreCase, I\u0307 is not regarded as equal to i.
 100
 101         IgnoreKanaType
 102                 ToKanaTypeInsensitive(). Note that this does not cover the
 103                 whole "level 4" differences described later.
 104
 105         IgnoreWidth
 106                 ToWidthInsensitive(), which is likely to be culture
 107                 independent. See also "Notes".
 108
 109         IgnoreNonSpace (see also Strippers; this flag works in both sides)
 110                 For some cultures this logic is still incomplete. All culture-
 111                 dependent collators must handle valid "replacement" of "one or
 112                 more characters" which might be related to specific
 113                 CompareOptions.
 114                 For example, there is a Japanese text sorting rule that
 115                 however applies to InvariantCulture. Concretely to say,
 116                 "\u3042\u30FC" is equivalent to "\u3042\u3042" only when
 117                 IgnoreNonSpace is specified.
 118
 119                 I'll take those items from CLDR (those items which has
 120                 <reset before="..." />), case by case though.
 121
 122 *** Strippers
 123
 124         I already wrote all the required strippers which should be MS
 125         compatible (at least with .NET 1.1 invariant culture).
 126
 127         IgnoreNonSpace
 128                 IsIgnorableNonSpacing().
 129                 Some Diacritic characters are covered by this flag.
 130
 131                 There are some culture *dependent* characters:
 132                         LCID 90/1114(syr) : 64b, 652, 670
 133
 134         IgnoreSymbols
 135                 IsIgnorableSymbol().
 136                 UnicodeCategory does not work here.
 137
 138                 There are some culture *dependent* characters:
 139                         LCID 17/1041(ja) : 2015
 140                         LCID 90/1114(syr) : 64b, 652
 141
 142 *** StringSort
 143
 144         See "sort order categories" section.
 145
 146 ** ICU and UCA
 147
 148         First to note: we won't use collation element table from unicode.org.
 149
 150         There are many differences between ICU/UCA and Windows despite they
 151         look so similar; having collation keys in different levels, culture
 152         dependent composition, etc. In the history, Windows collation is
 153         designed before UCA was specified, so basically Windows is obsolete
 154         in this area.
 155
 156         - Logic: Unlike UCA it has no concept of "blocked" combining marks,
 157           and combining marks are never considered as an independent character
 158           (thus combining in Windows is buggy).
 159         - Data: Windows is based on old Unicode standard (even older than 1.1).
 160           It ignores minor cultures. Character property values differ as well
 161           as those from the default Unicode collation element table (DUCET).
 162           In a few cultures Windows collation is close to the native language
 163           (e.g. Tamil, while it does not conform to TAM).
 164
 165         IgnoreWidth/IgnoreSymbols is processed after Kana voice mark
 166         decomposition (something like NFD, but not equivalent. Example: \u304C
 167         is completely equivalent to \u304B\u309B, which is not part of NFKD).
 168         <del>This means, if there is a combined Kana characters, it will be
 169         first decomposed and then compared.<del> It scarcely matters since
 170         there are special weight data for Japanese.
 171
 172 *** Microsoft design problem
 173
 174         Microsoft implementation seems to have a serious problem that many,
 175         many characters that are used in for each specific culture, such as
 176         Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
 177         "completely ignorable".
 178
 179         I tagged many LAMESPEC items in the implementation (both in collator
 180         and table generator).
 181
 182
 183 ** MS collation design inference
 184
 185 *** Levels
 186
 187         Each character has several "weights". It is a common concept between
 188         Windows and UCA.
 189
 190         There are 5 levels:
 191
 192         - level 1: primary difference
 193           The first byte of level 1 means the category of the character.
 194         - level 2: diacritic difference, including Japanese voice mark countup
 195         - level 3: case/width sensitivity, and Hangul properties
 196         - level 4: kana weight (all of them have primary category 22, at least
 197           in InvariantCulture)
 198         - level 5: shift weight (apostrophe, hyphens etc.)
 199
 200         Note that these levels does not digitally match IgnoreXXX flags. Thus
 201         it is not OK that we omit some levels of sortkey values in reflection
 202         to CompareOptions.
 203
 204         String comparison is done from level 1 to 5. The comparison won't
 205         stop until either it found the primary difference, or it reached to
 206         the end (thus upper level differences are returned).
 207
 208         For example, "e" is smaller than "E", but "eB" is bigger than "EA".
 209         If the collator just returned case difference at first 'e' and 'E',
 210         "eB" is still smaller than "EA".
 211
 212 **** level 5: shift weight by StringSort
 213
 214         There are some characters that are treated specially. Namely they are
 215         apostrophe and hyphens. The sortkeys for them is put after level 4
 216         (thus here I write them as "level 5"). It has different sort key
 217         format. See immediate below. There is no level 5 characters when
 218         StringSort is specified.
 219
 220 *** sort key format
 221
 222         00 means the end of sort key.
 223         01 means the end of the level.
 224         02-FF means the value.
 225         Actually '2' could be cut when all the following values are
 226         also '2' (i.e. the sort key binary won't contain extraneous '2').
 227
 228         Every level has different key layout.
 229
 230 **** level 2
 231
 232         It looks like all level 2 keys are just accumulated, however without
 233         considering overflow. It sometimes makes sense (e.g. diaeresis and
 234         acute) but it causes many conflicts (e.g. "A\u0308\u0301" and "\u1EA6"
 235         are incorrectly regarded as equal).
 236
 237         Anyways since Japanese voice mark has level 2 value as 1 it just
 238         looked like the sum of voice marks.
 239
 240 **** level 3
 241
 242         The actual values are + 2 (e.g. for Hangul Normal Jamo, the value is 4)
 243
 244         For Korean letters:
 245                 - 2: Jongseong (11A8-11F9)
 246                 - 4: Half width? (FFA0-FFDC) and Compatibility Jamo? (3165-318E)
 247                 - 5: Compatibility Jamo (3130-3164)?
 248                      TODO: Learn about Korean characters.
 249
 250         For numbers:
 251                 - 4 circled inverse (2776-277F)
 252                 - 8 circled sans serif (2780-2789)
 253                 - C circled inverse && sans serif (278A-2793)
 254                 - 47 roman (2160-2182)
 255
 256         For Arabic letters:
 257                 - 2 Isolated form in presentation form B in FE80-FE8D
 258                 - 4 Alef/Bet/Gimel/Dalet (2135-2138)
 259                 - 8 Final form in presentation form B in FE82-FEF2
 260                 - 18 Medial form in presentation form B in FE8C-FEF4
 261                      Grep "ISOLATED", "FINAL" or "MEDIAL" on UnicodeData.txt
 262                      (and filter by codepoints).
 263                      or alternatively, see DerivedDecompositionType.txt.
 264                 - 22 6A9 (TODO: what is it?)
 265                 - 28 6AA (TODO: what is it?)
 266
 267         For other letters:
 268                 - 1 Fullwidth. UnicodeData.txt has <full>.
 269                 - 2 Subscript. UnicodeData.txt has <sub>.
 270                 - 8 Small capital, 03C2 (TODO: why?),
 271                     2104, 212B(flag=1A) (TODO: why?)
 272                     grep "SMALL CAPITAL" against UnicodeData.txt.
 273                 - C only FE42. TODO: what is this?
 274                 - E Superscripts. UnicodeData.txt has <super>.
 275                 - 10 Uppercase.
 276                      DerivedCoreProperties.txt has Uppercase property.
 277
 278         Note that simple 02 (value is 00) could be omitted.
 279
 280         Summary: at least 7 bits are required as to represent a table -
 281         smallCapital, uppercase, normalization forms (2 bits:full/sub/super),
 282         arabic forms (2 bits:isolated/medial/final)
 283
 284 **** level 4
 285
 286         Those sortkey data is collected only for Japanese (category 22)
 287         characters.
 288
 289         There are 3 sections each of them ends with FF. Each of them
 290         represents the values for character by character:
 291         - small letter type (kogaki moji); C4 (small) or E4 (normal)
 292         - category middle section:
 293                 two subsections separated by 0x02
 294                 - char type;
 295                   3 (normal)
 296                   or 4 (voice mark - \u309D,\u309E,\u30FD,\u30FE, \uFF70)
 297                   or 5 (dash mark - \u30FC)
 298                 - kana type; C4 (katakana) or E4 (hiragana)
 299         - width; 2 (normal) or C5 (full) or C4 (half)
 300
 301           LAMESPEC: those characters of value '4' of middle section differs
 302           in level 2 wrt voice marks, but does not differetiate kana types
 303           (bug). It is ignored when IgnoreNonSpace applies.
 304
 305 **** level 5
 306
 307         UPDATED: I noticed offsetL does not exist, so removed it from here.
 308
 309         [offsetM + 0x80]? [const 3 + (offsetS + 1) * 4] [category] [level1]
 310
 311         where "offsetM" and "offsetS" represents the offset in the input
 312         string. "offsetM" is always larger than 0x80.
 313         LAMESPEC: This design results in a buggy overflow.
 314
 315         <xmp>
 316         byte [] data = new CultureInfo ("").CompareInfo.GetSortKey (s).KeyData;
 317         int idx = 0;
 318         for (int i = 0; i < 4; i++, idx++)
 319                 for (; data [idx] != 1; idx++)
 320                         ;
 321         for (; idx < data.Length; idx++)
 322                 Console.Write ("{0:X02} ", data [idx]);
 323         Console.WriteLine ();
 324         </xmp>
 325
 326         inputs (s) and results:
 327
 328         80 07 06 82 80 2F 06 82 00 // '-' + new string ('A', 10) + '-'
 329         80 07 06 82 81 97 06 82 00 // (100)
 330         80 07 06 82 8F A7 06 82 00 // (1000)
 331         80 07 06 82 9C 47 06 82 00 // (10000)
 332         80 07 06 82 9A 87 06 82 00 // (100000)
 333         80 07 06 82 89 07 06 82 00 // (1000000)
 334
 335         The actual offset is 63 * offsetM + offsetS
 336
 337         (const '3' may actually vary but no idea.
 338         At least 00, 01 and 02 are not acceptable since they are reserved.
 339         02 is not reserved by definition above, but the key-size optimizer
 340         uses it as a special mark, as mentioned above.)
 341
 342 *** sort key table
 343
 344         Here is the simple sortkey dumper:
 345
 346         public static void Main (string [] args)
 347         {
 348                 CultureInfo culture = args.Length > 0 ?
 349                         new CultureInfo (args [0]) :
 350                         CultureInfo.InvariantCulture;
 351                 CompareInfo ci = culture.CompareInfo;
 352                 for (int i = 0; i < char.MaxValue; i++) {
 353                         string s = new string ((char) i, 1);
 354                         if (ci.Compare (s, "") == 0)
 355                                 continue; // ignored
 356                         byte [] data = ci.GetSortKey (s).KeyData;
 357                         foreach (byte b in data) {
 358                                 Console.Write ("{0:X02}", b);
 359                                 Console.Write (' ');
 360                         }
 361                         Console.WriteLine (" : {0:X}, {1} {2}",
 362                                 i,
 363                                 Char.GetUnicodeCategory ((char) i),
 364                                 data [2] != 1 ? '!' : ' ');
 365                 }
 366         }
 367
 368 *** multiple character mappings
 369
 370         Some sequence of characters are considered as a "composite" that is
 371         to be composed either as another character or another sequence of
 372         characters. Those "composite" might not have corresponding equivalent
 373         character in sortkey.
 374         Similarly, some single characters are expanded to a sequence of
 375         characters.
 376
 377 **** diacritic characters
 378
 379         Except for those shift-weight characters, there are only
 380         diacritical (or other kinds of nonspacing) characters that don't
 381         have primary weights.
 382
 383         Diacritics are not regarded as a base character when placed after
 384         (maybe some kind of) letters.
 385
 386         The behavior is diacritic character dependent. For example, Japanese
 387         combination of a Kana character and a voice mark is compulsory (the
 388         resulting sort key is regarded as identical to the corresponding
 389         single character. Try \u304B\u309B with \u304C. It is invariant).
 390
 391         In French cultures, diacritic orderings are checked from right to left.
 392
 393 **** Composite character processing
 394
 395         There are some sequences of characters that are treated as another
 396         character or another sequence of characters.
 397
 398         By default, there is no composite form.
 399         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
 400         (Note that composite is different from expansion.)
 401
 402         Note that composite characters is likely to not have equivalent
 403         codepoint.
 404
 405 **** Expanded character processing
 406
 407         Some characters are expanded to two or more characters:
 408
 409         C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
 410         132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
 411         DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
 412         1113-115F (hangul)
 413         (CJK extension is not really expanded)
 414
 415         They don't match with any of Unicode normalization.
 416
 417         Some alphabetic cultures have different mappings, but mostly small
 418         (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
 419
 420         Invariant culture also puts Czech unique character \u0161 between s
 421         and t, unlike described here:
 422         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
 423
 424 *** default sort key table
 425
 426 **** StringSort
 427
 428         When CompareOptions.StringSort is specified, then it modifies
 429         characters in category 2 from "1 1 1 1 80 07 06 xx" to
 430         "06 xx yy zz" and some characters become case sensitive.
 431
 432         For details, "level 5" description above.
 433
 434         To handle them simply, they are laid out as "category 0x01" (which
 435         never happens in the actual sortkeys) for those shift-weight ones
 436         in the table.
 437
 438         There seems no further differences between StringSort and None.
 439
 440 **** level 2 details
 441
 442         Known value maps:
 443                 -0A: Korean parenthesized numbers (3200-321C)
 444                 -0C: Korean circled numbers (3260-327B)
 445
 446                 -03: Japanese voice mark
 447
 448                 <primary category 13 : Arabic>
 449                 -08: 627-648 (basic Abjad letters)
 450                 -09: madda (622)
 451                 -05: waw with hamza (624)
 452                 -07: yeh with hamza (626. ignore Presentation Form A area)
 453                 -0A: alef with hamza above (623)
 454                 -0A: alef with hamza below (625)
 455
 456                 <primary category 0E : diacritics>
 457                 Characters in non "0E" category are out of scope.
 458                 They can be grepped in UnicodeData.txt.
 459                 -0E: acute
 460                 -0F: grave
 461                 -10: dot above
 462                 -11: middle dot
 463                 -12: circumflex
 464                 -13: diaeresis
 465                 -14: caron
 466                      Note that 1C4-1C6 are covered but they are also expanded.
 467                 -15: breve (cyrillic are also covered? at least 4C1/4C2 are.)
 468                 -16: dialytika and tonos (category 0F though)
 469                 -17: macron
 470                 -19: tilde
 471                 -1A: ring above | 212B
 472                 -1B: ogonek ("WITH OGONEK;")
 473                 -1C: cedilla (WITH CEDILLA;")
 474                 -1D: double acute | acute and dot above
 475                 -1E: stroke, except for 0E[1F] and cp{19B, 1BE} |
 476                      circumflex and acute | 18B,18C,19A,289
 477                      (i.e. they not one-to-one mapping. Neither that every
 478                      "stroke" are mapped to 1E, nor not every 1E are mapped to
 479                      "stroke".)
 480                 -1F: diaeresis and acute | with circumflex and grave | l slash
 481                         beware "symbol slash"
 482                 -20: diaeresis and grave | 19B,19F
 483                 -21: breve and acute | D8,F8
 484                 -22: caron and dot above | breve and grave
 485                 -23: macron and acute
 486                 -24: macron and grave
 487                 -25: diaeresis and caron | dot above and macron | tilde and acute
 488                 -26: ring above and acute
 489                 -28: diaeresis and macron | cedilla and acute |
 490                      macron and diaeresis
 491                 -29: circumflex and tilde
 492                 -2A: tilde and diaeresis
 493                 -2B: stroke and acute
 494                 -2C: breve and tilde
 495                 -2F: cedilla and breve
 496                 -30: ogonek and macron
 497                 -43: hook, except for cp{192,1B2,25A,25D,27B,28B,2B1,2B5} |
 498                      left hook | with hook above except for cp{1EF6,1EF7} |
 499                      27D,284
 500                 -44: double grave | 1EF6,1EF7
 501                 -46: inverted breve
 502                 -48: preceded by apostrophe (actually only 149)
 503                 -52: horn
 504                 -55: line below | circumflex and hook above
 505                 -57: palatal hook (actually only 1AB)
 506                 -58: dot below except for cp{1EA0,1EA1}
 507                 -59: "retroflex" (without "WITH") | diaeresis below | 1EA0,1EA1
 508                 -5A: ring below | 1E76,1E77
 509                 -60: circumflex below except for cp{1E76,1E77} | horn and acute
 510                 -61: breve below | horn and grave
 511                 -63: tilde below | 2125
 512                 -68: D0,F0,182,183 | dot below and dot above | topbar
 513                 -69: right half ring | horn and tilde
 514                 -6A: circumflex and dot below
 515                 -6D: breve and dot below
 516                 -6E: dot below and macron
 517                 -95: horn and hook above
 518                 -AA: horn and dot
 519
 520                 (for 01-0D and 7B-8A, they are not related to diacritics.)
 521
 522                 <category BlahBlahNumbers from 0100 to 1000>
 523                 -38: Arabic-Indic numbers (660-669)
 524                 -39: extended Arabic-Indic numbers (6F0-6F9)
 525                 -3A: Devanagari numbers (966-96F)
 526                 -3B: Bengali numbers (9E6-9EF)
 527                 -3C: Bengali currency enumerators (9F4-9F9)
 528                 -3D: Gurmukhi numbers (A66-A6F)
 529                 -3E: Gujarati numbesr (AE6-AEF)
 530                 -3F: Oriya digit numbers (B66-B6F)
 531                 -40: Tamil numbers (BE7-BF2)
 532                 -41: Telugu numbers (C66-C6F)
 533                 -42: Kannada numbers (CE6-CEF)
 534                 -43: Malayam numbers (D66-D6F)
 535                 -44: Thai numbers (E50-E59)
 536                 -45: Lao numbers (ED0-ED9)
 537                 <miscellaneous numbers>
 538                 -47: Roman numbers (2160-2182)
 539                 -4E: Hangchou numbers (3021-3029)
 540
 541                 -E0[64]: 2107 (Eurer)
 542                 -E0[87]: some Tone letters (TONE TWO / TONE SIX)
 543                 -EE: Circled letter-or-digits and katakanas
 544                         CIRCLED {DIGIT|NUMBER|LATIN|KATAKANA}
 545                         numbers (2460-2473,2776-2793,24EA)
 546                         latin (24B6-24E9)
 547                         katakana (32D0-32FE)
 548                 -F3: Parenthesized enumerations
 549                         numbers (2474-2487)
 550                         latin (249C-24B5)
 551                         PARENTHESIZED {DIGIT|NUMBER|LATIN}
 552                 -F4: Numbers with dot (2488-249B)
 553                         {DIGIT|NUMBER} * FULL STOP
 554
 555                 <miscellaneous>
 556                 -258,25C-25E,285,286,29A,297 -> 0E[80-86,88]
 557                 -27F,2B3-2B6 -> 0E 8A[80-84]
 558                 -3D3 -> 0F[44]
 559                 -476,477 -> 10[46]
 560                 -215F -> 0C[03]
 561
 562                 -20D0-20E1 -> 01[DD-F0]
 563                 -483-486 -> 01[94-97]
 564                 -559,55A -> 01[98,99]
 565                 -711 -> 01[9A]
 566
 567                 -346-348,2BE-2C5,2CE-2CF -> 01[74-7F]
 568                 -2D1-2D3,2DE,2E4-2E9 -> 01[81-8A]
 569
 570                 -342,343 -> 01[8D,8E]
 571                 -345 -> 01[90]
 572
 573                 -700-780 01[8D-AF]. Maybe there is some kind of traditional
 574                 order in Estrangela, but for now am not sure.
 575                 /*
 576                 -740-742 -> 01[8D-8F]
 577                 -747,748,732,735,738,739,73C,73F,743-746,730 -> 01[90,91,94-9F]
 578                 -731,733,734,736,737,73A,73B,73D,73E,749,74A,7A6-7A9
 579                  -> 01[A0-AA,AC-AF]
 580                 */
 581                 -7AA-7B0 -> 01[B0-B6]
 582
 583                 -591-5C2 except for 5BA,5BE -> 01[03-33] in order
 584
 585                 No further patterns for >= 80
 586
 587                 TODO: Below are not done yet:
 588                         - x < 0x80 in non-"0E" part
 589                         - 03 <= x <= 0D in "0E" part
 590                         - 7B <= x <= 7F in "0E" part
 591
 592 **** sortkey details by category
 593
 594         1 specially ignored ones (Japanese, Tamil, Thai)
 595
 596                 IdentifyBy: constants
 597                 Unicode: 3099-309C, BCD, E47, E4C, FF9E, FF9F
 598                 SortKey: 01 01 01 01 00
 599
 600         2 shift weight characters
 601
 602         They are either at 01 01 01 01 or 06, depending on StringSort. For
 603         convenience, I use 06 to describe them.
 604
 605         2.1 control characters (specified as such in Unicode), except for
 606         whitespaces (0009-000D).
 607
 608                 ProcessAfter: 4.1
 609                 IdentifyBy: UnicodeCategory.Control
 610                 Unicode: 0001-000F minus 0009-000D, 007F-009F
 611                 SortKey: 06 03 - 06 3D
 612
 613         2.2 Apostrophe
 614                 IdentifyBy: constant
 615                 Unicode: 0027,FF07 (')
 616                 SortKey: 06 80 (and width insensitive equivalents)
 617
 618         2.3  minus sign, hyphen, dash
 619           minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width)
 620           hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN?
 621           dashes, horizontal bars: FE58 ... UnicodeCategory.DashPunctuation
 622
 623                 IdentifyBy: UnicodeCategory.DashPunctuation
 624                 SortKey: 06 81 - 06 90 (and nonspace equivalents)
 625
 626         2.4 Arabic spacing and equivalents (64B-652, FE70-FE7F)
 627           They are part of nonspacing mark, but not equal.
 628
 629                 SortKey: 06 A0 - 06 A7 (and nonspace equivalents)
 630
 631         3 nonprimary characters, mixed.
 632
 633           ModifierSymbol, except for that are not in category 0 and "07" area
 634           (i.e. < 128) nor those equivalents
 635
 636           NonSpacingMark which is ignorable (IsIgnorableNonSpacing())
 637           // 30D, CD5-CD6, ABD, 2B9-2C5, 2C8, 2CB-2CD, 591-5C2. NonSpacingMark in
 638           // 981-A3C. A4D, A70, A71, ABC ...
 639
 640           TODO: I need more insight to write table generator.
 641
 642           SortKey: 01 03 01 - 01 B6 01
 643
 644           This part of MS table design is problematic (buggy): \u0592 should
 645           not be equal to \u09BC.
 646
 647           I guess, this buggy design is because Microsoft first thought that
 648           there won't be more than 255 characters in this area. Or they might be
 649           aware of the problem but prefer table optimization.
 650
 651           Ideal solutions:
 652
 653           1) We should not mix those code (make things sequential) and expands
 654              level 2 length to 2 bytes. Instead of having direct value, we
 655              could use index (pointer) to zero-terminating level 2 table.
 656
 657           2) Include those charactors from minor cultures here.
 658
 659           If in "discriminatory mode", those tables could be still provided
 660           as to be compatible to Windows.
 661
 662           Additionally there seems some bugs around Modifier letter collection.
 663           For example, 2C6 should be nonspacing diacritical character but it
 664           is regarded as a primary character. The same applies to Mandarin
 665           tone marks (2C9-2CB) (and there's a plenty of such characters).
 666
 667         4 space separators and some kind of marks
 668
 669         4.1 whitespaces, paragraph separator etc.
 670           UnicodeCategory.SpaceSeparator : 20, 3000, A0, 9-D, 2000-200B
 671
 672           SortKey : 07 02 - 07 18
 673
 674         4.2 some OtherSymbols: 2422-2423
 675
 676           SortKey : 07 19 - 07 1A
 677
 678         4.3 ASCII compatible marks ('!', '^', ...)
 679           Non-alpha-numeric < 0x7F except for [[+-<=>']]
 680           small compatibility equivalents -> itself, wide
 681
 682         4.3 other marks
 683           FIXME: how to identify them?
 684           some Punctuations: InitialQuote/FinalQuote/Open/Close/Connector
 685           some OtherSymbols: 2400-2424
 686           3003, 3006, 2D0, 10FB
 687           remaining Puncuations: 9xx, 7xx
 688           70F (Format)
 689
 690           SortKey : 07 1B - 07 F0
 691
 692         5 mathmatical symbols
 693           InitialQuotePunctuation and FinalQuotePunctuation in ASCII
 694           (not Quotation_Mark property in PropList.txt ; 22, 27)
 695
 696           byte area MathSymbol: 2B,3C,3D,3E,AB,B1,BB,D7,F7 except for AC
 697           some MathSymbol (2044, 208A, 208C, 207A, 207C)
 698           OtherLetter (1C0-1C2)
 699           2200-22FF MathSymbol except for 221E (INF. ; regarded as a number)
 700
 701           SortKey : 08 02 - 08 F8
 702
 703         6 Arrows and Box drawings
 704           09 02 .. 09 7C : 2300-237A
 705                         only primary differences
 706           09 BC ... 09 FE : 25A0-AB, 25E7-EB, 25AC-B5, 25EC-EF, 25B6-B9,
 707                         25BC-C3, 25BA-25BB, 25C4-25D8, 25E6, 25DA-25E5
 708                         21*,25*,26*,27*
 709                         This area contains level 2 values.
 710           2190- (non-codepoint order)
 711                 note that there are many compatibility equivalents
 712           2500- except for 266F (#)
 713
 714           SortKey : 09 02 - 09 7C, 09 BC 01 03 - 09 BC 01 13,
 715                     09 {BD|BE|BF} 01 {03|04}, ...
 716                     TODO: fill the patterns
 717
 718         7 currency sumbols and some punctuations
 719           byte CurrencySymbols except for 24 ($)
 720           byte OtherSymbols (A7-B6)
 721           ConnectorPunctuation - 2040 (i.e. FF65, 30FB)
 722           OtherPunct/ConnectorPunct/CurrencyCymbol 2020-20AC - 20AC
 723           OtherSymbol 3012-303F,3004,327F
 724           MathSymbol/OtherSymbol 2600-2767 (math = 266F)
 725           OtherSymbol 2440-244A, 2117
 726           20AC (CurrencySymbol)
 727
 728           Sortey : 0A 02 - 0A FB
 729
 730         8 (C) numbers
 731           all DecimalDigitNumber, LetterNumber, non-CJK OtherNumber.
 732           9F8.
 733           digits, in numeric order. We can use NET_2_0 CharUnicodeInfo.
 734           221E. (INF.)
 735
 736           SortKey : 0C 02 (9F8), 0C 03 - 0C E1 (normal numbers), 0C FF (INF.)
 737
 738         9 (E) latin letters (alphabets), mixing alphabetical symbols
 739           Alphabets, A to Z, mixing alphabetical symbols. See below.
 740           F8-2B8 except for (1BB-1BD and 1C0-1C3), but not sequential.
 741           2E0-2E3.
 742
 743           For diacritical orders, see level 2.
 744
 745           For 'A' it is "0E 02", for 'B' "0E 09" ... 'Z' "0E A9", ezh "0E AA".
 746           0E B3 (1BE), 0E B4 (298)
 747
 748           There are CJK compatibility characters (3800-) and letterlike
 749           symbols (2100-) in those A-to-Z area, ordered by character name.
 750
 751           Primary weights are sometimes culture-dependent.
 752                 FIXME: [0E 0D], [0E 0E], [0E 4B], [0E 75], [0E B2] are unknown
 753                 02: A
 754                 03: C4 in sk|vi
 755                 04: C1 in is|pl|vi
 756                 05-08: CJKext
 757                 09: B
 758                 0A: C
 759                 0B: 10D in hr|lt|lv|pl, 107 in pl
 760                 0C: C7 in az|tr, 10D in cs|sk, 106 in hr
 761                 0F-19: CJKext
 762                 1A: D
 763                 1B: 189 (African D)
 764                 1C: 2A3 (DZ Digraph)
 765                 1D: 1C6 (dz) in hr
 766                 1E: 110 (D with stroke) in hr
 767                 1F-20: CJKext
 768                 21: E
 769                 22: 18F=259 in az, E9 in is, 119 in pl, EA in vi, 1EBE-1EC7 in vi
 770                 23: F
 771                 24: CJKext
 772                 25: G
 773                 26: 11F in az|tr, 123 in lv
 774                 28-2B: CJKext
 775                 2C: H
 776                 2D: 267 (Heng with hook)
 777                 2E: 33CB in az, 33CA in tr
 778                 2F-31: CJKext
 779                 32: I
 780                 33: CD in is, 79 in lt
 781                 34: CJKext
 782                 35: J
 783                 36: K
 784                 37-47: CJKext
 785                 48: L
 786                 49: 2114
 787                 4A: 1C9 in hr
 788                 4C: 142 in pl
 789                 4D-50: CJKext
 790                 51: M
 791                 52-6F: CJKext
 792                 70: N
 793                 71: 2116
 794                 72: 144 in pl
 795                 73: F1 in es, 1CC in hr
 796                 74: 14B
 797                 76-7B: CJKext
 798                 7C: O
 799                 7D: F6 in az|hu|tr, 151 in hu, F3 in is|pl, F4 in sk|vi, 1ED0-1ED9 in vi
 800                 7E: P
 801                 7F-88: CJKext
 802                 89: Q
 803                 8A: R
 804                 8B: 211E
 805                 8C: 211F
 806                 8D: 159 in cs|sk
 807                 8E-90: CJKext
 808                 91: S
 809                 92: 2108
 810                 93: 2120
 811                 94-95: CJKext
 812                 96: 17F (LATIN SMALL LONG S)
 813                 97: 15F in az|tr, 161 in cs|hr|lt|lv|sk|sl, 7A,179-17C in et, 15B in pl
 814                 98: 17E in et, 15F in ro, 15B in sl
 815                 99: T
 816                 9A: 2121
 817                 9B: CJKext
 818                 9C: 2122
 819                 9D: 2A6
 820                 9E: 166
 821                 9F: U
 822                 A0: FA in is, 1B0,1EE8-1EF1 in vi
 823                 A1: FC in az|tr, 56,57 in et, FC,171 in hu, FB in vi
 824                 A2: V
 825                 A3: 2123
 826                 A4: W
 827                 A5: CJKext
 828                 A6: X
 829                 A7: Y
 830                 A8: FD in is
 831                 A9: Z
 832                 AA: 292
 833                 AB: DE in is, 17E in lt|lv, 17A in pl
 834                 AC: E6 in da|is, 1E3 in is, 17C in pl, 17E in sl
 835                 AD: 17E in cs|hr|sk, E5 in fi, F6,F8 in is 17A in sl
 836                 AE: F6,F8,151 in da
 837                 AF: E4 in fi
 838                 B0: F6,F8,151 in fi
 839                 B1: E5 in da, "aa" in da
 840                 B3: 1BE
 841                 B4: 298
 842
 843         10 culture dependent letters (general)
 844           0F: 386-3F2 ... Greek and Coptic
 845                 386-3CF: [0F 02] - [0F 19] (consider primary equivalents)
 846                 3D0-3EF: [0F 40] - [0F 54]
 847           10: 400-4E9 ... Cyrillic.
 848                 For 400-45F and 4B1, they are mostly UCA DUCET order.
 849                 After that 460-481 follows, by codepoint.
 850                 (490-4FF except for 4B1 and Cyrillic supplementary are unused.)
 851           11: 531-586 ... Armenian.
 852                 Simply sorted by codepoint (handle case).
 853           12: 5D0-5F2 ... Hebrew.
 854                 Codepoint order (handle case).
 855           13: 621-6D5 plus 670 (NonSpacingMark) ... Arabic
 856                 Area 1:
 857                 They look like ordered by Arabic Presentation Form B except
 858                 for FE95, and considering diacritical equivalents maybe based
 859                 on the primary character area (621-6D5).
 860                 There are still some special characters: 67E,686,698,6AF ...
 861                 which might not have equivalent characters (I wonder how they
 862                 are inserted into the presentation form B map).
 863
 864                         Solution:
 865                         - hamza, waw, yeh (621,624,626) are special: [13 07]
 866                         - For all remaining letters, get primary letter name
 867                           and store it into dictionary. If unique, then
 868                           increment index by 4 from [13 0B]
 869                 Area 2:
 870                 674-6D5 : by codepoint from [13 84].
 871           14: 901-963 exc. 93A-93D 950-954 ... Devanagari.
 872                 For <905 codepoint order, x2 from [14 04].
 873                 For 905-939 codepoint order, x4 from [14 0B].
 874                 For 93E-94D codepoint order, x2 from [14 DA].
 875           15: 982-9FA ... Bengali. Actually all UnicodeCategories except for
 876                 NonSpacingMark, DecimalDigitNumber and OtherNumber.
 877                 For <9E0 simple codepoint order from [15 02].
 878                 For >9E0 simple codepoint order from [15 3B].
 879           16: A05-A74 exc. A3C A4D A66-A71 ... Gurmukhi.
 880                 The same as UCA order, x4 from [16 04].
 881           17: A81-AE0 exc. ABC-ABD ... Gujarati.
 882                 Mostly equivalent to UCA, but insert {AB3,A81-A83} before AB9,
 883                 x4 from [17 04].
 884           18: B00-B70 ... Oriya
 885                 All but NonSpacingMark and DecimalDigitNumber, by codepoint.
 886           19: B80-BFF ... Tamil
 887                 BD7 is special : [19 02].
 888                 B82-B93 (vowels) : x2 from [19 0A].
 889                 B94 (vowel AU) : [19 24]
 890                 For consonant order Windows has native Tamil order which is
 891                 different from UCA.
 892                 http://www.nationmaster.com/encyclopedia/Tamil-alphabet
 893                 (The order is still different in "Grantha" order from TAM.)
 894                 So, we should just hold constant array for consonants.
 895                 And put them in order, x4 form [19 26].
 896                 BBE-BCC : SpacingCombiningMark and BC0 ... x2 from [19 82].
 897           1A: C00-C61 ... Telugu.
 898                 C55 and C56 are ignored (C5x line and remaining part of C6x
 899                 line just look like ignored).
 900                 C60 and C61 are specially placed. C60 after C0B, C61 after C0C.
 901                 Except for above, by codepoint, x3 from [1A 04].
 902           1B: C80-CE5 ... Kannada.
 903                 CD5,CD6 (and CE6-CEF: DecimalDigitNumber) are ignored.
 904                 by codepoint, 3x from [1B 04].
 905           1C: D02-D40 ... Malayalam.
 906                 by simple codepoint from [1C 02].
 907           (1D: Sinhala ... totally ignored?)
 908           1E: E00-E44 ... Thai.
 909                 preceding vowels (E40-E44) by codepoint [1E 02 - 1E 06]
 910                 consonants (E01-E2A) by codepoint, x6 from [1E 07].
 911           1F: E2B-E5B,E80-EDF ... Thai / Lao. (Thai breaks the category wall.)
 912                 Thai:
 913                 remaining consonants (E2B-E2E) by codepoint, x6 from [1E 07].
 914                 remaining vowels (E2F-E3A) by codepoint.
 915                 E45,E46,E4E,E4F,E5A,E5B
 916                 Lao:
 917                 E80-EDF by codepoint from [1F 02].
 918           21: 10A0-10FF ... Georgian
 919                 Mostly equal to UCA order, but swap 10E3 <-> 10F3,
 920                 x5 from [21 05].
 921
 922         11 (22) japanese kana letters and symbols, not in codepoint order
 923
 924           For single character, the sortkeys look like:
 925           - Katakana normal A, Half Width (FF71) : FF 02 C4 FF C4 FF 01 00
 926           - Katakana normal A, Full Width (30A2) : FF C4 FF 01 00
 927           - Hiragana normal A, Full Width (3042) : FF FF 01 00
 928
 929           Actually for level 4 weights, there is a different rule (see
 930           "level 4" format above).
 931
 932           There is also 32D0 (normal katakana A with circle) that have
 933           diacritic difference.
 934
 935           For primary weights, 'A' to 'O' are mapped to 22-26, 'Ka' to 'Ko'
 936           are to 2A-2E, 'Sa' to 'So' are to 32-36 ... and follows.
 937           'Nn' is special: [22 80].
 938
 939           After Kana characters, there are CJK compat characters.
 940           From 22 97 01 01 01 01 00 (3349) to 22 A6 01 01 01 01 00 (333B) are
 941           sorted in JIS table order (CP932.TXT). Remaining square characters
 942           are maybe sorted in Alphabetic order.
 943
 944           UCA DUCET also does not apply here.
 945
 946         12 (23) bopomofo letters
 947                 3105-312C: simple codepoint order from [23 02].
 948
 949         13 culture dependent letters 2
 950                 710-72C : Estrangela (ancient Syriac).
 951                         codepoint order.
 952                         711 is excluded (superscript).
 953                         714,716,71C,724 and 727 are "alternative" characters.
 954                         SortKey: [24 0B]-[24 60], by x where x is 2 for those
 955                         which is "alternative" defined above, otherwise 4.
 956
 957                 780-7A5 : Thaana
 958                         Equals to UCA order, x2 from [24 6E].
 959
 960         (Maybe we should add remaining minor-culture characters here. Tibetan,
 961         Limbu, Tagalog, Hanunoo, Buhid, Tagbanwa, Myanmar, Kumer, Tai-Le,
 962         Mongolian, Cherokee, Canadian-Aboriginal, Ogham, Runic are ignored)
 963
 964         14 (41-45) surrogate Pt.1
 965
 966         15 (52 02-7E C8) hangul, mixing combined ones
 967
 968           It starts from 1100. After width-insensitive equivalents, those
 969           syllables (from AC00) follow (until <del>AE4B</del>D7A3).
 970           It follows kinda based on some formula (sometimes it looks not
 971           e.g. 1117). FIXME: this area should be clarified more.
 972
 973           Hangle Syllables should not be filled in the table. Instead, they
 974           can be easily computed by the following formulum:
 975
 976                 // rc is the codepoint for the input Syllable
 977                 // (p holds "category << 8 + level1weight")
 978                 int ri = ((int) rc - 0xAC00) + 1;
 979                 ushort p = (ushort)
 980                         ((ri / 254) * 256 + (ri % 254) + 2);
 981
 982           Hangul Jamo cannot be filled in the table directly, since
 983           U+1113 - U+159 holds additional primary key bytes.
 984           FIXME: find out how they can be computed.
 985           See http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/collation/ICU_collation_design.htm?rev=HEAD&content-type=text/html#Hangul_Implicit_CEs
 986
 987         16 (9E 02-F1 E4) CJK
 988
 989            9E 02-F0 B4 [3192-319F,3220-3243,3280-32B0,4E00-9FA5] : CJK mark,
 990                 parenthesized CJK (part), circled CJK (part), CJK ideograph.
 991                 Ordered but condidering compatible characters (i.e. there is
 992                 no other way than having massive mapping).
 993            F0 B5-F1 E4 [F900-FA2D]. CJK compatibility ideograph.
 994
 995            LAMESPEC: in the latest spec CJK ends at 9F BB. Since MS table
 996            joins these two categories without any consideration, it is
 997            impossible to insert those new characters without breaking binary
 998            compatibility.
 999
1000         17 (E5 02-FE 33) PrivateUse.
1001
1002            In fact it overlaps to CJK characters (maybe layout design failure).
1003
1004         18 (F2 01-F2 31) surrogate Pt.2
1005
1006            In fact it overlaps to PrivateUse (maybe layout design failure).
1007
1008         19 (FE FF 10 02 - FE FF 29 E9) CJK extensions
1009
1010            3400-4DB5. Ordered.
1011
1012            They should be computed, since this range should be anyways
1013            checked (to not directly acquire the sortkey values but needs
1014            FE FF part) and anyways it can be computed.
1015
1016         20 (FF FF 01 01 01 01 00) special.
1017            Japanese extender marks:
1018                 3005, 3031, 3032, 309D, 309E, 30FC, 30FD, 30FE, FF70
1019
1020            LAMESPEC: In native context Microsoft's understanding of Japanese
1021            3031 and 3032 is wrong. They can never be used to repeat *just
1022            previous one* character, but are usually used to repeat two or
1023            more characters. Also, 3005 is not always used to repeat exactly
1024            one character but sometimes used to repeat two (or possibly more)
1025            characters.
1026
1027            Arabic shadda: FE7C (isolated), FE7D (medium)
1028            (Actually they are not extender in Unicode PropList.txt)
1029
1030
1031         - by UnicodeCategory -
1032
1033         DashPunctuation         6 (no exception)
1034         DecimalDigitNumber      C (no exception)
1035         EnclosingMark           1 E (no exception)
1036         Format                  7 (only 70F)
1037         LetterNumber            C (no exception)
1038         LineSeparator           7 (only 2028)
1039         ParagraphSeparator      7 (only 2029)
1040         PrivateUse
1041         SpaceSeparator          7 (no exception)
1042         Surrogate
1043
1044         OtherNumber             C(<3192), 9E-A7 (3124<)
1045
1046         Control                 6 except for 9-D (7)
1047         FinalQuotePunctuation   7 except for BB (8)
1048         InitialQuotePunctuation 7 except for AB (8)
1049         ClosePunctuation        7 except for 232A (9)
1050         OpenPunctuation         7 except for 2329 (9)
1051         ConnectorPunctuation    7 except for FF65, 30FB, 2040 (A)
1052
1053         OtherLetter             1, 7, 8 (1C0-1C2), C, 12-FF
1054         MathSymbol              8, 9, 6, 7, A, C
1055         OtherSymbol             7, 9, A, C, E, F, <22, 52<
1056         CurrencySymbol          A except for FF69,24,FF04 (7) and 9F2,9F3 (15)
1057
1058         LowercaseLetter         E-11 except for B5 (A) and 1BD (C)
1059         TitlecaseLetter         E (no exception)
1060         UppercaseLetter         E,F,10,11,21 except for 1BC (C)
1061         ModifierLetter          1, 7, E, 1F, FF
1062         ModifierSymbol          1, 6, 7
1063         NonSpacingMark          1, 6, 13-1F
1064         OtherPunctuation        1, 7, A, 1F
1065         SpacingCombiningMark    1, 14-22
1066
1067 *** Culture dependent design
1068
1069         (To assure this section, run the simple dumper code shown above,
1070         with all the supported cultures.)
1071
1072 **** primary cultures and non-primary cultures
1073
1074         This code is used to iterate character dump through all cultures,
1075         using sort key dumper put above.
1076
1077         public static void Main ()
1078         {
1079                 foreach (CultureInfo ci in CultureInfo.GetCultures (
1080                         CultureTypes.AllCultures)) {
1081                         ProcessStartInfo psi = new ProcessStartInfo ();
1082                         psi.FileName = "../allsortkey.exe";
1083                         psi.Arguments = ci.Name;
1084                         psi.RedirectStandardOutput = true;
1085                         psi.UseShellExecute = false;
1086                         Process p = new Process ();
1087                         p.StartInfo = psi;
1088                         p.Start ();
1089                         string s = p.StandardOutput.ReadToEnd ();
1090                         StreamWriter sw = new StreamWriter (ci.Name + ".txt", false, Encoding.UTF8);
1091                         sw.Write (s);
1092                         sw.Close ();
1093                 }
1094         }
1095
1096         For each sub culture (that has a parent culture), its collation
1097         mapping is identical to that of its parent, except for az-AZ-Cyrl.
1098
1099         Additionally,
1100
1101         - zh-CHS = zh-CN = zh-SG = zh-MO : pronounciation
1102         - zh-TW = zh-HK = zh-CHT : stroke count
1103         - da = no
1104         - fi = sv
1105         - hr = sr
1106
1107         (UCA implies that there are some cultures that sorts alphabets from
1108         large to small, but as long as I see there is no such CultureInfo.)
1109
1110 **** Latin characters and NonSpacingMark order tailorings
1111
1112         div : FDF2 is 24 83 01 01 01 01 00 (only 1 difference)
1113         syr : some NonSpacingMarks are totally ignorable.
1114         tt,kk,mk,az-AZ-Cyrl,uk : cyrillic difference
1115         az,et,lt,lv,sl,tr,sv,ro,pl,no,is,hu,fi,es,da : latin difference
1116         fr : 1C4-1C6.
1117         sk,hr,cs : latin and NonSpacingMark differences
1118
1119         ja,ko : 5C
1120
1121 **** CJK character order tailorings
1122
1123         <how many tables?>
1124
1125         There are five different CJK orderings:
1126         default, ko(-KR), ja(-JP), zh-CHS and zh-CHT
1127         They have very different CJK mapping for each.
1128
1129         Since they seems based on traditional encodings, we are likely to
1130         provide other constant tables and switch depending on the culture.
1131
1132         <what characters are different from the invariant culture?>
1133
1134         ko : CJK layout difference (52 -> 80)
1135         ja,zh-CHS,zh-TW : dash (5C), CJK layout difference.
1136
1137         Target characters are : CJK misc (3190-), Parenthesized CJK
1138         (3200-), CJK compat (3300-), CJK ideographs (4E00-),
1139         CJK compat ideograph (F900-), Half/Full width compat (FF00-)
1140
1141         Additionally for Korean: Jamo (1100-), Hangle syllables (AC00)
1142
1143         <how do they consist of?>
1144
1145         Japanese CJK order looks based on JIS table order. Those characters
1146         which are also in JIS table are moved to 80 xx. Those which are *not*
1147         in JIS table are left as is (9E-FE).
1148
1149         Additionally, Windows has different order for characters below:
1150         4EDD,337B,337E,337D,337D,337C
1151         They come in front of the first CJK character.
1152
1153         Maybe Korean CJK order respects KS C 5619. Note that Korean mixes
1154         Hangul and CJK in their order so it's not flat order without indexes
1155         (thus, for CJK they are not computable). Also, there is an extra
1156         level2 values for Korean CJK map.
1157
1158         For some Chinese such as zh-CHS, character order is based on pinyin.
1159         And for remaining Chinese such as zh-TW, it is stroke count based.
1160
1161         CLDR of unicode.org has reference ordering of those characters, so
1162         our collation table extracts the sorting order from it.
1163         http://www.unicode.org/cldr/
1164
1165 **** Accent evaluation order
1166
1167         With French cultures, diacritical marks are counted in reverse order.
1168         French ordering does not affect only on some diacritics (Japanese
1169         voice mark is not affected - FIXME: I doubt it, because the algorithm
1170         does not seem to allow it).
1171
1172         Some other cultures might also have different ones, but not obvious.
1173
1174
1175 ** Mono implementation plans
1176
1177 *** Collator
1178
1179         CompareInfo contains many overloaded methods that are just for
1180         convenience. This class contains almost only required members.
1181
1182         This class also provices access to tailoring information which is
1183         culture instance dependent:
1184
1185         - French sorting
1186         - contractions/expansions - returns contraction or expansion
1187         - diacritical remapping
1188         - CJK custom mapping
1189
1190         For data area, see CollationDataStructures.txt for now.
1191
1192 *** UnicodeTable (for now MSCompatUnicodeTable)
1193
1194         Provides several access to character information with related to
1195         the collation element table (of our own).
1196         FIXME: I want to fix some bugs in Windows collation table especially
1197         to not ignore some characters, but it requires table modification
1198         which results in further memory allocation. Maybe it would be done
1199         as a patch for the runtime (or classlib) sources.
1200
1201         - ignorable, ignorable nonspace, normalize width, normalize kanatype
1202         - level 4 sortkey provision method(s)
1203
1204 **** character comparison
1205
1206         Since composite character is likely to *not have* equivalent
1207         codepoint, character comparison could not just be done by expecting
1208         "resulting char" value.
1209         In contrast, since composite character is likely to *do have*
1210         equivalent codepoint, character comparison could not also just be done
1211         by comparing "source char" value.
1212
1213 ***** future optimizations
1214
1215         From where those codepoints differ, for each strings it adjusts the
1216         position so that it represents exactly one character element. That is,
1217         find primary character as the start of the range and the last
1218         nonprimary character as the end of the range.
1219
1220         Once Compare() adjusted the character location to be valid
1221         comparison position, further comparison is done as usual comparison,
1222         i.e. sortkey comparison considering comparisonLevel.
1223
1224 **** Characters in the table / characters computed
1225
1226         Currently I plan not to contain following characters in the table
1227         but compute on demand:
1228
1229         - PrivateUse
1230         - Surrogate
1231
1232 **** CJK Unified Ideographs
1233
1234         For CJK unified ideographs, I had to make those culture-dependent
1235         tables in memory. Since they came from some classical encodings, they
1236         are not computed. Thus, they are in separate table.
1237
1238 **** Level 4: Kana type
1239
1240         The table does not contain level 4 (kanatype) properties for
1241         the whole characters. They can be simply computed.
1242
1243 **** Level 3: Case/Width properties
1244
1245         Case properties will be stored as a byte array, with limited areas of
1246         codepoint (cp < 3120 || FE00 < cp).
1247
1248         For Hangul characters, it will be computed by codepoint areas.
1249
1250 **** Level 2: Diacritical properties
1251
1252         The table will be composed as a byte for a character. If we provide
1253         non-buggy mode (Windows is buggy here by design; it just sums
1254         secondary weight values up), the values will come from UCA and
1255         non-blocking check will be introduced.
1256
1257         Note that Japanese voice marks are considered at level 2 but no need to
1258         have maps.
1259
1260
1261 ** Reference materials
1262
1263         Developing International Software for Windows 95 and Windows NT
1264         Appendix D Sort Order for Selected Languages
1265         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
1266
1267         UTR#10 Unicode Collation Algorithm (It is still informative)
1268         http://www.unicode.org/reports/tr10/
1269
1270         UAX#15 Unicode Normalization
1271         http://www.unicode.org/reports/tr15/
1272         especially its canonical/compatibility equivalent characters might
1273         be informative to get those equivalent characters.
1274
1275         To know which character can be expanded, Unicode Character Database
1276         (UCD) is informative (it's informative but not normative to us)
1277         http://www.unicode.org/Public/UNIDATA/UCD.html
1278
1279         Decent char-by-char explaination is available here:
1280         http://www.fileformat.info/info/unicode/
1281
1282         Wine uses UCA default element table, but has windows-like character
1283         filterings support in their LCMapString implementation:
1284         http://cvs.winehq.com/cvsweb/wine/dlls/kernel/locale.c
1285         http://cvs.winehq.com/cvsweb/wine/libs/unicode/sortkey.c
1286
1287         Mimer has decent materials on culture specific collations:
1288         http://developer.mimer.com/collations/
1289
1290         This is written in Japanese, but awesome analysis on MS Access
1291         string sorting:
1292         http://www.asahi-net.or.jp/~ez3k-msym/comp/acccoll.htm