mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt

   1 * String collation Notes
   2
   3 ** Summary
   4
   5         We are going to implement Windows-like collation, apart from ICU (that
   6         implements UCA, Unicode Collation Algorithm).
   7
   8
   9 ** Tasks
  10
  11         * create collation element table(s)
  12                 - infer how Windows collation element table is composed
  13                   : mostly analyzed.
  14                 - write table generator source(s)
  15                   : mostly implemented. Need to fix nearly 400 mappings.
  16                     They are mainly 1) IPA extensions (U+250-U+300),
  17                     2) Latin extensions (U+1E00-U+1F00), 3) Letterlike
  18                     symbols (U+2100-U+2140), 4) some Cyrillic letters
  19                     (U+460-U+500), and 5) some Hangul characters.
  20                 - culture-specific sortkey data
  21                   : They are defined in mono-tailoring-source.txt.
  22                     All single sortkey remapping in all cultures are filled.
  23                     Contractions are not fully checked yet (should be filled
  24                     from UCA tailorings via create-tailorings.exe).
  25
  26
  27 ** How to implement CompareInfo members
  28
  29         GetSortKey() : done
  30                 Compute sort key for every character elements into byte[].
  31         Compare() : done
  32                 Find first difference and compare it.
  33                 "Larger/smaller" matters (beyond "different").
  34         IsPrefix()
  35                 It calls CompareInternal() which also answers if the target
  36                 is fully consumed, so it just returns true if it says that
  37                 the target is fully consumed.
  38         IsSuffix()
  39                 It tries CompareInternal() to compare source and target at
  40                 the end, where source varies from minimum tail to the
  41                 original args.
  42         IndexOf(), LastIndexOf()
  43                 For character search, it finds the matching character element
  44                 to the end (or start) of the string to find.
  45                 For string search, it invokes one of private IndexOf() (or
  46                 LastIndexOf()) overload passing the first character element
  47                 of the target, and if found, tests if the sequence is a valid
  48                 start point, using IsPrefix() (or IsSuffix()).
  49
  50 *** Optimizations
  51
  52         For Compare() and IsPrefix(), it uses forward iteration, which moves
  53         forward and don't stop until either it finds next "primary" character
  54         or it reached the end of the string, checking with IsSafe(char).
  55
  56         For IndexOf(char) and LastIndexOf(char), there is no special
  57         optimization (since the codepoints usually do not match, while they
  58         often matches as a natural collation), but it omits extraneous sortkey
  59         value computation.
  60
  61         IsSuffix() reuses Compare() and returns false if it does not consume
  62         the target string more than 3 times. 3 is kind of magic number that
  63         represents the longest expansion.
  64
  65         IndexOf(string) is implemented as a combination of IndexOf(char) and
  66         IsPrefix().
  67
  68         LastIndexOf(string) is implemented as a combination of
  69         LastIndexOf(char) and IsPrefix().
  70
  71         Porting them to C code is an alternative possible approach, but from
  72         Compare() optimization experience, it is quick enough.
  73
  74 ** How to support CompareOptions
  75
  76         There are two kind of "ignorance" : strippers' ignorance and
  77         normalizers' ignorance.
  78
  79         The strippers will "filter characters out" and there will be no
  80         corresponding character elements in SortKey binaries.
  81
  82         Normalizers, on the other hand, will result in certain characters
  83         that is still in effect between irrelevant character and itself.
  84         For example, with IgnoreKanaType Hiragana "A" and Katakana "A" are
  85         not distinguished, but Hiragana "A" and Hiragana "I" are.
  86
  87         Actually, even without any IgnoreXXX flags (i.e. "None"), there are
  88         many characters that are ignored ("completely ignorable").
  89
  90         Except for LCID 101/1125(div), '\ufdf2' is completely ignorable.
  91         This rule even applies to CompareOptions.None.
  92
  93 *** Normalizers
  94
  95         IgnoreCase
  96                 Maybe culture-dependent TextInfo.ToLower() could be used.
  97
  98                 Unlike ICU (specialCaseToLower()), even with tr-TR(LCID 31)
  99                 and IgnoreCase, I\u0307 is not regarded as equal to i.
 100
 101         IgnoreKanaType
 102                 ToKanaTypeInsensitive(). Note that this does not cover the
 103                 whole "level 4" differences described later.
 104
 105         IgnoreWidth
 106                 ToWidthInsensitive(), which is likely to be culture
 107                 independent. See also "Notes".
 108
 109         IgnoreNonSpace (see also Strippers; this flag works in both sides)
 110                 For some cultures this logic is still incomplete. All culture-
 111                 dependent collators must handle valid "replacement" of "one or
 112                 more characters" which might be related to specific
 113                 CompareOptions.
 114                 For example, there is a Japanese text sorting rule that
 115                 however applies to InvariantCulture. Concretely to say,
 116                 "\u3042\u30FC" is equivalent to "\u3042\u3042" only when
 117                 IgnoreNonSpace is specified.
 118
 119                 I'll take those items from CLDR (those items which has
 120                 <reset before="..." />), case by case though.
 121
 122 *** Strippers
 123
 124         I already wrote all the required strippers which should be MS
 125         compatible (at least with .NET 1.1 invariant culture).
 126
 127         IgnoreNonSpace
 128                 IsIgnorableNonSpacing().
 129                 Some Diacritic characters are covered by this flag.
 130
 131                 There are some culture *dependent* characters:
 132                         LCID 90/1114(syr) : 64b, 652, 670
 133
 134         IgnoreSymbols
 135                 IsIgnorableSymbol().
 136                 UnicodeCategory does not work here.
 137
 138                 There are some culture *dependent* characters:
 139                         LCID 17/1041(ja) : 2015
 140                         LCID 90/1114(syr) : 64b, 652
 141
 142 *** StringSort
 143
 144         See "sort order categories" section.
 145
 146 ** ICU and UCA
 147
 148         First to note: we won't use collation element table from unicode.org.
 149
 150         There are many differences between ICU/UCA and Windows despite they
 151         look so similar; having collation keys in different levels, culture
 152         dependent composition, etc. In the history, Windows collation is
 153         designed before UCA was specified, so basically Windows is obsolete
 154         in this area.
 155
 156         - Logic: Unlike UCA it has no concept of "blocked" combining marks,
 157           and combining marks are never considered as an independent character
 158           (thus combining in Windows is buggy).
 159         - Data: Windows is based on old Unicode standard (even older than 1.1).
 160           It ignores minor cultures. Character property values differ as well
 161           as those from the default Unicode collation element table (DUCET).
 162           In a few cultures Windows collation is close to the native language
 163           (e.g. Tamil, while it does not conform to TAM).
 164
 165         IgnoreWidth/IgnoreSymbols is processed after Kana voice mark
 166         decomposition (something like NFD, but not equivalent. Example: \u304C
 167         is completely equivalent to \u304B\u309B, which is not part of NFKD).
 168         <del>This means, if there is a combined Kana characters, it will be
 169         first decomposed and then compared.<del> It scarcely matters since
 170         there are special weight data for Japanese.
 171
 172 *** Microsoft design problem
 173
 174         Microsoft implementation seems to have a serious problem that many,
 175         many characters that are used in for each specific culture, such as
 176         Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
 177         "completely ignorable".
 178
 179         I tagged many LAMESPEC items in the implementation (both in collator
 180         and table generator).
 181
 182
 183 ** MS collation design inference
 184
 185 *** Levels
 186
 187         Each character has several "weights". It is a common concept between
 188         Windows and UCA.
 189
 190         There are 5 levels:
 191
 192         - level 1: primary difference
 193           The first byte of level 1 means the category of the character.
 194         - level 2: diacritic difference, including Japanese voice mark countup
 195         - level 3: case/width sensitivity, and Hangul properties
 196         - level 4: kana weight (all of them have primary category 22, at least
 197           in InvariantCulture)
 198         - level 5: shift weight (apostrophe, hyphens etc.)
 199
 200         Note that these levels does not digitally match IgnoreXXX flags. Thus
 201         it is not OK that we omit some levels of sortkey values in reflection
 202         to CompareOptions.
 203
 204         String comparison is done from level 1 to 5. The comparison won't
 205         stop until either it found the primary difference, or it reached to
 206         the end (thus upper level differences are returned).
 207
 208         For example, "e" is smaller than "E", but "eB" is bigger than "EA".
 209         If the collator just returned case difference at first 'e' and 'E',
 210         "eB" is still smaller than "EA".
 211
 212 **** level 5: shift weight by StringSort
 213
 214         There are some characters that are treated specially. Namely they are
 215         apostrophe and hyphens. The sortkeys for them is put after level 4
 216         (thus here I write them as "level 5"). It has different sort key
 217         format. See immediate below. There is no level 5 characters when
 218         StringSort is specified.
 219
 220 *** sort key format
 221
 222         00 means the end of sort key.
 223         01 means the end of the level.
 224         02-FF means the value.
 225         Actually '2' could be cut when all the following values are
 226         also '2' (i.e. the sort key binary won't contain extraneous '2').
 227
 228         Every level has different key layout.
 229
 230 **** level 2
 231
 232         It looks like all level 2 keys are just accumulated, however without
 233         considering overflow. It sometimes makes sense (e.g. diaeresis and
 234         acute) but it causes many conflicts (e.g. "A\u0308\u0301" and "\u1EA6"
 235         are incorrectly regarded as equal).
 236
 237         Anyways since Japanese voice mark has level 2 value as 1 it just
 238         looked like the sum of voice marks.
 239
 240 **** level 3
 241
 242         The actual value analysis is not complete in this document. See the
 243         actual generator code.
 244
 245         The actual values are + 2 (e.g. for Hangul Normal Jamo, the value is 4)
 246
 247         For Korean letters:
 248                 - 2: Jongseong (11A8-11F9)
 249                 - 4: Half width? (FFA0-FFDC) and Compatibility Jamo? (3165-318E)
 250                 - 5: Compatibility Jamo (3130-3164)?
 251                      TODO: Learn about Korean characters.
 252
 253         For numbers:
 254                 - 4 circled inverse (2776-277F)
 255                 - 8 circled sans serif (2780-2789)
 256                 - C circled inverse && sans serif (278A-2793)
 257                 - 47 roman (2160-2182)
 258
 259         For Arabic letters:
 260                 - 2 Isolated form in presentation form B in FE80-FE8D
 261                 - 4 Alef/Bet/Gimel/Dalet (2135-2138)
 262                 - 8 Final form in presentation form B in FE82-FEF2
 263                 - 18 Medial form in presentation form B in FE8C-FEF4
 264                      Grep "ISOLATED", "FINAL" or "MEDIAL" on UnicodeData.txt
 265                      (and filter by codepoints).
 266                      or alternatively, see DerivedDecompositionType.txt.
 267                 - 22 6A9 (TODO: what is it?)
 268                 - 28 6AA (TODO: what is it?)
 269
 270         For other letters:
 271                 - 1 Fullwidth. UnicodeData.txt has <full>.
 272                 - 2 Subscript. UnicodeData.txt has <sub>.
 273                 - 8 Small capital, 03C2 (TODO: why?),
 274                     2104, 212B(flag=1A) (TODO: why?)
 275                     grep "SMALL CAPITAL" against UnicodeData.txt.
 276                 - C only FE42. TODO: what is this?
 277                 - E Superscripts. UnicodeData.txt has <super>.
 278                 - 10 Uppercase.
 279                      DerivedCoreProperties.txt has Uppercase property.
 280
 281         Note that simple 02 (value is 00) could be omitted.
 282
 283         Summary: at least 7 bits are required as to represent a table -
 284         smallCapital, uppercase, normalization forms (2 bits:full/sub/super),
 285         arabic forms (2 bits:isolated/medial/final)
 286
 287 **** level 4
 288
 289         Those sortkey data is collected only for Japanese (category 22)
 290         characters.
 291
 292         There are 3 sections each of them ends with FF. Each of them
 293         represents the values for character by character:
 294         - small letter type (kogaki moji); C4 (small) or E4 (normal)
 295         - category middle section:
 296                 two subsections separated by 0x02
 297                 - char type;
 298                   3 (normal)
 299                   or 4 (voice mark - \u309D,\u309E,\u30FD,\u30FE, \uFF70)
 300                   or 5 (dash mark - \u30FC)
 301                 - kana type; C4 (katakana) or E4 (hiragana)
 302         - width; 2 (normal) or C5 (full) or C4 (half)
 303
 304           LAMESPEC: those characters of value '4' of middle section differs
 305           in level 2 wrt voice marks, but does not differetiate kana types
 306           (bug). It is ignored when IgnoreNonSpace applies.
 307
 308 **** level 5
 309
 310         UPDATED: I noticed offsetL does not exist, so removed it from here.
 311
 312         [offsetM + 0x80]? [const 3 + (offsetS + 1) * 4] [category] [level1]
 313
 314         where "offsetM" and "offsetS" represents the offset in the input
 315         string. "offsetM" is always larger than 0x80.
 316         LAMESPEC: This design results in a buggy overflow.
 317
 318         <xmp>
 319         byte [] data = new CultureInfo ("").CompareInfo.GetSortKey (s).KeyData;
 320         int idx = 0;
 321         for (int i = 0; i < 4; i++, idx++)
 322                 for (; data [idx] != 1; idx++)
 323                         ;
 324         for (; idx < data.Length; idx++)
 325                 Console.Write ("{0:X02} ", data [idx]);
 326         Console.WriteLine ();
 327         </xmp>
 328
 329         inputs (s) and results:
 330
 331         80 07 06 82 80 2F 06 82 00 // '-' + new string ('A', 10) + '-'
 332         80 07 06 82 81 97 06 82 00 // (100)
 333         80 07 06 82 8F A7 06 82 00 // (1000)
 334         80 07 06 82 9C 47 06 82 00 // (10000)
 335         80 07 06 82 9A 87 06 82 00 // (100000)
 336         80 07 06 82 89 07 06 82 00 // (1000000)
 337
 338         The actual offset is 63 * offsetM + offsetS
 339
 340         (const '3' may actually vary but no idea.
 341         At least 00, 01 and 02 are not acceptable since they are reserved.
 342         02 is not reserved by definition above, but the key-size optimizer
 343         uses it as a special mark, as mentioned above.)
 344
 345 *** sort key table
 346
 347         Here is the simple sortkey dumper:
 348
 349         public static void Main (string [] args)
 350         {
 351                 CultureInfo culture = args.Length > 0 ?
 352                         new CultureInfo (args [0]) :
 353                         CultureInfo.InvariantCulture;
 354                 CompareInfo ci = culture.CompareInfo;
 355                 for (int i = 0; i < char.MaxValue; i++) {
 356                         string s = new string ((char) i, 1);
 357                         if (ci.Compare (s, "") == 0)
 358                                 continue; // ignored
 359                         byte [] data = ci.GetSortKey (s).KeyData;
 360                         foreach (byte b in data) {
 361                                 Console.Write ("{0:X02}", b);
 362                                 Console.Write (' ');
 363                         }
 364                         Console.WriteLine (" : {0:X}, {1} {2}",
 365                                 i,
 366                                 Char.GetUnicodeCategory ((char) i),
 367                                 data [2] != 1 ? '!' : ' ');
 368                 }
 369         }
 370
 371 *** multiple character mappings
 372
 373         Some sequence of characters are considered as a "composite" that is
 374         to be composed either as another character or another sequence of
 375         characters. Those "composite" might not have corresponding equivalent
 376         character in sortkey.
 377         Similarly, some single characters are expanded to a sequence of
 378         characters.
 379
 380 **** diacritic characters
 381
 382         Except for those shift-weight characters, there are only
 383         diacritical (or other kinds of nonspacing) characters that don't
 384         have primary weights.
 385
 386         Diacritics are not regarded as a base character when placed after
 387         (maybe some kind of) letters.
 388
 389         The behavior is diacritic character dependent. For example, Japanese
 390         combination of a Kana character and a voice mark is compulsory (the
 391         resulting sort key is regarded as identical to the corresponding
 392         single character. Try \u304B\u309B with \u304C. It is invariant).
 393
 394         In French cultures, diacritic orderings are checked from right to left.
 395
 396 **** Composite character processing
 397
 398         There are some sequences of characters that are treated as another
 399         character or another sequence of characters.
 400
 401         By default, there is no composite form.
 402         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
 403         (Note that composite is different from expansion.)
 404
 405         Note that composite characters is likely to not have equivalent
 406         codepoint.
 407
 408 **** Expanded character processing
 409
 410         Some characters are expanded to two or more characters:
 411
 412         C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
 413         132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
 414         DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
 415         1113-115F (hangul)
 416         (CJK extension is not really expanded)
 417
 418         They don't match with any of Unicode normalization.
 419
 420         Some alphabetic cultures have different mappings, but mostly small
 421         (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
 422
 423         Invariant culture also puts Czech unique character \u0161 between s
 424         and t, unlike described here:
 425         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
 426
 427 *** default sort key table
 428
 429 **** StringSort
 430
 431         When CompareOptions.StringSort is specified, then it modifies
 432         characters in category 2 from "1 1 1 1 80 07 06 xx" to
 433         "06 xx yy zz" and some characters become case sensitive.
 434
 435         For details, "level 5" description above.
 436
 437         To handle them simply, they are laid out as "category 0x01" (which
 438         never happens in the actual sortkeys) for those shift-weight ones
 439         in the table.
 440
 441         There seems no further differences between StringSort and None.
 442
 443 **** level 2 details
 444
 445         The value analysis is not complete in this document. See the
 446         actual generator code.
 447
 448         Known value maps:
 449                 -0A: Korean parenthesized numbers (3200-321C)
 450                 -0C: Korean circled numbers (3260-327B)
 451
 452                 -03: Japanese voice mark
 453
 454                 <primary category 13 : Arabic>
 455                 -08: 627-648 (basic Abjad letters)
 456                 -09: madda (622)
 457                 -05: waw with hamza (624)
 458                 -07: yeh with hamza (626. ignore Presentation Form A area)
 459                 -0A: alef with hamza above (623)
 460                 -0A: alef with hamza below (625)
 461
 462                 <primary category 0E : diacritics>
 463                 Characters in non "0E" category are out of scope.
 464                 They can be grepped in UnicodeData.txt.
 465                 -0E: acute
 466                 -0F: grave
 467                 -10: dot above
 468                 -11: middle dot
 469                 -12: circumflex
 470                 -13: diaeresis
 471                 -14: caron
 472                      Note that 1C4-1C6 are covered but they are also expanded.
 473                 -15: breve (cyrillic are also covered? at least 4C1/4C2 are.)
 474                 -16: dialytika and tonos (category 0F though)
 475                 -17: macron
 476                 -19: tilde
 477                 -1A: ring above | 212B
 478                 -1B: ogonek ("WITH OGONEK;")
 479                 -1C: cedilla (WITH CEDILLA;")
 480                 -1D: double acute | acute and dot above
 481                 -1E: stroke, except for 0E[1F] and cp{19B, 1BE} |
 482                      circumflex and acute | 18B,18C,19A,289
 483                      (i.e. they not one-to-one mapping. Neither that every
 484                      "stroke" are mapped to 1E, nor not every 1E are mapped to
 485                      "stroke".)
 486                 -1F: diaeresis and acute | with circumflex and grave | l slash
 487                         beware "symbol slash"
 488                 -20: diaeresis and grave | 19B,19F
 489                 -21: breve and acute | D8,F8
 490                 -22: caron and dot above | breve and grave
 491                 -23: macron and acute
 492                 -24: macron and grave
 493                 -25: diaeresis and caron | dot above and macron | tilde and acute
 494                 -26: ring above and acute
 495                 -28: diaeresis and macron | cedilla and acute |
 496                      macron and diaeresis
 497                 -29: circumflex and tilde
 498                 -2A: tilde and diaeresis
 499                 -2B: stroke and acute
 500                 -2C: breve and tilde
 501                 -2F: cedilla and breve
 502                 -30: ogonek and macron
 503                 -43: hook, except for cp{192,1B2,25A,25D,27B,28B,2B1,2B5} |
 504                      left hook | with hook above except for cp{1EF6,1EF7} |
 505                      27D,284
 506                 -44: double grave | 1EF6,1EF7
 507                 -46: inverted breve
 508                 -48: preceded by apostrophe (actually only 149)
 509                 -52: horn
 510                 -55: line below | circumflex and hook above
 511                 -57: palatal hook (actually only 1AB)
 512                 -58: dot below except for cp{1EA0,1EA1}
 513                 -59: "retroflex" (without "WITH") | diaeresis below | 1EA0,1EA1
 514                 -5A: ring below | 1E76,1E77
 515                 -60: circumflex below except for cp{1E76,1E77} | horn and acute
 516                 -61: breve below | horn and grave
 517                 -63: tilde below | 2125
 518                 -68: D0,F0,182,183 | dot below and dot above | topbar
 519                 -69: right half ring | horn and tilde
 520                 -6A: circumflex and dot below
 521                 -6D: breve and dot below
 522                 -6E: dot below and macron
 523                 -95: horn and hook above
 524                 -AA: horn and dot
 525
 526                 (for 01-0D and 7B-8A, they are not related to diacritics.)
 527
 528                 <category BlahBlahNumbers from 0100 to 1000>
 529                 -38: Arabic-Indic numbers (660-669)
 530                 -39: extended Arabic-Indic numbers (6F0-6F9)
 531                 -3A: Devanagari numbers (966-96F)
 532                 -3B: Bengali numbers (9E6-9EF)
 533                 -3C: Bengali currency enumerators (9F4-9F9)
 534                 -3D: Gurmukhi numbers (A66-A6F)
 535                 -3E: Gujarati numbesr (AE6-AEF)
 536                 -3F: Oriya digit numbers (B66-B6F)
 537                 -40: Tamil numbers (BE7-BF2)
 538                 -41: Telugu numbers (C66-C6F)
 539                 -42: Kannada numbers (CE6-CEF)
 540                 -43: Malayam numbers (D66-D6F)
 541                 -44: Thai numbers (E50-E59)
 542                 -45: Lao numbers (ED0-ED9)
 543                 <miscellaneous numbers>
 544                 -47: Roman numbers (2160-2182)
 545                 -4E: Hangchou numbers (3021-3029)
 546
 547                 -E0[64]: 2107 (Eurer)
 548                 -E0[87]: some Tone letters (TONE TWO / TONE SIX)
 549                 -EE: Circled letter-or-digits and katakanas
 550                         CIRCLED {DIGIT|NUMBER|LATIN|KATAKANA}
 551                         numbers (2460-2473,2776-2793,24EA)
 552                         latin (24B6-24E9)
 553                         katakana (32D0-32FE)
 554                 -F3: Parenthesized enumerations
 555                         numbers (2474-2487)
 556                         latin (249C-24B5)
 557                         PARENTHESIZED {DIGIT|NUMBER|LATIN}
 558                 -F4: Numbers with dot (2488-249B)
 559                         {DIGIT|NUMBER} * FULL STOP
 560
 561                 <miscellaneous>
 562                 -258,25C-25E,285,286,29A,297 -> 0E[80-86,88]
 563                 -27F,2B3-2B6 -> 0E 8A[80-84]
 564                 -3D3 -> 0F[44]
 565                 -476,477 -> 10[46]
 566                 -215F -> 0C[03]
 567
 568                 -20D0-20E1 -> 01[DD-F0]
 569                 -483-486 -> 01[94-97]
 570                 -559,55A -> 01[98,99]
 571                 -711 -> 01[9A]
 572
 573                 -346-348,2BE-2C5,2CE-2CF -> 01[74-7F]
 574                 -2D1-2D3,2DE,2E4-2E9 -> 01[81-8A]
 575
 576                 -342,343 -> 01[8D,8E]
 577                 -345 -> 01[90]
 578
 579                 -700-780 01[8D-AF]. Maybe there is some kind of traditional
 580                 order in Estrangela, but for now am not sure.
 581                 /*
 582                 -740-742 -> 01[8D-8F]
 583                 -747,748,732,735,738,739,73C,73F,743-746,730 -> 01[90,91,94-9F]
 584                 -731,733,734,736,737,73A,73B,73D,73E,749,74A,7A6-7A9
 585                  -> 01[A0-AA,AC-AF]
 586                 */
 587                 -7AA-7B0 -> 01[B0-B6]
 588
 589                 -591-5C2 except for 5BA,5BE -> 01[03-33] in order
 590
 591                 No further patterns for >= 80
 592
 593                 TODO: Below are not done yet:
 594                         - x < 0x80 in non-"0E" part
 595                         - 03 <= x <= 0D in "0E" part
 596                         - 7B <= x <= 7F in "0E" part
 597
 598 **** sortkey details by category
 599
 600         The actual value analysis is not complete in this document. See the
 601         actual generator code.
 602
 603         1 specially ignored ones (Japanese, Tamil, Thai)
 604
 605                 IdentifyBy: constants
 606                 Unicode: 3099-309C, BCD, E47, E4C, FF9E, FF9F
 607                 SortKey: 01 01 01 01 00
 608
 609         2 shift weight characters
 610
 611         They are either at 01 01 01 01 or 06, depending on StringSort. For
 612         convenience, I use 06 to describe them.
 613
 614         2.1 control characters (specified as such in Unicode), except for
 615         whitespaces (0009-000D).
 616
 617                 ProcessAfter: 4.1
 618                 IdentifyBy: UnicodeCategory.Control
 619                 Unicode: 0001-000F minus 0009-000D, 007F-009F
 620                 SortKey: 06 03 - 06 3D
 621
 622         2.2 Apostrophe
 623                 IdentifyBy: constant
 624                 Unicode: 0027,FF07 (')
 625                 SortKey: 06 80 (and width insensitive equivalents)
 626
 627         2.3  minus sign, hyphen, dash
 628           minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width)
 629           hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN?
 630           dashes, horizontal bars: FE58 ... UnicodeCategory.DashPunctuation
 631
 632                 IdentifyBy: UnicodeCategory.DashPunctuation
 633                 SortKey: 06 81 - 06 90 (and nonspace equivalents)
 634
 635         2.4 Arabic spacing and equivalents (64B-652, FE70-FE7F)
 636           They are part of nonspacing mark, but not equal.
 637
 638                 SortKey: 06 A0 - 06 A7 (and nonspace equivalents)
 639
 640         3 nonprimary characters, mixed.
 641
 642           ModifierSymbol, except for that are not in category 0 and "07" area
 643           (i.e. < 128) nor those equivalents
 644
 645           NonSpacingMark which is ignorable (IsIgnorableNonSpacing())
 646           // 30D, CD5-CD6, ABD, 2B9-2C5, 2C8, 2CB-2CD, 591-5C2. NonSpacingMark in
 647           // 981-A3C. A4D, A70, A71, ABC ...
 648
 649           TODO: I need more insight to write table generator.
 650
 651           SortKey: 01 03 01 - 01 B6 01
 652
 653           This part of MS table design is problematic (buggy): \u0592 should
 654           not be equal to \u09BC.
 655
 656           I guess, this buggy design is because Microsoft first thought that
 657           there won't be more than 255 characters in this area. Or they might be
 658           aware of the problem but prefer table optimization.
 659
 660           Ideal solutions:
 661
 662           1) We should not mix those code (make things sequential) and expands
 663              level 2 length to 2 bytes. Instead of having direct value, we
 664              could use index (pointer) to zero-terminating level 2 table.
 665
 666           2) Include those charactors from minor cultures here.
 667
 668           If in "discriminatory mode", those tables could be still provided
 669           as to be compatible to Windows.
 670
 671           Additionally there seems some bugs around Modifier letter collection.
 672           For example, 2C6 should be nonspacing diacritical character but it
 673           is regarded as a primary character. The same applies to Mandarin
 674           tone marks (2C9-2CB) (and there's a plenty of such characters).
 675
 676         4 space separators and some kind of marks
 677
 678         4.1 whitespaces, paragraph separator etc.
 679           UnicodeCategory.SpaceSeparator : 20, 3000, A0, 9-D, 2000-200B
 680
 681           SortKey : 07 02 - 07 18
 682
 683         4.2 some OtherSymbols: 2422-2423
 684
 685           SortKey : 07 19 - 07 1A
 686
 687         4.3 ASCII compatible marks ('!', '^', ...)
 688           Non-alpha-numeric < 0x7F except for [[+-<=>']]
 689           small compatibility equivalents -> itself, wide
 690
 691         4.3 other marks
 692           FIXME: how to identify them?
 693           some Punctuations: InitialQuote/FinalQuote/Open/Close/Connector
 694           some OtherSymbols: 2400-2424
 695           3003, 3006, 2D0, 10FB
 696           remaining Puncuations: 9xx, 7xx
 697           70F (Format)
 698
 699           SortKey : 07 1B - 07 F0
 700
 701         5 mathmatical symbols
 702           InitialQuotePunctuation and FinalQuotePunctuation in ASCII
 703           (not Quotation_Mark property in PropList.txt ; 22, 27)
 704
 705           byte area MathSymbol: 2B,3C,3D,3E,AB,B1,BB,D7,F7 except for AC
 706           some MathSymbol (2044, 208A, 208C, 207A, 207C)
 707           OtherLetter (1C0-1C2)
 708           2200-22FF MathSymbol except for 221E (INF. ; regarded as a number)
 709
 710           SortKey : 08 02 - 08 F8
 711
 712         6 Arrows and Box drawings
 713           09 02 .. 09 7C : 2300-237A
 714                         only primary differences
 715           09 BC ... 09 FE : 25A0-AB, 25E7-EB, 25AC-B5, 25EC-EF, 25B6-B9,
 716                         25BC-C3, 25BA-25BB, 25C4-25D8, 25E6, 25DA-25E5
 717                         21*,25*,26*,27*
 718                         This area contains level 2 values.
 719           2190- (non-codepoint order)
 720                 note that there are many compatibility equivalents
 721           2500- except for 266F (#)
 722
 723           SortKey : 09 02 - 09 7C, 09 BC 01 03 - 09 BC 01 13,
 724                     09 {BD|BE|BF} 01 {03|04}, ...
 725                     TODO: fill the patterns
 726
 727         7 currency sumbols and some punctuations
 728           byte CurrencySymbols except for 24 ($)
 729           byte OtherSymbols (A7-B6)
 730           ConnectorPunctuation - 2040 (i.e. FF65, 30FB)
 731           OtherPunct/ConnectorPunct/CurrencyCymbol 2020-20AC - 20AC
 732           OtherSymbol 3012-303F,3004,327F
 733           MathSymbol/OtherSymbol 2600-2767 (math = 266F)
 734           OtherSymbol 2440-244A, 2117
 735           20AC (CurrencySymbol)
 736
 737           Sortey : 0A 02 - 0A FB
 738
 739         8 (C) numbers
 740           all DecimalDigitNumber, LetterNumber, non-CJK OtherNumber.
 741           9F8.
 742           digits, in numeric order. We can use NET_2_0 CharUnicodeInfo.
 743           221E. (INF.)
 744
 745           SortKey : 0C 02 (9F8), 0C 03 - 0C E1 (normal numbers), 0C FF (INF.)
 746
 747         9 (E) latin letters (alphabets), mixing alphabetical symbols
 748           Alphabets, A to Z, mixing alphabetical symbols. See below.
 749           F8-2B8 except for (1BB-1BD and 1C0-1C3), but not sequential.
 750           2E0-2E3.
 751
 752           For diacritical orders, see level 2.
 753
 754           For 'A' it is "0E 02", for 'B' "0E 09" ... 'Z' "0E A9", ezh "0E AA".
 755           0E B3 (1BE), 0E B4 (298)
 756
 757           There are CJK compatibility characters (3800-) and letterlike
 758           symbols (2100-) in those A-to-Z area, ordered by character name.
 759
 760           Primary weights are sometimes culture-dependent.
 761                 FIXME: [0E 0D], [0E 0E], [0E 4B], [0E 75], [0E B2] are unknown
 762                 02: A
 763                 03: C4 in sk|vi
 764                 04: C1 in is|pl|vi
 765                 05-08: CJKext
 766                 09: B
 767                 0A: C
 768                 0B: 10D in hr|lt|lv|pl, 107 in pl
 769                 0C: C7 in az|tr, 10D in cs|sk, 106 in hr
 770                 0F-19: CJKext
 771                 1A: D
 772                 1B: 189 (African D)
 773                 1C: 2A3 (DZ Digraph)
 774                 1D: 1C6 (dz) in hr
 775                 1E: 110 (D with stroke) in hr
 776                 1F-20: CJKext
 777                 21: E
 778                 22: 18F=259 in az, E9 in is, 119 in pl, EA in vi, 1EBE-1EC7 in vi
 779                 23: F
 780                 24: CJKext
 781                 25: G
 782                 26: 11F in az|tr, 123 in lv
 783                 28-2B: CJKext
 784                 2C: H
 785                 2D: 267 (Heng with hook)
 786                 2E: 33CB in az, 33CA in tr
 787                 2F-31: CJKext
 788                 32: I
 789                 33: CD in is, 79 in lt
 790                 34: CJKext
 791                 35: J
 792                 36: K
 793                 37-47: CJKext
 794                 48: L
 795                 49: 2114
 796                 4A: 1C9 in hr
 797                 4C: 142 in pl
 798                 4D-50: CJKext
 799                 51: M
 800                 52-6F: CJKext
 801                 70: N
 802                 71: 2116
 803                 72: 144 in pl
 804                 73: F1 in es, 1CC in hr
 805                 74: 14B
 806                 76-7B: CJKext
 807                 7C: O
 808                 7D: F6 in az|hu|tr, 151 in hu, F3 in is|pl, F4 in sk|vi, 1ED0-1ED9 in vi
 809                 7E: P
 810                 7F-88: CJKext
 811                 89: Q
 812                 8A: R
 813                 8B: 211E
 814                 8C: 211F
 815                 8D: 159 in cs|sk
 816                 8E-90: CJKext
 817                 91: S
 818                 92: 2108
 819                 93: 2120
 820                 94-95: CJKext
 821                 96: 17F (LATIN SMALL LONG S)
 822                 97: 15F in az|tr, 161 in cs|hr|lt|lv|sk|sl, 7A,179-17C in et, 15B in pl
 823                 98: 17E in et, 15F in ro, 15B in sl
 824                 99: T
 825                 9A: 2121
 826                 9B: CJKext
 827                 9C: 2122
 828                 9D: 2A6
 829                 9E: 166
 830                 9F: U
 831                 A0: FA in is, 1B0,1EE8-1EF1 in vi
 832                 A1: FC in az|tr, 56,57 in et, FC,171 in hu, FB in vi
 833                 A2: V
 834                 A3: 2123
 835                 A4: W
 836                 A5: CJKext
 837                 A6: X
 838                 A7: Y
 839                 A8: FD in is
 840                 A9: Z
 841                 AA: 292
 842                 AB: DE in is, 17E in lt|lv, 17A in pl
 843                 AC: E6 in da|is, 1E3 in is, 17C in pl, 17E in sl
 844                 AD: 17E in cs|hr|sk, E5 in fi, F6,F8 in is 17A in sl
 845                 AE: F6,F8,151 in da
 846                 AF: E4 in fi
 847                 B0: F6,F8,151 in fi
 848                 B1: E5 in da, "aa" in da
 849                 B3: 1BE
 850                 B4: 298
 851
 852         10 culture dependent letters (general)
 853           0F: 386-3F2 ... Greek and Coptic
 854                 386-3CF: [0F 02] - [0F 19] (consider primary equivalents)
 855                 3D0-3EF: [0F 40] - [0F 54]
 856           10: 400-4E9 ... Cyrillic.
 857                 For 400-45F and 4B1, they are mostly UCA DUCET order.
 858                 After that 460-481 follows, by codepoint.
 859                 (490-4FF except for 4B1 and Cyrillic supplementary are unused.)
 860           11: 531-586 ... Armenian.
 861                 Simply sorted by codepoint (handle case).
 862           12: 5D0-5F2 ... Hebrew.
 863                 Codepoint order (handle case).
 864           13: 621-6D5 plus 670 (NonSpacingMark) ... Arabic
 865                 Area 1:
 866                 They look like ordered by Arabic Presentation Form B except
 867                 for FE95, and considering diacritical equivalents maybe based
 868                 on the primary character area (621-6D5).
 869                 There are still some special characters: 67E,686,698,6AF ...
 870                 which might not have equivalent characters (I wonder how they
 871                 are inserted into the presentation form B map).
 872
 873                         Solution:
 874                         - hamza, waw, yeh (621,624,626) are special: [13 07]
 875                         - For all remaining letters, get primary letter name
 876                           and store it into dictionary. If unique, then
 877                           increment index by 4 from [13 0B]
 878                 Area 2:
 879                 674-6D5 : by codepoint from [13 84].
 880           14: 901-963 exc. 93A-93D 950-954 ... Devanagari.
 881                 For <905 codepoint order, x2 from [14 04].
 882                 For 905-939 codepoint order, x4 from [14 0B].
 883                 For 93E-94D codepoint order, x2 from [14 DA].
 884           15: 982-9FA ... Bengali. Actually all UnicodeCategories except for
 885                 NonSpacingMark, DecimalDigitNumber and OtherNumber.
 886                 For <9E0 simple codepoint order from [15 02].
 887                 For >9E0 simple codepoint order from [15 3B].
 888           16: A05-A74 exc. A3C A4D A66-A71 ... Gurmukhi.
 889                 The same as UCA order, x4 from [16 04].
 890           17: A81-AE0 exc. ABC-ABD ... Gujarati.
 891                 Mostly equivalent to UCA, but insert {AB3,A81-A83} before AB9,
 892                 x4 from [17 04].
 893           18: B00-B70 ... Oriya
 894                 All but NonSpacingMark and DecimalDigitNumber, by codepoint.
 895           19: B80-BFF ... Tamil
 896                 BD7 is special : [19 02].
 897                 B82-B93 (vowels) : x2 from [19 0A].
 898                 B94 (vowel AU) : [19 24]
 899                 For consonant order Windows has native Tamil order which is
 900                 different from UCA.
 901                 http://www.nationmaster.com/encyclopedia/Tamil-alphabet
 902                 (The order is still different in "Grantha" order from TAM.)
 903                 So, we should just hold constant array for consonants.
 904                 And put them in order, x4 form [19 26].
 905                 BBE-BCC : SpacingCombiningMark and BC0 ... x2 from [19 82].
 906           1A: C00-C61 ... Telugu.
 907                 C55 and C56 are ignored (C5x line and remaining part of C6x
 908                 line just look like ignored).
 909                 C60 and C61 are specially placed. C60 after C0B, C61 after C0C.
 910                 Except for above, by codepoint, x3 from [1A 04].
 911           1B: C80-CE5 ... Kannada.
 912                 CD5,CD6 (and CE6-CEF: DecimalDigitNumber) are ignored.
 913                 by codepoint, 3x from [1B 04].
 914           1C: D02-D40 ... Malayalam.
 915                 by simple codepoint from [1C 02].
 916           (1D: Sinhala ... totally ignored?)
 917           1E: E00-E44 ... Thai.
 918                 preceding vowels (E40-E44) by codepoint [1E 02 - 1E 06]
 919                 consonants (E01-E2A) by codepoint, x6 from [1E 07].
 920           1F: E2B-E5B,E80-EDF ... Thai / Lao. (Thai breaks the category wall.)
 921                 Thai:
 922                 remaining consonants (E2B-E2E) by codepoint, x6 from [1E 07].
 923                 remaining vowels (E2F-E3A) by codepoint.
 924                 E45,E46,E4E,E4F,E5A,E5B
 925                 Lao:
 926                 E80-EDF by codepoint from [1F 02].
 927           21: 10A0-10FF ... Georgian
 928                 Mostly equal to UCA order, but swap 10E3 <-> 10F3,
 929                 x5 from [21 05].
 930
 931         11 (22) japanese kana letters and symbols, not in codepoint order
 932
 933           For single character, the sortkeys look like:
 934           - Katakana normal A, Half Width (FF71) : FF 02 C4 FF C4 FF 01 00
 935           - Katakana normal A, Full Width (30A2) : FF C4 FF 01 00
 936           - Hiragana normal A, Full Width (3042) : FF FF 01 00
 937
 938           Actually for level 4 weights, there is a different rule (see
 939           "level 4" format above).
 940
 941           There is also 32D0 (normal katakana A with circle) that have
 942           diacritic difference.
 943
 944           For primary weights, 'A' to 'O' are mapped to 22-26, 'Ka' to 'Ko'
 945           are to 2A-2E, 'Sa' to 'So' are to 32-36 ... and follows.
 946           'Nn' is special: [22 80].
 947
 948           After Kana characters, there are CJK compat characters.
 949           From 22 97 01 01 01 01 00 (3349) to 22 A6 01 01 01 01 00 (333B) are
 950           sorted in JIS table order (CP932.TXT). Remaining square characters
 951           are maybe sorted in Alphabetic order.
 952
 953           UCA DUCET also does not apply here.
 954
 955         12 (23) bopomofo letters
 956                 3105-312C: simple codepoint order from [23 02].
 957
 958         13 culture dependent letters 2
 959                 710-72C : Estrangela (ancient Syriac).
 960                         codepoint order.
 961                         711 is excluded (superscript).
 962                         714,716,71C,724 and 727 are "alternative" characters.
 963                         SortKey: [24 0B]-[24 60], by x where x is 2 for those
 964                         which is "alternative" defined above, otherwise 4.
 965
 966                 780-7A5 : Thaana
 967                         Equals to UCA order, x2 from [24 6E].
 968
 969         (Maybe we should add remaining minor-culture characters here. Tibetan,
 970         Limbu, Tagalog, Hanunoo, Buhid, Tagbanwa, Myanmar, Kumer, Tai-Le,
 971         Mongolian, Cherokee, Canadian-Aboriginal, Ogham, Runic are ignored)
 972
 973         14 (41-45) surrogate Pt.1
 974
 975         15 (52 02-7E C8) hangul, mixing combined ones
 976
 977           It starts from 1100. After width-insensitive equivalents, those
 978           syllables (from AC00) follow (until <del>AE4B</del>D7A3).
 979           It follows kinda based on some formula (sometimes it looks not
 980           e.g. 1117). FIXME: this area should be clarified more.
 981
 982           Hangle Syllables should not be filled in the table. Instead, they
 983           can be easily computed by the following formulum:
 984
 985                 // rc is the codepoint for the input Syllable
 986                 // (p holds "category << 8 + level1weight")
 987                 int ri = ((int) rc - 0xAC00) + 1;
 988                 ushort p = (ushort)
 989                         ((ri / 254) * 256 + (ri % 254) + 2);
 990
 991           Hangul Jamo cannot be filled in the table directly, since
 992           U+1113 - U+159 holds additional primary key bytes.
 993           FIXME: find out how they can be computed.
 994           See http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/collation/ICU_collation_design.htm?rev=HEAD&content-type=text/html#Hangul_Implicit_CEs
 995
 996         16 (9E 02-F1 E4) CJK
 997
 998            9E 02-F0 B4 [3192-319F,3220-3243,3280-32B0,4E00-9FA5] : CJK mark,
 999                 parenthesized CJK (part), circled CJK (part), CJK ideograph.
1000                 Ordered but condidering compatible characters (i.e. there is
1001                 no other way than having massive mapping).
1002            F0 B5-F1 E4 [F900-FA2D]. CJK compatibility ideograph.
1003
1004            LAMESPEC: in the latest spec CJK ends at 9F BB. Since MS table
1005            joins these two categories without any consideration, it is
1006            impossible to insert those new characters without breaking binary
1007            compatibility.
1008
1009         17 (E5 02-FE 33) PrivateUse.
1010
1011            In fact it overlaps to CJK characters (maybe layout design failure).
1012
1013         18 (F2 01-F2 31) surrogate Pt.2
1014
1015            In fact it overlaps to PrivateUse (maybe layout design failure).
1016
1017         19 (FE FF 10 02 - FE FF 29 E9) CJK extensions
1018
1019            3400-4DB5. Ordered.
1020
1021            They should be computed, since this range should be anyways
1022            checked (to not directly acquire the sortkey values but needs
1023            FE FF part) and anyways it can be computed.
1024
1025         20 (FF FF 01 01 01 01 00) special.
1026            Japanese extender marks:
1027                 3005, 3031, 3032, 309D, 309E, 30FC, 30FD, 30FE, FF70
1028
1029            LAMESPEC: In native context Microsoft's understanding of Japanese
1030            3031 and 3032 is wrong. They can never be used to repeat *just
1031            previous one* character, but are usually used to repeat two or
1032            more characters. Also, 3005 is not always used to repeat exactly
1033            one character but sometimes used to repeat two (or possibly more)
1034            characters.
1035
1036            Arabic shadda: FE7C (isolated), FE7D (medium)
1037            (Actually they are not extender in Unicode PropList.txt)
1038
1039
1040         - by UnicodeCategory -
1041
1042         DashPunctuation         6 (no exception)
1043         DecimalDigitNumber      C (no exception)
1044         EnclosingMark           1 E (no exception)
1045         Format                  7 (only 70F)
1046         LetterNumber            C (no exception)
1047         LineSeparator           7 (only 2028)
1048         ParagraphSeparator      7 (only 2029)
1049         PrivateUse
1050         SpaceSeparator          7 (no exception)
1051         Surrogate
1052
1053         OtherNumber             C(<3192), 9E-A7 (3124<)
1054
1055         Control                 6 except for 9-D (7)
1056         FinalQuotePunctuation   7 except for BB (8)
1057         InitialQuotePunctuation 7 except for AB (8)
1058         ClosePunctuation        7 except for 232A (9)
1059         OpenPunctuation         7 except for 2329 (9)
1060         ConnectorPunctuation    7 except for FF65, 30FB, 2040 (A)
1061
1062         OtherLetter             1, 7, 8 (1C0-1C2), C, 12-FF
1063         MathSymbol              8, 9, 6, 7, A, C
1064         OtherSymbol             7, 9, A, C, E, F, <22, 52<
1065         CurrencySymbol          A except for FF69,24,FF04 (7) and 9F2,9F3 (15)
1066
1067         LowercaseLetter         E-11 except for B5 (A) and 1BD (C)
1068         TitlecaseLetter         E (no exception)
1069         UppercaseLetter         E,F,10,11,21 except for 1BC (C)
1070         ModifierLetter          1, 7, E, 1F, FF
1071         ModifierSymbol          1, 6, 7
1072         NonSpacingMark          1, 6, 13-1F
1073         OtherPunctuation        1, 7, A, 1F
1074         SpacingCombiningMark    1, 14-22
1075
1076 *** Culture dependent design
1077
1078         (To assure this section, run the simple dumper code shown above,
1079         with all the supported cultures.)
1080
1081 **** primary cultures and non-primary cultures
1082
1083         This code is used to iterate character dump through all cultures,
1084         using sort key dumper put above.
1085
1086         public static void Main ()
1087         {
1088                 foreach (CultureInfo ci in CultureInfo.GetCultures (
1089                         CultureTypes.AllCultures)) {
1090                         ProcessStartInfo psi = new ProcessStartInfo ();
1091                         psi.FileName = "../allsortkey.exe";
1092                         psi.Arguments = ci.Name;
1093                         psi.RedirectStandardOutput = true;
1094                         psi.UseShellExecute = false;
1095                         Process p = new Process ();
1096                         p.StartInfo = psi;
1097                         p.Start ();
1098                         string s = p.StandardOutput.ReadToEnd ();
1099                         StreamWriter sw = new StreamWriter (ci.Name + ".txt", false, Encoding.UTF8);
1100                         sw.Write (s);
1101                         sw.Close ();
1102                 }
1103         }
1104
1105         For each sub culture (that has a parent culture), its collation
1106         mapping is identical to that of its parent, except for az-AZ-Cyrl.
1107
1108         Additionally,
1109
1110         - zh-CHS = zh-CN = zh-SG = zh-MO : pronounciation
1111         - zh-TW = zh-HK = zh-CHT : stroke count
1112         - da = no
1113         - fi = sv
1114         - hr = sr
1115
1116         (UCA implies that there are some cultures that sorts alphabets from
1117         large to small, but as long as I see there is no such CultureInfo.)
1118
1119 **** Latin characters and NonSpacingMark order tailorings
1120
1121         div : FDF2 is 24 83 01 01 01 01 00 (only 1 difference)
1122         syr : some NonSpacingMarks are totally ignorable.
1123         tt,kk,mk,az-AZ-Cyrl,uk : cyrillic difference
1124         az,et,lt,lv,sl,tr,sv,ro,pl,no,is,hu,fi,es,da : latin difference
1125         fr : 1C4-1C6.
1126         sk,hr,cs : latin and NonSpacingMark differences
1127
1128         ja,ko : 5C
1129
1130 **** CJK character order tailorings
1131
1132         <how many tables?>
1133
1134         There are five different CJK orderings:
1135         default, ko(-KR), ja(-JP), zh-CHS and zh-CHT
1136         They have very different CJK mapping for each.
1137
1138         Since they seems based on traditional encodings, we are likely to
1139         provide other constant tables and switch depending on the culture.
1140
1141         <what characters are different from the invariant culture?>
1142
1143         ko : CJK layout difference (52 -> 80)
1144         ja,zh-CHS,zh-TW : dash (5C), CJK layout difference.
1145
1146         Target characters are : CJK misc (3190-), Parenthesized CJK
1147         (3200-), CJK compat (3300-), CJK ideographs (4E00-),
1148         CJK compat ideograph (F900-), Half/Full width compat (FF00-)
1149
1150         Additionally for Korean: Jamo (1100-), Hangle syllables (AC00)
1151
1152         <how do they consist of?>
1153
1154         Japanese CJK order looks based on JIS table order. Those characters
1155         which are also in JIS table are moved to 80 xx. Those which are *not*
1156         in JIS table are left as is (9E-FE).
1157
1158         Additionally, Windows has different order for characters below:
1159         4EDD,337B,337E,337D,337D,337C
1160         They come in front of the first CJK character.
1161
1162         Maybe Korean CJK order respects KS C 5619. Note that Korean mixes
1163         Hangul and CJK in their order so it's not flat order without indexes
1164         (thus, for CJK they are not computable). Also, there is an extra
1165         level2 values for Korean CJK map.
1166
1167         For some Chinese such as zh-CHS, character order is based on pinyin.
1168         And for remaining Chinese such as zh-TW, it is stroke count based.
1169
1170         CLDR of unicode.org has reference ordering of those characters, so
1171         our collation table extracts the sorting order from it.
1172         http://www.unicode.org/cldr/
1173
1174 **** Accent evaluation order
1175
1176         With French cultures, diacritical marks are counted in reverse order.
1177         French ordering does not affect only on some diacritics (Japanese
1178         voice mark is not affected - FIXME: I doubt it, because the algorithm
1179         does not seem to allow it).
1180
1181         Some other cultures might also have different ones, but not obvious.
1182
1183
1184 ** Mono implementation plans
1185
1186 *** Collator
1187
1188         CompareInfo contains many overloaded methods that are just for
1189         convenience. This class contains almost only required members.
1190
1191         This class also provices access to tailoring information which is
1192         culture instance dependent:
1193
1194         - French sorting
1195         - contractions/expansions - returns contraction or expansion
1196         - diacritical remapping
1197         - CJK custom mapping
1198
1199         For data area, see CollationDataStructures.txt for now.
1200
1201 *** UnicodeTable (for now MSCompatUnicodeTable)
1202
1203         Provides several access to character information with related to
1204         the collation element table (of our own).
1205         FIXME: I want to fix some bugs in Windows collation table especially
1206         to not ignore some characters, but it requires table modification
1207         which results in further memory allocation. Maybe it would be done
1208         as a patch for the runtime (or classlib) sources.
1209
1210         - ignorable, ignorable nonspace, normalize width, normalize kanatype
1211         - level 4 sortkey provision method(s)
1212
1213 **** character comparison
1214
1215         Since composite character is likely to *not have* equivalent
1216         codepoint, character comparison could not just be done by expecting
1217         "resulting char" value.
1218         In contrast, since composite character is likely to *do have*
1219         equivalent codepoint, character comparison could not also just be done
1220         by comparing "source char" value.
1221
1222 ***** future optimizations
1223
1224         From where those codepoints differ, for each strings it adjusts the
1225         position so that it represents exactly one character element. That is,
1226         find primary character as the start of the range and the last
1227         nonprimary character as the end of the range.
1228
1229         Once Compare() adjusted the character location to be valid
1230         comparison position, further comparison is done as usual comparison,
1231         i.e. sortkey comparison considering comparisonLevel.
1232
1233 **** Characters in the table / characters computed
1234
1235         Currently I plan not to contain following characters in the table
1236         but compute on demand:
1237
1238         - PrivateUse
1239         - Surrogate
1240
1241 **** CJK Unified Ideographs
1242
1243         For CJK unified ideographs, I had to make those culture-dependent
1244         tables in memory. Since they came from some classical encodings, they
1245         are not computed. Thus, they are in separate table.
1246
1247 **** Level 4: Kana type
1248
1249         The table does not contain level 4 (kanatype) properties for
1250         the whole characters. They can be simply computed.
1251
1252 **** Level 3: Case/Width properties
1253
1254         Case properties will be stored as a byte array, with limited areas of
1255         codepoint (cp < 3120 || FE00 < cp).
1256
1257         For Hangul characters, it will be computed by codepoint areas.
1258
1259 **** Level 2: Diacritical properties
1260
1261         The table will be composed as a byte for a character. If we provide
1262         non-buggy mode (Windows is buggy here by design; it just sums
1263         secondary weight values up), the values will come from UCA and
1264         non-blocking check will be introduced.
1265
1266         Note that Japanese voice marks are considered at level 2 but no need to
1267         have maps.
1268
1269
1270 ** Reference materials
1271
1272         Developing International Software for Windows 95 and Windows NT
1273         Appendix D Sort Order for Selected Languages
1274         http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
1275
1276         UTR#10 Unicode Collation Algorithm (It is still informative)
1277         http://www.unicode.org/reports/tr10/
1278
1279         UAX#15 Unicode Normalization
1280         http://www.unicode.org/reports/tr15/
1281         especially its canonical/compatibility equivalent characters might
1282         be informative to get those equivalent characters.
1283
1284         To know which character can be expanded, Unicode Character Database
1285         (UCD) is informative (it's informative but not normative to us)
1286         http://www.unicode.org/Public/UNIDATA/UCD.html
1287
1288         Decent char-by-char explaination is available here:
1289         http://www.fileformat.info/info/unicode/
1290
1291         Wine uses UCA default element table, but has windows-like character
1292         filterings support in their LCMapString implementation:
1293         http://cvs.winehq.com/cvsweb/wine/dlls/kernel/locale.c
1294         http://cvs.winehq.com/cvsweb/wine/libs/unicode/sortkey.c
1295
1296         Mimer has decent materials on culture specific collations:
1297         http://developer.mimer.com/collations/
1298
1299         This is written in Japanese, but awesome analysis on MS Access
1300         string sorting:
1301         http://www.asahi-net.or.jp/~ez3k-msym/comp/acccoll.htm