mcs/class/corlib/Mono.Globalization.Unicode/normalization-notes.txt

   1 * Normalization implementation notes
   2
   3 ** Basics
   4
   5         Unicode normalization is implemented as String.Normalize(), which
   6         supports all of FormD, FormC, FormKD and FormKC.
   7
   8         FormD and FormKD decompose the input string.
   9         FormC and FormKC combine the decomposed input string.
  10
  11         Mono's Unicode Normalization methods are implemented in
  12         Mono.Globalization.Unicode.Normalization.
  13
  14 *** Normalization array resources
  15
  16         The Normalization implementation involves a lot of array lookup
  17         which mostly represent UCD (Unicode Character Data) which is
  18         essential to Unicode Normalization.
  19
  20         By default (in the release), the arrays are defined as C array and
  21         then loaded via icalls (see the static constructor). Defined in
  22         normalization-table.h.
  23
  24         Alternatively, for debugging purpose, you can switch to managed array
  25         lookup instead. The arrays are then defined in
  26         NormalizationGenerated.cs.
  27
  28         Both .h and -Generated.cs files can be generated by running
  29         create-normalization-source.exe, which reads UCD and emits them.
  30
  31         There are 6 arrays in our implementation. Each array is of [size]:
  32
  33         - byte props [char.MaxValue]:
  34           Stores "properties" for each character, where the "properties"
  35           are dedicated set of the properties for normalization as defined
  36           in "DerivedNormalizationProps.txt".
  37           It is used for quick check (NF*_QC) etc.
  38
  39         - int mappedChars []:
  40           Stores all the normalized strings in the mapping entries expanded
  41           as an array of chars. Element at 0 is 0. Each of the strings is
  42           NULL-terminated (ends with 0). The entries are sorted first in the
  43           order of the primary composite (source) char, and second in the
  44           order of the normalized string.
  45
  46           For example, if the length of the normalized string of the first
  47           mapping entry is 2, then [1] holds the first character of the
  48           normalized string of the first mapping entry. [2] holds the second
  49           character of the normalized string of the first mapping entry.
  50
  51         - short charMapIndex [char.MaxValue]:
  52           Stores the indexes to the mapping for each primary composite (source)
  53           Unicode character. If there is no mapping for the character, then
  54           the index value is 0.
  55
  56           Note that mapping information is not directly stored in any of the
  57           arrays.
  58
  59           example:
  60                   mappedChars: [A1, A2, B1, C1, C2, D1, D2, D3, E1]
  61                   charMapIndex: [0, 2, 3, 5, 8]
  62
  63         - short helperIndex [char.MaxValue]
  64           Stores the index to mappedChars of the first character of the
  65           first entry of the normalized strings for each character (note
  66           that it is *not* map from primary composite but from head of
  67           normalized strings).
  68           If there is no mapping for the character, then 0 is returned.
  69
  70         - ushort mapIdxToComposite [maps.Length]:
  71           Stores the primary composite (source) character for each mapping,
  72           where the key is the index to mappedChars.
  73           It is a "reversed" charMapIndex array (which is char-to-mapidx).
  74
  75           example: char src = (char) mapIdxToComposite [mapIdx];
  76
  77         - byte combiningClass [char.MaxValue]:
  78           Stores the UCD CombiningClass value for each Unicode character.
  79