* Collation Data structure and code
+** This document is not up to date
-** General
-
- I will create a code generator named "collation-builder" (currently
- create-mscompat-collation-table.cs), which creates collation support
- source files:
-
- - collation-tables.h : C header that holds raw constant arrays
- - CollationTable.cs : C# source that declares raw constant arrays
-
- The latter one is totally optional. It is created just for ease of
- debugging in pure managed world.
-
- CollationSource class is used to represent a culture-specific collation
- resource set.
-
- Note that there are many characters whose sortkey cannot be acquired
- only via those tables. For example, Korean Jamo ((U+1113 - U+115F) has
- primary keys which is more than 2 bytes.
+ Currently the table generator generates all three kind of files
+ (C header, C# source and binary resources) but only binary resources
+ are used in fact. C# source is useful for debugging. C header is
+ historical.
** Manual tasks required to maintain the source.
that is likely not be in sync with the constant arrays.
-
** collation-tables.h
+ Note: the structures are similar in MSCompatUnicodeTable.cs but now
+ it is managed code.
+
typedef struct {
ushort lcid;
ushort tailoringIndex;
ushort tailoringCount;
short reverseAccentOrder; /* 1:French sort. 0:Normal */
- } CollationSource;
+ } TailoringInfo;
Those [*] characters will be compressed using CodePointIndexer
whose max value is char.MaxValue+1.
// Holds sortkey basis.
- char [*] category;
- char [*] level1;
- char [*] level2;
- char [*] level3;
- char [*] ignorableFlags; // 1:complete, 2:symbol, 3:nonspace
+ guint8 [*] category;
+ guint8 [*] level1;
+ guint8 [*] level2;
+ guint8 [*] level3;
+ guint8 [*] ignorableFlags; // 1:complete, 2:symbol, 3:nonspace
gunichar [*] widthCompat;
// Holds special arrays for CJK order which is culture dependent.
- ushort [*] cjkCHS;
- ushort [*] cjkCHT;
- ushort [*] cjkJA;
- ushort [*] cjkKO;
- char [*] cjkKOlv2;
+ guint16 [*] cjkCHS;
+ guint16 [*] cjkCHT;
+ guint16 [*] cjkJA;
+ guint16 [*] cjkKO;
+ guint8 [*] cjkKOlv2;
gunichar [whole_tailoring_count] tailorings;
CollationSource [culture_count] collationSources;
static CodePointIndexer Ignorable;
static CodePointIndexer WidthCompat;
static CodePointIndexer CjkCHS;
- static CodePointIndexer CjkCHT;
- static CodePointIndexer CjkJA;
- static CodePointIndexer CjkKO;
- static CodePointIndexer CjkKOLevel1;
+ static CodePointIndexer Cjk;
** CollatorSource.cs
static ushort [] cjkKO;
static byte [] cjkKOlv2;
- class CollationSource // instantiated for each CultureInfo
+ class TailoringInfo // instantiated for each CultureInfo
{
// Primary constants
int tailoringIndex;
int tailoringCount;
- bool reverseAccentOrder;
-
- // This array is set according to CJK type, and CJK type
- // will be hardcoded, being identified from LCID.
- ushort [] cjk;
-
- // For Korean, level2 table is specially treated.
-
- // Computed values for optimization in use.
- byte [*] hasTailoringHead;
- byte [*] hasTailoringTail;
+ bool frenchSort;
}
+