From: Atsushi Eno Date: Tue, 26 Apr 2005 08:45:03 +0000 (-0000) Subject: 2005-04-26 Atsushi Enomoto X-Git-Url: http://wien.tomnetworks.com/gitweb/?a=commitdiff_plain;h=316abdb206f395349eb2ddbcedae30acae3a7034;p=mono.git 2005-04-26 Atsushi Enomoto * Collation-notes.txt : some updates. * create-mapping-char-source.cs : superscripts and subscripts are also ignored in IgnoreWidth comparison. * Makefile : tiny touch fix. svn path=/branches/atsushi/mcs/; revision=43583 --- diff --git a/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog b/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog index 68d5003138c..55854cb98ff 100644 --- a/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog +++ b/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog @@ -1,3 +1,10 @@ +2005-04-26 Atsushi Enomoto + + * Collation-notes.txt : some updates. + * create-mapping-char-source.cs : superscripts and subscripts are also + ignored in IgnoreWidth comparison. + * Makefile : tiny touch fix. + 2005-04-25 Atsushi Enomoto * CompareInfoImpl.cs, Collator.cs : conceptual stuff (not working). diff --git a/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt b/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt index ef2e96d0507..65a30763c28 100644 --- a/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt +++ b/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt @@ -1,10 +1,15 @@ String collation +* Summary + + We are going to implement Windows-like collation, apart from ICU which + is conformant to Unicode specifications. + * CompareInfo members GetSortKey() Compute sort key for every characters into byte[]. - Use collation element table. + Use collation element table, but Windows specific ones. Compare() Find first difference and compare it. "Larger/smaller" matters. IsPrefix() @@ -34,8 +39,7 @@ String collation not distinguished, but Hiragana "A" and Hiragana "I" are. Actually, even without any IgnoreXXX flags (i.e. "None"), there are - many characters that are excluded from results. I'd say it as - "completely ignorable" characters (as said in UCA). + many characters that are ignored ("completely ignorable"). For LCID 101/1125(div), '\ufdf2' is completely ignorable. This rule even applies to CompareOptions.None. @@ -49,10 +53,10 @@ String collation and IgnoreCase, I\u0307 is not regarded as equal to i. IgnoreKanaType - We need ToHiragana() like ToLower(). See also "Notes". + ToKanaTypeInsensitive(). See also "Notes". IgnoreWidth - We need ToFullWidth(), which is likely to be culture + ToWidthInsensitive(), which is likely to be culture independent. See also "Notes". ** Strippers @@ -61,15 +65,15 @@ String collation compatible (at least with .NET 1.1 invariant culture). IgnoreNonSpace - It is in a black box. - - Some Diacritic characters are covered by this flag. + IsIgnorableNonSpacing(). + Some Diacritic characters are covered by this flag. There are some culture *dependent* characters: LCID 90/1114(syr) : 64b, 652, 670 IgnoreSymbols - We need to implement our own Char.IsSymbol(). - UnicodeCategory does not work. + IsIgnorableSymbol(). + UnicodeCategory does not work here. There are some culture *dependent* characters: LCID 17/1041(ja) : 2015 @@ -88,16 +92,20 @@ String collation * Collation element table tailoring + Deprecated; We won't use collation element table from unicode.org. + We will contain only the default element table and Chinese table. (Japanese might be added too, since CLDR contains a large table for it) - Other rules are always "evaluated"; no physical expansion is done. + Other rules are always "evaluated"; no physical expansion is done to + the table loaded in memory (it's too wasting). * Notes Since UCA Level 3 handles both casing and width, it is impossible to use UCA variables for IgnoreWidth, at least with the default element - table. + table. And IgnoreKanaType cannot be handled without case and width + insensitivity. IgnoreWidth/IgnoreSymbols is processed after Kana voice mark decomposition (NFD). @@ -109,3 +117,101 @@ String collation Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as "completely ignorable". +* MS collation design inference + +** sort key format + + 00 means the end of sort key. + 01 means the end of the level. + 02-FF means the value. + + There are 5 levels. + + - level 1: primary difference + The first byte of level 1 means the category of the character. + - level 2: case sensitivity + - level 3: diacritic difference + - level 4: kana type (mostly at primary category 22) + - level 5: control characters etc. + +** default + + So the problem is, how to detect diacritic. Maybe they are combined + similarly to what is specified in UCA. + +*** sort order categories + + 1 (0) specially ignored ones (Japanese, Tamil, Thai) + + 3099-309C, BCD, E47, E4C, FF9E, FF9F + + 2 (1) maybe nonspacing marks + + 2.1 control characters (specified as such in Unicode), except for + whitespaces (0009-000D). + + 2.2 0027,FF07 (') + + 2.3 minus sign, hyphen, dash + minus signs: FE63, 207B (super), 208B (sub), 002D, 00FD (full-width) + hyphens: 00AD (soft), 2010, 2011 (nonbreaking) ... Unicode HYPHEN? + dashes, horizontal bars: FE58 ... Unicode DASH? + + 2.4 Arabic spacing and equivalents (64B-651, FE70-FE7F) + They are part of nonspacing mark, but not equal. + + 2.5 Nonspacing marks mixed + 30D, 591-5C2, Mn:981-A3C, A4D, A70, A71, ABC, ABD ... + + 3 (7) space separators and some kind of marks + + 3.1 whitespaces, paragraph separator etc. + + 3.2 other marks ('!', '^', ...) + + 4 (8) mathmatical symbols + + 5 (9) some other symbols + + 6 (A) punctuations + + 7 (C) numbers + + 8 (E) latin letters (alphabets) + + 9 (F) greek letters + + ... + + (21) georgian letters + + 13 (22) japanese kana letters and symbols + + 14 (23) bopomofo letters + + 15 (24) syriac letters + + 16 (41-45) surrogate Pt.1 + + 17 (52-7E) hangul + + 18 (9E-FE) CJK (kangxi etc.), PrivateUse mixed, surrogate Pt.2 + + 19 (FE) CJK extensions (3400-) + + 20 (FF) Some supplemental Japanese/Arabic marks + +** Traditional Spanish + + It has some combined characters as a unique character (like 'll'). + +** Czech + + Invariant culture also puts Czech unique character \u0161 between s + and t, unlike described here: + http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp + +** Other locales + + There are some character reorderings. + diff --git a/mcs/class/corlib/Mono.Globalization.Unicode/Makefile b/mcs/class/corlib/Mono.Globalization.Unicode/Makefile index 90eb5ac73e5..ab7dd98b6e1 100644 --- a/mcs/class/corlib/Mono.Globalization.Unicode/Makefile +++ b/mcs/class/corlib/Mono.Globalization.Unicode/Makefile @@ -94,7 +94,7 @@ $(CB_CLASS_TABLE) : if ! test -d UCD/extracted; then mkdir UCD/extracted; fi wget http://www.unicode.org/Public/UNIDATA/extracted/DerivedCombiningClass.txt mv DerivedCombiningClass.txt UCD/extracted/ - touch UCD/DerivedCombiningClass.txt + touch UCD/extracted/DerivedCombiningClass.txt sample : diff --git a/mcs/class/corlib/Mono.Globalization.Unicode/create-char-mapping-source.cs b/mcs/class/corlib/Mono.Globalization.Unicode/create-char-mapping-source.cs index a3656d1856b..7916b3f752d 100644 --- a/mcs/class/corlib/Mono.Globalization.Unicode/create-char-mapping-source.cs +++ b/mcs/class/corlib/Mono.Globalization.Unicode/create-char-mapping-source.cs @@ -224,6 +224,8 @@ namespace Mono.Globalization.Unicode switch (combiningCategory) { case "narrow": case "wide": + case "super": + case "sub": widthSensitives.Add (cp); break; }