hg/stx-libbasic: comparison Character.st

equal deleted inserted replaced

-:d8a2b0f3efff
+:d222015cc39c
 The word 'asciiValue' is a historic leftover - actually, any integer
 code is allowed and actually used (i.e. characters are not limited to 8bit).
 Also, the encoding is actually Unicode, of which ascii is a subset and the same encoding value
 for the first 128 characters (codePoint 0 to 127 are the same in ascii).
 Some heavily used Characters are kept as singletons; i.e. for every asciiValue (0..N),
 there exists exactly one instance of Character, which is shared.
 Character value:xxx checks for this, and returns a reference to an existing instance.
 For N<=255, this is guaranteed; i.e. in all Smalltalks, the single byte characters are always
 handled like this, and you can therefore safely compare them using == (identity compare).
 Other characters (i.e. codepoint > N) are not guaranteed to be shared;
 i.e. these my or may not be created as required.
 Actually, do NOT depend on which characters are and which are not shared.
 Always compare using #= if there is any chance of a non-ascii character being involved.
 Once again (because beginners sometimes make this mistake):
-This means: you may compare characters using #== ONLY IFF you are certain,
+	This means: you may compare characters using #== ONLY IFF you are certain,
-that the characters ranges is 0..255.
+	that the characters ranges is 0..255.
-Otherwise, you HAVE TO compare using #=. (if in doubt, always compare using #=).
+	Otherwise, you HAVE TO compare using #=. (if in doubt, always compare using #=).
-Sorry for this inconvenience, but it is (practically) impossible to keep
+	Sorry for this inconvenience, but it is (practically) impossible to keep
-the possible maximum of 2^32 characters (Unicode) around, for that convenience alone.
+	the possible maximum of 2^32 characters (Unicode) around, for that convenience alone.
 In ST/X, N is (currently) 1024. This means that all the latin characters and some others are
 kept as singleton in the CharacterTable class variable (which is also used by the VM when characters
 are instanciated).
 Interval elements (i.e. ($a to:$z) do:[...] );
 They are not a big deal, but convenient add-ons.
 Some of these have been modified a bit.
 WARNING: characters are known by compiler and runtime system -
-do not change the instance layout.
+	     do not change the instance layout.
 Also, although you can create subclasses of Character, the compiler always
 creates instances of Character for literals ...
 ... and other classes are hard-wired to always return instances of characters
 in some cases (i.e. String>>at:, Symbol>>at: etc.).
 Therefore, it may not make sense to create a character-subclass.
 Case Mapping in Unicode:
-There are a number of complications to case mappings that occur once the repertoire
+	There are a number of complications to case mappings that occur once the repertoire
-of characters is expanded beyond ASCII.
+	of characters is expanded beyond ASCII.
-* Because of the inclusion of certain composite characters for compatibility,
+	* Because of the inclusion of certain composite characters for compatibility,
-such as U+01F1 'DZ' capital dz, there is a third case, called titlecase,
+	  such as U+01F1 'DZ' capital dz, there is a third case, called titlecase,
-which is used where the first letter of a word is to be capitalized
+	  which is used where the first letter of a word is to be capitalized
-(e.g. Titlecase, vs. UPPERCASE, or lowercase).
+	  (e.g. Titlecase, vs. UPPERCASE, or lowercase).
-For example, the title case of the example character is U+01F2 'Dz' capital d with small z.
+	  For example, the title case of the example character is U+01F2 'Dz' capital d with small z.
-* Case mappings may produce strings of different length than the original.
+	* Case mappings may produce strings of different length than the original.
-For example, the German character U+00DF small letter sharp s expands when uppercased to
+	  For example, the German character U+00DF small letter sharp s expands when uppercased to
-the sequence of two characters 'SS'.
+	  the sequence of two characters 'SS'.
-This also occurs where there is no precomposed character corresponding to a case mapping.
+	  This also occurs where there is no precomposed character corresponding to a case mapping.
-*** This is not yet implemented (in 5.2) ***
+	  *** This is not yet implemented (in 5.2) ***
-* Characters may also have different case mappings, depending on the context.
+	* Characters may also have different case mappings, depending on the context.
-For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is not followed
+	  For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is not followed
-by another letter, but lowercases to 03C2 small final sigma if it is.
+	  by another letter, but lowercases to 03C2 small final sigma if it is.
-*** This is not yet implemented (in 5.2) ***
+	  *** This is not yet implemented (in 5.2) ***
-* Characters may have case mappings that depend on the locale.
+	* Characters may have case mappings that depend on the locale.
-For example, in Turkish the letter 0049 'I' capital letter i lowercases to 0131 small dotless i.
+	  For example, in Turkish the letter 0049 'I' capital letter i lowercases to 0131 small dotless i.
-*** This is not yet implemented (in 5.2) ***
+	  *** This is not yet implemented (in 5.2) ***
-* Case mappings are not, in general, reversible.
+	* Case mappings are not, in general, reversible.
-For example, once the string 'McGowan' has been uppercased, lowercased or titlecased,
+	  For example, once the string 'McGowan' has been uppercased, lowercased or titlecased,
-the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.
+	  the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.
 Collation Sequence:
-*** This is not yet implemented (in 5.2) ***
+	*** This is not yet implemented (in 5.2) ***
 [author:]
-Claus Gittinger
+	Claus Gittinger
 [see also:]
-String TwoByteString Unicode16String Unicode32String
+	String TwoByteString Unicode16String Unicode32String
-StringCollection Text
+	StringCollection Text
 "
 ! !
 !Character class methodsFor:'instance creation'!
 codePoint:anInteger
 "return a character with codePoint anInteger"
 %{  /* NOCONTEXT */
+#ifdef __JAVA__
+{
+	char ch = (char)(context.stArg(0).intValue("[codePoint:]"));
+	return context._RETURN(STCharacter._new(ch));
+}
+/* NOTREACHED */
+#else
 INT __codePoint;
 if (__isSmallInteger(anInteger)) {
 	__codePoint = __smallIntegerVal(anInteger);
 	if ((unsigned INT)(__codePoint) <= MAX_IMMEDIATE_CHARACTER /* (__codePoint >= 0) && (__codePoint <= 255) */) {
 	    RETURN ( __MKCHARACTER(__codePoint) );
 	} else {
 	    RETURN ( __MKUCHARACTER(__codePoint) );
 	}
 }
+#endif
 %}.
 (anInteger between:0 and:(CharacterTable size - 1)) ifTrue:[
 	^ CharacterTable at:(anInteger + 1)
 ].
 (anInteger between:16r100 and:16r3FFFFFFF) ifTrue:[
 separators
 "return a collection of separator chars.
 Added for squeak compatibility"
 Separators isNil ifTrue:[
-Separators := Array
+	Separators := Array
-with:Character space
+	    with:Character space
-with:Character return
+	    with:Character return
-"/ with:Character cr
+	    "/ with:Character cr
-with:Character tab
+	    with:Character tab
-with:Character lf
+	    with:Character lf
-with:Character ff
+	    with:Character ff
 ].
 ^ Separators
 "
 Character separators
 self == aCharacter ifTrue:[^ true].
 aCharacter isCharacter ifFalse:[^ false].
 ^ asciivalue = aCharacter codePoint
 "
-$A = (Character value:65)
+	$A = (Character value:65)
-$A = (Character codePoint:65)
+	$A = (Character codePoint:65)
-$A = ($B-1)
+	$A = ($B-1)
-$A = 65
+	$A = 65
 "
 !
 > aMagnitude
 "return true, if the arguments asciiValue is less than the receiver's"
 CAVEAT:
 	for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
 	(which is more than mozilla does, btw. ;-)"
 %{
+#ifdef __JAVA__
+{
+	char ch = self.charValue("[asLowercase]");
+	ch = java.lang.Character.toLowerCase(ch);
+	return context._RETURN(STCharacter._new(ch));
+}
+/* NOTREACHED */
+#else
 static int __mapping[] = {
 /* From    To             Every   Diff   */
 0x0041, ((0x19 << 8) | 0x01), 0x0020  ,
 0x00c0, ((0x16 << 8) | 0x01), 0x0020  ,
 0x00d8, ((0x06 << 8) | 0x01), 0x0020  ,
 	    }
 	}
 }
 RETURN (self);
 allocationError: ;
+#endif /* ! __JAVA__ */
 %}.
 ^ ObjectMemory allocationFailureSignal raise.
 "
 $A asLowercase
 CAVEAT:
 	for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
 	(which is more than mozilla does, btw. ;-)"
 %{
+#ifdef __JAVA__
+{
+	char ch = self.charValue("[asUppercase]");
+	ch = java.lang.Character.toUpperCase(ch);
+	return context._RETURN(STCharacter._new(ch));
+}
+/* NOTREACHED */
+#else
 static int __mapping[] = {
 /* From    To             Every   Diff   */
 0x0061, ((0x19 << 8) | 0x01), -32  ,
 0x00b5, ((0x00 << 8) | 0x3b), 0x02e7  ,
 0x00e0, ((0x16 << 8) | 0x01), -32   ,
 	    }
 	}
 }
 RETURN (self);
 allocationError: ;
+#endif /* ! __JAVA__ */
 %}.
 ^ ObjectMemory allocationFailureSignal raise.
 "
 $A asLowercase
 unsigned INT val;
 // fast code for common cases
 val = __intVal(__characterVal(self));
 if (val <= 0xFF) {
-if (__isCharacter(aStringOrCharacter)) {
+	if (__isCharacter(aStringOrCharacter)) {
-unsigned INT val2 = __intVal(__characterVal(aStringOrCharacter));
+	    unsigned INT val2 = __intVal(__characterVal(aStringOrCharacter));
-if (val2 <= 0xFF) {
+	    if (val2 <= 0xFF) {
-char buffer[2];
+		char buffer[2];
-buffer[0] = val;
+		buffer[0] = val;
-buffer[1] = val2;
+		buffer[1] = val2;
-s = __MKSTRING_L(buffer, 2);
+		s = __MKSTRING_L(buffer, 2);
-if (s != nil) {
+		if (s != nil) {
-RETURN (s);
+		    RETURN (s);
-}
+		}
-}
+	    }
-} else {
+	} else {
-if (__isString(aStringOrCharacter)) {
+	    if (__isString(aStringOrCharacter)) {
-int strSize = __stringSize(aStringOrCharacter);
+		int strSize = __stringSize(aStringOrCharacter);
-s = __MKEMPTYSTRING(strSize+1);
+		s = __MKEMPTYSTRING(strSize+1);
-if (s != nil) {
+		if (s != nil) {
-__StringInstPtr(s)->s_element[0] = val;
+		    __StringInstPtr(s)->s_element[0] = val;
-memcpy(__StringInstPtr(s)->s_element+1, __stringVal(aStringOrCharacter), strSize+1); // copies 0-byte too
+		    memcpy(__StringInstPtr(s)->s_element+1, __stringVal(aStringOrCharacter), strSize+1); // copies 0-byte too
-RETURN (s);
+		    RETURN (s);
-}
+		}
-}
+	    }
-}
+	}
 }
 %}.
 ^ self asString , aStringOrCharacter
 "
 (although the fallBack is to display its printString ...)"
 "/ what a kludge - Dolphin and Squeak mean: printOn: a stream;
 "/ ST/X (and some old ST80's) mean: draw-yourself on a GC.
 (aGCOrStream isStream) ifFalse:[
-^ super displayOn:aGCOrStream
+	^ super displayOn:aGCOrStream
 ].
 self storeOn:aGCOrStream.
 aGCOrStream nextPutAll:' "16r'.
 asciivalue printOn:aGCOrStream base:16.
 asNonDiacritical
 "return a new character which represents the receiver without diacritics.
 This is used with string search and when lists are to be ordered/sorted by base character order.
 CAVEAT:
-for now, this method is only correct for unicode characters up to u+2FF,
+	for now, this method is only correct for unicode characters up to u+2FF,
-i.e. latin languages"
+	i.e. latin languages"
 %{  /* NOCONTEXT */
 REGISTER INT val;
 /* because used so often, this is open coded, instead of table driven */
 val = __intVal(__INST(asciivalue));
 switch (val >> 8) {
-case 0x00:
+	case 0x00:
-if (val < 0xC0) { RETURN(self); }
+	    if (val < 0xC0) { RETURN(self); }
-if (val <= 0xC6) { val = 'A'; break; }
+	    if (val <= 0xC6) { val = 'A'; break; }
-if (val == 0xC7) { val = 'C'; break; }
+	    if (val == 0xC7) { val = 'C'; break; }
-if (val <= 0xCB) { val = 'E'; break; }
+	    if (val <= 0xCB) { val = 'E'; break; }
-if (val <= 0xCF) { val = 'I'; break; }
+	    if (val <= 0xCF) { val = 'I'; break; }
-if (val == 0xD0) { val = 'D'; break; }
+	    if (val == 0xD0) { val = 'D'; break; }
-if (val == 0xD1) { val = 'N'; break; }
+	    if (val == 0xD1) { val = 'N'; break; }
-if (val <= 0xD6) { val = 'O'; break; }
+	    if (val <= 0xD6) { val = 'O'; break; }
-if (val == 0xD7) { RETURN(self) }
+	    if (val == 0xD7) { RETURN(self) }
-if (val == 0xD8) { val = 'O'; break; }
+	    if (val == 0xD8) { val = 'O'; break; }
-if (val <= 0xDC) { val = 'U'; break; }
+	    if (val <= 0xDC) { val = 'U'; break; }
-if (val == 0xDD) { val = 'Y'; break; }
+	    if (val == 0xDD) { val = 'Y'; break; }
-if (val < 0xE0) { RETURN(self) }
+	    if (val < 0xE0) { RETURN(self) }
-if (val <= 0xE6) { val = 'a'; break; }
+	    if (val <= 0xE6) { val = 'a'; break; }
-if (val == 0xE7) { val = 'c'; break; }
+	    if (val == 0xE7) { val = 'c'; break; }
-if (val <= 0xEB) { val = 'e'; break; }
+	    if (val <= 0xEB) { val = 'e'; break; }
-if (val <= 0xEF) { val = 'i'; break; }
+	    if (val <= 0xEF) { val = 'i'; break; }
-if (val == 0xF0) { val = 'd'; break; }
+	    if (val == 0xF0) { val = 'd'; break; }
-if (val == 0xF1) { val = 'n'; break; }
+	    if (val == 0xF1) { val = 'n'; break; }
-if (val <= 0xF6) { val = 'o'; break; }
+	    if (val <= 0xF6) { val = 'o'; break; }
-if (val == 0xF7) { RETURN(self) }
+	    if (val == 0xF7) { RETURN(self) }
-if (val == 0xF8) { val = 'o'; break; }
+	    if (val == 0xF8) { val = 'o'; break; }
-if (val <= 0xFC) { val = 'u'; break; }
+	    if (val <= 0xFC) { val = 'u'; break; }
-if (val == 0xFD) { val = 'y'; break; }
+	    if (val == 0xFD) { val = 'y'; break; }
-if (val == 0xFF) { val = 'y'; break; }
+	    if (val == 0xFF) { val = 'y'; break; }
-RETURN (self);
+	    RETURN (self);
-case 0x01:
+	case 0x01:
-if (val <= 0x105) { val = (val & 1) ? 'a' : 'A'; break; }
+	    if (val <= 0x105) { val = (val & 1) ? 'a' : 'A'; break; }
-if (val <= 0x10D) { val = (val & 1) ? 'c' : 'C'; break; }
+	    if (val <= 0x10D) { val = (val & 1) ? 'c' : 'C'; break; }
-if (val <= 0x111) { val = (val & 1) ? 'd' : 'D'; break; }
+	    if (val <= 0x111) { val = (val & 1) ? 'd' : 'D'; break; }
-if (val <= 0x11B) { val = (val & 1) ? 'e' : 'E'; break; }
+	    if (val <= 0x11B) { val = (val & 1) ? 'e' : 'E'; break; }
-if (val <= 0x123) { val = (val & 1) ? 'g' : 'G'; break; }
+	    if (val <= 0x123) { val = (val & 1) ? 'g' : 'G'; break; }
-if (val <= 0x127) { val = (val & 1) ? 'h' : 'H'; break; }
+	    if (val <= 0x127) { val = (val & 1) ? 'h' : 'H'; break; }
-if (val <= 0x133) { val = (val & 1) ? 'i' : 'I'; break; }
+	    if (val <= 0x133) { val = (val & 1) ? 'i' : 'I'; break; }
-if (val <= 0x137) { val = (val & 1) ? 'k' : 'K'; break; }
+	    if (val <= 0x137) { val = (val & 1) ? 'k' : 'K'; break; }
-if (val == 0x138) { val = 'K'; break; }
+	    if (val == 0x138) { val = 'K'; break; }
-if (val <= 0x142) { val = (val & 1) ? 'L' : 'l'; break; }
+	    if (val <= 0x142) { val = (val & 1) ? 'L' : 'l'; break; }
-if (val <= 0x148) { val = (val & 1) ? 'N' : 'n'; break; }
+	    if (val <= 0x148) { val = (val & 1) ? 'N' : 'n'; break; }
-if (val <= 0x14B) { val = (val & 1) ? 'n' : 'N'; break; }
+	    if (val <= 0x14B) { val = (val & 1) ? 'n' : 'N'; break; }
-if (val <= 0x153) { val = (val & 1) ? 'o' : 'O'; break; }
+	    if (val <= 0x153) { val = (val & 1) ? 'o' : 'O'; break; }
-if (val <= 0x159) { val = (val & 1) ? 'r' : 'R'; break; }
+	    if (val <= 0x159) { val = (val & 1) ? 'r' : 'R'; break; }
-if (val <= 0x161) { val = (val & 1) ? 's' : 'S'; break; }
+	    if (val <= 0x161) { val = (val & 1) ? 's' : 'S'; break; }
-if (val <= 0x167) { val = (val & 1) ? 't' : 'T'; break; }
+	    if (val <= 0x167) { val = (val & 1) ? 't' : 'T'; break; }
-if (val <= 0x173) { val = (val & 1) ? 'u' : 'U'; break; }
+	    if (val <= 0x173) { val = (val & 1) ? 'u' : 'U'; break; }
-if (val <= 0x175) { val = (val & 1) ? 'w' : 'W'; break; }
+	    if (val <= 0x175) { val = (val & 1) ? 'w' : 'W'; break; }
-if (val <= 0x178) { val = (val & 1) ? 'y' : 'Y'; break; }
+	    if (val <= 0x178) { val = (val & 1) ? 'y' : 'Y'; break; }
-if (val <= 0x17E) { val = (val & 1) ? 'Z' : 'z'; break; }
+	    if (val <= 0x17E) { val = (val & 1) ? 'Z' : 'z'; break; }
-RETURN (self);
+	    RETURN (self);
-case 0x02:
+	case 0x02:
-if (val <= 0x203) { val = (val & 1) ? 'a' : 'A'; break; }
+	    if (val <= 0x203) { val = (val & 1) ? 'a' : 'A'; break; }
-if (val <= 0x207) { val = (val & 1) ? 'e' : 'E'; break; }
+	    if (val <= 0x207) { val = (val & 1) ? 'e' : 'E'; break; }
-if (val <= 0x20B) { val = (val & 1) ? 'i' : 'I'; break; }
+	    if (val <= 0x20B) { val = (val & 1) ? 'i' : 'I'; break; }
-if (val <= 0x20F) { val = (val & 1) ? 'o' : 'O'; break; }
+	    if (val <= 0x20F) { val = (val & 1) ? 'o' : 'O'; break; }
-if (val <= 0x213) { val = (val & 1) ? 'r' : 'R'; break; }
+	    if (val <= 0x213) { val = (val & 1) ? 'r' : 'R'; break; }
-if (val <= 0x217) { val = (val & 1) ? 'u' : 'U'; break; }
+	    if (val <= 0x217) { val = (val & 1) ? 'u' : 'U'; break; }
-if (val <= 0x219) { val = (val & 1) ? 's' : 'S'; break; }
+	    if (val <= 0x219) { val = (val & 1) ? 's' : 'S'; break; }
-if (val <= 0x21B) { val = (val & 1) ? 't' : 'T'; break; }
+	    if (val <= 0x21B) { val = (val & 1) ? 't' : 'T'; break; }
-RETURN (self);
+	    RETURN (self);
-case 0x03:
+	case 0x03:
-// to be done
+	    // to be done
-RETURN (self);
+	    RETURN (self);
-case 0x04:
+	case 0x04:
-// to be done
+	    // to be done
-RETURN (self);
+	    RETURN (self);
 }
 if (val <= MAX_IMMEDIATE_CHARACTER) {
-RETURN (__MKCHARACTER(val)) ;
+	RETURN (__MKCHARACTER(val)) ;
 }
 RETURN (__MKUCHARACTER(val)) ;
 %}
 "
 ! !
 !Character class methodsFor:'documentation'!
 version
-^ '$Header: /cvs/stx/stx/libbasic/Character.st,v 1.159 2015-02-07 15:36:49 cg Exp $'
+^ '$Header: /cvs/stx/stx/libbasic/Character.st,v 1.160 2015-04-15 00:30:56 cg Exp $'
 !
 version_CVS
-^ '$Header: /cvs/stx/stx/libbasic/Character.st,v 1.159 2015-02-07 15:36:49 cg Exp $'
+^ '$Header: /cvs/stx/stx/libbasic/Character.st,v 1.160 2015-04-15 00:30:56 cg Exp $'
 ! !

branch	jv
changeset 18217	d222015cc39c
parent 18120	e3a375d5f6a8
parent 18215	5940d5eff81b
child 18261	22bdfc405bca