46 The word 'asciiValue' is a historic leftover - actually, any integer |
46 The word 'asciiValue' is a historic leftover - actually, any integer |
47 code is allowed and actually used (i.e. characters are not limited to 8bit). |
47 code is allowed and actually used (i.e. characters are not limited to 8bit). |
48 Also, the encoding is actually Unicode, of which ascii is a subset and the same encoding value |
48 Also, the encoding is actually Unicode, of which ascii is a subset and the same encoding value |
49 for the first 128 characters (codePoint 0 to 127 are the same in ascii). |
49 for the first 128 characters (codePoint 0 to 127 are the same in ascii). |
50 |
50 |
51 Some heavily used Characters are kept as singletons; i.e. for every asciiValue (0..N), |
51 Some heavily used Characters are kept as singletons; i.e. for every asciiValue (0..N), |
52 there exists exactly one instance of Character, which is shared. |
52 there exists exactly one instance of Character, which is shared. |
53 Character value:xxx checks for this, and returns a reference to an existing instance. |
53 Character value:xxx checks for this, and returns a reference to an existing instance. |
54 For N<=255, this is guaranteed; i.e. in all Smalltalks, the single byte characters are always |
54 For N<=255, this is guaranteed; i.e. in all Smalltalks, the single byte characters are always |
55 handled like this, and you can therefore safely compare them using == (identity compare). |
55 handled like this, and you can therefore safely compare them using == (identity compare). |
56 |
56 |
57 Other characters (i.e. codepoint > N) are not guaranteed to be shared; |
57 Other characters (i.e. codepoint > N) are not guaranteed to be shared; |
58 i.e. these my or may not be created as required. |
58 i.e. these my or may not be created as required. |
59 Actually, do NOT depend on which characters are and which are not shared. |
59 Actually, do NOT depend on which characters are and which are not shared. |
60 Always compare using #= if there is any chance of a non-ascii character being involved. |
60 Always compare using #= if there is any chance of a non-ascii character being involved. |
61 |
61 |
62 Once again (because beginners sometimes make this mistake): |
62 Once again (because beginners sometimes make this mistake): |
63 This means: you may compare characters using #== ONLY IFF you are certain, |
63 This means: you may compare characters using #== ONLY IFF you are certain, |
64 that the characters ranges is 0..255. |
64 that the characters ranges is 0..255. |
65 Otherwise, you HAVE TO compare using #=. (if in doubt, always compare using #=). |
65 Otherwise, you HAVE TO compare using #=. (if in doubt, always compare using #=). |
66 Sorry for this inconvenience, but it is (practically) impossible to keep |
66 Sorry for this inconvenience, but it is (practically) impossible to keep |
67 the possible maximum of 2^32 characters (Unicode) around, for that convenience alone. |
67 the possible maximum of 2^32 characters (Unicode) around, for that convenience alone. |
68 |
68 |
69 In ST/X, N is (currently) 1024. This means that all the latin characters and some others are |
69 In ST/X, N is (currently) 1024. This means that all the latin characters and some others are |
70 kept as singleton in the CharacterTable class variable (which is also used by the VM when characters |
70 kept as singleton in the CharacterTable class variable (which is also used by the VM when characters |
71 are instanciated). |
71 are instanciated). |
72 |
72 |
75 Interval elements (i.e. ($a to:$z) do:[...] ); |
75 Interval elements (i.e. ($a to:$z) do:[...] ); |
76 They are not a big deal, but convenient add-ons. |
76 They are not a big deal, but convenient add-ons. |
77 Some of these have been modified a bit. |
77 Some of these have been modified a bit. |
78 |
78 |
79 WARNING: characters are known by compiler and runtime system - |
79 WARNING: characters are known by compiler and runtime system - |
80 do not change the instance layout. |
80 do not change the instance layout. |
81 |
81 |
82 Also, although you can create subclasses of Character, the compiler always |
82 Also, although you can create subclasses of Character, the compiler always |
83 creates instances of Character for literals ... |
83 creates instances of Character for literals ... |
84 ... and other classes are hard-wired to always return instances of characters |
84 ... and other classes are hard-wired to always return instances of characters |
85 in some cases (i.e. String>>at:, Symbol>>at: etc.). |
85 in some cases (i.e. String>>at:, Symbol>>at: etc.). |
86 Therefore, it may not make sense to create a character-subclass. |
86 Therefore, it may not make sense to create a character-subclass. |
87 |
87 |
88 Case Mapping in Unicode: |
88 Case Mapping in Unicode: |
89 There are a number of complications to case mappings that occur once the repertoire |
89 There are a number of complications to case mappings that occur once the repertoire |
90 of characters is expanded beyond ASCII. |
90 of characters is expanded beyond ASCII. |
91 |
91 |
92 * Because of the inclusion of certain composite characters for compatibility, |
92 * Because of the inclusion of certain composite characters for compatibility, |
93 such as U+01F1 'DZ' capital dz, there is a third case, called titlecase, |
93 such as U+01F1 'DZ' capital dz, there is a third case, called titlecase, |
94 which is used where the first letter of a word is to be capitalized |
94 which is used where the first letter of a word is to be capitalized |
95 (e.g. Titlecase, vs. UPPERCASE, or lowercase). |
95 (e.g. Titlecase, vs. UPPERCASE, or lowercase). |
96 For example, the title case of the example character is U+01F2 'Dz' capital d with small z. |
96 For example, the title case of the example character is U+01F2 'Dz' capital d with small z. |
97 |
97 |
98 * Case mappings may produce strings of different length than the original. |
98 * Case mappings may produce strings of different length than the original. |
99 For example, the German character U+00DF small letter sharp s expands when uppercased to |
99 For example, the German character U+00DF small letter sharp s expands when uppercased to |
100 the sequence of two characters 'SS'. |
100 the sequence of two characters 'SS'. |
101 This also occurs where there is no precomposed character corresponding to a case mapping. |
101 This also occurs where there is no precomposed character corresponding to a case mapping. |
102 *** This is not yet implemented (in 5.2) *** |
102 *** This is not yet implemented (in 5.2) *** |
103 |
103 |
104 * Characters may also have different case mappings, depending on the context. |
104 * Characters may also have different case mappings, depending on the context. |
105 For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is not followed |
105 For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is not followed |
106 by another letter, but lowercases to 03C2 small final sigma if it is. |
106 by another letter, but lowercases to 03C2 small final sigma if it is. |
107 *** This is not yet implemented (in 5.2) *** |
107 *** This is not yet implemented (in 5.2) *** |
108 |
108 |
109 * Characters may have case mappings that depend on the locale. |
109 * Characters may have case mappings that depend on the locale. |
110 For example, in Turkish the letter 0049 'I' capital letter i lowercases to 0131 small dotless i. |
110 For example, in Turkish the letter 0049 'I' capital letter i lowercases to 0131 small dotless i. |
111 *** This is not yet implemented (in 5.2) *** |
111 *** This is not yet implemented (in 5.2) *** |
112 |
112 |
113 * Case mappings are not, in general, reversible. |
113 * Case mappings are not, in general, reversible. |
114 For example, once the string 'McGowan' has been uppercased, lowercased or titlecased, |
114 For example, once the string 'McGowan' has been uppercased, lowercased or titlecased, |
115 the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. |
115 the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. |
116 |
116 |
117 Collation Sequence: |
117 Collation Sequence: |
118 *** This is not yet implemented (in 5.2) *** |
118 *** This is not yet implemented (in 5.2) *** |
119 |
119 |
120 [author:] |
120 [author:] |
121 Claus Gittinger |
121 Claus Gittinger |
122 |
122 |
123 [see also:] |
123 [see also:] |
124 String TwoByteString Unicode16String Unicode32String |
124 String TwoByteString Unicode16String Unicode32String |
125 StringCollection Text |
125 StringCollection Text |
126 " |
126 " |
127 ! ! |
127 ! ! |
128 |
128 |
129 !Character class methodsFor:'instance creation'! |
129 !Character class methodsFor:'instance creation'! |
130 |
130 |
2382 |
2410 |
2383 asNonDiacritical |
2411 asNonDiacritical |
2384 "return a new character which represents the receiver without diacritics. |
2412 "return a new character which represents the receiver without diacritics. |
2385 This is used with string search and when lists are to be ordered/sorted by base character order. |
2413 This is used with string search and when lists are to be ordered/sorted by base character order. |
2386 CAVEAT: |
2414 CAVEAT: |
2387 for now, this method is only correct for unicode characters up to u+2FF, |
2415 for now, this method is only correct for unicode characters up to u+2FF, |
2388 i.e. latin languages" |
2416 i.e. latin languages" |
2389 |
2417 |
2390 %{ /* NOCONTEXT */ |
2418 %{ /* NOCONTEXT */ |
2391 |
2419 |
2392 REGISTER INT val; |
2420 REGISTER INT val; |
2393 |
2421 |
2394 /* because used so often, this is open coded, instead of table driven */ |
2422 /* because used so often, this is open coded, instead of table driven */ |
2395 val = __intVal(__INST(asciivalue)); |
2423 val = __intVal(__INST(asciivalue)); |
2396 switch (val >> 8) { |
2424 switch (val >> 8) { |
2397 case 0x00: |
2425 case 0x00: |
2398 if (val < 0xC0) { RETURN(self); } |
2426 if (val < 0xC0) { RETURN(self); } |
2399 if (val <= 0xC6) { val = 'A'; break; } |
2427 if (val <= 0xC6) { val = 'A'; break; } |
2400 if (val == 0xC7) { val = 'C'; break; } |
2428 if (val == 0xC7) { val = 'C'; break; } |
2401 if (val <= 0xCB) { val = 'E'; break; } |
2429 if (val <= 0xCB) { val = 'E'; break; } |
2402 if (val <= 0xCF) { val = 'I'; break; } |
2430 if (val <= 0xCF) { val = 'I'; break; } |
2403 if (val == 0xD0) { val = 'D'; break; } |
2431 if (val == 0xD0) { val = 'D'; break; } |
2404 if (val == 0xD1) { val = 'N'; break; } |
2432 if (val == 0xD1) { val = 'N'; break; } |
2405 if (val <= 0xD6) { val = 'O'; break; } |
2433 if (val <= 0xD6) { val = 'O'; break; } |
2406 if (val == 0xD7) { RETURN(self) } |
2434 if (val == 0xD7) { RETURN(self) } |
2407 if (val == 0xD8) { val = 'O'; break; } |
2435 if (val == 0xD8) { val = 'O'; break; } |
2408 if (val <= 0xDC) { val = 'U'; break; } |
2436 if (val <= 0xDC) { val = 'U'; break; } |
2409 if (val == 0xDD) { val = 'Y'; break; } |
2437 if (val == 0xDD) { val = 'Y'; break; } |
2410 |
2438 |
2411 if (val < 0xE0) { RETURN(self) } |
2439 if (val < 0xE0) { RETURN(self) } |
2412 if (val <= 0xE6) { val = 'a'; break; } |
2440 if (val <= 0xE6) { val = 'a'; break; } |
2413 if (val == 0xE7) { val = 'c'; break; } |
2441 if (val == 0xE7) { val = 'c'; break; } |
2414 if (val <= 0xEB) { val = 'e'; break; } |
2442 if (val <= 0xEB) { val = 'e'; break; } |
2415 if (val <= 0xEF) { val = 'i'; break; } |
2443 if (val <= 0xEF) { val = 'i'; break; } |
2416 if (val == 0xF0) { val = 'd'; break; } |
2444 if (val == 0xF0) { val = 'd'; break; } |
2417 if (val == 0xF1) { val = 'n'; break; } |
2445 if (val == 0xF1) { val = 'n'; break; } |
2418 if (val <= 0xF6) { val = 'o'; break; } |
2446 if (val <= 0xF6) { val = 'o'; break; } |
2419 if (val == 0xF7) { RETURN(self) } |
2447 if (val == 0xF7) { RETURN(self) } |
2420 if (val == 0xF8) { val = 'o'; break; } |
2448 if (val == 0xF8) { val = 'o'; break; } |
2421 if (val <= 0xFC) { val = 'u'; break; } |
2449 if (val <= 0xFC) { val = 'u'; break; } |
2422 if (val == 0xFD) { val = 'y'; break; } |
2450 if (val == 0xFD) { val = 'y'; break; } |
2423 if (val == 0xFF) { val = 'y'; break; } |
2451 if (val == 0xFF) { val = 'y'; break; } |
2424 RETURN (self); |
2452 RETURN (self); |
2425 |
2453 |
2426 case 0x01: |
2454 case 0x01: |
2427 if (val <= 0x105) { val = (val & 1) ? 'a' : 'A'; break; } |
2455 if (val <= 0x105) { val = (val & 1) ? 'a' : 'A'; break; } |
2428 if (val <= 0x10D) { val = (val & 1) ? 'c' : 'C'; break; } |
2456 if (val <= 0x10D) { val = (val & 1) ? 'c' : 'C'; break; } |
2429 if (val <= 0x111) { val = (val & 1) ? 'd' : 'D'; break; } |
2457 if (val <= 0x111) { val = (val & 1) ? 'd' : 'D'; break; } |
2430 if (val <= 0x11B) { val = (val & 1) ? 'e' : 'E'; break; } |
2458 if (val <= 0x11B) { val = (val & 1) ? 'e' : 'E'; break; } |
2431 if (val <= 0x123) { val = (val & 1) ? 'g' : 'G'; break; } |
2459 if (val <= 0x123) { val = (val & 1) ? 'g' : 'G'; break; } |
2432 if (val <= 0x127) { val = (val & 1) ? 'h' : 'H'; break; } |
2460 if (val <= 0x127) { val = (val & 1) ? 'h' : 'H'; break; } |
2433 if (val <= 0x133) { val = (val & 1) ? 'i' : 'I'; break; } |
2461 if (val <= 0x133) { val = (val & 1) ? 'i' : 'I'; break; } |
2434 if (val <= 0x137) { val = (val & 1) ? 'k' : 'K'; break; } |
2462 if (val <= 0x137) { val = (val & 1) ? 'k' : 'K'; break; } |
2435 if (val == 0x138) { val = 'K'; break; } |
2463 if (val == 0x138) { val = 'K'; break; } |
2436 if (val <= 0x142) { val = (val & 1) ? 'L' : 'l'; break; } |
2464 if (val <= 0x142) { val = (val & 1) ? 'L' : 'l'; break; } |
2437 if (val <= 0x148) { val = (val & 1) ? 'N' : 'n'; break; } |
2465 if (val <= 0x148) { val = (val & 1) ? 'N' : 'n'; break; } |
2438 if (val <= 0x14B) { val = (val & 1) ? 'n' : 'N'; break; } |
2466 if (val <= 0x14B) { val = (val & 1) ? 'n' : 'N'; break; } |
2439 if (val <= 0x153) { val = (val & 1) ? 'o' : 'O'; break; } |
2467 if (val <= 0x153) { val = (val & 1) ? 'o' : 'O'; break; } |
2440 if (val <= 0x159) { val = (val & 1) ? 'r' : 'R'; break; } |
2468 if (val <= 0x159) { val = (val & 1) ? 'r' : 'R'; break; } |
2441 if (val <= 0x161) { val = (val & 1) ? 's' : 'S'; break; } |
2469 if (val <= 0x161) { val = (val & 1) ? 's' : 'S'; break; } |
2442 if (val <= 0x167) { val = (val & 1) ? 't' : 'T'; break; } |
2470 if (val <= 0x167) { val = (val & 1) ? 't' : 'T'; break; } |
2443 if (val <= 0x173) { val = (val & 1) ? 'u' : 'U'; break; } |
2471 if (val <= 0x173) { val = (val & 1) ? 'u' : 'U'; break; } |
2444 if (val <= 0x175) { val = (val & 1) ? 'w' : 'W'; break; } |
2472 if (val <= 0x175) { val = (val & 1) ? 'w' : 'W'; break; } |
2445 if (val <= 0x178) { val = (val & 1) ? 'y' : 'Y'; break; } |
2473 if (val <= 0x178) { val = (val & 1) ? 'y' : 'Y'; break; } |
2446 if (val <= 0x17E) { val = (val & 1) ? 'Z' : 'z'; break; } |
2474 if (val <= 0x17E) { val = (val & 1) ? 'Z' : 'z'; break; } |
2447 RETURN (self); |
2475 RETURN (self); |
2448 |
2476 |
2449 case 0x02: |
2477 case 0x02: |
2450 if (val <= 0x203) { val = (val & 1) ? 'a' : 'A'; break; } |
2478 if (val <= 0x203) { val = (val & 1) ? 'a' : 'A'; break; } |
2451 if (val <= 0x207) { val = (val & 1) ? 'e' : 'E'; break; } |
2479 if (val <= 0x207) { val = (val & 1) ? 'e' : 'E'; break; } |
2452 if (val <= 0x20B) { val = (val & 1) ? 'i' : 'I'; break; } |
2480 if (val <= 0x20B) { val = (val & 1) ? 'i' : 'I'; break; } |
2453 if (val <= 0x20F) { val = (val & 1) ? 'o' : 'O'; break; } |
2481 if (val <= 0x20F) { val = (val & 1) ? 'o' : 'O'; break; } |
2454 if (val <= 0x213) { val = (val & 1) ? 'r' : 'R'; break; } |
2482 if (val <= 0x213) { val = (val & 1) ? 'r' : 'R'; break; } |
2455 if (val <= 0x217) { val = (val & 1) ? 'u' : 'U'; break; } |
2483 if (val <= 0x217) { val = (val & 1) ? 'u' : 'U'; break; } |
2456 if (val <= 0x219) { val = (val & 1) ? 's' : 'S'; break; } |
2484 if (val <= 0x219) { val = (val & 1) ? 's' : 'S'; break; } |
2457 if (val <= 0x21B) { val = (val & 1) ? 't' : 'T'; break; } |
2485 if (val <= 0x21B) { val = (val & 1) ? 't' : 'T'; break; } |
2458 RETURN (self); |
2486 RETURN (self); |
2459 |
2487 |
2460 case 0x03: |
2488 case 0x03: |
2461 // to be done |
2489 // to be done |
2462 RETURN (self); |
2490 RETURN (self); |
2463 |
2491 |
2464 case 0x04: |
2492 case 0x04: |
2465 // to be done |
2493 // to be done |
2466 RETURN (self); |
2494 RETURN (self); |
2467 } |
2495 } |
2468 if (val <= MAX_IMMEDIATE_CHARACTER) { |
2496 if (val <= MAX_IMMEDIATE_CHARACTER) { |
2469 RETURN (__MKCHARACTER(val)) ; |
2497 RETURN (__MKCHARACTER(val)) ; |
2470 } |
2498 } |
2471 RETURN (__MKUCHARACTER(val)) ; |
2499 RETURN (__MKUCHARACTER(val)) ; |
2472 %} |
2500 %} |
2473 |
2501 |
2474 " |
2502 " |