author | Stefan Vogel <sv@exept.de> |
Fri, 27 Oct 2017 16:14:37 +0200 | |
branch | expecco_2_11_1_branch |
changeset 22329 | 20662662693b |
parent 17568 | e90410336cc2 |
child 21478 | 2e63fbcbfa85 |
permissions | -rw-r--r-- |
17490 | 1 |
"{ Encoding: utf8 }" |
2 |
||
3 |
" |
|
4 |
COPYRIGHT (c) 2015 by eXept Software AG |
|
5 |
All Rights Reserved |
|
6 |
||
7 |
This software is furnished under a license and may be used |
|
8 |
only in accordance with the terms of that license and with the |
|
9 |
inclusion of the above copyright notice. This software may not |
|
10 |
be provided or otherwise made available to, or used by, any |
|
11 |
other person. No title to or ownership of the software is |
|
12 |
hereby transferred. |
|
13 |
" |
|
14 |
"{ Package: 'stx:libbasic' }" |
|
15 |
||
16 |
"{ NameSpace: CharacterEncoderImplementations }" |
|
17 |
||
18 |
ISO10646_to_UTF8 subclass:#ISO10646_to_UTF8_MAC |
|
19 |
instanceVariableNames:'' |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
20 |
classVariableNames:'AccentMap DecomposeMap ComposeMap' |
17490 | 21 |
poolDictionaries:'' |
22 |
category:'Collections-Text-Encodings' |
|
23 |
! |
|
24 |
||
25 |
!ISO10646_to_UTF8_MAC class methodsFor:'documentation'! |
|
26 |
||
27 |
copyright |
|
28 |
" |
|
29 |
COPYRIGHT (c) 2015 by eXept Software AG |
|
30 |
All Rights Reserved |
|
31 |
||
32 |
This software is furnished under a license and may be used |
|
33 |
only in accordance with the terms of that license and with the |
|
34 |
inclusion of the above copyright notice. This software may not |
|
35 |
be provided or otherwise made available to, or used by, any |
|
36 |
other person. No title to or ownership of the software is |
|
37 |
hereby transferred. |
|
38 |
" |
|
39 |
! |
|
40 |
||
41 |
documentation |
|
42 |
" |
|
43 |
UTF-8 can encode some diacritical characters (umlauts) in multiple ways: |
|
44 |
- either with a single uniode (e.g. ae -> ä -> ä -> C3 A4) |
|
45 |
- or as so called 'Normalization Form canonical Decomposition', i.e. as a regular 'a' followed by a |
|
46 |
combining diacritical mark (for example: acute). |
|
47 |
||
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
48 |
MAC OSX needs the second form for its file names. |
17490 | 49 |
However, OSX does not decompose the ranges U+2000-U+2FFF, U+F900-U+FAFF and U+2F800-U+2FAFF. |
50 |
||
51 |
This is a q&d hack, to at least support the first page (latin1) characters. |
|
52 |
Will be enhanced for the 2nd and 3rd unicode page, when I find time. |
|
53 |
||
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
54 |
[caveat:] |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
55 |
only a small subset of multi-composes are supported yet (for example: trema plus acute) |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
56 |
|
17490 | 57 |
[author:] |
58 |
Claus Gittinger |
|
59 |
||
60 |
[instance variables:] |
|
61 |
||
62 |
[class variables:] |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
63 |
ComposeMap DecomposeMap |
17490 | 64 |
|
65 |
[see also:] |
|
66 |
http://developer.apple.com/library/mac/#qa/qa2001/qa1173.html |
|
67 |
||
68 |
" |
|
69 |
! ! |
|
70 |
||
71 |
!ISO10646_to_UTF8_MAC class methodsFor:'initialization'! |
|
72 |
||
73 |
initializeDecomposeMap |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
74 |
"the map which decomposes a diacritical character into its two components" |
17490 | 75 |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
76 |
DecomposeMap := Dictionary new. |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
77 |
ComposeMap := Dictionary new. |
17490 | 78 |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
79 |
#( |
17566
a990c12c71c0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17565
diff
changeset
|
80 |
"/ attention: the following strings contain non-latin characters |
a990c12c71c0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17565
diff
changeset
|
81 |
"/ if you don't see them, change your font setting for a better font |
a990c12c71c0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17565
diff
changeset
|
82 |
|
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
83 |
(16r0300 "gravis" 'AÀaàEÈeèIÌiìoòOÒUÙuùNǸnǹWẀwẁYỲyỳÜǛüǜ') |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
84 |
(16r0301 "akut" 'AÁaáEÉeéIÍiíOÓoóUÚuúyýYÝCĆcćNŃnńRŔrŕSŚsśZŹzźGǴgǵÆǼæǽØǾøǿMḾmḿKḰkḱPṔpṕWẂwẃÜǗüǘ') |
17567
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
85 |
(16r0302 "circonflex" 'AÂaâEÊeêIÎiîOÔoôUÛuûCĈcĉGĜgĝHĤhĥJĴjĵSŜsŝWŴwŵYŶyŷZẐzẑ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
86 |
(16r0303 "tilde" 'AÃaãNÑnñOÕoõUŨuũYỸyỹEẼeẽVṼvṽ') |
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
87 |
(16r0304 "macron" 'AĀaāEĒeēIĪiīOŌoōUŪuūGḠgḡÜǕüǖ' ) |
17567
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
88 |
(16r0306 "breve" 'AĂaăEĔeĕGĞgğIĬiĭOŎoŏUŬuŭ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
89 |
(16r0307 "dot above" 'AȦaȧOȮoȯCĊcċEĖeėGĠgġZŻzżBḂbḃDḊdḋFḞfḟHḢhḣMṀmṁNṄnṅPṖpṗRṘrṙSṠsṡTṪtṫWẆwẇXẊxẋYẎyẏ' ) |
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
90 |
(16r0308 "umlaut/trema" 'AÄaäEËeëOÖoöUÜuüIÏiïyÿYŸHḦhḧXẌxẍtẗÙǛùǜŪǕūǖÚǗúǘǓǙǔǚ') |
17567
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
91 |
(16r030A "ring" 'AÅaåUŮuůwẘyẙ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
92 |
(16r030B "dbl akut" 'OŐoőUŰuű') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
93 |
(16r030C "hatcheck" 'CČcčDĎEĚeěNŇnňRŘrřSŠsšZŽzžAǍaǎIǏiǐOǑoǒUǓuǔGǦgǧKǨkǩÜǙüǚ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
94 |
(16r030F "dbl grave" 'AȀaȁEȄeȅIȈiȉOȌoȍRȐrȑUȔuȕ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
95 |
(16r0311 "inv. breve" 'AȂaȃEȆeȇIȊiȋOȎoȏRȒrȓUȖuȗ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
96 |
(16r0317 "acute. below" 'KĶkķLĻlļNŅnņRŖrŗSȘsșTȚtț') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
97 |
(16r0327 "cedille" 'CÇc窺TŢtţEȨeȩDḐdḑHḨhḩ') |
2d57395ef7e0
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17566
diff
changeset
|
98 |
(16r0328 "ogonek" 'AĄaąEĘeęIĮiįOǪoǫUŲuų') |
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
99 |
) do:[:eachPair | |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
100 |
|composeCode mapping| |
17490 | 101 |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
102 |
composeCode := eachPair first. |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
103 |
mapping := eachPair second. |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
104 |
mapping pairWiseDo:[:baseChar :composedChar | |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
105 |
"/ setup, so that we find |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
106 |
"/ DecomposeMap at:"$à codePoint" 16rE0 put:#( "$a codePoint" 16r61 "greve codePoint" 16r0300). |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
107 |
DecomposeMap |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
108 |
at:composedChar codePoint |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
109 |
put:(Array with:baseChar codePoint with:composeCode) |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
110 |
]. |
17490 | 111 |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
112 |
ComposeMap at:composeCode put:mapping. |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
113 |
]. |
17490 | 114 |
! ! |
115 |
||
116 |
!ISO10646_to_UTF8_MAC methodsFor:'encoding & decoding'! |
|
117 |
||
118 |
compositionOf: baseChar with: diacriticalChar to: outStream |
|
119 |
"compose two characters into one |
|
120 |
a + umlaut-diacritic-mark -> ä." |
|
121 |
||
122 |
|cp map i| |
|
123 |
||
124 |
cp := diacriticalChar codePoint. |
|
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
125 |
(cp between:16r300 and:16r328) ifTrue:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
126 |
map := ComposeMap at:cp ifAbsent:nil. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
127 |
map notNil ifTrue:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
128 |
"/ compose |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
129 |
i := map indexOf: baseChar. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
130 |
i ~~ 0 ifTrue:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
131 |
outStream nextPut: (map at:i+1). |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
132 |
^ self. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
133 |
]. |
17490 | 134 |
]. |
135 |
]. |
|
136 |
||
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
137 |
"/ leave as is |
17490 | 138 |
outStream nextPut: baseChar. |
139 |
outStream nextPut: diacriticalChar. |
|
140 |
! |
|
141 |
||
142 |
decodeString:aStringOrByteCollection |
|
143 |
"return a Unicode string from the passed in UTF-8-MAC encoded string. |
|
144 |
This is UTF-8 with compose-characters decomposed |
|
145 |
(i.e. as separate codes, not as single combined characters). |
|
146 |
||
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
147 |
For now, here is a limited version, which should work |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
148 |
at least for most european countries... |
17490 | 149 |
" |
150 |
||
151 |
|s buff previous| |
|
152 |
||
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
153 |
s := super decodeString:aStringOrByteCollection. |
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
154 |
(s contains:[:char | char codePoint between:16r0300 and:16r0328]) ifFalse:[^ s]. |
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
155 |
|
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
156 |
ComposeMap isNil ifTrue:[ |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
157 |
self class initializeDecomposeMap |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
158 |
]. |
17522
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
159 |
|
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
160 |
buff := CharacterWriteStream on:''. |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
161 |
previous := nil. |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
162 |
s do:[:each | |
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
163 |
(each codePoint between:16r0300 and:16r0328) ifTrue:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
164 |
previous isNil ifTrue:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
165 |
buff isEmpty ifTrue:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
166 |
"/ wrong - combiner not allowed here. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
167 |
buff nextPut:each. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
168 |
] ifFalse:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
169 |
"/ ouch - a multi-compose |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
170 |
previous := buff last. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
171 |
buff skip:-1. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
172 |
self compositionOf:previous with:each to:buff. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
173 |
]. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
174 |
] ifFalse:[ |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
175 |
self compositionOf:previous with:each to:buff. |
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
176 |
]. |
17522
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
177 |
previous := nil. |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
178 |
] ifFalse:[ |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
179 |
previous notNil ifTrue:[ |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
180 |
buff nextPut:previous. |
17490 | 181 |
]. |
17522
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
182 |
previous := each. |
17490 | 183 |
]. |
184 |
]. |
|
17522
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
185 |
previous notNil ifTrue:[ |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
186 |
buff nextPut:previous. |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
187 |
]. |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
188 |
^ buff contents. |
17490 | 189 |
|
190 |
" |
|
191 |
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray |
|
192 |
-> #[97 195 164 111 195 182 117 195 188] |
|
193 |
||
194 |
(ISO10646_to_UTF8 new decodeString: |
|
195 |
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray) |
|
196 |
||
197 |
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray |
|
198 |
-> #[97 97 204 136 111 111 204 136 117 117 204 136] |
|
199 |
||
200 |
(ISO10646_to_UTF8_MAC new decodeString: |
|
201 |
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray) |
|
202 |
" |
|
203 |
! |
|
204 |
||
205 |
decompositionOf: codePointIn into:outBlockWithTwoArgs |
|
206 |
"if required, decompose a diacritical character into a base character and a punctuation; |
|
207 |
eg. ä -> a + umlaut-diacritic-mark. |
|
208 |
Pass both as args to the given block. |
|
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
209 |
For non diactit. chars, pass a nil diacrit-mark value. |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
210 |
Return true, if a decomposition was done." |
17490 | 211 |
|
212 |
|entry| |
|
213 |
||
214 |
codePointIn < 16rC0 ifTrue:[ ^ false ]. |
|
215 |
||
216 |
entry := DecomposeMap at:codePointIn ifAbsent:nil. |
|
217 |
entry isNil ifTrue:[ ^ false ]. |
|
218 |
||
219 |
outBlockWithTwoArgs value:(entry at:1) value:(entry at:2). |
|
220 |
^ true |
|
221 |
! |
|
222 |
||
223 |
encodeString:aUnicodeString |
|
224 |
"return the UTF-8-MAC representation of a aUnicodeString. |
|
225 |
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as |
|
226 |
single combined characters). |
|
227 |
||
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
228 |
For now, here is a limited version, which should work |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
229 |
at least for most european countries... |
17490 | 230 |
" |
231 |
||
232 |
|gen s decomp codePoint composeCodePoint| |
|
233 |
||
234 |
DecomposeMap isNil ifTrue:[ |
|
235 |
self class initializeDecomposeMap |
|
236 |
]. |
|
237 |
||
238 |
gen := |
|
239 |
[:codePointArg | |
|
240 |
|codePoint "{Class: SmallInteger }" b1 b2 b3 b4 b5 v "{Class: SmallInteger }"| |
|
241 |
||
242 |
codePoint := codePointArg. |
|
243 |
codePoint <= 16r7F ifTrue:[ |
|
244 |
s nextPut:(Character value:codePoint). |
|
245 |
] ifFalse:[ |
|
246 |
b1 := Character value:((codePoint bitAnd:16r3F) bitOr:2r10000000). |
|
247 |
v := codePoint bitShift:-6. |
|
248 |
v <= 16r1F ifTrue:[ |
|
249 |
s nextPut:(Character value:(v bitOr:2r11000000)). |
|
250 |
s nextPut:b1. |
|
251 |
] ifFalse:[ |
|
252 |
b2 := Character value:((v bitAnd:16r3F) bitOr:2r10000000). |
|
253 |
v := v bitShift:-6. |
|
254 |
v <= 16r0F ifTrue:[ |
|
255 |
s nextPut:(Character value:(v bitOr:2r11100000)). |
|
256 |
s nextPut:b2; nextPut:b1. |
|
257 |
] ifFalse:[ |
|
258 |
b3 := Character value:((v bitAnd:16r3F) bitOr:2r10000000). |
|
259 |
v := v bitShift:-6. |
|
260 |
v <= 16r07 ifTrue:[ |
|
261 |
s nextPut:(Character value:(v bitOr:2r11110000)). |
|
262 |
s nextPut:b3; nextPut:b2; nextPut:b1. |
|
263 |
] ifFalse:[ |
|
264 |
b4 := Character value:((v bitAnd:16r3F) bitOr:2r10000000). |
|
265 |
v := v bitShift:-6. |
|
266 |
v <= 16r03 ifTrue:[ |
|
267 |
s nextPut:(Character value:(v bitOr:2r11111000)). |
|
268 |
s nextPut:b4; nextPut:b3; nextPut:b2; nextPut:b1. |
|
269 |
] ifFalse:[ |
|
270 |
b5 := Character value:((v bitAnd:16r3F) bitOr:2r10000000). |
|
271 |
v := v bitShift:-6. |
|
272 |
v <= 16r01 ifTrue:[ |
|
273 |
s nextPut:(Character value:(v bitOr:2r11111100)). |
|
274 |
s nextPut:b5; nextPut:b4; nextPut:b3; nextPut:b2; nextPut:b1. |
|
275 |
] ifFalse:[ |
|
276 |
"/ cannot happen - we only support up to 30 bit characters |
|
277 |
self error:'ascii value > 31bit in utf8Encode'. |
|
278 |
] |
|
279 |
]. |
|
280 |
]. |
|
281 |
]. |
|
282 |
]. |
|
283 |
]. |
|
284 |
]. |
|
285 |
||
17564
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
286 |
decomp := |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
287 |
[:baseCodePointArg :composeCodePointArg | |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
288 |
codePoint := baseCodePointArg. composeCodePoint := composeCodePointArg |
67ae75f28757
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17522
diff
changeset
|
289 |
]. |
17490 | 290 |
|
291 |
s := WriteStream on:(String uninitializedNew:aUnicodeString size). |
|
292 |
aUnicodeString do:[:eachCharacter | |
|
293 |
|needExtra| |
|
294 |
||
295 |
codePoint := eachCharacter codePoint. |
|
296 |
needExtra := self decompositionOf: codePoint into:decomp. |
|
297 |
gen value:codePoint. |
|
298 |
needExtra ifTrue:[ |
|
299 |
gen value:composeCodePoint |
|
300 |
]. |
|
301 |
]. |
|
302 |
||
303 |
^ s contents |
|
304 |
||
305 |
" |
|
306 |
(self encodeString:'hello') asByteArray #[104 101 108 108 111] |
|
307 |
(self encodeString:(Character value:16r40) asString) asByteArray #[64] |
|
308 |
(self encodeString:(Character value:16r7F) asString) asByteArray #[127] |
|
309 |
(self encodeString:(Character value:16r80) asString) asByteArray #[194 128] |
|
310 |
(self encodeString:(Character value:16rFF) asString) asByteArray #[195 191] |
|
311 |
||
312 |
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray |
|
313 |
-> #[97 195 164 111 195 182 117 195 188] |
|
314 |
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray |
|
315 |
-> #[97 97 204 136 111 111 204 136 117 117 204 136] |
|
17522
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
316 |
|
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
317 |
ISO10646_to_UTF8_MAC new decodeString: |
eea77b0b2c82
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17497
diff
changeset
|
318 |
(ISO10646_to_UTF8_MAC new encodeString:'Packages aus VSE für Smalltalk_X') asByteArray |
17490 | 319 |
" |
320 |
! ! |
|
321 |
||
17497
36ab19b73c1f
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17490
diff
changeset
|
322 |
!ISO10646_to_UTF8_MAC methodsFor:'queries'! |
36ab19b73c1f
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17490
diff
changeset
|
323 |
|
36ab19b73c1f
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17490
diff
changeset
|
324 |
nameOfEncoding |
36ab19b73c1f
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17490
diff
changeset
|
325 |
^ #'utf8-mac' |
36ab19b73c1f
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17490
diff
changeset
|
326 |
! ! |
36ab19b73c1f
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17490
diff
changeset
|
327 |
|
17490 | 328 |
!ISO10646_to_UTF8_MAC class methodsFor:'documentation'! |
329 |
||
330 |
version |
|
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
331 |
^ '$Header: /cvs/stx/stx/libbasic/CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st,v 1.8 2015-02-27 18:53:22 cg Exp $' |
17490 | 332 |
! |
333 |
||
334 |
version_CVS |
|
17568
e90410336cc2
class: CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
Claus Gittinger <cg@exept.de>
parents:
17567
diff
changeset
|
335 |
^ '$Header: /cvs/stx/stx/libbasic/CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st,v 1.8 2015-02-27 18:53:22 cg Exp $' |
17490 | 336 |
! ! |
337 |