CharacterEncoderImplementations__ISO10646_to_UTF8.st
author Jan Vrany <jan.vrany@fit.cvut.cz>
Sat, 03 Oct 2015 08:50:56 +0100
branchjv
changeset 18807 d79ce9fb5198
parent 18630 a74d669db937
child 19863 513bd7237fe7
permissions -rw-r--r--
Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
8148
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     1
"
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     2
 COPYRIGHT (c) 2004 by eXept Software AG
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
     3
	      All Rights Reserved
8148
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     4
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     5
 This software is furnished under a license and may be used
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     6
 only in accordance with the terms of that license and with the
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     7
 inclusion of the above copyright notice.   This software may not
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     8
 be provided or otherwise made available to, or used by, any
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
     9
 other person.  No title to or ownership of the software is
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    10
 hereby transferred.
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    11
"
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    12
"{ Package: 'stx:libbasic' }"
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    13
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    14
"{ NameSpace: CharacterEncoderImplementations }"
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    15
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    16
TwoByteEncoder subclass:#ISO10646_to_UTF8
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    17
	instanceVariableNames:''
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    18
	classVariableNames:''
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    19
	poolDictionaries:''
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    20
	category:'Collections-Text-Encodings'
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    21
!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    22
18604
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    23
ISO10646_to_UTF8 class instanceVariableNames:'theOneAndOnlyInstance'
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    24
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    25
"
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    26
 No other class instance variables are inherited by this class.
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    27
"
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    28
!
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    29
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    30
!ISO10646_to_UTF8 class methodsFor:'documentation'!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    31
8148
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    32
copyright
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    33
"
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    34
 COPYRIGHT (c) 2004 by eXept Software AG
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
    35
	      All Rights Reserved
8148
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    36
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    37
 This software is furnished under a license and may be used
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    38
 only in accordance with the terms of that license and with the
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    39
 inclusion of the above copyright notice.   This software may not
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    40
 be provided or otherwise made available to, or used by, any
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    41
 other person.  No title to or ownership of the software is
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    42
 hereby transferred.
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    43
"
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    44
!
dbf64e3142d9 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8114
diff changeset
    45
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    46
examples
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    47
"
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    48
  Encoding (unicode to utf8)
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
    49
     ISO10646_to_UTF8 encodeString:'hello'.
8297
e7a05a86f280 removed iso8859-chars (for hpux)
Claus Gittinger <cg@exept.de>
parents: 8221
diff changeset
    50
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    51
8297
e7a05a86f280 removed iso8859-chars (for hpux)
Claus Gittinger <cg@exept.de>
parents: 8221
diff changeset
    52
  Decoding (utf8 to unicode):
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    53
     |t|
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    54
18625
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
    55
     t := ISO10646_to_UTF8 encodeString:'Helloś'.
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
    56
     ISO10646_to_UTF8 decodeString:t.
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    57
"
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    58
! !
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    59
18604
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    60
!ISO10646_to_UTF8 class methodsFor:'instance creation'!
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    61
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    62
flushSingleton
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    63
    "flushes the cached singleton"
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    64
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    65
    theOneAndOnlyInstance := nil
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    66
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    67
    "
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    68
     self flushSingleton
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    69
    "
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    70
!
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    71
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    72
new
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    73
    "returns a singleton"
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    74
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    75
    theOneAndOnlyInstance isNil ifTrue:[
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    76
        theOneAndOnlyInstance := self basicNew initialize.
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    77
    ].
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    78
    ^ theOneAndOnlyInstance.
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    79
!
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    80
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    81
theOneAndOnlyInstance
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    82
    "returns a singleton"
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    83
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    84
    theOneAndOnlyInstance isNil ifTrue:[
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    85
        theOneAndOnlyInstance := self basicNew initialize.
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    86
    ].
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    87
    ^ theOneAndOnlyInstance.
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    88
! !
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
    89
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    90
!ISO10646_to_UTF8 methodsFor:'encoding & decoding'!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    91
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    92
decode:aCode
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    93
    self shouldNotImplement "/ no single byte conversion possible
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    94
!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    95
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    96
decodeString:aStringOrByteCollection
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
    97
    "given a string in UTF8 encoding,
17489
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
    98
     return a new string containing the same characters, in Unicode encoding.
17623
6fe31bc70e49 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17489
diff changeset
    99
     Returns either a normal String, a Unicode16String or a Unicode32String instance.
17489
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
   100
     This is only useful, when reading from external sources or communicating with
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
   101
     other systems 
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
   102
     (ST/X never uses utf8 internally, but always uses strings of fully decoded unicode characters).
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   103
     This only handles up-to 30bit characters.
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   104
17489
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
   105
     If you work a lot with utf-8 encoded textFiles,
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   106
     this is a first-class candidate for a primitive."
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   107
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   108
    |sz anyAbove7BitAscii nBitsRequired
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   109
     ascii "{ Class: SmallInteger }"
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   110
     byte  "{ Class: SmallInteger }"
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   111
     s newString idx next6Bits last6Bits
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   112
     errorReporter|
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   113
18601
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   114
    "/ fast track, also avoid creation of new strings if aStringOrByteCollection is already a 7-bit string
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   115
    aStringOrByteCollection containsNon7BitAscii ifFalse:[
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   116
        ^ aStringOrByteCollection asSingleByteString
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   117
    ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   118
8773
267612096a52 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8460
diff changeset
   119
    errorReporter := [:msg | 
267612096a52 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8460
diff changeset
   120
                             DecodingError newException
267612096a52 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8460
diff changeset
   121
                                defaultValue:aStringOrByteCollection;
10531
7a799d53e932 changed #decodeString: - proceedable exception
Stefan Vogel <sv@exept.de>
parents: 9928
diff changeset
   122
                                raiseRequestWith:aStringOrByteCollection errorString:msg.
8773
267612096a52 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8460
diff changeset
   123
                     ].
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   124
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   125
    next6Bits := [
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   126
                    | byte |
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   127
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   128
                    byte := s nextByte.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   129
                    byte isNil ifTrue:[^ errorReporter value:'short utf8 string'].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   130
                    ascii := (ascii bitShift:6) bitOr:(byte bitAnd:2r00111111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   131
                    (byte bitAnd:2r11000000) ~~ 2r10000000 ifTrue:[
14172
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   132
                        ^ errorReporter value:'illegal followbyte (next)'.
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   133
                    ].
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   134
                 ].
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   135
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   136
    last6Bits := [
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   137
                    | a byte |
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   138
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   139
                    byte := s nextByte.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   140
                    byte isNil ifTrue:[^ errorReporter value:'short utf8 string'].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   141
                    a := (ascii bitShift:6) bitOr:(byte bitAnd:2r00111111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   142
                    (a > 16r3FFFFFFF) ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   143
                        "/ ST/X can only represent 30 bit unicode characters.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   144
                        errorReporter value:'unicode character out of range'.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   145
                        a := 16r3FFFFFFF.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   146
                    ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   147
                    ascii := a.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   148
                    (byte bitAnd:2r11000000) ~~ 2r10000000 ifTrue:[
14172
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   149
                        ^ errorReporter value:'illegal followbyte (last)'.
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   150
                    ].
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   151
                 ].
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   152
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   153
    nBitsRequired := 8.
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   154
    anyAbove7BitAscii := false.
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   155
    sz := 0.
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   156
    s := aStringOrByteCollection readStream.
14172
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   157
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   158
    "first determine the string size"
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   159
    [s atEnd] whileFalse:[
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   160
        byte := ascii := s nextByte.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   161
        (byte bitAnd:16r80) ~~ 0 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   162
            anyAbove7BitAscii := true.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   163
            (byte bitAnd:2r11100000) == 2r11000000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   164
                "/ 80 .. 7FF
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   165
                ascii := (byte bitAnd:2r00011111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   166
                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   167
                ascii > 16rFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   168
                    nBitsRequired := nBitsRequired max:16
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   169
                ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   170
                "/ a strict utf8 decoder does not allow overlong sequences
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   171
                ascii < 16r80 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   172
                    errorReporter value:'overlong utf8 sequence'
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   173
                ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   174
            ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   175
                (byte bitAnd:2r11110000) == 2r11100000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   176
                    "/ 800 .. FFFF
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   177
                    ascii := (byte bitAnd:2r00001111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   178
                    next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   179
                    next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   180
                    ascii > 16rFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   181
                        nBitsRequired := nBitsRequired max:16
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   182
                    ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   183
                    ascii < 16r800 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   184
                        errorReporter value:'overlong utf8 sequence'
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   185
                    ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   186
                ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   187
                    (byte bitAnd:2r11111000) == 2r11110000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   188
                        "/ 10000 .. 1FFFFF
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   189
                        ascii := (byte bitAnd:2r00000111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   190
                        next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   191
                        next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   192
                        next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   193
                        ascii > 16rFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   194
                            ascii > 16rFFFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   195
                                nBitsRequired := nBitsRequired max:32
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   196
                            ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   197
                                nBitsRequired := nBitsRequired max:16
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   198
                            ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   199
                        ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   200
                        ascii < 16r10000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   201
                            errorReporter value:'overlong utf8 sequence'
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   202
                        ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   203
                    ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   204
                        (byte bitAnd:2r11111100) == 2r11111000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   205
                            "/ 200000 .. 3FFFFFF
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   206
                            ascii := (byte bitAnd:2r00000011).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   207
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   208
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   209
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   210
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   211
                            ascii > 16rFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   212
                                ascii > 16rFFFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   213
                                    nBitsRequired := nBitsRequired max:32
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   214
                                ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   215
                                    nBitsRequired := nBitsRequired max:16
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   216
                                ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   217
                            ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   218
                            ascii < 200000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   219
                                errorReporter value:'overlong utf8 sequence'
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   220
                            ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   221
                        ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   222
                            (byte bitAnd:2r11111110) == 2r11111100 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   223
                                "/ 4000000 .. 7FFFFFFF
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   224
                                ascii := (byte bitAnd:2r00000001).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   225
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   226
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   227
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   228
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   229
                                last6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   230
                                ascii > 16rFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   231
                                    ascii > 16rFFFF ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   232
                                        nBitsRequired := nBitsRequired max:32
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   233
                                    ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   234
                                        nBitsRequired := nBitsRequired max:16
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   235
                                    ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   236
                                ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   237
                                ascii < 16r4000000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   238
                                    errorReporter value:'overlong utf8 sequence'
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   239
                                ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   240
                            ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   241
                                errorReporter value:'invalid utf8 encoding'
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   242
                            ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   243
                        ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   244
                    ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   245
                ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   246
            ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   247
        ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   248
        sz := sz + 1.
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   249
    ].
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   250
    nBitsRequired == 8 ifTrue:[
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   251
        anyAbove7BitAscii ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   252
            "/ can return the original string
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   253
            aStringOrByteCollection isString ifTrue:[^ aStringOrByteCollection].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   254
        ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   255
        newString := String uninitializedNew:sz
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   256
    ] ifFalse:[
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   257
        nBitsRequired <= 16 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   258
            newString := Unicode16String new:sz
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   259
        ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   260
            newString := Unicode32String new:sz
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   261
        ]
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   262
    ].
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   263
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   264
    next6Bits := [
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   265
                    |byte|
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   266
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   267
                    byte := s nextByte.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   268
                    ascii := (ascii bitShift:6) bitOr:(byte bitAnd:2r00111111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   269
                 ].
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   270
14172
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   271
    s reset.
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   272
    idx := 1.
14172
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   273
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   274
    "now fill the string"
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   275
    [s atEnd] whileFalse:[
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   276
        byte := ascii := s nextByte.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   277
        (byte bitAnd:2r10000000) ~~ 0 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   278
            (byte bitAnd:2r11100000) == 2r11000000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   279
                ascii := (byte bitAnd:2r00011111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   280
                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   281
            ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   282
                (byte bitAnd:2r11110000) == 2r11100000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   283
                    ascii := (byte bitAnd:2r00001111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   284
                    next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   285
                    next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   286
                ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   287
                    (byte bitAnd:2r11111000) == 2r11110000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   288
                        ascii := (byte bitAnd:2r00000111).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   289
                        next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   290
                        next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   291
                        next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   292
                    ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   293
                        (byte bitAnd:2r11111100) == 2r11111000 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   294
                            ascii := (byte bitAnd:2r00000011).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   295
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   296
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   297
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   298
                            next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   299
                        ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   300
                            (byte bitAnd:2r11111110) == 2r11111100 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   301
                                ascii := (byte bitAnd:2r00000001).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   302
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   303
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   304
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   305
                                next6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   306
                                last6Bits value.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   307
                            ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   308
                        ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   309
                    ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   310
                ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   311
            ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   312
        ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   313
        newString at:idx put:(Character value:ascii).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   314
        idx := idx + 1.
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   315
    ].
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   316
    ^ newString
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   317
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   318
    "
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   319
     CharacterArray fromUTF8Bytes:#[ 16r41 16r42 ]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   320
     CharacterArray fromUTF8Bytes:#[ 16rC1 16r02 ]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   321
     CharacterArray fromUTF8Bytes:#[ 16rE0 16r81 16r02 ]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   322
     CharacterArray fromUTF8Bytes:#[ 16rEF 16rBF 16rBF ]
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   323
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   324
   rfc2279 examples:
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   325
     CharacterArray fromUTF8Bytes:#[ 16r41 16rE2 16r89 16rA2 16rCE 16r91 16r2E ]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   326
     CharacterArray fromUTF8Bytes:#[ 16rED 16r95 16r9C 16rEA 16rB5 16rAD 16rEC 16r96 16rB4 ]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   327
     CharacterArray fromUTF8Bytes:#[ 16rE6 16r97 16rA5 16rE6 16r9C 16rAC 16rE8 16rAA 16r9E ]
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   328
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   329
   invalid:
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   330
     CharacterArray fromUTF8Bytes:#[ 16rC0 16r80 ]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   331
     CharacterArray fromUTF8Bytes:#[ 16rE0 16r80 16r80 ]
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   332
    "
9928
46cf4350beb2 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8773
diff changeset
   333
46cf4350beb2 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 8773
diff changeset
   334
    "Modified: / 18-09-2006 / 19:55:52 / cg"
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   335
!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   336
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   337
encode:aCode
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   338
    self shouldNotImplement "/ no single byte conversion possible
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   339
!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   340
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   341
encodeString:aUnicodeString
17489
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
   342
    "return the UTF-8 representation of a Unicode string.
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   343
     The resulting string is only useful to be stored on some external file,
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   344
     not for being used inside ST/X.
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   345
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   346
     If you work a lot with utf8 encoded textFiles,
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   347
     this is a first-class candidate for a primitive."
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   348
18625
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   349
    |s
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   350
     stringSize "{ Class: SmallInteger }"|
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   351
18604
54caf7b64994 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18601
diff changeset
   352
    "/ avoid creation of new strings if possible
18601
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   353
    aUnicodeString containsNon7BitAscii ifFalse:[
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   354
        ^ aUnicodeString asSingleByteString
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   355
    ].
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   356
18625
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   357
    stringSize := aUnicodeString size.
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   358
    s := WriteStream on:(String uninitializedNew:(stringSize * 3 // 2)).
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   359
    1 to:stringSize do:[:idx |
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   360
        |character codePoint "{Class: SmallInteger }" b1 b2 b3 b4 b5 v "{Class: SmallInteger }"|
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   361
18625
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   362
        character := aUnicodeString at:idx.
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   363
        codePoint := character codePoint.
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   364
        codePoint <= 16r7F ifTrue:[
18625
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   365
            s nextPut:character.
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   366
        ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   367
            b1 := Character value:((codePoint bitAnd:16r3F) bitOr:2r10000000).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   368
            v := codePoint bitShift:-6.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   369
            v <= 16r1F ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   370
                s nextPut:(Character value:(v bitOr:2r11000000)).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   371
                s nextPut:b1.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   372
            ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   373
                b2 := Character value:((v bitAnd:16r3F) bitOr:2r10000000).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   374
                v := v bitShift:-6.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   375
                v <= 16r0F ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   376
                    s nextPut:(Character value:(v bitOr:2r11100000)).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   377
                    s nextPut:b2; nextPut:b1.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   378
                ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   379
                    b3 := Character value:((v bitAnd:16r3F) bitOr:2r10000000).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   380
                    v := v bitShift:-6.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   381
                    v <= 16r07 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   382
                        s nextPut:(Character value:(v bitOr:2r11110000)).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   383
                        s nextPut:b3; nextPut:b2; nextPut:b1.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   384
                    ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   385
                        b4 := Character value:((v bitAnd:16r3F) bitOr:2r10000000).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   386
                        v := v bitShift:-6.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   387
                        v <= 16r03 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   388
                            s nextPut:(Character value:(v bitOr:2r11111000)).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   389
                            s nextPut:b4; nextPut:b3; nextPut:b2; nextPut:b1.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   390
                        ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   391
                            b5 := Character value:((v bitAnd:16r3F) bitOr:2r10000000).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   392
                            v := v bitShift:-6.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   393
                            v <= 16r01 ifTrue:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   394
                                s nextPut:(Character value:(v bitOr:2r11111100)).
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   395
                                s nextPut:b5; nextPut:b4; nextPut:b3; nextPut:b2; nextPut:b1.
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   396
                            ] ifFalse:[
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   397
                                "/ cannot happen - we only support up to 30 bit characters
18625
37d697b9bf8d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 18604
diff changeset
   398
                                EncodingError raiseWith:character errorString:'codePoint > 31bit in #utf8Encode'.
8460
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   399
                            ]
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   400
                        ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   401
                    ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   402
                ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   403
            ].
f4d333135e1d tuned encoding/decoding (quick check for 8-bit chars)
penk
parents: 8411
diff changeset
   404
        ].
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   405
    ].
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   406
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   407
    ^ s contents
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   408
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   409
    "
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   410
     (self encodeString:'hello') asByteArray                             #[104 101 108 108 111]
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   411
     (self encodeString:(Character value:16r40) asString) asByteArray    #[64]
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   412
     (self encodeString:(Character value:16r7F) asString) asByteArray    #[127]
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   413
     (self encodeString:(Character value:16r80) asString) asByteArray    #[194 128]
8411
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   414
     (self encodeString:(Character value:16rFF) asString) asByteArray    #[195 191]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   415
     (self encodeString:(Character value:16r100) asString) asByteArray   #[196 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   416
     (self encodeString:(Character value:16r200) asString) asByteArray   #[200 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   417
     (self encodeString:(Character value:16r400) asString) asByteArray   #[208 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   418
     (self encodeString:(Character value:16r800) asString) asByteArray   #[224 160 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   419
     (self encodeString:(Character value:16r1000) asString) asByteArray  #[225 128 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   420
     (self encodeString:(Character value:16r2000) asString) asByteArray  #[226 128 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   421
     (self encodeString:(Character value:16r4000) asString) asByteArray  #[228 128 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   422
     (self encodeString:(Character value:16r8000) asString) asByteArray  #[232 128 128]
44509c4f92f0 *** empty log message ***
ca
parents: 8406
diff changeset
   423
     (self encodeString:(Character value:16rFFFF) asString) asByteArray  #[239 191 191]
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   424
    "
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   425
! !
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   426
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   427
!ISO10646_to_UTF8 methodsFor:'queries'!
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   428
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   429
bytesToReadFor:firstByte 
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   430
    |bytesToRead|
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   431
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   432
    bytesToRead := 1.
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   433
    (firstByte isBitSet:8) ifFalse:[^1].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   434
    7 downTo:3
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   435
        do:[:idx | 
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   436
            (firstByte isBitSet:idx) ifTrue:[
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   437
                bytesToRead := bytesToRead + 1
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   438
            ] ifFalse:[
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   439
                ^bytesToRead                
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   440
            ]
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   441
        ].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   442
    ^bytesToRead
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   443
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   444
    "Created: / 14-06-2005 / 17:17:24 / janfrog"
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   445
!
8163
a867b07aa226 name query
ca
parents: 8148
diff changeset
   446
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   447
characterSize:charOrcodePoint
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   448
    "return the number of bytes required to encode codePoint"
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   449
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   450
    "Taken from RFC 3629"
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   451
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   452
    (charOrcodePoint asInteger between:16r00000000 and:16r0000007F) ifTrue:[^1].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   453
    (charOrcodePoint asInteger between:16r00000080 and:16r000007FF) ifTrue:[^2].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   454
    (charOrcodePoint asInteger between:16r00000800 and:16r0000FFFF) ifTrue:[^3].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   455
    (charOrcodePoint asInteger between:16r00010000 and:16r0010FFFF) ifTrue:[^4].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   456
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   457
    ^self error:'Invalid codePoint'
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   458
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   459
    "Created: / 15-06-2005 / 15:16:22 / janfrog"
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   460
!
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   461
8163
a867b07aa226 name query
ca
parents: 8148
diff changeset
   462
nameOfEncoding
14172
8c2cf2a68116 changed:
Stefan Vogel <sv@exept.de>
parents: 11996
diff changeset
   463
    ^ #utf8
8163
a867b07aa226 name query
ca
parents: 8148
diff changeset
   464
! !
a867b07aa226 name query
ca
parents: 8148
diff changeset
   465
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   466
!ISO10646_to_UTF8 methodsFor:'stream support'!
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   467
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   468
readNext:charactersToRead charactersFrom:stream
11996
fm
parents: 11974
diff changeset
   469
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   470
    | s |
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   471
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   472
    s := (String new:charactersToRead) writeStream.
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   473
    charactersToRead timesRepeat:[
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   474
        | c |
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   475
        c := stream peek.
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   476
        s nextPutAll:(stream next:(self bytesToReadFor:c))
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   477
    ].
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   478
    ^ self decodeString:s contents
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   479
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   480
    "Created: / 16-06-2005 / 11:45:14 / masca"
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   481
!
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   482
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   483
readNextCharacterFrom:stream
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   484
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   485
    | c bytesYetToRead s |
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   486
    c := stream peek.
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   487
    bytesYetToRead := self bytesToReadFor:c codePoint.
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   488
    bytesYetToRead == 1 ifTrue:[ 
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   489
        stream next.
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   490
        ^ c.
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   491
    ].
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   492
    s := (String new:1 + bytesYetToRead) writeStream.
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   493
    s nextPutAll:(stream next: bytesYetToRead).
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   494
    ^ self decodeString:s contents
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   495
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   496
    "Created: / 14-06-2005 / 17:03:59 / janfrog"
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   497
    "Modified: / 03-10-2015 / 08:49:09 / Jan Vrany <jan.vrany@fit.cvut.cz>"
11974
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   498
! !
bbbf98b676b0 *** empty log message ***
Claus Gittinger <cg@exept.de>
parents: 10670
diff changeset
   499
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   500
!ISO10646_to_UTF8 class methodsFor:'documentation'!
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   501
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   502
version
18601
00dc53dfe54d class: CharacterEncoderImplementations::ISO10646_to_UTF8
Stefan Vogel <sv@exept.de>
parents: 17623
diff changeset
   503
    ^ '$Header$'
18807
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   504
!
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   505
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   506
version_HG
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   507
d79ce9fb5198 Fixed EncodedStream>>next for UTF8 to Unicode decoder.
Jan Vrany <jan.vrany@fit.cvut.cz>
parents: 18630
diff changeset
   508
    ^ '$Changeset: <not expanded> $'
8081
b468050174a9 initial checkin
Claus Gittinger <cg@exept.de>
parents:
diff changeset
   509
! !
17489
22f6151b5135 class: CharacterEncoderImplementations::ISO10646_to_UTF8
Claus Gittinger <cg@exept.de>
parents: 14172
diff changeset
   510