CharacterEncoderImplementations__ISO10646_to_UTF8.st
author |
Jan Vrany <jan.vrany@labware.com> |
|
Tue, 01 Jun 2021 20:19:13 +0100 |
branch | jv |
changeset 25424 |
51bd8a6b196f |
parent 25406 |
eba3da836698
|
permissions |
-rw-r--r-- |
Cherry-picked `Context`
cherry-picked Context.st from a6b6dda4caff:
* 4aaf30c174e9: #DOCUMENTATION by cg, Claus Gittinger <cg@exept.de>
* c67311afcc6c: #OTHER by cg, Claus Gittinger <cg@exept.de>
* 883f79e7b2a6: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* 716f3fbb09e9: Don't mark contexts with `CATCHMARK`, Jan Vrany <jan.vrany@fit.cvut.cz>
* cff24fa817b0: #REFACTORING by stefan, Stefan Vogel <sv@exept.de>
* 521f0d837330: #UI_ENHANCEMENT by cg, Claus Gittinger <cg@exept.de>
* bf1118f0fcca: #UI_ENHANCEMENT by cg, Claus Gittinger <cg@exept.de>
* e587cdd22868: #BUGFIX by cg, Claus Gittinger <cg@exept.de>
* fe9f9487a3ed: #DOCUMENTATION by cg, Claus Gittinger <cg@exept.de>
* d5b781899274: #BUGFIX by cg, Claus Gittinger <cg@exept.de>
* 8258751a7465: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* 40173e082cbc: Copyright updates, Jan Vrany <jan.vrany@fit.cvut.cz>
* 6db5c28207d5: #UI_ENHANCEMENT by cg, Claus Gittinger <cg@exept.de>
* 871ea64fd5dc: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* 4b544a108e4e: #DOCUMENTATION by cg, Claus Gittinger <cg@exept.de>
* 9a8d8399e566: #FEATURE by cgexept.de, Claus Gittinger <cg@exept.de>
* 170b00be0103: #BUGFIX by stefan, Stefan Vogel <sv@exept.de>
* a6c73965eae8: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* ce2a0e462ff0: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* 46a260a9ca92: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* 46cab49167fb: #UI_ENHANCEMENT by exept, Claus Gittinger <cg@exept.de>
* 7d52dfd3997d: #DOCUMENTATION by exept, Claus Gittinger <cg@exept.de>
* c52eeea62763: Fix `Context >> argAndVarNames` in cases when debug info is not available, Jan Vrany <jan.vrany@labware.com>
* b5d6963fe4a9: Backed out changeset c52eeea62763, Jan Vrany <jan.vrany@labware.com>
* 6fd3896f8703: #FEATURE by exept, Claus Gittinger <cg@exept.de>
* b530ee616256: #REFACTORING by cg, Claus Gittinger <cg@exept.de>
* ef9b481d7498: #FEATURE by cg, Claus Gittinger <cg@exept.de>
* ea663b72bd51: #UI_ENHANCEMENT by cg, Claus Gittinger <cg@exept.de>
* 6179572a733c: #FEATURE by exept, Claus Gittinger <cg@exept.de>
* 84155b1b6622: #DOCUMENTATION by exept, Claus Gittinger <cg@exept.de>
* 37d06602d856: *** empty log message ***, Claus Gittinger <cg@exept.de>
* f927b9022fea: *** empty log message ***, Claus Gittinger <cg@exept.de>
* 427d3be62d97: #UI_ENHANCEMENT by exept, Claus Gittinger <cg@exept.de>
"
COPYRIGHT (c) 2004 by eXept Software AG
COPYRIGHT (c) 2015 Jan Vrany
COPYRIGHT (c) 2021 LabWare
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
"
"{ Package: 'stx:libbasic' }"
"{ NameSpace: CharacterEncoderImplementations }"
TwoByteEncoder subclass:#ISO10646_to_UTF8
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
category:'Collections-Text-Encodings'
!
ISO10646_to_UTF8 class instanceVariableNames:'theOneAndOnlyInstance'
"
No other class instance variables are inherited by this class.
"
!
!ISO10646_to_UTF8 class methodsFor:'documentation'!
copyright
"
COPYRIGHT (c) 2004 by eXept Software AG
COPYRIGHT (c) 2015 Jan Vrany
COPYRIGHT (c) 2021 LabWare
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
"
!
documentation
"
I can encode characters into/from UTF8
Notice the naming (many are confused):
Unicode is the set of number-to-glyph assignments
whereas:
UTF8 is a concrete way of xmitting Unicode codePoints (numbers).
UTF16 is another concrete encoding, for example.
ST/X NEVER uses UTF8 internally - all characters are full 24bit characters.
Only when exchanging data, are these converted into UTF8 (or other) byte sequences.
"
!
examples
"
Encoding (unicode to utf8)
ISO10646_to_UTF8 encodeString:'hello'.
Decoding (utf8 to unicode):
|t|
t := ISO10646_to_UTF8 encodeString:'Hello'.
ISO10646_to_UTF8 decodeString:t.
"
! !
!ISO10646_to_UTF8 class methodsFor:'instance creation'!
flushSingleton
"flushes the cached singleton"
theOneAndOnlyInstance := nil
"
self flushSingleton
"
!
new
"returns a singleton"
theOneAndOnlyInstance isNil ifTrue:[
theOneAndOnlyInstance := self basicNew initialize.
].
^ theOneAndOnlyInstance.
!
theOneAndOnlyInstance
"returns a singleton"
theOneAndOnlyInstance isNil ifTrue:[
theOneAndOnlyInstance := self basicNew initialize.
].
^ theOneAndOnlyInstance.
! !
!ISO10646_to_UTF8 methodsFor:'encoding & decoding'!
decode:aCode
self shouldNotImplement "/ no single byte conversion possible
!
decodeString:aStringOrByteCollection
"given a string in UTF8 encoding,
return a new string containing the same characters, in Unicode encoding.
Returns either a normal String, a Unicode16String or a Unicode32String instance.
This is only useful, when reading from external sources or communicating with
other systems
(ST/X never uses utf8 internally, but always uses strings of fully decoded unicode characters).
This only handles up-to 30bit characters."
^ CharacterArray decodeFromUTF8:aStringOrByteCollection.
!
encode:aCode
self shouldNotImplement "/ no single byte conversion possible
!
encodeString:aUnicodeString
"return the UTF-8 representation of a Unicode string.
The resulting string is only useful to be stored on some external file,
not for being used inside ST/X."
^ aUnicodeString utf8Encoded.
! !
!ISO10646_to_UTF8 methodsFor:'queries'!
bytesToReadFor:firstByte
|bytesToRead|
bytesToRead := 1.
(firstByte isBitSet:8) ifFalse:[^1].
7 downTo:3
do:[:idx |
(firstByte isBitSet:idx) ifTrue:[
bytesToRead := bytesToRead + 1
] ifFalse:[
^bytesToRead
]
].
^bytesToRead
"Created: / 14-06-2005 / 17:17:24 / janfrog"
!
characterSize:charOrcodePoint
"return the number of bytes required to encode codePoint"
"Taken from RFC 3629"
(charOrcodePoint asInteger between:16r00000000 and:16r0000007F) ifTrue:[^1].
(charOrcodePoint asInteger between:16r00000080 and:16r000007FF) ifTrue:[^2].
(charOrcodePoint asInteger between:16r00000800 and:16r0000FFFF) ifTrue:[^3].
(charOrcodePoint asInteger between:16r00010000 and:16r0010FFFF) ifTrue:[^4].
^self error:'Invalid codePoint'
"Created: / 15-06-2005 / 15:16:22 / janfrog"
!
nameOfEncoding
^ #utf8
! !
!ISO10646_to_UTF8 methodsFor:'stream support'!
readNext:charactersToRead charactersFrom:stream
| s |
s := (String new:charactersToRead) writeStream.
charactersToRead timesRepeat:[
| c |
c := stream peek.
s nextPutAll:(stream next:(self bytesToReadFor:c))
].
^ self decodeString:s contents
"Created: / 16-06-2005 / 11:45:14 / masca"
!
readNextCharacterFrom:stream
| c bytesYetToRead s |
c := stream peek.
bytesYetToRead := self bytesToReadFor:c codePoint.
bytesYetToRead == 1 ifTrue:[
stream next.
^ c.
].
s := (String new:1 + bytesYetToRead) writeStream.
s nextPutAll:(stream next: bytesYetToRead).
s := self decodeString:s contents.
self assert: s size == 1.
^ s first
"Created: / 14-06-2005 / 17:03:59 / janfrog"
"Modified: / 03-10-2015 / 08:49:09 / Jan Vrany <jan.vrany@fit.cvut.cz>"
"Modified: / 29-01-2021 / 09:21:26 / Jan Vrany <jan.vrany@labware.com>"
! !
!ISO10646_to_UTF8 class methodsFor:'documentation'!
version
^ '$Header$'
!
version_CVS
^ '$Header$'
!
version_HG
^ '$Changeset: <not expanded> $'
! !