#FEATURE by cg
class: PhoneticStringUtilities::MiracodeStringComparator
comment/format in: #encode:
class: PhoneticStringUtilities::MiracodeStringComparator class
comment/format in: #documentation
class: PhoneticStringUtilities::SoundexStringComparator
comment/format in: #encode:
class: PhoneticStringUtilities::SoundexStringComparator class
comment/format in: #documentation
class: PhoneticStringUtilities::SpanishPhoneticCodeStringComparator
class definition
added:
#convertFirst:
#convertRest:
#encode:
class: PhoneticStringUtilities::SpanishPhoneticCodeStringComparator class
added:
#documentation
#examples
"{ Encoding: utf8 }"
"
COPYRIGHT (c) 1994 by Claus Gittinger
COPYRIGHT (c) 2009 by eXept Software AG
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
"
"{ Package: 'stx:libbasic2' }"
"{ NameSpace: Smalltalk }"
Object subclass:#PhoneticStringUtilities
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
category:'Collections-Text-Support'
!
Object subclass:#PhoneticStringComparator
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::PhoneticStringComparator subclass:#ExtendedSoundexStringComparator
instanceVariableNames:''
classVariableNames:'CharacterTranslationDict'
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::PhoneticStringComparator subclass:#SingleResultPhoneticStringComparator
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#MRAStringComparator
instanceVariableNames:''
classVariableNames:'CharacterTranslationDict'
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#SoundexStringComparator
instanceVariableNames:''
classVariableNames:'CharacterTranslationDict'
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SoundexStringComparator subclass:#MySQLSoundexStringComparator
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#NYSIISStringComparator
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#PhonemStringComparator
instanceVariableNames:''
classVariableNames:'CharacterTranslationDict'
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::PhoneticStringComparator subclass:#DoubleMetaphoneStringComparator
instanceVariableNames:'inputKey primaryTranslation secondaryTranslation startIndex
currentIndex skipCount'
classVariableNames:''
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#KoelnerPhoneticCodeStringComparator
instanceVariableNames:''
classVariableNames:'CharacterTranslationDict'
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SoundexStringComparator subclass:#MiracodeStringComparator
instanceVariableNames:''
classVariableNames:''
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#SpanishPhoneticCodeStringComparator
instanceVariableNames:''
classVariableNames:'CharacterTranslationDict'
poolDictionaries:''
privateIn:PhoneticStringUtilities
!
!PhoneticStringUtilities class methodsFor:'documentation'!
copyright
"
COPYRIGHT (c) 1994 by Claus Gittinger
COPYRIGHT (c) 2009 by eXept Software AG
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
"
!
documentation
"
Utilities which are helpful to perform phonetic string searches or comparisons.
These are all variations or improvements of the soundex algorithm, which usually fails
to provide good results for non-english languages.
soundexCode
this algorithm was originally contained in the CharacterArray class;
nysiis
a modified soundex algorithm
miracode
another modified soundex algorithm ('american soundex') used in the 1880 census.
mySQLSoundex
another modified soundex algorithm used in mySQL.
koelner phoneticCode
provides a functionality similar to soundex, but much more tuned towards the German language
Double metaphone
works with most european languages.
phonem
described in Georg Wilde and Carsten Meyer, 'Doppelgaenger gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung'
from 'ct Magazin fuer Computer & Technik 25/1999'.
More info for german readers is found in:
http://www.uni-koeln.de/phil-fak/phonetik/Lehre/MA-Arbeiten/magister_wilz.pdf
"
!
sampleData
"
for the 50 most common german names, we get:
ext.
name soundex soundex metaphone phonet phonet2 phonix daitsch phonem koeln
müller M460 54600000 MLR MÜLA NILA M4000000 689000 MYLR 657
schmidt S253 25300000 SKMTT SHMIT ZNIT S5300000 463000 CMYD 8628
schneider S253 25360000 SKNTR SHNEIDA ZNEITA S5300000 463900 CNAYDR 8627
fischer F260 12600000 FSKR FISHA FIZA F8000000 749000 VYCR 387
weber W160 16000000 WBR WEBA FEBA $1000000 779000 VBR 317
meyer M600 56000000 MYR MEIA NEIA M0000000 619000 MAYR 67
wagner W256 25600000 WKNR WAKNA FAKNA $2500000 756900 VACNR 367
schulz S242 24200000 SKLS SHULS ZULZ S4800000 484000 CULC 85
becker B260 12600000 BKR BEKA BEKA B2000000 759000 BCR 147
hoffmann H155 15500000 HFMN HOFMAN UFNAN $7550000 576600 OVMAN 036
schäfer S216 21600000 SKFR SHEFA ZEFA S7000000 479000 CVR 837
"
! !
!PhoneticStringUtilities class methodsFor:'phonetic codes'!
koelnerPhoneticCodeOf:aString
"return a koelner phonetic code.
The koelnerPhonetic code is for the german language what the soundex code is for english;
it returns simular strings for similar sounding words.
There are some differences to soundex, though:
its length is not limited to 4, but depends on the length of the original string;
it does not start with the first character of the input.
This algorithm is described by Postel 1969"
^ (KoelnerPhoneticCodeStringComparator new phoneticStringsFor:aString) first
"
#(
'Müller'
'Miller'
'Mueller'
'Mühler'
'Mühlherr'
'Mülherr'
'Myler'
'Millar'
'Myller'
'Müllar'
'Müler'
'Muehler'
'Mülller'
'Müllerr'
'Muehlherr'
'Muellar'
'Mueler'
'Mülleer'
'Mueller'
'Nüller'
'Nyller'
'Niler'
'Czerny'
'Tscherny'
'Czernie'
'Tschernie'
'Schernie'
'Scherny'
'Scherno'
'Czerne'
'Zerny'
'Tzernie'
'Breschnew'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities koelnerPhoneticCodeOf:w)
].
"
"
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschnew'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braeschneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braessneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Pressneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Presznäph'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Preschnjiev'. '17863'.
"
!
miracodeCodeOf:aString
"return a miracode soundex phonetic code or nil.
Miracode is a slightly modified soundex algorithm.
Notice that there are better algorithms around (doubleMetaphone) "
^ (MiracodeStringComparator new phoneticStringsFor:aString) first
"
PhoneticStringUtilities miracodeCodeOf:'claus'
PhoneticStringUtilities miracodeCodeOf:'clause'
PhoneticStringUtilities miracodeCodeOf:'close'
PhoneticStringUtilities miracodeCodeOf:'smalltalk'
PhoneticStringUtilities miracodeCodeOf:'smaltalk'
PhoneticStringUtilities miracodeCodeOf:'smaltak'
PhoneticStringUtilities miracodeCodeOf:'smaltok'
PhoneticStringUtilities miracodeCodeOf:'smoltok'
PhoneticStringUtilities miracodeCodeOf:'aa'
PhoneticStringUtilities miracodeCodeOf:'by'
PhoneticStringUtilities miracodeCodeOf:'bab'
PhoneticStringUtilities miracodeCodeOf:'bob'
PhoneticStringUtilities miracodeCodeOf:'bop'
PhoneticStringUtilities miracodeCodeOf:'pub'
"
"Created: / 28-07-2017 / 15:32:41 / cg"
!
mySQLSoundexCodeOf:aString
"return the mySQL soundex code. The mysql soundex coed is different from the miracode 'american' soundex
(no 4char limitation; different order of duplicate vowel vs. duplicate code elimination).
Notice that there are better algorithms around (doubleMetaphone) "
^ (MySQLSoundexStringComparator new phoneticStringsFor:aString) first
"
#(
'Müller'
'Miller'
'Mueller'
'Mühler'
'Mühlherr'
'Mülherr'
'Myler'
'Millar'
'Myller'
'Müllar'
'Müler'
'Muehler'
'Mülller'
'Müllerr'
'Muehlherr'
'Muellar'
'Mueler'
'Mülleer'
'Mueller'
'Nüller'
'Nyller'
'Niler'
'Czerny'
'Tscherny'
'Czernie'
'Tschernie'
'Schernie'
'Scherny'
'Scherno'
'Czerne'
'Zerny'
'Tzernie'
'Breschnew'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities mySQLSoundexCodeOf:w)
].
"
"
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschnew'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Braeschneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Braessneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Pressneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Presznäph'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Preschnjiev'.
"
"Modified (comment): / 28-07-2017 / 15:34:03 / cg"
!
soundexCodeOf:aString
"return a soundex phonetic code or nil.
Soundex (1918, 1922) returns similar codes for similar sounding words, making it a useful
tool when searching for words where the correct spelling is unknown.
(read Knuth or search the web if you don't know what a soundex code is).
Caveat: 'similar sounding words' means: 'similar sounding in english'.
Notice that there are better algorithms around (doubleMetaphone) "
^ (SoundexStringComparator new phoneticStringsFor:aString) first
"/ old code - now use code in private class...
"/ |inStream codeStream ch last lch codeLength codes code lastCode|
"/
"/ inStream := aString readStream.
"/ inStream skipSeparators.
"/ inStream atEnd ifTrue:[
"/ ^ nil
"/ ].
"/
"/ ch := inStream next.
"/ ch isLetter ifFalse:[
"/ ^ nil
"/ ].
"/ codeLength := 0.
"/
"/ codes := Dictionary new.
"/ codes atAll:'bpfv' put:$1.
"/ codes atAll:'cskgjqxz' put:$2.
"/ codes atAll:'dt' put:$3.
"/ codes atAll:'l' put:$4.
"/ codes atAll:'nm' put:$5.
"/ codes atAll:'r' put:$6.
"/
"/ codeStream := WriteStream on:(String new:4).
"/ codeStream nextPut:(ch asUppercase).
"/ last := ch asLowercase.
"/ lastCode := codes at:last ifAbsent:nil.
"/
"/ [inStream atEnd] whileFalse:[
"/ ch := inStream next.
"/ lch := ch asLowercase.
"/ lch = last ifFalse:[
"/ last := lch.
"/
"/ code := codes at:lch ifAbsent:nil.
"/ (code notNil and:[ code ~= lastCode]) ifTrue:[
"/ codeLength < 3 ifTrue:[
"/ codeStream nextPut:code.
"/ codeLength := codeLength + 1.
"/ codeLength > 3 ifTrue:[^ codeStream contents].
"/ ].
"/ ].
"/ lastCode := code.
"/ ]
"/ ].
"/ [ codeLength < 3 ] whileTrue:[
"/ codeStream nextPut:$0.
"/ codeLength := codeLength + 1.
"/ ].
"/
"/ ^ codeStream contents
"
PhoneticStringUtilities soundexCodeOf:'claus'
PhoneticStringUtilities soundexCodeOf:'clause'
PhoneticStringUtilities soundexCodeOf:'close'
PhoneticStringUtilities soundexCodeOf:'smalltalk'
PhoneticStringUtilities soundexCodeOf:'smaltalk'
PhoneticStringUtilities soundexCodeOf:'smaltak'
PhoneticStringUtilities soundexCodeOf:'smaltok'
PhoneticStringUtilities soundexCodeOf:'smoltok'
PhoneticStringUtilities soundexCodeOf:'aa'
PhoneticStringUtilities soundexCodeOf:'by'
PhoneticStringUtilities soundexCodeOf:'bab'
PhoneticStringUtilities soundexCodeOf:'bob'
PhoneticStringUtilities soundexCodeOf:'bop'
"
"Modified (comment): / 28-07-2017 / 15:33:53 / cg"
! !
!PhoneticStringUtilities class methodsFor:'queries'!
isUtilityClass
^ self == PhoneticStringUtilities
! !
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'constant'!
defaultClass
^SoundexStringComparator
! !
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'documentation'!
documentation
"
abstract superclass for various phonetic comparators.
They returns similar strings for similar sounding words, which can be used
to find similar sounding words in a search list.
Notice, that some comparators are better for particular languages.
"
!
examples
"
PhoneticStringUtilities::SoundexStringComparator new
does:'miller' soundLike:'miler'.
PhoneticStringUtilities::SoundexStringComparator new
does:'miller' soundLike:'milner'.
PhoneticStringUtilities::SoundexStringComparator new
does:'müller' soundLike:'mueller'.
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new
does:'müller' soundLike:'mueller'.
"
! !
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'instance creation'!
new
^ self basicNew initialize.
! !
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'queries'!
isAbstract
^ self == PhoneticStringUtilities::PhoneticStringComparator
! !
!PhoneticStringUtilities::PhoneticStringComparator methodsFor:'api'!
does:aString soundLike:anotherString
|translations1 translations2|
translations1 := self phoneticStringsFor:aString.
translations2 := self phoneticStringsFor:anotherString.
^ translations1 contains:[:t1 |
translations2 contains:[:t2 | t1 = t2]]
"
PhoneticStringUtilities::SoundexStringComparator new
does:'miller' soundLike:'miler'.
PhoneticStringUtilities::SoundexStringComparator new
does:'miller' soundLike:'milner'.
PhoneticStringUtilities::SoundexStringComparator new
does:'müller' soundLike:'mueller'.
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new
does:'müller' soundLike:'mueller'.
"
"Modified (comment): / 13-07-2017 / 17:51:43 / cg"
!
phoneticStringsFor: aString
"Should answer an array of alternate phonetic strings for the given input string."
self subclassResponsibility
"
(PhoneticStringUtilities::SoundexStringComparator new
phoneticStringsFor:'miller') first
'miller' asSoundexCode
"
"Modified (comment): / 27-07-2017 / 15:07:59 / cg"
! !
!PhoneticStringUtilities::PhoneticStringComparator methodsFor:'initialization'!
initialize
"Invoked when a new instance is created."
"/ please change as required (and remove this comment)
"/ super initialize. -- commented since inherited method does nothing
! !
!PhoneticStringUtilities::ExtendedSoundexStringComparator class methodsFor:'documentation'!
documentation
"
There are many extended and enhanced soundex variants around;
here is one, called 'extended soundex'. It is destribed for example in
http://www.epidata.dk/documentation.php.
An author or origin is unknown.
The number of digits is increased to 5 or 8;
The first character is not used literally; instead it is encoded like the rest.
This might have a negative effect on names starting with a vovel, though.
Overall, it can be doubted if this is really an enhancement after all.
"
! !
!PhoneticStringUtilities::ExtendedSoundexStringComparator methodsFor:'api'!
phoneticStringsFor:aString
"generates both an extended soundex of length 5 and one of length 8"
|first second u t prevCode|
u := aString asUppercase.
first := second := ''.
u do:[:c |
t := self translate:c.
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[
first := first , t.
second := second , t.
second size == 8 ifTrue:[
^ Array with:(first copyTo:5) with:second
].
].
prevCode := t
].
[ first size < 5 ] whileTrue:[
first := first , '0'.
second := second , '0'.
].
[ second size < 8 ] whileTrue:[
second := second , '0'
].
^ Array with:first with:second
"
self basicNew phoneticStringsFor:'müller' #('87900' '87900000')
self basicNew phoneticStringsFor:'miller' #('87900' '87900000')
self basicNew phoneticStringsFor:'muller' #('87900' '87900000')
self basicNew phoneticStringsFor:'muler' #('87900' '87900000')
self basicNew phoneticStringsFor:'schmidt' #('38600' '38600000')
self basicNew phoneticStringsFor:'schneider' #('38690' '38690000')
self basicNew phoneticStringsFor:'fischer' #('23900' '23900000')
self basicNew phoneticStringsFor:'weber' #('19000' '19000000')
self basicNew phoneticStringsFor:'meyer' #('89000' '89000000')
self basicNew phoneticStringsFor:'wagner' #('48900' '48900000')
self basicNew phoneticStringsFor:'schulz' #('37500' '37500000')
self basicNew phoneticStringsFor:'becker' #('13900' '13900000')
self basicNew phoneticStringsFor:'hoffmann' #('28800' '28800000')
self basicNew phoneticStringsFor:'schäfer' #('32900' '32900000')
"
! !
!PhoneticStringUtilities::ExtendedSoundexStringComparator methodsFor:'private'!
translate:aCharacter
"use simple if's for more speed when compiled"
"vowels serve as separators"
aCharacter == $A ifTrue:[^ '0' ].
aCharacter == $E ifTrue:[^ '0' ].
aCharacter == $I ifTrue:[^ '0' ].
aCharacter == $O ifTrue:[^ '0' ].
aCharacter == $U ifTrue:[^ '0' ].
aCharacter == $Y ifTrue:[^ '0' ].
aCharacter == $B ifTrue:[^ '1' ].
aCharacter == $P ifTrue:[^ '1' ].
aCharacter == $F ifTrue:[^ '2' ].
aCharacter == $V ifTrue:[^ '2' ].
aCharacter == $C ifTrue:[^ '3' ].
aCharacter == $S ifTrue:[^ '3' ].
aCharacter == $K ifTrue:[^ '3' ].
aCharacter == $G ifTrue:[^ '4' ].
aCharacter == $J ifTrue:[^ '4' ].
aCharacter == $Q ifTrue:[^ '5' ].
aCharacter == $X ifTrue:[^ '5' ].
aCharacter == $Z ifTrue:[^ '5' ].
aCharacter == $D ifTrue:[^ '6' ].
aCharacter == $G ifTrue:[^ '6' ].
aCharacter == $T ifTrue:[^ '6' ].
aCharacter == $L ifTrue:[^ '7' ].
aCharacter == $M ifTrue:[^ '8' ].
aCharacter == $N ifTrue:[^ '8' ].
aCharacter == $R ifTrue:[^ '9' ].
^ nil
! !
!PhoneticStringUtilities::SingleResultPhoneticStringComparator class methodsFor:'documentation'!
documentation
"
documentation to be added.
[author:]
cg
[instance variables:]
[class variables:]
[see also:]
"
! !
!PhoneticStringUtilities::SingleResultPhoneticStringComparator methodsFor:'api'!
encode:word
^ self subclassResponsibility
"Created: / 28-07-2017 / 15:20:49 / cg"
!
phoneticStringsFor:word
^ Array with:(self encode:word)
"Created: / 28-07-2017 / 15:20:38 / cg"
! !
!PhoneticStringUtilities::MRAStringComparator class methodsFor:'documentation'!
documentation
"
Match Rating Approach Encoder
The Western Airlines matching rating approach name encoder
[see also:]
https://en.wikipedia.org/wiki/Match_Rating_Approach
G.B. Moore, J.L. Kuhns, J.L. Treffzs, and C.A. Montgomery,
''Accessing Individual Records from Personal Data Files Using Nonunique Identifiers''
US National Institute of Standards and Technology, SP-500-2 (1977), p. 17.
"
!
rCode
"<<END
## Copyright (c) 2015, James P. Howard, II <jh@jameshoward.us>
##
## Redistribution and use in source and binary forms, with or without
## modification, are permitted provided that the following conditions are
## met:
##
## Redistributions of source code must retain the above copyright
## notice, this list of conditions and the following disclaimer.
##
## Redistributions in binary form must reproduce the above copyright
## notice, this list of conditions and the following disclaimer in
## the documentation and/or other materials provided with the
## distribution.
##
## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
## LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
## DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
## THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
## (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
## OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#' @rdname mra
#' @title Match Rating Approach Encoder
#'
#' @description
#' The Western Airlines matching rating approach name encoder
#'
#' @param word string or vector of strings to encode
#' @param x MRA-encoded character vector
#' @param y MRA-encoded character vector
#'
#' @details
#'
#' The variable \code{word} is the name to be encoded. The variable
#' \code{maxCodeLen} is \emph{not} supported in this algorithm encoder
#' because the algorithm itself is dependent upon its six-character
#' length. The variables \code{x} and \code{y} are MRA-encoded and are
#' compared to each other using the MRA comparison specification.
#'
#' @return The \code{mra_encode} function returns match rating approach
#' encoded character vector. The \code{mra_compare} returns a boolean
#' vector which is \code{TRUE} if \code{x} and \code{y} pass the MRA
#' comparison test.
#'
#' @references
#'
#' G.B. Moore, J.L. Kuhns, J.L. Treffzs, and C.A. Montgomery,
#' \emph{Accessing Individual Records from Personal Data Files Using
#' Nonunique Identifiers,} US National Institute of Standards and
#' Technology, SP-500-2 (1977), p. 17.
#'
#' @family phonics
#'
#' @examples
#' mra_encode("William")
#' mra_encode(c("Peter", "Peady"))
#' mra_encode("Stevenson")
#' @rdname mra
#' @name mra_encode
#' @export
mra_encode <- function(word) {
## First, remove any nonalphabetical characters and uppercase it
word <- gsub("[^[:alpha:]]*", "", word)
word <- toupper(word)
## First character of key = first character of name
first <- substr(word, 1, 1)
word <- substr(word, 2, nchar(word))
## Delete vowels not at the start of the word
word <- gsub("[AEIOU]", "", word)
word <- paste(first, word, sep = "")
## Remove duplicate consecutive characters
word <- gsub("([A-Z])\\1+", "\\1", word)
## If longer than 6 characters, take first and last 3...and we have
## to vectorize it
for(i in 1:length(word)) {
if((l = nchar(word[i])) > 6) {
first <- substr(word[i], 1, 3)
last <- substr(word[i], l - 2, l)
word[i] <- paste(first, last, sep = "");
}
}
return(word)
}
#' @rdname mra
#' @name mra_compare
#' @export
mra_compare <- function(x, y) {
mra <- data.frame(x = x, y = y, sim = 0, min = 100, stringsAsFactors = FALSE)
## Obtain the minimum rating value by calculating the length sum of
## the encoded strings and using table A (from Wikipedia). We start
## by setting the minimum to be the sum and move from there.
mra$lensum <- nchar(mra$x) + nchar(mra$y)
mra$min[mra$lensum == 12] <- 2
mra$min[mra$lensum > 7 && mra$lensum <= 11] <- 3
mra$min[mra$lensum > 4 && mra$lensum <= 7] <- 4
mra$min[mra$lensum <= 4] <- 5
## If the length difference between the encoded strings is 3 or
## greater, then no similarity comparison is done. For us, we
## continue the similarity comparison out of laziness and ensure the
## minimum is impossibly high to meet.
mra$min[abs(nchar(mra$x) - nchar(mra$y)) >= 3] <- 100
## Start the comparison.
x <- strsplit(mra$x, split = "")
y <- strsplit(mra$y, split = "")
rows <- nrow(mra)
for(i in 1:rows) {
## Process the encoded strings from left to right and remove any
## identical characters found from both strings respectively.
j <- 1
while(j < min(length(x[[i]]), length(y[[i]]))) {
if(x[[i]][j] == y[[i]][j]) {
x[[i]] <- x[[i]][-j]
y[[i]] <- y[[i]][-j]
} else
j <- j + 1
}
## Process the unmatched characters from right to left and
## remove any identical characters found from both names
## respectively.
x[[i]] <- rev(x[[i]])
y[[i]] <- rev(y[[i]])
j <- 1
while(j < min(length(x[[i]]), length(y[[i]]))) {
if(x[[i]][j] == y[[i]][j]) {
x[[i]] <- x[[i]][-j]
y[[i]] <- y[[i]][-j]
} else
j <- j + 1
}
## Subtract the number of unmatched characters from 6 in the
## longer string. This is the similarity rating.
len <- min(length(x[[i]]), length(y[[i]]))
mra$sim[i] <- 6 - len
}
## If the similarity is greater than or equal to the minimum
## required, it is a successful match.
mra$match <- (mra$sim >= mra$min)
return(mra$match)
}
END>>
! !
!PhoneticStringUtilities::MRAStringComparator methodsFor:'api'!
encode:wordIn
"see https://en.wikipedia.org/wiki/Match_Rating_Approach"
|word prev|
word := wordIn.
"/ First, remove any nonalphabetical characters and uppercase it
word := word select:#isLetter thenCollect:#asUppercase.
"/ Delete vowels not at the start of the word
word := word first asString , ((word from:2) reject:#isVowel).
"/ Remove duplicate consecutive characters
prev := nil.
word := word
collect:[:char |
char == prev ifTrue:[
$*
] ifFalse:[
prev := char.
char.
].
]
thenSelect:[:char | char ~~ $*].
"/ If longer than 6 characters, take first and last 3
word size > 6 ifTrue:[
word := (word copyFirst:3),(word copyLast:3)
].
^ word.
"
self new encode:'Catherine' -> 'CTHRN'
self new encode:'CatherineCatherine' -> 'CTHHRN'
self new encode:'Butter' -> 'BTR'
self new encode:'Byrne' -> 'BYRN'
self new encode:'Boern' -> 'BRN'
self new encode:'Smith' -> 'SMTH'
self new encode:'Smyth' -> 'SMYTH'
self new encode:'Kathryn' -> 'KTHRYN'
"
"Created: / 28-07-2017 / 15:19:22 / cg"
"Modified (comment): / 31-07-2017 / 15:14:31 / cg"
! !
!PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'!
documentation
"
WARNING: this is the so called 'simplified soundex' algorithm;
there are more variants like miracode (american soundex) or
mysqlSoundex around.
Be sure to use the correct algorithm, if the generated strings must be compatible
(otherwise, the differences are probably too small to be noticed as effect, but
your search will be different)
The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
components of names, but by doing so reports more matches.
There are some variations around in the literature;
the following is called 'simplified soundex', and the rules for coding a name are:
1. The first letter of the name is used in its un-coded form to serve as the prefix
character of the code. (The rest of the code is numerical).
2. Thereafter, W and H are ignored entirely.
3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
4. Other letters of the name are converted to a numerical equivalent:
B, P, F, V 1
C, G, J, K, Q, S, X, Z 2
D, T 3
L 4
M, N 5
R 6
5. There are two exceptions:
1. Letters that follow prefix letters which would, if coded, have the same
numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
2. The second letter of any pair of consonants having the same code number is likewise ignored,
i.e. unless there is a ''separator'' between them in the name.
6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
Notice, that in another variant, w and h are treated slightly differently.
This is only of relevance, if you need to reconstruct original soundex codes of other programs
or for the original 1880 us census data.
Also notice, that soundex deals better with english.
For german and other languages, other algorithms may provide better results.
"
! !
!PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'!
encode:word
|u p t prevCode|
u := word asUppercase.
p := u first asString.
prevCode := self translate:u first.
u from:2 to:u size do:[:c |
t := self translate:c.
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[
p := p , t.
p size == 4 ifTrue:[^ p ].
].
prevCode := t
].
[ p size < 4 ] whileTrue:[
p := p , '0'
].
^ (p copyFrom:1 to:4)
"
self new encode:'washington' -> 'W252'
self new encode:'lee' -> 'L000'
self new encode:'Gutierrez' -> 'G362'
self new encode:'Pfister' -> 'P236'
self new encode:'Jackson' -> 'J250'
self new encode:'Tymczak' -> 'T522'
"
"notice:
MiracodeStringComparator new encode:'Ashcraft' -> 'A261'
self new encode:'Ashcraft' -> 'A226'
"
"Created: / 28-07-2017 / 15:21:23 / cg"
"Modified (comment): / 01-08-2017 / 19:01:43 / cg"
! !
!PhoneticStringUtilities::SoundexStringComparator methodsFor:'private'!
translate:aCharacter
"use simple if's for more speed when compiled"
"vowels serve as separators"
aCharacter == $A ifTrue:[^ '0' ].
aCharacter == $E ifTrue:[^ '0' ].
aCharacter == $I ifTrue:[^ '0' ].
aCharacter == $O ifTrue:[^ '0' ].
aCharacter == $U ifTrue:[^ '0' ].
aCharacter == $Y ifTrue:[^ '0' ].
aCharacter == $B ifTrue:[^ '1' ].
aCharacter == $P ifTrue:[^ '1' ].
aCharacter == $F ifTrue:[^ '1' ].
aCharacter == $V ifTrue:[^ '1' ].
aCharacter == $C ifTrue:[^ '2' ].
aCharacter == $S ifTrue:[^ '2' ].
aCharacter == $K ifTrue:[^ '2' ].
aCharacter == $G ifTrue:[^ '2' ].
aCharacter == $J ifTrue:[^ '2' ].
aCharacter == $Q ifTrue:[^ '2' ].
aCharacter == $X ifTrue:[^ '2' ].
aCharacter == $Z ifTrue:[^ '2' ].
aCharacter == $D ifTrue:[^ '3' ].
aCharacter == $T ifTrue:[^ '3' ].
aCharacter == $L ifTrue:[^ '4' ].
aCharacter == $M ifTrue:[^ '5' ].
aCharacter == $N ifTrue:[^ '5' ].
aCharacter == $R ifTrue:[^ '6' ].
^ nil
! !
!PhoneticStringUtilities::MySQLSoundexStringComparator class methodsFor:'documentation'!
documentation
"
MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
and also removing vokals first, then removing duplicate codes
(whereas the soundex code does this in reverse order).
These variations are important, if you need the miracode soundex codes to be generated.
"
! !
!PhoneticStringUtilities::MySQLSoundexStringComparator methodsFor:'api'!
encode:word
|u p t prevCode|
u := word asUppercase.
p := u first asString.
prevCode := self translate:u first.
u from:2 to:u size do:[:c |
t := self translate:c.
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[
p := p , t.
].
(t ~= '0' and:[ c ~= $W and:[c ~= $H]]) ifTrue:[
prevCode := t.
].
].
[ p size < 4 ] whileTrue:[
p := p , '0'
].
^ p
"Created: / 28-07-2017 / 15:23:41 / cg"
"Modified: / 31-07-2017 / 17:53:51 / cg"
! !
!PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'!
documentation
"
NYSIIS Algorithm:
1.
remove all ''S'' and ''Z'' chars from the end of the surname
2.
transcode initial strings
MAC => MC
PF => F
3.
Transcode trailing strings as follows,
IX => IC
EX => EC
YE,EE,IE => Y
NT,ND => D
4.
transcode ''EV'' to ''EF'' if not at start of name
5.
use first character of name as first character of key
6.
remove any ''W'' that follows a vowel
7.
replace all vowels with ''A''
8.
transcode ''GHT'' to ''GT''
9.
transcode ''DG'' to ''G''
10.
transcode ''PH'' to ''F''
11.
if not first character, eliminate all ''H'' preceded or followed by a vowel
12.
change ''KN'' to ''N'', else ''K'' to ''C''
13.
if not first character, change ''M'' to ''N''
14.
if not first character, change ''Q'' to ''G''
15.
transcode ''SH'' to ''S''
16.
transcode ''SCH'' to ''S''
17.
transcode ''YW'' to ''Y''
18.
if not first or last character, change ''Y'' to ''A''
19.
transcode ''WR'' to ''R''
20.
if not first character, change ''Z'' to ''S''
21.
transcode terminal ''AY'' to ''Y''
22.
remove traling vowels
23.
collapse all strings of repeated characters
24.
if first char of original surname was a vowel, append it to the code
"
! !
!PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'!
encode:aString
|k|
k := self rule1:(aString asUppercase).
k := self rule2:k.
k := self rule3:k.
k := self rule4:k.
k := self rule5:k.
k := self rule6:k.
k := self rule7:k.
k := self rule8:k.
k := self rule9:k.
k := self rule10:k.
k := self rule11:k.
k := self rule12:k.
k := self rule13:k.
k := self rule14:k.
k := self rule15:k.
k := self rule16:k.
k := self rule17:k.
k := self rule18:k.
k := self rule19:k.
k := self rule20:k.
k := self rule21:k.
k := self rule22:k.
k := self rule23:k.
k := self rule24:k originalKey:aString.
^ k
"
self new encode:'hello'
self new encode:'bliss'
"
"
self new phoneticStringsFor:'hello'
self new phoneticStringsFor:'bliss'
"
"Created: / 28-07-2017 / 15:34:52 / cg"
! !
!PhoneticStringUtilities::NYSIISStringComparator methodsFor:'private'!
rule10:key
"10. transcode 'PH' to 'F' "
^ self
transcodeAll:'PH'
of:key
to:'F'
startingAt:1
!
rule11:key
|k c|
"11. if not first character, eliminate all 'H' preceded or followed by a vowel "
k := key copy.
c := SortedCollection sortBlock:[:a :b | b < a ].
2 to:key size do:[:i |
(key at:i) = $H ifTrue:[
((key at:i - 1) isVowel
or:[ (i < key size) and:[ (key at:i + 1) isVowel ] ]) ifTrue:[ c add:i ]
]
].
c do:[:n |
k := (k copyFrom:1 to:n - 1) , (k copyFrom:n + 1 to:k size)
].
^ k
!
rule12:key
|k|
"12. change 'KN' to 'N', else 'K' to 'C' "
k := self
transcodeAll:'KN'
of:key
to:'K'
startingAt:1.
k := self
transcodeAll:'K'
of:k
to:'C'
startingAt:1.
^ k
!
rule13:key
"13. if not first character, change 'M' to 'N' "
^ self
transcodeAll:'M'
of:key
to:'N'
startingAt:2
!
rule14:key
"14. if not first character, change 'Q' to 'G' "
^ self
transcodeAll:'Q'
of:key
to:'G'
startingAt:2
!
rule15:key
"15. transcode 'SH' to 'S' "
^ self
transcodeAll:'SH'
of:key
to:'S'
startingAt:1
!
rule16:key
"16. transcode 'SCH' to 'S' "
^ self
transcodeAll:'SCH'
of:key
to:'S'
startingAt:1
!
rule17:key
"17. transcode 'YW' to 'Y' "
^ self
transcodeAll:'YW'
of:key
to:'Y'
startingAt:1
!
rule18:key
|k|
"18. if not first or last character, change 'Y' to 'A' "
k := self
transcodeAll:'Y'
of:key
to:'A'
startingAt:2.
key last = $Y ifTrue:[
k at:k size put:$Y
].
^ k
!
rule19:key
"19. transcode 'WR' to 'R' "
^ self
transcodeAll:'WR'
of:key
to:'R'
startingAt:1
!
rule1:key
|k|
k := key copy.
"1. Remove all 'S' and 'Z' chars from the end of the name"
[
'SZ' includes:k last
] whileTrue:[ k := k copyFrom:1 to:(k size - 1) ].
^ k
!
rule20:key
"20. if not first character, change 'Z' to 'S' "
^ self
transcodeAll:'Z'
of:key
to:'S'
startingAt:2
!
rule21:key
"21. transcode terminal 'AY' to 'Y' "
^ self
transcodeAll:'AY'
of:key
to:'Y'
startingAt:key size - 1
!
rule22:key
|k|
"22. remove trailing vowels "
k := key copy.
[ k last isVowel ] whileTrue:[
k := k copyFrom:1 to:k size - 1
].
^ k
!
rule23:key
|k c|
"23. collapse all strings of repeated characters "
k := key copy.
c := SortedCollection sortBlock:[:a :b | b < a ].
k size to:2 do:[:i |
(k at:i) = (k at:i - 1) ifTrue:[
c add:i
]
].
c do:[:n |
k := (k copyFrom:1 to:n - 1) , (k copyFrom:n + 1 to:k size)
].
^ k
!
rule24:key originalKey:originalKey
|k|
"24. if first char of original surname was a vowel, append it to the code"
k := key copy.
originalKey first isVowel ifTrue:[
k := k , originalKey first asString asUppercase
].
^ k
!
rule2:key
|k|
k := key copy.
"2. Transcode initial strings: MAC => MC PF => F"
(k startsWith:'MAC') ifTrue:[
k := 'MC' , (k copyFrom:4)
].
(k startsWith:'PF') ifTrue:[
k := 'F' , (k copyFrom:3)
].
^ k
!
rule3:key
|k|
"3. Transcode trailing strings as follows:
IX => IC
EX => EC
YE, EE, IE => Y
NT, ND => D"
k := key copy.
k := self
transcodeTrailing:#( 'IX' )
of:k
to:'IC'.
k := self
transcodeTrailing:#( 'EX' )
of:k
to:'EC'.
k := self
transcodeTrailing:#( 'YE' 'EE' 'IE' )
of:k
to:'Y'.
k := self
transcodeTrailing:#( 'NT' 'ND' )
of:k
to:'D'.
^ k
!
rule4:key
"4. Transcode 'EV' to 'EF' if not at start of name"
^ self
transcodeAll:'EV'
of:key
to:'EF'
startingAt:2
!
rule5:key
"5. Use first character of name as first character of key. Ignored because we're doing an in-place conversion"
^ key
!
rule6:key
|k i|
"6. Remove any 'W' that follows a vowel"
k := key copy.
i := 2.
[
(i := k indexOf:$W startingAt:i) > 0
] whileTrue:[
(k at:i - 1) isVowel ifTrue:[
k := (k copyFrom:1 to:i - 1) , (k copyFrom:i + 1 to:k size).
i := i - 1
]
].
^ k
!
rule7:key
|k|
"7. replace all vowels with 'A' "
k := key copy.
1 to:key size do:[:i |
(key at:i) isVowel ifTrue:[
k at:i put:$A
]
].
^ k
!
rule8:key
"8. transcode 'GHT' to 'GT' "
^ self
transcodeAll:'GHT'
of:key
to:'GT'
startingAt:1
!
rule9:key
"9. transcode 'DG' to 'G' "
^ self
transcodeAll:'DG'
of:key
to:'G'
startingAt:1
!
transcodeAll:aString of:key to:replacementString startingAt:start
|k i|
k := key copy.
[
(i := k indexOfSubCollection:aString startingAt:start) > 0
] whileTrue:[
k := (k copyFrom:1 to:i - 1) , replacementString
, (k copyFrom:i + aString size to:k size)
].
^ k
!
transcodeTrailing:anArrayOfStrings of:key to:replacementString
|answer|
answer := key copy.
anArrayOfStrings do:[:aString |
answer := self
transcodeAll:aString
of:answer
to:replacementString
startingAt:(answer size - aString size) + 1
].
^ answer
! !
!PhoneticStringUtilities::PhonemStringComparator class methodsFor:'documentation'!
documentation
"
Implementation of the PHONEM algorithm, as described in
'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
Ein Programm fuer kontextsensitive phonetische Textumwandlung
ct Magazin fuer Computer & Technik 25/1998'
This algorithm deals better with the german language (it cares for umlauts)
"
! !
!PhoneticStringUtilities::PhonemStringComparator methodsFor:'api'!
encode:aString
|s idx t t2|
s := aString asUppercase.
idx := 1.
[idx < (s size-1)] whileTrue:[
t2 := nil.
t := s copyFrom:idx to:idx+1.
t = 'SC' ifTrue:[ t2 := 'C' ]
ifFalse:[ t = 'SZ' ifTrue:[ t2 := 'C' ]
ifFalse:[ t = 'CZ' ifTrue:[ t2 := 'C' ]
ifFalse:[ t = 'TZ' ifTrue:[ t2 := 'C' ]
ifFalse:[ t = 'TS' ifTrue:[ t2 := 'C' ]
ifFalse:[ t = 'KS' ifTrue:[ t2 := 'X' ]
ifFalse:[ t = 'PF' ifTrue:[ t2 := 'V' ]
ifFalse:[ t = 'QU' ifTrue:[ t2 := 'KW' ]
ifFalse:[ t = 'PH' ifTrue:[ t2 := 'V' ]
ifFalse:[ t = 'UE' ifTrue:[ t2 := 'Y' ]
ifFalse:[ t = 'AE' ifTrue:[ t2 := 'E' ]
ifFalse:[ t = 'OE' ifTrue:[ t2 := 'Ö' ]
ifFalse:[ t = 'EI' ifTrue:[ t2 := 'AY' ]
ifFalse:[ t = 'EY' ifTrue:[ t2 := 'AY' ]
ifFalse:[ t = 'EU' ifTrue:[ t2 := 'OY' ]
ifFalse:[ t = 'AU' ifTrue:[ t2 := 'A§' ]
ifFalse:[ t = 'OU' ifTrue:[ t2 := '§ ' ]]]]]]]]]]]]]]]]].
t2 notNil ifTrue:[
s := (s copyTo:idx-1),t2,(s copyFrom:idx+2)
] ifFalse:[
idx := idx + 1.
].
].
"/ single character substitutions via tr
s := s copyTransliterating:'ÖÄZKGQÜIJFWPT§' to:'YECCCCYYYVVDDUA'.
s := s copyTransliterating:'ABCDLMNORSUVWXY' to:'' complement:true squashDuplicates:false.
s := s copyTransliterating:'ABCDLMNORSUVWXY' to:'ABCDLMNORSUVWXY' complement:false squashDuplicates:true.
^ s
"
self basicNew encode:'müller' -> 'MYLR'
self basicNew encode:'mueller' -> 'MYLR'
self basicNew encode:'möller' -> 'MYLR'
self basicNew encode:'miller' -> 'MYLR'
self basicNew encode:'muller' -> 'MULR'
self basicNew encode:'muler' -> 'MULR'
self basicNew phoneticStringsFor:'müller' #('MYLR')
self basicNew phoneticStringsFor:'mueller' #('MYLR')
self basicNew phoneticStringsFor:'möller' #('MYLR')
self basicNew phoneticStringsFor:'miller' #('MYLR')
self basicNew phoneticStringsFor:'muller' #('MULR')
self basicNew phoneticStringsFor:'muler' #('MULR')
self basicNew phoneticStringsFor:'schmidt' #('CMYD')
self basicNew phoneticStringsFor:'schneider' #('CNAYDR')
self basicNew phoneticStringsFor:'fischer' #('VYCR')
self basicNew phoneticStringsFor:'weber' #('VBR')
self basicNew phoneticStringsFor:'weeber' #('VBR')
self basicNew phoneticStringsFor:'webber' #('VBR')
self basicNew phoneticStringsFor:'wepper' #('VBR')
self basicNew phoneticStringsFor:'meyer' #('MAYR')
self basicNew phoneticStringsFor:'maier' #('MAYR')
self basicNew phoneticStringsFor:'mayer' #('MAYR')
self basicNew phoneticStringsFor:'mayr' #('MAYR')
self basicNew phoneticStringsFor:'meir' #('MAYR')
self basicNew phoneticStringsFor:'wagner' #('VACNR')
self basicNew phoneticStringsFor:'schulz' #('CULC')
self basicNew phoneticStringsFor:'becker' #('BCR')
self basicNew phoneticStringsFor:'hoffmann' #('OVMAN')
self basicNew phoneticStringsFor:'haus' #('AUS')
self basicNew phoneticStringsFor:'schäfer' #('CVR')
self basicNew phoneticStringsFor:'scheffer' #('CVR')
self basicNew phoneticStringsFor:'schaeffer' #('CVR')
self basicNew phoneticStringsFor:'schaefer' #('CVR')
"
"Created: / 28-07-2017 / 15:38:08 / cg"
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'LICENSE'!
copyright
"
Copyright (c) 2002-2004 Robert Jarvis
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation
files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom
the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial
portions of the Software.
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
USE OR OTHER DEALINGS IN THE SOFTWARE.'
"
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'classification'!
isSlavoGermanic:aString
^ #('w' 'k' 'cz' 'witz' 'ä' 'ö' 'ü' 'ß') contains:[:sub | aString includesString:sub]
"
self isSlavoGermanic:'walter'
self isSlavoGermanic:'horowitz'
self isSlavoGermanic:'müller'
self isSlavoGermanic:'miller'
"
"Modified: / 28-07-2017 / 10:14:38 / cg"
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'documentation'!
documentation
"
The Double Metaphone algorithm
see internet: https://en.wikipedia.org/wiki/Metaphone
"
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'accessing'!
currentIndex
^currentIndex
!
currentIndex: anInteger
currentIndex := anInteger
!
inputKey
^inputKey
!
inputKey: aString
inputKey := aString asUppercase
!
primaryTranslation
^primaryTranslation
!
primaryTranslation: anObject
primaryTranslation := anObject
!
secondaryTranslation
^secondaryTranslation
!
secondaryTranslation: anObject
secondaryTranslation := anObject
!
skipCount
^skipCount
!
skipCount: anInteger
skipCount := anInteger
!
startIndex
^startIndex
!
startIndex: anObject
startIndex := anObject
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'api'!
phoneticStringsFor:aString
"Private - Answers an array of alternate phonetic strings for the given input string."
inputKey := aString.
self performInitialProcessing.
self processRemainingCharacters.
^ Array with:primaryTranslation with:secondaryTranslation
"Modified (format): / 28-07-2017 / 11:25:02 / cg"
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'initialization'!
initialize
super initialize.
startIndex := 1.
primaryTranslation := ''.
secondaryTranslation := ''.
skipCount := 0.
currentIndex := 1.
"Modified: / 28-07-2017 / 11:18:44 / cg"
! !
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'private'!
addPrimaryTranslation:aString
primaryTranslation := (primaryTranslation , aString)
"Modified: / 28-07-2017 / 11:19:09 / cg"
!
addSecondaryTranslation:aString
secondaryTranslation := secondaryTranslation , aString
"Modified: / 28-07-2017 / 11:17:11 / cg"
!
isSlavoGermanic: aString
^((aString includesAnyOf: 'WK') or:
[ (aString indexOfSubCollection: 'CZ' startingAt: 1) >= 1 ]) or:
[ (aString indexOfSubCollection: 'WITZ' startingAt: 1) >= 1 ]
!
keyAt: anInteger
(anInteger between:1 and:inputKey size) ifTrue: [
^ inputKey at: anInteger
].
^ Character space
"Modified: / 28-07-2017 / 11:38:30 / cg"
!
keyLeftString: lengthInteger
^self keyMidString: lengthInteger from: 1
!
keyMidString: lengthInteger from: fromInteger
| result from len additionalSpaces |
result := ''.
from := fromInteger.
len := lengthInteger.
"Prepend spaces if caller is requesting characters from before the start of the string"
[ from < 1 ] whileTrue:
[ result := result, ' '.
from := from + 1.
len := len - 1 ].
from + len - 1 > inputKey size
ifTrue:
[ additionalSpaces := from + len - 1 - inputKey size.
len := inputKey size - from + 1 ]
ifFalse: [ additionalSpaces := 0 ].
result := result, (inputKey copyFrom: from to: (from+len-1 min: inputKey size)).
[ additionalSpaces > 0 ] whileTrue:
[ result := result, ' '.
additionalSpaces := additionalSpaces - 1 ].
^result
"Modified: / 28-07-2017 / 11:20:43 / cg"
!
keyRightString: lengthInteger
^self keyMidString: lengthInteger from: inputKey size - lengthInteger + 1
"Modified: / 28-07-2017 / 11:20:51 / cg"
!
performInitialProcessing
(#( 'GN' 'KN' 'PN' 'WR' 'PS' ) includes:(inputKey copyFrom:1 to:2)) ifTrue:[
startIndex := startIndex + 1
].
(self keyAt:1) = $X ifTrue:[
self
addPrimaryTranslation:'S';
addSecondaryTranslation:'S'.
startIndex := startIndex + 1
].
(self keyAt:1) isVowel ifTrue:[
self
addPrimaryTranslation:'A';
addSecondaryTranslation:'A'.
startIndex := startIndex + 1
]
"Modified: / 28-07-2017 / 11:36:31 / cg"
!
processB
self
addPrimaryTranslation: 'P';
addSecondaryTranslation: 'P'.
(self keyAt: (currentIndex + 1)) == $B ifTrue: [
skipCount := skipCount + 1
].
"Modified: / 28-07-2017 / 11:26:03 / cg"
!
processC
"i"
((((currentIndex >= 3
and: [ (self keyAt: currentIndex-2) isVowel not ])
and: [ (self keyMidString: 3 from: currentIndex-1) = 'ACH' ])
and: [ (self keyAt: currentIndex+2) ~= $I ])
and: [ ((self keyAt: currentIndex+2) ~= $E)
or: [ (self keyMidString: 6 from: currentIndex-2) ~= 'BACHER'
and: [ (self keyMidString: 6 from: currentIndex-2) ~= 'MACHER' ] ] ])
ifTrue:
[ self addPrimaryTranslation: 'K'.
self addSecondaryTranslation: 'K'.
skipCount := skipCount + 2.
^self ].
"ii"
(inputKey beginsWith: 'CAESAR')
ifTrue:
[ self addPrimaryTranslation: 'S'.
self addSecondaryTranslation: 'S'.
skipCount := skipCount + 1.
^self ].
"iii"
(self keyMidString: 4 from: currentIndex) = 'CHIA'
ifTrue:
[ self addPrimaryTranslation: 'K'.
self addSecondaryTranslation: 'K'.
skipCount := skipCount + 1.
^self ].
"iv"
(self keyMidString: 2 from: currentIndex) = 'CH'
ifTrue:
[ (currentIndex > 1 "a"
and: [ (self keyMidString: 4 from: currentIndex) = 'CHAE' ])
ifTrue: [ self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'X'.
skipCount := skipCount + 1.
^self ].
(currentIndex = 1 "b"
and: [ (inputKey size > 5 and: [(inputKey copyFrom: 1 to: 6) = 'CHARAC'
or: [ (inputKey copyFrom: 1 to: 6) = 'CHARIS' ]] )
or: [inputKey size > 4 and: [ ((((inputKey copyFrom: 1 to: 4) = 'CHOR'
or: [ (inputKey copyFrom: 1 to: 4) = 'CHYM' ])
or: [ (inputKey copyFrom: 1 to: 4) = 'CHIA' ])
or: [ (inputKey copyFrom: 1 to: 4) = 'CHEM' ])
and: [ (inputKey copyFrom: 1 to: 4) ~= 'CHORE' ] ] ] ])
ifTrue: [ self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
skipCount := skipCount + 1.
^self ].
(((((#('VAN ' 'VON ') includes: (inputKey copyFrom: 1 to: 4)) "c"
or: [ (inputKey copyFrom: 1 to: 3) = 'SCH' ])
or: [ #('ORCHES' 'ARCHIT' 'ORCHID')
includes: (self keyMidString: 6 from: currentIndex-2) ])
or: [ #($T $S) includes: (self keyAt: currentIndex+2) ])
or: [ ((currentIndex = 1)
or: [ #($A $O $U $E) includes: (self keyAt: currentIndex-1) ])
and: [ #($L $R $N $M $B $H $F $V $W $ ) includes: (self keyAt: currentIndex+2) ] ] )
ifTrue:
[ self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
skipCount := skipCount + 1.
^self ]
ifFalse:
[ currentIndex > 1
ifTrue:
[ (inputKey copyFrom: 1 to: 2) = 'MC'
ifTrue:
[ self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K' ]
ifFalse:
[ self
addPrimaryTranslation: 'X';
addSecondaryTranslation: 'K' ] ]
ifFalse:
[ self
addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X' ].
skipCount := skipCount + 1.
^self ] ].
"v"
(self keyAt: currentIndex+1) = $Z
ifTrue:
[ self
addPrimaryTranslation: 'S';
addSecondaryTranslation: 'X'.
skipCount := skipCount + 1.
^self ].
"vi"
(self keyMidString: 3 from: currentIndex+1) = 'CIA'
ifTrue:
[ self
addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X'.
skipCount := skipCount + 2.
^self ].
"vii"
((self keyAt: currentIndex+1) = $C
and: [ ((currentIndex = 2)
and: [ (self keyAt: 1) = $M ]) not ])
ifTrue:
[ ((#($I $E $H) includes: (self keyAt: currentIndex+2))
and: [ (self keyMidString: 2 from: currentIndex+2) ~= 'HU' ])
ifTrue:
[ ((currentIndex = 2 and: [ (self keyAt: 1) = $A ])
or: [ #('UCCEE' 'UCCES') includes: (self keyMidString: 5 from: currentIndex-1)])
ifTrue:
[self
addPrimaryTranslation: 'KS';
addSecondaryTranslation: 'KS'.
skipCount := skipCount + 2.
^self ]
ifFalse:
[self
addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X'.
skipCount := skipCount + 2.
^self ] ]
ifFalse:
[ self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
skipCount := skipCount + 2.
^self ] ].
"viii"
(#($K $G $Q) includes: (self keyAt: currentIndex+1))
ifTrue:
[ self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
skipCount := skipCount + 1.
^self ].
"ix"
(#($I $E $Y) includes: (self keyAt: currentIndex+1))
ifTrue:
[ (#('CIO' 'CIE' 'CIA') includes: (self keyMidString: 3 from: currentIndex))
ifTrue:
[self
addPrimaryTranslation: 'S';
addSecondaryTranslation: 'X' ]
ifFalse:
[self
addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'].
skipCount := skipCount + 1.
^self ].
"x"
self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
"xi"
(#(' C' ' Q' ' G') includes: (self keyMidString: 2 from: currentIndex+1))
ifTrue:
[ skipCount := skipCount + 2 ]
ifFalse:
[ ((#($C $K $Q) includes: (self keyAt: currentIndex+1))
and: [ (#('CE' 'CI') includes: (self keyMidString: 2 from: currentIndex+1)) not ])
ifTrue: [ skipCount := skipCount + 1] ]
"Modified: / 28-07-2017 / 11:29:11 / cg"
!
processCedille
self
addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'
!
processD
"i"
(self keyAt: currentIndex+1) = $G
ifTrue:
[ (#($I $E $Y) includes: (self keyAt: currentIndex+2))
ifTrue:
[ self
addPrimaryTranslation: 'J';
addSecondaryTranslation: 'J'.
skipCount := skipCount + 2.
^self ]
ifFalse:
[ self
addPrimaryTranslation: 'TK';
addSecondaryTranslation: 'TK'.
skipCount := skipCount + 1.
^self ] ].
"ii"
(#($T $D) includes: (self keyAt: currentIndex+1))
ifTrue:
[ self
addPrimaryTranslation: 'T';
addSecondaryTranslation: 'T'.
skipCount := skipCount + 1.
^self ].
"iii"
self
addPrimaryTranslation: 'T';
addSecondaryTranslation: 'T'
"Modified: / 28-07-2017 / 11:27:39 / cg"
!
processF
self
addPrimaryTranslation: 'F';
addSecondaryTranslation: 'F'.
(self keyAt: currentIndex+1) = $F
ifTrue: [ skipCount := skipCount + 1 ]
"Modified (format): / 28-07-2017 / 11:29:21 / cg"
!
processG
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'G':
if(GetAt(current + 1) == 'H')
{"
| word |
(self keyAt: currentIndex + 1) = $H
ifTrue: [
"if((current > 0) AND !!IsVowel(current - 1))"
(currentIndex > 1 and: [(self keyAt: currentIndex - 1) isVowel not])
ifTrue: [
" {
MetaphAdd(K);
current += 2;
break;
}"
self
addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
skipCount := skipCount + 1.
^self
].
"if(current < 3)
{"
currentIndex < 4
ifTrue: [
" //'ghislane', ghiradelli
if(current == 0)
{ "
currentIndex = 1
ifTrue: [
"if(GetAt(current + 2) == 'I')"
(self keyAt: currentIndex + 2) = $I
ifTrue: [
"MetaphAdd(J);"
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'J'.
] ifFalse: [
"MetaphAdd(K);"
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
].
" current += 2;
break;"
skipCount := skipCount + 1.
^self
]
].
" //Parker's rule (with some further refinements) - e.g., 'hugh'
if(((current > 1) AND StringAt((current - 2), 1, B, H, D, ) )
//e.g., 'bough'
OR ((current > 2) AND StringAt((current - 3), 1, B, H, D, ) )
//e.g., 'broughton'
OR ((current > 3) AND StringAt((current - 4), 1, B, H, ) ) )
"
(((currentIndex > 2 and: [#($B $H $D) includes: (self keyAt: currentIndex - 2)])
or: [currentIndex > 3 and: [#($B $H $D) includes: (self keyAt: currentIndex - 3)]])
or: [currentIndex > 4 and: [#($B $H) includes: (self keyAt: currentIndex - 4)]])
ifTrue: [
"current += 2;
break;"
skipCount := skipCount + 1.
^self
] ifFalse: [
" //e.g., 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'
if((current > 2)
AND (GetAt(current - 1) == 'U')
AND StringAt((current - 3), 1, C, G, L, R, T, ) )"
(currentIndex > 3 and: [
((self keyAt: currentIndex - 1) = $U) and: [
#($C $G $L $R $T) includes: (self keyAt: currentIndex - 3)
]
]) ifTrue: [
"MetaphAdd(F);"
self addPrimaryTranslation: 'F';
addSecondaryTranslation: 'F'.
] ifFalse: [
" if((current > 0) AND GetAt(current - 1) !!= 'I')
MetaphAdd(K);"
(currentIndex > 1 and: [(self keyAt: currentIndex - 1) ~= $I])
ifTrue: [
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
].
].
skipCount := skipCount + 1.
^self
].
].
"if(GetAt(current + 1) == 'N')"
(self keyAt: currentIndex + 1) = $N
ifTrue: [
"if((current == 1) AND IsVowel(0) AND !!SlavoGermanic())"
(currentIndex = 2 and: [(inputKey at: 1) isVowel and: [(self isSlavoGermanic: inputKey) not]])
ifTrue: [
"MetaphAdd(KN, N);"
self addPrimaryTranslation: 'KN';
addSecondaryTranslation: 'N'.
] ifFalse: [
" //not e.g. 'cagney'
if(!!StringAt((current + 2), 2, EY, )
AND (GetAt(current + 1) !!= 'Y')
AND !!SlavoGermanic())"
((inputKey size >= (currentIndex + 2)) and: [
(inputKey copyFrom: currentIndex + 2 to: (currentIndex + 4 min: inputKey size)) ~= 'EY' and: [
(self keyAt: currentIndex + 1) ~= $Y and: [
(self isSlavoGermanic: inputKey) not
]
]
]) ifTrue: [
self addPrimaryTranslation: 'N';
addSecondaryTranslation: 'KN'.
] ifFalse: [
self addPrimaryTranslation: 'KN';
addSecondaryTranslation: 'KN'.
].
].
skipCount := skipCount + 1.
^self
].
" //'tagliaro'
if(StringAt((current + 1), 2, LI, ) AND !!SlavoGermanic())"
((inputKey size >= (currentIndex + 3)) and: [
(inputKey copyFrom: currentIndex + 1 to: currentIndex + 2) = 'LI' and: [
(self isSlavoGermanic: inputKey) not]])
ifTrue: [
self addPrimaryTranslation: 'KL';
addSecondaryTranslation: 'L'.
skipCount := skipCount + 1.
^self.
].
" //-ges-,-gep-,-gel-, -gie- at beginning
if((current == 0)
AND ((GetAt(current + 1) == 'Y')
OR StringAt((current + 1), 2, ES, EP, EB, EL, EY, IB, IL, IN, IE, EI, ER, )) )"
(currentIndex = 1 and: [
((self keyAt: currentIndex + 1) = $Y) or: [
(#('ES' 'EP' 'EB' 'EL' 'EY' 'IB' 'IL' 'IN' 'IE' 'EI' 'ER') includes:
(inputKey copyFrom: currentIndex + 1 to: currentIndex + 2))
]]) ifTrue: [
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'J'.
skipCount := skipCount + 1.
^self.
].
" // -ger-, -gy-
if((StringAt((current + 1), 2, ER, ) OR (GetAt(current + 1) == 'Y'))
AND !!StringAt(0, 6, DANGER, RANGER, MANGER, )
AND !!StringAt((current - 1), 1, E, I, )
AND !!StringAt((current - 1), 3, RGY, OGY, ) )
"
(((inputKey copyFrom: currentIndex + 1 to: (currentIndex + 3 min: inputKey size)) = 'ER' or: [
((self keyAt: currentIndex + 1) = $Y)])
and: [((#('DANGER' 'RANGER' 'MANGER') includes: (word := inputKey copyFrom: 1 to: (6 min: inputKey size))) not)
and: [(self keyAt: currentIndex - 1) ~= $E
and: [(#('RGY' 'OGY') includes: (inputKey copyFrom: currentIndex - 1 to: currentIndex + 1)) not]]])
ifTrue: [
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'J'.
skipCount := skipCount + 1.
^self.
].
" // italian e.g, 'biaggi'
if(StringAt((current + 1), 1, E, I, Y, ) OR StringAt((current - 1), 4, AGGI, OGGI, ))
"
((#($E $I $Y) includes: (self keyAt: (currentIndex + 1))) or: [(#('AGGI' 'OGGI') includes: (inputKey copyFrom: currentIndex - 1 to: (currentIndex + 2 min: inputKey size)))])
ifTrue: [
" //obvious germanic
if((StringAt(0, 4, VAN , VON , ) OR StringAt(0, 3, SCH, ))
OR StringAt((current + 1), 2, ET, )) MetaphAdd(K);"
word := (inputKey copyFrom: 1 to: 4).
((#('VAN ' 'VON ') includes: word) or: [(word copyFrom: 1 to: 3) = 'SCH' or: [(word copyFrom: 1 to: 2) = 'ET']])
ifTrue: [
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
] ifFalse: [
" //always soft if french ending
if(StringAt((current + 1), 4, IER , ))
MetaphAdd(J);
else
MetaphAdd(J, K);
current += 2;
break;"
(((inputKey copyFrom: currentIndex + 1 to: (currentIndex + 5 min: inputKey size)), ' ') copyFrom: 1 to: 4) = 'IER '
ifTrue: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'J'.
] ifFalse: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'K'.
].
].
skipCount := skipCount + 1.
^self.
].
" if(GetAt(current + 1) == 'G')
current += 2;
else
current += 1;
MetaphAdd(K);
break;"
(self keyAt: (currentIndex + 1)) = $G
ifTrue: [
skipCount := skipCount + 1.
].
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
"Modified: / 28-07-2017 / 11:31:33 / cg"
!
processH
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'H':
//only keep if first & before vowel or btw. 2 vowels
if(((current == 0) OR IsVowel(current - 1))
AND IsVowel(current + 1))
{
MetaphAdd(H);
current += 2;
}else//also takes care of 'HH'
current += 1;
break;
"
(((currentIndex = 1)
or: [ (self keyAt: currentIndex - 1) isVowel])
and: [(self keyAt: currentIndex + 1) isVowel])
ifTrue: [
self addPrimaryTranslation: 'H';
addSecondaryTranslation: 'H'.
skipCount := skipCount + 1.
^self.
]
"Modified: / 28-07-2017 / 11:29:52 / cg"
!
processJ
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'J':
//obvious spanish, 'jose', 'san jacinto'
if(StringAt(current, 4, JOSE, ) OR StringAt(0, 4, SAN , ) )
{
if(((current == 0) AND (GetAt(current + 4) == ' ')) OR StringAt(0, 4, SAN , ) )
MetaphAdd(H);
else
{
MetaphAdd(J, H);
}
current +=1;
break;
}
if((current == 0) AND !!StringAt(current, 4, JOSE, ))
MetaphAdd(J, A);//Yankelovich/Jankelowicz
else
//spanish pron. of e.g. 'bajador'
if(IsVowel(current - 1)
AND !!SlavoGermanic()
AND ((GetAt(current + 1) == 'A') OR (GetAt(current + 1) == 'O')))
MetaphAdd(J, H);
else
if(current == last)
MetaphAdd(J, );
else
if(!!StringAt((current + 1), 1, L, T, K, S, N, M, B, Z, )
AND !!StringAt((current - 1), 1, S, K, L, ))
MetaphAdd(J);
if(GetAt(current + 1) == 'J')//it could happen!!
current += 2;
else
current += 1;
break;
"
| currentWord firstWord nextLetter |
currentWord := inputKey copyFrom: currentIndex to: (currentIndex + 3 min: inputKey size).
firstWord := inputKey copyFrom: 1 to: (4 min: inputKey size).
nextLetter := self keyAt: currentIndex + 1.
(currentWord = 'JOSE' or: [firstWord = 'SAN '])
ifTrue: [
((currentIndex = 1 and: [inputKey size = 4 or: [inputKey size >= 5 and: [self keyAt: currentIndex + 4 = $ ]]])
or: [firstWord = 'SAN '])
ifTrue: [
self addPrimaryTranslation: 'H';
addSecondaryTranslation: 'H'.
] ifFalse: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'H'.
].
^self.
].
(currentIndex = 1 and: [firstWord ~= 'JOSE'])
ifTrue: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'A'.
] ifFalse: [
((currentIndex > 1 and: [(self keyAt: currentIndex -1) isVowel])
and: [(self isSlavoGermanic: inputKey) not and: [nextLetter == $A or: [nextLetter == $O]]])
ifTrue: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'H'.
] ifFalse: [
currentIndex = inputKey size
ifTrue: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: ' '.
] ifFalse: [
((#($L $T $K $S $N $M $B $Z) includes: nextLetter) not and: [(#($S $K $L) includes: (self keyAt: currentIndex - 1)) not])
ifTrue: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'J'.
].
].
].
].
nextLetter == $J
ifTrue: [
skipCount := skipCount + 1.
].
"Modified: / 28-07-2017 / 11:31:41 / cg"
!
processK
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'K':
if(GetAt(current + 1) == 'K')
current += 2;
else
current += 1;
MetaphAdd(K);
break;
"
(self keyAt: currentIndex + 1) = $K
ifTrue: [
skipCount := skipCount + 1
].
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
"Modified: / 28-07-2017 / 11:31:46 / cg"
!
processL
"case 'L':
if(GetAt(current + 1) == 'L')
{
//spanish e.g. 'cabrillo', 'gallegos'
if(((current == (length - 3))
AND StringAt((current - 1), 4, ILLO, ILLA, ALLE, ))
OR ((StringAt((last - 1), 2, AS, OS, ) OR StringAt(last, 1, A, O, ))
AND StringAt((current - 1), 4, ALLE, )) )
{
MetaphAdd(L, );
current += 2;
break;
}
current += 2;
}else
current += 1;
MetaphAdd(L);
break;
"
| currentWord |
(self keyAt: currentIndex + 1) = $L
ifTrue: [
(((currentIndex = (inputKey size - 2))
and: [(currentIndex > 1 and: [#('ILLO' 'ILLA' 'ALLE') includes: (currentWord := inputKey copyFrom: currentIndex - 1 to: (currentIndex + 2 min: inputKey size))])])
or: [((#('AS' 'OS') includes: (inputKey copyFrom: inputKey size - 1 to: inputKey size)) or: [#($A $O) includes: (self keyAt: inputKey size)]) and: [currentWord = 'ALLE']
])
ifTrue: [
self addPrimaryTranslation: 'L';
addSecondaryTranslation: ' '.
skipCount := skipCount + 1.
^self.
].
skipCount := skipCount + 1.
].
self addPrimaryTranslation: 'L';
addSecondaryTranslation: 'L'.
"Modified: / 28-07-2017 / 11:32:03 / cg"
!
processM
"case 'M':
if((StringAt((current - 1), 3, UMB, )
AND (((current + 1) == last) OR StringAt((current + 2), 2, ER, )))
//'dumb','thumb'
OR (GetAt(current + 1) == 'M') )
current += 2;
else
current += 1;
MetaphAdd(M);
break;
"
(((currentIndex > 1 and: [(inputKey copyFrom: currentIndex - 1 to: (currentIndex +1 min: inputKey size)) = 'UMB'])
and: [currentIndex + 1 = inputKey size or: [(inputKey copyFrom: (currentIndex + 2 min: inputKey size) to: (currentIndex + 4 min: inputKey size)) = 'ER']])
or: [(self keyAt: currentIndex + 1) = $M])
ifTrue: [
skipCount := skipCount + 1.
].
self addPrimaryTranslation: 'M';
addSecondaryTranslation: 'M'.
"Modified: / 28-07-2017 / 11:32:08 / cg"
!
processN
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'N':
if(GetAt(current + 1) == 'N')
current += 2;
else
current += 1;
MetaphAdd(N);
break;
"
(self keyAt: currentIndex + 1) = $N
ifTrue: [
skipCount := skipCount + 1
].
self addPrimaryTranslation: 'N';
addSecondaryTranslation: 'N'.
"Modified: / 28-07-2017 / 11:32:14 / cg"
!
processNtilde
"case 'Ñ':
current += 1;
MetaphAdd(N);
break;
"
self addPrimaryTranslation: 'N';
addSecondaryTranslation: 'N'.
!
processP
"case 'P':
if(GetAt(current + 1) == 'H')
{
MetaphAdd(F);
current += 2;
break;
}
//also account for campbell, raspberry
if(StringAt((current + 1), 1, P, B, ))
current += 2;
else
current += 1;
MetaphAdd(P);
break;
"
| nextLetter |
(nextLetter := self keyAt: currentIndex + 1) = $H
ifTrue: [
self addPrimaryTranslation: 'F';
addSecondaryTranslation: 'F'.
skipCount := skipCount + 1.
^self.
].
(#($P $B) includes: nextLetter)
ifTrue: [
skipCount := skipCount + 1.
] ifFalse: [
self addPrimaryTranslation: 'P';
addSecondaryTranslation: 'P'.
].
"Modified: / 28-07-2017 / 11:32:28 / cg"
!
processQ
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'Q':
if(GetAt(current + 1) == 'Q')
current += 2;
else
current += 1;
MetaphAdd(K);
break;
"
(self keyAt: currentIndex + 1) = $Q
ifTrue: [
skipCount := skipCount + 1
].
self addPrimaryTranslation: 'K';
addSecondaryTranslation: 'K'.
"Modified: / 28-07-2017 / 11:32:32 / cg"
!
processR
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'R':
//french e.g. 'rogier', but exclude 'hochmeier'
if((current == last)
AND !!SlavoGermanic()
AND StringAt((current - 2), 2, IE, )
AND !!StringAt((current - 4), 2, ME, MA, ))
MetaphAdd(, R);
else
MetaphAdd(R);
if(GetAt(current + 1) == 'R')
current += 2;
else
current += 1;
break;
"
(currentIndex = inputKey size and: [
(self isSlavoGermanic: inputKey) not and: [
(inputKey copyFrom: ((currentIndex - 2) max: 1) to: ((currentIndex - 1) max: 1)) = 'IE' and: [
(#('ME' 'MA') includes: (inputKey copyFrom: ((currentIndex - 4) max: 1) to: ((currentIndex - 3) max: 1))) not
]
]
])
ifTrue: [
self addPrimaryTranslation: '';
addSecondaryTranslation: 'R'.
] ifFalse: [
self addPrimaryTranslation: 'R';
addSecondaryTranslation: 'R'.
].
(self keyAt: currentIndex + 1) = $R
ifTrue: [
skipCount := skipCount + 1
].
"Modified: / 28-07-2017 / 11:32:37 / cg"
!
processRemainingCharacters
startIndex to: inputKey size do:[ :i |
| c methodSelector |
skipCount = 0 ifTrue:[
((primaryTranslation size > 4) and: [ secondaryTranslation size > 4 ])
ifTrue: [ ^self ].
currentIndex := i.
c := self keyAt: i.
(c isVowel not and: [c ~= $Y]) ifTrue:[
c == $Ç ifTrue: [
methodSelector := #processCedille
] ifFalse: [ c == $Ñ ifTrue: [
methodSelector := #processNtilde
] ifFalse: [
methodSelector := ('process', c asString) asSymbol
]].
self perform: methodSelector
]
] ifFalse: [
skipCount := skipCount - 1
]
]
"Modified: / 28-07-2017 / 11:24:15 / cg"
!
processS
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'S':
//special cases 'island', 'isle', 'carlisle', 'carlysle'
if(StringAt((current - 1), 3, ISL, YSL, ))
{
current += 1;
break;
}
//special case 'sugar-'
if((current == 0) AND StringAt(current, 5, SUGAR, ))
{
MetaphAdd(X, S);
current += 1;
break;
}
if(StringAt(current, 2, SH, ))
{
//germanic
if(StringAt((current + 1), 4, HEIM, HOEK, HOLM, HOLZ, ))
MetaphAdd(S);
else
MetaphAdd(X);
current += 2;
break;
}
//italian & armenian
if(StringAt(current, 3, SIO, SIA, ) OR StringAt(current, 4, SIAN, ))
{
if(!!SlavoGermanic())
MetaphAdd(S, X);
else
MetaphAdd(S);
current += 3;
break;
}
//german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'
//also, -sz- in slavic language altho in hungarian it is pronounced 's'
if(((current == 0)
AND StringAt((current + 1), 1, M, N, L, W, ))
OR StringAt((current + 1), 1, Z, ))
{
MetaphAdd(S, X);
if(StringAt((current + 1), 1, Z, ))
current += 2;
else
current += 1;
break;
}
if(StringAt(current, 2, SC, ))
{
//Schlesinger's rule
if(GetAt(current + 2) == 'H')
//dutch origin, e.g. 'school', 'schooner'
if(StringAt((current + 3), 2, OO, ER, EN, UY, ED, EM, ))
{
//'schermerhorn', 'schenker'
if(StringAt((current + 3), 2, ER, EN, ))
{
MetaphAdd(X, SK);
}else
MetaphAdd(SK);
current += 3;
break;
}else{
if((current == 0) AND !!IsVowel(3) AND (GetAt(3) !!= 'W'))
MetaphAdd(X, S);
else
MetaphAdd(X);
current += 3;
break;
}
if(StringAt((current + 2), 1, I, E, Y, ))
{
MetaphAdd(S);
current += 3;
break;
}
//else
MetaphAdd(SK);
current += 3;
break;
}
//french e.g. 'resnais', 'artois'
if((current == last) AND StringAt((current - 2), 2, AI, OI, ))
MetaphAdd(, S);
else
MetaphAdd(S);
if(StringAt((current + 1), 1, S, Z, ))
current += 2;
else
current += 1;
break;
"
| nextChar char2 chars char |
(#('ISL' 'YSL') includes: (inputKey copyFrom: (currentIndex - 1 max: 1) to: (currentIndex + 1 min: inputKey size)))
ifTrue: [
^self
].
(currentIndex = 1 and: [(inputKey copyFrom: 1 to: (5 min: inputKey size)) = 'SUGAR'])
ifTrue: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'S'.
^self.
].
(inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: inputKey size)) = 'SH'
ifTrue: [
(#('HEIM' 'HOEK' 'HOLM' 'HOLZ') includes: (inputKey copyFrom: (currentIndex + 1 min: inputKey size) to: ((currentIndex + 5) min: inputKey size)))
ifTrue: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'.
] ifFalse: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X'.
].
skipCount := skipCount + 1.
^self
].
((#('SIO' 'SIA') includes: (inputKey copyFrom: currentIndex to: (currentIndex + 2 min: inputKey size)))
or: [(inputKey copyFrom: currentIndex to: (currentIndex + 3 min: inputKey size)) = 'SIAN'])
ifTrue: [
(self isSlavoGermanic: inputKey) not
ifTrue: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'X'.
] ifFalse: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'.
].
skipCount := skipCount + 2.
^self
].
((currentIndex = 1 and: [#($M $N $L $W) includes: (self keyAt: currentIndex + 1)])
or: [(nextChar := self keyAt: currentIndex + 1) = $Z])
ifTrue: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'X'.
nextChar == $Z
ifTrue: [
skipCount := skipCount + 1.
^self.
].
^self.
].
((inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: inputKey size)) = 'SC')
ifTrue: [
(char2 := self keyAt: currentIndex + 2) = $H
ifTrue: [
(#('OO' 'ER' 'EN' 'UY' 'ED' 'EM') includes: (chars := inputKey copyFrom: ((currentIndex + 3) min: inputKey size) to: ((currentIndex + 4) min: inputKey size)))
ifTrue: [
(#('ER' 'EN') includes: chars)
ifTrue: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'SK'.
] ifFalse: [
self addPrimaryTranslation: 'SK';
addSecondaryTranslation: 'SK'.
].
skipCount := skipCount + 2.
^self.
] ifFalse: [
((currentIndex = 1 and: [(char := inputKey at: 4 ifAbsent: [$b]) isVowel not]) and: [char ~= $W])
ifTrue: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'S'.
] ifFalse: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X'.
].
skipCount := skipCount + 2.
^self .
].
] ifFalse: [
(#($I $E $Y) includes: char2)
ifTrue: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'.
skipCount := skipCount + 2.
^self .
] ifFalse: [
self addPrimaryTranslation: 'SK';
addSecondaryTranslation: 'SK'.
skipCount := skipCount + 2.
^self.
]
].
].
(currentIndex = inputKey size and: [(#('AI' 'OI') includes: (inputKey copyFrom: ((currentIndex - 2) max: 1) to: ((currentIndex - 1) max: 1)))])
ifTrue: [
self addPrimaryTranslation: '';
addSecondaryTranslation: 'S'.
] ifFalse: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'.
].
(#($S $Z) includes: (self keyAt: currentIndex + 1))
ifTrue: [
skipCount := skipCount + 1.
^self.
].
"Modified: / 28-07-2017 / 11:34:18 / cg"
!
processT
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'T':
if(StringAt(current, 4, TION, ))
{
MetaphAdd(X);
current += 3;
break;
}
if(StringAt(current, 3, TIA, TCH, ))
{
MetaphAdd(X);
current += 3;
break;
}
if(StringAt(current, 2, TH, )
OR StringAt(current, 3, TTH, ))
{
//special case 'thomas', 'thames' or germanic
if(StringAt((current + 2), 2, OM, AM, )
OR StringAt(0, 4, VAN , VON , )
OR StringAt(0, 3, SCH, ))
{
MetaphAdd(T);
}else{
MetaphAdd(0, T);
}
current += 2;
break;
}
if(StringAt((current + 1), 1, T, D, ))
current += 2;
else
current += 1;
MetaphAdd(T);
break;
"
((inputKey copyFrom: currentIndex to: ((currentIndex + 3) min: inputKey size)) = 'TION')
ifTrue: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X'.
skipCount := skipCount + 2.
^self.
].
(#('TIA' 'TCH') includes: (inputKey copyFrom: currentIndex to: ((currentIndex + 2) min: inputKey size)))
ifTrue: [
self addPrimaryTranslation: 'X';
addSecondaryTranslation: 'X'.
skipCount := skipCount + 2.
^self.
].
(((inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: inputKey size)) = 'TH') or: [
((inputKey copyFrom: currentIndex to: ((currentIndex + 2) min: inputKey size)) = 'TTH')
])
ifTrue: [
((#('OM' 'AM') includes: (inputKey copyFrom: currentIndex + 2 to: ((currentIndex + 3) min: inputKey size)))
or: [(#('VAN ' 'VON ') includes: (inputKey copyFrom: 1 to: (4 min: inputKey size)))
or: [(inputKey copyFrom: 1 to: (3 min: inputKey size)) = 'SCH']
])
ifTrue: [
self addPrimaryTranslation: 'T';
addSecondaryTranslation: 'T'.
] ifFalse: [
self addPrimaryTranslation: '0';
addSecondaryTranslation: 'T'.
].
skipCount := skipCount + 1.
^self.
].
(#($T $D) includes: (self keyAt: currentIndex + 1))
ifTrue: [
skipCount := skipCount + 1.
].
self addPrimaryTranslation: 'T';
addSecondaryTranslation: 'T'.
"Modified: / 28-07-2017 / 11:33:33 / cg"
!
processV
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'V':
if(GetAt(current + 1) == 'V')
current += 2;
else
current += 1;
MetaphAdd(F);
break;
"
(self keyAt: currentIndex + 1) = $V
ifTrue: [
skipCount := skipCount + 1
].
self addPrimaryTranslation: 'F';
addSecondaryTranslation: 'F'.
"Modified: / 28-07-2017 / 11:34:27 / cg"
!
processW
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'W':
//can also be in middle of word
if(StringAt(current, 2, WR, ))
{
MetaphAdd(R);
current += 2;
break;
}
if((current == 0)
AND (IsVowel(current + 1) OR StringAt(current, 2, WH, )))
{
//Wasserman should match Vasserman
if(IsVowel(current + 1))
MetaphAdd(A, F);
else
//need Uomo to match Womo
MetaphAdd(A);
}
//Arnow should match Arnoff
if(((current == last) AND IsVowel(current - 1))
OR StringAt((current - 1), 5, EWSKI, EWSKY, OWSKI, OWSKY, )
OR StringAt(0, 3, SCH, ))
{
MetaphAdd(, F);
current +=1;
break;
}
//polish e.g. 'filipowicz'
if(StringAt(current, 4, WICZ, WITZ, ))
{
MetaphAdd(TS, FX);
current +=4;
break;
}
//else skip it
current +=1;
break;
"
| word nextLetter |
((word := inputKey copyFrom: currentIndex to: (currentIndex + 1 min: inputKey size)) = 'WR')
ifTrue: [
self addPrimaryTranslation: 'R';
addSecondaryTranslation: 'R'.
skipCount := skipCount + 1.
^self
].
((currentIndex = 1 and: [(nextLetter := self keyAt: currentIndex + 1) isVowel]) or: [
word = 'WH'
])
ifTrue: [
nextLetter isVowel
ifTrue: [
self addPrimaryTranslation: 'A';
addSecondaryTranslation: 'F'.
] ifFalse: [
self addPrimaryTranslation: 'A';
addSecondaryTranslation: 'A'.
]
].
((((currentIndex = inputKey size) and: [(self keyAt: currentIndex - 1) isVowel])
or: [#('EWSKI' 'EWSKY' 'OWSKI' 'OWSKY') includes: (inputKey copyFrom: ((currentIndex - 1) max: 1) to: (currentIndex + 3 min: inputKey size))])
or: [inputKey startsWith:'SCH'])
ifTrue: [
self addPrimaryTranslation: '';
addSecondaryTranslation: 'F'.
^self.
].
(#('WICZ' 'WITZ') includes: (inputKey copyFrom: currentIndex to: (currentIndex + 4 min: inputKey size)))
ifTrue: [
self addPrimaryTranslation: 'TS';
addSecondaryTranslation: 'FX'.
skipCount := skipCount + 3.
^self
].
"Modified: / 28-07-2017 / 11:34:51 / cg"
!
processX
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'X':
//french e.g. breaux
if(!!((current == last)
AND (StringAt((current - 3), 3, IAU, EAU, )
OR StringAt((current - 2), 2, AU, OU, ))) )
MetaphAdd(KS);
if(StringAt((current + 1), 1, C, X, ))
current += 2;
else
current += 1;
break;
"
((currentIndex = inputKey size)
and: [(#('IAU' 'EAU') includes: (inputKey copyFrom: ((currentIndex - 3) min: 1) to: currentIndex))
or: [(#('AU' 'OU') includes: (inputKey copyFrom: ((currentIndex - 2) min: 1) to: currentIndex))]])
ifFalse: [
self addPrimaryTranslation: 'KS';
addSecondaryTranslation: 'KS'.
].
(#($C $X) includes: (self keyAt: currentIndex + 1))
ifTrue: [
skipCount := skipCount + 1.
^self
]
"Modified: / 28-07-2017 / 11:34:44 / cg"
!
processZ
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp
case 'Z':
//chinese pinyin e.g. 'zhao'
if(GetAt(current + 1) == 'H')
{
MetaphAdd(J);
current += 2;
break;
}else
if(StringAt((current + 1), 2, ZO, ZI, ZA, )
OR (SlavoGermanic() AND ((current > 0) AND GetAt(current - 1) !!= 'T')))
{
MetaphAdd(S, TS);
}
else
MetaphAdd(S);
if(GetAt(current + 1) == 'Z')
current += 2;
else
current += 1;
break;
"
(self keyAt: currentIndex + 1) = $H
ifTrue: [
self addPrimaryTranslation: 'J';
addSecondaryTranslation: 'J'.
skipCount := skipCount + 1.
^self
] ifFalse: [
((#('ZO' 'ZI' 'ZA') includes: (inputKey copyFrom: ((currentIndex + 1) min: inputKey size) to: ((currentIndex + 2) min: inputKey size))) or: [
(self isSlavoGermanic: inputKey) and: [(currentIndex > 1 and: [(self keyAt: currentIndex - 1) ~= 'T'])]
])
ifTrue: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'TS'.
] ifFalse: [
self addPrimaryTranslation: 'S';
addSecondaryTranslation: 'S'.
].
(self keyAt: currentIndex + 1) = $Z
ifTrue: [
skipCount := skipCount + 1.
^self
].
]
"Modified: / 28-07-2017 / 11:35:12 / cg"
! !
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator class methodsFor:'documentation'!
documentation
"
The 'Kölner Phonetik' (cologne phonetic) code is for the german language
what the soundex code is for english:
it returns similar strings for similar sounding words
(but is specifically aware of the pronunciation of German and eastern languages) .
There are some other differences to soundex, though:
its length is not limited to 4, but depends on the length of the original string;
it does not start with the first character of the input, but returns a pure numeric string.
This algorithm was described by Postel 1969,
See http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik
self new phoneticStringsFor:'Müller-Lüdenscheidt' -> #('65752682')
"
!
examples
"
words sounding similar (german pronunciation) will deliver a similar code:
#(
'Müller'
'Miller'
'Mueller'
'Mühler'
'Mühlherr'
'Mülherr'
'Myler'
'Millar'
'Myller'
'Müllar'
'Müler'
'Muehler'
'Mülller'
'Müllerr'
'Muehlherr'
'Muellar'
'Mueler'
'Mülleer'
'Mueller'
'Nüller'
'Nyller'
'Niler'
'Czerny'
'Tscherny'
'Czernie'
'Tschernie'
'Schernie'
'Scherny'
'Scherno'
'Czerne'
'Zerny'
'Tzernie'
'Breschnew'
'Breschnew'
'Breschneff'
'Breschnjeff'
'Braeschneff'
'Braessneff'
'Pressneff'
'Presznäph'
'Präschnäf'
'Breschnjeff'
'Breschnijeff'
'Breschnieff'
'Bräschnieff'
'Braschnieff'
'Broschnieff'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:w)
].
"
! !
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator methodsFor:'api'!
encode: aString
"return a koelner phonetic code.
The koelnerPhonetic code is for the german language what the soundex code is for english;
it returns simular strings for similar sounding words.
There are some differences to soundex, though:
its length is not limited to 4, but depends on the length of the original string;
it does not start with the first character of the input.
This algorithm is described by Postel 1969"
|in ret val rslt|
in := aString withoutSeparators asLowercase.
in := in copyReplaceString:'ph' withString:'f'.
(in includesAny:'öäüß') ifTrue:[
in := in copyReplaceAll:$ü withAll:'u'.
in := in copyReplaceAll:$ä withAll:'a'.
in := in copyReplaceAll:$ö withAll:'o'.
in := in copyReplaceAll:$ß withAll:'ss'.
].
in := in select:[:ch | ch isLetter].
in := '#',in,'#'.
ret := ''.
1 to:in size-2 do:[:i |
|sub|
sub := in copyFrom:i to:i+2.
val := (i==1)
ifTrue:[ self convertFirst:sub ]
ifFalse:[ self convertRest:sub ].
ret := ret,val
].
ret := ret select:[:ch | ch ~= $-].
(ret startsWith:'0') ifTrue:[
ret := '0',(ret select:[:ch | ch ~= $0]).
] ifFalse:[
ret := ret select:[:ch | ch ~= $0].
].
rslt := String streamContents:[:s |
|prev|
ret do:[:ch |
ch ~= prev ifTrue:[
s nextPut:ch
].
prev := ch.
].
].
^ rslt.
"
#(
'Müller'
'Miller'
'Mueller'
'Mühler'
'Mühlherr'
'Mülherr'
'Myler'
'Millar'
'Myller'
'Müllar'
'Müler'
'Muehler'
'Mülller'
'Müllerr'
'Muehlherr'
'Muellar'
'Mueler'
'Mülleer'
'Mueller'
'Nüller'
'Nyller'
'Niler'
'Czerny'
'Tscherny'
'Czernie'
'Tschernie'
'Schernie'
'Scherny'
'Scherno'
'Czerne'
'Zerny'
'Tzernie'
'Breschnew'
'Breschnew'
'Breschneff'
'Breschnjeff'
'Braeschneff'
'Braessneff'
'Pressneff'
'Presznäph'
'Präschnäf'
'Breschnjeff'
'Breschnijeff'
'Breschnieff'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:w)
].
"
"
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnew' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschneff' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Braeschneff' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Braessneff' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Pressneff' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Presznäph' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Präschnäf' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnjeff' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnijeff' -> '17863'
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnieff' -> '17863'
"
"
self basicNew encode:'müller' -> '657'
self basicNew encode:'möller' -> '657'
self basicNew encode:'miller' -> '657'
self basicNew encode:'muller' -> '657'
self basicNew encode:'muler' -> '657'
self basicNew encode:'schmidt' -> '862'
self basicNew encode:'schneider' -> '8627'
self basicNew encode:'fischer' -> '387'
self basicNew encode:'weber' -> '317'
self basicNew encode:'meyer' -> '67'
self basicNew encode:'wagner' -> '3467'
self basicNew encode:'schulz' -> '858'
self basicNew encode:'becker' -> '147'
self basicNew encode:'hoffmann' -> '036'
self basicNew encode:'schäfer' -> '837'
"
"Created: / 28-07-2017 / 15:24:33 / cg"
! !
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator methodsFor:'private'!
convertFirst:chars
|c2 c3|
chars size == 3 ifTrue:[
c2 := (chars at:2).
c2 == $a ifTrue:[^ '0'].
c2 == $e ifTrue:[^ '0'].
c2 == $i ifTrue:[^ '0'].
c2 == $j ifTrue:[^ '0'].
c2 == $y ifTrue:[^ '0'].
c2 == $o ifTrue:[^ '0'].
c2 == $u ifTrue:[^ '0'].
c2 == $c ifTrue:[
c3 := (chars at:3).
(c3 == $a) ifTrue:[^ '4'].
(c3 == $h) ifTrue:[^ '4'].
(c3 == $k) ifTrue:[^ '4'].
(c3 == $l) ifTrue:[^ '4'].
(c3 == $o) ifTrue:[^ '4'].
(c3 == $q) ifTrue:[^ '4'].
(c3 == $r) ifTrue:[^ '4'].
(c3 == $u) ifTrue:[^ '4'].
(c3 == $x) ifTrue:[^ '4'].
^ '8'
].
"/ #(
"/ ('#a#' '0')
"/ ('#e#' '0')
"/ ('#i#' '0')
"/ ('#j#' '0')
"/ ('#y#' '0')
"/ ('#o#' '0')
"/ ('#u#' '0')
"/
"/ ('#ca' '4')
"/ ('#ch' '4')
"/ ('#ck' '4')
"/ ('#cl' '4')
"/ ('#co' '4')
"/ ('#cq' '4')
"/ ('#cr' '4')
"/ ('#cu' '4')
"/ ('#cx' '4')
"/
"/ ('#c#' '8')
"/ ) do:[:pair |
"/ (pair first match:chars) ifTrue:[
"/ ^ pair second
"/ ]
"/ ].
].
^ self convertRest:chars
"Modified: / 29-07-2017 / 14:22:20 / cg"
!
convertRest:chars
chars size == 3 ifFalse:[
self error:'cannot happen'.
^ '?'
].
#(
"/ used to be matchpattern code,
"/ but doing these glob-matches is too slow.
"/ changed to:
"/ start nil code
"/ nil end code
"/ nil char code
"/
(nil 'ds' " '#ds' " '8')
(nil 'dc' " '#dc' " '8')
(nil 'dz' " '#dz' " '8')
(nil 'ts' " '#ts' " '8')
(nil 'tc' " '#tc' " '8')
(nil 'tz' " '#tz' " '8')
(nil $d " '#d#' " '2')
(nil $t " '#t#' " '2')
('cx' nil " 'cx#' " '8')
('kx' nil " 'kx#' " '8')
('qx' nil " 'qx#' " '8')
(nil $x " '#x#' " '48')
('sc' nil " 'sc#' " '8')
('sz' nil " 'sz#' " '8')
(nil 'ca' " '#ca' " '4')
(nil 'co' " '#co' " '4')
(nil 'cu' " '#cu' " '4')
(nil 'ch' " '#ch' " '4')
(nil 'ck' " '#ck' " '4')
(nil 'cx' " '#cx' " '4')
(nil 'cq' " '#cq' " '4')
(nil $c " '#c#' " '8')
(nil $a " '#a#' " '0')
(nil $e " '#e#' " '0')
(nil $i " '#i#' " '0')
(nil $j " '#j#' " '0')
(nil $y " '#y#' " '0')
(nil $o " '#o#' " '0')
(nil $u " '#u#' " '0')
(nil $h " '#h#' " '-')
(nil $l " '#l#' " '5')
(nil $r " '#r#' " '7')
(nil $m " '#m#' " '6')
(nil $n " '#n#' " '6')
(nil $s " '#s#' " '8')
(nil $z " '#z#' " '8')
(nil $b " '#b#' " '1')
(nil $p " '#p#' " '1')
(nil $f " '#f#' " '3')
(nil $v " '#v#' " '3')
(nil $w " '#w#' " '3')
(nil $g " '#g#' " '4')
(nil $k " '#k#' " '4')
(nil $q " '#q#' " '4')
(nil nil " '###' " '?')
) do:[:vector |
|v1 v2|
(v1 := vector at:1) notNil ifTrue:[
"/ prefix
(chars startsWith:v1) ifTrue:[^ (vector at:3) ].
] ifFalse:[
(v2 := vector at:2) isCharacter ifTrue:[
"/ middle character compare
(chars at:2) == v2 ifTrue:[^ (vector at:3) ].
] ifFalse:[
v2 isString ifTrue:[
"/ suffix
(chars endsWith:v2) ifTrue:[^ (vector at:3) ].
] ifFalse:[
^ '?'
]
]
].
"/ (vector first match:chars) ifTrue:[
"/ ^ vector second
"/ ]
].
self error:'cannot happen'
"Modified: / 29-07-2017 / 14:17:38 / cg"
! !
!PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'!
documentation
"
Miracode (also called << American Soundex >>) is like Soundex with the
addition that h and w are discarded if they separate consonants.
These variants may be specifically important because they were used in
U.S. National Archives.
Most archive data were encoded with Miracode,
but there are some (older) entries encoded with Simplified Soundex.
The HW-rule was documented as a standard in 1910,
but actually data of 1880, 1900 and 1910
censuses were encoded with mixed methods.
self new encode:'washington' -> 'W252'
self new encode:'lee' -> 'L000'
self new encode:'Gutierrez' -> 'G362'
self new encode:'Pfister' -> 'P236'
self new encode:'Jackson' -> 'J250'
self new encode:'Tymczak' -> 'T522'
notice:
MiracodeStringComparator new
encode:'Ashcraft' -> 'A261'
SoundexStringComparator
new encode:'Ashcraft' -> 'A226'
see also:
https://www.archives.gov/research/census/soundex.html
"
! !
!PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'!
encode:word
|u p t prevCode|
u := word asUppercase.
p := u first asString.
prevCode := self translate:u first.
u from:2 to:u size do:[:c |
t := self translate:c.
(t notNil
and:[ t ~= '0'
and:[ t ~= prevCode ]]) ifTrue:[
p := p , t.
p size == 4 ifTrue:[^ p ].
].
(c ~= $W and:[c ~= $H]) ifTrue:[
prevCode := t.
].
].
[ p size < 4 ] whileTrue:[
p := p , '0'
].
^ (p copyFrom:1 to:4)
"
self new encode:'washington' -> 'W252'
self new encode:'lee' -> 'L000'
self new encode:'Gutierrez' -> 'G362'
self new encode:'Pfister' -> 'P236'
self new encode:'Jackson' -> 'J250'
self new encode:'Tymczak' -> 'T522'
"
"notice:
MiracodeStringComparator new encode:'Ashcraft' -> 'A261'
self new encode:'Ashcraft' -> 'A226'
"
"Created: / 28-07-2017 / 15:23:16 / cg"
"Modified (comment): / 01-08-2017 / 19:01:51 / cg"
! !
!PhoneticStringUtilities::SpanishPhoneticCodeStringComparator class methodsFor:'documentation'!
documentation
"
The 'Spanish Phonetik' (spanish phonetic) code is for the spanish language
what the soundex code is for english:
it returns similar strings for similar sounding words
(but is specifically aware of the pronunciation of spanish) .
There are some other differences to soundex, though:
its length is not limited to 4, but depends on the length of the original string;
it does not start with the first character of the input,
but returns a pure numeric string,
it uses different character groups
This algorithm was described by Marıa del Pilar Angeles, Adrian Espino-Gamez,
and Jonathan Gil-Moncada, in
'Comparison of a Modified Spanish phonetic,
Soundex, and Phonex coding functions during data matching process'
See https://www.researchgate.net/publication/285589803_Comparison_of_a_Modified_Spanish_Phonetic_Soundex_and_Phonex_coding_functions_during_data_matching_process
"
!
examples
"
words sounding similar (german pronunciation) will deliver a similar code:
#(
'Marıa'
'Pilar'
'Angeles'
'Adrian'
'Gamez'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities::SpanishPhoneticCodeStringComparator new encode:w)
].
"
! !
!PhoneticStringUtilities::SpanishPhoneticCodeStringComparator methodsFor:'api'!
encode: aString
"return a spanish phonetic code.
The spanishPhonetic code is for the spanish language what the soundex code is for english;
it returns simular strings for similar sounding words.
There are some differences to soundex, though:
its length is not limited to 4, but depends on the length of the original string;
it does not start with the first character of the input,
it uses different character groups.
This algorithm is described by Marıa del Pilar Angeles, Adrian Espino-Gamez,
Jonathan Gil-Moncada."
|in|
in := aString withoutSeparators asUppercase.
^ String streamContents:[:out |
|prev|
in do:[:ch |
ch == prev ifFalse:[
ch == $P ifTrue:[
out nextPut:$0.
] ifFalse:[ ('BV' includes:ch) ifTrue:[
out nextPut:$1.
] ifFalse:[ ('FH' includes:ch) ifTrue:[
out nextPut:$2.
] ifFalse:[ ('DT' includes:ch) ifTrue:[
out nextPut:$3.
] ifFalse:[ ('SZCX' includes:ch) ifTrue:[
out nextPut:$4.
] ifFalse:[ ('YL' includes:ch) ifTrue:[
out nextPut:$5.
] ifFalse:[ ('NŃM' includes:ch) ifTrue:[
out nextPut:$6.
] ifFalse:[ ('QK' includes:ch) ifTrue:[
out nextPut:$7.
] ifFalse:[ ('GJ' includes:ch) ifTrue:[
out nextPut:$8.
] ifFalse:[ ('R' includes:ch) ifTrue:[
out nextPut:$9.
]]]]]]]]]].
prev := ch.
].
].
].
"
self new encode:'Jose'
"
"Created: / 28-07-2017 / 15:24:33 / cg"
"Modified: / 01-08-2017 / 18:48:50 / cg"
! !
!PhoneticStringUtilities::SpanishPhoneticCodeStringComparator methodsFor:'private'!
convertFirst:chars
|c2 c3|
chars size == 3 ifTrue:[
c2 := (chars at:2).
c2 == $a ifTrue:[^ '0'].
c2 == $e ifTrue:[^ '0'].
c2 == $i ifTrue:[^ '0'].
c2 == $j ifTrue:[^ '0'].
c2 == $y ifTrue:[^ '0'].
c2 == $o ifTrue:[^ '0'].
c2 == $u ifTrue:[^ '0'].
c2 == $c ifTrue:[
c3 := (chars at:3).
(c3 == $a) ifTrue:[^ '4'].
(c3 == $h) ifTrue:[^ '4'].
(c3 == $k) ifTrue:[^ '4'].
(c3 == $l) ifTrue:[^ '4'].
(c3 == $o) ifTrue:[^ '4'].
(c3 == $q) ifTrue:[^ '4'].
(c3 == $r) ifTrue:[^ '4'].
(c3 == $u) ifTrue:[^ '4'].
(c3 == $x) ifTrue:[^ '4'].
^ '8'
].
"/ #(
"/ ('#a#' '0')
"/ ('#e#' '0')
"/ ('#i#' '0')
"/ ('#j#' '0')
"/ ('#y#' '0')
"/ ('#o#' '0')
"/ ('#u#' '0')
"/
"/ ('#ca' '4')
"/ ('#ch' '4')
"/ ('#ck' '4')
"/ ('#cl' '4')
"/ ('#co' '4')
"/ ('#cq' '4')
"/ ('#cr' '4')
"/ ('#cu' '4')
"/ ('#cx' '4')
"/
"/ ('#c#' '8')
"/ ) do:[:pair |
"/ (pair first match:chars) ifTrue:[
"/ ^ pair second
"/ ]
"/ ].
].
^ self convertRest:chars
"Modified: / 29-07-2017 / 14:22:20 / cg"
!
convertRest:chars
chars size == 3 ifFalse:[
self error:'cannot happen'.
^ '?'
].
#(
"/ used to be matchpattern code,
"/ but doing these glob-matches is too slow.
"/ changed to:
"/ start nil code
"/ nil end code
"/ nil char code
"/
(nil 'ds' " '#ds' " '8')
(nil 'dc' " '#dc' " '8')
(nil 'dz' " '#dz' " '8')
(nil 'ts' " '#ts' " '8')
(nil 'tc' " '#tc' " '8')
(nil 'tz' " '#tz' " '8')
(nil $d " '#d#' " '2')
(nil $t " '#t#' " '2')
('cx' nil " 'cx#' " '8')
('kx' nil " 'kx#' " '8')
('qx' nil " 'qx#' " '8')
(nil $x " '#x#' " '48')
('sc' nil " 'sc#' " '8')
('sz' nil " 'sz#' " '8')
(nil 'ca' " '#ca' " '4')
(nil 'co' " '#co' " '4')
(nil 'cu' " '#cu' " '4')
(nil 'ch' " '#ch' " '4')
(nil 'ck' " '#ck' " '4')
(nil 'cx' " '#cx' " '4')
(nil 'cq' " '#cq' " '4')
(nil $c " '#c#' " '8')
(nil $a " '#a#' " '0')
(nil $e " '#e#' " '0')
(nil $i " '#i#' " '0')
(nil $j " '#j#' " '0')
(nil $y " '#y#' " '0')
(nil $o " '#o#' " '0')
(nil $u " '#u#' " '0')
(nil $h " '#h#' " '-')
(nil $l " '#l#' " '5')
(nil $r " '#r#' " '7')
(nil $m " '#m#' " '6')
(nil $n " '#n#' " '6')
(nil $s " '#s#' " '8')
(nil $z " '#z#' " '8')
(nil $b " '#b#' " '1')
(nil $p " '#p#' " '1')
(nil $f " '#f#' " '3')
(nil $v " '#v#' " '3')
(nil $w " '#w#' " '3')
(nil $g " '#g#' " '4')
(nil $k " '#k#' " '4')
(nil $q " '#q#' " '4')
(nil nil " '###' " '?')
) do:[:vector |
|v1 v2|
(v1 := vector at:1) notNil ifTrue:[
"/ prefix
(chars startsWith:v1) ifTrue:[^ (vector at:3) ].
] ifFalse:[
(v2 := vector at:2) isCharacter ifTrue:[
"/ middle character compare
(chars at:2) == v2 ifTrue:[^ (vector at:3) ].
] ifFalse:[
v2 isString ifTrue:[
"/ suffix
(chars endsWith:v2) ifTrue:[^ (vector at:3) ].
] ifFalse:[
^ '?'
]
]
].
"/ (vector first match:chars) ifTrue:[
"/ ^ vector second
"/ ]
].
self error:'cannot happen'
"Modified: / 29-07-2017 / 14:17:38 / cg"
! !
!PhoneticStringUtilities class methodsFor:'documentation'!
version
^ '$Header$'
!
version_CVS
^ '$Header$'
! !