author | Claus Gittinger <cg@exept.de> |
Tue, 01 Aug 2017 11:40:16 +0200 | |
changeset 4488 | 51f2907c7389 |
parent 4487 | 908110f595e9 |
child 4489 | 2d7af11ffcd7 |
permissions | -rw-r--r-- |
4488 | 1 |
"{ Encoding: utf8 }" |
2 |
||
2197 | 3 |
" |
4 |
COPYRIGHT (c) 1994 by Claus Gittinger |
|
5 |
COPYRIGHT (c) 2009 by eXept Software AG |
|
6 |
All Rights Reserved |
|
7 |
||
8 |
This software is furnished under a license and may be used |
|
9 |
only in accordance with the terms of that license and with the |
|
10 |
inclusion of the above copyright notice. This software may not |
|
11 |
be provided or otherwise made available to, or used by, any |
|
12 |
other person. No title to or ownership of the software is |
|
13 |
hereby transferred. |
|
14 |
" |
|
15 |
"{ Package: 'stx:libbasic2' }" |
|
16 |
||
3488
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
17 |
"{ NameSpace: Smalltalk }" |
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
18 |
|
2197 | 19 |
Object subclass:#PhoneticStringUtilities |
20 |
instanceVariableNames:'' |
|
21 |
classVariableNames:'' |
|
22 |
poolDictionaries:'' |
|
23 |
category:'Collections-Text-Support' |
|
24 |
! |
|
25 |
||
2208 | 26 |
Object subclass:#PhoneticStringComparator |
27 |
instanceVariableNames:'' |
|
28 |
classVariableNames:'' |
|
29 |
poolDictionaries:'' |
|
30 |
privateIn:PhoneticStringUtilities |
|
31 |
! |
|
32 |
||
2211 | 33 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#ExtendedSoundexStringComparator |
34 |
instanceVariableNames:'' |
|
35 |
classVariableNames:'CharacterTranslationDict' |
|
36 |
poolDictionaries:'' |
|
37 |
privateIn:PhoneticStringUtilities |
|
38 |
! |
|
39 |
||
4488 | 40 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#SingleResultPhoneticStringComparator |
41 |
instanceVariableNames:'' |
|
42 |
classVariableNames:'' |
|
43 |
poolDictionaries:'' |
|
44 |
privateIn:PhoneticStringUtilities |
|
45 |
! |
|
46 |
||
47 |
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#MRAStringComparator |
|
2208 | 48 |
instanceVariableNames:'' |
49 |
classVariableNames:'CharacterTranslationDict' |
|
50 |
poolDictionaries:'' |
|
51 |
privateIn:PhoneticStringUtilities |
|
52 |
! |
|
53 |
||
4488 | 54 |
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#SoundexStringComparator |
2208 | 55 |
instanceVariableNames:'' |
56 |
classVariableNames:'CharacterTranslationDict' |
|
57 |
poolDictionaries:'' |
|
58 |
privateIn:PhoneticStringUtilities |
|
59 |
! |
|
60 |
||
61 |
PhoneticStringUtilities::SoundexStringComparator subclass:#MySQLSoundexStringComparator |
|
62 |
instanceVariableNames:'' |
|
63 |
classVariableNames:'' |
|
64 |
poolDictionaries:'' |
|
65 |
privateIn:PhoneticStringUtilities |
|
66 |
! |
|
67 |
||
4488 | 68 |
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#NYSIISStringComparator |
2208 | 69 |
instanceVariableNames:'' |
70 |
classVariableNames:'' |
|
71 |
poolDictionaries:'' |
|
72 |
privateIn:PhoneticStringUtilities |
|
73 |
! |
|
74 |
||
4488 | 75 |
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#PhonemStringComparator |
2211 | 76 |
instanceVariableNames:'' |
77 |
classVariableNames:'CharacterTranslationDict' |
|
78 |
poolDictionaries:'' |
|
79 |
privateIn:PhoneticStringUtilities |
|
80 |
! |
|
81 |
||
2208 | 82 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#DoubleMetaphoneStringComparator |
83 |
instanceVariableNames:'inputKey primaryTranslation secondaryTranslation startIndex |
|
84 |
currentIndex skipCount' |
|
85 |
classVariableNames:'' |
|
86 |
poolDictionaries:'' |
|
87 |
privateIn:PhoneticStringUtilities |
|
88 |
! |
|
89 |
||
4488 | 90 |
PhoneticStringUtilities::SingleResultPhoneticStringComparator subclass:#KoelnerPhoneticCodeStringComparator |
91 |
instanceVariableNames:'' |
|
92 |
classVariableNames:'CharacterTranslationDict' |
|
93 |
poolDictionaries:'' |
|
94 |
privateIn:PhoneticStringUtilities |
|
95 |
! |
|
96 |
||
2208 | 97 |
PhoneticStringUtilities::SoundexStringComparator subclass:#MiracodeStringComparator |
98 |
instanceVariableNames:'' |
|
99 |
classVariableNames:'' |
|
100 |
poolDictionaries:'' |
|
101 |
privateIn:PhoneticStringUtilities |
|
102 |
! |
|
103 |
||
2197 | 104 |
!PhoneticStringUtilities class methodsFor:'documentation'! |
105 |
||
106 |
copyright |
|
107 |
" |
|
108 |
COPYRIGHT (c) 1994 by Claus Gittinger |
|
109 |
COPYRIGHT (c) 2009 by eXept Software AG |
|
110 |
All Rights Reserved |
|
111 |
||
112 |
This software is furnished under a license and may be used |
|
113 |
only in accordance with the terms of that license and with the |
|
114 |
inclusion of the above copyright notice. This software may not |
|
115 |
be provided or otherwise made available to, or used by, any |
|
116 |
other person. No title to or ownership of the software is |
|
117 |
hereby transferred. |
|
118 |
" |
|
119 |
! |
|
120 |
||
121 |
documentation |
|
122 |
" |
|
2445 | 123 |
Utilities which are helpful to perform phonetic string searches or comparisons. |
124 |
These are all variations or improvements of the soundex algorithm, which usually fails |
|
125 |
to provide good results for non-english languages. |
|
2285 | 126 |
|
2208 | 127 |
soundexCode |
128 |
this algorithm was originally contained in the CharacterArray class; |
|
129 |
||
130 |
nysiis |
|
131 |
a modified soundex algorithm |
|
132 |
||
2209 | 133 |
miracode |
134 |
another modified soundex algorithm ('american soundex') used in the 1880 census. |
|
135 |
||
136 |
mySQLSoundex |
|
137 |
another modified soundex algorithm used in mySQL. |
|
138 |
||
2208 | 139 |
koelner phoneticCode |
140 |
provides a functionality similar to soundex, but much more tuned towards the German language |
|
141 |
||
142 |
Double metaphone |
|
143 |
works with most european languages. |
|
2211 | 144 |
|
145 |
phonem |
|
146 |
described in Georg Wilde and Carsten Meyer, 'Doppelgaenger gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung' |
|
147 |
from 'ct Magazin fuer Computer & Technik 25/1999'. |
|
148 |
||
149 |
More info for german readers is found in: |
|
150 |
http://www.uni-koeln.de/phil-fak/phonetik/Lehre/MA-Arbeiten/magister_wilz.pdf |
|
151 |
" |
|
152 |
! |
|
153 |
||
154 |
sampleData |
|
155 |
" |
|
156 |
for the 50 most common german names, we get: |
|
157 |
||
158 |
ext. |
|
159 |
name soundex soundex metaphone phonet phonet2 phonix daitsch phonem koeln |
|
160 |
||
4488 | 161 |
müller M460 54600000 MLR MÜLA NILA M4000000 689000 MYLR 657 |
2211 | 162 |
schmidt S253 25300000 SKMTT SHMIT ZNIT S5300000 463000 CMYD 8628 |
163 |
schneider S253 25360000 SKNTR SHNEIDA ZNEITA S5300000 463900 CNAYDR 8627 |
|
164 |
fischer F260 12600000 FSKR FISHA FIZA F8000000 749000 VYCR 387 |
|
165 |
weber W160 16000000 WBR WEBA FEBA $1000000 779000 VBR 317 |
|
166 |
meyer M600 56000000 MYR MEIA NEIA M0000000 619000 MAYR 67 |
|
167 |
wagner W256 25600000 WKNR WAKNA FAKNA $2500000 756900 VACNR 367 |
|
168 |
schulz S242 24200000 SKLS SHULS ZULZ S4800000 484000 CULC 85 |
|
169 |
becker B260 12600000 BKR BEKA BEKA B2000000 759000 BCR 147 |
|
170 |
hoffmann H155 15500000 HFMN HOFMAN UFNAN $7550000 576600 OVMAN 036 |
|
4488 | 171 |
schäfer S216 21600000 SKFR SHEFA ZEFA S7000000 479000 CVR 837 |
2197 | 172 |
" |
173 |
! ! |
|
174 |
||
175 |
!PhoneticStringUtilities class methodsFor:'phonetic codes'! |
|
176 |
||
177 |
koelnerPhoneticCodeOf:aString |
|
178 |
"return a koelner phonetic code. |
|
179 |
The koelnerPhonetic code is for the german language what the soundex code is for english; |
|
180 |
it returns simular strings for similar sounding words. |
|
181 |
There are some differences to soundex, though: |
|
182 |
its length is not limited to 4, but depends on the length of the original string; |
|
2207 | 183 |
it does not start with the first character of the input. |
184 |
This algorithm is described by Postel 1969" |
|
2197 | 185 |
|
2209 | 186 |
^ (KoelnerPhoneticCodeStringComparator new phoneticStringsFor:aString) first |
2197 | 187 |
|
188 |
" |
|
189 |
#( |
|
4488 | 190 |
'Müller' |
2197 | 191 |
'Miller' |
192 |
'Mueller' |
|
4488 | 193 |
'Mühler' |
194 |
'Mühlherr' |
|
195 |
'Mülherr' |
|
2197 | 196 |
'Myler' |
197 |
'Millar' |
|
198 |
'Myller' |
|
4488 | 199 |
'Müllar' |
200 |
'Müler' |
|
2197 | 201 |
'Muehler' |
4488 | 202 |
'Mülller' |
203 |
'Müllerr' |
|
2197 | 204 |
'Muehlherr' |
205 |
'Muellar' |
|
206 |
'Mueler' |
|
4488 | 207 |
'Mülleer' |
2197 | 208 |
'Mueller' |
4488 | 209 |
'Nüller' |
2197 | 210 |
'Nyller' |
211 |
'Niler' |
|
212 |
'Czerny' |
|
213 |
'Tscherny' |
|
214 |
'Czernie' |
|
215 |
'Tschernie' |
|
216 |
'Schernie' |
|
217 |
'Scherny' |
|
218 |
'Scherno' |
|
219 |
'Czerne' |
|
220 |
'Zerny' |
|
221 |
'Tzernie' |
|
222 |
'Breschnew' |
|
223 |
) do:[:w | |
|
224 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities koelnerPhoneticCodeOf:w) |
|
225 |
]. |
|
226 |
" |
|
227 |
||
228 |
" |
|
2209 | 229 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschnew'. '17863'. |
230 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschneff'. '17863'. |
|
231 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braeschneff'. '17863'. |
|
232 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braessneff'. '17863'. |
|
233 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Pressneff'. '17863'. |
|
4488 | 234 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Presznäph'. '17863'. |
2209 | 235 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Preschnjiev'. '17863'. |
236 |
" |
|
237 |
! |
|
238 |
||
4488 | 239 |
miracodeCodeOf:aString |
240 |
"return a miracode soundex phonetic code or nil. |
|
241 |
Miracode is a slightly modified soundex algorithm. |
|
242 |
Notice that there are better algorithms around (doubleMetaphone) " |
|
243 |
||
244 |
^ (MiracodeStringComparator new phoneticStringsFor:aString) first |
|
245 |
||
246 |
" |
|
247 |
PhoneticStringUtilities miracodeCodeOf:'claus' |
|
248 |
PhoneticStringUtilities miracodeCodeOf:'clause' |
|
249 |
PhoneticStringUtilities miracodeCodeOf:'close' |
|
250 |
PhoneticStringUtilities miracodeCodeOf:'smalltalk' |
|
251 |
PhoneticStringUtilities miracodeCodeOf:'smaltalk' |
|
252 |
PhoneticStringUtilities miracodeCodeOf:'smaltak' |
|
253 |
PhoneticStringUtilities miracodeCodeOf:'smaltok' |
|
254 |
PhoneticStringUtilities miracodeCodeOf:'smoltok' |
|
255 |
PhoneticStringUtilities miracodeCodeOf:'aa' |
|
256 |
PhoneticStringUtilities miracodeCodeOf:'by' |
|
257 |
PhoneticStringUtilities miracodeCodeOf:'bab' |
|
258 |
PhoneticStringUtilities miracodeCodeOf:'bob' |
|
259 |
PhoneticStringUtilities miracodeCodeOf:'bop' |
|
260 |
PhoneticStringUtilities miracodeCodeOf:'pub' |
|
261 |
" |
|
262 |
||
263 |
"Created: / 28-07-2017 / 15:32:41 / cg" |
|
264 |
! |
|
265 |
||
2209 | 266 |
mySQLSoundexCodeOf:aString |
267 |
"return the mySQL soundex code. The mysql soundex coed is different from the miracode 'american' soundex |
|
4488 | 268 |
(no 4char limitation; different order of duplicate vowel vs. duplicate code elimination). |
269 |
Notice that there are better algorithms around (doubleMetaphone) " |
|
2209 | 270 |
|
271 |
^ (MySQLSoundexStringComparator new phoneticStringsFor:aString) first |
|
272 |
||
273 |
" |
|
274 |
#( |
|
4488 | 275 |
'Müller' |
2209 | 276 |
'Miller' |
277 |
'Mueller' |
|
4488 | 278 |
'Mühler' |
279 |
'Mühlherr' |
|
280 |
'Mülherr' |
|
2209 | 281 |
'Myler' |
282 |
'Millar' |
|
283 |
'Myller' |
|
4488 | 284 |
'Müllar' |
285 |
'Müler' |
|
2209 | 286 |
'Muehler' |
4488 | 287 |
'Mülller' |
288 |
'Müllerr' |
|
2209 | 289 |
'Muehlherr' |
290 |
'Muellar' |
|
291 |
'Mueler' |
|
4488 | 292 |
'Mülleer' |
2209 | 293 |
'Mueller' |
4488 | 294 |
'Nüller' |
2209 | 295 |
'Nyller' |
296 |
'Niler' |
|
297 |
'Czerny' |
|
298 |
'Tscherny' |
|
299 |
'Czernie' |
|
300 |
'Tschernie' |
|
301 |
'Schernie' |
|
302 |
'Scherny' |
|
303 |
'Scherno' |
|
304 |
'Czerne' |
|
305 |
'Zerny' |
|
306 |
'Tzernie' |
|
307 |
'Breschnew' |
|
308 |
) do:[:w | |
|
309 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities mySQLSoundexCodeOf:w) |
|
310 |
]. |
|
311 |
" |
|
312 |
||
313 |
" |
|
314 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschnew'. |
|
315 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschneff'. |
|
316 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Braeschneff'. |
|
317 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Braessneff'. |
|
318 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Pressneff'. |
|
4488 | 319 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Presznäph'. |
2209 | 320 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Preschnjiev'. |
2197 | 321 |
" |
4488 | 322 |
|
323 |
"Modified (comment): / 28-07-2017 / 15:34:03 / cg" |
|
2197 | 324 |
! |
325 |
||
326 |
soundexCodeOf:aString |
|
327 |
"return a soundex phonetic code or nil. |
|
2207 | 328 |
Soundex (1918, 1922) returns similar codes for similar sounding words, making it a useful |
2197 | 329 |
tool when searching for words where the correct spelling is unknown. |
4194 | 330 |
(read Knuth or search the web if you don't know what a soundex code is). |
4488 | 331 |
Caveat: 'similar sounding words' means: 'similar sounding in english'. |
332 |
Notice that there are better algorithms around (doubleMetaphone) " |
|
2197 | 333 |
|
2210 | 334 |
^ (SoundexStringComparator new phoneticStringsFor:aString) first |
2197 | 335 |
|
2210 | 336 |
"/ old code - now use code in private class... |
337 |
"/ |inStream codeStream ch last lch codeLength codes code lastCode| |
|
338 |
"/ |
|
339 |
"/ inStream := aString readStream. |
|
340 |
"/ inStream skipSeparators. |
|
341 |
"/ inStream atEnd ifTrue:[ |
|
342 |
"/ ^ nil |
|
343 |
"/ ]. |
|
344 |
"/ |
|
345 |
"/ ch := inStream next. |
|
346 |
"/ ch isLetter ifFalse:[ |
|
347 |
"/ ^ nil |
|
348 |
"/ ]. |
|
349 |
"/ codeLength := 0. |
|
350 |
"/ |
|
351 |
"/ codes := Dictionary new. |
|
352 |
"/ codes atAll:'bpfv' put:$1. |
|
353 |
"/ codes atAll:'cskgjqxz' put:$2. |
|
354 |
"/ codes atAll:'dt' put:$3. |
|
355 |
"/ codes atAll:'l' put:$4. |
|
356 |
"/ codes atAll:'nm' put:$5. |
|
357 |
"/ codes atAll:'r' put:$6. |
|
358 |
"/ |
|
359 |
"/ codeStream := WriteStream on:(String new:4). |
|
360 |
"/ codeStream nextPut:(ch asUppercase). |
|
361 |
"/ last := ch asLowercase. |
|
362 |
"/ lastCode := codes at:last ifAbsent:nil. |
|
363 |
"/ |
|
364 |
"/ [inStream atEnd] whileFalse:[ |
|
365 |
"/ ch := inStream next. |
|
366 |
"/ lch := ch asLowercase. |
|
367 |
"/ lch = last ifFalse:[ |
|
368 |
"/ last := lch. |
|
369 |
"/ |
|
370 |
"/ code := codes at:lch ifAbsent:nil. |
|
371 |
"/ (code notNil and:[ code ~= lastCode]) ifTrue:[ |
|
372 |
"/ codeLength < 3 ifTrue:[ |
|
373 |
"/ codeStream nextPut:code. |
|
374 |
"/ codeLength := codeLength + 1. |
|
375 |
"/ codeLength > 3 ifTrue:[^ codeStream contents]. |
|
376 |
"/ ]. |
|
377 |
"/ ]. |
|
378 |
"/ lastCode := code. |
|
379 |
"/ ] |
|
380 |
"/ ]. |
|
381 |
"/ [ codeLength < 3 ] whileTrue:[ |
|
382 |
"/ codeStream nextPut:$0. |
|
383 |
"/ codeLength := codeLength + 1. |
|
384 |
"/ ]. |
|
385 |
"/ |
|
386 |
"/ ^ codeStream contents |
|
2197 | 387 |
|
388 |
" |
|
389 |
PhoneticStringUtilities soundexCodeOf:'claus' |
|
390 |
PhoneticStringUtilities soundexCodeOf:'clause' |
|
391 |
PhoneticStringUtilities soundexCodeOf:'close' |
|
392 |
PhoneticStringUtilities soundexCodeOf:'smalltalk' |
|
393 |
PhoneticStringUtilities soundexCodeOf:'smaltalk' |
|
394 |
PhoneticStringUtilities soundexCodeOf:'smaltak' |
|
395 |
PhoneticStringUtilities soundexCodeOf:'smaltok' |
|
396 |
PhoneticStringUtilities soundexCodeOf:'smoltok' |
|
397 |
PhoneticStringUtilities soundexCodeOf:'aa' |
|
398 |
PhoneticStringUtilities soundexCodeOf:'by' |
|
399 |
PhoneticStringUtilities soundexCodeOf:'bab' |
|
400 |
PhoneticStringUtilities soundexCodeOf:'bob' |
|
401 |
PhoneticStringUtilities soundexCodeOf:'bop' |
|
402 |
" |
|
4488 | 403 |
|
404 |
"Modified (comment): / 28-07-2017 / 15:33:53 / cg" |
|
2197 | 405 |
! ! |
406 |
||
3648 | 407 |
!PhoneticStringUtilities class methodsFor:'queries'! |
408 |
||
409 |
isUtilityClass |
|
410 |
^ self == PhoneticStringUtilities |
|
411 |
! ! |
|
412 |
||
2208 | 413 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'constant'! |
414 |
||
415 |
defaultClass |
|
416 |
^SoundexStringComparator |
|
417 |
! ! |
|
418 |
||
3646 | 419 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'documentation'! |
420 |
||
421 |
documentation |
|
422 |
" |
|
423 |
abstract superclass for various phonetic comparators. |
|
424 |
They returns similar strings for similar sounding words, which can be used |
|
425 |
to find similar sounding words in a search list. |
|
426 |
||
427 |
Notice, that some comparators are better for particular languages. |
|
428 |
" |
|
4467 | 429 |
! |
430 |
||
431 |
examples |
|
432 |
" |
|
433 |
PhoneticStringUtilities::SoundexStringComparator new |
|
434 |
does:'miller' soundLike:'miler'. |
|
435 |
||
436 |
PhoneticStringUtilities::SoundexStringComparator new |
|
437 |
does:'miller' soundLike:'milner'. |
|
438 |
||
439 |
PhoneticStringUtilities::SoundexStringComparator new |
|
4488 | 440 |
does:'müller' soundLike:'mueller'. |
4467 | 441 |
|
442 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new |
|
4488 | 443 |
does:'müller' soundLike:'mueller'. |
4467 | 444 |
" |
3646 | 445 |
! ! |
446 |
||
2208 | 447 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'instance creation'! |
448 |
||
449 |
new |
|
450 |
^ self basicNew initialize. |
|
451 |
! ! |
|
452 |
||
3646 | 453 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'queries'! |
454 |
||
455 |
isAbstract |
|
456 |
^ self == PhoneticStringUtilities::PhoneticStringComparator |
|
457 |
! ! |
|
458 |
||
2208 | 459 |
!PhoneticStringUtilities::PhoneticStringComparator methodsFor:'api'! |
460 |
||
461 |
does:aString soundLike:anotherString |
|
462 |
|translations1 translations2| |
|
463 |
||
464 |
translations1 := self phoneticStringsFor:aString. |
|
465 |
translations2 := self phoneticStringsFor:anotherString. |
|
466 |
||
467 |
^ translations1 contains:[:t1 | |
|
468 |
translations2 contains:[:t2 | t1 = t2]] |
|
469 |
||
470 |
" |
|
471 |
PhoneticStringUtilities::SoundexStringComparator new |
|
472 |
does:'miller' soundLike:'miler'. |
|
4467 | 473 |
|
2208 | 474 |
PhoneticStringUtilities::SoundexStringComparator new |
475 |
does:'miller' soundLike:'milner'. |
|
4467 | 476 |
|
477 |
PhoneticStringUtilities::SoundexStringComparator new |
|
4488 | 478 |
does:'müller' soundLike:'mueller'. |
4467 | 479 |
|
480 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new |
|
4488 | 481 |
does:'müller' soundLike:'mueller'. |
2208 | 482 |
" |
4467 | 483 |
|
484 |
"Modified (comment): / 13-07-2017 / 17:51:43 / cg" |
|
2208 | 485 |
! |
486 |
||
487 |
phoneticStringsFor: aString |
|
488 |
"Should answer an array of alternate phonetic strings for the given input string." |
|
4485 | 489 |
|
2208 | 490 |
self subclassResponsibility |
491 |
||
492 |
" |
|
493 |
(PhoneticStringUtilities::SoundexStringComparator new |
|
4485 | 494 |
phoneticStringsFor:'miller') first |
495 |
||
2208 | 496 |
'miller' asSoundexCode |
497 |
" |
|
4485 | 498 |
|
499 |
"Modified (comment): / 27-07-2017 / 15:07:59 / cg" |
|
2208 | 500 |
! ! |
501 |
||
502 |
!PhoneticStringUtilities::PhoneticStringComparator methodsFor:'initialization'! |
|
503 |
||
504 |
initialize |
|
505 |
"Invoked when a new instance is created." |
|
506 |
||
507 |
"/ please change as required (and remove this comment) |
|
508 |
||
509 |
"/ super initialize. -- commented since inherited method does nothing |
|
510 |
! ! |
|
511 |
||
2211 | 512 |
!PhoneticStringUtilities::ExtendedSoundexStringComparator class methodsFor:'documentation'! |
513 |
||
514 |
documentation |
|
515 |
" |
|
516 |
There are many extended and enhanced soundex variants around; |
|
517 |
here is one, called 'extended soundex'. It is destribed for example in |
|
518 |
http://www.epidata.dk/documentation.php. |
|
519 |
An author or origin is unknown. |
|
520 |
||
521 |
The number of digits is increased to 5 or 8; |
|
522 |
The first character is not used literally; instead it is encoded like the rest. |
|
523 |
This might have a negative effect on names starting with a vovel, though. |
|
524 |
||
525 |
Overall, it can be doubted if this is really an enhancement after all. |
|
526 |
" |
|
527 |
! ! |
|
528 |
||
529 |
!PhoneticStringUtilities::ExtendedSoundexStringComparator methodsFor:'api'! |
|
530 |
||
531 |
phoneticStringsFor:aString |
|
532 |
"generates both an extended soundex of length 5 and one of length 8" |
|
533 |
||
534 |
|first second u t prevCode| |
|
535 |
||
536 |
u := aString asUppercase. |
|
537 |
first := second := ''. |
|
538 |
u do:[:c | |
|
539 |
t := self translate:c. |
|
540 |
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[ |
|
541 |
first := first , t. |
|
542 |
second := second , t. |
|
543 |
second size == 8 ifTrue:[ |
|
544 |
^ Array with:(first copyTo:5) with:second |
|
545 |
]. |
|
546 |
]. |
|
547 |
prevCode := t |
|
548 |
]. |
|
549 |
[ first size < 5 ] whileTrue:[ |
|
550 |
first := first , '0'. |
|
551 |
second := second , '0'. |
|
552 |
]. |
|
553 |
[ second size < 8 ] whileTrue:[ |
|
554 |
second := second , '0' |
|
555 |
]. |
|
556 |
^ Array with:first with:second |
|
557 |
||
558 |
" |
|
4488 | 559 |
self basicNew phoneticStringsFor:'müller' #('87900' '87900000') |
2211 | 560 |
self basicNew phoneticStringsFor:'miller' #('87900' '87900000') |
561 |
self basicNew phoneticStringsFor:'muller' #('87900' '87900000') |
|
562 |
self basicNew phoneticStringsFor:'muler' #('87900' '87900000') |
|
563 |
self basicNew phoneticStringsFor:'schmidt' #('38600' '38600000') |
|
564 |
self basicNew phoneticStringsFor:'schneider' #('38690' '38690000') |
|
565 |
self basicNew phoneticStringsFor:'fischer' #('23900' '23900000') |
|
566 |
self basicNew phoneticStringsFor:'weber' #('19000' '19000000') |
|
567 |
self basicNew phoneticStringsFor:'meyer' #('89000' '89000000') |
|
568 |
self basicNew phoneticStringsFor:'wagner' #('48900' '48900000') |
|
569 |
self basicNew phoneticStringsFor:'schulz' #('37500' '37500000') |
|
570 |
self basicNew phoneticStringsFor:'becker' #('13900' '13900000') |
|
571 |
self basicNew phoneticStringsFor:'hoffmann' #('28800' '28800000') |
|
4488 | 572 |
self basicNew phoneticStringsFor:'schäfer' #('32900' '32900000') |
2211 | 573 |
" |
574 |
! ! |
|
575 |
||
576 |
!PhoneticStringUtilities::ExtendedSoundexStringComparator methodsFor:'private'! |
|
577 |
||
578 |
translate:aCharacter |
|
579 |
"use simple if's for more speed when compiled" |
|
580 |
||
581 |
"vowels serve as separators" |
|
582 |
aCharacter == $A ifTrue:[^ '0' ]. |
|
583 |
aCharacter == $E ifTrue:[^ '0' ]. |
|
584 |
aCharacter == $I ifTrue:[^ '0' ]. |
|
585 |
aCharacter == $O ifTrue:[^ '0' ]. |
|
586 |
aCharacter == $U ifTrue:[^ '0' ]. |
|
587 |
aCharacter == $Y ifTrue:[^ '0' ]. |
|
588 |
||
589 |
aCharacter == $B ifTrue:[^ '1' ]. |
|
590 |
aCharacter == $P ifTrue:[^ '1' ]. |
|
591 |
||
592 |
aCharacter == $F ifTrue:[^ '2' ]. |
|
593 |
aCharacter == $V ifTrue:[^ '2' ]. |
|
594 |
||
595 |
aCharacter == $C ifTrue:[^ '3' ]. |
|
596 |
aCharacter == $S ifTrue:[^ '3' ]. |
|
597 |
aCharacter == $K ifTrue:[^ '3' ]. |
|
598 |
||
599 |
aCharacter == $G ifTrue:[^ '4' ]. |
|
600 |
aCharacter == $J ifTrue:[^ '4' ]. |
|
601 |
||
602 |
aCharacter == $Q ifTrue:[^ '5' ]. |
|
603 |
aCharacter == $X ifTrue:[^ '5' ]. |
|
604 |
aCharacter == $Z ifTrue:[^ '5' ]. |
|
605 |
||
606 |
aCharacter == $D ifTrue:[^ '6' ]. |
|
607 |
aCharacter == $G ifTrue:[^ '6' ]. |
|
608 |
aCharacter == $T ifTrue:[^ '6' ]. |
|
609 |
||
610 |
aCharacter == $L ifTrue:[^ '7' ]. |
|
611 |
||
612 |
aCharacter == $M ifTrue:[^ '8' ]. |
|
613 |
aCharacter == $N ifTrue:[^ '8' ]. |
|
614 |
||
615 |
aCharacter == $R ifTrue:[^ '9' ]. |
|
616 |
^ nil |
|
617 |
! ! |
|
618 |
||
4488 | 619 |
!PhoneticStringUtilities::SingleResultPhoneticStringComparator class methodsFor:'documentation'! |
620 |
||
621 |
documentation |
|
622 |
" |
|
623 |
documentation to be added. |
|
624 |
||
625 |
[author:] |
|
626 |
cg |
|
627 |
||
628 |
[instance variables:] |
|
629 |
||
630 |
[class variables:] |
|
631 |
||
632 |
[see also:] |
|
633 |
||
634 |
" |
|
635 |
! ! |
|
636 |
||
637 |
!PhoneticStringUtilities::SingleResultPhoneticStringComparator methodsFor:'api'! |
|
638 |
||
639 |
encode:word |
|
640 |
^ self subclassResponsibility |
|
641 |
||
642 |
"Created: / 28-07-2017 / 15:20:49 / cg" |
|
643 |
! |
|
644 |
||
645 |
phoneticStringsFor:word |
|
646 |
^ Array with:(self encode:word) |
|
647 |
||
648 |
"Created: / 28-07-2017 / 15:20:38 / cg" |
|
649 |
! ! |
|
650 |
||
651 |
!PhoneticStringUtilities::MRAStringComparator class methodsFor:'documentation'! |
|
2208 | 652 |
|
653 |
documentation |
|
654 |
" |
|
4488 | 655 |
Match Rating Approach Encoder |
656 |
||
657 |
The Western Airlines matching rating approach name encoder |
|
658 |
||
659 |
[see also:] |
|
660 |
https://en.wikipedia.org/wiki/Match_Rating_Approach |
|
661 |
||
662 |
G.B. Moore, J.L. Kuhns, J.L. Treffzs, and C.A. Montgomery, |
|
663 |
''Accessing Individual Records from Personal Data Files Using Nonunique Identifiers'' |
|
664 |
US National Institute of Standards and Technology, SP-500-2 (1977), p. 17. |
|
2208 | 665 |
" |
4488 | 666 |
! |
667 |
||
668 |
rCode |
|
669 |
"<<END |
|
670 |
## Copyright (c) 2015, James P. Howard, II <jh@jameshoward.us> |
|
671 |
## |
|
672 |
## Redistribution and use in source and binary forms, with or without |
|
673 |
## modification, are permitted provided that the following conditions are |
|
674 |
## met: |
|
675 |
## |
|
676 |
## Redistributions of source code must retain the above copyright |
|
677 |
## notice, this list of conditions and the following disclaimer. |
|
678 |
## |
|
679 |
## Redistributions in binary form must reproduce the above copyright |
|
680 |
## notice, this list of conditions and the following disclaimer in |
|
681 |
## the documentation and/or other materials provided with the |
|
682 |
## distribution. |
|
683 |
## |
|
684 |
## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS |
|
685 |
## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT |
|
686 |
## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR |
|
687 |
## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT |
|
688 |
## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, |
|
689 |
## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT |
|
690 |
## LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, |
|
691 |
## DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY |
|
692 |
## THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
|
693 |
## (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
|
694 |
## OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
|
695 |
||
696 |
#' @rdname mra |
|
697 |
#' @title Match Rating Approach Encoder |
|
698 |
#' |
|
699 |
#' @description |
|
700 |
#' The Western Airlines matching rating approach name encoder |
|
701 |
#' |
|
702 |
#' @param word string or vector of strings to encode |
|
703 |
#' @param x MRA-encoded character vector |
|
704 |
#' @param y MRA-encoded character vector |
|
705 |
#' |
|
706 |
#' @details |
|
707 |
#' |
|
708 |
#' The variable \code{word} is the name to be encoded. The variable |
|
709 |
#' \code{maxCodeLen} is \emph{not} supported in this algorithm encoder |
|
710 |
#' because the algorithm itself is dependent upon its six-character |
|
711 |
#' length. The variables \code{x} and \code{y} are MRA-encoded and are |
|
712 |
#' compared to each other using the MRA comparison specification. |
|
713 |
#' |
|
714 |
#' @return The \code{mra_encode} function returns match rating approach |
|
715 |
#' encoded character vector. The \code{mra_compare} returns a boolean |
|
716 |
#' vector which is \code{TRUE} if \code{x} and \code{y} pass the MRA |
|
717 |
#' comparison test. |
|
718 |
#' |
|
719 |
#' @references |
|
720 |
#' |
|
721 |
#' G.B. Moore, J.L. Kuhns, J.L. Treffzs, and C.A. Montgomery, |
|
722 |
#' \emph{Accessing Individual Records from Personal Data Files Using |
|
723 |
#' Nonunique Identifiers,} US National Institute of Standards and |
|
724 |
#' Technology, SP-500-2 (1977), p. 17. |
|
725 |
#' |
|
726 |
#' @family phonics |
|
727 |
#' |
|
728 |
#' @examples |
|
729 |
#' mra_encode("William") |
|
730 |
#' mra_encode(c("Peter", "Peady")) |
|
731 |
#' mra_encode("Stevenson") |
|
732 |
||
733 |
#' @rdname mra |
|
734 |
#' @name mra_encode |
|
735 |
#' @export |
|
736 |
mra_encode <- function(word) { |
|
737 |
||
738 |
## First, remove any nonalphabetical characters and uppercase it |
|
739 |
word <- gsub("[^[:alpha:]]*", "", word) |
|
740 |
word <- toupper(word) |
|
741 |
||
742 |
## First character of key = first character of name |
|
743 |
first <- substr(word, 1, 1) |
|
744 |
word <- substr(word, 2, nchar(word)) |
|
745 |
||
746 |
## Delete vowels not at the start of the word |
|
747 |
word <- gsub("[AEIOU]", "", word) |
|
748 |
word <- paste(first, word, sep = "") |
|
749 |
||
750 |
## Remove duplicate consecutive characters |
|
751 |
word <- gsub("([A-Z])\\1+", "\\1", word) |
|
752 |
||
753 |
## If longer than 6 characters, take first and last 3...and we have |
|
754 |
## to vectorize it |
|
755 |
for(i in 1:length(word)) { |
|
756 |
if((l = nchar(word[i])) > 6) { |
|
757 |
first <- substr(word[i], 1, 3) |
|
758 |
last <- substr(word[i], l - 2, l) |
|
759 |
word[i] <- paste(first, last, sep = ""); |
|
760 |
} |
|
761 |
} |
|
762 |
||
763 |
return(word) |
|
764 |
} |
|
765 |
||
766 |
#' @rdname mra |
|
767 |
#' @name mra_compare |
|
768 |
#' @export |
|
769 |
mra_compare <- function(x, y) { |
|
770 |
mra <- data.frame(x = x, y = y, sim = 0, min = 100, stringsAsFactors = FALSE) |
|
771 |
||
772 |
## Obtain the minimum rating value by calculating the length sum of |
|
773 |
## the encoded strings and using table A (from Wikipedia). We start |
|
774 |
## by setting the minimum to be the sum and move from there. |
|
775 |
mra$lensum <- nchar(mra$x) + nchar(mra$y) |
|
776 |
mra$min[mra$lensum == 12] <- 2 |
|
777 |
mra$min[mra$lensum > 7 && mra$lensum <= 11] <- 3 |
|
778 |
mra$min[mra$lensum > 4 && mra$lensum <= 7] <- 4 |
|
779 |
mra$min[mra$lensum <= 4] <- 5 |
|
780 |
||
781 |
## If the length difference between the encoded strings is 3 or |
|
782 |
## greater, then no similarity comparison is done. For us, we |
|
783 |
## continue the similarity comparison out of laziness and ensure the |
|
784 |
## minimum is impossibly high to meet. |
|
785 |
mra$min[abs(nchar(mra$x) - nchar(mra$y)) >= 3] <- 100 |
|
786 |
||
787 |
## Start the comparison. |
|
788 |
x <- strsplit(mra$x, split = "") |
|
789 |
y <- strsplit(mra$y, split = "") |
|
790 |
rows <- nrow(mra) |
|
791 |
for(i in 1:rows) { |
|
792 |
## Process the encoded strings from left to right and remove any |
|
793 |
## identical characters found from both strings respectively. |
|
794 |
j <- 1 |
|
795 |
while(j < min(length(x[[i]]), length(y[[i]]))) { |
|
796 |
if(x[[i]][j] == y[[i]][j]) { |
|
797 |
x[[i]] <- x[[i]][-j] |
|
798 |
y[[i]] <- y[[i]][-j] |
|
799 |
} else |
|
800 |
j <- j + 1 |
|
801 |
} |
|
802 |
||
803 |
## Process the unmatched characters from right to left and |
|
804 |
## remove any identical characters found from both names |
|
805 |
## respectively. |
|
806 |
x[[i]] <- rev(x[[i]]) |
|
807 |
y[[i]] <- rev(y[[i]]) |
|
808 |
j <- 1 |
|
809 |
while(j < min(length(x[[i]]), length(y[[i]]))) { |
|
810 |
if(x[[i]][j] == y[[i]][j]) { |
|
811 |
x[[i]] <- x[[i]][-j] |
|
812 |
y[[i]] <- y[[i]][-j] |
|
813 |
} else |
|
814 |
j <- j + 1 |
|
815 |
} |
|
816 |
## Subtract the number of unmatched characters from 6 in the |
|
817 |
## longer string. This is the similarity rating. |
|
818 |
len <- min(length(x[[i]]), length(y[[i]])) |
|
819 |
mra$sim[i] <- 6 - len |
|
820 |
} |
|
821 |
||
822 |
## If the similarity is greater than or equal to the minimum |
|
823 |
## required, it is a successful match. |
|
824 |
mra$match <- (mra$sim >= mra$min) |
|
825 |
return(mra$match) |
|
826 |
} |
|
827 |
||
828 |
END>> |
|
2208 | 829 |
! ! |
830 |
||
4488 | 831 |
!PhoneticStringUtilities::MRAStringComparator methodsFor:'api'! |
832 |
||
833 |
encode:wordIn |
|
834 |
"see https://en.wikipedia.org/wiki/Match_Rating_Approach" |
|
835 |
||
836 |
|word prev| |
|
837 |
||
838 |
word := wordIn. |
|
839 |
||
840 |
"/ First, remove any nonalphabetical characters and uppercase it |
|
841 |
||
842 |
word := word select:#isLetter thenCollect:#asUppercase. |
|
843 |
||
844 |
"/ Delete vowels not at the start of the word |
|
845 |
||
846 |
word := word first asString , ((word from:2) reject:#isVowel). |
|
847 |
||
848 |
"/ Remove duplicate consecutive characters |
|
849 |
||
850 |
prev := nil. |
|
851 |
word := word |
|
852 |
collect:[:char | |
|
853 |
char == prev ifTrue:[ |
|
854 |
$* |
|
855 |
] ifFalse:[ |
|
856 |
prev := char. |
|
857 |
char. |
|
858 |
]. |
|
859 |
] |
|
860 |
thenSelect:[:char | char ~~ $*]. |
|
861 |
||
862 |
"/ If longer than 6 characters, take first and last 3 |
|
863 |
word size > 6 ifTrue:[ |
|
864 |
word := (word copyFirst:3),(word copyLast:3) |
|
2208 | 865 |
]. |
4488 | 866 |
^ word. |
2208 | 867 |
|
868 |
" |
|
4488 | 869 |
self new encode:'Catherine' -> 'CTHRN' |
870 |
self new encode:'CatherineCatherine' -> 'CTHHRN' |
|
871 |
self new encode:'Butter' -> 'BTR' |
|
872 |
self new encode:'Byrne' -> 'BYRN' |
|
873 |
self new encode:'Boern' -> 'BRN' |
|
874 |
self new encode:'Smith' -> 'SMTH' |
|
875 |
self new encode:'Smyth' -> 'SMYTH' |
|
876 |
self new encode:'Kathryn' -> 'KTHRYN' |
|
2211 | 877 |
" |
4486 | 878 |
|
4488 | 879 |
"Created: / 28-07-2017 / 15:19:22 / cg" |
880 |
"Modified (comment): / 31-07-2017 / 15:14:31 / cg" |
|
2208 | 881 |
! ! |
882 |
||
883 |
!PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'! |
|
884 |
||
885 |
documentation |
|
886 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
887 |
WARNING: this is the so called 'simplified soundex' algorithm; |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
888 |
there are more variants like miracode (american soundex) or mysqlSoundex around. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
889 |
Be sure to use the correct algorithm, if the generated strings must be compatible |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
890 |
(otherwise, the differences are probably too small to be noticed as effect, but |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
891 |
your search will be different) |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
892 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
893 |
The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
894 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
895 |
SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
896 |
components of names, but by doing so reports more matches. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
897 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
898 |
There are some variations around in the literature; |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
899 |
the following is called 'simplified soundex', and the rules for coding a name are: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
900 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
901 |
1. The first letter of the name is used in its un-coded form to serve as the prefix |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
902 |
character of the code. (The rest of the code is numerical). |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
903 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
904 |
2. Thereafter, W and H are ignored entirely. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
905 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
906 |
3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
907 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
908 |
4. Other letters of the name are converted to a numerical equivalent: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
909 |
B, P, F, V 1 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
910 |
C, G, J, K, Q, S, X, Z 2 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
911 |
D, T 3 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
912 |
L 4 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
913 |
M, N 5 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
914 |
R 6 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
915 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
916 |
5. There are two exceptions: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
917 |
1. Letters that follow prefix letters which would, if coded, have the same |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
918 |
numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
919 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
920 |
2. The second letter of any pair of consonants having the same code number is likewise ignored, |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
921 |
i.e. unless there is a ''separator'' between them in the name. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
922 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
923 |
6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
924 |
Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
925 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
926 |
Notice, that in another variant, w and h are treated slightly differently. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
927 |
This is only of relevance, if you need to reconstruct original soundex codes of other programs |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
928 |
or for the original 1880 us census data. |
3646 | 929 |
|
930 |
Also notice, that soundex deals better with english. |
|
931 |
For german and other languages, other algorithms may provide better results. |
|
2208 | 932 |
" |
933 |
! ! |
|
934 |
||
935 |
!PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'! |
|
936 |
||
4488 | 937 |
encode:word |
2208 | 938 |
|u p t prevCode| |
939 |
||
4488 | 940 |
u := word asUppercase. |
2208 | 941 |
p := u first asString. |
942 |
prevCode := self translate:u first. |
|
943 |
u from:2 to:u size do:[:c | |
|
944 |
t := self translate:c. |
|
945 |
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[ |
|
946 |
p := p , t. |
|
4488 | 947 |
p size == 4 ifTrue:[^ p ]. |
2208 | 948 |
]. |
949 |
prevCode := t |
|
950 |
]. |
|
951 |
[ p size < 4 ] whileTrue:[ |
|
952 |
p := p , '0' |
|
953 |
]. |
|
4488 | 954 |
^ (p copyFrom:1 to:4) |
955 |
||
956 |
"Created: / 28-07-2017 / 15:21:23 / cg" |
|
2208 | 957 |
! ! |
958 |
||
959 |
!PhoneticStringUtilities::SoundexStringComparator methodsFor:'private'! |
|
960 |
||
961 |
translate:aCharacter |
|
962 |
"use simple if's for more speed when compiled" |
|
963 |
||
964 |
"vowels serve as separators" |
|
965 |
aCharacter == $A ifTrue:[^ '0' ]. |
|
966 |
aCharacter == $E ifTrue:[^ '0' ]. |
|
967 |
aCharacter == $I ifTrue:[^ '0' ]. |
|
968 |
aCharacter == $O ifTrue:[^ '0' ]. |
|
969 |
aCharacter == $U ifTrue:[^ '0' ]. |
|
970 |
aCharacter == $Y ifTrue:[^ '0' ]. |
|
971 |
||
972 |
aCharacter == $B ifTrue:[^ '1' ]. |
|
973 |
aCharacter == $P ifTrue:[^ '1' ]. |
|
974 |
aCharacter == $F ifTrue:[^ '1' ]. |
|
975 |
aCharacter == $V ifTrue:[^ '1' ]. |
|
976 |
||
977 |
aCharacter == $C ifTrue:[^ '2' ]. |
|
978 |
aCharacter == $S ifTrue:[^ '2' ]. |
|
979 |
aCharacter == $K ifTrue:[^ '2' ]. |
|
980 |
aCharacter == $G ifTrue:[^ '2' ]. |
|
981 |
aCharacter == $J ifTrue:[^ '2' ]. |
|
982 |
aCharacter == $Q ifTrue:[^ '2' ]. |
|
983 |
aCharacter == $X ifTrue:[^ '2' ]. |
|
984 |
aCharacter == $Z ifTrue:[^ '2' ]. |
|
985 |
||
986 |
aCharacter == $D ifTrue:[^ '3' ]. |
|
987 |
aCharacter == $T ifTrue:[^ '3' ]. |
|
988 |
||
989 |
aCharacter == $L ifTrue:[^ '4' ]. |
|
990 |
||
991 |
aCharacter == $M ifTrue:[^ '5' ]. |
|
992 |
aCharacter == $N ifTrue:[^ '5' ]. |
|
993 |
||
994 |
aCharacter == $R ifTrue:[^ '6' ]. |
|
995 |
^ nil |
|
996 |
! ! |
|
997 |
||
998 |
!PhoneticStringUtilities::MySQLSoundexStringComparator class methodsFor:'documentation'! |
|
999 |
||
1000 |
documentation |
|
1001 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1002 |
MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation, |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1003 |
and also removing vokals first, then removing duplicate codes |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1004 |
(whereas the soundex code does this in reverse order). |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1005 |
|
4133 | 1006 |
These variations are important, if you need the miracode soundex codes to be generated. |
2208 | 1007 |
" |
1008 |
! ! |
|
1009 |
||
1010 |
!PhoneticStringUtilities::MySQLSoundexStringComparator methodsFor:'api'! |
|
1011 |
||
4488 | 1012 |
encode:word |
2208 | 1013 |
|u p t prevCode| |
1014 |
||
4488 | 1015 |
u := word asUppercase. |
2208 | 1016 |
p := u first asString. |
1017 |
prevCode := self translate:u first. |
|
1018 |
u from:2 to:u size do:[:c | |
|
1019 |
t := self translate:c. |
|
1020 |
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[ |
|
1021 |
p := p , t. |
|
1022 |
]. |
|
1023 |
(t ~= '0' and:[ c ~= $W and:[c ~= $H]]) ifTrue:[ |
|
1024 |
prevCode := t. |
|
1025 |
]. |
|
1026 |
]. |
|
1027 |
[ p size < 4 ] whileTrue:[ |
|
1028 |
p := p , '0' |
|
1029 |
]. |
|
4488 | 1030 |
^ p |
1031 |
||
1032 |
"Created: / 28-07-2017 / 15:23:41 / cg" |
|
1033 |
"Modified: / 31-07-2017 / 17:53:51 / cg" |
|
2208 | 1034 |
! ! |
1035 |
||
1036 |
!PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'! |
|
1037 |
||
1038 |
documentation |
|
1039 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1040 |
NYSIIS Algorithm: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1041 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1042 |
1. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1043 |
remove all ''S'' and ''Z'' chars from the end of the surname |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1044 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1045 |
2. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1046 |
transcode initial strings |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1047 |
MAC => MC |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1048 |
PF => F |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1049 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1050 |
3. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1051 |
Transcode trailing strings as follows, |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1052 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1053 |
IX => IC |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1054 |
EX => EC |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1055 |
YE,EE,IE => Y |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1056 |
NT,ND => D |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1057 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1058 |
4. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1059 |
transcode ''EV'' to ''EF'' if not at start of name |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1060 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1061 |
5. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1062 |
use first character of name as first character of key |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1063 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1064 |
6. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1065 |
remove any ''W'' that follows a vowel |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1066 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1067 |
7. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1068 |
replace all vowels with ''A'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1069 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1070 |
8. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1071 |
transcode ''GHT'' to ''GT'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1072 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1073 |
9. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1074 |
transcode ''DG'' to ''G'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1075 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1076 |
10. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1077 |
transcode ''PH'' to ''F'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1078 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1079 |
11. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1080 |
if not first character, eliminate all ''H'' preceded or followed by a vowel |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1081 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1082 |
12. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1083 |
change ''KN'' to ''N'', else ''K'' to ''C'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1084 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1085 |
13. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1086 |
if not first character, change ''M'' to ''N'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1087 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1088 |
14. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1089 |
if not first character, change ''Q'' to ''G'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1090 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1091 |
15. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1092 |
transcode ''SH'' to ''S'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1093 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1094 |
16. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1095 |
transcode ''SCH'' to ''S'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1096 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1097 |
17. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1098 |
transcode ''YW'' to ''Y'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1099 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1100 |
18. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1101 |
if not first or last character, change ''Y'' to ''A'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1102 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1103 |
19. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1104 |
transcode ''WR'' to ''R'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1105 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1106 |
20. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1107 |
if not first character, change ''Z'' to ''S'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1108 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1109 |
21. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1110 |
transcode terminal ''AY'' to ''Y'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1111 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1112 |
22. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1113 |
remove traling vowels |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1114 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1115 |
23. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1116 |
collapse all strings of repeated characters |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1117 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1118 |
24. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1119 |
if first char of original surname was a vowel, append it to the code |
2208 | 1120 |
" |
1121 |
! ! |
|
1122 |
||
1123 |
!PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'! |
|
1124 |
||
4488 | 1125 |
encode:aString |
2208 | 1126 |
|k| |
1127 |
||
1128 |
k := self rule1:(aString asUppercase). |
|
1129 |
k := self rule2:k. |
|
1130 |
k := self rule3:k. |
|
1131 |
k := self rule4:k. |
|
1132 |
k := self rule5:k. |
|
1133 |
k := self rule6:k. |
|
1134 |
k := self rule7:k. |
|
1135 |
k := self rule8:k. |
|
1136 |
k := self rule9:k. |
|
1137 |
k := self rule10:k. |
|
1138 |
k := self rule11:k. |
|
1139 |
k := self rule12:k. |
|
1140 |
k := self rule13:k. |
|
1141 |
k := self rule14:k. |
|
1142 |
k := self rule15:k. |
|
1143 |
k := self rule16:k. |
|
1144 |
k := self rule17:k. |
|
1145 |
k := self rule18:k. |
|
1146 |
k := self rule19:k. |
|
1147 |
k := self rule20:k. |
|
1148 |
k := self rule21:k. |
|
1149 |
k := self rule22:k. |
|
1150 |
k := self rule23:k. |
|
1151 |
k := self rule24:k originalKey:aString. |
|
4488 | 1152 |
^ k |
1153 |
||
1154 |
" |
|
1155 |
self new encode:'hello' |
|
1156 |
self new encode:'bliss' |
|
1157 |
" |
|
2208 | 1158 |
" |
1159 |
self new phoneticStringsFor:'hello' |
|
3839 | 1160 |
self new phoneticStringsFor:'bliss' |
2208 | 1161 |
" |
4488 | 1162 |
|
1163 |
"Created: / 28-07-2017 / 15:34:52 / cg" |
|
2208 | 1164 |
! ! |
1165 |
||
1166 |
!PhoneticStringUtilities::NYSIISStringComparator methodsFor:'private'! |
|
1167 |
||
1168 |
rule10:key |
|
1169 |
"10. transcode 'PH' to 'F' " |
|
1170 |
||
1171 |
^ self |
|
1172 |
transcodeAll:'PH' |
|
1173 |
of:key |
|
1174 |
to:'F' |
|
1175 |
startingAt:1 |
|
1176 |
! |
|
1177 |
||
1178 |
rule11:key |
|
1179 |
|k c| |
|
1180 |
||
1181 |
"11. if not first character, eliminate all 'H' preceded or followed by a vowel " |
|
1182 |
k := key copy. |
|
1183 |
c := SortedCollection sortBlock:[:a :b | b < a ]. |
|
1184 |
2 to:key size do:[:i | |
|
1185 |
(key at:i) = $H ifTrue:[ |
|
1186 |
((key at:i - 1) isVowel |
|
1187 |
or:[ (i < key size) and:[ (key at:i + 1) isVowel ] ]) ifTrue:[ c add:i ] |
|
1188 |
] |
|
1189 |
]. |
|
1190 |
c do:[:n | |
|
1191 |
k := (k copyFrom:1 to:n - 1) , (k copyFrom:n + 1 to:k size) |
|
1192 |
]. |
|
1193 |
^ k |
|
1194 |
! |
|
1195 |
||
1196 |
rule12:key |
|
1197 |
|k| |
|
1198 |
||
1199 |
"12. change 'KN' to 'N', else 'K' to 'C' " |
|
1200 |
k := self |
|
1201 |
transcodeAll:'KN' |
|
1202 |
of:key |
|
1203 |
to:'K' |
|
1204 |
startingAt:1. |
|
1205 |
k := self |
|
1206 |
transcodeAll:'K' |
|
1207 |
of:k |
|
1208 |
to:'C' |
|
1209 |
startingAt:1. |
|
1210 |
^ k |
|
1211 |
! |
|
1212 |
||
1213 |
rule13:key |
|
1214 |
"13. if not first character, change 'M' to 'N' " |
|
1215 |
||
1216 |
^ self |
|
1217 |
transcodeAll:'M' |
|
1218 |
of:key |
|
1219 |
to:'N' |
|
1220 |
startingAt:2 |
|
1221 |
! |
|
1222 |
||
1223 |
rule14:key |
|
1224 |
"14. if not first character, change 'Q' to 'G' " |
|
1225 |
||
1226 |
^ self |
|
1227 |
transcodeAll:'Q' |
|
1228 |
of:key |
|
1229 |
to:'G' |
|
1230 |
startingAt:2 |
|
1231 |
! |
|
1232 |
||
1233 |
rule15:key |
|
1234 |
"15. transcode 'SH' to 'S' " |
|
1235 |
||
1236 |
^ self |
|
1237 |
transcodeAll:'SH' |
|
1238 |
of:key |
|
1239 |
to:'S' |
|
1240 |
startingAt:1 |
|
1241 |
! |
|
1242 |
||
1243 |
rule16:key |
|
1244 |
"16. transcode 'SCH' to 'S' " |
|
1245 |
||
1246 |
^ self |
|
1247 |
transcodeAll:'SCH' |
|
1248 |
of:key |
|
1249 |
to:'S' |
|
1250 |
startingAt:1 |
|
1251 |
! |
|
1252 |
||
1253 |
rule17:key |
|
1254 |
"17. transcode 'YW' to 'Y' " |
|
1255 |
||
1256 |
^ self |
|
1257 |
transcodeAll:'YW' |
|
1258 |
of:key |
|
1259 |
to:'Y' |
|
1260 |
startingAt:1 |
|
1261 |
! |
|
1262 |
||
1263 |
rule18:key |
|
1264 |
|k| |
|
1265 |
||
1266 |
"18. if not first or last character, change 'Y' to 'A' " |
|
1267 |
k := self |
|
1268 |
transcodeAll:'Y' |
|
1269 |
of:key |
|
1270 |
to:'A' |
|
1271 |
startingAt:2. |
|
1272 |
key last = $Y ifTrue:[ |
|
1273 |
k at:k size put:$Y |
|
1274 |
]. |
|
1275 |
^ k |
|
1276 |
! |
|
1277 |
||
1278 |
rule19:key |
|
1279 |
"19. transcode 'WR' to 'R' " |
|
1280 |
||
1281 |
^ self |
|
1282 |
transcodeAll:'WR' |
|
1283 |
of:key |
|
1284 |
to:'R' |
|
1285 |
startingAt:1 |
|
1286 |
! |
|
1287 |
||
1288 |
rule1:key |
|
1289 |
|k| |
|
1290 |
||
1291 |
k := key copy. |
|
1292 |
"1. Remove all 'S' and 'Z' chars from the end of the name" |
|
1293 |
[ |
|
3839 | 1294 |
'SZ' includes:k last |
2208 | 1295 |
] whileTrue:[ k := k copyFrom:1 to:(k size - 1) ]. |
1296 |
^ k |
|
1297 |
! |
|
1298 |
||
1299 |
rule20:key |
|
1300 |
"20. if not first character, change 'Z' to 'S' " |
|
1301 |
||
1302 |
^ self |
|
1303 |
transcodeAll:'Z' |
|
1304 |
of:key |
|
1305 |
to:'S' |
|
1306 |
startingAt:2 |
|
1307 |
! |
|
1308 |
||
1309 |
rule21:key |
|
1310 |
"21. transcode terminal 'AY' to 'Y' " |
|
1311 |
||
1312 |
^ self |
|
1313 |
transcodeAll:'AY' |
|
1314 |
of:key |
|
1315 |
to:'Y' |
|
1316 |
startingAt:key size - 1 |
|
1317 |
! |
|
1318 |
||
1319 |
rule22:key |
|
1320 |
|k| |
|
1321 |
||
1322 |
"22. remove trailing vowels " |
|
1323 |
k := key copy. |
|
1324 |
[ k last isVowel ] whileTrue:[ |
|
1325 |
k := k copyFrom:1 to:k size - 1 |
|
1326 |
]. |
|
1327 |
^ k |
|
1328 |
! |
|
1329 |
||
1330 |
rule23:key |
|
1331 |
|k c| |
|
1332 |
||
1333 |
"23. collapse all strings of repeated characters " |
|
1334 |
k := key copy. |
|
1335 |
c := SortedCollection sortBlock:[:a :b | b < a ]. |
|
1336 |
k size to:2 do:[:i | |
|
1337 |
(k at:i) = (k at:i - 1) ifTrue:[ |
|
1338 |
c add:i |
|
1339 |
] |
|
1340 |
]. |
|
1341 |
c do:[:n | |
|
1342 |
k := (k copyFrom:1 to:n - 1) , (k copyFrom:n + 1 to:k size) |
|
1343 |
]. |
|
1344 |
^ k |
|
1345 |
! |
|
1346 |
||
1347 |
rule24:key originalKey:originalKey |
|
1348 |
|k| |
|
1349 |
||
1350 |
"24. if first char of original surname was a vowel, append it to the code" |
|
1351 |
k := key copy. |
|
1352 |
originalKey first isVowel ifTrue:[ |
|
1353 |
k := k , originalKey first asString asUppercase |
|
1354 |
]. |
|
1355 |
^ k |
|
1356 |
! |
|
1357 |
||
1358 |
rule2:key |
|
1359 |
|k| |
|
1360 |
||
1361 |
k := key copy. |
|
1362 |
"2. Transcode initial strings: MAC => MC PF => F" |
|
4184 | 1363 |
(k startsWith:'MAC') ifTrue:[ |
1364 |
k := 'MC' , (k copyFrom:4) |
|
2208 | 1365 |
]. |
4184 | 1366 |
(k startsWith:'PF') ifTrue:[ |
1367 |
k := 'F' , (k copyFrom:3) |
|
2208 | 1368 |
]. |
1369 |
^ k |
|
1370 |
! |
|
1371 |
||
1372 |
rule3:key |
|
1373 |
|k| |
|
1374 |
||
1375 |
"3. Transcode trailing strings as follows: |
|
1376 |
IX => IC |
|
1377 |
EX => EC |
|
1378 |
YE, EE, IE => Y |
|
1379 |
NT, ND => D" |
|
1380 |
k := key copy. |
|
1381 |
k := self |
|
1382 |
transcodeTrailing:#( 'IX' ) |
|
1383 |
of:k |
|
1384 |
to:'IC'. |
|
1385 |
k := self |
|
1386 |
transcodeTrailing:#( 'EX' ) |
|
1387 |
of:k |
|
1388 |
to:'EC'. |
|
1389 |
k := self |
|
1390 |
transcodeTrailing:#( 'YE' 'EE' 'IE' ) |
|
1391 |
of:k |
|
1392 |
to:'Y'. |
|
1393 |
k := self |
|
1394 |
transcodeTrailing:#( 'NT' 'ND' ) |
|
1395 |
of:k |
|
1396 |
to:'D'. |
|
1397 |
^ k |
|
1398 |
! |
|
1399 |
||
1400 |
rule4:key |
|
1401 |
"4. Transcode 'EV' to 'EF' if not at start of name" |
|
1402 |
||
1403 |
^ self |
|
1404 |
transcodeAll:'EV' |
|
1405 |
of:key |
|
1406 |
to:'EF' |
|
1407 |
startingAt:2 |
|
1408 |
! |
|
1409 |
||
1410 |
rule5:key |
|
1411 |
"5. Use first character of name as first character of key. Ignored because we're doing an in-place conversion" |
|
1412 |
||
1413 |
^ key |
|
1414 |
! |
|
1415 |
||
1416 |
rule6:key |
|
1417 |
|k i| |
|
1418 |
||
1419 |
"6. Remove any 'W' that follows a vowel" |
|
1420 |
k := key copy. |
|
1421 |
i := 2. |
|
1422 |
[ |
|
1423 |
(i := k indexOf:$W startingAt:i) > 0 |
|
1424 |
] whileTrue:[ |
|
1425 |
(k at:i - 1) isVowel ifTrue:[ |
|
1426 |
k := (k copyFrom:1 to:i - 1) , (k copyFrom:i + 1 to:k size). |
|
1427 |
i := i - 1 |
|
1428 |
] |
|
1429 |
]. |
|
1430 |
^ k |
|
1431 |
! |
|
1432 |
||
1433 |
rule7:key |
|
1434 |
|k| |
|
1435 |
||
1436 |
"7. replace all vowels with 'A' " |
|
1437 |
k := key copy. |
|
1438 |
1 to:key size do:[:i | |
|
1439 |
(key at:i) isVowel ifTrue:[ |
|
1440 |
k at:i put:$A |
|
1441 |
] |
|
1442 |
]. |
|
1443 |
^ k |
|
1444 |
! |
|
1445 |
||
1446 |
rule8:key |
|
1447 |
"8. transcode 'GHT' to 'GT' " |
|
1448 |
||
1449 |
^ self |
|
1450 |
transcodeAll:'GHT' |
|
1451 |
of:key |
|
1452 |
to:'GT' |
|
1453 |
startingAt:1 |
|
1454 |
! |
|
1455 |
||
1456 |
rule9:key |
|
1457 |
"9. transcode 'DG' to 'G' " |
|
1458 |
||
1459 |
^ self |
|
1460 |
transcodeAll:'DG' |
|
1461 |
of:key |
|
1462 |
to:'G' |
|
1463 |
startingAt:1 |
|
1464 |
! |
|
1465 |
||
1466 |
transcodeAll:aString of:key to:replacementString startingAt:start |
|
1467 |
|k i| |
|
1468 |
||
1469 |
k := key copy. |
|
1470 |
[ |
|
1471 |
(i := k indexOfSubCollection:aString startingAt:start) > 0 |
|
1472 |
] whileTrue:[ |
|
1473 |
k := (k copyFrom:1 to:i - 1) , replacementString |
|
1474 |
, (k copyFrom:i + aString size to:k size) |
|
1475 |
]. |
|
1476 |
^ k |
|
1477 |
! |
|
1478 |
||
1479 |
transcodeTrailing:anArrayOfStrings of:key to:replacementString |
|
1480 |
|answer| |
|
1481 |
||
1482 |
answer := key copy. |
|
1483 |
anArrayOfStrings do:[:aString | |
|
1484 |
answer := self |
|
1485 |
transcodeAll:aString |
|
1486 |
of:answer |
|
1487 |
to:replacementString |
|
1488 |
startingAt:(answer size - aString size) + 1 |
|
1489 |
]. |
|
1490 |
^ answer |
|
1491 |
! ! |
|
1492 |
||
2211 | 1493 |
!PhoneticStringUtilities::PhonemStringComparator class methodsFor:'documentation'! |
1494 |
||
1495 |
documentation |
|
1496 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1497 |
Implementation of the PHONEM algorithm, as described in |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1498 |
'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht - |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1499 |
Ein Programm fuer kontextsensitive phonetische Textumwandlung |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1500 |
ct Magazin fuer Computer & Technik 25/1998' |
3646 | 1501 |
|
1502 |
This algorithm deals better with the german language (it cares for umlauts) |
|
2211 | 1503 |
" |
1504 |
! ! |
|
1505 |
||
1506 |
!PhoneticStringUtilities::PhonemStringComparator methodsFor:'api'! |
|
1507 |
||
4488 | 1508 |
encode:aString |
2211 | 1509 |
|s idx t t2| |
1510 |
||
1511 |
s := aString asUppercase. |
|
1512 |
||
1513 |
idx := 1. |
|
1514 |
[idx < (s size-1)] whileTrue:[ |
|
1515 |
t2 := nil. |
|
1516 |
t := s copyFrom:idx to:idx+1. |
|
1517 |
t = 'SC' ifTrue:[ t2 := 'C' ] |
|
1518 |
ifFalse:[ t = 'SZ' ifTrue:[ t2 := 'C' ] |
|
1519 |
ifFalse:[ t = 'CZ' ifTrue:[ t2 := 'C' ] |
|
1520 |
ifFalse:[ t = 'TZ' ifTrue:[ t2 := 'C' ] |
|
1521 |
ifFalse:[ t = 'TS' ifTrue:[ t2 := 'C' ] |
|
1522 |
ifFalse:[ t = 'KS' ifTrue:[ t2 := 'X' ] |
|
1523 |
ifFalse:[ t = 'PF' ifTrue:[ t2 := 'V' ] |
|
1524 |
ifFalse:[ t = 'QU' ifTrue:[ t2 := 'KW' ] |
|
1525 |
ifFalse:[ t = 'PH' ifTrue:[ t2 := 'V' ] |
|
1526 |
ifFalse:[ t = 'UE' ifTrue:[ t2 := 'Y' ] |
|
1527 |
ifFalse:[ t = 'AE' ifTrue:[ t2 := 'E' ] |
|
4488 | 1528 |
ifFalse:[ t = 'OE' ifTrue:[ t2 := 'Ö' ] |
2211 | 1529 |
ifFalse:[ t = 'EI' ifTrue:[ t2 := 'AY' ] |
1530 |
ifFalse:[ t = 'EY' ifTrue:[ t2 := 'AY' ] |
|
1531 |
ifFalse:[ t = 'EU' ifTrue:[ t2 := 'OY' ] |
|
4488 | 1532 |
ifFalse:[ t = 'AU' ifTrue:[ t2 := 'A§' ] |
1533 |
ifFalse:[ t = 'OU' ifTrue:[ t2 := '§ ' ]]]]]]]]]]]]]]]]]. |
|
2211 | 1534 |
t2 notNil ifTrue:[ |
1535 |
s := (s copyTo:idx-1),t2,(s copyFrom:idx+2) |
|
1536 |
] ifFalse:[ |
|
1537 |
idx := idx + 1. |
|
1538 |
]. |
|
1539 |
]. |
|
1540 |
||
1541 |
"/ single character substitutions via tr |
|
4488 | 1542 |
s := s copyTransliterating:'ÖÄZKGQÜIJFWPT§' to:'YECCCCYYYVVDDUA'. |
2211 | 1543 |
s := s copyTransliterating:'ABCDLMNORSUVWXY' to:'' complement:true squashDuplicates:false. |
1544 |
s := s copyTransliterating:'ABCDLMNORSUVWXY' to:'ABCDLMNORSUVWXY' complement:false squashDuplicates:true. |
|
4488 | 1545 |
^ s |
2211 | 1546 |
|
1547 |
" |
|
4488 | 1548 |
self basicNew encode:'müller' -> 'MYLR' |
1549 |
self basicNew encode:'mueller' -> 'MYLR' |
|
1550 |
self basicNew encode:'möller' -> 'MYLR' |
|
1551 |
self basicNew encode:'miller' -> 'MYLR' |
|
1552 |
self basicNew encode:'muller' -> 'MULR' |
|
1553 |
self basicNew encode:'muler' -> 'MULR' |
|
1554 |
||
1555 |
self basicNew phoneticStringsFor:'müller' #('MYLR') |
|
3646 | 1556 |
self basicNew phoneticStringsFor:'mueller' #('MYLR') |
4488 | 1557 |
self basicNew phoneticStringsFor:'möller' #('MYLR') |
2211 | 1558 |
self basicNew phoneticStringsFor:'miller' #('MYLR') |
1559 |
self basicNew phoneticStringsFor:'muller' #('MULR') |
|
1560 |
self basicNew phoneticStringsFor:'muler' #('MULR') |
|
4488 | 1561 |
|
2211 | 1562 |
self basicNew phoneticStringsFor:'schmidt' #('CMYD') |
1563 |
self basicNew phoneticStringsFor:'schneider' #('CNAYDR') |
|
1564 |
self basicNew phoneticStringsFor:'fischer' #('VYCR') |
|
1565 |
self basicNew phoneticStringsFor:'weber' #('VBR') |
|
4488 | 1566 |
self basicNew phoneticStringsFor:'weeber' #('VBR') |
1567 |
self basicNew phoneticStringsFor:'webber' #('VBR') |
|
1568 |
self basicNew phoneticStringsFor:'wepper' #('VBR') |
|
1569 |
||
2211 | 1570 |
self basicNew phoneticStringsFor:'meyer' #('MAYR') |
4488 | 1571 |
self basicNew phoneticStringsFor:'maier' #('MAYR') |
1572 |
self basicNew phoneticStringsFor:'mayer' #('MAYR') |
|
1573 |
self basicNew phoneticStringsFor:'mayr' #('MAYR') |
|
1574 |
self basicNew phoneticStringsFor:'meir' #('MAYR') |
|
1575 |
||
2211 | 1576 |
self basicNew phoneticStringsFor:'wagner' #('VACNR') |
1577 |
self basicNew phoneticStringsFor:'schulz' #('CULC') |
|
1578 |
self basicNew phoneticStringsFor:'becker' #('BCR') |
|
1579 |
self basicNew phoneticStringsFor:'hoffmann' #('OVMAN') |
|
4488 | 1580 |
self basicNew phoneticStringsFor:'haus' #('AUS') |
1581 |
||
1582 |
self basicNew phoneticStringsFor:'schäfer' #('CVR') |
|
3646 | 1583 |
self basicNew phoneticStringsFor:'scheffer' #('CVR') |
1584 |
self basicNew phoneticStringsFor:'schaeffer' #('CVR') |
|
1585 |
self basicNew phoneticStringsFor:'schaefer' #('CVR') |
|
2211 | 1586 |
" |
4488 | 1587 |
|
1588 |
"Created: / 28-07-2017 / 15:38:08 / cg" |
|
2211 | 1589 |
! ! |
1590 |
||
2208 | 1591 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'LICENSE'! |
1592 |
||
2209 | 1593 |
copyright |
1594 |
" |
|
1595 |
Copyright (c) 2002-2004 Robert Jarvis |
|
2208 | 1596 |
|
2209 | 1597 |
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation |
1598 |
files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, |
|
1599 |
copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom |
|
1600 |
the Software is furnished to do so, subject to the following conditions: |
|
1601 |
||
1602 |
The above copyright notice and this permission notice shall be included in all copies or substantial |
|
1603 |
portions of the Software. |
|
2208 | 1604 |
|
2209 | 1605 |
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, |
1606 |
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. |
|
1607 |
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, |
|
1608 |
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE |
|
1609 |
USE OR OTHER DEALINGS IN THE SOFTWARE.' |
|
1610 |
" |
|
1611 |
! ! |
|
2208 | 1612 |
|
2213 | 1613 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'classification'! |
1614 |
||
1615 |
isSlavoGermanic:aString |
|
4488 | 1616 |
^ #('w' 'k' 'cz' 'witz' 'ä' 'ö' 'ü' 'ß') contains:[:sub | aString includesString:sub] |
2213 | 1617 |
|
1618 |
" |
|
1619 |
self isSlavoGermanic:'walter' |
|
4488 | 1620 |
self isSlavoGermanic:'horowitz' |
1621 |
self isSlavoGermanic:'müller' |
|
1622 |
self isSlavoGermanic:'miller' |
|
2213 | 1623 |
" |
4488 | 1624 |
|
1625 |
"Modified: / 28-07-2017 / 10:14:38 / cg" |
|
2213 | 1626 |
! ! |
1627 |
||
2209 | 1628 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'documentation'! |
2208 | 1629 |
|
3685 | 1630 |
documentation |
2209 | 1631 |
" |
4488 | 1632 |
The Double Metaphone algorithm |
1633 |
||
1634 |
see internet: https://en.wikipedia.org/wiki/Metaphone |
|
2209 | 1635 |
" |
2208 | 1636 |
! ! |
1637 |
||
1638 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'accessing'! |
|
1639 |
||
1640 |
currentIndex |
|
1641 |
^currentIndex |
|
1642 |
! |
|
1643 |
||
1644 |
currentIndex: anInteger |
|
1645 |
currentIndex := anInteger |
|
1646 |
! |
|
1647 |
||
1648 |
inputKey |
|
1649 |
^inputKey |
|
1650 |
! |
|
1651 |
||
1652 |
inputKey: aString |
|
1653 |
inputKey := aString asUppercase |
|
1654 |
! |
|
1655 |
||
1656 |
primaryTranslation |
|
1657 |
^primaryTranslation |
|
1658 |
! |
|
1659 |
||
1660 |
primaryTranslation: anObject |
|
1661 |
primaryTranslation := anObject |
|
1662 |
! |
|
1663 |
||
1664 |
secondaryTranslation |
|
1665 |
^secondaryTranslation |
|
1666 |
! |
|
1667 |
||
1668 |
secondaryTranslation: anObject |
|
1669 |
secondaryTranslation := anObject |
|
1670 |
! |
|
1671 |
||
1672 |
skipCount |
|
1673 |
^skipCount |
|
1674 |
! |
|
1675 |
||
1676 |
skipCount: anInteger |
|
1677 |
skipCount := anInteger |
|
1678 |
! |
|
1679 |
||
1680 |
startIndex |
|
1681 |
^startIndex |
|
1682 |
! |
|
1683 |
||
1684 |
startIndex: anObject |
|
1685 |
startIndex := anObject |
|
1686 |
! ! |
|
1687 |
||
1688 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'api'! |
|
1689 |
||
4488 | 1690 |
phoneticStringsFor:aString |
1691 |
"Private - Answers an array of alternate phonetic strings for the given input string." |
|
1692 |
||
1693 |
inputKey := aString. |
|
1694 |
self performInitialProcessing. |
|
1695 |
self processRemainingCharacters. |
|
1696 |
^ Array with:primaryTranslation with:secondaryTranslation |
|
1697 |
||
1698 |
"Modified (format): / 28-07-2017 / 11:25:02 / cg" |
|
2208 | 1699 |
! ! |
1700 |
||
1701 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'initialization'! |
|
1702 |
||
1703 |
initialize |
|
4488 | 1704 |
super initialize. |
1705 |
||
1706 |
startIndex := 1. |
|
1707 |
primaryTranslation := ''. |
|
1708 |
secondaryTranslation := ''. |
|
1709 |
skipCount := 0. |
|
1710 |
currentIndex := 1. |
|
1711 |
||
1712 |
"Modified: / 28-07-2017 / 11:18:44 / cg" |
|
2208 | 1713 |
! ! |
1714 |
||
1715 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'private'! |
|
1716 |
||
4488 | 1717 |
addPrimaryTranslation:aString |
1718 |
primaryTranslation := (primaryTranslation , aString) |
|
1719 |
||
1720 |
"Modified: / 28-07-2017 / 11:19:09 / cg" |
|
2208 | 1721 |
! |
1722 |
||
4488 | 1723 |
addSecondaryTranslation:aString |
1724 |
secondaryTranslation := secondaryTranslation , aString |
|
1725 |
||
1726 |
"Modified: / 28-07-2017 / 11:17:11 / cg" |
|
2208 | 1727 |
! |
1728 |
||
1729 |
isSlavoGermanic: aString |
|
1730 |
^((aString includesAnyOf: 'WK') or: |
|
1731 |
[ (aString indexOfSubCollection: 'CZ' startingAt: 1) >= 1 ]) or: |
|
1732 |
[ (aString indexOfSubCollection: 'WITZ' startingAt: 1) >= 1 ] |
|
1733 |
! |
|
1734 |
||
1735 |
keyAt: anInteger |
|
4488 | 1736 |
(anInteger between:1 and:inputKey size) ifTrue: [ |
1737 |
^ inputKey at: anInteger |
|
1738 |
]. |
|
1739 |
^ Character space |
|
1740 |
||
1741 |
"Modified: / 28-07-2017 / 11:38:30 / cg" |
|
2208 | 1742 |
! |
1743 |
||
1744 |
keyLeftString: lengthInteger |
|
1745 |
^self keyMidString: lengthInteger from: 1 |
|
1746 |
! |
|
1747 |
||
1748 |
keyMidString: lengthInteger from: fromInteger |
|
4488 | 1749 |
| result from len additionalSpaces | |
1750 |
||
1751 |
result := ''. |
|
1752 |
from := fromInteger. |
|
1753 |
len := lengthInteger. |
|
1754 |
||
1755 |
"Prepend spaces if caller is requesting characters from before the start of the string" |
|
1756 |
||
1757 |
[ from < 1 ] whileTrue: |
|
1758 |
[ result := result, ' '. |
|
1759 |
from := from + 1. |
|
1760 |
len := len - 1 ]. |
|
1761 |
||
1762 |
from + len - 1 > inputKey size |
|
1763 |
ifTrue: |
|
1764 |
[ additionalSpaces := from + len - 1 - inputKey size. |
|
1765 |
len := inputKey size - from + 1 ] |
|
1766 |
ifFalse: [ additionalSpaces := 0 ]. |
|
1767 |
||
1768 |
result := result, (inputKey copyFrom: from to: (from+len-1 min: inputKey size)). |
|
1769 |
||
1770 |
[ additionalSpaces > 0 ] whileTrue: |
|
1771 |
[ result := result, ' '. |
|
1772 |
additionalSpaces := additionalSpaces - 1 ]. |
|
1773 |
||
1774 |
^result |
|
1775 |
||
1776 |
"Modified: / 28-07-2017 / 11:20:43 / cg" |
|
2208 | 1777 |
! |
1778 |
||
1779 |
keyRightString: lengthInteger |
|
4488 | 1780 |
^self keyMidString: lengthInteger from: inputKey size - lengthInteger + 1 |
1781 |
||
1782 |
"Modified: / 28-07-2017 / 11:20:51 / cg" |
|
2208 | 1783 |
! |
1784 |
||
1785 |
performInitialProcessing |
|
4488 | 1786 |
(#( 'GN' 'KN' 'PN' 'WR' 'PS' ) includes:(inputKey copyFrom:1 to:2)) ifTrue:[ |
1787 |
startIndex := startIndex + 1 |
|
1788 |
]. |
|
1789 |
(self keyAt:1) = $X ifTrue:[ |
|
1790 |
self |
|
1791 |
addPrimaryTranslation:'S'; |
|
1792 |
addSecondaryTranslation:'S'. |
|
1793 |
startIndex := startIndex + 1 |
|
1794 |
]. |
|
1795 |
(self keyAt:1) isVowel ifTrue:[ |
|
1796 |
self |
|
1797 |
addPrimaryTranslation:'A'; |
|
1798 |
addSecondaryTranslation:'A'. |
|
1799 |
startIndex := startIndex + 1 |
|
1800 |
] |
|
1801 |
||
1802 |
"Modified: / 28-07-2017 / 11:36:31 / cg" |
|
2208 | 1803 |
! |
1804 |
||
1805 |
processB |
|
4488 | 1806 |
self |
1807 |
addPrimaryTranslation: 'P'; |
|
1808 |
addSecondaryTranslation: 'P'. |
|
1809 |
||
1810 |
(self keyAt: (currentIndex + 1)) == $B ifTrue: [ |
|
1811 |
skipCount := skipCount + 1 |
|
1812 |
]. |
|
1813 |
||
1814 |
"Modified: / 28-07-2017 / 11:26:03 / cg" |
|
2208 | 1815 |
! |
1816 |
||
1817 |
processC |
|
2213 | 1818 |
"i" |
1819 |
((((currentIndex >= 3 |
|
1820 |
and: [ (self keyAt: currentIndex-2) isVowel not ]) |
|
1821 |
and: [ (self keyMidString: 3 from: currentIndex-1) = 'ACH' ]) |
|
1822 |
and: [ (self keyAt: currentIndex+2) ~= $I ]) |
|
1823 |
and: [ ((self keyAt: currentIndex+2) ~= $E) |
|
1824 |
or: [ (self keyMidString: 6 from: currentIndex-2) ~= 'BACHER' |
|
1825 |
and: [ (self keyMidString: 6 from: currentIndex-2) ~= 'MACHER' ] ] ]) |
|
1826 |
ifTrue: |
|
1827 |
[ self addPrimaryTranslation: 'K'. |
|
1828 |
self addSecondaryTranslation: 'K'. |
|
4488 | 1829 |
skipCount := skipCount + 2. |
2213 | 1830 |
^self ]. |
1831 |
||
1832 |
"ii" |
|
4488 | 1833 |
(inputKey beginsWith: 'CAESAR') |
2213 | 1834 |
ifTrue: |
1835 |
[ self addPrimaryTranslation: 'S'. |
|
1836 |
self addSecondaryTranslation: 'S'. |
|
4488 | 1837 |
skipCount := skipCount + 1. |
2213 | 1838 |
^self ]. |
1839 |
||
1840 |
"iii" |
|
1841 |
(self keyMidString: 4 from: currentIndex) = 'CHIA' |
|
1842 |
ifTrue: |
|
1843 |
[ self addPrimaryTranslation: 'K'. |
|
1844 |
self addSecondaryTranslation: 'K'. |
|
4488 | 1845 |
skipCount := skipCount + 1. |
2213 | 1846 |
^self ]. |
1847 |
||
1848 |
"iv" |
|
1849 |
(self keyMidString: 2 from: currentIndex) = 'CH' |
|
1850 |
ifTrue: |
|
1851 |
[ (currentIndex > 1 "a" |
|
1852 |
and: [ (self keyMidString: 4 from: currentIndex) = 'CHAE' ]) |
|
1853 |
ifTrue: [ self |
|
1854 |
addPrimaryTranslation: 'K'; |
|
4488 | 1855 |
addSecondaryTranslation: 'X'. |
1856 |
skipCount := skipCount + 1. |
|
1857 |
^self ]. |
|
2213 | 1858 |
|
1859 |
(currentIndex = 1 "b" |
|
4488 | 1860 |
and: [ (inputKey size > 5 and: [(inputKey copyFrom: 1 to: 6) = 'CHARAC' |
1861 |
or: [ (inputKey copyFrom: 1 to: 6) = 'CHARIS' ]] ) |
|
1862 |
or: [inputKey size > 4 and: [ ((((inputKey copyFrom: 1 to: 4) = 'CHOR' |
|
1863 |
or: [ (inputKey copyFrom: 1 to: 4) = 'CHYM' ]) |
|
1864 |
or: [ (inputKey copyFrom: 1 to: 4) = 'CHIA' ]) |
|
1865 |
or: [ (inputKey copyFrom: 1 to: 4) = 'CHEM' ]) |
|
1866 |
and: [ (inputKey copyFrom: 1 to: 4) ~= 'CHORE' ] ] ] ]) |
|
2213 | 1867 |
ifTrue: [ self |
1868 |
addPrimaryTranslation: 'K'; |
|
4488 | 1869 |
addSecondaryTranslation: 'K'. |
1870 |
skipCount := skipCount + 1. |
|
1871 |
^self ]. |
|
1872 |
||
1873 |
(((((#('VAN ' 'VON ') includes: (inputKey copyFrom: 1 to: 4)) "c" |
|
1874 |
or: [ (inputKey copyFrom: 1 to: 3) = 'SCH' ]) |
|
2213 | 1875 |
or: [ #('ORCHES' 'ARCHIT' 'ORCHID') |
1876 |
includes: (self keyMidString: 6 from: currentIndex-2) ]) |
|
1877 |
or: [ #($T $S) includes: (self keyAt: currentIndex+2) ]) |
|
1878 |
or: [ ((currentIndex = 1) |
|
1879 |
or: [ #($A $O $U $E) includes: (self keyAt: currentIndex-1) ]) |
|
1880 |
and: [ #($L $R $N $M $B $H $F $V $W $ ) includes: (self keyAt: currentIndex+2) ] ] ) |
|
1881 |
ifTrue: |
|
1882 |
[ self |
|
1883 |
addPrimaryTranslation: 'K'; |
|
4488 | 1884 |
addSecondaryTranslation: 'K'. |
1885 |
skipCount := skipCount + 1. |
|
1886 |
^self ] |
|
2213 | 1887 |
ifFalse: |
1888 |
[ currentIndex > 1 |
|
1889 |
ifTrue: |
|
4488 | 1890 |
[ (inputKey copyFrom: 1 to: 2) = 'MC' |
2213 | 1891 |
ifTrue: |
1892 |
[ self |
|
1893 |
addPrimaryTranslation: 'K'; |
|
1894 |
addSecondaryTranslation: 'K' ] |
|
1895 |
ifFalse: |
|
1896 |
[ self |
|
1897 |
addPrimaryTranslation: 'X'; |
|
1898 |
addSecondaryTranslation: 'K' ] ] |
|
1899 |
ifFalse: |
|
1900 |
[ self |
|
1901 |
addPrimaryTranslation: 'X'; |
|
1902 |
addSecondaryTranslation: 'X' ]. |
|
4488 | 1903 |
skipCount := skipCount + 1. |
2213 | 1904 |
^self ] ]. |
1905 |
||
1906 |
"v" |
|
1907 |
(self keyAt: currentIndex+1) = $Z |
|
1908 |
ifTrue: |
|
1909 |
[ self |
|
1910 |
addPrimaryTranslation: 'S'; |
|
4488 | 1911 |
addSecondaryTranslation: 'X'. |
1912 |
skipCount := skipCount + 1. |
|
1913 |
^self ]. |
|
2213 | 1914 |
|
1915 |
"vi" |
|
1916 |
(self keyMidString: 3 from: currentIndex+1) = 'CIA' |
|
1917 |
ifTrue: |
|
1918 |
[ self |
|
1919 |
addPrimaryTranslation: 'X'; |
|
4488 | 1920 |
addSecondaryTranslation: 'X'. |
1921 |
skipCount := skipCount + 2. |
|
1922 |
^self ]. |
|
2213 | 1923 |
|
1924 |
"vii" |
|
1925 |
((self keyAt: currentIndex+1) = $C |
|
1926 |
and: [ ((currentIndex = 2) |
|
1927 |
and: [ (self keyAt: 1) = $M ]) not ]) |
|
1928 |
ifTrue: |
|
1929 |
[ ((#($I $E $H) includes: (self keyAt: currentIndex+2)) |
|
1930 |
and: [ (self keyMidString: 2 from: currentIndex+2) ~= 'HU' ]) |
|
1931 |
ifTrue: |
|
1932 |
[ ((currentIndex = 2 and: [ (self keyAt: 1) = $A ]) |
|
1933 |
or: [ #('UCCEE' 'UCCES') includes: (self keyMidString: 5 from: currentIndex-1)]) |
|
1934 |
ifTrue: |
|
1935 |
[self |
|
1936 |
addPrimaryTranslation: 'KS'; |
|
4488 | 1937 |
addSecondaryTranslation: 'KS'. |
1938 |
skipCount := skipCount + 2. |
|
1939 |
^self ] |
|
2213 | 1940 |
ifFalse: |
1941 |
[self |
|
1942 |
addPrimaryTranslation: 'X'; |
|
4488 | 1943 |
addSecondaryTranslation: 'X'. |
1944 |
skipCount := skipCount + 2. |
|
1945 |
^self ] ] |
|
2213 | 1946 |
ifFalse: |
1947 |
[ self |
|
1948 |
addPrimaryTranslation: 'K'; |
|
4488 | 1949 |
addSecondaryTranslation: 'K'. |
1950 |
skipCount := skipCount + 2. |
|
1951 |
^self ] ]. |
|
2213 | 1952 |
|
1953 |
"viii" |
|
1954 |
(#($K $G $Q) includes: (self keyAt: currentIndex+1)) |
|
1955 |
ifTrue: |
|
1956 |
[ self |
|
1957 |
addPrimaryTranslation: 'K'; |
|
4488 | 1958 |
addSecondaryTranslation: 'K'. |
1959 |
skipCount := skipCount + 1. |
|
1960 |
^self ]. |
|
2213 | 1961 |
|
1962 |
"ix" |
|
1963 |
(#($I $E $Y) includes: (self keyAt: currentIndex+1)) |
|
1964 |
ifTrue: |
|
1965 |
[ (#('CIO' 'CIE' 'CIA') includes: (self keyMidString: 3 from: currentIndex)) |
|
1966 |
ifTrue: |
|
1967 |
[self |
|
1968 |
addPrimaryTranslation: 'S'; |
|
1969 |
addSecondaryTranslation: 'X' ] |
|
1970 |
ifFalse: |
|
1971 |
[self |
|
1972 |
addPrimaryTranslation: 'S'; |
|
1973 |
addSecondaryTranslation: 'S']. |
|
4488 | 1974 |
skipCount := skipCount + 1. |
2213 | 1975 |
^self ]. |
1976 |
||
1977 |
"x" |
|
1978 |
self |
|
1979 |
addPrimaryTranslation: 'K'; |
|
1980 |
addSecondaryTranslation: 'K'. |
|
1981 |
||
1982 |
"xi" |
|
1983 |
(#(' C' ' Q' ' G') includes: (self keyMidString: 2 from: currentIndex+1)) |
|
1984 |
ifTrue: |
|
4488 | 1985 |
[ skipCount := skipCount + 2 ] |
2213 | 1986 |
ifFalse: |
1987 |
[ ((#($C $K $Q) includes: (self keyAt: currentIndex+1)) |
|
1988 |
and: [ (#('CE' 'CI') includes: (self keyMidString: 2 from: currentIndex+1)) not ]) |
|
4488 | 1989 |
ifTrue: [ skipCount := skipCount + 1] ] |
1990 |
||
1991 |
"Modified: / 28-07-2017 / 11:29:11 / cg" |
|
2208 | 1992 |
! |
1993 |
||
1994 |
processCedille |
|
1995 |
self |
|
1996 |
addPrimaryTranslation: 'S'; |
|
1997 |
addSecondaryTranslation: 'S' |
|
1998 |
! |
|
1999 |
||
2000 |
processD |
|
2213 | 2001 |
"i" |
2002 |
(self keyAt: currentIndex+1) = $G |
|
2003 |
ifTrue: |
|
2004 |
[ (#($I $E $Y) includes: (self keyAt: currentIndex+2)) |
|
2005 |
ifTrue: |
|
2006 |
[ self |
|
2007 |
addPrimaryTranslation: 'J'; |
|
4488 | 2008 |
addSecondaryTranslation: 'J'. |
2009 |
skipCount := skipCount + 2. |
|
2213 | 2010 |
^self ] |
2011 |
ifFalse: |
|
2012 |
[ self |
|
2013 |
addPrimaryTranslation: 'TK'; |
|
4488 | 2014 |
addSecondaryTranslation: 'TK'. |
2015 |
skipCount := skipCount + 1. |
|
2213 | 2016 |
^self ] ]. |
2017 |
||
2018 |
"ii" |
|
2019 |
(#($T $D) includes: (self keyAt: currentIndex+1)) |
|
2020 |
ifTrue: |
|
2021 |
[ self |
|
2022 |
addPrimaryTranslation: 'T'; |
|
4488 | 2023 |
addSecondaryTranslation: 'T'. |
2024 |
skipCount := skipCount + 1. |
|
2025 |
^self ]. |
|
2213 | 2026 |
|
2027 |
"iii" |
|
2028 |
self |
|
2029 |
addPrimaryTranslation: 'T'; |
|
2030 |
addSecondaryTranslation: 'T' |
|
4488 | 2031 |
|
2032 |
"Modified: / 28-07-2017 / 11:27:39 / cg" |
|
2208 | 2033 |
! |
2034 |
||
2035 |
processF |
|
4488 | 2036 |
self |
2037 |
addPrimaryTranslation: 'F'; |
|
2038 |
addSecondaryTranslation: 'F'. |
|
2039 |
||
2040 |
(self keyAt: currentIndex+1) = $F |
|
2041 |
ifTrue: [ skipCount := skipCount + 1 ] |
|
2042 |
||
2043 |
"Modified (format): / 28-07-2017 / 11:29:21 / cg" |
|
2208 | 2044 |
! |
2045 |
||
2046 |
processG |
|
2047 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
|
2048 |
case 'G': |
|
2049 |
if(GetAt(current + 1) == 'H') |
|
2050 |
{" |
|
2051 |
| word | |
|
2213 | 2052 |
(self keyAt: currentIndex + 1) = $H |
2208 | 2053 |
ifTrue: [ |
2054 |
"if((current > 0) AND !!IsVowel(current - 1))" |
|
2055 |
||
2213 | 2056 |
(currentIndex > 1 and: [(self keyAt: currentIndex - 1) isVowel not]) |
2208 | 2057 |
ifTrue: [ |
2058 |
" { |
|
2059 |
MetaphAdd(K); |
|
2060 |
current += 2; |
|
2061 |
break; |
|
2062 |
}" |
|
2063 |
||
4488 | 2064 |
self |
2065 |
addPrimaryTranslation: 'K'; |
|
2066 |
addSecondaryTranslation: 'K'. |
|
2067 |
skipCount := skipCount + 1. |
|
2068 |
^self |
|
2208 | 2069 |
]. |
2070 |
||
2071 |
"if(current < 3) |
|
2072 |
{" |
|
2073 |
||
2074 |
currentIndex < 4 |
|
2075 |
ifTrue: [ |
|
2076 |
||
2077 |
" //'ghislane', ghiradelli |
|
2078 |
if(current == 0) |
|
2079 |
{ " |
|
2080 |
currentIndex = 1 |
|
2081 |
ifTrue: [ |
|
2082 |
"if(GetAt(current + 2) == 'I')" |
|
2083 |
||
2213 | 2084 |
(self keyAt: currentIndex + 2) = $I |
2208 | 2085 |
ifTrue: [ |
2086 |
"MetaphAdd(J);" |
|
2087 |
self addPrimaryTranslation: 'J'; |
|
2088 |
addSecondaryTranslation: 'J'. |
|
2089 |
] ifFalse: [ |
|
2090 |
"MetaphAdd(K);" |
|
2091 |
self addPrimaryTranslation: 'K'; |
|
2092 |
addSecondaryTranslation: 'K'. |
|
2093 |
]. |
|
2094 |
" current += 2; |
|
2095 |
break;" |
|
4488 | 2096 |
skipCount := skipCount + 1. |
2097 |
^self |
|
2208 | 2098 |
] |
2099 |
]. |
|
2100 |
||
2101 |
" //Parker's rule (with some further refinements) - e.g., 'hugh' |
|
2102 |
if(((current > 1) AND StringAt((current - 2), 1, B, H, D, ) ) |
|
2103 |
//e.g., 'bough' |
|
2104 |
OR ((current > 2) AND StringAt((current - 3), 1, B, H, D, ) ) |
|
2105 |
//e.g., 'broughton' |
|
2106 |
OR ((current > 3) AND StringAt((current - 4), 1, B, H, ) ) ) |
|
2107 |
" |
|
2213 | 2108 |
(((currentIndex > 2 and: [#($B $H $D) includes: (self keyAt: currentIndex - 2)]) |
2109 |
or: [currentIndex > 3 and: [#($B $H $D) includes: (self keyAt: currentIndex - 3)]]) |
|
2110 |
or: [currentIndex > 4 and: [#($B $H) includes: (self keyAt: currentIndex - 4)]]) |
|
2208 | 2111 |
ifTrue: [ |
2112 |
"current += 2; |
|
2113 |
break;" |
|
4488 | 2114 |
skipCount := skipCount + 1. |
2115 |
^self |
|
2208 | 2116 |
] ifFalse: [ |
2117 |
" //e.g., 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough' |
|
2118 |
if((current > 2) |
|
2119 |
AND (GetAt(current - 1) == 'U') |
|
2120 |
AND StringAt((current - 3), 1, C, G, L, R, T, ) )" |
|
2121 |
(currentIndex > 3 and: [ |
|
2213 | 2122 |
((self keyAt: currentIndex - 1) = $U) and: [ |
2123 |
#($C $G $L $R $T) includes: (self keyAt: currentIndex - 3) |
|
2208 | 2124 |
] |
2125 |
]) ifTrue: [ |
|
2126 |
"MetaphAdd(F);" |
|
2127 |
self addPrimaryTranslation: 'F'; |
|
2128 |
addSecondaryTranslation: 'F'. |
|
2129 |
] ifFalse: [ |
|
2130 |
" if((current > 0) AND GetAt(current - 1) !!= 'I') |
|
2131 |
MetaphAdd(K);" |
|
2213 | 2132 |
(currentIndex > 1 and: [(self keyAt: currentIndex - 1) ~= $I]) |
2208 | 2133 |
ifTrue: [ |
2134 |
self addPrimaryTranslation: 'K'; |
|
2135 |
addSecondaryTranslation: 'K'. |
|
2136 |
]. |
|
2137 |
]. |
|
4488 | 2138 |
skipCount := skipCount + 1. |
2139 |
^self |
|
2208 | 2140 |
]. |
2141 |
]. |
|
2142 |
"if(GetAt(current + 1) == 'N')" |
|
2213 | 2143 |
(self keyAt: currentIndex + 1) = $N |
2208 | 2144 |
ifTrue: [ |
2145 |
"if((current == 1) AND IsVowel(0) AND !!SlavoGermanic())" |
|
4488 | 2146 |
(currentIndex = 2 and: [(inputKey at: 1) isVowel and: [(self isSlavoGermanic: inputKey) not]]) |
2208 | 2147 |
ifTrue: [ |
2148 |
"MetaphAdd(KN, N);" |
|
2149 |
self addPrimaryTranslation: 'KN'; |
|
2150 |
addSecondaryTranslation: 'N'. |
|
2151 |
] ifFalse: [ |
|
2152 |
" //not e.g. 'cagney' |
|
2153 |
if(!!StringAt((current + 2), 2, EY, ) |
|
2154 |
AND (GetAt(current + 1) !!= 'Y') |
|
2155 |
AND !!SlavoGermanic())" |
|
4488 | 2156 |
((inputKey size >= (currentIndex + 2)) and: [ |
2157 |
(inputKey copyFrom: currentIndex + 2 to: (currentIndex + 4 min: inputKey size)) ~= 'EY' and: [ |
|
2213 | 2158 |
(self keyAt: currentIndex + 1) ~= $Y and: [ |
4488 | 2159 |
(self isSlavoGermanic: inputKey) not |
2208 | 2160 |
] |
2161 |
] |
|
2162 |
]) ifTrue: [ |
|
2163 |
self addPrimaryTranslation: 'N'; |
|
2164 |
addSecondaryTranslation: 'KN'. |
|
2165 |
] ifFalse: [ |
|
2166 |
self addPrimaryTranslation: 'KN'; |
|
2167 |
addSecondaryTranslation: 'KN'. |
|
2168 |
]. |
|
2169 |
]. |
|
4488 | 2170 |
skipCount := skipCount + 1. |
2171 |
^self |
|
2208 | 2172 |
]. |
2173 |
" //'tagliaro' |
|
2174 |
if(StringAt((current + 1), 2, LI, ) AND !!SlavoGermanic())" |
|
4488 | 2175 |
((inputKey size >= (currentIndex + 3)) and: [ |
2176 |
(inputKey copyFrom: currentIndex + 1 to: currentIndex + 2) = 'LI' and: [ |
|
2177 |
(self isSlavoGermanic: inputKey) not]]) |
|
2208 | 2178 |
ifTrue: [ |
2179 |
self addPrimaryTranslation: 'KL'; |
|
2180 |
addSecondaryTranslation: 'L'. |
|
4488 | 2181 |
skipCount := skipCount + 1. |
2182 |
^self. |
|
2208 | 2183 |
]. |
2184 |
" //-ges-,-gep-,-gel-, -gie- at beginning |
|
2185 |
if((current == 0) |
|
2186 |
AND ((GetAt(current + 1) == 'Y') |
|
2187 |
OR StringAt((current + 1), 2, ES, EP, EB, EL, EY, IB, IL, IN, IE, EI, ER, )) )" |
|
2213 | 2188 |
(currentIndex = 1 and: [ |
2189 |
((self keyAt: currentIndex + 1) = $Y) or: [ |
|
2208 | 2190 |
(#('ES' 'EP' 'EB' 'EL' 'EY' 'IB' 'IL' 'IN' 'IE' 'EI' 'ER') includes: |
4488 | 2191 |
(inputKey copyFrom: currentIndex + 1 to: currentIndex + 2)) |
2208 | 2192 |
]]) ifTrue: [ |
2193 |
self addPrimaryTranslation: 'K'; |
|
2194 |
addSecondaryTranslation: 'J'. |
|
4488 | 2195 |
skipCount := skipCount + 1. |
2196 |
^self. |
|
2208 | 2197 |
]. |
2198 |
" // -ger-, -gy- |
|
2199 |
if((StringAt((current + 1), 2, ER, ) OR (GetAt(current + 1) == 'Y')) |
|
2200 |
AND !!StringAt(0, 6, DANGER, RANGER, MANGER, ) |
|
2201 |
AND !!StringAt((current - 1), 1, E, I, ) |
|
2202 |
AND !!StringAt((current - 1), 3, RGY, OGY, ) ) |
|
2203 |
" |
|
4488 | 2204 |
(((inputKey copyFrom: currentIndex + 1 to: (currentIndex + 3 min: inputKey size)) = 'ER' or: [ |
2213 | 2205 |
((self keyAt: currentIndex + 1) = $Y)]) |
4488 | 2206 |
and: [((#('DANGER' 'RANGER' 'MANGER') includes: (word := inputKey copyFrom: 1 to: (6 min: inputKey size))) not) |
2213 | 2207 |
and: [(self keyAt: currentIndex - 1) ~= $E |
4488 | 2208 |
and: [(#('RGY' 'OGY') includes: (inputKey copyFrom: currentIndex - 1 to: currentIndex + 1)) not]]]) |
2208 | 2209 |
ifTrue: [ |
2210 |
self addPrimaryTranslation: 'K'; |
|
2211 |
addSecondaryTranslation: 'J'. |
|
4488 | 2212 |
skipCount := skipCount + 1. |
2213 |
^self. |
|
2208 | 2214 |
]. |
2215 |
||
2216 |
" // italian e.g, 'biaggi' |
|
2217 |
if(StringAt((current + 1), 1, E, I, Y, ) OR StringAt((current - 1), 4, AGGI, OGGI, )) |
|
2218 |
" |
|
4488 | 2219 |
((#($E $I $Y) includes: (self keyAt: (currentIndex + 1))) or: [(#('AGGI' 'OGGI') includes: (inputKey copyFrom: currentIndex - 1 to: (currentIndex + 2 min: inputKey size)))]) |
2208 | 2220 |
ifTrue: [ |
2221 |
" //obvious germanic |
|
2222 |
if((StringAt(0, 4, VAN , VON , ) OR StringAt(0, 3, SCH, )) |
|
2223 |
OR StringAt((current + 1), 2, ET, )) MetaphAdd(K);" |
|
4488 | 2224 |
word := (inputKey copyFrom: 1 to: 4). |
2208 | 2225 |
((#('VAN ' 'VON ') includes: word) or: [(word copyFrom: 1 to: 3) = 'SCH' or: [(word copyFrom: 1 to: 2) = 'ET']]) |
2226 |
ifTrue: [ |
|
2227 |
self addPrimaryTranslation: 'K'; |
|
2228 |
addSecondaryTranslation: 'K'. |
|
2229 |
] ifFalse: [ |
|
2230 |
" //always soft if french ending |
|
2231 |
if(StringAt((current + 1), 4, IER , )) |
|
2232 |
MetaphAdd(J); |
|
2233 |
else |
|
2234 |
MetaphAdd(J, K); |
|
2235 |
current += 2; |
|
2236 |
break;" |
|
4488 | 2237 |
(((inputKey copyFrom: currentIndex + 1 to: (currentIndex + 5 min: inputKey size)), ' ') copyFrom: 1 to: 4) = 'IER ' |
2208 | 2238 |
ifTrue: [ |
2239 |
self addPrimaryTranslation: 'J'; |
|
2240 |
addSecondaryTranslation: 'J'. |
|
2241 |
] ifFalse: [ |
|
2242 |
self addPrimaryTranslation: 'J'; |
|
2243 |
addSecondaryTranslation: 'K'. |
|
2244 |
]. |
|
2245 |
||
2246 |
]. |
|
4488 | 2247 |
skipCount := skipCount + 1. |
2248 |
^self. |
|
2208 | 2249 |
]. |
2250 |
||
2251 |
" if(GetAt(current + 1) == 'G') |
|
2252 |
current += 2; |
|
2253 |
else |
|
2254 |
current += 1; |
|
2255 |
MetaphAdd(K); |
|
2256 |
break;" |
|
2257 |
||
2213 | 2258 |
(self keyAt: (currentIndex + 1)) = $G |
2208 | 2259 |
ifTrue: [ |
4488 | 2260 |
skipCount := skipCount + 1. |
2208 | 2261 |
]. |
2262 |
self addPrimaryTranslation: 'K'; |
|
2263 |
addSecondaryTranslation: 'K'. |
|
4488 | 2264 |
|
2265 |
"Modified: / 28-07-2017 / 11:31:33 / cg" |
|
2208 | 2266 |
! |
2267 |
||
2268 |
processH |
|
2213 | 2269 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2270 |
case 'H': |
|
2208 | 2271 |
//only keep if first & before vowel or btw. 2 vowels |
2272 |
if(((current == 0) OR IsVowel(current - 1)) |
|
2273 |
AND IsVowel(current + 1)) |
|
2274 |
{ |
|
2275 |
MetaphAdd(H); |
|
2276 |
current += 2; |
|
2277 |
}else//also takes care of 'HH' |
|
2278 |
current += 1; |
|
2279 |
break; |
|
2280 |
" |
|
2281 |
||
2213 | 2282 |
(((currentIndex = 1) |
2283 |
or: [ (self keyAt: currentIndex - 1) isVowel]) |
|
2284 |
and: [(self keyAt: currentIndex + 1) isVowel]) |
|
2285 |
ifTrue: [ |
|
2286 |
self addPrimaryTranslation: 'H'; |
|
2287 |
addSecondaryTranslation: 'H'. |
|
4488 | 2288 |
skipCount := skipCount + 1. |
2289 |
^self. |
|
2213 | 2290 |
] |
4488 | 2291 |
|
2292 |
"Modified: / 28-07-2017 / 11:29:52 / cg" |
|
2208 | 2293 |
! |
2294 |
||
2295 |
processJ |
|
2213 | 2296 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2297 |
case 'J': |
|
2208 | 2298 |
//obvious spanish, 'jose', 'san jacinto' |
2299 |
if(StringAt(current, 4, JOSE, ) OR StringAt(0, 4, SAN , ) ) |
|
2300 |
{ |
|
2301 |
if(((current == 0) AND (GetAt(current + 4) == ' ')) OR StringAt(0, 4, SAN , ) ) |
|
2302 |
MetaphAdd(H); |
|
2303 |
else |
|
2304 |
{ |
|
2305 |
MetaphAdd(J, H); |
|
2306 |
} |
|
2307 |
current +=1; |
|
2308 |
break; |
|
2309 |
} |
|
2310 |
||
2311 |
if((current == 0) AND !!StringAt(current, 4, JOSE, )) |
|
2312 |
MetaphAdd(J, A);//Yankelovich/Jankelowicz |
|
2313 |
else |
|
2314 |
//spanish pron. of e.g. 'bajador' |
|
2315 |
if(IsVowel(current - 1) |
|
2316 |
AND !!SlavoGermanic() |
|
2317 |
AND ((GetAt(current + 1) == 'A') OR (GetAt(current + 1) == 'O'))) |
|
2318 |
MetaphAdd(J, H); |
|
2319 |
else |
|
2320 |
if(current == last) |
|
2321 |
MetaphAdd(J, ); |
|
2322 |
else |
|
2323 |
if(!!StringAt((current + 1), 1, L, T, K, S, N, M, B, Z, ) |
|
2324 |
AND !!StringAt((current - 1), 1, S, K, L, )) |
|
2325 |
MetaphAdd(J); |
|
2326 |
||
2327 |
if(GetAt(current + 1) == 'J')//it could happen!! |
|
2328 |
current += 2; |
|
2329 |
else |
|
2330 |
current += 1; |
|
2331 |
break; |
|
2332 |
" |
|
2213 | 2333 |
| currentWord firstWord nextLetter | |
4488 | 2334 |
currentWord := inputKey copyFrom: currentIndex to: (currentIndex + 3 min: inputKey size). |
2335 |
firstWord := inputKey copyFrom: 1 to: (4 min: inputKey size). |
|
2213 | 2336 |
nextLetter := self keyAt: currentIndex + 1. |
2337 |
(currentWord = 'JOSE' or: [firstWord = 'SAN ']) |
|
2338 |
ifTrue: [ |
|
4488 | 2339 |
((currentIndex = 1 and: [inputKey size = 4 or: [inputKey size >= 5 and: [self keyAt: currentIndex + 4 = $ ]]]) |
2213 | 2340 |
or: [firstWord = 'SAN ']) |
2341 |
ifTrue: [ |
|
2342 |
self addPrimaryTranslation: 'H'; |
|
2343 |
addSecondaryTranslation: 'H'. |
|
2344 |
] ifFalse: [ |
|
2345 |
self addPrimaryTranslation: 'J'; |
|
2346 |
addSecondaryTranslation: 'H'. |
|
2347 |
]. |
|
2348 |
^self. |
|
2349 |
]. |
|
2350 |
(currentIndex = 1 and: [firstWord ~= 'JOSE']) |
|
2351 |
ifTrue: [ |
|
2352 |
self addPrimaryTranslation: 'J'; |
|
2353 |
addSecondaryTranslation: 'A'. |
|
2354 |
] ifFalse: [ |
|
2355 |
((currentIndex > 1 and: [(self keyAt: currentIndex -1) isVowel]) |
|
4488 | 2356 |
and: [(self isSlavoGermanic: inputKey) not and: [nextLetter == $A or: [nextLetter == $O]]]) |
2213 | 2357 |
ifTrue: [ |
2358 |
self addPrimaryTranslation: 'J'; |
|
2359 |
addSecondaryTranslation: 'H'. |
|
2360 |
] ifFalse: [ |
|
4488 | 2361 |
currentIndex = inputKey size |
2213 | 2362 |
ifTrue: [ |
2363 |
self addPrimaryTranslation: 'J'; |
|
2364 |
addSecondaryTranslation: ' '. |
|
2365 |
] ifFalse: [ |
|
2366 |
((#($L $T $K $S $N $M $B $Z) includes: nextLetter) not and: [(#($S $K $L) includes: (self keyAt: currentIndex - 1)) not]) |
|
2367 |
ifTrue: [ |
|
2368 |
self addPrimaryTranslation: 'J'; |
|
2369 |
addSecondaryTranslation: 'J'. |
|
2370 |
]. |
|
2371 |
]. |
|
2372 |
]. |
|
2373 |
]. |
|
3489
6ef5f530df03
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3488
diff
changeset
|
2374 |
nextLetter == $J |
2213 | 2375 |
ifTrue: [ |
4488 | 2376 |
skipCount := skipCount + 1. |
2213 | 2377 |
]. |
4488 | 2378 |
|
2379 |
"Modified: / 28-07-2017 / 11:31:41 / cg" |
|
2208 | 2380 |
! |
2381 |
||
2382 |
processK |
|
2213 | 2383 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2384 |
case 'K': |
|
2208 | 2385 |
if(GetAt(current + 1) == 'K') |
2386 |
current += 2; |
|
2387 |
else |
|
2388 |
current += 1; |
|
2389 |
MetaphAdd(K); |
|
2390 |
break; |
|
2213 | 2391 |
" |
2392 |
||
2393 |
(self keyAt: currentIndex + 1) = $K |
|
2394 |
ifTrue: [ |
|
4488 | 2395 |
skipCount := skipCount + 1 |
2213 | 2396 |
]. |
2397 |
self addPrimaryTranslation: 'K'; |
|
2398 |
addSecondaryTranslation: 'K'. |
|
4488 | 2399 |
|
2400 |
"Modified: / 28-07-2017 / 11:31:46 / cg" |
|
2208 | 2401 |
! |
2402 |
||
2403 |
processL |
|
2404 |
||
2405 |
"case 'L': |
|
2406 |
if(GetAt(current + 1) == 'L') |
|
2407 |
{ |
|
2408 |
//spanish e.g. 'cabrillo', 'gallegos' |
|
2409 |
if(((current == (length - 3)) |
|
2410 |
AND StringAt((current - 1), 4, ILLO, ILLA, ALLE, )) |
|
2411 |
OR ((StringAt((last - 1), 2, AS, OS, ) OR StringAt(last, 1, A, O, )) |
|
2412 |
AND StringAt((current - 1), 4, ALLE, )) ) |
|
2413 |
{ |
|
2414 |
MetaphAdd(L, ); |
|
2415 |
current += 2; |
|
2416 |
break; |
|
2417 |
} |
|
2418 |
current += 2; |
|
2419 |
}else |
|
2420 |
current += 1; |
|
2421 |
MetaphAdd(L); |
|
2422 |
break; |
|
2423 |
" |
|
2213 | 2424 |
| currentWord | |
2425 |
(self keyAt: currentIndex + 1) = $L |
|
2426 |
ifTrue: [ |
|
4488 | 2427 |
(((currentIndex = (inputKey size - 2)) |
2428 |
and: [(currentIndex > 1 and: [#('ILLO' 'ILLA' 'ALLE') includes: (currentWord := inputKey copyFrom: currentIndex - 1 to: (currentIndex + 2 min: inputKey size))])]) |
|
2429 |
or: [((#('AS' 'OS') includes: (inputKey copyFrom: inputKey size - 1 to: inputKey size)) or: [#($A $O) includes: (self keyAt: inputKey size)]) and: [currentWord = 'ALLE'] |
|
2213 | 2430 |
]) |
2431 |
ifTrue: [ |
|
2432 |
self addPrimaryTranslation: 'L'; |
|
2433 |
addSecondaryTranslation: ' '. |
|
4488 | 2434 |
skipCount := skipCount + 1. |
2435 |
^self. |
|
2213 | 2436 |
]. |
4488 | 2437 |
skipCount := skipCount + 1. |
2213 | 2438 |
]. |
2439 |
self addPrimaryTranslation: 'L'; |
|
4488 | 2440 |
addSecondaryTranslation: 'L'. |
2441 |
||
2442 |
"Modified: / 28-07-2017 / 11:32:03 / cg" |
|
2208 | 2443 |
! |
2444 |
||
2445 |
processM |
|
2446 |
||
2447 |
"case 'M': |
|
2448 |
if((StringAt((current - 1), 3, UMB, ) |
|
2449 |
AND (((current + 1) == last) OR StringAt((current + 2), 2, ER, ))) |
|
2450 |
//'dumb','thumb' |
|
2451 |
OR (GetAt(current + 1) == 'M') ) |
|
2452 |
current += 2; |
|
2453 |
else |
|
2454 |
current += 1; |
|
2455 |
MetaphAdd(M); |
|
2456 |
break; |
|
2457 |
" |
|
4488 | 2458 |
(((currentIndex > 1 and: [(inputKey copyFrom: currentIndex - 1 to: (currentIndex +1 min: inputKey size)) = 'UMB']) |
2459 |
and: [currentIndex + 1 = inputKey size or: [(inputKey copyFrom: (currentIndex + 2 min: inputKey size) to: (currentIndex + 4 min: inputKey size)) = 'ER']]) |
|
2213 | 2460 |
or: [(self keyAt: currentIndex + 1) = $M]) |
2461 |
ifTrue: [ |
|
4488 | 2462 |
skipCount := skipCount + 1. |
2213 | 2463 |
]. |
2464 |
self addPrimaryTranslation: 'M'; |
|
2465 |
addSecondaryTranslation: 'M'. |
|
4488 | 2466 |
|
2467 |
"Modified: / 28-07-2017 / 11:32:08 / cg" |
|
2208 | 2468 |
! |
2469 |
||
2470 |
processN |
|
2213 | 2471 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2472 |
case 'N': |
|
2208 | 2473 |
if(GetAt(current + 1) == 'N') |
2474 |
current += 2; |
|
2475 |
else |
|
2476 |
current += 1; |
|
2477 |
MetaphAdd(N); |
|
2478 |
break; |
|
2479 |
||
2213 | 2480 |
" |
2481 |
||
2482 |
(self keyAt: currentIndex + 1) = $N |
|
2483 |
ifTrue: [ |
|
4488 | 2484 |
skipCount := skipCount + 1 |
2213 | 2485 |
]. |
2486 |
self addPrimaryTranslation: 'N'; |
|
2487 |
addSecondaryTranslation: 'N'. |
|
4488 | 2488 |
|
2489 |
"Modified: / 28-07-2017 / 11:32:14 / cg" |
|
2208 | 2490 |
! |
2491 |
||
2492 |
processNtilde |
|
4488 | 2493 |
"case 'Ñ': |
2208 | 2494 |
current += 1; |
2495 |
MetaphAdd(N); |
|
2496 |
break; |
|
2497 |
" |
|
2498 |
self addPrimaryTranslation: 'N'; |
|
2499 |
addSecondaryTranslation: 'N'. |
|
2500 |
! |
|
2501 |
||
2502 |
processP |
|
2213 | 2503 |
"case 'P': |
2208 | 2504 |
if(GetAt(current + 1) == 'H') |
2505 |
{ |
|
2506 |
MetaphAdd(F); |
|
2507 |
current += 2; |
|
2508 |
break; |
|
2509 |
} |
|
2510 |
||
2511 |
//also account for campbell, raspberry |
|
2512 |
if(StringAt((current + 1), 1, P, B, )) |
|
2513 |
current += 2; |
|
2514 |
else |
|
2515 |
current += 1; |
|
2516 |
MetaphAdd(P); |
|
2517 |
break; |
|
2518 |
" |
|
2213 | 2519 |
| nextLetter | |
2520 |
(nextLetter := self keyAt: currentIndex + 1) = $H |
|
2521 |
ifTrue: [ |
|
2522 |
self addPrimaryTranslation: 'F'; |
|
2523 |
addSecondaryTranslation: 'F'. |
|
4488 | 2524 |
skipCount := skipCount + 1. |
2525 |
^self. |
|
2213 | 2526 |
]. |
2527 |
(#($P $B) includes: nextLetter) |
|
2528 |
ifTrue: [ |
|
4488 | 2529 |
skipCount := skipCount + 1. |
2213 | 2530 |
] ifFalse: [ |
2531 |
self addPrimaryTranslation: 'P'; |
|
2532 |
addSecondaryTranslation: 'P'. |
|
2533 |
]. |
|
4488 | 2534 |
|
2535 |
"Modified: / 28-07-2017 / 11:32:28 / cg" |
|
2208 | 2536 |
! |
2537 |
||
2538 |
processQ |
|
2213 | 2539 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2540 |
case 'Q': |
|
2208 | 2541 |
if(GetAt(current + 1) == 'Q') |
2542 |
current += 2; |
|
2543 |
else |
|
2544 |
current += 1; |
|
2545 |
MetaphAdd(K); |
|
2546 |
break; |
|
2547 |
||
2213 | 2548 |
" |
2549 |
||
2550 |
(self keyAt: currentIndex + 1) = $Q |
|
2551 |
ifTrue: [ |
|
4488 | 2552 |
skipCount := skipCount + 1 |
2213 | 2553 |
]. |
2554 |
self addPrimaryTranslation: 'K'; |
|
2555 |
addSecondaryTranslation: 'K'. |
|
4488 | 2556 |
|
2557 |
"Modified: / 28-07-2017 / 11:32:32 / cg" |
|
2208 | 2558 |
! |
2559 |
||
2560 |
processR |
|
2213 | 2561 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2562 |
case 'R': |
|
2208 | 2563 |
//french e.g. 'rogier', but exclude 'hochmeier' |
2564 |
if((current == last) |
|
2565 |
AND !!SlavoGermanic() |
|
2566 |
AND StringAt((current - 2), 2, IE, ) |
|
2567 |
AND !!StringAt((current - 4), 2, ME, MA, )) |
|
2568 |
MetaphAdd(, R); |
|
2569 |
else |
|
2570 |
MetaphAdd(R); |
|
2571 |
||
2572 |
if(GetAt(current + 1) == 'R') |
|
2573 |
current += 2; |
|
2574 |
else |
|
2575 |
current += 1; |
|
2576 |
break; |
|
2213 | 2577 |
" |
4488 | 2578 |
(currentIndex = inputKey size and: [ |
2579 |
(self isSlavoGermanic: inputKey) not and: [ |
|
2580 |
(inputKey copyFrom: ((currentIndex - 2) max: 1) to: ((currentIndex - 1) max: 1)) = 'IE' and: [ |
|
2581 |
(#('ME' 'MA') includes: (inputKey copyFrom: ((currentIndex - 4) max: 1) to: ((currentIndex - 3) max: 1))) not |
|
2213 | 2582 |
] |
2583 |
] |
|
2584 |
]) |
|
2585 |
ifTrue: [ |
|
2586 |
self addPrimaryTranslation: ''; |
|
2587 |
addSecondaryTranslation: 'R'. |
|
2588 |
] ifFalse: [ |
|
2589 |
self addPrimaryTranslation: 'R'; |
|
2590 |
addSecondaryTranslation: 'R'. |
|
2591 |
]. |
|
2592 |
(self keyAt: currentIndex + 1) = $R |
|
2593 |
ifTrue: [ |
|
4488 | 2594 |
skipCount := skipCount + 1 |
2213 | 2595 |
]. |
4488 | 2596 |
|
2597 |
"Modified: / 28-07-2017 / 11:32:37 / cg" |
|
2208 | 2598 |
! |
2599 |
||
2600 |
processRemainingCharacters |
|
4488 | 2601 |
startIndex to: inputKey size do:[ :i | |
2208 | 2602 |
| c methodSelector | |
2603 |
||
4488 | 2604 |
skipCount = 0 ifTrue:[ |
2605 |
((primaryTranslation size > 4) and: [ secondaryTranslation size > 4 ]) |
|
2208 | 2606 |
ifTrue: [ ^self ]. |
2607 |
||
4488 | 2608 |
currentIndex := i. |
2208 | 2609 |
c := self keyAt: i. |
2610 |
||
2611 |
(c isVowel not and: [c ~= $Y]) ifTrue:[ |
|
4488 | 2612 |
c == $Ç ifTrue: [ |
2208 | 2613 |
methodSelector := #processCedille |
4488 | 2614 |
] ifFalse: [ c == $Ñ ifTrue: [ |
2208 | 2615 |
methodSelector := #processNtilde |
2616 |
] ifFalse: [ |
|
2617 |
methodSelector := ('process', c asString) asSymbol |
|
2618 |
]]. |
|
2619 |
self perform: methodSelector |
|
2620 |
] |
|
2621 |
] ifFalse: [ |
|
4488 | 2622 |
skipCount := skipCount - 1 |
2208 | 2623 |
] |
2624 |
] |
|
4488 | 2625 |
|
2626 |
"Modified: / 28-07-2017 / 11:24:15 / cg" |
|
2208 | 2627 |
! |
2628 |
||
2629 |
processS |
|
2213 | 2630 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2631 |
case 'S': |
|
2208 | 2632 |
//special cases 'island', 'isle', 'carlisle', 'carlysle' |
2633 |
if(StringAt((current - 1), 3, ISL, YSL, )) |
|
2634 |
{ |
|
2635 |
current += 1; |
|
2636 |
break; |
|
2637 |
} |
|
2638 |
||
2639 |
//special case 'sugar-' |
|
2640 |
if((current == 0) AND StringAt(current, 5, SUGAR, )) |
|
2641 |
{ |
|
2642 |
MetaphAdd(X, S); |
|
2643 |
current += 1; |
|
2644 |
break; |
|
2645 |
} |
|
2646 |
||
2647 |
if(StringAt(current, 2, SH, )) |
|
2648 |
{ |
|
2649 |
//germanic |
|
2650 |
if(StringAt((current + 1), 4, HEIM, HOEK, HOLM, HOLZ, )) |
|
2651 |
MetaphAdd(S); |
|
2652 |
else |
|
2653 |
MetaphAdd(X); |
|
2654 |
current += 2; |
|
2655 |
break; |
|
2656 |
} |
|
2657 |
||
2658 |
//italian & armenian |
|
2659 |
if(StringAt(current, 3, SIO, SIA, ) OR StringAt(current, 4, SIAN, )) |
|
2660 |
{ |
|
2661 |
if(!!SlavoGermanic()) |
|
2662 |
MetaphAdd(S, X); |
|
2663 |
else |
|
2664 |
MetaphAdd(S); |
|
2665 |
current += 3; |
|
2666 |
break; |
|
2667 |
} |
|
2668 |
||
2669 |
//german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider' |
|
2670 |
//also, -sz- in slavic language altho in hungarian it is pronounced 's' |
|
2671 |
if(((current == 0) |
|
2672 |
AND StringAt((current + 1), 1, M, N, L, W, )) |
|
2673 |
OR StringAt((current + 1), 1, Z, )) |
|
2674 |
{ |
|
2675 |
MetaphAdd(S, X); |
|
2676 |
if(StringAt((current + 1), 1, Z, )) |
|
2677 |
current += 2; |
|
2678 |
else |
|
2679 |
current += 1; |
|
2680 |
break; |
|
2681 |
} |
|
2682 |
||
2683 |
if(StringAt(current, 2, SC, )) |
|
2684 |
{ |
|
2685 |
//Schlesinger's rule |
|
2686 |
if(GetAt(current + 2) == 'H') |
|
2687 |
//dutch origin, e.g. 'school', 'schooner' |
|
2688 |
if(StringAt((current + 3), 2, OO, ER, EN, UY, ED, EM, )) |
|
2689 |
{ |
|
2690 |
//'schermerhorn', 'schenker' |
|
2691 |
if(StringAt((current + 3), 2, ER, EN, )) |
|
2692 |
{ |
|
2693 |
MetaphAdd(X, SK); |
|
2694 |
}else |
|
2695 |
MetaphAdd(SK); |
|
2696 |
current += 3; |
|
2697 |
break; |
|
2698 |
}else{ |
|
2699 |
if((current == 0) AND !!IsVowel(3) AND (GetAt(3) !!= 'W')) |
|
2700 |
MetaphAdd(X, S); |
|
2701 |
else |
|
2702 |
MetaphAdd(X); |
|
2703 |
current += 3; |
|
2704 |
break; |
|
2705 |
} |
|
2706 |
||
2707 |
if(StringAt((current + 2), 1, I, E, Y, )) |
|
2708 |
{ |
|
2709 |
MetaphAdd(S); |
|
2710 |
current += 3; |
|
2711 |
break; |
|
2712 |
} |
|
2713 |
//else |
|
2714 |
MetaphAdd(SK); |
|
2715 |
current += 3; |
|
2716 |
break; |
|
2717 |
} |
|
2718 |
||
2719 |
//french e.g. 'resnais', 'artois' |
|
2720 |
if((current == last) AND StringAt((current - 2), 2, AI, OI, )) |
|
2721 |
MetaphAdd(, S); |
|
2722 |
else |
|
2723 |
MetaphAdd(S); |
|
2724 |
||
2725 |
if(StringAt((current + 1), 1, S, Z, )) |
|
2726 |
current += 2; |
|
2727 |
else |
|
2728 |
current += 1; |
|
2729 |
break; |
|
2730 |
" |
|
2731 |
||
2213 | 2732 |
| nextChar char2 chars char | |
4488 | 2733 |
(#('ISL' 'YSL') includes: (inputKey copyFrom: (currentIndex - 1 max: 1) to: (currentIndex + 1 min: inputKey size))) |
2213 | 2734 |
ifTrue: [ |
2735 |
^self |
|
2736 |
]. |
|
4488 | 2737 |
(currentIndex = 1 and: [(inputKey copyFrom: 1 to: (5 min: inputKey size)) = 'SUGAR']) |
2213 | 2738 |
ifTrue: [ |
2739 |
self addPrimaryTranslation: 'X'; |
|
2740 |
addSecondaryTranslation: 'S'. |
|
2741 |
^self. |
|
2742 |
]. |
|
4488 | 2743 |
(inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: inputKey size)) = 'SH' |
2213 | 2744 |
ifTrue: [ |
4488 | 2745 |
(#('HEIM' 'HOEK' 'HOLM' 'HOLZ') includes: (inputKey copyFrom: (currentIndex + 1 min: inputKey size) to: ((currentIndex + 5) min: inputKey size))) |
2213 | 2746 |
ifTrue: [ |
2747 |
self addPrimaryTranslation: 'S'; |
|
2748 |
addSecondaryTranslation: 'S'. |
|
2749 |
] ifFalse: [ |
|
2750 |
self addPrimaryTranslation: 'X'; |
|
2751 |
addSecondaryTranslation: 'X'. |
|
2752 |
]. |
|
4488 | 2753 |
skipCount := skipCount + 1. |
2754 |
^self |
|
2213 | 2755 |
]. |
4488 | 2756 |
((#('SIO' 'SIA') includes: (inputKey copyFrom: currentIndex to: (currentIndex + 2 min: inputKey size))) |
2757 |
or: [(inputKey copyFrom: currentIndex to: (currentIndex + 3 min: inputKey size)) = 'SIAN']) |
|
2213 | 2758 |
ifTrue: [ |
4488 | 2759 |
(self isSlavoGermanic: inputKey) not |
2213 | 2760 |
ifTrue: [ |
2761 |
self addPrimaryTranslation: 'S'; |
|
2762 |
addSecondaryTranslation: 'X'. |
|
2763 |
] ifFalse: [ |
|
2764 |
self addPrimaryTranslation: 'S'; |
|
2765 |
addSecondaryTranslation: 'S'. |
|
2766 |
]. |
|
4488 | 2767 |
skipCount := skipCount + 2. |
2768 |
^self |
|
2213 | 2769 |
]. |
2770 |
((currentIndex = 1 and: [#($M $N $L $W) includes: (self keyAt: currentIndex + 1)]) |
|
2771 |
or: [(nextChar := self keyAt: currentIndex + 1) = $Z]) |
|
2772 |
ifTrue: [ |
|
2773 |
self addPrimaryTranslation: 'S'; |
|
2774 |
addSecondaryTranslation: 'X'. |
|
3488
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
2775 |
nextChar == $Z |
2213 | 2776 |
ifTrue: [ |
4488 | 2777 |
skipCount := skipCount + 1. |
2778 |
^self. |
|
2213 | 2779 |
]. |
2780 |
^self. |
|
2781 |
]. |
|
4488 | 2782 |
((inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: inputKey size)) = 'SC') |
2213 | 2783 |
ifTrue: [ |
2784 |
(char2 := self keyAt: currentIndex + 2) = $H |
|
2785 |
ifTrue: [ |
|
4488 | 2786 |
(#('OO' 'ER' 'EN' 'UY' 'ED' 'EM') includes: (chars := inputKey copyFrom: ((currentIndex + 3) min: inputKey size) to: ((currentIndex + 4) min: inputKey size))) |
2213 | 2787 |
ifTrue: [ |
2788 |
(#('ER' 'EN') includes: chars) |
|
2789 |
ifTrue: [ |
|
2790 |
self addPrimaryTranslation: 'X'; |
|
2791 |
addSecondaryTranslation: 'SK'. |
|
2792 |
] ifFalse: [ |
|
2793 |
self addPrimaryTranslation: 'SK'; |
|
2794 |
addSecondaryTranslation: 'SK'. |
|
2795 |
]. |
|
4488 | 2796 |
skipCount := skipCount + 2. |
2797 |
^self. |
|
2213 | 2798 |
] ifFalse: [ |
4488 | 2799 |
((currentIndex = 1 and: [(char := inputKey at: 4 ifAbsent: [$b]) isVowel not]) and: [char ~= $W]) |
2213 | 2800 |
ifTrue: [ |
2801 |
self addPrimaryTranslation: 'X'; |
|
2802 |
addSecondaryTranslation: 'S'. |
|
2803 |
] ifFalse: [ |
|
2804 |
self addPrimaryTranslation: 'X'; |
|
2805 |
addSecondaryTranslation: 'X'. |
|
2806 |
]. |
|
4488 | 2807 |
skipCount := skipCount + 2. |
2808 |
^self . |
|
2213 | 2809 |
]. |
2810 |
] ifFalse: [ |
|
2811 |
(#($I $E $Y) includes: char2) |
|
2812 |
ifTrue: [ |
|
2813 |
self addPrimaryTranslation: 'S'; |
|
2814 |
addSecondaryTranslation: 'S'. |
|
4488 | 2815 |
skipCount := skipCount + 2. |
2816 |
^self . |
|
2213 | 2817 |
] ifFalse: [ |
2818 |
self addPrimaryTranslation: 'SK'; |
|
2819 |
addSecondaryTranslation: 'SK'. |
|
4488 | 2820 |
skipCount := skipCount + 2. |
2821 |
^self. |
|
2213 | 2822 |
] |
2823 |
]. |
|
2824 |
]. |
|
4488 | 2825 |
(currentIndex = inputKey size and: [(#('AI' 'OI') includes: (inputKey copyFrom: ((currentIndex - 2) max: 1) to: ((currentIndex - 1) max: 1)))]) |
2213 | 2826 |
ifTrue: [ |
2827 |
self addPrimaryTranslation: ''; |
|
2828 |
addSecondaryTranslation: 'S'. |
|
2829 |
] ifFalse: [ |
|
2830 |
self addPrimaryTranslation: 'S'; |
|
2831 |
addSecondaryTranslation: 'S'. |
|
2832 |
]. |
|
2833 |
(#($S $Z) includes: (self keyAt: currentIndex + 1)) |
|
2834 |
ifTrue: [ |
|
4488 | 2835 |
skipCount := skipCount + 1. |
2836 |
^self. |
|
2213 | 2837 |
]. |
4488 | 2838 |
|
2839 |
"Modified: / 28-07-2017 / 11:34:18 / cg" |
|
2208 | 2840 |
! |
2841 |
||
2842 |
processT |
|
2213 | 2843 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2844 |
case 'T': |
|
2208 | 2845 |
if(StringAt(current, 4, TION, )) |
2846 |
{ |
|
2847 |
MetaphAdd(X); |
|
2848 |
current += 3; |
|
2849 |
break; |
|
2850 |
} |
|
2851 |
||
2852 |
if(StringAt(current, 3, TIA, TCH, )) |
|
2853 |
{ |
|
2854 |
MetaphAdd(X); |
|
2855 |
current += 3; |
|
2856 |
break; |
|
2857 |
} |
|
2858 |
||
2859 |
if(StringAt(current, 2, TH, ) |
|
2860 |
OR StringAt(current, 3, TTH, )) |
|
2861 |
{ |
|
2862 |
//special case 'thomas', 'thames' or germanic |
|
2863 |
if(StringAt((current + 2), 2, OM, AM, ) |
|
2864 |
OR StringAt(0, 4, VAN , VON , ) |
|
2865 |
OR StringAt(0, 3, SCH, )) |
|
2866 |
{ |
|
2867 |
MetaphAdd(T); |
|
2868 |
}else{ |
|
2869 |
MetaphAdd(0, T); |
|
2870 |
} |
|
2871 |
current += 2; |
|
2872 |
break; |
|
2873 |
} |
|
2874 |
||
2875 |
if(StringAt((current + 1), 1, T, D, )) |
|
2876 |
current += 2; |
|
2877 |
else |
|
2878 |
current += 1; |
|
2879 |
MetaphAdd(T); |
|
2880 |
break; |
|
2881 |
" |
|
4488 | 2882 |
((inputKey copyFrom: currentIndex to: ((currentIndex + 3) min: inputKey size)) = 'TION') |
2213 | 2883 |
ifTrue: [ |
2884 |
self addPrimaryTranslation: 'X'; |
|
4488 | 2885 |
addSecondaryTranslation: 'X'. |
2886 |
skipCount := skipCount + 2. |
|
2887 |
^self. |
|
2213 | 2888 |
]. |
4488 | 2889 |
(#('TIA' 'TCH') includes: (inputKey copyFrom: currentIndex to: ((currentIndex + 2) min: inputKey size))) |
2213 | 2890 |
ifTrue: [ |
2891 |
self addPrimaryTranslation: 'X'; |
|
4488 | 2892 |
addSecondaryTranslation: 'X'. |
2893 |
skipCount := skipCount + 2. |
|
2894 |
^self. |
|
2213 | 2895 |
]. |
4488 | 2896 |
(((inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: inputKey size)) = 'TH') or: [ |
2897 |
((inputKey copyFrom: currentIndex to: ((currentIndex + 2) min: inputKey size)) = 'TTH') |
|
2213 | 2898 |
]) |
2899 |
ifTrue: [ |
|
4488 | 2900 |
((#('OM' 'AM') includes: (inputKey copyFrom: currentIndex + 2 to: ((currentIndex + 3) min: inputKey size))) |
2901 |
or: [(#('VAN ' 'VON ') includes: (inputKey copyFrom: 1 to: (4 min: inputKey size))) |
|
2902 |
or: [(inputKey copyFrom: 1 to: (3 min: inputKey size)) = 'SCH'] |
|
2213 | 2903 |
]) |
2904 |
ifTrue: [ |
|
2905 |
self addPrimaryTranslation: 'T'; |
|
2906 |
addSecondaryTranslation: 'T'. |
|
2907 |
] ifFalse: [ |
|
2908 |
self addPrimaryTranslation: '0'; |
|
2909 |
addSecondaryTranslation: 'T'. |
|
2910 |
]. |
|
4488 | 2911 |
skipCount := skipCount + 1. |
2912 |
^self. |
|
2213 | 2913 |
]. |
2914 |
(#($T $D) includes: (self keyAt: currentIndex + 1)) |
|
2915 |
ifTrue: [ |
|
4488 | 2916 |
skipCount := skipCount + 1. |
2213 | 2917 |
]. |
2918 |
self addPrimaryTranslation: 'T'; |
|
4488 | 2919 |
addSecondaryTranslation: 'T'. |
2920 |
||
2921 |
"Modified: / 28-07-2017 / 11:33:33 / cg" |
|
2208 | 2922 |
! |
2923 |
||
2924 |
processV |
|
2213 | 2925 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2926 |
case 'V': |
|
2208 | 2927 |
if(GetAt(current + 1) == 'V') |
2928 |
current += 2; |
|
2929 |
else |
|
2930 |
current += 1; |
|
2931 |
MetaphAdd(F); |
|
2932 |
break; |
|
2933 |
||
2934 |
||
2213 | 2935 |
" |
2936 |
||
2937 |
(self keyAt: currentIndex + 1) = $V |
|
2938 |
ifTrue: [ |
|
4488 | 2939 |
skipCount := skipCount + 1 |
2213 | 2940 |
]. |
2941 |
self addPrimaryTranslation: 'F'; |
|
2942 |
addSecondaryTranslation: 'F'. |
|
4488 | 2943 |
|
2944 |
"Modified: / 28-07-2017 / 11:34:27 / cg" |
|
2208 | 2945 |
! |
2946 |
||
2947 |
processW |
|
2213 | 2948 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2949 |
case 'W': |
|
2208 | 2950 |
//can also be in middle of word |
2951 |
if(StringAt(current, 2, WR, )) |
|
2952 |
{ |
|
2953 |
MetaphAdd(R); |
|
2954 |
current += 2; |
|
2955 |
break; |
|
2956 |
} |
|
2957 |
||
2958 |
if((current == 0) |
|
2959 |
AND (IsVowel(current + 1) OR StringAt(current, 2, WH, ))) |
|
2960 |
{ |
|
2961 |
//Wasserman should match Vasserman |
|
2962 |
if(IsVowel(current + 1)) |
|
2963 |
MetaphAdd(A, F); |
|
2964 |
else |
|
2965 |
//need Uomo to match Womo |
|
2966 |
MetaphAdd(A); |
|
2967 |
} |
|
2968 |
||
2969 |
//Arnow should match Arnoff |
|
2970 |
if(((current == last) AND IsVowel(current - 1)) |
|
2971 |
OR StringAt((current - 1), 5, EWSKI, EWSKY, OWSKI, OWSKY, ) |
|
2972 |
OR StringAt(0, 3, SCH, )) |
|
2213 | 2973 |
{ |
2208 | 2974 |
MetaphAdd(, F); |
2975 |
current +=1; |
|
2976 |
break; |
|
2977 |
} |
|
2978 |
||
2979 |
//polish e.g. 'filipowicz' |
|
2980 |
if(StringAt(current, 4, WICZ, WITZ, )) |
|
2981 |
{ |
|
2982 |
MetaphAdd(TS, FX); |
|
2983 |
current +=4; |
|
2984 |
break; |
|
2985 |
} |
|
2986 |
||
2987 |
//else skip it |
|
2988 |
current +=1; |
|
2989 |
break; |
|
2990 |
" |
|
2213 | 2991 |
| word nextLetter | |
4488 | 2992 |
((word := inputKey copyFrom: currentIndex to: (currentIndex + 1 min: inputKey size)) = 'WR') |
2213 | 2993 |
ifTrue: [ |
2994 |
self addPrimaryTranslation: 'R'; |
|
2995 |
addSecondaryTranslation: 'R'. |
|
4488 | 2996 |
skipCount := skipCount + 1. |
2997 |
^self |
|
2213 | 2998 |
]. |
2999 |
((currentIndex = 1 and: [(nextLetter := self keyAt: currentIndex + 1) isVowel]) or: [ |
|
3000 |
word = 'WH' |
|
3001 |
]) |
|
3002 |
ifTrue: [ |
|
3003 |
nextLetter isVowel |
|
3004 |
ifTrue: [ |
|
3005 |
self addPrimaryTranslation: 'A'; |
|
3006 |
addSecondaryTranslation: 'F'. |
|
3007 |
] ifFalse: [ |
|
3008 |
self addPrimaryTranslation: 'A'; |
|
3009 |
addSecondaryTranslation: 'A'. |
|
3010 |
] |
|
3011 |
]. |
|
4488 | 3012 |
((((currentIndex = inputKey size) and: [(self keyAt: currentIndex - 1) isVowel]) |
3013 |
or: [#('EWSKI' 'EWSKY' 'OWSKI' 'OWSKY') includes: (inputKey copyFrom: ((currentIndex - 1) max: 1) to: (currentIndex + 3 min: inputKey size))]) |
|
3014 |
or: [inputKey startsWith:'SCH']) |
|
2213 | 3015 |
ifTrue: [ |
3016 |
self addPrimaryTranslation: ''; |
|
3017 |
addSecondaryTranslation: 'F'. |
|
3018 |
^self. |
|
3019 |
]. |
|
4488 | 3020 |
(#('WICZ' 'WITZ') includes: (inputKey copyFrom: currentIndex to: (currentIndex + 4 min: inputKey size))) |
2213 | 3021 |
ifTrue: [ |
3022 |
self addPrimaryTranslation: 'TS'; |
|
3023 |
addSecondaryTranslation: 'FX'. |
|
4488 | 3024 |
skipCount := skipCount + 3. |
3025 |
^self |
|
2213 | 3026 |
]. |
4488 | 3027 |
|
3028 |
"Modified: / 28-07-2017 / 11:34:51 / cg" |
|
2208 | 3029 |
! |
3030 |
||
3031 |
processX |
|
2213 | 3032 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
3033 |
case 'X': |
|
2208 | 3034 |
//french e.g. breaux |
3035 |
if(!!((current == last) |
|
3036 |
AND (StringAt((current - 3), 3, IAU, EAU, ) |
|
3037 |
OR StringAt((current - 2), 2, AU, OU, ))) ) |
|
3038 |
MetaphAdd(KS); |
|
3039 |
||
3040 |
if(StringAt((current + 1), 1, C, X, )) |
|
3041 |
current += 2; |
|
3042 |
else |
|
3043 |
current += 1; |
|
3044 |
break; |
|
3045 |
" |
|
3046 |
||
3047 |
||
4488 | 3048 |
((currentIndex = inputKey size) |
3049 |
and: [(#('IAU' 'EAU') includes: (inputKey copyFrom: ((currentIndex - 3) min: 1) to: currentIndex)) |
|
3050 |
or: [(#('AU' 'OU') includes: (inputKey copyFrom: ((currentIndex - 2) min: 1) to: currentIndex))]]) |
|
2580
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
3051 |
ifFalse: [ |
2213 | 3052 |
self addPrimaryTranslation: 'KS'; |
3053 |
addSecondaryTranslation: 'KS'. |
|
3054 |
]. |
|
3055 |
(#($C $X) includes: (self keyAt: currentIndex + 1)) |
|
3056 |
ifTrue: [ |
|
4488 | 3057 |
skipCount := skipCount + 1. |
3058 |
^self |
|
2213 | 3059 |
] |
2580
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
3060 |
|
4488 | 3061 |
"Modified: / 28-07-2017 / 11:34:44 / cg" |
2208 | 3062 |
! |
3063 |
||
3064 |
processZ |
|
2213 | 3065 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
3066 |
case 'Z': |
|
2208 | 3067 |
//chinese pinyin e.g. 'zhao' |
3068 |
if(GetAt(current + 1) == 'H') |
|
3069 |
{ |
|
3070 |
MetaphAdd(J); |
|
3071 |
current += 2; |
|
3072 |
break; |
|
3073 |
}else |
|
3074 |
if(StringAt((current + 1), 2, ZO, ZI, ZA, ) |
|
3075 |
OR (SlavoGermanic() AND ((current > 0) AND GetAt(current - 1) !!= 'T'))) |
|
3076 |
{ |
|
3077 |
MetaphAdd(S, TS); |
|
3078 |
} |
|
3079 |
else |
|
3080 |
MetaphAdd(S); |
|
3081 |
||
3082 |
if(GetAt(current + 1) == 'Z') |
|
3083 |
current += 2; |
|
3084 |
else |
|
3085 |
current += 1; |
|
3086 |
break; |
|
3087 |
" |
|
3088 |
||
2213 | 3089 |
(self keyAt: currentIndex + 1) = $H |
3090 |
ifTrue: [ |
|
3091 |
self addPrimaryTranslation: 'J'; |
|
3092 |
addSecondaryTranslation: 'J'. |
|
4488 | 3093 |
skipCount := skipCount + 1. |
3094 |
^self |
|
2213 | 3095 |
] ifFalse: [ |
4488 | 3096 |
((#('ZO' 'ZI' 'ZA') includes: (inputKey copyFrom: ((currentIndex + 1) min: inputKey size) to: ((currentIndex + 2) min: inputKey size))) or: [ |
3097 |
(self isSlavoGermanic: inputKey) and: [(currentIndex > 1 and: [(self keyAt: currentIndex - 1) ~= 'T'])] |
|
2213 | 3098 |
]) |
3099 |
ifTrue: [ |
|
3100 |
self addPrimaryTranslation: 'S'; |
|
3101 |
addSecondaryTranslation: 'TS'. |
|
3102 |
] ifFalse: [ |
|
3103 |
self addPrimaryTranslation: 'S'; |
|
3104 |
addSecondaryTranslation: 'S'. |
|
3105 |
]. |
|
3106 |
(self keyAt: currentIndex + 1) = $Z |
|
3107 |
ifTrue: [ |
|
4488 | 3108 |
skipCount := skipCount + 1. |
3109 |
^self |
|
2213 | 3110 |
]. |
3111 |
] |
|
4488 | 3112 |
|
3113 |
"Modified: / 28-07-2017 / 11:35:12 / cg" |
|
3114 |
! ! |
|
3115 |
||
3116 |
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator class methodsFor:'documentation'! |
|
3117 |
||
3118 |
documentation |
|
3119 |
" |
|
3120 |
The 'Kölner Phonetik' (cologne phonetic) code is for the german language |
|
3121 |
what the soundex code is for english: |
|
3122 |
it returns similar strings for similar sounding words |
|
3123 |
(but is specifically aware of the pronunciation of German and eastern languages) . |
|
3124 |
||
3125 |
There are some other differences to soundex, though: |
|
3126 |
its length is not limited to 4, but depends on the length of the original string; |
|
3127 |
it does not start with the first character of the input, but returns a pure numeric string. |
|
3128 |
||
3129 |
This algorithm was described by Postel 1969, |
|
3130 |
See http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik |
|
3131 |
||
3132 |
self new phoneticStringsFor:'Müller-Lüdenscheidt' -> #('65752682') |
|
3133 |
" |
|
3134 |
! |
|
3135 |
||
3136 |
examples |
|
3137 |
" |
|
3138 |
words sounding similar (german pronunciation) will deliver a similar code: |
|
3139 |
||
3140 |
#( |
|
3141 |
'Müller' |
|
3142 |
'Miller' |
|
3143 |
'Mueller' |
|
3144 |
'Mühler' |
|
3145 |
'Mühlherr' |
|
3146 |
'Mülherr' |
|
3147 |
'Myler' |
|
3148 |
'Millar' |
|
3149 |
'Myller' |
|
3150 |
'Müllar' |
|
3151 |
'Müler' |
|
3152 |
'Muehler' |
|
3153 |
'Mülller' |
|
3154 |
'Müllerr' |
|
3155 |
'Muehlherr' |
|
3156 |
'Muellar' |
|
3157 |
'Mueler' |
|
3158 |
'Mülleer' |
|
3159 |
'Mueller' |
|
3160 |
'Nüller' |
|
3161 |
'Nyller' |
|
3162 |
'Niler' |
|
3163 |
'Czerny' |
|
3164 |
'Tscherny' |
|
3165 |
'Czernie' |
|
3166 |
'Tschernie' |
|
3167 |
'Schernie' |
|
3168 |
'Scherny' |
|
3169 |
'Scherno' |
|
3170 |
'Czerne' |
|
3171 |
'Zerny' |
|
3172 |
'Tzernie' |
|
3173 |
'Breschnew' |
|
3174 |
'Breschnew' |
|
3175 |
'Breschneff' |
|
3176 |
'Breschnjeff' |
|
3177 |
'Braeschneff' |
|
3178 |
'Braessneff' |
|
3179 |
'Pressneff' |
|
3180 |
'Presznäph' |
|
3181 |
'Präschnäf' |
|
3182 |
'Breschnjeff' |
|
3183 |
'Breschnijeff' |
|
3184 |
'Breschnieff' |
|
3185 |
'Bräschnieff' |
|
3186 |
'Braschnieff' |
|
3187 |
'Broschnieff' |
|
3188 |
) do:[:w | |
|
3189 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:w) |
|
3190 |
]. |
|
3191 |
" |
|
3192 |
! ! |
|
3193 |
||
3194 |
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator methodsFor:'api'! |
|
3195 |
||
3196 |
encode: aString |
|
3197 |
"return a koelner phonetic code. |
|
3198 |
The koelnerPhonetic code is for the german language what the soundex code is for english; |
|
3199 |
it returns simular strings for similar sounding words. |
|
3200 |
There are some differences to soundex, though: |
|
3201 |
its length is not limited to 4, but depends on the length of the original string; |
|
3202 |
it does not start with the first character of the input. |
|
3203 |
This algorithm is described by Postel 1969" |
|
3204 |
||
3205 |
|in ret val rslt| |
|
3206 |
||
3207 |
in := aString withoutSeparators asLowercase. |
|
3208 |
in := in copyReplaceString:'ph' withString:'f'. |
|
3209 |
(in includesAny:'öäüß') ifTrue:[ |
|
3210 |
in := in copyReplaceAll:$ü withAll:'u'. |
|
3211 |
in := in copyReplaceAll:$ä withAll:'a'. |
|
3212 |
in := in copyReplaceAll:$ö withAll:'o'. |
|
3213 |
in := in copyReplaceAll:$ß withAll:'ss'. |
|
3214 |
]. |
|
3215 |
in := in select:[:ch | ch isLetter]. |
|
3216 |
in := '#',in,'#'. |
|
3217 |
||
3218 |
ret := ''. |
|
3219 |
1 to:in size-2 do:[:i | |
|
3220 |
|sub| |
|
3221 |
||
3222 |
sub := in copyFrom:i to:i+2. |
|
3223 |
val := (i==1) |
|
3224 |
ifTrue:[ self convertFirst:sub ] |
|
3225 |
ifFalse:[ self convertRest:sub ]. |
|
3226 |
ret := ret,val |
|
3227 |
]. |
|
3228 |
||
3229 |
ret := ret select:[:ch | ch ~= $-]. |
|
3230 |
||
3231 |
(ret startsWith:'0') ifTrue:[ |
|
3232 |
ret := '0',(ret select:[:ch | ch ~= $0]). |
|
3233 |
] ifFalse:[ |
|
3234 |
ret := ret select:[:ch | ch ~= $0]. |
|
3235 |
]. |
|
3236 |
||
3237 |
rslt := String streamContents:[:s | |
|
3238 |
|prev| |
|
3239 |
||
3240 |
ret do:[:ch | |
|
3241 |
ch ~= prev ifTrue:[ |
|
3242 |
s nextPut:ch |
|
3243 |
]. |
|
3244 |
prev := ch. |
|
3245 |
]. |
|
3246 |
]. |
|
3247 |
^ rslt. |
|
3248 |
||
3249 |
" |
|
3250 |
#( |
|
3251 |
'Müller' |
|
3252 |
'Miller' |
|
3253 |
'Mueller' |
|
3254 |
'Mühler' |
|
3255 |
'Mühlherr' |
|
3256 |
'Mülherr' |
|
3257 |
'Myler' |
|
3258 |
'Millar' |
|
3259 |
'Myller' |
|
3260 |
'Müllar' |
|
3261 |
'Müler' |
|
3262 |
'Muehler' |
|
3263 |
'Mülller' |
|
3264 |
'Müllerr' |
|
3265 |
'Muehlherr' |
|
3266 |
'Muellar' |
|
3267 |
'Mueler' |
|
3268 |
'Mülleer' |
|
3269 |
'Mueller' |
|
3270 |
'Nüller' |
|
3271 |
'Nyller' |
|
3272 |
'Niler' |
|
3273 |
'Czerny' |
|
3274 |
'Tscherny' |
|
3275 |
'Czernie' |
|
3276 |
'Tschernie' |
|
3277 |
'Schernie' |
|
3278 |
'Scherny' |
|
3279 |
'Scherno' |
|
3280 |
'Czerne' |
|
3281 |
'Zerny' |
|
3282 |
'Tzernie' |
|
3283 |
'Breschnew' |
|
3284 |
'Breschnew' |
|
3285 |
'Breschneff' |
|
3286 |
'Breschnjeff' |
|
3287 |
'Braeschneff' |
|
3288 |
'Braessneff' |
|
3289 |
'Pressneff' |
|
3290 |
'Presznäph' |
|
3291 |
'Präschnäf' |
|
3292 |
'Breschnjeff' |
|
3293 |
'Breschnijeff' |
|
3294 |
'Breschnieff' |
|
3295 |
) do:[:w | |
|
3296 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:w) |
|
3297 |
]. |
|
3298 |
" |
|
3299 |
||
3300 |
" |
|
3301 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnew' -> '17863' |
|
3302 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschneff' -> '17863' |
|
3303 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Braeschneff' -> '17863' |
|
3304 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Braessneff' -> '17863' |
|
3305 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Pressneff' -> '17863' |
|
3306 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Presznäph' -> '17863' |
|
3307 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Präschnäf' -> '17863' |
|
3308 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnjeff' -> '17863' |
|
3309 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnijeff' -> '17863' |
|
3310 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new encode:'Breschnieff' -> '17863' |
|
3311 |
" |
|
3312 |
" |
|
3313 |
self basicNew encode:'müller' -> '657' |
|
3314 |
self basicNew encode:'möller' -> '657' |
|
3315 |
self basicNew encode:'miller' -> '657' |
|
3316 |
self basicNew encode:'muller' -> '657' |
|
3317 |
self basicNew encode:'muler' -> '657' |
|
3318 |
self basicNew encode:'schmidt' -> '862' |
|
3319 |
self basicNew encode:'schneider' -> '8627' |
|
3320 |
self basicNew encode:'fischer' -> '387' |
|
3321 |
self basicNew encode:'weber' -> '317' |
|
3322 |
self basicNew encode:'meyer' -> '67' |
|
3323 |
self basicNew encode:'wagner' -> '3467' |
|
3324 |
self basicNew encode:'schulz' -> '858' |
|
3325 |
self basicNew encode:'becker' -> '147' |
|
3326 |
self basicNew encode:'hoffmann' -> '036' |
|
3327 |
self basicNew encode:'schäfer' -> '837' |
|
3328 |
" |
|
3329 |
||
3330 |
"Created: / 28-07-2017 / 15:24:33 / cg" |
|
3331 |
! ! |
|
3332 |
||
3333 |
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator methodsFor:'private'! |
|
3334 |
||
3335 |
convertFirst:chars |
|
3336 |
|c2 c3| |
|
3337 |
||
3338 |
chars size == 3 ifTrue:[ |
|
3339 |
c2 := (chars at:2). |
|
3340 |
c2 == $a ifTrue:[^ '0']. |
|
3341 |
c2 == $e ifTrue:[^ '0']. |
|
3342 |
c2 == $i ifTrue:[^ '0']. |
|
3343 |
c2 == $j ifTrue:[^ '0']. |
|
3344 |
c2 == $y ifTrue:[^ '0']. |
|
3345 |
c2 == $o ifTrue:[^ '0']. |
|
3346 |
c2 == $u ifTrue:[^ '0']. |
|
3347 |
||
3348 |
c2 == $c ifTrue:[ |
|
3349 |
c3 := (chars at:3). |
|
3350 |
(c3 == $a) ifTrue:[^ '4']. |
|
3351 |
(c3 == $h) ifTrue:[^ '4']. |
|
3352 |
(c3 == $k) ifTrue:[^ '4']. |
|
3353 |
(c3 == $l) ifTrue:[^ '4']. |
|
3354 |
(c3 == $o) ifTrue:[^ '4']. |
|
3355 |
(c3 == $q) ifTrue:[^ '4']. |
|
3356 |
(c3 == $r) ifTrue:[^ '4']. |
|
3357 |
(c3 == $u) ifTrue:[^ '4']. |
|
3358 |
(c3 == $x) ifTrue:[^ '4']. |
|
3359 |
^ '8' |
|
3360 |
]. |
|
3361 |
||
3362 |
"/ #( |
|
3363 |
"/ ('#a#' '0') |
|
3364 |
"/ ('#e#' '0') |
|
3365 |
"/ ('#i#' '0') |
|
3366 |
"/ ('#j#' '0') |
|
3367 |
"/ ('#y#' '0') |
|
3368 |
"/ ('#o#' '0') |
|
3369 |
"/ ('#u#' '0') |
|
3370 |
"/ |
|
3371 |
"/ ('#ca' '4') |
|
3372 |
"/ ('#ch' '4') |
|
3373 |
"/ ('#ck' '4') |
|
3374 |
"/ ('#cl' '4') |
|
3375 |
"/ ('#co' '4') |
|
3376 |
"/ ('#cq' '4') |
|
3377 |
"/ ('#cr' '4') |
|
3378 |
"/ ('#cu' '4') |
|
3379 |
"/ ('#cx' '4') |
|
3380 |
"/ |
|
3381 |
"/ ('#c#' '8') |
|
3382 |
"/ ) do:[:pair | |
|
3383 |
"/ (pair first match:chars) ifTrue:[ |
|
3384 |
"/ ^ pair second |
|
3385 |
"/ ] |
|
3386 |
"/ ]. |
|
3387 |
]. |
|
3388 |
||
3389 |
^ self convertRest:chars |
|
3390 |
||
3391 |
"Modified: / 29-07-2017 / 14:22:20 / cg" |
|
3392 |
! |
|
3393 |
||
3394 |
convertRest:chars |
|
3395 |
chars size == 3 ifFalse:[ |
|
3396 |
self error:'cannot happen'. |
|
3397 |
^ '?' |
|
3398 |
]. |
|
3399 |
||
3400 |
#( |
|
3401 |
"/ used to be matchpattern code, |
|
3402 |
"/ but doing these glob-matches is too slow. |
|
3403 |
"/ changed to: |
|
3404 |
"/ start nil code |
|
3405 |
"/ nil end code |
|
3406 |
"/ nil char code |
|
3407 |
"/ |
|
3408 |
(nil 'ds' " '#ds' " '8') |
|
3409 |
(nil 'dc' " '#dc' " '8') |
|
3410 |
(nil 'dz' " '#dz' " '8') |
|
3411 |
(nil 'ts' " '#ts' " '8') |
|
3412 |
(nil 'tc' " '#tc' " '8') |
|
3413 |
(nil 'tz' " '#tz' " '8') |
|
3414 |
(nil $d " '#d#' " '2') |
|
3415 |
(nil $t " '#t#' " '2') |
|
3416 |
('cx' nil " 'cx#' " '8') |
|
3417 |
('kx' nil " 'kx#' " '8') |
|
3418 |
('qx' nil " 'qx#' " '8') |
|
3419 |
(nil $x " '#x#' " '48') |
|
3420 |
('sc' nil " 'sc#' " '8') |
|
3421 |
('sz' nil " 'sz#' " '8') |
|
3422 |
(nil 'ca' " '#ca' " '4') |
|
3423 |
(nil 'co' " '#co' " '4') |
|
3424 |
(nil 'cu' " '#cu' " '4') |
|
3425 |
(nil 'ch' " '#ch' " '4') |
|
3426 |
(nil 'ck' " '#ck' " '4') |
|
3427 |
(nil 'cx' " '#cx' " '4') |
|
3428 |
(nil 'cq' " '#cq' " '4') |
|
3429 |
(nil $c " '#c#' " '8') |
|
3430 |
(nil $a " '#a#' " '0') |
|
3431 |
(nil $e " '#e#' " '0') |
|
3432 |
(nil $i " '#i#' " '0') |
|
3433 |
(nil $j " '#j#' " '0') |
|
3434 |
(nil $y " '#y#' " '0') |
|
3435 |
(nil $o " '#o#' " '0') |
|
3436 |
(nil $u " '#u#' " '0') |
|
3437 |
(nil $h " '#h#' " '-') |
|
3438 |
(nil $l " '#l#' " '5') |
|
3439 |
(nil $r " '#r#' " '7') |
|
3440 |
(nil $m " '#m#' " '6') |
|
3441 |
(nil $n " '#n#' " '6') |
|
3442 |
(nil $s " '#s#' " '8') |
|
3443 |
(nil $z " '#z#' " '8') |
|
3444 |
(nil $b " '#b#' " '1') |
|
3445 |
(nil $p " '#p#' " '1') |
|
3446 |
(nil $f " '#f#' " '3') |
|
3447 |
(nil $v " '#v#' " '3') |
|
3448 |
(nil $w " '#w#' " '3') |
|
3449 |
(nil $g " '#g#' " '4') |
|
3450 |
(nil $k " '#k#' " '4') |
|
3451 |
(nil $q " '#q#' " '4') |
|
3452 |
(nil nil " '###' " '?') |
|
3453 |
) do:[:vector | |
|
3454 |
|v1 v2| |
|
3455 |
||
3456 |
(v1 := vector at:1) notNil ifTrue:[ |
|
3457 |
"/ prefix |
|
3458 |
(chars startsWith:v1) ifTrue:[^ (vector at:3) ]. |
|
3459 |
] ifFalse:[ |
|
3460 |
(v2 := vector at:2) isCharacter ifTrue:[ |
|
3461 |
"/ middle character compare |
|
3462 |
(chars at:2) == v2 ifTrue:[^ (vector at:3) ]. |
|
3463 |
] ifFalse:[ |
|
3464 |
v2 isString ifTrue:[ |
|
3465 |
"/ suffix |
|
3466 |
(chars endsWith:v2) ifTrue:[^ (vector at:3) ]. |
|
3467 |
] ifFalse:[ |
|
3468 |
^ '?' |
|
3469 |
] |
|
3470 |
] |
|
3471 |
]. |
|
3472 |
||
3473 |
"/ (vector first match:chars) ifTrue:[ |
|
3474 |
"/ ^ vector second |
|
3475 |
"/ ] |
|
3476 |
]. |
|
3477 |
||
3478 |
self error:'cannot happen' |
|
3479 |
||
3480 |
"Modified: / 29-07-2017 / 14:17:38 / cg" |
|
2208 | 3481 |
! ! |
3482 |
||
3483 |
!PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'! |
|
3484 |
||
3485 |
documentation |
|
3486 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3487 |
Miracode (also called American Soundex) is like Soundex with the addition that h and w are |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3488 |
discarded if they separate consonants. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3489 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3490 |
These variants may be specifically important because they were used in U.S. National Archives. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3491 |
Most archive data were encoded with Miracode, but there are some entries encoded with |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3492 |
Simplified Soundex. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3493 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3494 |
The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3495 |
censuses were encoded with mixed methods. |
2208 | 3496 |
" |
3497 |
! ! |
|
3498 |
||
3499 |
!PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'! |
|
3500 |
||
4488 | 3501 |
encode:word |
2208 | 3502 |
|u p t prevCode| |
3503 |
||
4488 | 3504 |
u := word asUppercase. |
2208 | 3505 |
p := u first asString. |
3506 |
prevCode := self translate:u first. |
|
3507 |
u from:2 to:u size do:[:c | |
|
3508 |
t := self translate:c. |
|
3509 |
(t notNil |
|
3510 |
and:[ t ~= '0' |
|
3511 |
and:[ t ~= prevCode ]]) ifTrue:[ |
|
3512 |
p := p , t. |
|
4488 | 3513 |
p size == 4 ifTrue:[^ p ]. |
2208 | 3514 |
]. |
3515 |
(c ~= $W and:[c ~= $H]) ifTrue:[ |
|
3516 |
prevCode := t. |
|
3517 |
]. |
|
3518 |
]. |
|
3519 |
[ p size < 4 ] whileTrue:[ |
|
3520 |
p := p , '0' |
|
3521 |
]. |
|
4488 | 3522 |
^ (p copyFrom:1 to:4) |
3523 |
||
3524 |
"Created: / 28-07-2017 / 15:23:16 / cg" |
|
2208 | 3525 |
! ! |
3526 |
||
2197 | 3527 |
!PhoneticStringUtilities class methodsFor:'documentation'! |
3528 |
||
3529 |
version |
|
3646 | 3530 |
^ '$Header$' |
2285 | 3531 |
! |
3532 |
||
3533 |
version_CVS |
|
3646 | 3534 |
^ '$Header$' |
2197 | 3535 |
! ! |
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
3536 |