author | Claus Gittinger <cg@exept.de> |
Thu, 09 Jun 2016 17:48:53 +0200 | |
changeset 3928 | d1133788cbba |
parent 3839 | 6874980a5d05 |
child 4133 | eda6f1bfc8d2 |
permissions | -rw-r--r-- |
2197 | 1 |
" |
2 |
COPYRIGHT (c) 1994 by Claus Gittinger |
|
3 |
COPYRIGHT (c) 2009 by eXept Software AG |
|
4 |
All Rights Reserved |
|
5 |
||
6 |
This software is furnished under a license and may be used |
|
7 |
only in accordance with the terms of that license and with the |
|
8 |
inclusion of the above copyright notice. This software may not |
|
9 |
be provided or otherwise made available to, or used by, any |
|
10 |
other person. No title to or ownership of the software is |
|
11 |
hereby transferred. |
|
12 |
" |
|
13 |
"{ Package: 'stx:libbasic2' }" |
|
14 |
||
3488
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
15 |
"{ NameSpace: Smalltalk }" |
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
16 |
|
2197 | 17 |
Object subclass:#PhoneticStringUtilities |
18 |
instanceVariableNames:'' |
|
19 |
classVariableNames:'' |
|
20 |
poolDictionaries:'' |
|
21 |
category:'Collections-Text-Support' |
|
22 |
! |
|
23 |
||
2208 | 24 |
Object subclass:#PhoneticStringComparator |
25 |
instanceVariableNames:'' |
|
26 |
classVariableNames:'' |
|
27 |
poolDictionaries:'' |
|
28 |
privateIn:PhoneticStringUtilities |
|
29 |
! |
|
30 |
||
2211 | 31 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#ExtendedSoundexStringComparator |
32 |
instanceVariableNames:'' |
|
33 |
classVariableNames:'CharacterTranslationDict' |
|
34 |
poolDictionaries:'' |
|
35 |
privateIn:PhoneticStringUtilities |
|
36 |
! |
|
37 |
||
2208 | 38 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#KoelnerPhoneticCodeStringComparator |
39 |
instanceVariableNames:'' |
|
40 |
classVariableNames:'CharacterTranslationDict' |
|
41 |
poolDictionaries:'' |
|
42 |
privateIn:PhoneticStringUtilities |
|
43 |
! |
|
44 |
||
45 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#SoundexStringComparator |
|
46 |
instanceVariableNames:'' |
|
47 |
classVariableNames:'CharacterTranslationDict' |
|
48 |
poolDictionaries:'' |
|
49 |
privateIn:PhoneticStringUtilities |
|
50 |
! |
|
51 |
||
52 |
PhoneticStringUtilities::SoundexStringComparator subclass:#MySQLSoundexStringComparator |
|
53 |
instanceVariableNames:'' |
|
54 |
classVariableNames:'' |
|
55 |
poolDictionaries:'' |
|
56 |
privateIn:PhoneticStringUtilities |
|
57 |
! |
|
58 |
||
59 |
Object subclass:#NYSIISStringComparator |
|
60 |
instanceVariableNames:'' |
|
61 |
classVariableNames:'' |
|
62 |
poolDictionaries:'' |
|
63 |
privateIn:PhoneticStringUtilities |
|
64 |
! |
|
65 |
||
2211 | 66 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#PhonemStringComparator |
67 |
instanceVariableNames:'' |
|
68 |
classVariableNames:'CharacterTranslationDict' |
|
69 |
poolDictionaries:'' |
|
70 |
privateIn:PhoneticStringUtilities |
|
71 |
! |
|
72 |
||
2208 | 73 |
PhoneticStringUtilities::PhoneticStringComparator subclass:#DoubleMetaphoneStringComparator |
74 |
instanceVariableNames:'inputKey primaryTranslation secondaryTranslation startIndex |
|
75 |
currentIndex skipCount' |
|
76 |
classVariableNames:'' |
|
77 |
poolDictionaries:'' |
|
78 |
privateIn:PhoneticStringUtilities |
|
79 |
! |
|
80 |
||
81 |
PhoneticStringUtilities::SoundexStringComparator subclass:#MiracodeStringComparator |
|
82 |
instanceVariableNames:'' |
|
83 |
classVariableNames:'' |
|
84 |
poolDictionaries:'' |
|
85 |
privateIn:PhoneticStringUtilities |
|
86 |
! |
|
87 |
||
2197 | 88 |
!PhoneticStringUtilities class methodsFor:'documentation'! |
89 |
||
90 |
copyright |
|
91 |
" |
|
92 |
COPYRIGHT (c) 1994 by Claus Gittinger |
|
93 |
COPYRIGHT (c) 2009 by eXept Software AG |
|
94 |
All Rights Reserved |
|
95 |
||
96 |
This software is furnished under a license and may be used |
|
97 |
only in accordance with the terms of that license and with the |
|
98 |
inclusion of the above copyright notice. This software may not |
|
99 |
be provided or otherwise made available to, or used by, any |
|
100 |
other person. No title to or ownership of the software is |
|
101 |
hereby transferred. |
|
102 |
" |
|
103 |
! |
|
104 |
||
105 |
documentation |
|
106 |
" |
|
2445 | 107 |
Utilities which are helpful to perform phonetic string searches or comparisons. |
108 |
These are all variations or improvements of the soundex algorithm, which usually fails |
|
109 |
to provide good results for non-english languages. |
|
2285 | 110 |
|
2208 | 111 |
soundexCode |
112 |
this algorithm was originally contained in the CharacterArray class; |
|
113 |
||
114 |
nysiis |
|
115 |
a modified soundex algorithm |
|
116 |
||
2209 | 117 |
miracode |
118 |
another modified soundex algorithm ('american soundex') used in the 1880 census. |
|
119 |
||
120 |
mySQLSoundex |
|
121 |
another modified soundex algorithm used in mySQL. |
|
122 |
||
2208 | 123 |
koelner phoneticCode |
124 |
provides a functionality similar to soundex, but much more tuned towards the German language |
|
125 |
||
126 |
Double metaphone |
|
127 |
works with most european languages. |
|
2211 | 128 |
|
129 |
phonem |
|
130 |
described in Georg Wilde and Carsten Meyer, 'Doppelgaenger gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung' |
|
131 |
from 'ct Magazin fuer Computer & Technik 25/1999'. |
|
132 |
||
133 |
More info for german readers is found in: |
|
134 |
http://www.uni-koeln.de/phil-fak/phonetik/Lehre/MA-Arbeiten/magister_wilz.pdf |
|
135 |
" |
|
136 |
! |
|
137 |
||
138 |
sampleData |
|
139 |
" |
|
140 |
for the 50 most common german names, we get: |
|
141 |
||
142 |
ext. |
|
143 |
name soundex soundex metaphone phonet phonet2 phonix daitsch phonem koeln |
|
144 |
||
145 |
müller M460 54600000 MLR MÜLA NILA M4000000 689000 MYLR 657 |
|
146 |
schmidt S253 25300000 SKMTT SHMIT ZNIT S5300000 463000 CMYD 8628 |
|
147 |
schneider S253 25360000 SKNTR SHNEIDA ZNEITA S5300000 463900 CNAYDR 8627 |
|
148 |
fischer F260 12600000 FSKR FISHA FIZA F8000000 749000 VYCR 387 |
|
149 |
weber W160 16000000 WBR WEBA FEBA $1000000 779000 VBR 317 |
|
150 |
meyer M600 56000000 MYR MEIA NEIA M0000000 619000 MAYR 67 |
|
151 |
wagner W256 25600000 WKNR WAKNA FAKNA $2500000 756900 VACNR 367 |
|
152 |
schulz S242 24200000 SKLS SHULS ZULZ S4800000 484000 CULC 85 |
|
153 |
becker B260 12600000 BKR BEKA BEKA B2000000 759000 BCR 147 |
|
154 |
hoffmann H155 15500000 HFMN HOFMAN UFNAN $7550000 576600 OVMAN 036 |
|
155 |
schäfer S216 21600000 SKFR SHEFA ZEFA S7000000 479000 CVR 837 |
|
2197 | 156 |
" |
157 |
! ! |
|
158 |
||
159 |
!PhoneticStringUtilities class methodsFor:'phonetic codes'! |
|
160 |
||
161 |
koelnerPhoneticCodeOf:aString |
|
162 |
"return a koelner phonetic code. |
|
163 |
The koelnerPhonetic code is for the german language what the soundex code is for english; |
|
164 |
it returns simular strings for similar sounding words. |
|
165 |
There are some differences to soundex, though: |
|
166 |
its length is not limited to 4, but depends on the length of the original string; |
|
2207 | 167 |
it does not start with the first character of the input. |
168 |
This algorithm is described by Postel 1969" |
|
2197 | 169 |
|
2209 | 170 |
^ (KoelnerPhoneticCodeStringComparator new phoneticStringsFor:aString) first |
2197 | 171 |
|
172 |
" |
|
173 |
#( |
|
174 |
'Müller' |
|
175 |
'Miller' |
|
176 |
'Mueller' |
|
177 |
'Mühler' |
|
178 |
'Mühlherr' |
|
179 |
'Mülherr' |
|
180 |
'Myler' |
|
181 |
'Millar' |
|
182 |
'Myller' |
|
183 |
'Müllar' |
|
184 |
'Müler' |
|
185 |
'Muehler' |
|
186 |
'Mülller' |
|
187 |
'Müllerr' |
|
188 |
'Muehlherr' |
|
189 |
'Muellar' |
|
190 |
'Mueler' |
|
191 |
'Mülleer' |
|
192 |
'Mueller' |
|
193 |
'Nüller' |
|
194 |
'Nyller' |
|
195 |
'Niler' |
|
196 |
'Czerny' |
|
197 |
'Tscherny' |
|
198 |
'Czernie' |
|
199 |
'Tschernie' |
|
200 |
'Schernie' |
|
201 |
'Scherny' |
|
202 |
'Scherno' |
|
203 |
'Czerne' |
|
204 |
'Zerny' |
|
205 |
'Tzernie' |
|
206 |
'Breschnew' |
|
207 |
) do:[:w | |
|
208 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities koelnerPhoneticCodeOf:w) |
|
209 |
]. |
|
210 |
" |
|
211 |
||
212 |
" |
|
2209 | 213 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschnew'. '17863'. |
214 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschneff'. '17863'. |
|
215 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braeschneff'. '17863'. |
|
216 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braessneff'. '17863'. |
|
217 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Pressneff'. '17863'. |
|
218 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Presznäph'. '17863'. |
|
219 |
PhoneticStringUtilities koelnerPhoneticCodeOf:'Preschnjiev'. '17863'. |
|
220 |
" |
|
221 |
! |
|
222 |
||
223 |
mySQLSoundexCodeOf:aString |
|
224 |
"return the mySQL soundex code. The mysql soundex coed is different from the miracode 'american' soundex |
|
225 |
(no 4char limitation; different order of duplicate vowel vs. duplicate code elimination)" |
|
226 |
||
227 |
^ (MySQLSoundexStringComparator new phoneticStringsFor:aString) first |
|
228 |
||
229 |
" |
|
230 |
#( |
|
231 |
'Müller' |
|
232 |
'Miller' |
|
233 |
'Mueller' |
|
234 |
'Mühler' |
|
235 |
'Mühlherr' |
|
236 |
'Mülherr' |
|
237 |
'Myler' |
|
238 |
'Millar' |
|
239 |
'Myller' |
|
240 |
'Müllar' |
|
241 |
'Müler' |
|
242 |
'Muehler' |
|
243 |
'Mülller' |
|
244 |
'Müllerr' |
|
245 |
'Muehlherr' |
|
246 |
'Muellar' |
|
247 |
'Mueler' |
|
248 |
'Mülleer' |
|
249 |
'Mueller' |
|
250 |
'Nüller' |
|
251 |
'Nyller' |
|
252 |
'Niler' |
|
253 |
'Czerny' |
|
254 |
'Tscherny' |
|
255 |
'Czernie' |
|
256 |
'Tschernie' |
|
257 |
'Schernie' |
|
258 |
'Scherny' |
|
259 |
'Scherno' |
|
260 |
'Czerne' |
|
261 |
'Zerny' |
|
262 |
'Tzernie' |
|
263 |
'Breschnew' |
|
264 |
) do:[:w | |
|
265 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities mySQLSoundexCodeOf:w) |
|
266 |
]. |
|
267 |
" |
|
268 |
||
269 |
" |
|
270 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschnew'. |
|
271 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschneff'. |
|
272 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Braeschneff'. |
|
273 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Braessneff'. |
|
274 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Pressneff'. |
|
275 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Presznäph'. |
|
276 |
PhoneticStringUtilities mySQLSoundexCodeOf:'Preschnjiev'. |
|
2197 | 277 |
" |
278 |
! |
|
279 |
||
280 |
soundexCodeOf:aString |
|
281 |
"return a soundex phonetic code or nil. |
|
2207 | 282 |
Soundex (1918, 1922) returns similar codes for similar sounding words, making it a useful |
2197 | 283 |
tool when searching for words where the correct spelling is unknown. |
284 |
(read Knuth or search the web if you dont know what a soundex code is). |
|
285 |
Caveat: 'similar sounding words' means: 'similar sounding in english'." |
|
286 |
||
2210 | 287 |
^ (SoundexStringComparator new phoneticStringsFor:aString) first |
2197 | 288 |
|
2210 | 289 |
"/ old code - now use code in private class... |
290 |
"/ |inStream codeStream ch last lch codeLength codes code lastCode| |
|
291 |
"/ |
|
292 |
"/ inStream := aString readStream. |
|
293 |
"/ inStream skipSeparators. |
|
294 |
"/ inStream atEnd ifTrue:[ |
|
295 |
"/ ^ nil |
|
296 |
"/ ]. |
|
297 |
"/ |
|
298 |
"/ ch := inStream next. |
|
299 |
"/ ch isLetter ifFalse:[ |
|
300 |
"/ ^ nil |
|
301 |
"/ ]. |
|
302 |
"/ codeLength := 0. |
|
303 |
"/ |
|
304 |
"/ codes := Dictionary new. |
|
305 |
"/ codes atAll:'bpfv' put:$1. |
|
306 |
"/ codes atAll:'cskgjqxz' put:$2. |
|
307 |
"/ codes atAll:'dt' put:$3. |
|
308 |
"/ codes atAll:'l' put:$4. |
|
309 |
"/ codes atAll:'nm' put:$5. |
|
310 |
"/ codes atAll:'r' put:$6. |
|
311 |
"/ |
|
312 |
"/ codeStream := WriteStream on:(String new:4). |
|
313 |
"/ codeStream nextPut:(ch asUppercase). |
|
314 |
"/ last := ch asLowercase. |
|
315 |
"/ lastCode := codes at:last ifAbsent:nil. |
|
316 |
"/ |
|
317 |
"/ [inStream atEnd] whileFalse:[ |
|
318 |
"/ ch := inStream next. |
|
319 |
"/ lch := ch asLowercase. |
|
320 |
"/ lch = last ifFalse:[ |
|
321 |
"/ last := lch. |
|
322 |
"/ |
|
323 |
"/ code := codes at:lch ifAbsent:nil. |
|
324 |
"/ (code notNil and:[ code ~= lastCode]) ifTrue:[ |
|
325 |
"/ codeLength < 3 ifTrue:[ |
|
326 |
"/ codeStream nextPut:code. |
|
327 |
"/ codeLength := codeLength + 1. |
|
328 |
"/ codeLength > 3 ifTrue:[^ codeStream contents]. |
|
329 |
"/ ]. |
|
330 |
"/ ]. |
|
331 |
"/ lastCode := code. |
|
332 |
"/ ] |
|
333 |
"/ ]. |
|
334 |
"/ [ codeLength < 3 ] whileTrue:[ |
|
335 |
"/ codeStream nextPut:$0. |
|
336 |
"/ codeLength := codeLength + 1. |
|
337 |
"/ ]. |
|
338 |
"/ |
|
339 |
"/ ^ codeStream contents |
|
2197 | 340 |
|
341 |
" |
|
342 |
PhoneticStringUtilities soundexCodeOf:'claus' |
|
343 |
PhoneticStringUtilities soundexCodeOf:'clause' |
|
344 |
PhoneticStringUtilities soundexCodeOf:'close' |
|
345 |
PhoneticStringUtilities soundexCodeOf:'smalltalk' |
|
346 |
PhoneticStringUtilities soundexCodeOf:'smaltalk' |
|
347 |
PhoneticStringUtilities soundexCodeOf:'smaltak' |
|
348 |
PhoneticStringUtilities soundexCodeOf:'smaltok' |
|
349 |
PhoneticStringUtilities soundexCodeOf:'smoltok' |
|
350 |
PhoneticStringUtilities soundexCodeOf:'aa' |
|
351 |
PhoneticStringUtilities soundexCodeOf:'by' |
|
352 |
PhoneticStringUtilities soundexCodeOf:'bab' |
|
353 |
PhoneticStringUtilities soundexCodeOf:'bob' |
|
354 |
PhoneticStringUtilities soundexCodeOf:'bop' |
|
355 |
" |
|
356 |
! ! |
|
357 |
||
3648 | 358 |
!PhoneticStringUtilities class methodsFor:'queries'! |
359 |
||
360 |
isUtilityClass |
|
361 |
^ self == PhoneticStringUtilities |
|
362 |
! ! |
|
363 |
||
2208 | 364 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'constant'! |
365 |
||
366 |
defaultClass |
|
367 |
^SoundexStringComparator |
|
368 |
! ! |
|
369 |
||
3646 | 370 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'documentation'! |
371 |
||
372 |
documentation |
|
373 |
" |
|
374 |
abstract superclass for various phonetic comparators. |
|
375 |
They returns similar strings for similar sounding words, which can be used |
|
376 |
to find similar sounding words in a search list. |
|
377 |
||
378 |
Notice, that some comparators are better for particular languages. |
|
379 |
" |
|
380 |
! ! |
|
381 |
||
2208 | 382 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'instance creation'! |
383 |
||
384 |
new |
|
385 |
^ self basicNew initialize. |
|
386 |
! ! |
|
387 |
||
3646 | 388 |
!PhoneticStringUtilities::PhoneticStringComparator class methodsFor:'queries'! |
389 |
||
390 |
isAbstract |
|
391 |
^ self == PhoneticStringUtilities::PhoneticStringComparator |
|
392 |
! ! |
|
393 |
||
2208 | 394 |
!PhoneticStringUtilities::PhoneticStringComparator methodsFor:'api'! |
395 |
||
396 |
does:aString soundLike:anotherString |
|
397 |
|translations1 translations2| |
|
398 |
||
399 |
translations1 := self phoneticStringsFor:aString. |
|
400 |
translations2 := self phoneticStringsFor:anotherString. |
|
401 |
||
402 |
^ translations1 contains:[:t1 | |
|
403 |
translations2 contains:[:t2 | t1 = t2]] |
|
404 |
||
405 |
" |
|
406 |
PhoneticStringUtilities::SoundexStringComparator new |
|
407 |
does:'miller' soundLike:'miler'. |
|
408 |
PhoneticStringUtilities::SoundexStringComparator new |
|
409 |
does:'miller' soundLike:'milner'. |
|
410 |
" |
|
411 |
! |
|
412 |
||
413 |
phoneticStringsFor: aString |
|
414 |
"Should answer an array of alternate phonetic strings for the given input string." |
|
415 |
self subclassResponsibility |
|
416 |
||
417 |
" |
|
418 |
(PhoneticStringUtilities::SoundexStringComparator new |
|
419 |
phoneticStringsFor:'miller') first |
|
420 |
'miller' asSoundexCode |
|
421 |
" |
|
422 |
! ! |
|
423 |
||
424 |
!PhoneticStringUtilities::PhoneticStringComparator methodsFor:'initialization'! |
|
425 |
||
426 |
initialize |
|
427 |
"Invoked when a new instance is created." |
|
428 |
||
429 |
"/ please change as required (and remove this comment) |
|
430 |
||
431 |
"/ super initialize. -- commented since inherited method does nothing |
|
432 |
! ! |
|
433 |
||
2211 | 434 |
!PhoneticStringUtilities::ExtendedSoundexStringComparator class methodsFor:'documentation'! |
435 |
||
436 |
documentation |
|
437 |
" |
|
438 |
There are many extended and enhanced soundex variants around; |
|
439 |
here is one, called 'extended soundex'. It is destribed for example in |
|
440 |
http://www.epidata.dk/documentation.php. |
|
441 |
An author or origin is unknown. |
|
442 |
||
443 |
The number of digits is increased to 5 or 8; |
|
444 |
The first character is not used literally; instead it is encoded like the rest. |
|
445 |
This might have a negative effect on names starting with a vovel, though. |
|
446 |
||
447 |
Overall, it can be doubted if this is really an enhancement after all. |
|
448 |
" |
|
449 |
! ! |
|
450 |
||
451 |
!PhoneticStringUtilities::ExtendedSoundexStringComparator methodsFor:'api'! |
|
452 |
||
453 |
phoneticStringsFor:aString |
|
454 |
"generates both an extended soundex of length 5 and one of length 8" |
|
455 |
||
456 |
|first second u t prevCode| |
|
457 |
||
458 |
u := aString asUppercase. |
|
459 |
first := second := ''. |
|
460 |
u do:[:c | |
|
461 |
t := self translate:c. |
|
462 |
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[ |
|
463 |
first := first , t. |
|
464 |
second := second , t. |
|
465 |
second size == 8 ifTrue:[ |
|
466 |
^ Array with:(first copyTo:5) with:second |
|
467 |
]. |
|
468 |
]. |
|
469 |
prevCode := t |
|
470 |
]. |
|
471 |
[ first size < 5 ] whileTrue:[ |
|
472 |
first := first , '0'. |
|
473 |
second := second , '0'. |
|
474 |
]. |
|
475 |
[ second size < 8 ] whileTrue:[ |
|
476 |
second := second , '0' |
|
477 |
]. |
|
478 |
^ Array with:first with:second |
|
479 |
||
480 |
" |
|
481 |
self basicNew phoneticStringsFor:'müller' #('87900' '87900000') |
|
482 |
self basicNew phoneticStringsFor:'miller' #('87900' '87900000') |
|
483 |
self basicNew phoneticStringsFor:'muller' #('87900' '87900000') |
|
484 |
self basicNew phoneticStringsFor:'muler' #('87900' '87900000') |
|
485 |
self basicNew phoneticStringsFor:'schmidt' #('38600' '38600000') |
|
486 |
self basicNew phoneticStringsFor:'schneider' #('38690' '38690000') |
|
487 |
self basicNew phoneticStringsFor:'fischer' #('23900' '23900000') |
|
488 |
self basicNew phoneticStringsFor:'weber' #('19000' '19000000') |
|
489 |
self basicNew phoneticStringsFor:'meyer' #('89000' '89000000') |
|
490 |
self basicNew phoneticStringsFor:'wagner' #('48900' '48900000') |
|
491 |
self basicNew phoneticStringsFor:'schulz' #('37500' '37500000') |
|
492 |
self basicNew phoneticStringsFor:'becker' #('13900' '13900000') |
|
493 |
self basicNew phoneticStringsFor:'hoffmann' #('28800' '28800000') |
|
494 |
self basicNew phoneticStringsFor:'schäfer' #('32900' '32900000') |
|
495 |
" |
|
496 |
! ! |
|
497 |
||
498 |
!PhoneticStringUtilities::ExtendedSoundexStringComparator methodsFor:'private'! |
|
499 |
||
500 |
translate:aCharacter |
|
501 |
"use simple if's for more speed when compiled" |
|
502 |
||
503 |
"vowels serve as separators" |
|
504 |
aCharacter == $A ifTrue:[^ '0' ]. |
|
505 |
aCharacter == $E ifTrue:[^ '0' ]. |
|
506 |
aCharacter == $I ifTrue:[^ '0' ]. |
|
507 |
aCharacter == $O ifTrue:[^ '0' ]. |
|
508 |
aCharacter == $U ifTrue:[^ '0' ]. |
|
509 |
aCharacter == $Y ifTrue:[^ '0' ]. |
|
510 |
||
511 |
aCharacter == $B ifTrue:[^ '1' ]. |
|
512 |
aCharacter == $P ifTrue:[^ '1' ]. |
|
513 |
||
514 |
aCharacter == $F ifTrue:[^ '2' ]. |
|
515 |
aCharacter == $V ifTrue:[^ '2' ]. |
|
516 |
||
517 |
aCharacter == $C ifTrue:[^ '3' ]. |
|
518 |
aCharacter == $S ifTrue:[^ '3' ]. |
|
519 |
aCharacter == $K ifTrue:[^ '3' ]. |
|
520 |
||
521 |
aCharacter == $G ifTrue:[^ '4' ]. |
|
522 |
aCharacter == $J ifTrue:[^ '4' ]. |
|
523 |
||
524 |
aCharacter == $Q ifTrue:[^ '5' ]. |
|
525 |
aCharacter == $X ifTrue:[^ '5' ]. |
|
526 |
aCharacter == $Z ifTrue:[^ '5' ]. |
|
527 |
||
528 |
aCharacter == $D ifTrue:[^ '6' ]. |
|
529 |
aCharacter == $G ifTrue:[^ '6' ]. |
|
530 |
aCharacter == $T ifTrue:[^ '6' ]. |
|
531 |
||
532 |
aCharacter == $L ifTrue:[^ '7' ]. |
|
533 |
||
534 |
aCharacter == $M ifTrue:[^ '8' ]. |
|
535 |
aCharacter == $N ifTrue:[^ '8' ]. |
|
536 |
||
537 |
aCharacter == $R ifTrue:[^ '9' ]. |
|
538 |
^ nil |
|
539 |
! ! |
|
540 |
||
2208 | 541 |
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator class methodsFor:'documentation'! |
542 |
||
543 |
documentation |
|
544 |
" |
|
545 |
The koelnerPhonetic code is for the german language what the soundex code is for english. |
|
3646 | 546 |
It returns similar strings for similar sounding words. |
2208 | 547 |
|
548 |
There are some differences to soundex, though: |
|
549 |
its length is not limited to 4, but depends on the length of the original string; |
|
550 |
it does not start with the first character of the input. |
|
551 |
||
552 |
This algorithm was described by Postel 1969 |
|
553 |
" |
|
554 |
! ! |
|
555 |
||
556 |
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator methodsFor:'api'! |
|
557 |
||
558 |
phoneticStringsFor: aString |
|
559 |
"return a koelner phonetic code. |
|
560 |
The koelnerPhonetic code is for the german language what the soundex code is for english; |
|
561 |
it returns simular strings for similar sounding words. |
|
562 |
There are some differences to soundex, though: |
|
563 |
its length is not limited to 4, but depends on the length of the original string; |
|
564 |
it does not start with the first character of the input. |
|
565 |
This algorithm is described by Postel 1969" |
|
566 |
||
567 |
|in ret val rslt| |
|
568 |
||
569 |
in := aString withoutSeparators asLowercase. |
|
570 |
in := in copyReplaceString:'ph' withString:'f'. |
|
571 |
in := in copyReplaceAll:$ü withAll:'u'. |
|
572 |
in := in copyReplaceAll:$ä withAll:'a'. |
|
573 |
in := in copyReplaceAll:$ö withAll:'o'. |
|
574 |
in := in copyReplaceAll:$ß withAll:'ss'. |
|
575 |
in := '#',in,'#'. |
|
576 |
||
577 |
ret := ''. |
|
578 |
1 to:in size-2 do:[:i | |
|
579 |
|sub| |
|
580 |
||
581 |
sub := in copyFrom:i to:i+2. |
|
582 |
val := (i==1) |
|
583 |
ifTrue:[ self convertFirst:sub ] |
|
584 |
ifFalse:[ self convertRest:sub ]. |
|
585 |
ret := ret,val |
|
586 |
]. |
|
587 |
||
588 |
ret := ret select:[:ch | ch ~= $-]. |
|
589 |
||
590 |
(ret startsWith:'0') ifTrue:[ |
|
591 |
ret := '0',(ret select:[:ch | ch ~= $0]). |
|
592 |
] ifFalse:[ |
|
593 |
ret := ret select:[:ch | ch ~= $0]. |
|
594 |
]. |
|
595 |
||
596 |
rslt := String streamContents:[:s | |
|
597 |
|prev| |
|
598 |
||
599 |
ret do:[:ch | |
|
600 |
ch ~= prev ifTrue:[ |
|
601 |
s nextPut:ch |
|
602 |
]. |
|
603 |
prev := ch. |
|
604 |
]. |
|
605 |
]. |
|
606 |
^ Array with:rslt. |
|
607 |
||
608 |
" |
|
609 |
#( |
|
610 |
'Müller' |
|
611 |
'Miller' |
|
612 |
'Mueller' |
|
613 |
'Mühler' |
|
614 |
'Mühlherr' |
|
615 |
'Mülherr' |
|
616 |
'Myler' |
|
617 |
'Millar' |
|
618 |
'Myller' |
|
619 |
'Müllar' |
|
620 |
'Müler' |
|
621 |
'Muehler' |
|
622 |
'Mülller' |
|
623 |
'Müllerr' |
|
624 |
'Muehlherr' |
|
625 |
'Muellar' |
|
626 |
'Mueler' |
|
627 |
'Mülleer' |
|
628 |
'Mueller' |
|
629 |
'Nüller' |
|
630 |
'Nyller' |
|
631 |
'Niler' |
|
632 |
'Czerny' |
|
633 |
'Tscherny' |
|
634 |
'Czernie' |
|
635 |
'Tschernie' |
|
636 |
'Schernie' |
|
637 |
'Scherny' |
|
638 |
'Scherno' |
|
639 |
'Czerne' |
|
640 |
'Zerny' |
|
641 |
'Tzernie' |
|
642 |
'Breschnew' |
|
643 |
) do:[:w | |
|
644 |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:w) first |
|
645 |
]. |
|
646 |
" |
|
647 |
||
648 |
" |
|
649 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Breschnew' -> '17863' |
|
650 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Breschneff' -> '17863' |
|
651 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Braeschneff' -> '17863' |
|
652 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Braessneff' -> '17863' |
|
653 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Pressneff' -> '17863' |
|
654 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Presznäph' -> '17863' |
|
655 |
PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator new phoneticStringsFor:'Präschnäf' -> '17863' |
|
656 |
" |
|
2211 | 657 |
" |
658 |
self basicNew phoneticStringsFor:'müller' #('657') |
|
659 |
self basicNew phoneticStringsFor:'möller' #('657') |
|
660 |
self basicNew phoneticStringsFor:'miller' #('657') |
|
661 |
self basicNew phoneticStringsFor:'muller' #('657') |
|
662 |
self basicNew phoneticStringsFor:'muler' #('657') |
|
663 |
self basicNew phoneticStringsFor:'schmidt' #('862') |
|
664 |
self basicNew phoneticStringsFor:'schneider' #('8627') |
|
665 |
self basicNew phoneticStringsFor:'fischer' #('387') |
|
666 |
self basicNew phoneticStringsFor:'weber' #('317') |
|
667 |
self basicNew phoneticStringsFor:'meyer' #('67') |
|
668 |
self basicNew phoneticStringsFor:'wagner' #('3467') |
|
669 |
self basicNew phoneticStringsFor:'schulz' #('858') |
|
670 |
self basicNew phoneticStringsFor:'becker' #('147') |
|
671 |
self basicNew phoneticStringsFor:'hoffmann' #('036') |
|
672 |
self basicNew phoneticStringsFor:'schäfer' #('837') |
|
673 |
" |
|
2208 | 674 |
! ! |
675 |
||
676 |
!PhoneticStringUtilities::KoelnerPhoneticCodeStringComparator methodsFor:'private'! |
|
677 |
||
678 |
convertFirst:chars |
|
679 |
#( |
|
680 |
('#a#' '0') |
|
681 |
('#e#' '0') |
|
682 |
('#i#' '0') |
|
683 |
('#j#' '0') |
|
684 |
('#y#' '0') |
|
685 |
('#o#' '0') |
|
686 |
('#u#' '0') |
|
687 |
||
688 |
('#ca' '4') |
|
689 |
('#ch' '4') |
|
690 |
('#ck' '4') |
|
691 |
('#cl' '4') |
|
692 |
('#co' '4') |
|
693 |
('#cq' '4') |
|
694 |
('#cr' '4') |
|
695 |
('#cu' '4') |
|
696 |
('#cx' '4') |
|
697 |
||
698 |
('#c#' '8') |
|
699 |
) do:[:pair | |
|
700 |
(pair first match:chars) ifTrue:[ |
|
701 |
^ pair second |
|
702 |
] |
|
703 |
]. |
|
704 |
||
705 |
^ self convertRest:chars |
|
706 |
! |
|
707 |
||
708 |
convertRest:chars |
|
709 |
#( |
|
710 |
('#ds' '8') |
|
711 |
('#dc' '8') |
|
712 |
('#dz' '8') |
|
713 |
('#ts' '8') |
|
714 |
('#tc' '8') |
|
715 |
('#tz' '8') |
|
716 |
('#d#' '2') |
|
717 |
('#t#' '2') |
|
718 |
('cx#' '8') |
|
719 |
('kx#' '8') |
|
720 |
('qx#' '8') |
|
721 |
('#x#' '48') |
|
722 |
('sc#' '8') |
|
723 |
('sz#' '8') |
|
724 |
('#ca' '4') |
|
725 |
('#co' '4') |
|
726 |
('#cu' '4') |
|
727 |
('#ch' '4') |
|
728 |
('#ck' '4') |
|
729 |
('#cx' '4') |
|
730 |
('#cq' '4') |
|
731 |
('#c#' '8') |
|
732 |
('#a#' '0') |
|
733 |
('#e#' '0') |
|
734 |
('#i#' '0') |
|
735 |
('#j#' '0') |
|
736 |
('#y#' '0') |
|
737 |
('#o#' '0') |
|
738 |
('#u#' '0') |
|
739 |
('#h#' '-') |
|
740 |
('#l#' '5') |
|
741 |
('#r#' '7') |
|
742 |
('#m#' '6') |
|
743 |
('#n#' '6') |
|
744 |
('#s#' '8') |
|
745 |
('#z#' '8') |
|
746 |
('#b#' '1') |
|
747 |
('#p#' '1') |
|
748 |
('#f#' '3') |
|
749 |
('#v#' '3') |
|
750 |
('#w#' '3') |
|
751 |
('#g#' '4') |
|
752 |
('#k#' '4') |
|
753 |
('#q#' '4') |
|
754 |
('###' '?') |
|
755 |
) do:[:pair | |
|
756 |
(pair first match:chars) ifTrue:[ |
|
757 |
^ pair second |
|
758 |
] |
|
759 |
]. |
|
760 |
||
761 |
self error:'cannot happen' |
|
762 |
! ! |
|
763 |
||
764 |
!PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'! |
|
765 |
||
766 |
documentation |
|
767 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
768 |
WARNING: this is the so called 'simplified soundex' algorithm; |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
769 |
there are more variants like miracode (american soundex) or mysqlSoundex around. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
770 |
Be sure to use the correct algorithm, if the generated strings must be compatible |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
771 |
(otherwise, the differences are probably too small to be noticed as effect, but |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
772 |
your search will be different) |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
773 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
774 |
The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
775 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
776 |
SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
777 |
components of names, but by doing so reports more matches. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
778 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
779 |
There are some variations around in the literature; |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
780 |
the following is called 'simplified soundex', and the rules for coding a name are: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
781 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
782 |
1. The first letter of the name is used in its un-coded form to serve as the prefix |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
783 |
character of the code. (The rest of the code is numerical). |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
784 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
785 |
2. Thereafter, W and H are ignored entirely. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
786 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
787 |
3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
788 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
789 |
4. Other letters of the name are converted to a numerical equivalent: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
790 |
B, P, F, V 1 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
791 |
C, G, J, K, Q, S, X, Z 2 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
792 |
D, T 3 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
793 |
L 4 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
794 |
M, N 5 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
795 |
R 6 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
796 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
797 |
5. There are two exceptions: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
798 |
1. Letters that follow prefix letters which would, if coded, have the same |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
799 |
numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
800 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
801 |
2. The second letter of any pair of consonants having the same code number is likewise ignored, |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
802 |
i.e. unless there is a ''separator'' between them in the name. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
803 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
804 |
6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
805 |
Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
806 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
807 |
Notice, that in another variant, w and h are treated slightly differently. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
808 |
This is only of relevance, if you need to reconstruct original soundex codes of other programs |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
809 |
or for the original 1880 us census data. |
3646 | 810 |
|
811 |
Also notice, that soundex deals better with english. |
|
812 |
For german and other languages, other algorithms may provide better results. |
|
2208 | 813 |
" |
814 |
! ! |
|
815 |
||
816 |
!PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'! |
|
817 |
||
818 |
phoneticStringsFor:aString |
|
819 |
|u p t prevCode| |
|
820 |
||
821 |
u := aString asUppercase. |
|
822 |
p := u first asString. |
|
823 |
prevCode := self translate:u first. |
|
824 |
u from:2 to:u size do:[:c | |
|
825 |
t := self translate:c. |
|
826 |
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[ |
|
827 |
p := p , t. |
|
828 |
p size == 4 ifTrue:[^ Array with:p ]. |
|
829 |
]. |
|
830 |
prevCode := t |
|
831 |
]. |
|
832 |
[ p size < 4 ] whileTrue:[ |
|
833 |
p := p , '0' |
|
834 |
]. |
|
835 |
^ Array with:(p copyFrom:1 to:4) |
|
836 |
! ! |
|
837 |
||
838 |
!PhoneticStringUtilities::SoundexStringComparator methodsFor:'private'! |
|
839 |
||
840 |
translate:aCharacter |
|
841 |
"use simple if's for more speed when compiled" |
|
842 |
||
843 |
"vowels serve as separators" |
|
844 |
aCharacter == $A ifTrue:[^ '0' ]. |
|
845 |
aCharacter == $E ifTrue:[^ '0' ]. |
|
846 |
aCharacter == $I ifTrue:[^ '0' ]. |
|
847 |
aCharacter == $O ifTrue:[^ '0' ]. |
|
848 |
aCharacter == $U ifTrue:[^ '0' ]. |
|
849 |
aCharacter == $Y ifTrue:[^ '0' ]. |
|
850 |
||
851 |
aCharacter == $B ifTrue:[^ '1' ]. |
|
852 |
aCharacter == $P ifTrue:[^ '1' ]. |
|
853 |
aCharacter == $F ifTrue:[^ '1' ]. |
|
854 |
aCharacter == $V ifTrue:[^ '1' ]. |
|
855 |
||
856 |
aCharacter == $C ifTrue:[^ '2' ]. |
|
857 |
aCharacter == $S ifTrue:[^ '2' ]. |
|
858 |
aCharacter == $K ifTrue:[^ '2' ]. |
|
859 |
aCharacter == $G ifTrue:[^ '2' ]. |
|
860 |
aCharacter == $J ifTrue:[^ '2' ]. |
|
861 |
aCharacter == $Q ifTrue:[^ '2' ]. |
|
862 |
aCharacter == $X ifTrue:[^ '2' ]. |
|
863 |
aCharacter == $Z ifTrue:[^ '2' ]. |
|
864 |
||
865 |
aCharacter == $D ifTrue:[^ '3' ]. |
|
866 |
aCharacter == $T ifTrue:[^ '3' ]. |
|
867 |
||
868 |
aCharacter == $L ifTrue:[^ '4' ]. |
|
869 |
||
870 |
aCharacter == $M ifTrue:[^ '5' ]. |
|
871 |
aCharacter == $N ifTrue:[^ '5' ]. |
|
872 |
||
873 |
aCharacter == $R ifTrue:[^ '6' ]. |
|
874 |
^ nil |
|
875 |
! ! |
|
876 |
||
877 |
!PhoneticStringUtilities::MySQLSoundexStringComparator class methodsFor:'documentation'! |
|
878 |
||
879 |
documentation |
|
880 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
881 |
MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation, |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
882 |
and also removing vokals first, then removing duplicate codes |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
883 |
(whereas the soundex code does this in reverse order). |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
884 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
885 |
These variations are important, if you need the ame soundex codes to be generated. |
2208 | 886 |
" |
887 |
! ! |
|
888 |
||
889 |
!PhoneticStringUtilities::MySQLSoundexStringComparator methodsFor:'api'! |
|
890 |
||
891 |
phoneticStringsFor:aString |
|
892 |
|u p t prevCode| |
|
893 |
||
894 |
u := aString asUppercase. |
|
895 |
p := u first asString. |
|
896 |
prevCode := self translate:u first. |
|
897 |
u from:2 to:u size do:[:c | |
|
898 |
t := self translate:c. |
|
899 |
(t notNil and:[ t ~= '0' and:[ t ~= prevCode ]]) ifTrue:[ |
|
900 |
p := p , t. |
|
901 |
]. |
|
902 |
(t ~= '0' and:[ c ~= $W and:[c ~= $H]]) ifTrue:[ |
|
903 |
prevCode := t. |
|
904 |
]. |
|
905 |
]. |
|
906 |
[ p size < 4 ] whileTrue:[ |
|
907 |
p := p , '0' |
|
908 |
]. |
|
909 |
^ Array with:p |
|
910 |
! ! |
|
911 |
||
912 |
!PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'! |
|
913 |
||
914 |
documentation |
|
915 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
916 |
NYSIIS Algorithm: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
917 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
918 |
1. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
919 |
remove all ''S'' and ''Z'' chars from the end of the surname |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
920 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
921 |
2. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
922 |
transcode initial strings |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
923 |
MAC => MC |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
924 |
PF => F |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
925 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
926 |
3. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
927 |
Transcode trailing strings as follows, |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
928 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
929 |
IX => IC |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
930 |
EX => EC |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
931 |
YE,EE,IE => Y |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
932 |
NT,ND => D |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
933 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
934 |
4. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
935 |
transcode ''EV'' to ''EF'' if not at start of name |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
936 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
937 |
5. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
938 |
use first character of name as first character of key |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
939 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
940 |
6. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
941 |
remove any ''W'' that follows a vowel |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
942 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
943 |
7. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
944 |
replace all vowels with ''A'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
945 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
946 |
8. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
947 |
transcode ''GHT'' to ''GT'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
948 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
949 |
9. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
950 |
transcode ''DG'' to ''G'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
951 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
952 |
10. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
953 |
transcode ''PH'' to ''F'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
954 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
955 |
11. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
956 |
if not first character, eliminate all ''H'' preceded or followed by a vowel |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
957 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
958 |
12. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
959 |
change ''KN'' to ''N'', else ''K'' to ''C'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
960 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
961 |
13. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
962 |
if not first character, change ''M'' to ''N'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
963 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
964 |
14. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
965 |
if not first character, change ''Q'' to ''G'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
966 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
967 |
15. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
968 |
transcode ''SH'' to ''S'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
969 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
970 |
16. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
971 |
transcode ''SCH'' to ''S'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
972 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
973 |
17. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
974 |
transcode ''YW'' to ''Y'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
975 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
976 |
18. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
977 |
if not first or last character, change ''Y'' to ''A'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
978 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
979 |
19. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
980 |
transcode ''WR'' to ''R'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
981 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
982 |
20. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
983 |
if not first character, change ''Z'' to ''S'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
984 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
985 |
21. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
986 |
transcode terminal ''AY'' to ''Y'' |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
987 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
988 |
22. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
989 |
remove traling vowels |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
990 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
991 |
23. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
992 |
collapse all strings of repeated characters |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
993 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
994 |
24. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
995 |
if first char of original surname was a vowel, append it to the code |
2208 | 996 |
" |
997 |
! ! |
|
998 |
||
999 |
!PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'! |
|
1000 |
||
1001 |
phoneticStringsFor:aString |
|
1002 |
|k| |
|
1003 |
||
1004 |
k := self rule1:(aString asUppercase). |
|
1005 |
k := self rule2:k. |
|
1006 |
k := self rule3:k. |
|
1007 |
k := self rule4:k. |
|
1008 |
k := self rule5:k. |
|
1009 |
k := self rule6:k. |
|
1010 |
k := self rule7:k. |
|
1011 |
k := self rule8:k. |
|
1012 |
k := self rule9:k. |
|
1013 |
k := self rule10:k. |
|
1014 |
k := self rule11:k. |
|
1015 |
k := self rule12:k. |
|
1016 |
k := self rule13:k. |
|
1017 |
k := self rule14:k. |
|
1018 |
k := self rule15:k. |
|
1019 |
k := self rule16:k. |
|
1020 |
k := self rule17:k. |
|
1021 |
k := self rule18:k. |
|
1022 |
k := self rule19:k. |
|
1023 |
k := self rule20:k. |
|
1024 |
k := self rule21:k. |
|
1025 |
k := self rule22:k. |
|
1026 |
k := self rule23:k. |
|
1027 |
k := self rule24:k originalKey:aString. |
|
1028 |
^ Array with:k |
|
1029 |
||
1030 |
" |
|
1031 |
self new phoneticStringsFor:'hello' |
|
3839 | 1032 |
self new phoneticStringsFor:'bliss' |
2208 | 1033 |
" |
1034 |
! ! |
|
1035 |
||
1036 |
!PhoneticStringUtilities::NYSIISStringComparator methodsFor:'private'! |
|
1037 |
||
1038 |
rule10:key |
|
1039 |
"10. transcode 'PH' to 'F' " |
|
1040 |
||
1041 |
^ self |
|
1042 |
transcodeAll:'PH' |
|
1043 |
of:key |
|
1044 |
to:'F' |
|
1045 |
startingAt:1 |
|
1046 |
! |
|
1047 |
||
1048 |
rule11:key |
|
1049 |
|k c| |
|
1050 |
||
1051 |
"11. if not first character, eliminate all 'H' preceded or followed by a vowel " |
|
1052 |
k := key copy. |
|
1053 |
c := SortedCollection sortBlock:[:a :b | b < a ]. |
|
1054 |
2 to:key size do:[:i | |
|
1055 |
(key at:i) = $H ifTrue:[ |
|
1056 |
((key at:i - 1) isVowel |
|
1057 |
or:[ (i < key size) and:[ (key at:i + 1) isVowel ] ]) ifTrue:[ c add:i ] |
|
1058 |
] |
|
1059 |
]. |
|
1060 |
c do:[:n | |
|
1061 |
k := (k copyFrom:1 to:n - 1) , (k copyFrom:n + 1 to:k size) |
|
1062 |
]. |
|
1063 |
^ k |
|
1064 |
! |
|
1065 |
||
1066 |
rule12:key |
|
1067 |
|k| |
|
1068 |
||
1069 |
"12. change 'KN' to 'N', else 'K' to 'C' " |
|
1070 |
k := self |
|
1071 |
transcodeAll:'KN' |
|
1072 |
of:key |
|
1073 |
to:'K' |
|
1074 |
startingAt:1. |
|
1075 |
k := self |
|
1076 |
transcodeAll:'K' |
|
1077 |
of:k |
|
1078 |
to:'C' |
|
1079 |
startingAt:1. |
|
1080 |
^ k |
|
1081 |
! |
|
1082 |
||
1083 |
rule13:key |
|
1084 |
"13. if not first character, change 'M' to 'N' " |
|
1085 |
||
1086 |
^ self |
|
1087 |
transcodeAll:'M' |
|
1088 |
of:key |
|
1089 |
to:'N' |
|
1090 |
startingAt:2 |
|
1091 |
! |
|
1092 |
||
1093 |
rule14:key |
|
1094 |
"14. if not first character, change 'Q' to 'G' " |
|
1095 |
||
1096 |
^ self |
|
1097 |
transcodeAll:'Q' |
|
1098 |
of:key |
|
1099 |
to:'G' |
|
1100 |
startingAt:2 |
|
1101 |
! |
|
1102 |
||
1103 |
rule15:key |
|
1104 |
"15. transcode 'SH' to 'S' " |
|
1105 |
||
1106 |
^ self |
|
1107 |
transcodeAll:'SH' |
|
1108 |
of:key |
|
1109 |
to:'S' |
|
1110 |
startingAt:1 |
|
1111 |
! |
|
1112 |
||
1113 |
rule16:key |
|
1114 |
"16. transcode 'SCH' to 'S' " |
|
1115 |
||
1116 |
^ self |
|
1117 |
transcodeAll:'SCH' |
|
1118 |
of:key |
|
1119 |
to:'S' |
|
1120 |
startingAt:1 |
|
1121 |
! |
|
1122 |
||
1123 |
rule17:key |
|
1124 |
"17. transcode 'YW' to 'Y' " |
|
1125 |
||
1126 |
^ self |
|
1127 |
transcodeAll:'YW' |
|
1128 |
of:key |
|
1129 |
to:'Y' |
|
1130 |
startingAt:1 |
|
1131 |
! |
|
1132 |
||
1133 |
rule18:key |
|
1134 |
|k| |
|
1135 |
||
1136 |
"18. if not first or last character, change 'Y' to 'A' " |
|
1137 |
k := self |
|
1138 |
transcodeAll:'Y' |
|
1139 |
of:key |
|
1140 |
to:'A' |
|
1141 |
startingAt:2. |
|
1142 |
key last = $Y ifTrue:[ |
|
1143 |
k at:k size put:$Y |
|
1144 |
]. |
|
1145 |
^ k |
|
1146 |
! |
|
1147 |
||
1148 |
rule19:key |
|
1149 |
"19. transcode 'WR' to 'R' " |
|
1150 |
||
1151 |
^ self |
|
1152 |
transcodeAll:'WR' |
|
1153 |
of:key |
|
1154 |
to:'R' |
|
1155 |
startingAt:1 |
|
1156 |
! |
|
1157 |
||
1158 |
rule1:key |
|
1159 |
|k| |
|
1160 |
||
1161 |
k := key copy. |
|
1162 |
"1. Remove all 'S' and 'Z' chars from the end of the name" |
|
1163 |
[ |
|
3839 | 1164 |
'SZ' includes:k last |
2208 | 1165 |
] whileTrue:[ k := k copyFrom:1 to:(k size - 1) ]. |
1166 |
^ k |
|
1167 |
! |
|
1168 |
||
1169 |
rule20:key |
|
1170 |
"20. if not first character, change 'Z' to 'S' " |
|
1171 |
||
1172 |
^ self |
|
1173 |
transcodeAll:'Z' |
|
1174 |
of:key |
|
1175 |
to:'S' |
|
1176 |
startingAt:2 |
|
1177 |
! |
|
1178 |
||
1179 |
rule21:key |
|
1180 |
"21. transcode terminal 'AY' to 'Y' " |
|
1181 |
||
1182 |
^ self |
|
1183 |
transcodeAll:'AY' |
|
1184 |
of:key |
|
1185 |
to:'Y' |
|
1186 |
startingAt:key size - 1 |
|
1187 |
! |
|
1188 |
||
1189 |
rule22:key |
|
1190 |
|k| |
|
1191 |
||
1192 |
"22. remove trailing vowels " |
|
1193 |
k := key copy. |
|
1194 |
[ k last isVowel ] whileTrue:[ |
|
1195 |
k := k copyFrom:1 to:k size - 1 |
|
1196 |
]. |
|
1197 |
^ k |
|
1198 |
! |
|
1199 |
||
1200 |
rule23:key |
|
1201 |
|k c| |
|
1202 |
||
1203 |
"23. collapse all strings of repeated characters " |
|
1204 |
k := key copy. |
|
1205 |
c := SortedCollection sortBlock:[:a :b | b < a ]. |
|
1206 |
k size to:2 do:[:i | |
|
1207 |
(k at:i) = (k at:i - 1) ifTrue:[ |
|
1208 |
c add:i |
|
1209 |
] |
|
1210 |
]. |
|
1211 |
c do:[:n | |
|
1212 |
k := (k copyFrom:1 to:n - 1) , (k copyFrom:n + 1 to:k size) |
|
1213 |
]. |
|
1214 |
^ k |
|
1215 |
! |
|
1216 |
||
1217 |
rule24:key originalKey:originalKey |
|
1218 |
|k| |
|
1219 |
||
1220 |
"24. if first char of original surname was a vowel, append it to the code" |
|
1221 |
k := key copy. |
|
1222 |
originalKey first isVowel ifTrue:[ |
|
1223 |
k := k , originalKey first asString asUppercase |
|
1224 |
]. |
|
1225 |
^ k |
|
1226 |
! |
|
1227 |
||
1228 |
rule2:key |
|
1229 |
|k| |
|
1230 |
||
1231 |
k := key copy. |
|
1232 |
"2. Transcode initial strings: MAC => MC PF => F" |
|
1233 |
(k copyFrom:1 to:3) = 'MAC' ifTrue:[ |
|
1234 |
k := 'MC' , (k copyFrom:4 to:k size) |
|
1235 |
]. |
|
1236 |
(k copyFrom:1 to:2) = 'PF' ifTrue:[ |
|
1237 |
k := 'F' , (k copyFrom:3 to:k size) |
|
1238 |
]. |
|
1239 |
^ k |
|
1240 |
! |
|
1241 |
||
1242 |
rule3:key |
|
1243 |
|k| |
|
1244 |
||
1245 |
"3. Transcode trailing strings as follows: |
|
1246 |
IX => IC |
|
1247 |
EX => EC |
|
1248 |
YE, EE, IE => Y |
|
1249 |
NT, ND => D" |
|
1250 |
k := key copy. |
|
1251 |
k := self |
|
1252 |
transcodeTrailing:#( 'IX' ) |
|
1253 |
of:k |
|
1254 |
to:'IC'. |
|
1255 |
k := self |
|
1256 |
transcodeTrailing:#( 'EX' ) |
|
1257 |
of:k |
|
1258 |
to:'EC'. |
|
1259 |
k := self |
|
1260 |
transcodeTrailing:#( 'YE' 'EE' 'IE' ) |
|
1261 |
of:k |
|
1262 |
to:'Y'. |
|
1263 |
k := self |
|
1264 |
transcodeTrailing:#( 'NT' 'ND' ) |
|
1265 |
of:k |
|
1266 |
to:'D'. |
|
1267 |
^ k |
|
1268 |
! |
|
1269 |
||
1270 |
rule4:key |
|
1271 |
"4. Transcode 'EV' to 'EF' if not at start of name" |
|
1272 |
||
1273 |
^ self |
|
1274 |
transcodeAll:'EV' |
|
1275 |
of:key |
|
1276 |
to:'EF' |
|
1277 |
startingAt:2 |
|
1278 |
! |
|
1279 |
||
1280 |
rule5:key |
|
1281 |
"5. Use first character of name as first character of key. Ignored because we're doing an in-place conversion" |
|
1282 |
||
1283 |
^ key |
|
1284 |
! |
|
1285 |
||
1286 |
rule6:key |
|
1287 |
|k i| |
|
1288 |
||
1289 |
"6. Remove any 'W' that follows a vowel" |
|
1290 |
k := key copy. |
|
1291 |
i := 2. |
|
1292 |
[ |
|
1293 |
(i := k indexOf:$W startingAt:i) > 0 |
|
1294 |
] whileTrue:[ |
|
1295 |
(k at:i - 1) isVowel ifTrue:[ |
|
1296 |
k := (k copyFrom:1 to:i - 1) , (k copyFrom:i + 1 to:k size). |
|
1297 |
i := i - 1 |
|
1298 |
] |
|
1299 |
]. |
|
1300 |
^ k |
|
1301 |
! |
|
1302 |
||
1303 |
rule7:key |
|
1304 |
|k| |
|
1305 |
||
1306 |
"7. replace all vowels with 'A' " |
|
1307 |
k := key copy. |
|
1308 |
1 to:key size do:[:i | |
|
1309 |
(key at:i) isVowel ifTrue:[ |
|
1310 |
k at:i put:$A |
|
1311 |
] |
|
1312 |
]. |
|
1313 |
^ k |
|
1314 |
! |
|
1315 |
||
1316 |
rule8:key |
|
1317 |
"8. transcode 'GHT' to 'GT' " |
|
1318 |
||
1319 |
^ self |
|
1320 |
transcodeAll:'GHT' |
|
1321 |
of:key |
|
1322 |
to:'GT' |
|
1323 |
startingAt:1 |
|
1324 |
! |
|
1325 |
||
1326 |
rule9:key |
|
1327 |
"9. transcode 'DG' to 'G' " |
|
1328 |
||
1329 |
^ self |
|
1330 |
transcodeAll:'DG' |
|
1331 |
of:key |
|
1332 |
to:'G' |
|
1333 |
startingAt:1 |
|
1334 |
! |
|
1335 |
||
1336 |
transcodeAll:aString of:key to:replacementString startingAt:start |
|
1337 |
|k i| |
|
1338 |
||
1339 |
k := key copy. |
|
1340 |
[ |
|
1341 |
(i := k indexOfSubCollection:aString startingAt:start) > 0 |
|
1342 |
] whileTrue:[ |
|
1343 |
k := (k copyFrom:1 to:i - 1) , replacementString |
|
1344 |
, (k copyFrom:i + aString size to:k size) |
|
1345 |
]. |
|
1346 |
^ k |
|
1347 |
! |
|
1348 |
||
1349 |
transcodeTrailing:anArrayOfStrings of:key to:replacementString |
|
1350 |
|answer| |
|
1351 |
||
1352 |
answer := key copy. |
|
1353 |
anArrayOfStrings do:[:aString | |
|
1354 |
answer := self |
|
1355 |
transcodeAll:aString |
|
1356 |
of:answer |
|
1357 |
to:replacementString |
|
1358 |
startingAt:(answer size - aString size) + 1 |
|
1359 |
]. |
|
1360 |
^ answer |
|
1361 |
! ! |
|
1362 |
||
2211 | 1363 |
!PhoneticStringUtilities::PhonemStringComparator class methodsFor:'documentation'! |
1364 |
||
1365 |
documentation |
|
1366 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1367 |
Implementation of the PHONEM algorithm, as described in |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1368 |
'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht - |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1369 |
Ein Programm fuer kontextsensitive phonetische Textumwandlung |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1370 |
ct Magazin fuer Computer & Technik 25/1998' |
3646 | 1371 |
|
1372 |
This algorithm deals better with the german language (it cares for umlauts) |
|
2211 | 1373 |
" |
1374 |
! ! |
|
1375 |
||
1376 |
!PhoneticStringUtilities::PhonemStringComparator methodsFor:'api'! |
|
1377 |
||
1378 |
phoneticStringsFor:aString |
|
1379 |
|s idx t t2| |
|
1380 |
||
1381 |
s := aString asUppercase. |
|
1382 |
||
1383 |
idx := 1. |
|
1384 |
[idx < (s size-1)] whileTrue:[ |
|
1385 |
t2 := nil. |
|
1386 |
t := s copyFrom:idx to:idx+1. |
|
1387 |
t = 'SC' ifTrue:[ t2 := 'C' ] |
|
1388 |
ifFalse:[ t = 'SZ' ifTrue:[ t2 := 'C' ] |
|
1389 |
ifFalse:[ t = 'CZ' ifTrue:[ t2 := 'C' ] |
|
1390 |
ifFalse:[ t = 'TZ' ifTrue:[ t2 := 'C' ] |
|
1391 |
ifFalse:[ t = 'TS' ifTrue:[ t2 := 'C' ] |
|
1392 |
ifFalse:[ t = 'KS' ifTrue:[ t2 := 'X' ] |
|
1393 |
ifFalse:[ t = 'PF' ifTrue:[ t2 := 'V' ] |
|
1394 |
ifFalse:[ t = 'QU' ifTrue:[ t2 := 'KW' ] |
|
1395 |
ifFalse:[ t = 'PH' ifTrue:[ t2 := 'V' ] |
|
1396 |
ifFalse:[ t = 'UE' ifTrue:[ t2 := 'Y' ] |
|
1397 |
ifFalse:[ t = 'AE' ifTrue:[ t2 := 'E' ] |
|
1398 |
ifFalse:[ t = 'OE' ifTrue:[ t2 := 'Ö' ] |
|
1399 |
ifFalse:[ t = 'EI' ifTrue:[ t2 := 'AY' ] |
|
1400 |
ifFalse:[ t = 'EY' ifTrue:[ t2 := 'AY' ] |
|
1401 |
ifFalse:[ t = 'EU' ifTrue:[ t2 := 'OY' ] |
|
1402 |
ifFalse:[ t = 'AU' ifTrue:[ t2 := 'A§' ] |
|
1403 |
ifFalse:[ t = 'OU' ifTrue:[ t2 := '§ ' ]]]]]]]]]]]]]]]]]. |
|
1404 |
t2 notNil ifTrue:[ |
|
1405 |
s := (s copyTo:idx-1),t2,(s copyFrom:idx+2) |
|
1406 |
] ifFalse:[ |
|
1407 |
idx := idx + 1. |
|
1408 |
]. |
|
1409 |
]. |
|
1410 |
||
1411 |
"/ single character substitutions via tr |
|
1412 |
s := s copyTransliterating:'ÖÄZKGQÜIJFWPT§' to:'YECCCCYYYVVDDUA'. |
|
1413 |
s := s copyTransliterating:'ABCDLMNORSUVWXY' to:'' complement:true squashDuplicates:false. |
|
1414 |
s := s copyTransliterating:'ABCDLMNORSUVWXY' to:'ABCDLMNORSUVWXY' complement:false squashDuplicates:true. |
|
1415 |
^ Array with:s |
|
1416 |
||
1417 |
" |
|
1418 |
self basicNew phoneticStringsFor:'müller' #('MYLR') |
|
3646 | 1419 |
self basicNew phoneticStringsFor:'mueller' #('MYLR') |
2211 | 1420 |
self basicNew phoneticStringsFor:'möller' #('MYLR') |
1421 |
self basicNew phoneticStringsFor:'miller' #('MYLR') |
|
1422 |
self basicNew phoneticStringsFor:'muller' #('MULR') |
|
1423 |
self basicNew phoneticStringsFor:'muler' #('MULR') |
|
1424 |
self basicNew phoneticStringsFor:'schmidt' #('CMYD') |
|
1425 |
self basicNew phoneticStringsFor:'schneider' #('CNAYDR') |
|
1426 |
self basicNew phoneticStringsFor:'fischer' #('VYCR') |
|
1427 |
self basicNew phoneticStringsFor:'weber' #('VBR') |
|
1428 |
self basicNew phoneticStringsFor:'meyer' #('MAYR') |
|
1429 |
self basicNew phoneticStringsFor:'wagner' #('VACNR') |
|
1430 |
self basicNew phoneticStringsFor:'schulz' #('CULC') |
|
1431 |
self basicNew phoneticStringsFor:'becker' #('BCR') |
|
1432 |
self basicNew phoneticStringsFor:'hoffmann' #('OVMAN') |
|
1433 |
self basicNew phoneticStringsFor:'schäfer' #('CVR') |
|
3646 | 1434 |
self basicNew phoneticStringsFor:'scheffer' #('CVR') |
1435 |
self basicNew phoneticStringsFor:'schaeffer' #('CVR') |
|
1436 |
self basicNew phoneticStringsFor:'schaefer' #('CVR') |
|
2211 | 1437 |
" |
1438 |
! ! |
|
1439 |
||
2208 | 1440 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'LICENSE'! |
1441 |
||
2209 | 1442 |
copyright |
1443 |
" |
|
1444 |
Copyright (c) 2002-2004 Robert Jarvis |
|
2208 | 1445 |
|
2209 | 1446 |
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation |
1447 |
files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, |
|
1448 |
copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom |
|
1449 |
the Software is furnished to do so, subject to the following conditions: |
|
1450 |
||
1451 |
The above copyright notice and this permission notice shall be included in all copies or substantial |
|
1452 |
portions of the Software. |
|
2208 | 1453 |
|
2209 | 1454 |
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, |
1455 |
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. |
|
1456 |
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, |
|
1457 |
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE |
|
1458 |
USE OR OTHER DEALINGS IN THE SOFTWARE.' |
|
1459 |
" |
|
1460 |
! ! |
|
2208 | 1461 |
|
2213 | 1462 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'classification'! |
1463 |
||
1464 |
isSlavoGermanic:aString |
|
1465 |
^ #('w' 'k' 'cz' 'witz') contains:[:sub | aString includesString:sub] |
|
1466 |
||
1467 |
" |
|
1468 |
self isSlavoGermanic:'walter' |
|
1469 |
" |
|
1470 |
! ! |
|
1471 |
||
2209 | 1472 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'documentation'! |
2208 | 1473 |
|
3685 | 1474 |
documentation |
2209 | 1475 |
" |
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1476 |
The Double Metaphone algorithm: |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
1477 |
see internet |
2209 | 1478 |
" |
2208 | 1479 |
! ! |
1480 |
||
1481 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'accessing'! |
|
1482 |
||
1483 |
currentIndex |
|
1484 |
^currentIndex |
|
1485 |
! |
|
1486 |
||
1487 |
currentIndex: anInteger |
|
1488 |
currentIndex := anInteger |
|
1489 |
! |
|
1490 |
||
1491 |
inputKey |
|
1492 |
^inputKey |
|
1493 |
! |
|
1494 |
||
1495 |
inputKey: aString |
|
1496 |
inputKey := aString asUppercase |
|
1497 |
! |
|
1498 |
||
1499 |
primaryTranslation |
|
1500 |
^primaryTranslation |
|
1501 |
! |
|
1502 |
||
1503 |
primaryTranslation: anObject |
|
1504 |
primaryTranslation := anObject |
|
1505 |
! |
|
1506 |
||
1507 |
secondaryTranslation |
|
1508 |
^secondaryTranslation |
|
1509 |
! |
|
1510 |
||
1511 |
secondaryTranslation: anObject |
|
1512 |
secondaryTranslation := anObject |
|
1513 |
! |
|
1514 |
||
1515 |
skipCount |
|
1516 |
^skipCount |
|
1517 |
! |
|
1518 |
||
1519 |
skipCount: anInteger |
|
1520 |
skipCount := anInteger |
|
1521 |
! |
|
1522 |
||
1523 |
startIndex |
|
1524 |
^startIndex |
|
1525 |
! |
|
1526 |
||
1527 |
startIndex: anObject |
|
1528 |
startIndex := anObject |
|
1529 |
! ! |
|
1530 |
||
1531 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'api'! |
|
1532 |
||
1533 |
phoneticStringsFor: aString |
|
1534 |
"Private - Answers an array of alternate phonetic strings for the given input string." |
|
1535 |
||
1536 |
self inputKey: aString. |
|
1537 |
self performInitialProcessing. |
|
1538 |
self processRemainingCharacters. |
|
1539 |
||
2209 | 1540 |
^ Array with: primaryTranslation with: secondaryTranslation |
2208 | 1541 |
! ! |
1542 |
||
1543 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'initialization'! |
|
1544 |
||
1545 |
initialize |
|
1546 |
super initialize. |
|
1547 |
||
1548 |
self |
|
1549 |
startIndex: 1; |
|
1550 |
primaryTranslation: ''; |
|
1551 |
secondaryTranslation: ''; |
|
1552 |
skipCount: 0; |
|
1553 |
currentIndex: 1 |
|
1554 |
! ! |
|
1555 |
||
1556 |
!PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'private'! |
|
1557 |
||
1558 |
addPrimaryTranslation: aString |
|
1559 |
self primaryTranslation: self primaryTranslation, aString |
|
1560 |
! |
|
1561 |
||
1562 |
addSecondaryTranslation: aString |
|
1563 |
self secondaryTranslation: self secondaryTranslation, aString |
|
1564 |
! |
|
1565 |
||
1566 |
decrementSkipCount |
|
1567 |
self skipCount: self skipCount - 1 |
|
1568 |
! |
|
1569 |
||
1570 |
incrementSkipCount |
|
1571 |
self incrementSkipCount: 1 |
|
1572 |
! |
|
1573 |
||
1574 |
incrementSkipCount: anInteger |
|
1575 |
self skipCount: self skipCount + anInteger |
|
1576 |
! |
|
1577 |
||
1578 |
incrementStartIndex |
|
1579 |
self startIndex: self startIndex + 1 |
|
1580 |
! |
|
1581 |
||
1582 |
isSlavoGermanic: aString |
|
1583 |
^((aString includesAnyOf: 'WK') or: |
|
1584 |
[ (aString indexOfSubCollection: 'CZ' startingAt: 1) >= 1 ]) or: |
|
1585 |
[ (aString indexOfSubCollection: 'WITZ' startingAt: 1) >= 1 ] |
|
1586 |
! |
|
1587 |
||
1588 |
keyAt: anInteger |
|
1589 |
(anInteger >=1 and: [ anInteger <= self inputKey size ]) |
|
1590 |
ifTrue: [ ^self inputKey at: anInteger ] |
|
1591 |
ifFalse: [ ^$ ] |
|
1592 |
! |
|
1593 |
||
1594 |
keyLeftString: lengthInteger |
|
1595 |
^self keyMidString: lengthInteger from: 1 |
|
1596 |
! |
|
1597 |
||
1598 |
keyMidString: lengthInteger from: fromInteger |
|
1599 |
| result from len additionalSpaces | |
|
1600 |
||
1601 |
result := ''. |
|
1602 |
from := fromInteger. |
|
1603 |
len := lengthInteger. |
|
1604 |
||
1605 |
"Prepend spaces if caller is requesting characters from before the start of the string" |
|
1606 |
||
1607 |
[ from < 1 ] whileTrue: |
|
1608 |
[ result := result, ' '. |
|
1609 |
from := from + 1. |
|
1610 |
len := len - 1 ]. |
|
1611 |
||
1612 |
from + len - 1 > self inputKey size |
|
1613 |
ifTrue: |
|
1614 |
[ additionalSpaces := from + len - 1 - self inputKey size. |
|
1615 |
len := self inputKey size - from + 1 ] |
|
1616 |
ifFalse: [ additionalSpaces := 0 ]. |
|
1617 |
||
1618 |
result := result, (self inputKey copyFrom: from to: (from+len-1 min: self inputKey size)). |
|
1619 |
||
1620 |
[ additionalSpaces > 0 ] whileTrue: |
|
1621 |
[ result := result, ' '. |
|
1622 |
additionalSpaces := additionalSpaces - 1 ]. |
|
1623 |
||
1624 |
^result |
|
1625 |
! |
|
1626 |
||
1627 |
keyRightString: lengthInteger |
|
1628 |
^self keyMidString: lengthInteger from: self inputKey size - lengthInteger + 1 |
|
1629 |
! |
|
1630 |
||
1631 |
performInitialProcessing |
|
1632 |
(#('GN' 'KN' 'PN' 'WR' 'PS') includes: (self inputKey copyFrom: 1 to: 2)) |
|
1633 |
ifTrue: [ self incrementStartIndex ]. |
|
1634 |
||
1635 |
(self keyAt: 1) = $X |
|
1636 |
ifTrue: |
|
1637 |
[ self |
|
1638 |
addPrimaryTranslation: 'S'; |
|
1639 |
addSecondaryTranslation: 'S'. |
|
1640 |
self incrementStartIndex ]. |
|
1641 |
||
1642 |
(self keyAt: 1) isVowel |
|
1643 |
ifTrue: |
|
1644 |
[ self |
|
1645 |
addPrimaryTranslation: 'A'; |
|
1646 |
addSecondaryTranslation: 'A'. |
|
1647 |
self incrementStartIndex ] |
|
1648 |
! |
|
1649 |
||
1650 |
processB |
|
2213 | 1651 |
self |
1652 |
addPrimaryTranslation: 'P'; |
|
1653 |
addSecondaryTranslation: 'P'. |
|
1654 |
(self keyAt: (currentIndex + 1)) = $B |
|
1655 |
ifTrue: [ self incrementSkipCount ]. |
|
2208 | 1656 |
! |
1657 |
||
1658 |
processC |
|
2213 | 1659 |
"i" |
1660 |
((((currentIndex >= 3 |
|
1661 |
and: [ (self keyAt: currentIndex-2) isVowel not ]) |
|
1662 |
and: [ (self keyMidString: 3 from: currentIndex-1) = 'ACH' ]) |
|
1663 |
and: [ (self keyAt: currentIndex+2) ~= $I ]) |
|
1664 |
and: [ ((self keyAt: currentIndex+2) ~= $E) |
|
1665 |
or: [ (self keyMidString: 6 from: currentIndex-2) ~= 'BACHER' |
|
1666 |
and: [ (self keyMidString: 6 from: currentIndex-2) ~= 'MACHER' ] ] ]) |
|
1667 |
ifTrue: |
|
1668 |
[ self addPrimaryTranslation: 'K'. |
|
1669 |
self addSecondaryTranslation: 'K'. |
|
1670 |
self incrementSkipCount: 2. |
|
1671 |
^self ]. |
|
1672 |
||
1673 |
"ii" |
|
1674 |
(self inputKey beginsWith: 'CAESAR') |
|
1675 |
ifTrue: |
|
1676 |
[ self addPrimaryTranslation: 'S'. |
|
1677 |
self addSecondaryTranslation: 'S'. |
|
1678 |
self incrementSkipCount: 1. |
|
1679 |
^self ]. |
|
1680 |
||
1681 |
"iii" |
|
1682 |
(self keyMidString: 4 from: currentIndex) = 'CHIA' |
|
1683 |
ifTrue: |
|
1684 |
[ self addPrimaryTranslation: 'K'. |
|
1685 |
self addSecondaryTranslation: 'K'. |
|
1686 |
self incrementSkipCount: 1. |
|
1687 |
^self ]. |
|
1688 |
||
1689 |
"iv" |
|
1690 |
(self keyMidString: 2 from: currentIndex) = 'CH' |
|
1691 |
ifTrue: |
|
1692 |
[ (currentIndex > 1 "a" |
|
1693 |
and: [ (self keyMidString: 4 from: currentIndex) = 'CHAE' ]) |
|
1694 |
ifTrue: [ self |
|
1695 |
addPrimaryTranslation: 'K'; |
|
1696 |
addSecondaryTranslation: 'X'; |
|
1697 |
incrementSkipCount: 1. |
|
1698 |
^self ]. |
|
1699 |
||
1700 |
(currentIndex = 1 "b" |
|
1701 |
and: [ (self inputKey size > 5 and: [(self inputKey copyFrom: 1 to: 6) = 'CHARAC' |
|
1702 |
or: [ (self inputKey copyFrom: 1 to: 6) = 'CHARIS' ]] ) |
|
1703 |
or: [self inputKey size > 4 and: [ ((((self inputKey copyFrom: 1 to: 4) = 'CHOR' |
|
1704 |
or: [ (self inputKey copyFrom: 1 to: 4) = 'CHYM' ]) |
|
1705 |
or: [ (self inputKey copyFrom: 1 to: 4) = 'CHIA' ]) |
|
1706 |
or: [ (self inputKey copyFrom: 1 to: 4) = 'CHEM' ]) |
|
1707 |
and: [ (self inputKey copyFrom: 1 to: 4) ~= 'CHORE' ] ] ] ]) |
|
1708 |
ifTrue: [ self |
|
1709 |
addPrimaryTranslation: 'K'; |
|
1710 |
addSecondaryTranslation: 'K'; |
|
1711 |
incrementSkipCount: 1. |
|
1712 |
^self ]. |
|
1713 |
||
1714 |
(((((#('VAN ' 'VON ') includes: (self inputKey copyFrom: 1 to: 4)) "c" |
|
1715 |
or: [ (self inputKey copyFrom: 1 to: 3) = 'SCH' ]) |
|
1716 |
or: [ #('ORCHES' 'ARCHIT' 'ORCHID') |
|
1717 |
includes: (self keyMidString: 6 from: currentIndex-2) ]) |
|
1718 |
or: [ #($T $S) includes: (self keyAt: currentIndex+2) ]) |
|
1719 |
or: [ ((currentIndex = 1) |
|
1720 |
or: [ #($A $O $U $E) includes: (self keyAt: currentIndex-1) ]) |
|
1721 |
and: [ #($L $R $N $M $B $H $F $V $W $ ) includes: (self keyAt: currentIndex+2) ] ] ) |
|
1722 |
ifTrue: |
|
1723 |
[ self |
|
1724 |
addPrimaryTranslation: 'K'; |
|
1725 |
addSecondaryTranslation: 'K'; |
|
1726 |
incrementSkipCount: 1. |
|
1727 |
^self ] |
|
1728 |
ifFalse: |
|
1729 |
[ currentIndex > 1 |
|
1730 |
ifTrue: |
|
1731 |
[ (self inputKey copyFrom: 1 to: 2) = 'MC' |
|
1732 |
ifTrue: |
|
1733 |
[ self |
|
1734 |
addPrimaryTranslation: 'K'; |
|
1735 |
addSecondaryTranslation: 'K' ] |
|
1736 |
ifFalse: |
|
1737 |
[ self |
|
1738 |
addPrimaryTranslation: 'X'; |
|
1739 |
addSecondaryTranslation: 'K' ] ] |
|
1740 |
ifFalse: |
|
1741 |
[ self |
|
1742 |
addPrimaryTranslation: 'X'; |
|
1743 |
addSecondaryTranslation: 'X' ]. |
|
1744 |
self incrementSkipCount: 1. |
|
1745 |
^self ] ]. |
|
1746 |
||
1747 |
"v" |
|
1748 |
(self keyAt: currentIndex+1) = $Z |
|
1749 |
ifTrue: |
|
1750 |
[ self |
|
1751 |
addPrimaryTranslation: 'S'; |
|
1752 |
addSecondaryTranslation: 'X'; |
|
1753 |
incrementSkipCount: 1. |
|
1754 |
^self ]. |
|
1755 |
||
1756 |
"vi" |
|
1757 |
(self keyMidString: 3 from: currentIndex+1) = 'CIA' |
|
1758 |
ifTrue: |
|
1759 |
[ self |
|
1760 |
addPrimaryTranslation: 'X'; |
|
1761 |
addSecondaryTranslation: 'X'; |
|
1762 |
incrementSkipCount: 2. |
|
1763 |
^self ]. |
|
1764 |
||
1765 |
"vii" |
|
1766 |
((self keyAt: currentIndex+1) = $C |
|
1767 |
and: [ ((currentIndex = 2) |
|
1768 |
and: [ (self keyAt: 1) = $M ]) not ]) |
|
1769 |
ifTrue: |
|
1770 |
[ ((#($I $E $H) includes: (self keyAt: currentIndex+2)) |
|
1771 |
and: [ (self keyMidString: 2 from: currentIndex+2) ~= 'HU' ]) |
|
1772 |
ifTrue: |
|
1773 |
[ ((currentIndex = 2 and: [ (self keyAt: 1) = $A ]) |
|
1774 |
or: [ #('UCCEE' 'UCCES') includes: (self keyMidString: 5 from: currentIndex-1)]) |
|
1775 |
ifTrue: |
|
1776 |
[self |
|
1777 |
addPrimaryTranslation: 'KS'; |
|
1778 |
addSecondaryTranslation: 'KS'; |
|
1779 |
incrementSkipCount: 2. |
|
1780 |
^self ] |
|
1781 |
ifFalse: |
|
1782 |
[self |
|
1783 |
addPrimaryTranslation: 'X'; |
|
1784 |
addSecondaryTranslation: 'X'; |
|
1785 |
incrementSkipCount: 2. |
|
1786 |
^self ] ] |
|
1787 |
ifFalse: |
|
1788 |
[ self |
|
1789 |
addPrimaryTranslation: 'K'; |
|
1790 |
addSecondaryTranslation: 'K'; |
|
1791 |
incrementSkipCount: 2. |
|
1792 |
^self ] ]. |
|
1793 |
||
1794 |
"viii" |
|
1795 |
(#($K $G $Q) includes: (self keyAt: currentIndex+1)) |
|
1796 |
ifTrue: |
|
1797 |
[ self |
|
1798 |
addPrimaryTranslation: 'K'; |
|
1799 |
addSecondaryTranslation: 'K'; |
|
1800 |
incrementSkipCount: 1. |
|
1801 |
^self ]. |
|
1802 |
||
1803 |
"ix" |
|
1804 |
(#($I $E $Y) includes: (self keyAt: currentIndex+1)) |
|
1805 |
ifTrue: |
|
1806 |
[ (#('CIO' 'CIE' 'CIA') includes: (self keyMidString: 3 from: currentIndex)) |
|
1807 |
ifTrue: |
|
1808 |
[self |
|
1809 |
addPrimaryTranslation: 'S'; |
|
1810 |
addSecondaryTranslation: 'X' ] |
|
1811 |
ifFalse: |
|
1812 |
[self |
|
1813 |
addPrimaryTranslation: 'S'; |
|
1814 |
addSecondaryTranslation: 'S']. |
|
1815 |
self incrementSkipCount: 1. |
|
1816 |
^self ]. |
|
1817 |
||
1818 |
"x" |
|
1819 |
self |
|
1820 |
addPrimaryTranslation: 'K'; |
|
1821 |
addSecondaryTranslation: 'K'. |
|
1822 |
||
1823 |
"xi" |
|
1824 |
(#(' C' ' Q' ' G') includes: (self keyMidString: 2 from: currentIndex+1)) |
|
1825 |
ifTrue: |
|
1826 |
[ self incrementSkipCount: 2 ] |
|
1827 |
ifFalse: |
|
1828 |
[ ((#($C $K $Q) includes: (self keyAt: currentIndex+1)) |
|
1829 |
and: [ (#('CE' 'CI') includes: (self keyMidString: 2 from: currentIndex+1)) not ]) |
|
1830 |
ifTrue: [ self incrementSkipCount: 1] ] |
|
2208 | 1831 |
! |
1832 |
||
1833 |
processCedille |
|
1834 |
self |
|
1835 |
addPrimaryTranslation: 'S'; |
|
1836 |
addSecondaryTranslation: 'S' |
|
1837 |
! |
|
1838 |
||
1839 |
processD |
|
2213 | 1840 |
"i" |
1841 |
(self keyAt: currentIndex+1) = $G |
|
1842 |
ifTrue: |
|
1843 |
[ (#($I $E $Y) includes: (self keyAt: currentIndex+2)) |
|
1844 |
ifTrue: |
|
1845 |
[ self |
|
1846 |
addPrimaryTranslation: 'J'; |
|
1847 |
addSecondaryTranslation: 'J'; |
|
1848 |
incrementSkipCount: 2. |
|
1849 |
^self ] |
|
1850 |
ifFalse: |
|
1851 |
[ self |
|
1852 |
addPrimaryTranslation: 'TK'; |
|
1853 |
addSecondaryTranslation: 'TK'; |
|
1854 |
incrementSkipCount: 1. |
|
1855 |
^self ] ]. |
|
1856 |
||
1857 |
"ii" |
|
1858 |
(#($T $D) includes: (self keyAt: currentIndex+1)) |
|
1859 |
ifTrue: |
|
1860 |
[ self |
|
1861 |
addPrimaryTranslation: 'T'; |
|
1862 |
addSecondaryTranslation: 'T'; |
|
1863 |
incrementSkipCount: 1. |
|
1864 |
^self ]. |
|
1865 |
||
1866 |
"iii" |
|
1867 |
self |
|
1868 |
addPrimaryTranslation: 'T'; |
|
1869 |
addSecondaryTranslation: 'T' |
|
2208 | 1870 |
! |
1871 |
||
1872 |
processF |
|
1873 |
self |
|
1874 |
addPrimaryTranslation: 'F'; |
|
1875 |
addSecondaryTranslation: 'F'. |
|
1876 |
(self keyAt: self currentIndex+1) = $F |
|
1877 |
ifTrue: [ self incrementSkipCount: 1 ] |
|
1878 |
! |
|
1879 |
||
1880 |
processG |
|
1881 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
|
1882 |
case 'G': |
|
1883 |
if(GetAt(current + 1) == 'H') |
|
1884 |
{" |
|
1885 |
| word | |
|
2213 | 1886 |
(self keyAt: currentIndex + 1) = $H |
2208 | 1887 |
ifTrue: [ |
1888 |
"if((current > 0) AND !!IsVowel(current - 1))" |
|
1889 |
||
2213 | 1890 |
(currentIndex > 1 and: [(self keyAt: currentIndex - 1) isVowel not]) |
2208 | 1891 |
ifTrue: [ |
1892 |
" { |
|
1893 |
MetaphAdd(K); |
|
1894 |
current += 2; |
|
1895 |
break; |
|
1896 |
}" |
|
1897 |
||
1898 |
self addPrimaryTranslation: 'K'; |
|
1899 |
addSecondaryTranslation: 'K'. |
|
1900 |
^self incrementSkipCount: 1 |
|
1901 |
]. |
|
1902 |
||
1903 |
"if(current < 3) |
|
1904 |
{" |
|
1905 |
||
1906 |
currentIndex < 4 |
|
1907 |
ifTrue: [ |
|
1908 |
||
1909 |
" //'ghislane', ghiradelli |
|
1910 |
if(current == 0) |
|
1911 |
{ " |
|
1912 |
currentIndex = 1 |
|
1913 |
ifTrue: [ |
|
1914 |
"if(GetAt(current + 2) == 'I')" |
|
1915 |
||
2213 | 1916 |
(self keyAt: currentIndex + 2) = $I |
2208 | 1917 |
ifTrue: [ |
1918 |
"MetaphAdd(J);" |
|
1919 |
self addPrimaryTranslation: 'J'; |
|
1920 |
addSecondaryTranslation: 'J'. |
|
1921 |
] ifFalse: [ |
|
1922 |
"MetaphAdd(K);" |
|
1923 |
self addPrimaryTranslation: 'K'; |
|
1924 |
addSecondaryTranslation: 'K'. |
|
1925 |
]. |
|
1926 |
" current += 2; |
|
1927 |
break;" |
|
1928 |
^self incrementSkipCount: 1 |
|
1929 |
] |
|
1930 |
]. |
|
1931 |
||
1932 |
" //Parker's rule (with some further refinements) - e.g., 'hugh' |
|
1933 |
if(((current > 1) AND StringAt((current - 2), 1, B, H, D, ) ) |
|
1934 |
//e.g., 'bough' |
|
1935 |
OR ((current > 2) AND StringAt((current - 3), 1, B, H, D, ) ) |
|
1936 |
//e.g., 'broughton' |
|
1937 |
OR ((current > 3) AND StringAt((current - 4), 1, B, H, ) ) ) |
|
1938 |
" |
|
2213 | 1939 |
(((currentIndex > 2 and: [#($B $H $D) includes: (self keyAt: currentIndex - 2)]) |
1940 |
or: [currentIndex > 3 and: [#($B $H $D) includes: (self keyAt: currentIndex - 3)]]) |
|
1941 |
or: [currentIndex > 4 and: [#($B $H) includes: (self keyAt: currentIndex - 4)]]) |
|
2208 | 1942 |
ifTrue: [ |
1943 |
"current += 2; |
|
1944 |
break;" |
|
1945 |
^self incrementSkipCount: 1 |
|
1946 |
] ifFalse: [ |
|
1947 |
" //e.g., 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough' |
|
1948 |
if((current > 2) |
|
1949 |
AND (GetAt(current - 1) == 'U') |
|
1950 |
AND StringAt((current - 3), 1, C, G, L, R, T, ) )" |
|
1951 |
(currentIndex > 3 and: [ |
|
2213 | 1952 |
((self keyAt: currentIndex - 1) = $U) and: [ |
1953 |
#($C $G $L $R $T) includes: (self keyAt: currentIndex - 3) |
|
2208 | 1954 |
] |
1955 |
]) ifTrue: [ |
|
1956 |
"MetaphAdd(F);" |
|
1957 |
self addPrimaryTranslation: 'F'; |
|
1958 |
addSecondaryTranslation: 'F'. |
|
1959 |
] ifFalse: [ |
|
1960 |
" if((current > 0) AND GetAt(current - 1) !!= 'I') |
|
1961 |
MetaphAdd(K);" |
|
2213 | 1962 |
(currentIndex > 1 and: [(self keyAt: currentIndex - 1) ~= $I]) |
2208 | 1963 |
ifTrue: [ |
1964 |
self addPrimaryTranslation: 'K'; |
|
1965 |
addSecondaryTranslation: 'K'. |
|
1966 |
]. |
|
1967 |
]. |
|
1968 |
^self incrementSkipCount: 1 |
|
1969 |
]. |
|
1970 |
]. |
|
1971 |
"if(GetAt(current + 1) == 'N')" |
|
2213 | 1972 |
(self keyAt: currentIndex + 1) = $N |
2208 | 1973 |
ifTrue: [ |
1974 |
"if((current == 1) AND IsVowel(0) AND !!SlavoGermanic())" |
|
1975 |
(currentIndex = 2 and: [(self inputKey at: 1) isVowel and: [(self isSlavoGermanic: self inputKey) not]]) |
|
1976 |
ifTrue: [ |
|
1977 |
"MetaphAdd(KN, N);" |
|
1978 |
self addPrimaryTranslation: 'KN'; |
|
1979 |
addSecondaryTranslation: 'N'. |
|
1980 |
] ifFalse: [ |
|
1981 |
" //not e.g. 'cagney' |
|
1982 |
if(!!StringAt((current + 2), 2, EY, ) |
|
1983 |
AND (GetAt(current + 1) !!= 'Y') |
|
1984 |
AND !!SlavoGermanic())" |
|
2213 | 1985 |
((self inputKey size >= (currentIndex + 2)) and: [ |
1986 |
(self inputKey copyFrom: currentIndex + 2 to: (currentIndex + 4 min: self inputKey size)) ~= 'EY' and: [ |
|
1987 |
(self keyAt: currentIndex + 1) ~= $Y and: [ |
|
2208 | 1988 |
(self isSlavoGermanic: self inputKey) not |
1989 |
] |
|
1990 |
] |
|
1991 |
]) ifTrue: [ |
|
1992 |
self addPrimaryTranslation: 'N'; |
|
1993 |
addSecondaryTranslation: 'KN'. |
|
1994 |
] ifFalse: [ |
|
1995 |
self addPrimaryTranslation: 'KN'; |
|
1996 |
addSecondaryTranslation: 'KN'. |
|
1997 |
]. |
|
1998 |
]. |
|
1999 |
^self incrementSkipCount: 1 |
|
2000 |
]. |
|
2001 |
" //'tagliaro' |
|
2002 |
if(StringAt((current + 1), 2, LI, ) AND !!SlavoGermanic())" |
|
2213 | 2003 |
((self inputKey size >= (currentIndex + 3)) and: [ |
2004 |
(self inputKey copyFrom: currentIndex + 1 to: currentIndex + 2) = 'LI' and: [ |
|
2208 | 2005 |
(self isSlavoGermanic: self inputKey) not]]) |
2006 |
ifTrue: [ |
|
2007 |
self addPrimaryTranslation: 'KL'; |
|
2008 |
addSecondaryTranslation: 'L'. |
|
2009 |
^self incrementSkipCount: 1. |
|
2010 |
]. |
|
2011 |
" //-ges-,-gep-,-gel-, -gie- at beginning |
|
2012 |
if((current == 0) |
|
2013 |
AND ((GetAt(current + 1) == 'Y') |
|
2014 |
OR StringAt((current + 1), 2, ES, EP, EB, EL, EY, IB, IL, IN, IE, EI, ER, )) )" |
|
2213 | 2015 |
(currentIndex = 1 and: [ |
2016 |
((self keyAt: currentIndex + 1) = $Y) or: [ |
|
2208 | 2017 |
(#('ES' 'EP' 'EB' 'EL' 'EY' 'IB' 'IL' 'IN' 'IE' 'EI' 'ER') includes: |
2213 | 2018 |
(self inputKey copyFrom: currentIndex + 1 to: currentIndex + 2)) |
2208 | 2019 |
]]) ifTrue: [ |
2020 |
self addPrimaryTranslation: 'K'; |
|
2021 |
addSecondaryTranslation: 'J'. |
|
2022 |
^self incrementSkipCount: 1. |
|
2023 |
]. |
|
2024 |
" // -ger-, -gy- |
|
2025 |
if((StringAt((current + 1), 2, ER, ) OR (GetAt(current + 1) == 'Y')) |
|
2026 |
AND !!StringAt(0, 6, DANGER, RANGER, MANGER, ) |
|
2027 |
AND !!StringAt((current - 1), 1, E, I, ) |
|
2028 |
AND !!StringAt((current - 1), 3, RGY, OGY, ) ) |
|
2029 |
" |
|
2213 | 2030 |
(((self inputKey copyFrom: currentIndex + 1 to: (currentIndex + 3 min: self inputKey size)) = 'ER' or: [ |
2031 |
((self keyAt: currentIndex + 1) = $Y)]) |
|
2208 | 2032 |
and: [((#('DANGER' 'RANGER' 'MANGER') includes: (word := self inputKey copyFrom: 1 to: (6 min: self inputKey size))) not) |
2213 | 2033 |
and: [(self keyAt: currentIndex - 1) ~= $E |
2034 |
and: [(#('RGY' 'OGY') includes: (self inputKey copyFrom: currentIndex - 1 to: currentIndex + 1)) not]]]) |
|
2208 | 2035 |
ifTrue: [ |
2036 |
self addPrimaryTranslation: 'K'; |
|
2037 |
addSecondaryTranslation: 'J'. |
|
2038 |
^self incrementSkipCount: 1. |
|
2039 |
]. |
|
2040 |
||
2041 |
" // italian e.g, 'biaggi' |
|
2042 |
if(StringAt((current + 1), 1, E, I, Y, ) OR StringAt((current - 1), 4, AGGI, OGGI, )) |
|
2043 |
" |
|
2213 | 2044 |
((#($E $I $Y) includes: (self keyAt: (currentIndex + 1))) or: [(#('AGGI' 'OGGI') includes: (self inputKey copyFrom: currentIndex - 1 to: (currentIndex + 2 min: self inputKey size)))]) |
2208 | 2045 |
ifTrue: [ |
2046 |
" //obvious germanic |
|
2047 |
if((StringAt(0, 4, VAN , VON , ) OR StringAt(0, 3, SCH, )) |
|
2048 |
OR StringAt((current + 1), 2, ET, )) MetaphAdd(K);" |
|
2049 |
word := (self inputKey copyFrom: 1 to: 4). |
|
2050 |
((#('VAN ' 'VON ') includes: word) or: [(word copyFrom: 1 to: 3) = 'SCH' or: [(word copyFrom: 1 to: 2) = 'ET']]) |
|
2051 |
ifTrue: [ |
|
2052 |
self addPrimaryTranslation: 'K'; |
|
2053 |
addSecondaryTranslation: 'K'. |
|
2054 |
] ifFalse: [ |
|
2055 |
" //always soft if french ending |
|
2056 |
if(StringAt((current + 1), 4, IER , )) |
|
2057 |
MetaphAdd(J); |
|
2058 |
else |
|
2059 |
MetaphAdd(J, K); |
|
2060 |
current += 2; |
|
2061 |
break;" |
|
2213 | 2062 |
(((self inputKey copyFrom: currentIndex + 1 to: (currentIndex + 5 min: self inputKey size)), ' ') copyFrom: 1 to: 4) = 'IER ' |
2208 | 2063 |
ifTrue: [ |
2064 |
self addPrimaryTranslation: 'J'; |
|
2065 |
addSecondaryTranslation: 'J'. |
|
2066 |
] ifFalse: [ |
|
2067 |
self addPrimaryTranslation: 'J'; |
|
2068 |
addSecondaryTranslation: 'K'. |
|
2069 |
]. |
|
2070 |
||
2071 |
]. |
|
2072 |
^self incrementSkipCount: 1. |
|
2073 |
]. |
|
2074 |
||
2075 |
" if(GetAt(current + 1) == 'G') |
|
2076 |
current += 2; |
|
2077 |
else |
|
2078 |
current += 1; |
|
2079 |
MetaphAdd(K); |
|
2080 |
break;" |
|
2081 |
||
2213 | 2082 |
(self keyAt: (currentIndex + 1)) = $G |
2208 | 2083 |
ifTrue: [ |
2084 |
self incrementSkipCount: 1. |
|
2085 |
]. |
|
2086 |
self addPrimaryTranslation: 'K'; |
|
2087 |
addSecondaryTranslation: 'K'. |
|
2088 |
! |
|
2089 |
||
2090 |
processH |
|
2213 | 2091 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2092 |
case 'H': |
|
2208 | 2093 |
//only keep if first & before vowel or btw. 2 vowels |
2094 |
if(((current == 0) OR IsVowel(current - 1)) |
|
2095 |
AND IsVowel(current + 1)) |
|
2096 |
{ |
|
2097 |
MetaphAdd(H); |
|
2098 |
current += 2; |
|
2099 |
}else//also takes care of 'HH' |
|
2100 |
current += 1; |
|
2101 |
break; |
|
2102 |
" |
|
2103 |
||
2213 | 2104 |
(((currentIndex = 1) |
2105 |
or: [ (self keyAt: currentIndex - 1) isVowel]) |
|
2106 |
and: [(self keyAt: currentIndex + 1) isVowel]) |
|
2107 |
ifTrue: [ |
|
2108 |
self addPrimaryTranslation: 'H'; |
|
2109 |
addSecondaryTranslation: 'H'. |
|
2110 |
^self incrementSkipCount: 1. |
|
2111 |
] |
|
2208 | 2112 |
! |
2113 |
||
2114 |
processJ |
|
2213 | 2115 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2116 |
case 'J': |
|
2208 | 2117 |
//obvious spanish, 'jose', 'san jacinto' |
2118 |
if(StringAt(current, 4, JOSE, ) OR StringAt(0, 4, SAN , ) ) |
|
2119 |
{ |
|
2120 |
if(((current == 0) AND (GetAt(current + 4) == ' ')) OR StringAt(0, 4, SAN , ) ) |
|
2121 |
MetaphAdd(H); |
|
2122 |
else |
|
2123 |
{ |
|
2124 |
MetaphAdd(J, H); |
|
2125 |
} |
|
2126 |
current +=1; |
|
2127 |
break; |
|
2128 |
} |
|
2129 |
||
2130 |
if((current == 0) AND !!StringAt(current, 4, JOSE, )) |
|
2131 |
MetaphAdd(J, A);//Yankelovich/Jankelowicz |
|
2132 |
else |
|
2133 |
//spanish pron. of e.g. 'bajador' |
|
2134 |
if(IsVowel(current - 1) |
|
2135 |
AND !!SlavoGermanic() |
|
2136 |
AND ((GetAt(current + 1) == 'A') OR (GetAt(current + 1) == 'O'))) |
|
2137 |
MetaphAdd(J, H); |
|
2138 |
else |
|
2139 |
if(current == last) |
|
2140 |
MetaphAdd(J, ); |
|
2141 |
else |
|
2142 |
if(!!StringAt((current + 1), 1, L, T, K, S, N, M, B, Z, ) |
|
2143 |
AND !!StringAt((current - 1), 1, S, K, L, )) |
|
2144 |
MetaphAdd(J); |
|
2145 |
||
2146 |
if(GetAt(current + 1) == 'J')//it could happen!! |
|
2147 |
current += 2; |
|
2148 |
else |
|
2149 |
current += 1; |
|
2150 |
break; |
|
2151 |
" |
|
2213 | 2152 |
| currentWord firstWord nextLetter | |
2153 |
currentWord := self inputKey copyFrom: currentIndex to: (currentIndex + 3 min: self inputKey size). |
|
2154 |
firstWord := self inputKey copyFrom: 1 to: (4 min: self inputKey size). |
|
2155 |
nextLetter := self keyAt: currentIndex + 1. |
|
2156 |
(currentWord = 'JOSE' or: [firstWord = 'SAN ']) |
|
2157 |
ifTrue: [ |
|
2158 |
((currentIndex = 1 and: [self inputKey size = 4 or: [self inputKey size >= 5 and: [self keyAt: currentIndex + 4 = $ ]]]) |
|
2159 |
or: [firstWord = 'SAN ']) |
|
2160 |
ifTrue: [ |
|
2161 |
self addPrimaryTranslation: 'H'; |
|
2162 |
addSecondaryTranslation: 'H'. |
|
2163 |
] ifFalse: [ |
|
2164 |
self addPrimaryTranslation: 'J'; |
|
2165 |
addSecondaryTranslation: 'H'. |
|
2166 |
]. |
|
2167 |
^self. |
|
2168 |
]. |
|
2169 |
(currentIndex = 1 and: [firstWord ~= 'JOSE']) |
|
2170 |
ifTrue: [ |
|
2171 |
self addPrimaryTranslation: 'J'; |
|
2172 |
addSecondaryTranslation: 'A'. |
|
2173 |
] ifFalse: [ |
|
2174 |
((currentIndex > 1 and: [(self keyAt: currentIndex -1) isVowel]) |
|
3489
6ef5f530df03
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3488
diff
changeset
|
2175 |
and: [(self isSlavoGermanic: self inputKey) not and: [nextLetter == $A or: [nextLetter == $O]]]) |
2213 | 2176 |
ifTrue: [ |
2177 |
self addPrimaryTranslation: 'J'; |
|
2178 |
addSecondaryTranslation: 'H'. |
|
2179 |
] ifFalse: [ |
|
2180 |
currentIndex = self inputKey size |
|
2181 |
ifTrue: [ |
|
2182 |
self addPrimaryTranslation: 'J'; |
|
2183 |
addSecondaryTranslation: ' '. |
|
2184 |
] ifFalse: [ |
|
2185 |
((#($L $T $K $S $N $M $B $Z) includes: nextLetter) not and: [(#($S $K $L) includes: (self keyAt: currentIndex - 1)) not]) |
|
2186 |
ifTrue: [ |
|
2187 |
self addPrimaryTranslation: 'J'; |
|
2188 |
addSecondaryTranslation: 'J'. |
|
2189 |
]. |
|
2190 |
]. |
|
2191 |
]. |
|
2192 |
]. |
|
3489
6ef5f530df03
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3488
diff
changeset
|
2193 |
nextLetter == $J |
2213 | 2194 |
ifTrue: [ |
2195 |
self incrementSkipCount: 1. |
|
2196 |
]. |
|
2208 | 2197 |
! |
2198 |
||
2199 |
processK |
|
2213 | 2200 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2201 |
case 'K': |
|
2208 | 2202 |
if(GetAt(current + 1) == 'K') |
2203 |
current += 2; |
|
2204 |
else |
|
2205 |
current += 1; |
|
2206 |
MetaphAdd(K); |
|
2207 |
break; |
|
2213 | 2208 |
" |
2209 |
||
2210 |
(self keyAt: currentIndex + 1) = $K |
|
2211 |
ifTrue: [ |
|
2212 |
self incrementSkipCount: 1 |
|
2213 |
]. |
|
2214 |
self addPrimaryTranslation: 'K'; |
|
2215 |
addSecondaryTranslation: 'K'. |
|
2208 | 2216 |
! |
2217 |
||
2218 |
processL |
|
2219 |
||
2220 |
"case 'L': |
|
2221 |
if(GetAt(current + 1) == 'L') |
|
2222 |
{ |
|
2223 |
//spanish e.g. 'cabrillo', 'gallegos' |
|
2224 |
if(((current == (length - 3)) |
|
2225 |
AND StringAt((current - 1), 4, ILLO, ILLA, ALLE, )) |
|
2226 |
OR ((StringAt((last - 1), 2, AS, OS, ) OR StringAt(last, 1, A, O, )) |
|
2227 |
AND StringAt((current - 1), 4, ALLE, )) ) |
|
2228 |
{ |
|
2229 |
MetaphAdd(L, ); |
|
2230 |
current += 2; |
|
2231 |
break; |
|
2232 |
} |
|
2233 |
current += 2; |
|
2234 |
}else |
|
2235 |
current += 1; |
|
2236 |
MetaphAdd(L); |
|
2237 |
break; |
|
2238 |
" |
|
2213 | 2239 |
| currentWord | |
2240 |
(self keyAt: currentIndex + 1) = $L |
|
2241 |
ifTrue: [ |
|
2242 |
(((currentIndex = (self inputKey size - 2)) |
|
2243 |
and: [(currentIndex > 1 and: [#('ILLO' 'ILLA' 'ALLE') includes: (currentWord := self inputKey copyFrom: currentIndex - 1 to: (currentIndex + 2 min: self inputKey size))])]) |
|
2244 |
or: [((#('AS' 'OS') includes: (self inputKey copyFrom: self inputKey size - 1 to: self inputKey size)) or: [#($A $O) includes: (self keyAt: self inputKey size)]) and: [currentWord = 'ALLE'] |
|
2245 |
]) |
|
2246 |
ifTrue: [ |
|
2247 |
self addPrimaryTranslation: 'L'; |
|
2248 |
addSecondaryTranslation: ' '. |
|
2249 |
^self incrementSkipCount: 1. |
|
2250 |
]. |
|
2251 |
self incrementSkipCount: 1. |
|
2252 |
]. |
|
2253 |
self addPrimaryTranslation: 'L'; |
|
2254 |
addSecondaryTranslation: 'L'. |
|
2208 | 2255 |
! |
2256 |
||
2257 |
processM |
|
2258 |
||
2259 |
"case 'M': |
|
2260 |
if((StringAt((current - 1), 3, UMB, ) |
|
2261 |
AND (((current + 1) == last) OR StringAt((current + 2), 2, ER, ))) |
|
2262 |
//'dumb','thumb' |
|
2263 |
OR (GetAt(current + 1) == 'M') ) |
|
2264 |
current += 2; |
|
2265 |
else |
|
2266 |
current += 1; |
|
2267 |
MetaphAdd(M); |
|
2268 |
break; |
|
2269 |
" |
|
2213 | 2270 |
(((currentIndex > 1 and: [(self inputKey copyFrom: currentIndex - 1 to: (currentIndex +1 min: self inputKey size)) = 'UMB']) |
2271 |
and: [currentIndex + 1 = self inputKey size or: [(self inputKey copyFrom: (currentIndex + 2 min: self inputKey size) to: (currentIndex + 4 min: self inputKey size)) = 'ER']]) |
|
2272 |
or: [(self keyAt: currentIndex + 1) = $M]) |
|
2273 |
ifTrue: [ |
|
2274 |
self incrementSkipCount: 1. |
|
2275 |
]. |
|
2276 |
self addPrimaryTranslation: 'M'; |
|
2277 |
addSecondaryTranslation: 'M'. |
|
2208 | 2278 |
! |
2279 |
||
2280 |
processN |
|
2213 | 2281 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2282 |
case 'N': |
|
2208 | 2283 |
if(GetAt(current + 1) == 'N') |
2284 |
current += 2; |
|
2285 |
else |
|
2286 |
current += 1; |
|
2287 |
MetaphAdd(N); |
|
2288 |
break; |
|
2289 |
||
2213 | 2290 |
" |
2291 |
||
2292 |
(self keyAt: currentIndex + 1) = $N |
|
2293 |
ifTrue: [ |
|
2294 |
self incrementSkipCount: 1 |
|
2295 |
]. |
|
2296 |
self addPrimaryTranslation: 'N'; |
|
2297 |
addSecondaryTranslation: 'N'. |
|
2208 | 2298 |
! |
2299 |
||
2300 |
processNtilde |
|
2301 |
"case 'Ñ': |
|
2302 |
current += 1; |
|
2303 |
MetaphAdd(N); |
|
2304 |
break; |
|
2305 |
" |
|
2306 |
self addPrimaryTranslation: 'N'; |
|
2307 |
addSecondaryTranslation: 'N'. |
|
2308 |
! |
|
2309 |
||
2310 |
processP |
|
2213 | 2311 |
"case 'P': |
2208 | 2312 |
if(GetAt(current + 1) == 'H') |
2313 |
{ |
|
2314 |
MetaphAdd(F); |
|
2315 |
current += 2; |
|
2316 |
break; |
|
2317 |
} |
|
2318 |
||
2319 |
//also account for campbell, raspberry |
|
2320 |
if(StringAt((current + 1), 1, P, B, )) |
|
2321 |
current += 2; |
|
2322 |
else |
|
2323 |
current += 1; |
|
2324 |
MetaphAdd(P); |
|
2325 |
break; |
|
2326 |
" |
|
2213 | 2327 |
| nextLetter | |
2328 |
(nextLetter := self keyAt: currentIndex + 1) = $H |
|
2329 |
ifTrue: [ |
|
2330 |
self addPrimaryTranslation: 'F'; |
|
2331 |
addSecondaryTranslation: 'F'. |
|
2332 |
^self incrementSkipCount: 1. |
|
2333 |
]. |
|
2334 |
(#($P $B) includes: nextLetter) |
|
2335 |
ifTrue: [ |
|
2336 |
self incrementSkipCount: 1. |
|
2337 |
] ifFalse: [ |
|
2338 |
self addPrimaryTranslation: 'P'; |
|
2339 |
addSecondaryTranslation: 'P'. |
|
2340 |
]. |
|
2208 | 2341 |
! |
2342 |
||
2343 |
processQ |
|
2213 | 2344 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2345 |
case 'Q': |
|
2208 | 2346 |
if(GetAt(current + 1) == 'Q') |
2347 |
current += 2; |
|
2348 |
else |
|
2349 |
current += 1; |
|
2350 |
MetaphAdd(K); |
|
2351 |
break; |
|
2352 |
||
2213 | 2353 |
" |
2354 |
||
2355 |
(self keyAt: currentIndex + 1) = $Q |
|
2356 |
ifTrue: [ |
|
2357 |
self incrementSkipCount: 1 |
|
2358 |
]. |
|
2359 |
self addPrimaryTranslation: 'K'; |
|
2360 |
addSecondaryTranslation: 'K'. |
|
2208 | 2361 |
! |
2362 |
||
2363 |
processR |
|
2213 | 2364 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2365 |
case 'R': |
|
2208 | 2366 |
//french e.g. 'rogier', but exclude 'hochmeier' |
2367 |
if((current == last) |
|
2368 |
AND !!SlavoGermanic() |
|
2369 |
AND StringAt((current - 2), 2, IE, ) |
|
2370 |
AND !!StringAt((current - 4), 2, ME, MA, )) |
|
2371 |
MetaphAdd(, R); |
|
2372 |
else |
|
2373 |
MetaphAdd(R); |
|
2374 |
||
2375 |
if(GetAt(current + 1) == 'R') |
|
2376 |
current += 2; |
|
2377 |
else |
|
2378 |
current += 1; |
|
2379 |
break; |
|
2213 | 2380 |
" |
2381 |
(currentIndex = self inputKey size and: [ |
|
2382 |
(self isSlavoGermanic: self inputKey) not and: [ |
|
2383 |
(self inputKey copyFrom: ((currentIndex - 2) max: 1) to: ((currentIndex - 1) max: 1)) = 'IE' and: [ |
|
2384 |
(#('ME' 'MA') includes: (self inputKey copyFrom: ((currentIndex - 4) max: 1) to: ((currentIndex - 3) max: 1))) not |
|
2385 |
] |
|
2386 |
] |
|
2387 |
]) |
|
2388 |
ifTrue: [ |
|
2389 |
self addPrimaryTranslation: ''; |
|
2390 |
addSecondaryTranslation: 'R'. |
|
2391 |
] ifFalse: [ |
|
2392 |
self addPrimaryTranslation: 'R'; |
|
2393 |
addSecondaryTranslation: 'R'. |
|
2394 |
]. |
|
2395 |
(self keyAt: currentIndex + 1) = $R |
|
2396 |
ifTrue: [ |
|
2397 |
self incrementSkipCount: 1 |
|
2398 |
]. |
|
2208 | 2399 |
! |
2400 |
||
2401 |
processRemainingCharacters |
|
2402 |
self startIndex to: self inputKey size do:[ :i | |
|
2403 |
| c methodSelector | |
|
2404 |
||
2405 |
self skipCount = 0 ifTrue:[ |
|
2406 |
((self primaryTranslation size > 4) and: [ self secondaryTranslation size > 4 ]) |
|
2407 |
ifTrue: [ ^self ]. |
|
2408 |
||
2409 |
self currentIndex: i. |
|
2410 |
c := self keyAt: i. |
|
2411 |
||
2412 |
(c isVowel not and: [c ~= $Y]) ifTrue:[ |
|
3488
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
2413 |
c == $Ç ifTrue: [ |
2208 | 2414 |
methodSelector := #processCedille |
3488
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
2415 |
] ifFalse: [ c == $Ñ ifTrue: [ |
2208 | 2416 |
methodSelector := #processNtilde |
2417 |
] ifFalse: [ |
|
2418 |
methodSelector := ('process', c asString) asSymbol |
|
2419 |
]]. |
|
2420 |
self perform: methodSelector |
|
2421 |
] |
|
2422 |
] ifFalse: [ |
|
2423 |
self decrementSkipCount |
|
2424 |
] |
|
2425 |
] |
|
2426 |
! |
|
2427 |
||
2428 |
processS |
|
2213 | 2429 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2430 |
case 'S': |
|
2208 | 2431 |
//special cases 'island', 'isle', 'carlisle', 'carlysle' |
2432 |
if(StringAt((current - 1), 3, ISL, YSL, )) |
|
2433 |
{ |
|
2434 |
current += 1; |
|
2435 |
break; |
|
2436 |
} |
|
2437 |
||
2438 |
//special case 'sugar-' |
|
2439 |
if((current == 0) AND StringAt(current, 5, SUGAR, )) |
|
2440 |
{ |
|
2441 |
MetaphAdd(X, S); |
|
2442 |
current += 1; |
|
2443 |
break; |
|
2444 |
} |
|
2445 |
||
2446 |
if(StringAt(current, 2, SH, )) |
|
2447 |
{ |
|
2448 |
//germanic |
|
2449 |
if(StringAt((current + 1), 4, HEIM, HOEK, HOLM, HOLZ, )) |
|
2450 |
MetaphAdd(S); |
|
2451 |
else |
|
2452 |
MetaphAdd(X); |
|
2453 |
current += 2; |
|
2454 |
break; |
|
2455 |
} |
|
2456 |
||
2457 |
//italian & armenian |
|
2458 |
if(StringAt(current, 3, SIO, SIA, ) OR StringAt(current, 4, SIAN, )) |
|
2459 |
{ |
|
2460 |
if(!!SlavoGermanic()) |
|
2461 |
MetaphAdd(S, X); |
|
2462 |
else |
|
2463 |
MetaphAdd(S); |
|
2464 |
current += 3; |
|
2465 |
break; |
|
2466 |
} |
|
2467 |
||
2468 |
//german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider' |
|
2469 |
//also, -sz- in slavic language altho in hungarian it is pronounced 's' |
|
2470 |
if(((current == 0) |
|
2471 |
AND StringAt((current + 1), 1, M, N, L, W, )) |
|
2472 |
OR StringAt((current + 1), 1, Z, )) |
|
2473 |
{ |
|
2474 |
MetaphAdd(S, X); |
|
2475 |
if(StringAt((current + 1), 1, Z, )) |
|
2476 |
current += 2; |
|
2477 |
else |
|
2478 |
current += 1; |
|
2479 |
break; |
|
2480 |
} |
|
2481 |
||
2482 |
if(StringAt(current, 2, SC, )) |
|
2483 |
{ |
|
2484 |
//Schlesinger's rule |
|
2485 |
if(GetAt(current + 2) == 'H') |
|
2486 |
//dutch origin, e.g. 'school', 'schooner' |
|
2487 |
if(StringAt((current + 3), 2, OO, ER, EN, UY, ED, EM, )) |
|
2488 |
{ |
|
2489 |
//'schermerhorn', 'schenker' |
|
2490 |
if(StringAt((current + 3), 2, ER, EN, )) |
|
2491 |
{ |
|
2492 |
MetaphAdd(X, SK); |
|
2493 |
}else |
|
2494 |
MetaphAdd(SK); |
|
2495 |
current += 3; |
|
2496 |
break; |
|
2497 |
}else{ |
|
2498 |
if((current == 0) AND !!IsVowel(3) AND (GetAt(3) !!= 'W')) |
|
2499 |
MetaphAdd(X, S); |
|
2500 |
else |
|
2501 |
MetaphAdd(X); |
|
2502 |
current += 3; |
|
2503 |
break; |
|
2504 |
} |
|
2505 |
||
2506 |
if(StringAt((current + 2), 1, I, E, Y, )) |
|
2507 |
{ |
|
2508 |
MetaphAdd(S); |
|
2509 |
current += 3; |
|
2510 |
break; |
|
2511 |
} |
|
2512 |
//else |
|
2513 |
MetaphAdd(SK); |
|
2514 |
current += 3; |
|
2515 |
break; |
|
2516 |
} |
|
2517 |
||
2518 |
//french e.g. 'resnais', 'artois' |
|
2519 |
if((current == last) AND StringAt((current - 2), 2, AI, OI, )) |
|
2520 |
MetaphAdd(, S); |
|
2521 |
else |
|
2522 |
MetaphAdd(S); |
|
2523 |
||
2524 |
if(StringAt((current + 1), 1, S, Z, )) |
|
2525 |
current += 2; |
|
2526 |
else |
|
2527 |
current += 1; |
|
2528 |
break; |
|
2529 |
" |
|
2530 |
||
2213 | 2531 |
| nextChar char2 chars char | |
2532 |
(#('ISL' 'YSL') includes: (self inputKey copyFrom: (currentIndex - 1 max: 1) to: (currentIndex + 1 min: self inputKey size))) |
|
2533 |
ifTrue: [ |
|
2534 |
^self |
|
2535 |
]. |
|
2536 |
(currentIndex = 1 and: [(self inputKey copyFrom: 1 to: (5 min: self inputKey size)) = 'SUGAR']) |
|
2537 |
ifTrue: [ |
|
2538 |
self addPrimaryTranslation: 'X'; |
|
2539 |
addSecondaryTranslation: 'S'. |
|
2540 |
^self. |
|
2541 |
]. |
|
2542 |
(self inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: self inputKey size)) = 'SH' |
|
2543 |
ifTrue: [ |
|
2544 |
(#('HEIM' 'HOEK' 'HOLM' 'HOLZ') includes: (self inputKey copyFrom: (currentIndex + 1 min: self inputKey size) to: ((currentIndex + 5) min: self inputKey size))) |
|
2545 |
ifTrue: [ |
|
2546 |
self addPrimaryTranslation: 'S'; |
|
2547 |
addSecondaryTranslation: 'S'. |
|
2548 |
] ifFalse: [ |
|
2549 |
self addPrimaryTranslation: 'X'; |
|
2550 |
addSecondaryTranslation: 'X'. |
|
2551 |
]. |
|
2552 |
^self incrementSkipCount: 1 |
|
2553 |
]. |
|
2554 |
((#('SIO' 'SIA') includes: (self inputKey copyFrom: currentIndex to: (currentIndex + 2 min: self inputKey size))) |
|
2555 |
or: [(self inputKey copyFrom: currentIndex to: (currentIndex + 3 min: self inputKey size)) = 'SIAN']) |
|
2556 |
ifTrue: [ |
|
2557 |
(self isSlavoGermanic: self inputKey) not |
|
2558 |
ifTrue: [ |
|
2559 |
self addPrimaryTranslation: 'S'; |
|
2560 |
addSecondaryTranslation: 'X'. |
|
2561 |
] ifFalse: [ |
|
2562 |
self addPrimaryTranslation: 'S'; |
|
2563 |
addSecondaryTranslation: 'S'. |
|
2564 |
]. |
|
2565 |
^self incrementSkipCount: 2 |
|
2566 |
]. |
|
2567 |
((currentIndex = 1 and: [#($M $N $L $W) includes: (self keyAt: currentIndex + 1)]) |
|
2568 |
or: [(nextChar := self keyAt: currentIndex + 1) = $Z]) |
|
2569 |
ifTrue: [ |
|
2570 |
self addPrimaryTranslation: 'S'; |
|
2571 |
addSecondaryTranslation: 'X'. |
|
3488
5a69e672d7f8
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
3185
diff
changeset
|
2572 |
nextChar == $Z |
2213 | 2573 |
ifTrue: [ |
2574 |
^self incrementSkipCount: 1. |
|
2575 |
]. |
|
2576 |
^self. |
|
2577 |
]. |
|
2578 |
((self inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: self inputKey size)) = 'SC') |
|
2579 |
ifTrue: [ |
|
2580 |
(char2 := self keyAt: currentIndex + 2) = $H |
|
2581 |
ifTrue: [ |
|
2582 |
(#('OO' 'ER' 'EN' 'UY' 'ED' 'EM') includes: (chars := self inputKey copyFrom: ((currentIndex + 3) min: self inputKey size) to: ((currentIndex + 4) min: self inputKey size))) |
|
2583 |
ifTrue: [ |
|
2584 |
(#('ER' 'EN') includes: chars) |
|
2585 |
ifTrue: [ |
|
2586 |
self addPrimaryTranslation: 'X'; |
|
2587 |
addSecondaryTranslation: 'SK'. |
|
2588 |
] ifFalse: [ |
|
2589 |
self addPrimaryTranslation: 'SK'; |
|
2590 |
addSecondaryTranslation: 'SK'. |
|
2591 |
]. |
|
2592 |
^self incrementSkipCount: 2. |
|
2593 |
] ifFalse: [ |
|
2594 |
((currentIndex = 1 and: [(char := self inputKey at: 4 ifAbsent: [$b]) isVowel not]) and: [char ~= $W]) |
|
2595 |
ifTrue: [ |
|
2596 |
self addPrimaryTranslation: 'X'; |
|
2597 |
addSecondaryTranslation: 'S'. |
|
2598 |
] ifFalse: [ |
|
2599 |
self addPrimaryTranslation: 'X'; |
|
2600 |
addSecondaryTranslation: 'X'. |
|
2601 |
]. |
|
2602 |
^self incrementSkipCount: 2. |
|
2603 |
]. |
|
2604 |
] ifFalse: [ |
|
2605 |
(#($I $E $Y) includes: char2) |
|
2606 |
ifTrue: [ |
|
2607 |
self addPrimaryTranslation: 'S'; |
|
2608 |
addSecondaryTranslation: 'S'. |
|
2609 |
^self incrementSkipCount: 2. |
|
2610 |
] ifFalse: [ |
|
2611 |
self addPrimaryTranslation: 'SK'; |
|
2612 |
addSecondaryTranslation: 'SK'. |
|
2613 |
^self incrementSkipCount: 2. |
|
2614 |
] |
|
2615 |
]. |
|
2616 |
]. |
|
2617 |
(currentIndex = self inputKey size and: [(#('AI' 'OI') includes: (self inputKey copyFrom: ((currentIndex - 2) max: 1) to: ((currentIndex - 1) max: 1)))]) |
|
2618 |
ifTrue: [ |
|
2619 |
self addPrimaryTranslation: ''; |
|
2620 |
addSecondaryTranslation: 'S'. |
|
2621 |
] ifFalse: [ |
|
2622 |
self addPrimaryTranslation: 'S'; |
|
2623 |
addSecondaryTranslation: 'S'. |
|
2624 |
]. |
|
2625 |
(#($S $Z) includes: (self keyAt: currentIndex + 1)) |
|
2626 |
ifTrue: [ |
|
2627 |
^self incrementSkipCount: 1. |
|
2628 |
]. |
|
2208 | 2629 |
! |
2630 |
||
2631 |
processT |
|
2213 | 2632 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2633 |
case 'T': |
|
2208 | 2634 |
if(StringAt(current, 4, TION, )) |
2635 |
{ |
|
2636 |
MetaphAdd(X); |
|
2637 |
current += 3; |
|
2638 |
break; |
|
2639 |
} |
|
2640 |
||
2641 |
if(StringAt(current, 3, TIA, TCH, )) |
|
2642 |
{ |
|
2643 |
MetaphAdd(X); |
|
2644 |
current += 3; |
|
2645 |
break; |
|
2646 |
} |
|
2647 |
||
2648 |
if(StringAt(current, 2, TH, ) |
|
2649 |
OR StringAt(current, 3, TTH, )) |
|
2650 |
{ |
|
2651 |
//special case 'thomas', 'thames' or germanic |
|
2652 |
if(StringAt((current + 2), 2, OM, AM, ) |
|
2653 |
OR StringAt(0, 4, VAN , VON , ) |
|
2654 |
OR StringAt(0, 3, SCH, )) |
|
2655 |
{ |
|
2656 |
MetaphAdd(T); |
|
2657 |
}else{ |
|
2658 |
MetaphAdd(0, T); |
|
2659 |
} |
|
2660 |
current += 2; |
|
2661 |
break; |
|
2662 |
} |
|
2663 |
||
2664 |
if(StringAt((current + 1), 1, T, D, )) |
|
2665 |
current += 2; |
|
2666 |
else |
|
2667 |
current += 1; |
|
2668 |
MetaphAdd(T); |
|
2669 |
break; |
|
2670 |
" |
|
2213 | 2671 |
((self inputKey copyFrom: currentIndex to: ((currentIndex + 3) min: self inputKey size)) = 'TION') |
2672 |
ifTrue: [ |
|
2673 |
self addPrimaryTranslation: 'X'; |
|
2674 |
addSecondaryTranslation: 'X'. |
|
2675 |
^self incrementSkipCount: 2. |
|
2676 |
]. |
|
2677 |
(#('TIA' 'TCH') includes: (self inputKey copyFrom: currentIndex to: ((currentIndex + 2) min: self inputKey size))) |
|
2678 |
ifTrue: [ |
|
2679 |
self addPrimaryTranslation: 'X'; |
|
2680 |
addSecondaryTranslation: 'X'. |
|
2681 |
^self incrementSkipCount: 2. |
|
2682 |
]. |
|
2683 |
(((self inputKey copyFrom: currentIndex to: ((currentIndex + 1) min: self inputKey size)) = 'TH') or: [ |
|
2684 |
((self inputKey copyFrom: currentIndex to: ((currentIndex + 2) min: self inputKey size)) = 'TTH') |
|
2685 |
]) |
|
2686 |
ifTrue: [ |
|
2687 |
((#('OM' 'AM') includes: (self inputKey copyFrom: currentIndex + 2 to: ((currentIndex + 3) min: self inputKey size))) |
|
2688 |
or: [(#('VAN ' 'VON ') includes: (self inputKey copyFrom: 1 to: (4 min: self inputKey size))) |
|
2689 |
or: [(self inputKey copyFrom: 1 to: (3 min: self inputKey size)) = 'SCH'] |
|
2690 |
]) |
|
2691 |
ifTrue: [ |
|
2692 |
self addPrimaryTranslation: 'T'; |
|
2693 |
addSecondaryTranslation: 'T'. |
|
2694 |
] ifFalse: [ |
|
2695 |
self addPrimaryTranslation: '0'; |
|
2696 |
addSecondaryTranslation: 'T'. |
|
2697 |
]. |
|
2698 |
^self incrementSkipCount: 1. |
|
2699 |
]. |
|
2700 |
(#($T $D) includes: (self keyAt: currentIndex + 1)) |
|
2701 |
ifTrue: [ |
|
2702 |
self incrementSkipCount: 1. |
|
2703 |
]. |
|
2704 |
self addPrimaryTranslation: 'T'; |
|
2705 |
addSecondaryTranslation: 'T'. |
|
2208 | 2706 |
! |
2707 |
||
2708 |
processV |
|
2213 | 2709 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2710 |
case 'V': |
|
2208 | 2711 |
if(GetAt(current + 1) == 'V') |
2712 |
current += 2; |
|
2713 |
else |
|
2714 |
current += 1; |
|
2715 |
MetaphAdd(F); |
|
2716 |
break; |
|
2717 |
||
2718 |
||
2213 | 2719 |
" |
2720 |
||
2721 |
(self keyAt: currentIndex + 1) = $V |
|
2722 |
ifTrue: [ |
|
2723 |
self incrementSkipCount: 1 |
|
2724 |
]. |
|
2725 |
self addPrimaryTranslation: 'F'; |
|
2726 |
addSecondaryTranslation: 'F'. |
|
2208 | 2727 |
! |
2728 |
||
2729 |
processW |
|
2213 | 2730 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2731 |
case 'W': |
|
2208 | 2732 |
//can also be in middle of word |
2733 |
if(StringAt(current, 2, WR, )) |
|
2734 |
{ |
|
2735 |
MetaphAdd(R); |
|
2736 |
current += 2; |
|
2737 |
break; |
|
2738 |
} |
|
2739 |
||
2740 |
if((current == 0) |
|
2741 |
AND (IsVowel(current + 1) OR StringAt(current, 2, WH, ))) |
|
2742 |
{ |
|
2743 |
//Wasserman should match Vasserman |
|
2744 |
if(IsVowel(current + 1)) |
|
2745 |
MetaphAdd(A, F); |
|
2746 |
else |
|
2747 |
//need Uomo to match Womo |
|
2748 |
MetaphAdd(A); |
|
2749 |
} |
|
2750 |
||
2751 |
//Arnow should match Arnoff |
|
2752 |
if(((current == last) AND IsVowel(current - 1)) |
|
2753 |
OR StringAt((current - 1), 5, EWSKI, EWSKY, OWSKI, OWSKY, ) |
|
2754 |
OR StringAt(0, 3, SCH, )) |
|
2213 | 2755 |
{ |
2208 | 2756 |
MetaphAdd(, F); |
2757 |
current +=1; |
|
2758 |
break; |
|
2759 |
} |
|
2760 |
||
2761 |
//polish e.g. 'filipowicz' |
|
2762 |
if(StringAt(current, 4, WICZ, WITZ, )) |
|
2763 |
{ |
|
2764 |
MetaphAdd(TS, FX); |
|
2765 |
current +=4; |
|
2766 |
break; |
|
2767 |
} |
|
2768 |
||
2769 |
//else skip it |
|
2770 |
current +=1; |
|
2771 |
break; |
|
2772 |
" |
|
2213 | 2773 |
| word nextLetter | |
2774 |
((word := self inputKey copyFrom: currentIndex to: (currentIndex + 1 min: self inputKey size)) = 'WR') |
|
2775 |
ifTrue: [ |
|
2776 |
self addPrimaryTranslation: 'R'; |
|
2777 |
addSecondaryTranslation: 'R'. |
|
2778 |
^self incrementSkipCount: 1 |
|
2779 |
]. |
|
2780 |
((currentIndex = 1 and: [(nextLetter := self keyAt: currentIndex + 1) isVowel]) or: [ |
|
2781 |
word = 'WH' |
|
2782 |
]) |
|
2783 |
ifTrue: [ |
|
2784 |
nextLetter isVowel |
|
2785 |
ifTrue: [ |
|
2786 |
self addPrimaryTranslation: 'A'; |
|
2787 |
addSecondaryTranslation: 'F'. |
|
2788 |
] ifFalse: [ |
|
2789 |
self addPrimaryTranslation: 'A'; |
|
2790 |
addSecondaryTranslation: 'A'. |
|
2791 |
] |
|
2792 |
]. |
|
2793 |
((((currentIndex = self inputKey size) and: [(self keyAt: currentIndex - 1) isVowel]) |
|
2794 |
or: [#('EWSKI' 'EWSKY' 'OWSKI' 'OWSKY') includes: (self inputKey copyFrom: ((currentIndex - 1) max: 1) to: (currentIndex + 3 min: self inputKey size))]) |
|
2795 |
or: [(self inputKey copyFrom: 1 to: 3) = 'SCH']) |
|
2796 |
ifTrue: [ |
|
2797 |
self addPrimaryTranslation: ''; |
|
2798 |
addSecondaryTranslation: 'F'. |
|
2799 |
^self. |
|
2800 |
]. |
|
2801 |
(#('WICZ' 'WITZ') includes: (self inputKey copyFrom: currentIndex to: (currentIndex + 4 min: self inputKey size))) |
|
2802 |
ifTrue: [ |
|
2803 |
self addPrimaryTranslation: 'TS'; |
|
2804 |
addSecondaryTranslation: 'FX'. |
|
2805 |
^self incrementSkipCount: 3 |
|
2806 |
]. |
|
2208 | 2807 |
! |
2808 |
||
2809 |
processX |
|
2213 | 2810 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2811 |
case 'X': |
|
2208 | 2812 |
//french e.g. breaux |
2813 |
if(!!((current == last) |
|
2814 |
AND (StringAt((current - 3), 3, IAU, EAU, ) |
|
2815 |
OR StringAt((current - 2), 2, AU, OU, ))) ) |
|
2816 |
MetaphAdd(KS); |
|
2817 |
||
2818 |
if(StringAt((current + 1), 1, C, X, )) |
|
2819 |
current += 2; |
|
2820 |
else |
|
2821 |
current += 1; |
|
2822 |
break; |
|
2823 |
" |
|
2824 |
||
2825 |
||
2580
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
2826 |
((currentIndex = self inputKey size) |
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
2827 |
and: [(#('IAU' 'EAU') includes: (self inputKey copyFrom: ((currentIndex - 3) min: 1) to: currentIndex)) |
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
2828 |
or: [(#('AU' 'OU') includes: (self inputKey copyFrom: ((currentIndex - 2) min: 1) to: currentIndex))]]) |
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
2829 |
ifFalse: [ |
2213 | 2830 |
self addPrimaryTranslation: 'KS'; |
2831 |
addSecondaryTranslation: 'KS'. |
|
2832 |
]. |
|
2833 |
(#($C $X) includes: (self keyAt: currentIndex + 1)) |
|
2834 |
ifTrue: [ |
|
2835 |
^self incrementSkipCount: 1 |
|
2836 |
] |
|
2580
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
2837 |
|
7ce713ba2618
not ifTrue -> ifFalse (trying the rewrite tool ;-)
Claus Gittinger <cg@exept.de>
parents:
2445
diff
changeset
|
2838 |
"Modified: / 24-07-2011 / 06:54:25 / cg" |
2208 | 2839 |
! |
2840 |
||
2841 |
processZ |
|
2213 | 2842 |
"http://aspell.sourceforge.net/metaphone/dmetaph.cpp |
2843 |
case 'Z': |
|
2208 | 2844 |
//chinese pinyin e.g. 'zhao' |
2845 |
if(GetAt(current + 1) == 'H') |
|
2846 |
{ |
|
2847 |
MetaphAdd(J); |
|
2848 |
current += 2; |
|
2849 |
break; |
|
2850 |
}else |
|
2851 |
if(StringAt((current + 1), 2, ZO, ZI, ZA, ) |
|
2852 |
OR (SlavoGermanic() AND ((current > 0) AND GetAt(current - 1) !!= 'T'))) |
|
2853 |
{ |
|
2854 |
MetaphAdd(S, TS); |
|
2855 |
} |
|
2856 |
else |
|
2857 |
MetaphAdd(S); |
|
2858 |
||
2859 |
if(GetAt(current + 1) == 'Z') |
|
2860 |
current += 2; |
|
2861 |
else |
|
2862 |
current += 1; |
|
2863 |
break; |
|
2864 |
" |
|
2865 |
||
2213 | 2866 |
(self keyAt: currentIndex + 1) = $H |
2867 |
ifTrue: [ |
|
2868 |
self addPrimaryTranslation: 'J'; |
|
2869 |
addSecondaryTranslation: 'J'. |
|
2870 |
^self incrementSkipCount: 1 |
|
2871 |
] ifFalse: [ |
|
2872 |
((#('ZO' 'ZI' 'ZA') includes: (self inputKey copyFrom: ((currentIndex + 1) min: self inputKey size) to: ((currentIndex + 2) min: self inputKey size))) or: [ |
|
2873 |
(self isSlavoGermanic: self inputKey) and: [(currentIndex > 1 and: [(self keyAt: currentIndex - 1) ~= 'T'])] |
|
2874 |
]) |
|
2875 |
ifTrue: [ |
|
2876 |
self addPrimaryTranslation: 'S'; |
|
2877 |
addSecondaryTranslation: 'TS'. |
|
2878 |
] ifFalse: [ |
|
2879 |
self addPrimaryTranslation: 'S'; |
|
2880 |
addSecondaryTranslation: 'S'. |
|
2881 |
]. |
|
2882 |
(self keyAt: currentIndex + 1) = $Z |
|
2883 |
ifTrue: [ |
|
2884 |
^self incrementSkipCount: 1 |
|
2885 |
]. |
|
2886 |
] |
|
2208 | 2887 |
! ! |
2888 |
||
2889 |
!PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'! |
|
2890 |
||
2891 |
documentation |
|
2892 |
" |
|
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2893 |
Miracode (also called American Soundex) is like Soundex with the addition that h and w are |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2894 |
discarded if they separate consonants. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2895 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2896 |
These variants may be specifically important because they were used in U.S. National Archives. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2897 |
Most archive data were encoded with Miracode, but there are some entries encoded with |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2898 |
Simplified Soundex. |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2899 |
|
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2900 |
The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 |
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2901 |
censuses were encoded with mixed methods. |
2208 | 2902 |
" |
2903 |
! ! |
|
2904 |
||
2905 |
!PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'! |
|
2906 |
||
2907 |
phoneticStringsFor:aString |
|
2908 |
|u p t prevCode| |
|
2909 |
||
2910 |
u := aString asUppercase. |
|
2911 |
p := u first asString. |
|
2912 |
prevCode := self translate:u first. |
|
2913 |
u from:2 to:u size do:[:c | |
|
2914 |
t := self translate:c. |
|
2915 |
(t notNil |
|
2916 |
and:[ t ~= '0' |
|
2917 |
and:[ t ~= prevCode ]]) ifTrue:[ |
|
2918 |
p := p , t. |
|
2919 |
p size == 4 ifTrue:[^ Array with:p ]. |
|
2920 |
]. |
|
2921 |
(c ~= $W and:[c ~= $H]) ifTrue:[ |
|
2922 |
prevCode := t. |
|
2923 |
]. |
|
2924 |
]. |
|
2925 |
[ p size < 4 ] whileTrue:[ |
|
2926 |
p := p , '0' |
|
2927 |
]. |
|
2928 |
^ Array with:(p copyFrom:1 to:4) |
|
2929 |
! ! |
|
2930 |
||
2197 | 2931 |
!PhoneticStringUtilities class methodsFor:'documentation'! |
2932 |
||
2933 |
version |
|
3646 | 2934 |
^ '$Header$' |
2285 | 2935 |
! |
2936 |
||
2937 |
version_CVS |
|
3646 | 2938 |
^ '$Header$' |
2197 | 2939 |
! ! |
3185
9833bbba2050
class: PhoneticStringUtilities
Claus Gittinger <cg@exept.de>
parents:
2580
diff
changeset
|
2940 |