diff -r 27271594c7d8 -r 9833bbba2050 PhoneticStringUtilities.st --- a/PhoneticStringUtilities.st Tue Feb 25 08:19:54 2014 +0100 +++ b/PhoneticStringUtilities.st Tue Feb 25 08:24:31 2014 +0100 @@ -739,48 +739,48 @@ documentation " -WARNING: this is the so called 'simplified soundex' algorithm; -there are more variants like miracode (american soundex) or mysqlSoundex around. -Be sure to use the correct algorithm, if the generated strings must be compatible -(otherwise, the differences are probably too small to be noticed as effect) - -The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm - -SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable -components of names, but by doing so reports more matches. - -There are some variations around in the literature; -the following is called 'simplified soundex', and the rules for coding a name are: - -1. The first letter of the name is used in its un-coded form to serve as the prefix - character of the code. (The rest of the code is numerical). - -2. Thereafter, W and H are ignored entirely. - -3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). - -4. Other letters of the name are converted to a numerical equivalent: - B, P, F, V 1 - C, G, J, K, Q, S, X, Z 2 - D, T 3 - L 4 - M, N 5 - R 6 - -5. There are two exceptions: - 1. Letters that follow prefix letters which would, if coded, have the same - numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. - - 2. The second letter of any pair of consonants having the same code number is likewise ignored, - i.e. unless there is a ''separator'' between them in the name. - -6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. - Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. - -Notice, that in another variant, w and h are treated slightly differently. -This is only of relevance, if you need to reconstruct original soundex codes of other programs -or for the original 1880 us census data. - + WARNING: this is the so called 'simplified soundex' algorithm; + there are more variants like miracode (american soundex) or mysqlSoundex around. + Be sure to use the correct algorithm, if the generated strings must be compatible + (otherwise, the differences are probably too small to be noticed as effect, but + your search will be different) + + The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm + + SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable + components of names, but by doing so reports more matches. + + There are some variations around in the literature; + the following is called 'simplified soundex', and the rules for coding a name are: + + 1. The first letter of the name is used in its un-coded form to serve as the prefix + character of the code. (The rest of the code is numerical). + + 2. Thereafter, W and H are ignored entirely. + + 3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). + + 4. Other letters of the name are converted to a numerical equivalent: + B, P, F, V 1 + C, G, J, K, Q, S, X, Z 2 + D, T 3 + L 4 + M, N 5 + R 6 + + 5. There are two exceptions: + 1. Letters that follow prefix letters which would, if coded, have the same + numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. + + 2. The second letter of any pair of consonants having the same code number is likewise ignored, + i.e. unless there is a ''separator'' between them in the name. + + 6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. + Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. + + Notice, that in another variant, w and h are treated slightly differently. + This is only of relevance, if you need to reconstruct original soundex codes of other programs + or for the original 1880 us census data. " ! ! @@ -849,11 +849,11 @@ documentation " -MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation, -and also removing vokals first, then removing duplicate codes -(whereas the soundex code does this in reverse order). - -These variations are important, if you need the ame soundex codes to be generated. + MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation, + and also removing vokals first, then removing duplicate codes + (whereas the soundex code does this in reverse order). + + These variations are important, if you need the ame soundex codes to be generated. " ! ! @@ -884,86 +884,86 @@ documentation " -NYSIIS Algorithm: - -1. - remove all ''S'' and ''Z'' chars from the end of the surname - -2. - transcode initial strings - MAC => MC - PF => F - -3. - Transcode trailing strings as follows, - - IX => IC - EX => EC - YE,EE,IE => Y - NT,ND => D - -4. - transcode ''EV'' to ''EF'' if not at start of name - -5. - use first character of name as first character of key - -6. - remove any ''W'' that follows a vowel - -7. - replace all vowels with ''A'' - -8. - transcode ''GHT'' to ''GT'' - -9. - transcode ''DG'' to ''G'' - -10. - transcode ''PH'' to ''F'' - -11. - if not first character, eliminate all ''H'' preceded or followed by a vowel - -12. - change ''KN'' to ''N'', else ''K'' to ''C'' - -13. - if not first character, change ''M'' to ''N'' - -14. - if not first character, change ''Q'' to ''G'' - -15. - transcode ''SH'' to ''S'' - -16. - transcode ''SCH'' to ''S'' - -17. - transcode ''YW'' to ''Y'' - -18. - if not first or last character, change ''Y'' to ''A'' - -19. - transcode ''WR'' to ''R'' - -20. - if not first character, change ''Z'' to ''S'' - -21. - transcode terminal ''AY'' to ''Y'' - -22. - remove traling vowels - -23. - collapse all strings of repeated characters - -24. - if first char of original surname was a vowel, append it to the code + NYSIIS Algorithm: + + 1. + remove all ''S'' and ''Z'' chars from the end of the surname + + 2. + transcode initial strings + MAC => MC + PF => F + + 3. + Transcode trailing strings as follows, + + IX => IC + EX => EC + YE,EE,IE => Y + NT,ND => D + + 4. + transcode ''EV'' to ''EF'' if not at start of name + + 5. + use first character of name as first character of key + + 6. + remove any ''W'' that follows a vowel + + 7. + replace all vowels with ''A'' + + 8. + transcode ''GHT'' to ''GT'' + + 9. + transcode ''DG'' to ''G'' + + 10. + transcode ''PH'' to ''F'' + + 11. + if not first character, eliminate all ''H'' preceded or followed by a vowel + + 12. + change ''KN'' to ''N'', else ''K'' to ''C'' + + 13. + if not first character, change ''M'' to ''N'' + + 14. + if not first character, change ''Q'' to ''G'' + + 15. + transcode ''SH'' to ''S'' + + 16. + transcode ''SCH'' to ''S'' + + 17. + transcode ''YW'' to ''Y'' + + 18. + if not first or last character, change ''Y'' to ''A'' + + 19. + transcode ''WR'' to ''R'' + + 20. + if not first character, change ''Z'' to ''S'' + + 21. + transcode terminal ''AY'' to ''Y'' + + 22. + remove traling vowels + + 23. + collapse all strings of repeated characters + + 24. + if first char of original surname was a vowel, append it to the code " ! ! @@ -1334,10 +1334,10 @@ documentation " -Implementation of the PHONEM algorithm, as described in -'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht - -Ein Programm fuer kontextsensitive phonetische Textumwandlung -ct Magazin fuer Computer & Technik 25/1998' + Implementation of the PHONEM algorithm, as described in + 'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht - + Ein Programm fuer kontextsensitive phonetische Textumwandlung + ct Magazin fuer Computer & Technik 25/1998' " ! ! @@ -1437,8 +1437,8 @@ documentaion " -The Double Metaphone algorithm: -see internet + The Double Metaphone algorithm: + see internet " ! ! @@ -2854,15 +2854,15 @@ documentation " -Miracode (also called American Soundex) is like Soundex with the addition that h and w are -discarded if they separate consonants. - -These variants may be specifically important because they were used in U.S. National Archives. -Most archive data were encoded with Miracode, but there are some entries encoded with -Simplified Soundex. - -The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 -censuses were encoded with mixed methods. + Miracode (also called American Soundex) is like Soundex with the addition that h and w are + discarded if they separate consonants. + + These variants may be specifically important because they were used in U.S. National Archives. + Most archive data were encoded with Miracode, but there are some entries encoded with + Simplified Soundex. + + The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 + censuses were encoded with mixed methods. " ! ! @@ -2895,9 +2895,10 @@ !PhoneticStringUtilities class methodsFor:'documentation'! version - ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $' + ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $' ! version_CVS - ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $' + ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $' ! ! +