hg/stx-libbasic2: comparison PhoneticStringUtilities.st

equal deleted inserted replaced

-:27271594c7d8
+:9833bbba2050
 !PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'!
 documentation
 "
 WARNING: this is the so called 'simplified soundex' algorithm;
 there are more variants like miracode (american soundex) or mysqlSoundex around.
 Be sure to use the correct algorithm, if the generated strings must be compatible
-(otherwise, the differences are probably too small to be noticed as effect)
+(otherwise, the differences are probably too small to be noticed as effect, but
+your search will be different)
-The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
+The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
-SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
-components of names, but by doing so reports more matches.
+SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
+components of names, but by doing so reports more matches.
-There are some variations around in the literature;
-the following is called 'simplified soundex', and the rules for coding a name are:
+There are some variations around in the literature;
+the following is called 'simplified soundex', and the rules for coding a name are:
-1. The first letter of the name is used in its un-coded form to serve as the prefix
-character of the code. (The rest of the code is numerical).
+1. The first letter of the name is used in its un-coded form to serve as the prefix
+character of the code. (The rest of the code is numerical).
-2. Thereafter, W and H are ignored entirely.
+2. Thereafter, W and H are ignored entirely.
-3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
+3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
-4. Other letters of the name are converted to a numerical equivalent:
-B, P, F, V              1
+4. Other letters of the name are converted to a numerical equivalent:
-C, G, J, K, Q, S, X, Z  2
+B, P, F, V              1
-D, T                    3
+C, G, J, K, Q, S, X, Z  2
-L                       4
+D, T                    3
-M, N                    5
+L                       4
-R                       6
+M, N                    5
+R                       6
-5. There are two exceptions:
-1. Letters that follow prefix letters which would, if coded, have the same
+5. There are two exceptions:
-numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
+1. Letters that follow prefix letters which would, if coded, have the same
+numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
-2. The second letter of any pair of consonants having the same code number is likewise ignored,
-i.e. unless there is a ''separator'' between them in the name.
+2. The second letter of any pair of consonants having the same code number is likewise ignored,
+i.e. unless there is a ''separator'' between them in the name.
-6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
-Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
+6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
+Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
-Notice, that in another variant, w and h are treated slightly differently.
-This is only of relevance, if you need to reconstruct original soundex codes of other programs
+Notice, that in another variant, w and h are treated slightly differently.
-or for the original 1880 us census data.
+This is only of relevance, if you need to reconstruct original soundex codes of other programs
+or for the original 1880 us census data.
 "
 ! !
 !PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'!
 !PhoneticStringUtilities::MySQLSoundexStringComparator class methodsFor:'documentation'!
 documentation
 "
 MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
 and also removing vokals first, then removing duplicate codes
 (whereas the soundex code does this in reverse order).
 These variations are important, if you need the ame soundex codes to be generated.
 "
 ! !
 !PhoneticStringUtilities::MySQLSoundexStringComparator methodsFor:'api'!
 !PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'!
 documentation
 "
 NYSIIS Algorithm:
 1.
 remove all ''S'' and ''Z'' chars from the end of the surname
 2.
 transcode initial strings
 MAC => MC
 PF => F
 3.
 Transcode trailing strings as follows,
 IX => IC
 EX => EC
 YE,EE,IE => Y
 NT,ND => D
 4.
 transcode ''EV'' to ''EF'' if not at start of name
 5.
 use first character of name as first character of key
 6.
 remove any ''W'' that follows a vowel
 7.
 replace all vowels with ''A''
 8.
 transcode ''GHT'' to ''GT''
 9.
 transcode ''DG'' to ''G''
 10.
 transcode ''PH'' to ''F''
 11.
 if not first character, eliminate all ''H'' preceded or followed by a vowel
 12.
 change ''KN'' to ''N'', else ''K'' to ''C''
 13.
 if not first character, change ''M'' to ''N''
 14.
 if not first character, change ''Q'' to ''G''
 15.
 transcode ''SH'' to ''S''
 16.
 transcode ''SCH'' to ''S''
 17.
 transcode ''YW'' to ''Y''
 18.
 if not first or last character, change ''Y'' to ''A''
 19.
 transcode ''WR'' to ''R''
 20.
 if not first character, change ''Z'' to ''S''
 21.
 transcode terminal ''AY'' to ''Y''
 22.
 remove traling vowels
 23.
 collapse all strings of repeated characters
 24.
 if first char of original surname was a vowel, append it to the code
 "
 ! !
 !PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'!
 !PhoneticStringUtilities::PhonemStringComparator class methodsFor:'documentation'!
 documentation
 "
 Implementation of the PHONEM algorithm, as described in
 'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
 Ein Programm fuer kontextsensitive phonetische Textumwandlung
 ct Magazin fuer Computer & Technik 25/1998'
 "
 ! !
 !PhoneticStringUtilities::PhonemStringComparator methodsFor:'api'!
 !PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'documentation'!
 documentaion
 "
 The Double Metaphone algorithm:
 see internet
 "
 ! !
 !PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'accessing'!
 !PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'!
 documentation
 "
 Miracode (also called American Soundex) is like Soundex with the addition that h and w are
 discarded if they separate consonants.
 These variants may be specifically important because they were used in U.S. National Archives.
 Most archive data were encoded with Miracode, but there are some entries encoded with
 Simplified Soundex.
 The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910
 censuses were encoded with mixed methods.
 "
 ! !
 !PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'!
 ! !
 !PhoneticStringUtilities class methodsFor:'documentation'!
 version
-^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
+^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
 !
 version_CVS
-^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
+^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
 ! !

changeset 3185	9833bbba2050
parent 2580	7ce713ba2618
child 3488	5a69e672d7f8