PhoneticStringUtilities.st
changeset 3185 9833bbba2050
parent 2580 7ce713ba2618
child 3488 5a69e672d7f8
equal deleted inserted replaced
3184:27271594c7d8 3185:9833bbba2050
   737 
   737 
   738 !PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'!
   738 !PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'!
   739 
   739 
   740 documentation
   740 documentation
   741 "
   741 "
   742 WARNING: this is the so called 'simplified soundex' algorithm;
   742     WARNING: this is the so called 'simplified soundex' algorithm;
   743 there are more variants like miracode (american soundex) or mysqlSoundex around.
   743       there are more variants like miracode (american soundex) or mysqlSoundex around.
   744 Be sure to use the correct algorithm, if the generated strings must be compatible
   744       Be sure to use the correct algorithm, if the generated strings must be compatible
   745 (otherwise, the differences are probably too small to be noticed as effect)
   745       (otherwise, the differences are probably too small to be noticed as effect, but
   746 
   746       your search will be different)
   747 The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
   747 
   748 
   748     The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
   749 SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
   749 
   750 components of names, but by doing so reports more matches. 
   750     SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
   751 
   751     components of names, but by doing so reports more matches. 
   752 There are some variations around in the literature; 
   752 
   753 the following is called 'simplified soundex', and the rules for coding a name are:
   753     There are some variations around in the literature; 
   754 
   754     the following is called 'simplified soundex', and the rules for coding a name are:
   755 1. The first letter of the name is used in its un-coded form to serve as the prefix
   755 
   756    character of the code. (The rest of the code is numerical).
   756     1. The first letter of the name is used in its un-coded form to serve as the prefix
   757 
   757        character of the code. (The rest of the code is numerical).
   758 2. Thereafter, W and H are ignored entirely.
   758 
   759 
   759     2. Thereafter, W and H are ignored entirely.
   760 3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
   760 
   761 
   761     3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
   762 4. Other letters of the name are converted to a numerical equivalent:
   762 
   763              B, P, F, V              1 
   763     4. Other letters of the name are converted to a numerical equivalent:
   764              C, G, J, K, Q, S, X, Z  2 
   764                  B, P, F, V              1 
   765              D, T                    3 
   765                  C, G, J, K, Q, S, X, Z  2 
   766              L                       4 
   766                  D, T                    3 
   767              M, N                    5 
   767                  L                       4 
   768              R                       6 
   768                  M, N                    5 
   769 
   769                  R                       6 
   770 5. There are two exceptions: 
   770 
   771     1. Letters that follow prefix letters which would, if coded, have the same
   771     5. There are two exceptions: 
   772        numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
   772         1. Letters that follow prefix letters which would, if coded, have the same
   773 
   773            numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
   774     2. The second letter of any pair of consonants having the same code number is likewise ignored, 
   774 
   775        i.e. unless there is a ''separator'' between them in the name.
   775         2. The second letter of any pair of consonants having the same code number is likewise ignored, 
   776 
   776            i.e. unless there is a ''separator'' between them in the name.
   777 6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
   777 
   778    Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
   778     6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
   779 
   779        Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
   780 Notice, that in another variant, w and h are treated slightly differently.
   780 
   781 This is only of relevance, if you need to reconstruct original soundex codes of other programs
   781     Notice, that in another variant, w and h are treated slightly differently.
   782 or for the original 1880 us census data.
   782     This is only of relevance, if you need to reconstruct original soundex codes of other programs
   783 
   783     or for the original 1880 us census data.
   784 "
   784 "
   785 ! !
   785 ! !
   786 
   786 
   787 !PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'!
   787 !PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'!
   788 
   788 
   847 
   847 
   848 !PhoneticStringUtilities::MySQLSoundexStringComparator class methodsFor:'documentation'!
   848 !PhoneticStringUtilities::MySQLSoundexStringComparator class methodsFor:'documentation'!
   849 
   849 
   850 documentation
   850 documentation
   851 "
   851 "
   852 MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
   852     MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
   853 and also removing vokals first, then removing duplicate codes
   853     and also removing vokals first, then removing duplicate codes
   854 (whereas the soundex code does this in reverse order).
   854     (whereas the soundex code does this in reverse order).
   855 
   855 
   856 These variations are important, if you need the ame soundex codes to be generated.
   856     These variations are important, if you need the ame soundex codes to be generated.
   857 "
   857 "
   858 ! !
   858 ! !
   859 
   859 
   860 !PhoneticStringUtilities::MySQLSoundexStringComparator methodsFor:'api'!
   860 !PhoneticStringUtilities::MySQLSoundexStringComparator methodsFor:'api'!
   861 
   861 
   882 
   882 
   883 !PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'!
   883 !PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'!
   884 
   884 
   885 documentation
   885 documentation
   886 "
   886 "
   887 NYSIIS Algorithm:
   887     NYSIIS Algorithm:
   888 
   888 
   889 1.
   889     1.
   890     remove all ''S'' and ''Z'' chars from the end of the surname 
   890         remove all ''S'' and ''Z'' chars from the end of the surname 
   891 
   891 
   892 2.
   892     2.
   893     transcode initial strings
   893         transcode initial strings
   894         MAC => MC
   894             MAC => MC
   895         PF => F
   895             PF => F
   896 
   896 
   897 3.
   897     3.
   898     Transcode trailing strings as follows,
   898         Transcode trailing strings as follows,
   899     
   899         
   900         IX => IC
   900             IX => IC
   901         EX => EC
   901             EX => EC
   902         YE,EE,IE => Y
   902             YE,EE,IE => Y
   903         NT,ND => D 
   903             NT,ND => D 
   904 
   904 
   905 4.
   905     4.
   906     transcode ''EV'' to ''EF'' if not at start of name
   906         transcode ''EV'' to ''EF'' if not at start of name
   907 
   907 
   908 5.
   908     5.
   909     use first character of name as first character of key 
   909         use first character of name as first character of key 
   910 
   910 
   911 6.
   911     6.
   912     remove any ''W'' that follows a vowel 
   912         remove any ''W'' that follows a vowel 
   913 
   913 
   914 7.
   914     7.
   915     replace all vowels with ''A'' 
   915         replace all vowels with ''A'' 
   916 
   916 
   917 8.
   917     8.
   918     transcode ''GHT'' to ''GT'' 
   918         transcode ''GHT'' to ''GT'' 
   919 
   919 
   920 9.
   920     9.
   921     transcode ''DG'' to ''G'' 
   921         transcode ''DG'' to ''G'' 
   922 
   922 
   923 10.
   923     10.
   924     transcode ''PH'' to ''F'' 
   924         transcode ''PH'' to ''F'' 
   925 
   925 
   926 11.
   926     11.
   927     if not first character, eliminate all ''H'' preceded or followed by a vowel 
   927         if not first character, eliminate all ''H'' preceded or followed by a vowel 
   928 
   928 
   929 12.
   929     12.
   930     change ''KN'' to ''N'', else ''K'' to ''C'' 
   930         change ''KN'' to ''N'', else ''K'' to ''C'' 
   931 
   931 
   932 13.
   932     13.
   933     if not first character, change ''M'' to ''N'' 
   933         if not first character, change ''M'' to ''N'' 
   934 
   934 
   935 14.
   935     14.
   936     if not first character, change ''Q'' to ''G'' 
   936         if not first character, change ''Q'' to ''G'' 
   937 
   937 
   938 15.
   938     15.
   939     transcode ''SH'' to ''S'' 
   939         transcode ''SH'' to ''S'' 
   940 
   940 
   941 16.
   941     16.
   942     transcode ''SCH'' to ''S'' 
   942         transcode ''SCH'' to ''S'' 
   943 
   943 
   944 17.
   944     17.
   945     transcode ''YW'' to ''Y'' 
   945         transcode ''YW'' to ''Y'' 
   946 
   946 
   947 18.
   947     18.
   948     if not first or last character, change ''Y'' to ''A'' 
   948         if not first or last character, change ''Y'' to ''A'' 
   949 
   949 
   950 19.
   950     19.
   951     transcode ''WR'' to ''R'' 
   951         transcode ''WR'' to ''R'' 
   952 
   952 
   953 20.
   953     20.
   954     if not first character, change ''Z'' to ''S'' 
   954         if not first character, change ''Z'' to ''S'' 
   955 
   955 
   956 21.
   956     21.
   957     transcode terminal ''AY'' to ''Y'' 
   957         transcode terminal ''AY'' to ''Y'' 
   958 
   958 
   959 22.
   959     22.
   960     remove traling vowels 
   960         remove traling vowels 
   961 
   961 
   962 23.
   962     23.
   963     collapse all strings of repeated characters 
   963         collapse all strings of repeated characters 
   964 
   964 
   965 24.
   965     24.
   966     if first char of original surname was a vowel, append it to the code
   966         if first char of original surname was a vowel, append it to the code
   967 "
   967 "
   968 ! !
   968 ! !
   969 
   969 
   970 !PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'!
   970 !PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'!
   971 
   971 
  1332 
  1332 
  1333 !PhoneticStringUtilities::PhonemStringComparator class methodsFor:'documentation'!
  1333 !PhoneticStringUtilities::PhonemStringComparator class methodsFor:'documentation'!
  1334 
  1334 
  1335 documentation
  1335 documentation
  1336 "
  1336 "
  1337 Implementation of the PHONEM algorithm, as described in
  1337     Implementation of the PHONEM algorithm, as described in
  1338 'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
  1338     'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
  1339 Ein Programm fuer kontextsensitive phonetische Textumwandlung
  1339     Ein Programm fuer kontextsensitive phonetische Textumwandlung
  1340 ct Magazin fuer Computer & Technik 25/1998'
  1340     ct Magazin fuer Computer & Technik 25/1998'
  1341 "
  1341 "
  1342 ! !
  1342 ! !
  1343 
  1343 
  1344 !PhoneticStringUtilities::PhonemStringComparator methodsFor:'api'!
  1344 !PhoneticStringUtilities::PhonemStringComparator methodsFor:'api'!
  1345 
  1345 
  1435 
  1435 
  1436 !PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'documentation'!
  1436 !PhoneticStringUtilities::DoubleMetaphoneStringComparator class methodsFor:'documentation'!
  1437 
  1437 
  1438 documentaion
  1438 documentaion
  1439 "
  1439 "
  1440 The Double Metaphone algorithm:
  1440     The Double Metaphone algorithm:
  1441 see internet
  1441     see internet
  1442 "
  1442 "
  1443 ! !
  1443 ! !
  1444 
  1444 
  1445 !PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'accessing'!
  1445 !PhoneticStringUtilities::DoubleMetaphoneStringComparator methodsFor:'accessing'!
  1446 
  1446 
  2852 
  2852 
  2853 !PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'!
  2853 !PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'!
  2854 
  2854 
  2855 documentation
  2855 documentation
  2856 "
  2856 "
  2857 Miracode (also called American Soundex) is like Soundex with the addition that h and w are 
  2857     Miracode (also called American Soundex) is like Soundex with the addition that h and w are 
  2858 discarded if they separate consonants.
  2858     discarded if they separate consonants.
  2859 
  2859 
  2860 These variants may be specifically important because they were used in U.S. National Archives. 
  2860     These variants may be specifically important because they were used in U.S. National Archives. 
  2861 Most archive data were encoded with Miracode, but there are some entries encoded with 
  2861     Most archive data were encoded with Miracode, but there are some entries encoded with 
  2862 Simplified Soundex. 
  2862     Simplified Soundex. 
  2863 
  2863 
  2864 The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 
  2864     The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 
  2865 censuses were encoded with mixed methods.
  2865     censuses were encoded with mixed methods.
  2866 "
  2866 "
  2867 ! !
  2867 ! !
  2868 
  2868 
  2869 !PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'!
  2869 !PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'!
  2870 
  2870 
  2893 ! !
  2893 ! !
  2894 
  2894 
  2895 !PhoneticStringUtilities class methodsFor:'documentation'!
  2895 !PhoneticStringUtilities class methodsFor:'documentation'!
  2896 
  2896 
  2897 version
  2897 version
  2898     ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
  2898     ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
  2899 !
  2899 !
  2900 
  2900 
  2901 version_CVS
  2901 version_CVS
  2902     ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
  2902     ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
  2903 ! !
  2903 ! !
       
  2904