737 |
737 |
738 !PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'! |
738 !PhoneticStringUtilities::SoundexStringComparator class methodsFor:'documentation'! |
739 |
739 |
740 documentation |
740 documentation |
741 " |
741 " |
742 WARNING: this is the so called 'simplified soundex' algorithm; |
742 WARNING: this is the so called 'simplified soundex' algorithm; |
743 there are more variants like miracode (american soundex) or mysqlSoundex around. |
743 there are more variants like miracode (american soundex) or mysqlSoundex around. |
744 Be sure to use the correct algorithm, if the generated strings must be compatible |
744 Be sure to use the correct algorithm, if the generated strings must be compatible |
745 (otherwise, the differences are probably too small to be noticed as effect) |
745 (otherwise, the differences are probably too small to be noticed as effect, but |
746 |
746 your search will be different) |
747 The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm |
747 |
748 |
748 The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm |
749 SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable |
749 |
750 components of names, but by doing so reports more matches. |
750 SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable |
751 |
751 components of names, but by doing so reports more matches. |
752 There are some variations around in the literature; |
752 |
753 the following is called 'simplified soundex', and the rules for coding a name are: |
753 There are some variations around in the literature; |
754 |
754 the following is called 'simplified soundex', and the rules for coding a name are: |
755 1. The first letter of the name is used in its un-coded form to serve as the prefix |
755 |
756 character of the code. (The rest of the code is numerical). |
756 1. The first letter of the name is used in its un-coded form to serve as the prefix |
757 |
757 character of the code. (The rest of the code is numerical). |
758 2. Thereafter, W and H are ignored entirely. |
758 |
759 |
759 2. Thereafter, W and H are ignored entirely. |
760 3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). |
760 |
761 |
761 3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). |
762 4. Other letters of the name are converted to a numerical equivalent: |
762 |
763 B, P, F, V 1 |
763 4. Other letters of the name are converted to a numerical equivalent: |
764 C, G, J, K, Q, S, X, Z 2 |
764 B, P, F, V 1 |
765 D, T 3 |
765 C, G, J, K, Q, S, X, Z 2 |
766 L 4 |
766 D, T 3 |
767 M, N 5 |
767 L 4 |
768 R 6 |
768 M, N 5 |
769 |
769 R 6 |
770 5. There are two exceptions: |
770 |
771 1. Letters that follow prefix letters which would, if coded, have the same |
771 5. There are two exceptions: |
772 numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. |
772 1. Letters that follow prefix letters which would, if coded, have the same |
773 |
773 numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. |
774 2. The second letter of any pair of consonants having the same code number is likewise ignored, |
774 |
775 i.e. unless there is a ''separator'' between them in the name. |
775 2. The second letter of any pair of consonants having the same code number is likewise ignored, |
776 |
776 i.e. unless there is a ''separator'' between them in the name. |
777 6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. |
777 |
778 Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. |
778 6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. |
779 |
779 Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. |
780 Notice, that in another variant, w and h are treated slightly differently. |
780 |
781 This is only of relevance, if you need to reconstruct original soundex codes of other programs |
781 Notice, that in another variant, w and h are treated slightly differently. |
782 or for the original 1880 us census data. |
782 This is only of relevance, if you need to reconstruct original soundex codes of other programs |
783 |
783 or for the original 1880 us census data. |
784 " |
784 " |
785 ! ! |
785 ! ! |
786 |
786 |
787 !PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'! |
787 !PhoneticStringUtilities::SoundexStringComparator methodsFor:'api'! |
788 |
788 |
882 |
882 |
883 !PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'! |
883 !PhoneticStringUtilities::NYSIISStringComparator class methodsFor:'documentation'! |
884 |
884 |
885 documentation |
885 documentation |
886 " |
886 " |
887 NYSIIS Algorithm: |
887 NYSIIS Algorithm: |
888 |
888 |
889 1. |
889 1. |
890 remove all ''S'' and ''Z'' chars from the end of the surname |
890 remove all ''S'' and ''Z'' chars from the end of the surname |
891 |
891 |
892 2. |
892 2. |
893 transcode initial strings |
893 transcode initial strings |
894 MAC => MC |
894 MAC => MC |
895 PF => F |
895 PF => F |
896 |
896 |
897 3. |
897 3. |
898 Transcode trailing strings as follows, |
898 Transcode trailing strings as follows, |
899 |
899 |
900 IX => IC |
900 IX => IC |
901 EX => EC |
901 EX => EC |
902 YE,EE,IE => Y |
902 YE,EE,IE => Y |
903 NT,ND => D |
903 NT,ND => D |
904 |
904 |
905 4. |
905 4. |
906 transcode ''EV'' to ''EF'' if not at start of name |
906 transcode ''EV'' to ''EF'' if not at start of name |
907 |
907 |
908 5. |
908 5. |
909 use first character of name as first character of key |
909 use first character of name as first character of key |
910 |
910 |
911 6. |
911 6. |
912 remove any ''W'' that follows a vowel |
912 remove any ''W'' that follows a vowel |
913 |
913 |
914 7. |
914 7. |
915 replace all vowels with ''A'' |
915 replace all vowels with ''A'' |
916 |
916 |
917 8. |
917 8. |
918 transcode ''GHT'' to ''GT'' |
918 transcode ''GHT'' to ''GT'' |
919 |
919 |
920 9. |
920 9. |
921 transcode ''DG'' to ''G'' |
921 transcode ''DG'' to ''G'' |
922 |
922 |
923 10. |
923 10. |
924 transcode ''PH'' to ''F'' |
924 transcode ''PH'' to ''F'' |
925 |
925 |
926 11. |
926 11. |
927 if not first character, eliminate all ''H'' preceded or followed by a vowel |
927 if not first character, eliminate all ''H'' preceded or followed by a vowel |
928 |
928 |
929 12. |
929 12. |
930 change ''KN'' to ''N'', else ''K'' to ''C'' |
930 change ''KN'' to ''N'', else ''K'' to ''C'' |
931 |
931 |
932 13. |
932 13. |
933 if not first character, change ''M'' to ''N'' |
933 if not first character, change ''M'' to ''N'' |
934 |
934 |
935 14. |
935 14. |
936 if not first character, change ''Q'' to ''G'' |
936 if not first character, change ''Q'' to ''G'' |
937 |
937 |
938 15. |
938 15. |
939 transcode ''SH'' to ''S'' |
939 transcode ''SH'' to ''S'' |
940 |
940 |
941 16. |
941 16. |
942 transcode ''SCH'' to ''S'' |
942 transcode ''SCH'' to ''S'' |
943 |
943 |
944 17. |
944 17. |
945 transcode ''YW'' to ''Y'' |
945 transcode ''YW'' to ''Y'' |
946 |
946 |
947 18. |
947 18. |
948 if not first or last character, change ''Y'' to ''A'' |
948 if not first or last character, change ''Y'' to ''A'' |
949 |
949 |
950 19. |
950 19. |
951 transcode ''WR'' to ''R'' |
951 transcode ''WR'' to ''R'' |
952 |
952 |
953 20. |
953 20. |
954 if not first character, change ''Z'' to ''S'' |
954 if not first character, change ''Z'' to ''S'' |
955 |
955 |
956 21. |
956 21. |
957 transcode terminal ''AY'' to ''Y'' |
957 transcode terminal ''AY'' to ''Y'' |
958 |
958 |
959 22. |
959 22. |
960 remove traling vowels |
960 remove traling vowels |
961 |
961 |
962 23. |
962 23. |
963 collapse all strings of repeated characters |
963 collapse all strings of repeated characters |
964 |
964 |
965 24. |
965 24. |
966 if first char of original surname was a vowel, append it to the code |
966 if first char of original surname was a vowel, append it to the code |
967 " |
967 " |
968 ! ! |
968 ! ! |
969 |
969 |
970 !PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'! |
970 !PhoneticStringUtilities::NYSIISStringComparator methodsFor:'api'! |
971 |
971 |
2852 |
2852 |
2853 !PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'! |
2853 !PhoneticStringUtilities::MiracodeStringComparator class methodsFor:'documentation'! |
2854 |
2854 |
2855 documentation |
2855 documentation |
2856 " |
2856 " |
2857 Miracode (also called American Soundex) is like Soundex with the addition that h and w are |
2857 Miracode (also called American Soundex) is like Soundex with the addition that h and w are |
2858 discarded if they separate consonants. |
2858 discarded if they separate consonants. |
2859 |
2859 |
2860 These variants may be specifically important because they were used in U.S. National Archives. |
2860 These variants may be specifically important because they were used in U.S. National Archives. |
2861 Most archive data were encoded with Miracode, but there are some entries encoded with |
2861 Most archive data were encoded with Miracode, but there are some entries encoded with |
2862 Simplified Soundex. |
2862 Simplified Soundex. |
2863 |
2863 |
2864 The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 |
2864 The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 |
2865 censuses were encoded with mixed methods. |
2865 censuses were encoded with mixed methods. |
2866 " |
2866 " |
2867 ! ! |
2867 ! ! |
2868 |
2868 |
2869 !PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'! |
2869 !PhoneticStringUtilities::MiracodeStringComparator methodsFor:'api'! |
2870 |
2870 |