PhoneticStringUtilities.st
changeset 3185 9833bbba2050
parent 2580 7ce713ba2618
child 3488 5a69e672d7f8
--- a/PhoneticStringUtilities.st	Tue Feb 25 08:19:54 2014 +0100
+++ b/PhoneticStringUtilities.st	Tue Feb 25 08:24:31 2014 +0100
@@ -739,48 +739,48 @@
 
 documentation
 "
-WARNING: this is the so called 'simplified soundex' algorithm;
-there are more variants like miracode (american soundex) or mysqlSoundex around.
-Be sure to use the correct algorithm, if the generated strings must be compatible
-(otherwise, the differences are probably too small to be noticed as effect)
-
-The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
-
-SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
-components of names, but by doing so reports more matches. 
-
-There are some variations around in the literature; 
-the following is called 'simplified soundex', and the rules for coding a name are:
-
-1. The first letter of the name is used in its un-coded form to serve as the prefix
-   character of the code. (The rest of the code is numerical).
-
-2. Thereafter, W and H are ignored entirely.
-
-3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
-
-4. Other letters of the name are converted to a numerical equivalent:
-             B, P, F, V              1 
-             C, G, J, K, Q, S, X, Z  2 
-             D, T                    3 
-             L                       4 
-             M, N                    5 
-             R                       6 
-
-5. There are two exceptions: 
-    1. Letters that follow prefix letters which would, if coded, have the same
-       numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
-
-    2. The second letter of any pair of consonants having the same code number is likewise ignored, 
-       i.e. unless there is a ''separator'' between them in the name.
-
-6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
-   Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
-
-Notice, that in another variant, w and h are treated slightly differently.
-This is only of relevance, if you need to reconstruct original soundex codes of other programs
-or for the original 1880 us census data.
-
+    WARNING: this is the so called 'simplified soundex' algorithm;
+      there are more variants like miracode (american soundex) or mysqlSoundex around.
+      Be sure to use the correct algorithm, if the generated strings must be compatible
+      (otherwise, the differences are probably too small to be noticed as effect, but
+      your search will be different)
+
+    The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
+
+    SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
+    components of names, but by doing so reports more matches. 
+
+    There are some variations around in the literature; 
+    the following is called 'simplified soundex', and the rules for coding a name are:
+
+    1. The first letter of the name is used in its un-coded form to serve as the prefix
+       character of the code. (The rest of the code is numerical).
+
+    2. Thereafter, W and H are ignored entirely.
+
+    3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
+
+    4. Other letters of the name are converted to a numerical equivalent:
+                 B, P, F, V              1 
+                 C, G, J, K, Q, S, X, Z  2 
+                 D, T                    3 
+                 L                       4 
+                 M, N                    5 
+                 R                       6 
+
+    5. There are two exceptions: 
+        1. Letters that follow prefix letters which would, if coded, have the same
+           numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
+
+        2. The second letter of any pair of consonants having the same code number is likewise ignored, 
+           i.e. unless there is a ''separator'' between them in the name.
+
+    6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
+       Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
+
+    Notice, that in another variant, w and h are treated slightly differently.
+    This is only of relevance, if you need to reconstruct original soundex codes of other programs
+    or for the original 1880 us census data.
 "
 ! !
 
@@ -849,11 +849,11 @@
 
 documentation
 "
-MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
-and also removing vokals first, then removing duplicate codes
-(whereas the soundex code does this in reverse order).
-
-These variations are important, if you need the ame soundex codes to be generated.
+    MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
+    and also removing vokals first, then removing duplicate codes
+    (whereas the soundex code does this in reverse order).
+
+    These variations are important, if you need the ame soundex codes to be generated.
 "
 ! !
 
@@ -884,86 +884,86 @@
 
 documentation
 "
-NYSIIS Algorithm:
-
-1.
-    remove all ''S'' and ''Z'' chars from the end of the surname 
-
-2.
-    transcode initial strings
-        MAC => MC
-        PF => F
-
-3.
-    Transcode trailing strings as follows,
-    
-        IX => IC
-        EX => EC
-        YE,EE,IE => Y
-        NT,ND => D 
-
-4.
-    transcode ''EV'' to ''EF'' if not at start of name
-
-5.
-    use first character of name as first character of key 
-
-6.
-    remove any ''W'' that follows a vowel 
-
-7.
-    replace all vowels with ''A'' 
-
-8.
-    transcode ''GHT'' to ''GT'' 
-
-9.
-    transcode ''DG'' to ''G'' 
-
-10.
-    transcode ''PH'' to ''F'' 
-
-11.
-    if not first character, eliminate all ''H'' preceded or followed by a vowel 
-
-12.
-    change ''KN'' to ''N'', else ''K'' to ''C'' 
-
-13.
-    if not first character, change ''M'' to ''N'' 
-
-14.
-    if not first character, change ''Q'' to ''G'' 
-
-15.
-    transcode ''SH'' to ''S'' 
-
-16.
-    transcode ''SCH'' to ''S'' 
-
-17.
-    transcode ''YW'' to ''Y'' 
-
-18.
-    if not first or last character, change ''Y'' to ''A'' 
-
-19.
-    transcode ''WR'' to ''R'' 
-
-20.
-    if not first character, change ''Z'' to ''S'' 
-
-21.
-    transcode terminal ''AY'' to ''Y'' 
-
-22.
-    remove traling vowels 
-
-23.
-    collapse all strings of repeated characters 
-
-24.
-    if first char of original surname was a vowel, append it to the code
+    NYSIIS Algorithm:
+
+    1.
+        remove all ''S'' and ''Z'' chars from the end of the surname 
+
+    2.
+        transcode initial strings
+            MAC => MC
+            PF => F
+
+    3.
+        Transcode trailing strings as follows,
+        
+            IX => IC
+            EX => EC
+            YE,EE,IE => Y
+            NT,ND => D 
+
+    4.
+        transcode ''EV'' to ''EF'' if not at start of name
+
+    5.
+        use first character of name as first character of key 
+
+    6.
+        remove any ''W'' that follows a vowel 
+
+    7.
+        replace all vowels with ''A'' 
+
+    8.
+        transcode ''GHT'' to ''GT'' 
+
+    9.
+        transcode ''DG'' to ''G'' 
+
+    10.
+        transcode ''PH'' to ''F'' 
+
+    11.
+        if not first character, eliminate all ''H'' preceded or followed by a vowel 
+
+    12.
+        change ''KN'' to ''N'', else ''K'' to ''C'' 
+
+    13.
+        if not first character, change ''M'' to ''N'' 
+
+    14.
+        if not first character, change ''Q'' to ''G'' 
+
+    15.
+        transcode ''SH'' to ''S'' 
+
+    16.
+        transcode ''SCH'' to ''S'' 
+
+    17.
+        transcode ''YW'' to ''Y'' 
+
+    18.
+        if not first or last character, change ''Y'' to ''A'' 
+
+    19.
+        transcode ''WR'' to ''R'' 
+
+    20.
+        if not first character, change ''Z'' to ''S'' 
+
+    21.
+        transcode terminal ''AY'' to ''Y'' 
+
+    22.
+        remove traling vowels 
+
+    23.
+        collapse all strings of repeated characters 
+
+    24.
+        if first char of original surname was a vowel, append it to the code
 "
 ! !
 
@@ -1334,10 +1334,10 @@
 
 documentation
 "
-Implementation of the PHONEM algorithm, as described in
-'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
-Ein Programm fuer kontextsensitive phonetische Textumwandlung
-ct Magazin fuer Computer & Technik 25/1998'
+    Implementation of the PHONEM algorithm, as described in
+    'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
+    Ein Programm fuer kontextsensitive phonetische Textumwandlung
+    ct Magazin fuer Computer & Technik 25/1998'
 "
 ! !
 
@@ -1437,8 +1437,8 @@
 
 documentaion
 "
-The Double Metaphone algorithm:
-see internet
+    The Double Metaphone algorithm:
+    see internet
 "
 ! !
 
@@ -2854,15 +2854,15 @@
 
 documentation
 "
-Miracode (also called American Soundex) is like Soundex with the addition that h and w are 
-discarded if they separate consonants.
-
-These variants may be specifically important because they were used in U.S. National Archives. 
-Most archive data were encoded with Miracode, but there are some entries encoded with 
-Simplified Soundex. 
-
-The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 
-censuses were encoded with mixed methods.
+    Miracode (also called American Soundex) is like Soundex with the addition that h and w are 
+    discarded if they separate consonants.
+
+    These variants may be specifically important because they were used in U.S. National Archives. 
+    Most archive data were encoded with Miracode, but there are some entries encoded with 
+    Simplified Soundex. 
+
+    The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910 
+    censuses were encoded with mixed methods.
 "
 ! !
 
@@ -2895,9 +2895,10 @@
 !PhoneticStringUtilities class methodsFor:'documentation'!
 
 version
-    ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
+    ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
 !
 
 version_CVS
-    ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
+    ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
 ! !
+