--- a/PhoneticStringUtilities.st Tue Feb 25 08:19:54 2014 +0100
+++ b/PhoneticStringUtilities.st Tue Feb 25 08:24:31 2014 +0100
@@ -739,48 +739,48 @@
documentation
"
-WARNING: this is the so called 'simplified soundex' algorithm;
-there are more variants like miracode (american soundex) or mysqlSoundex around.
-Be sure to use the correct algorithm, if the generated strings must be compatible
-(otherwise, the differences are probably too small to be noticed as effect)
-
-The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
-
-SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
-components of names, but by doing so reports more matches.
-
-There are some variations around in the literature;
-the following is called 'simplified soundex', and the rules for coding a name are:
-
-1. The first letter of the name is used in its un-coded form to serve as the prefix
- character of the code. (The rest of the code is numerical).
-
-2. Thereafter, W and H are ignored entirely.
-
-3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
-
-4. Other letters of the name are converted to a numerical equivalent:
- B, P, F, V 1
- C, G, J, K, Q, S, X, Z 2
- D, T 3
- L 4
- M, N 5
- R 6
-
-5. There are two exceptions:
- 1. Letters that follow prefix letters which would, if coded, have the same
- numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
-
- 2. The second letter of any pair of consonants having the same code number is likewise ignored,
- i.e. unless there is a ''separator'' between them in the name.
-
-6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
- Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
-
-Notice, that in another variant, w and h are treated slightly differently.
-This is only of relevance, if you need to reconstruct original soundex codes of other programs
-or for the original 1880 us census data.
-
+ WARNING: this is the so called 'simplified soundex' algorithm;
+ there are more variants like miracode (american soundex) or mysqlSoundex around.
+ Be sure to use the correct algorithm, if the generated strings must be compatible
+ (otherwise, the differences are probably too small to be noticed as effect, but
+ your search will be different)
+
+ The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm
+
+ SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
+ components of names, but by doing so reports more matches.
+
+ There are some variations around in the literature;
+ the following is called 'simplified soundex', and the rules for coding a name are:
+
+ 1. The first letter of the name is used in its un-coded form to serve as the prefix
+ character of the code. (The rest of the code is numerical).
+
+ 2. Thereafter, W and H are ignored entirely.
+
+ 3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).
+
+ 4. Other letters of the name are converted to a numerical equivalent:
+ B, P, F, V 1
+ C, G, J, K, Q, S, X, Z 2
+ D, T 3
+ L 4
+ M, N 5
+ R 6
+
+ 5. There are two exceptions:
+ 1. Letters that follow prefix letters which would, if coded, have the same
+ numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.
+
+ 2. The second letter of any pair of consonants having the same code number is likewise ignored,
+ i.e. unless there is a ''separator'' between them in the name.
+
+ 6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
+ Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.
+
+ Notice, that in another variant, w and h are treated slightly differently.
+ This is only of relevance, if you need to reconstruct original soundex codes of other programs
+ or for the original 1880 us census data.
"
! !
@@ -849,11 +849,11 @@
documentation
"
-MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
-and also removing vokals first, then removing duplicate codes
-(whereas the soundex code does this in reverse order).
-
-These variations are important, if you need the ame soundex codes to be generated.
+ MySQL soundex is like american Soundex (i.e. miracode) without the 4 character limitation,
+ and also removing vokals first, then removing duplicate codes
+ (whereas the soundex code does this in reverse order).
+
+ These variations are important, if you need the ame soundex codes to be generated.
"
! !
@@ -884,86 +884,86 @@
documentation
"
-NYSIIS Algorithm:
-
-1.
- remove all ''S'' and ''Z'' chars from the end of the surname
-
-2.
- transcode initial strings
- MAC => MC
- PF => F
-
-3.
- Transcode trailing strings as follows,
-
- IX => IC
- EX => EC
- YE,EE,IE => Y
- NT,ND => D
-
-4.
- transcode ''EV'' to ''EF'' if not at start of name
-
-5.
- use first character of name as first character of key
-
-6.
- remove any ''W'' that follows a vowel
-
-7.
- replace all vowels with ''A''
-
-8.
- transcode ''GHT'' to ''GT''
-
-9.
- transcode ''DG'' to ''G''
-
-10.
- transcode ''PH'' to ''F''
-
-11.
- if not first character, eliminate all ''H'' preceded or followed by a vowel
-
-12.
- change ''KN'' to ''N'', else ''K'' to ''C''
-
-13.
- if not first character, change ''M'' to ''N''
-
-14.
- if not first character, change ''Q'' to ''G''
-
-15.
- transcode ''SH'' to ''S''
-
-16.
- transcode ''SCH'' to ''S''
-
-17.
- transcode ''YW'' to ''Y''
-
-18.
- if not first or last character, change ''Y'' to ''A''
-
-19.
- transcode ''WR'' to ''R''
-
-20.
- if not first character, change ''Z'' to ''S''
-
-21.
- transcode terminal ''AY'' to ''Y''
-
-22.
- remove traling vowels
-
-23.
- collapse all strings of repeated characters
-
-24.
- if first char of original surname was a vowel, append it to the code
+ NYSIIS Algorithm:
+
+ 1.
+ remove all ''S'' and ''Z'' chars from the end of the surname
+
+ 2.
+ transcode initial strings
+ MAC => MC
+ PF => F
+
+ 3.
+ Transcode trailing strings as follows,
+
+ IX => IC
+ EX => EC
+ YE,EE,IE => Y
+ NT,ND => D
+
+ 4.
+ transcode ''EV'' to ''EF'' if not at start of name
+
+ 5.
+ use first character of name as first character of key
+
+ 6.
+ remove any ''W'' that follows a vowel
+
+ 7.
+ replace all vowels with ''A''
+
+ 8.
+ transcode ''GHT'' to ''GT''
+
+ 9.
+ transcode ''DG'' to ''G''
+
+ 10.
+ transcode ''PH'' to ''F''
+
+ 11.
+ if not first character, eliminate all ''H'' preceded or followed by a vowel
+
+ 12.
+ change ''KN'' to ''N'', else ''K'' to ''C''
+
+ 13.
+ if not first character, change ''M'' to ''N''
+
+ 14.
+ if not first character, change ''Q'' to ''G''
+
+ 15.
+ transcode ''SH'' to ''S''
+
+ 16.
+ transcode ''SCH'' to ''S''
+
+ 17.
+ transcode ''YW'' to ''Y''
+
+ 18.
+ if not first or last character, change ''Y'' to ''A''
+
+ 19.
+ transcode ''WR'' to ''R''
+
+ 20.
+ if not first character, change ''Z'' to ''S''
+
+ 21.
+ transcode terminal ''AY'' to ''Y''
+
+ 22.
+ remove traling vowels
+
+ 23.
+ collapse all strings of repeated characters
+
+ 24.
+ if first char of original surname was a vowel, append it to the code
"
! !
@@ -1334,10 +1334,10 @@
documentation
"
-Implementation of the PHONEM algorithm, as described in
-'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
-Ein Programm fuer kontextsensitive phonetische Textumwandlung
-ct Magazin fuer Computer & Technik 25/1998'
+ Implementation of the PHONEM algorithm, as described in
+ 'Georg Wilde and Carsten Meyer, Doppelgaenger gesucht -
+ Ein Programm fuer kontextsensitive phonetische Textumwandlung
+ ct Magazin fuer Computer & Technik 25/1998'
"
! !
@@ -1437,8 +1437,8 @@
documentaion
"
-The Double Metaphone algorithm:
-see internet
+ The Double Metaphone algorithm:
+ see internet
"
! !
@@ -2854,15 +2854,15 @@
documentation
"
-Miracode (also called American Soundex) is like Soundex with the addition that h and w are
-discarded if they separate consonants.
-
-These variants may be specifically important because they were used in U.S. National Archives.
-Most archive data were encoded with Miracode, but there are some entries encoded with
-Simplified Soundex.
-
-The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910
-censuses were encoded with mixed methods.
+ Miracode (also called American Soundex) is like Soundex with the addition that h and w are
+ discarded if they separate consonants.
+
+ These variants may be specifically important because they were used in U.S. National Archives.
+ Most archive data were encoded with Miracode, but there are some entries encoded with
+ Simplified Soundex.
+
+ The HW-rule was documented as a standard in 1910, but actually data of 1880, 1900 and 1910
+ censuses were encoded with mixed methods.
"
! !
@@ -2895,9 +2895,10 @@
!PhoneticStringUtilities class methodsFor:'documentation'!
version
- ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
+ ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
!
version_CVS
- ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.11 2011-07-24 04:58:20 cg Exp $'
+ ^ '$Header: /cvs/stx/stx/libbasic2/PhoneticStringUtilities.st,v 1.12 2014-02-25 07:24:31 cg Exp $'
! !
+