Soundex is a useful function for performing data matching
While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it. Here’s why …
- The algorithm encodes consonants
- Vowels will not be encoded unless it is the first letter
- Consonants to the right of a vowel are not coded
- Similar sounding consonants share the same digit
- C,G,J,K,Q,S,X,Z are all encoded with the same digit
To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William.
As you can see from the brief example above, Soundex codes fall short of matching like strings. One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William. Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.
I plan to dig deeper into Soundex functions and their applicability in a future blog post. In the meantime, I wanted to get this observation of mine out there for public consumption.
Thanks for taking the time to visit the weblog!