Soundex is a useful function for performing data matching
While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it. Here’s why …
- The algorithm encodes consonants
- Vowels will not be encoded unless it is the first letter
- Consonants to the right of a vowel are not coded
- Similar sounding consonants share the same digit
- C,G,J,K,Q,S,X,Z are all encoded with the same digit
To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William.
As you can see from the brief example above, Soundex codes fall short of matching like strings. One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William. Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.
I plan to dig deeper into Soundex functions and their applicability in a future blog post. In the meantime, I wanted to get this observation of mine out there for public consumption.
Thanks for taking the time to visit the weblog!
William Sharp
Yes, I agree. On top of it, SoundEx is an algorithm that has been improved with metaphones and double-metaphones.
I think it is important that people realize the real limitations of Soundex. It is imbedded in a lot of technologies, specifically Microsoft products like Dynamics CRM and SQL Server.
Soundex has been pretty much deprecated in favor of newer algorithms such as double-metaphone. At Aware Research we often use multiple algorithms for phonetic similarity and then run a second pass to select and rank the most probable match(es).
Two pass validations are often the most thorough process. Thanks for stopping by and commenting, Justin.