Soundex for String Matching

April 30, 2011
By William Sharp

Soundex is a useful function for performing data matching

While you can use a Soundex function in the process of identifying potential duplicate strings, I don’t recommend it.  Here’s why …

  • The algorithm encodes consonants
  • Vowels will not be encoded unless it is the first letter
  • Consonants to the right of a vowel are not coded
  • Similar sounding consonants share the same digit
  • C,G,J,K,Q,S,X,Z are all encoded with the same digit

To illustrate the impact of this type of encoding let’s look at an example of soundex codes for deviations of my first name, William.

 As you can see from the brief example above, Soundex codes fall short of matching like strings.  One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William.  Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.

 I plan to dig deeper into Soundex functions and their applicability in a future blog post.  In the meantime, I wanted to get this observation of mine out there for public consumption.


Thanks for taking the time to visit the weblog!

William Sharp

sharp@thedataqualitychronicle.org

4 Responses to Soundex for String Matching

  1. Steve Sarsfield on April 30, 2011 at 2:13 pm

    Yes, I agree. On top of it, SoundEx is an algorithm that has been improved with metaphones and double-metaphones.

    • William Sharp on May 1, 2011 at 11:13 am

      I think it is important that people realize the real limitations of Soundex. It is imbedded in a lot of technologies, specifically Microsoft products like Dynamics CRM and SQL Server.

  2. Justin Ellings on April 30, 2011 at 4:39 pm

    Soundex has been pretty much deprecated in favor of newer algorithms such as double-metaphone. At Aware Research we often use multiple algorithms for phonetic similarity and then run a second pass to select and rank the most probable match(es).

    • William Sharp on May 1, 2011 at 11:15 am

      Two pass validations are often the most thorough process. Thanks for stopping by and commenting, Justin.

Leave a Reply

Your email address will not be published. Required fields are marked *

*