Back to List

More Adventures with Fuzzy Tools: Six Reasons Why Fuzzy Lookup is Better than a Normal Lookup

Jared Kuehn Jared Kuehn  |  
Mar 22, 2017
 
A few months ago, I wrote about a migration I was working on where I incorporated a Fuzzy Lookup into a data matching process between two environments (see ‘Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems’).

Time has passed, that project has finished, and I now find myself in the middle of yet another migration with a need for matching. From the beginning of the initial design process, I recommended we utilize the Fuzzy tools available to handle every matching need for the migration.

That’s right. All of our matching needs. No rounds of exact matching algorithms. We are comparing data sets, identifying how similar entire records are, defining a threshold and trusting the results.

You may think that’s crazy. Exact matches can be trusted to be accurate. Can you find accurate matches with only a partial match?

The answer: Yes you can! You would be surprised at how accurate non-exact matches can be. Using a Fuzzy Lookup, I am finding more matches than I would with a normal Lookup. The data may differ slightly, but they are still matches.

The better question to ask is: Why? Why would I pick a process that (theoretically) does not guarantee accurate results? Why wouldn’t I go with exact matches that I can trust?

Here are six reasons why I would utilize the Fuzzy tools over exact matching algorithms:
  1. Assume users will not enter data accurately. In an ideal world, typos would not happen, and all data entered into an application would be exactly as it should be. In reality, this is a rarity. Applications interfaces can be designed to increase the chances that entered data will be accurate and formatted, but no matter what, there will be text fields that need to be populated and can’t be checked for data quality (People’s names, for example). Oftentimes, these free-form text fields are the ones you need to compare in order to find your matches. Using fuzzy tools, you can identify matches between records even when typos occur, where exact matches would miss the match.
  2. Data changes over time, at different times. Say you are comparing two datasets of people, and you are trying to find a person existing in both sets. What if a person’s information gets updated in one set, but not in the other? An exact match would not find a match. A fuzzy lookup could find it.
  3. Exact matching algorithms can be wrong. As I mentioned in point 1, user-entered fields may be inaccurate. If your exact matching criteria is too narrow, it is possible to make an invalid match if you only check a subset of your data for exact matches. With fuzzy tools, you can check all available data at once and define similar matches, decreasing the chances of finding an invalid match due to inaccurate data in some of your fields.
  4. More flexible. Suppose you had a project where you had to match multiple, unique datasets to a master data set over time. For each unique dataset, if you wanted to match exactly, you may need to create unique rules that allows each one to match to your master set, rebuilding your process every time. With Fuzzy tools, you can design a single process that can match any set to your master set. You may have to adjust your similarity thresholds and conform unique datasets a little to find valid matches, but those should be your only adjustments. 
  5. Less effort for faster turnaround. In scenarios like the ones I’ve worked in, regardless of the approach you take, time must be spent analyzing the matching results and tuning your matching process to produce valid, accurate results. In an exact-match scenario, ‘tuning’ involves determining your unique cases and building rules to accommodate the cases. Depending on your data quality, there is no telling how many rules you may need to build to get the number of accurate matches you are expecting. It could take days, even weeks to build, with development time increasing the time it takes to get data to a point of analysis. Using fuzzy tools, the base architecture takes a few hours to set up. Once that is done, you can begin running your data through the tools, analyzing the results, and tuning the tools with different thresholds and formatting your data. You will get your initial matching results sooner with fuzzy tools, and making future adjustments will also take less time.
  6. Fuzzy tools will still find exact matches. This shouldn’t be surprising, but I think this is a misunderstanding that some people have (I had the same thought when I began learning about fuzzy tools). Fuzzy tools are designed to find matches when only part of the data matches. This doesn’t mean it ignores matches that are exact. Records where all columns match exactly receive a Similarity of 1 (or 100%), and are included with the matches.
The more I work with Fuzzy tools, the more I am seeing how powerful they can be. I plan to post more blogs soon about how best to use Fuzzy tools in whatever process you are working on.
 

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 

Comments
Blog post currently doesn't have any comments.
 Security code

Related Content


Blog Article
More Adventures with Fuzzy Tools: Six Reasons Why Fuzzy Lookup is Better than a Normal Lookup
Jared KuehnJared Kuehn  |  
Mar 22, 2017
A few months ago, I wrote about a migration I was working on where I incorporated a Fuzzy Lookup into a data matching process between two environments (see ‘Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems’). Time has passed, that project has...
Blog Article
A First Look at Power BI Desktop Reports in SSRS 2016
Cory CundyCory Cundy  |  
Nov 22, 2016
With Microsoft announcing a technical preview of Power BI reports in SQL Server Reporting Services (SSRS), I was interested in giving it a try to see how it compared to the Power BI service. The preview is available through a pre-created virtual machine in Azure that is set up with the proper...
Blog Article
PASS Summit 2016 Takeaways
Cory CundyCory Cundy  |  
Nov 15, 2016
I recently had the privilege of attending PASS Summit for the first time. I found it to be a very educational experience filled with great sessions and the opportunity to network with highly technical subject matter experts. I walked away from the conference with numerous takeaways, but there...
Blog Article
Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems
Jared KuehnJared Kuehn  |  
Oct 25, 2016
Have you ever had a data accuracy issue? A situation where you have to build complex logic to compare data between two systems because names are misspelled, or data is missing in your matching columns?   I ran into this recently with a project migrating data between two environments. We...
Blog Article
SSIS Project Deployment Error – Changing Protection Level Failed
Scott HietpasScott Hietpas  |  
Oct 14, 2016
Overview Starting with SQL Server 2012, SQL Server Integration Services (SSIS) introduced support for the Project Deployment model.  This enables all objects within a project to be deployed together rather than package-level deployment.   When working with SSIS package using the...