Back to List

More Adventures with Fuzzy Tools: Six Reasons Why Fuzzy Lookup is Better than a Normal Lookup

Jared Kuehn Jared Kuehn  |  
Mar 22, 2017
 
A few months ago, I wrote about a migration I was working on where I incorporated a Fuzzy Lookup into a data matching process between two environments (see ‘Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems’).

Time has passed, that project has finished, and I now find myself in the middle of yet another migration with a need for matching. From the beginning of the initial design process, I recommended we utilize the Fuzzy tools available to handle every matching need for the migration.

That’s right. All of our matching needs. No rounds of exact matching algorithms. We are comparing data sets, identifying how similar entire records are, defining a threshold and trusting the results.

You may think that’s crazy. Exact matches can be trusted to be accurate. Can you find accurate matches with only a partial match?

The answer: Yes you can! You would be surprised at how accurate non-exact matches can be. Using a Fuzzy Lookup, I am finding more matches than I would with a normal Lookup. The data may differ slightly, but they are still matches.

The better question to ask is: Why? Why would I pick a process that (theoretically) does not guarantee accurate results? Why wouldn’t I go with exact matches that I can trust?

Here are six reasons why I would utilize the Fuzzy tools over exact matching algorithms:
  1. Assume users will not enter data accurately. In an ideal world, typos would not happen, and all data entered into an application would be exactly as it should be. In reality, this is a rarity. Applications interfaces can be designed to increase the chances that entered data will be accurate and formatted, but no matter what, there will be text fields that need to be populated and can’t be checked for data quality (People’s names, for example). Oftentimes, these free-form text fields are the ones you need to compare in order to find your matches. Using fuzzy tools, you can identify matches between records even when typos occur, where exact matches would miss the match.
  2. Data changes over time, at different times. Say you are comparing two datasets of people, and you are trying to find a person existing in both sets. What if a person’s information gets updated in one set, but not in the other? An exact match would not find a match. A fuzzy lookup could find it.
  3. Exact matching algorithms can be wrong. As I mentioned in point 1, user-entered fields may be inaccurate. If your exact matching criteria is too narrow, it is possible to make an invalid match if you only check a subset of your data for exact matches. With fuzzy tools, you can check all available data at once and define similar matches, decreasing the chances of finding an invalid match due to inaccurate data in some of your fields.
  4. More flexible. Suppose you had a project where you had to match multiple, unique datasets to a master data set over time. For each unique dataset, if you wanted to match exactly, you may need to create unique rules that allows each one to match to your master set, rebuilding your process every time. With Fuzzy tools, you can design a single process that can match any set to your master set. You may have to adjust your similarity thresholds and conform unique datasets a little to find valid matches, but those should be your only adjustments. 
  5. Less effort for faster turnaround. In scenarios like the ones I’ve worked in, regardless of the approach you take, time must be spent analyzing the matching results and tuning your matching process to produce valid, accurate results. In an exact-match scenario, ‘tuning’ involves determining your unique cases and building rules to accommodate the cases. Depending on your data quality, there is no telling how many rules you may need to build to get the number of accurate matches you are expecting. It could take days, even weeks to build, with development time increasing the time it takes to get data to a point of analysis. Using fuzzy tools, the base architecture takes a few hours to set up. Once that is done, you can begin running your data through the tools, analyzing the results, and tuning the tools with different thresholds and formatting your data. You will get your initial matching results sooner with fuzzy tools, and making future adjustments will also take less time.
  6. Fuzzy tools will still find exact matches. This shouldn’t be surprising, but I think this is a misunderstanding that some people have (I had the same thought when I began learning about fuzzy tools). Fuzzy tools are designed to find matches when only part of the data matches. This doesn’t mean it ignores matches that are exact. Records where all columns match exactly receive a Similarity of 1 (or 100%), and are included with the matches.
The more I work with Fuzzy tools, the more I am seeing how powerful they can be. I plan to post more blogs soon about how best to use Fuzzy tools in whatever process you are working on.
 

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 

Comments
Blog post currently doesn't have any comments.
 Security code

Related Content


Blog Article
Haven't I Seen You Before? Fuzzy Grouping Your Datasets for Data Deduplication
Jared KuehnJared Kuehn  |  
Jun 20, 2017
If you’ve been reading my earlier posts regarding Fuzzy logic (See the end of this blog for links), you would know that I am usually talking about Fuzzy Lookups, taking two sets of data and comparing them to find similarities. Did you know that you can perform the same analysis on a single...
Blog Article
My Memory is a Little Fuzzy: Things I Want to Remember When Setting Up Fuzzy Lookups
Jared KuehnJared Kuehn  |  
Jun 08, 2017
In my last post, I explained why I would want to use Fuzzy Lookups for my various matching needs. Now I want to explain the how: How does one set up a Fuzzy Lookup so that it helps solve my data quality problems? The simple answer is that I use the available SSIS Fuzzy Lookup component in a Data...
Blog Article
Three Steps to Implement AD Security for an On-Premises Data Gateway
Eric SaltzmannEric Saltzmann  |  
Apr 28, 2017
The idea to implement AD groups was first realized when trying to dynamically populate O365 group membership, that can be done with PowerShell. The benefit of that knowledge lead me to thinking how I could reverse the process and extract the list of members in an AD group. A short discussion with...
Blog Article
More Adventures with Fuzzy Tools: Six Reasons Why Fuzzy Lookup is Better than a Normal Lookup
Jared KuehnJared Kuehn  |  
Mar 22, 2017
A few months ago, I wrote about a migration I was working on where I incorporated a Fuzzy Lookup into a data matching process between two environments (see ‘Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems’). Time has passed, that project has...
Blog Article
A First Look at Power BI Desktop Reports in SSRS 2016
Cory CundyCory Cundy  |  
Nov 22, 2016
With Microsoft announcing a technical preview of Power BI reports in SQL Server Reporting Services (SSRS), I was interested in giving it a try to see how it compared to the Power BI service. The preview is available through a pre-created virtual machine in Azure that is set up with the proper...