Back to List

More Adventures with Fuzzy Tools: Six Reasons Why Fuzzy Lookup is Better than a Normal Lookup

Jared Kuehn Jared Kuehn  |  
Mar 22, 2017
 
A few months ago, I wrote about a migration I was working on where I incorporated a Fuzzy Lookup into a data matching process between two environments (see ‘Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems’).

Time has passed, that project has finished, and I now find myself in the middle of yet another migration with a need for matching. From the beginning of the initial design process, I recommended we utilize the Fuzzy tools available to handle every matching need for the migration.

That’s right. All of our matching needs. No rounds of exact matching algorithms. We are comparing data sets, identifying how similar entire records are, defining a threshold and trusting the results.

You may think that’s crazy. Exact matches can be trusted to be accurate. Can you find accurate matches with only a partial match?

The answer: Yes you can! You would be surprised at how accurate non-exact matches can be. Using a Fuzzy Lookup, I am finding more matches than I would with a normal Lookup. The data may differ slightly, but they are still matches.

The better question to ask is: Why? Why would I pick a process that (theoretically) does not guarantee accurate results? Why wouldn’t I go with exact matches that I can trust?

Here are six reasons why I would utilize the Fuzzy tools over exact matching algorithms:
  1. Assume users will not enter data accurately. In an ideal world, typos would not happen, and all data entered into an application would be exactly as it should be. In reality, this is a rarity. Applications interfaces can be designed to increase the chances that entered data will be accurate and formatted, but no matter what, there will be text fields that need to be populated and can’t be checked for data quality (People’s names, for example). Oftentimes, these free-form text fields are the ones you need to compare in order to find your matches. Using fuzzy tools, you can identify matches between records even when typos occur, where exact matches would miss the match.
  2. Data changes over time, at different times. Say you are comparing two datasets of people, and you are trying to find a person existing in both sets. What if a person’s information gets updated in one set, but not in the other? An exact match would not find a match. A fuzzy lookup could find it.
  3. Exact matching algorithms can be wrong. As I mentioned in point 1, user-entered fields may be inaccurate. If your exact matching criteria is too narrow, it is possible to make an invalid match if you only check a subset of your data for exact matches. With fuzzy tools, you can check all available data at once and define similar matches, decreasing the chances of finding an invalid match due to inaccurate data in some of your fields.
  4. More flexible. Suppose you had a project where you had to match multiple, unique datasets to a master data set over time. For each unique dataset, if you wanted to match exactly, you may need to create unique rules that allows each one to match to your master set, rebuilding your process every time. With Fuzzy tools, you can design a single process that can match any set to your master set. You may have to adjust your similarity thresholds and conform unique datasets a little to find valid matches, but those should be your only adjustments. 
  5. Less effort for faster turnaround. In scenarios like the ones I’ve worked in, regardless of the approach you take, time must be spent analyzing the matching results and tuning your matching process to produce valid, accurate results. In an exact-match scenario, ‘tuning’ involves determining your unique cases and building rules to accommodate the cases. Depending on your data quality, there is no telling how many rules you may need to build to get the number of accurate matches you are expecting. It could take days, even weeks to build, with development time increasing the time it takes to get data to a point of analysis. Using fuzzy tools, the base architecture takes a few hours to set up. Once that is done, you can begin running your data through the tools, analyzing the results, and tuning the tools with different thresholds and formatting your data. You will get your initial matching results sooner with fuzzy tools, and making future adjustments will also take less time.
  6. Fuzzy tools will still find exact matches. This shouldn’t be surprising, but I think this is a misunderstanding that some people have (I had the same thought when I began learning about fuzzy tools). Fuzzy tools are designed to find matches when only part of the data matches. This doesn’t mean it ignores matches that are exact. Records where all columns match exactly receive a Similarity of 1 (or 100%), and are included with the matches.
The more I work with Fuzzy tools, the more I am seeing how powerful they can be. I plan to post more blogs soon about how best to use Fuzzy tools in whatever process you are working on.
 

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 

Comments
Blog post currently doesn't have any comments.

Related Content


Blog Article
Power BI & Excel - Options for Data Sharing and Publishing
Kathy ShieldsKathy Shields  |  
Nov 14, 2017
If you’re a Power BI user or support those who are, you know Microsoft continues to add features for data sharing and report publishing between Excel and Power BI. Many Power BI users are also strong Excel users, relying on it for complex analysis and reporting on data.  There are...
Blog Article
How to Quickly Detect Tabular Referential Integrity Issues
Cory CundyCory Cundy  |  
Nov 07, 2017
Often when we think about referential integrity (RI) we think of relational databases or data warehouses. While we should have referential integrity constraints in those systems, it may not always be implemented for various reasons. When building a tabular model, we should also be concerned about...
Blog Article
Important Power BI Gateway Update Required by November 1st, 2017
Scott HietpasScott Hietpas  |  
Sep 28, 2017
The on-premises data gateway is the key component for access to on-premises data sources for Power BI (as well as PowerApps, Flow, and more). It is best practice to update the on-premises data gateway to ensure you are always on the latest version. If you are not doing regular updates, there is...
Blog Article
The Business Value of Data Visualization
Josh PinsonneaultJosh Pinsonneault  |  
Sep 21, 2017
Data is abundant everywhere in our lives whether we’re at work, school, or at home. However, data “in the wild” is often raw and insights can be lost without visualization. When faced with many different data points, it can be difficult for our minds to effectively process and...
Blog Article
How to Dynamically Scale an Azure SQL Database to Get the Power You Need
Jared KuehnJared Kuehn  |  
Aug 24, 2017
We have had opportunities to work in Azure solutions on various occasions, including web applications, data warehousing scenarios, and IoT. Quite often in these solutions, there is a database in place to store information, in which case there may be an Azure SQL Database in the architecture. As...