Back to List

My Memory is a Little Fuzzy: Things I Want to Remember When Setting Up Fuzzy Lookups

Jared Kuehn Jared Kuehn  |  
Jun 08, 2017
 
In my last post, I explained why I would want to use Fuzzy Lookups for my various matching needs. Now I want to explain the how: How does one set up a Fuzzy Lookup so that it helps solve my data quality problems?

The simple answer is that I use the available SSIS Fuzzy Lookup component in a Data Flow task. There are various documents and blogs that explain the basics of adding a Fuzzy Lookup component to your process, so I will let those speak for themselves. Here I want to cover the things I’ve learned that most other writings don’t cover. Once you have the basic architecture set up, there are several things you should consider, both in your process and in your data.

Here are some tips to jog your memory:
  1. Formatting your datasets will return higher similarity. Like the Lookup transformation in SSIS, the Fuzzy tools may be case sensitive. That means that ‘Johann’ and ‘johann’ will not be exact matches. To make more accurate matches, it is best that you format both sets of data under the same rules (lower case, remove spaces, etc.). If you are comparing text data that may come in different formats, be sure to either remove or standardize any formatting that could cause unnecessary differences. An example is phone numbers: sometimes they include dashes (123-456-7890) and parentheses, but other times they will be omitted (1234567890). You may be able to improve the column similarity with the token delimiter settings, but from what I’ve tried, if one dataset has the delimiter and one does not, this won’t allow you to find an exact match.
  2. Overall similarity = Sum(each column’s unique similarity) / Number of columns being compared…sort of. As I’ve written before, each column you analyze will have its own similarity percentage calculated. If you then take those percentages, add them together, and divide by the number of compared columns, you get the overall similarity…sometimes. I haven’t figured out yet what all goes into the overall similarity calculation, but at least this will give you a general idea of what similarity to shoot for when defining your threshold.
  3. In Fuzzy tools, NULL will match to NULL. That’s right. The absence of a value will match exactly to the absence of a value, which is different from how SQL Server handles this (by default settings). To get around this, I recommend that you set one of your datasets (most likely your reference index) so that every column has a default value other than NULL. When selecting your default value, make sure it is something that is not likely to find a match in the other dataset (i.e. using ‘Smith’ for a default in a Last Name column may not be wise).
  4. You can set individual columns to either match Exactly or Fuzzy. It took me a while to find this in the Lookup settings, but I consider it a useful find. In the same way that you receive a similarity score for each column that is compared, you can also define whether a column should match exactly, or if it can match fuzzy. You can set similarity thresholds on each column as well. One word of caution: If you set a column as an exact match, you will only find a match when that column matches. If the other columns match, but that column does not, you will not find a match. I would only use this when I am 100% confident in the accuracy of the data in that column, both in the source and reference index.
  5. Only strings (Unicode or non-Unicode) can be evaluated with Fuzzy logic. Fuzzy Lookup looks for similarities in two data sets. That said, I don’t find it surprising Fuzzy logic is used on string columns, which have the highest chance of being incorrect. I do find it surprising that Fuzzy logic is used only on string columns.  You can evaluate other data types as well within a Fuzzy tool, but they will automatically be set to Exact. If you wish to evaluate these columns as Fuzzy, a workaround is to convert non-string columns into string columns. I typically do this in my Fuzzy Lookups.
  6. Make your comparison set as wide as possible. The goal with Fuzzy logic is to find accurate matches based on similarity of entire records, not just exact matches based on a narrow set of fields. The more fields you add to your comparison, the better chance you will find an accurate match if only 1 or 2 fields don’t match. It also reduces the chances of finding a false positive match because incorrectly inputted data matched to something it should not match to.
  7. Check your results with your eyes. Regardless of how you handle your matching process, it is tedious. It would be nice to have an automated process, but the truth of the matter is that someone needs to review the matches and see how good they are. The benefit I’ve found of using Fuzzy Lookups is that you spend less time altering your process, and more time reviewing results. I usually go through multiple rounds of adjusting the similarity threshold before I decide on any setting. I’ve seen some writers recommend setting two thresholds: a higher threshold that is trusted as accurate, and then a lower threshold of ‘potential’ matches that can then be reviewed and decided on. I personally have not used this approach yet, but given my experiences thus far, I intend to incorporate this approach in my future projects.
I hope these tips help you out on your implementations of Fuzzy tools. Stay Fuzzy!
 
Want to learn more about Fuzzy logic? Check out my other blog posts:

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 

Comments
Blog post currently doesn't have any comments.

Related Content


Blog Article
Power BI & Excel - Options for Data Sharing and Publishing
Kathy ShieldsKathy Shields  |  
Nov 14, 2017
If you’re a Power BI user or support those who are, you know Microsoft continues to add features for data sharing and report publishing between Excel and Power BI. Many Power BI users are also strong Excel users, relying on it for complex analysis and reporting on data.  There are...
Blog Article
How to Quickly Detect Tabular Referential Integrity Issues
Cory CundyCory Cundy  |  
Nov 07, 2017
Often when we think about referential integrity (RI) we think of relational databases or data warehouses. While we should have referential integrity constraints in those systems, it may not always be implemented for various reasons. When building a tabular model, we should also be concerned about...
Blog Article
Important Power BI Gateway Update Required by November 1st, 2017
Scott HietpasScott Hietpas  |  
Sep 28, 2017
The on-premises data gateway is the key component for access to on-premises data sources for Power BI (as well as PowerApps, Flow, and more). It is best practice to update the on-premises data gateway to ensure you are always on the latest version. If you are not doing regular updates, there is...
Blog Article
The Business Value of Data Visualization
Josh PinsonneaultJosh Pinsonneault  |  
Sep 21, 2017
Data is abundant everywhere in our lives whether we’re at work, school, or at home. However, data “in the wild” is often raw and insights can be lost without visualization. When faced with many different data points, it can be difficult for our minds to effectively process and...
Blog Article
How to Dynamically Scale an Azure SQL Database to Get the Power You Need
Jared KuehnJared Kuehn  |  
Aug 24, 2017
We have had opportunities to work in Azure solutions on various occasions, including web applications, data warehousing scenarios, and IoT. Quite often in these solutions, there is a database in place to store information, in which case there may be an Azure SQL Database in the architecture. As...