Back to List

Data Lake vs Data Warehouse: Avoiding the Data Swamp

Scott Hietpas Scott Hietpas  |  
Jul 16, 2019
 
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, and Matt Pluster, the Data Analytics and Data Platform director, respond to some common questions on data warehouses and data lakes. Previous installments covered pros and cons of each solution, and the advantages of a hybrid configuration. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.
 

What is a data swamp?

In a previous blog, we talked about how one of the benefits of a data lake is being able to explore a large variety of data. We can land whatever data we want in this structure, and there aren't any rules for how to store it.
 
Just like how you may have your own way of organizing folders on your computer, the data lake offers different options. You can create a series of containers, folders, sub-folders, and so on because you can put any kind of data into the data lake. However, to get it out again, that exploration requires some guidance based on how you structure and store your data. Without a good plan, you get a data swamp -which basically means you have stuff all over the place (maybe different versions, or in different partitions), and it can be difficult to find anything.
 
A lot of thought can go into how you logically organize your data. If you don't put some upfront thought into that, you end up with files everywhere and no organization system. And that starts to diminish the value of your data lake.
 

How do you avoid the data swamp?

Other than just having a plan, what are some steps to take to prevent ending up with a data swamp? One of the first ways to avoid a data swamp is starting small. Don’t try to take on the world in your initial planning process. You should absolutely have an initial plan about what your structure should be, but it’s also important to plan time around refactoring. As soon as you introduce additional elements and requirements, your structure is going to change.
 
Don’t expect to get it perfect right away. Plan to be agile, knowing you will want to refactor it along the way. Even if you initially get it 100% right, the structure of your data lake is an ongoing thing that will continue to evolve. Just planning for that refactoring time will help you keep the data lake useful to your various users.
 

How do I plan our data lake structure?

When getting started with the structure for your data lake, ideally it would be requirements-driven. If you have a new Big Data disruptor to your current data warehouse (for example, new IoT data, JSON files, log files, etc.), I would just start taking those on individually. However, you also want to take a step back and start to think about what other types of data you might store and how to structure it.
 
The most difficult part of minimizing data swamp risk is finding the right balance between planning ahead for unknowns and realizing that (no matter how long you plan) there will be disruptors ahead. It's not a one-and-done thing. Just make sure you allocate time to rethink your structure when it becomes necessary.
 

What’s the best data storage solution?

Data lakes work well for that initial landing and for some quick exploration of data, but our most successful clients often implement a mix of data lakes and data warehouses. From an analytical standpoint, we’re seeing the value of getting to a structured data model in a data warehouse as the next layer downstream from the data lake. This especially applies to Microsoft-centric technologies. Microsoft said they consider it best practice to have a physical relational data model that serves as the source for an analysis services tabular model or a Power BI data model.
 
That notion of going from data lake to data warehouse to tabular model to Power BI really is the best practice that we're seeing in the industry. We've worked hard at Skyline Technologies to wrap some best practices around the implementation of that kind of a solution.
 
If balancing the needs of a data lake or data warehouse is one of the challenges you're facing in your organization, our data team would welcome the opportunity to share more best practices with you and talk about this topic in greater detail. Feel free to contact us or subscribe to our blog for more practical insights.
Business IntelligenceData Analytics

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 


Related Content


Deep Dive Into How the Power Platform Can Benefit Your Organization
Oct 16, 2019
Location: 833 East Michigan Suite 860 Milwaukee, WI 53202
Blog Article
Cloud Data Storage Architecture: Pros and Cons
Scott HietpasScott Hietpas  |  
Oct 08, 2019
In this blog series, Scott Hietpas, a Principal Consultant with Skyline Technologies’ data team, explores the pros and cons of different data storage architecture. For a full overview on this topic, check out the original Cloud vs On-Premises Architecture Webinar.   Like the on...
Blog Article
On-Premises Data Solutions: Pros and Cons
Scott HietpasScott Hietpas  |  
Sep 24, 2019
In this blog series, Scott Hietpas, a Principal Consultant with Skyline Technologies’ data team, explores the pros and cons of different data storage architecture. For a full overview on this topic, check out the original Cloud vs On-Premises Architecture Webinar.   The goal of this...
Blog Article
How to Handle Refactoring in a Self-Service Analytics World
Scott HietpasScott Hietpas  |  
Sep 10, 2019
In this blog series, Scott Hietpas, a Principal Consultant with Skyline Technologies’ data team, and Matt Pluster, Data Analytics and Data Platform Team Director, explore the advantages and disadvantages of different analytics management approaches. For a full overview on this topic, check...
Blog Article
IT-Managed and Self-Service Analytics: Best Practices
Scott HietpasScott Hietpas  |  
Aug 27, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, explores the advantages and disadvantages of different analytics management approaches. For a full overview on this topic, check out the original IT-Managed vs Self-Service Analytics Webinar...