Back to List

Data Lake vs Data Warehouse: Avoiding the Data Swamp

Scott Hietpas Scott Hietpas  |  
Jul 16, 2019
 
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, and Matt Pluster, the Data Analytics and Data Platform director, respond to some common questions on data warehouses and data lakes. Previous installments covered pros and cons of each solution, and the advantages of a hybrid configuration. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.
 

What is a data swamp?

In a previous blog, we talked about how one of the benefits of a data lake is being able to explore a large variety of data. We can land whatever data we want in this structure, and there aren't any rules for how to store it.
 
Just like how you may have your own way of organizing folders on your computer, the data lake offers different options. You can create a series of containers, folders, sub-folders, and so on because you can put any kind of data into the data lake. However, to get it out again, that exploration requires some guidance based on how you structure and store your data. Without a good plan, you get a data swamp -which basically means you have stuff all over the place (maybe different versions, or in different partitions), and it can be difficult to find anything.
 
A lot of thought can go into how you logically organize your data. If you don't put some upfront thought into that, you end up with files everywhere and no organization system. And that starts to diminish the value of your data lake.
 

How do you avoid the data swamp?

Other than just having a plan, what are some steps to take to prevent ending up with a data swamp? One of the first ways to avoid a data swamp is starting small. Don’t try to take on the world in your initial planning process. You should absolutely have an initial plan about what your structure should be, but it’s also important to plan time around refactoring. As soon as you introduce additional elements and requirements, your structure is going to change.
 
Don’t expect to get it perfect right away. Plan to be agile, knowing you will want to refactor it along the way. Even if you initially get it 100% right, the structure of your data lake is an ongoing thing that will continue to evolve. Just planning for that refactoring time will help you keep the data lake useful to your various users.
 

How do I plan our data lake structure?

When getting started with the structure for your data lake, ideally it would be requirements-driven. If you have a new Big Data disruptor to your current data warehouse (for example, new IoT data, JSON files, log files, etc.), I would just start taking those on individually. However, you also want to take a step back and start to think about what other types of data you might store and how to structure it.
 
The most difficult part of minimizing data swamp risk is finding the right balance between planning ahead for unknowns and realizing that (no matter how long you plan) there will be disruptors ahead. It's not a one-and-done thing. Just make sure you allocate time to rethink your structure when it becomes necessary.
 

What’s the best data storage solution?

Data lakes work well for that initial landing and for some quick exploration of data, but our most successful clients often implement a mix of data lakes and data warehouses. From an analytical standpoint, we’re seeing the value of getting to a structured data model in a data warehouse as the next layer downstream from the data lake. This especially applies to Microsoft-centric technologies. Microsoft said they consider it best practice to have a physical relational data model that serves as the source for an analysis services tabular model or a Power BI data model.
 
That notion of going from data lake to data warehouse to tabular model to Power BI really is the best practice that we're seeing in the industry. We've worked hard at Skyline Technologies to wrap some best practices around the implementation of that kind of a solution.
 
If balancing the needs of a data lake or data warehouse is one of the challenges you're facing in your organization, our data team would welcome the opportunity to share more best practices with you and talk about this topic in greater detail. Feel free to contact us or subscribe to our blog for more practical insights.
Business IntelligenceData Analytics

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 


Related Content


Blog Article
How to Use Prototyping to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Aug 15, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for Business Intelligence projects, this post is for you. This is the tenth and final technique in our blog series on 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI...
Blog Article
Self-Service Analytics: Pros and Cons
Scott HietpasScott Hietpas  |  
Aug 13, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, explores the advantages and disadvantages of different analytics management approaches.  For a full overview on this topic, check out the original IT-Managed vs Self-Service Analytics...
Blog Article
How to Use Process Modeling to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Aug 01, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for BI projects, this post is for you. This is the ninth technique in our blog series on 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI) requirements.   What is...
Blog Article
IT-Managed Analytics: Pros and Cons
Scott HietpasScott Hietpas  |  
Jul 30, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, explores the advantages and disadvantages of different analytics management approaches. For a full overview on this topic, check out the original IT-Managed vs Self-Service Analytics Webinar...
Blog Article
How to Use Metrics and KPIs to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Jul 18, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for Business Intelligence projects, this post is for you. This is the eighth technique in our blog series on 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI) requirements...