Back to List

Data Lake vs Data Warehouse: What’s the Best Configuration?

Scott Hietpas Scott Hietpas  |  
Jul 02, 2019
 
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, responds to some common questions on data warehouses and data lakes. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.
 
In my previous blog, I talked about some of the pros and cons of data lake and data warehouse solutions. While for some organizations it might make sense to choose one or the other, what we see is that the data lake and the data warehouse really work better together.
 
We can use the data lake early in an architecture solution because it's optimized for ingesting data from Big Data scenarios. We can also use it to give access to the data science layer. However, when it comes to end users and self-service, there's still a lot of value in having a highly optimized query. We benefit from still having that analytical data model in a data warehouse. We also want the best performance time, and we want to make it easy for users to understand how to slice and dice the data.
 
data lake data warehouse better together
 
We typically put a data warehouse on top of the data lake, and then our reporting can come out of that data warehouse.
 

What’s the Best Configuration?

The image above shows a single data lake to a single data warehouse, but there are lots of options. The image below looks rather more complex. We might find there is some delineation you want to make based on where data is stored, and thus you want to have a couple data lakes. We can also have a collection of data marts or data warehouses that serve different audiences. All the options that existed before data lakes still exist today; we just have an additional layer now to better meet the needs of that Big Data scenario.
 
power bi data lake data warehouse
 
We're also seeing some changing functionality. I covered the individual strengths and weaknesses of data lakes and data warehouses in the previous blog of this series, but the primary difference is that one has a schema-on-read and the other has a schema-on-write. Today, we're starting to see some of that relational database technology coming into the data lake or sitting on top of it.
 
Generally, there's still enough differentiation where we still see data warehouse and data lake layers being separate. However, we're seeing some advancements on both sides to try to enter that middle ground of being best at both.
 

Examples of New Hybrid Options

Data Lake Hybrids: Databricks Delta

One example of that is Azure Databricks and its implementation (Databricks Delta). Databricks Delta starts to give us more of that transaction management. Other high-value features include data versioning so we can track point-in-time values and some efficiency around the upserts. With it, we're starting to see some layering of the functionality on top of the data lake.
 
azure databricks features
 

Common Data Model: Power BI Dataflows

Another example is the Common Data Model, which Microsoft is doing with an open data initiative to figure out how to put more structure around a data lake to lend itself to end-user analysis.
 
Power BI Dataflows lets us ingest data from a self-service or enterprise level, land that data in a data lake, but still have some sort of schema and meaning around it. This maintains an understanding of relationships between those entities and facilitates the ability to build data visualization off it.
 
Again, this may not meet everybody's performance needs (a data warehouse may still make sense), but we're seeing some opportunities in that middle ground of hybrid solutions.
 

In Summary

Does it make sense to have a data lake, a data warehouse, or both? It ultimately depends on your business requirements.
 

A data lake can add a lot of value to your organization if you:

  • Work with Big Data and need to handle volume, velocity or a variety of data.
  • Have a data science role or machine learning/AI where you need to do a broader exploration of data that may not yet have a known analytical value.
  • Value speed over accuracy (meaning you prioritize the ability to analyze data quickly over a more formal IT extract, transform and load process).
 
The data lake can offer a lot of value. But, from an end-user perspective, the data warehouse is still a cornerstone of a good data analytics solution.
 

The data warehouse is generally your cornerstone because it allows you to:

  • Provide the single source of truth that businesses expect in most cases.
  • Ensure you have a quality, analytical data model that lends itself to slicing and dicing the data.
  • Guarantee that you have accurate data when users are doing a self-service scenario off that model.
 
Both solutions offer plenty of value to your organization, and they will generally work better together. In my next blog, I’m going to look at a common question about the data swamp that can happen with a data lake (and how to avoid that situation).
 
Can’t wait for the next blog? Check out the Data Lake vs Data Warehouse webinar to explore this hot topic in full.
 
Business IntelligenceData Analytics

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 


Related Content


Blog Article
Data Lake vs Data Warehouse: Avoiding the Data Swamp
Scott HietpasScott Hietpas  |  
Jul 16, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, and Matt Pluster, the Data Analytics and Data Platform director, respond to some common questions on data warehouses and data lakes. Previous installments covered pros and cons of each solution,...
Blog Article
How to Use Report Tables to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Jun 27, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for BI projects, this post is for you. Over the next several months, we will be releasing 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI) requirements on our blog...
Blog Article
Data Lake vs Data Warehouse: Pros and Cons
Scott HietpasScott Hietpas  |  
Jun 18, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, responds to some common questions on data warehouses and data lakes. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.   There's a lot of...
Blog Article
How to Use a Glossary to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Jun 13, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for BI projects, this post is for you. Over the next several months, we will be releasing 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI) requirements on our blog...
Blog Article
Machine Monitoring IoT Solution with Azure Services and Power BI
Eric SaltzmannEric Saltzmann  |  
Jun 11, 2019
We often hear organizations ask how they can drive more insights out of their connected devices. Though the Internet of Things (IoT) has been a buzzword for the last few years, many organizations are still struggling through the headache of implementing an IoT pilot or solution. Most of the...