Back to List

Data Lake vs Data Warehouse: What’s the Best Configuration?

Scott Hietpas Scott Hietpas  |  
Jul 02, 2019
 
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, responds to some common questions on data warehouses and data lakes. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.
 
In my previous blog, I talked about some of the pros and cons of data lake and data warehouse solutions. While for some organizations it might make sense to choose one or the other, what we see is that the data lake and the data warehouse really work better together.
 
We can use the data lake early in an architecture solution because it's optimized for ingesting data from Big Data scenarios. We can also use it to give access to the data science layer. However, when it comes to end users and self-service, there's still a lot of value in having a highly optimized query. We benefit from still having that analytical data model in a data warehouse. We also want the best performance time, and we want to make it easy for users to understand how to slice and dice the data.
 
data lake data warehouse better together
 
We typically put a data warehouse on top of the data lake, and then our reporting can come out of that data warehouse.
 

What’s the Best Configuration?

The image above shows a single data lake to a single data warehouse, but there are lots of options. The image below looks rather more complex. We might find there is some delineation you want to make based on where data is stored, and thus you want to have a couple data lakes. We can also have a collection of data marts or data warehouses that serve different audiences. All the options that existed before data lakes still exist today; we just have an additional layer now to better meet the needs of that Big Data scenario.
 
power bi data lake data warehouse
 
We're also seeing some changing functionality. I covered the individual strengths and weaknesses of data lakes and data warehouses in the previous blog of this series, but the primary difference is that one has a schema-on-read and the other has a schema-on-write. Today, we're starting to see some of that relational database technology coming into the data lake or sitting on top of it.
 
Generally, there's still enough differentiation where we still see data warehouse and data lake layers being separate. However, we're seeing some advancements on both sides to try to enter that middle ground of being best at both.
 

Examples of New Hybrid Options

Data Lake Hybrids: Databricks Delta

One example of that is Azure Databricks and its implementation (Databricks Delta). Databricks Delta starts to give us more of that transaction management. Other high-value features include data versioning so we can track point-in-time values and some efficiency around the upserts. With it, we're starting to see some layering of the functionality on top of the data lake.
 
azure databricks features
 

Common Data Model: Power BI Dataflows

Another example is the Common Data Model, which Microsoft is doing with an open data initiative to figure out how to put more structure around a data lake to lend itself to end-user analysis.
 
Power BI Dataflows lets us ingest data from a self-service or enterprise level, land that data in a data lake, but still have some sort of schema and meaning around it. This maintains an understanding of relationships between those entities and facilitates the ability to build data visualization off it.
 
Again, this may not meet everybody's performance needs (a data warehouse may still make sense), but we're seeing some opportunities in that middle ground of hybrid solutions.
 

In Summary

Does it make sense to have a data lake, a data warehouse, or both? It ultimately depends on your business requirements.
 

A data lake can add a lot of value to your organization if you:

  • Work with Big Data and need to handle volume, velocity or a variety of data.
  • Have a data science role or machine learning/AI where you need to do a broader exploration of data that may not yet have a known analytical value.
  • Value speed over accuracy (meaning you prioritize the ability to analyze data quickly over a more formal IT extract, transform and load process).
 
The data lake can offer a lot of value. But, from an end-user perspective, the data warehouse is still a cornerstone of a good data analytics solution.
 

The data warehouse is generally your cornerstone because it allows you to:

  • Provide the single source of truth that businesses expect in most cases.
  • Ensure you have a quality, analytical data model that lends itself to slicing and dicing the data.
  • Guarantee that you have accurate data when users are doing a self-service scenario off that model.
 
Both solutions offer plenty of value to your organization, and they will generally work better together. In my next blog, I’m going to look at a common question about the data swamp that can happen with a data lake (and how to avoid that situation).
 
Can’t wait for the next blog? Check out the Data Lake vs Data Warehouse webinar to explore this hot topic in full.
 
Business IntelligenceData Analytics

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 


Related Content


Blog Article
What is Microsoft’s Power Platform and How to Use It: The Guide
Skyline Technologies  |  
Jan 14, 2020
In this guide, Libby Fisette (Director of Skyline Modern Workplace team) and Marcus Radue (Data Analytics Engineer), dig into the functionality of the Microsoft Power Platform and how you can leverage this toolset to solve many business situations. From basics to key questions, you will find...
Blog Article
Realtime and Near-Realtime Data Sources and Data Integrity
Matt PlusterMatt Pluster  |  
Dec 17, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.   In previous blogs in this series, I dug into advantages...
Blog Article
Mitigating the Risks of Realtime or Near-Realtime Data Processing
Matt PlusterMatt Pluster  |  
Dec 10, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.   In previous blogs in this series, I’ve talked about...
Blog Article
“The Other Realtime”: Low-Latency Data Processing via DirectQuery
Matt PlusterMatt Pluster  |  
Dec 03, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.    So far in this blog series, we have talked...
Blog Article
Realtime vs Near-Realtime Data: Pros and Cons
Matt PlusterMatt Pluster  |  
Nov 26, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.   In this blog series, we are looking at the matchup of...