Back to List

Data Lake vs Data Warehouse: Pros and Cons

Scott Hietpas Scott Hietpas  |  
Jun 18, 2019
 
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, responds to some common questions on data warehouses and data lakes. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.
 
There's a lot of discussion around data lakes and data warehouses. When organizations are evaluating these two technologies, there is one big question: “Where should you store your data to do the type of analysis you need and get the most out of it?” While it may be tempting to choose only the newer data lake format, or the tried-and-true data warehouse, ultimately both systems offer both pros and cons. We are going to explore some of the big ones in this blog.
 

Pros of the Traditional Data Warehouse

While the traditional data warehouse has started to see some Big Data challenges (from social media and marketing tools) and the growing demands of data science, there are still a lot of benefits to this solution.
 

Well-Established

One of the core benefits of the data warehouse is that it's a well-established, proven solution. The SQL server stack is a widely available, mature toolset that has been around for a long time. If you need additional resources or a team internally to support your data storage system, it's easy to find resources that are familiar with those tools. Although new functionality can create issues, for the most part a data warehouse is well-proven and can run optimally with very little maintenance.
 

Performance

A data warehouse also delivers very good performance - partially due to its underlying structure. Since a data warehouse has a schema-on-write, SQL servers (or other data warehouses) already understand its underlying structure. Therefore, its query engine can understand how to work with that data to quickly return query results.
 

Usability

The usability of a data warehouse is due to the analytical model. Users may not be familiar with how to get data out of the source systems where the data originates. Often, those structures can be extremely complex and normalized. Hence the data warehouse is an opportunity to have a more denormalized model that makes it easier for users to understand how to slice and dice that data. The data warehouse is flexible in that it can work well on-premises or in the cloud. Whether you're using a SQL server stack on-prem, or exploring an Azure SQL database / data warehouse in the cloud, any of these data warehouse options can support this type of approach very well.
 

Cons of the Traditional Data Warehouse

While data warehouses have significant advantages, they also have some significant drawbacks in this era of Big Data and high data science demand. Some of the cons of the data warehouse are:
 

Storage Costs

Storage cost is one of the key modern drawbacks of a data warehouse. Database resources are more expensive than those of a data lake due to features. The budget-conscious company (that is, most companies) must ask if it needs to store large volumes of data in an Azure SQL database or data warehouse (which may have a higher storage cost than a data lake).
 

Latency

With a data warehouse, you have to wait for each business process component to be built to get value from that data. Although there's a lot of value in having an analytical model, it also takes time to have your development team support getting data into that model, typically through the development of ETL processes. If you have data that's not in your data warehouse today, and users want to report on it, then there can be additional latency to get that data into the warehouse.
 

Limited Exploration

Cost and warehouse model can restrict capturing data of unknown value, which can limit exploration. You might only pull certain amounts of data where you had known reporting requirements. There may be a whole other set of data that you're not bringing in that users might want to explore.
 

Not Designed for Big Data

Of course, the Big Data challenge is one of the major sticking points when it comes to the data warehouse. Its schema-on-write is not really optimized for that variety, velocity and volume of data. Disruptors like hybrid source systems (cloud and on-prem), and high data volume, variety and velocity coming from big data scenarios can limit the effectiveness of your data warehouse.
 

Pros of the Data Lake

The data lake solves some of the pros and cons of the traditional data warehouse, but it also offers drawbacks of its own. First, let’s look at its pros.
 

Volume, Variety and Velocity

The real strength of the data lake is that it does a good job addressing some of those Big Data challenges like volume, variety, and velocity. It’s infinitely scalable, handles structured or unstructured data, and is designed for rapid data ingestion. It can also support Internet of Things scenarios. Because it's schema-on-read, we don't have to understand the format of the data until we're ready to read it back out. We can write that data very quickly before we have that understanding and without risk of write errors.
 

Low Cost Storage

A data lake is also low cost (relatively speaking). That allows us to not worry as much about the types of data that we're storing. We may not know the analytical value around our data yet, but with data lake storage, we can inexpensively hang on to it for some future point where we might be able to find value for it.
 

Accessible Data

Landing the data in the data lake makes it easy for us to open copies of that data (or subsets of that data) to different user groups. Whether that's for self-service or data science, we can control access to that data.
 

Cons of the Data Lake

While these are some big positives, the data lake also has some significant cons.
 

Cloud Bias

There are on-premises data lake solutions (Hadoop is a very common one). However, installing a data lake solution on-prem can be much more complex, whereas spinning off a data lake in the cloud is very simple.  This may be considered a negative if it does not align with your infrastructure strategy.
 

Skillset Learning Curve

The data lake often comes with a new set of tools and services that need to be understood (and it’s a bit of a learning curve). That requires some additional investment, either from recruiting to get the right team members or doing internal professional development to understand those new tools.
 

Transition Period

When moving from a data warehouse to a data lake (or just adding a data lake to your existing system), there’s going to be a transition period. If you already have an existing data warehouse, how does the data lake fit into that? Do you rework some of that for the new needs and functionality, or not?
 

Investment in Best Practices and Processes

A data lake also requires significant investment not just from a skillset perspective, but also in understanding how that impacts your best practices. In addition to being able to do the work, you need to understand what your organizational standards will be around doing that work.
 

Not Optimized for Query Performance

Although we can very easily land data in the data lake, in order to query that data back out, the data lake doesn't have the same underlying query engine that a data warehouse does. Those queries may not perform to the level you need.
 

Which Option Is Right for You?

Often, we see that the data lake and the data warehouse really work better together. In my next blog, I’m going to walk through what a successful hybrid solution looks like.
 
Can’t wait for the next blog? Check out the Data Lake vs Data Warehouse webinar to explore this hot topic in full.
 
Business IntelligenceData Analytics

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 


Related Content


Blog Article
Data Lake vs Data Warehouse: Avoiding the Data Swamp
Scott HietpasScott Hietpas  |  
Jul 16, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, and Matt Pluster, the Data Analytics and Data Platform director, respond to some common questions on data warehouses and data lakes. Previous installments covered pros and cons of each solution,...
Blog Article
Data Lake vs Data Warehouse: What’s the Best Configuration?
Scott HietpasScott Hietpas  |  
Jul 02, 2019
In this blog series, Scott Hietpas, a principal consultant with Skyline Technologies’ data team, responds to some common questions on data warehouses and data lakes. For a full overview on this topic, check out the original Data Lake vs Data Warehouse webinar.   In my previous blog...
Blog Article
How to Use Report Tables to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Jun 27, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for BI projects, this post is for you. Over the next several months, we will be releasing 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI) requirements on our blog...
Blog Article
How to Use a Glossary to Model and Analyze BI Requirements
Rachael WilterdinkRachael Wilterdink  |  
Jun 13, 2019
If you’re involved in eliciting, modeling, analyzing, or consuming requirements for BI projects, this post is for you. Over the next several months, we will be releasing 10 Techniques for Business Analysts (BAs) to model and analyze Business Intelligence (BI) requirements on our blog...
Blog Article
Machine Monitoring IoT Solution with Azure Services and Power BI
Eric SaltzmannEric Saltzmann  |  
Jun 11, 2019
We often hear organizations ask how they can drive more insights out of their connected devices. Though the Internet of Things (IoT) has been a buzzword for the last few years, many organizations are still struggling through the headache of implementing an IoT pilot or solution. Most of the...