Data warehouses have been around for decades, but in the last few years you’ve probably heard the term “data lake”. Despite what some believe, a data lake is not just a new buzzword for a data warehouse. In fact, the two are quite different concepts and often complement each other in an organization’s data infrastructure.
What is a Data Warehouse?
In simple terms, a data warehouse is a database where data from different systems is stored and modeled to support analysis and other actives related to answering questions with it. During the process of loading and transforming data into a warehouse, the data goes through processing that cleans and structures the data for optimal querying.
Thanks to such processing, data warehouses can be queried directly with SQL queries, but are typically accessed via Business Intelligence and Visualization tools.
Though the data is structured (think tables with a predefined schema), the tables in a data warehouse are designed to support the types of queries used in data analysis. This is considered denormalized design, vs. the normalized design you’ll see in databases that power software applications.
Though years ago data warehouses were often built on the same database technology as application databases, there are now numerous technologies specifically designed for the challenges of storing and querying high volumes of data. Examples include Amazon Redshift, Microsoft Azure SQL Data Warehouse and Snowflake.
How is a Data Lake Different?
Just like a data warehouse, a data lake is a central repository for data that can come from just about anywhere. That’s where the similarities stop however.
A data lake is where data is stored, but without the structure or query optimization of a data warehouse. It will likely contain a high volume of data as well as a variety of data types. For example, a single data lake might contain a collection of blog posts stored as text files, a history of billions of Tweets, and JSON documents containing events generated by sensors embedded in a fleet of vehicles. It can even store structured data like a standard database, though it’s not optimized for querying such data.
Storing large amounts of data in a data lake is far less expensive than storing it in a data warehouse, and because it’s unstructured data (no pre-defined schema) making changes to the types or properties of data stored is far easier than modifying a warehouse schema. Just send the data in as is and store it!
There are tradeoffs however. Without a schema, asking questions of the data is not as easy as writing a nice clean SQL query. You’ll need to understand the contents of the data and how to handle cases where one document has a property that another may not. For example, how do you handle a sensor events with varying properties? Because the data is stored in a raw, unstructured form in the data lake there’s no guarantee a particular property will be present in each event. In addition, the nature of unstructured data leads to challenges in data governance.
Data lakes aren’t a great fit for sitting behind a business intelligence tool, or for consumption by a general audience within an organization. Instead, they are often utilized by data scientists and machine learning engineers who need raw data, and a lot of it! In addition, data lakes can serve as a landing zone for data before its processed and sent to a data warehouse. Data engineers are more frequently implementing data lakes for that reason alone.
AWS, Azure and other cloud providers have several storage options that are frequently used for data lakes.
Find Your Fit
There’s no simple answer to whether you need a data warehouse, a data lake, or both. Hopefully this brief guide gives you a good idea of what they are and how they might fit your needs.
Don’t forget to sign up for the Data Liftoff mailing list to get more content and to stay up to date on the latest in data science and data engineering.