Data Warehouses, lakes, hubs, and vaults explained

Share this article:

Let’s cut right to the chase, you are reading this blog because you are looking for expertise on storing data for your current needs. I will discuss the definition of and sample use cases that I have seen for data warehouses, lakes, hubs, and vaults. The differences between them are subtle, but they all serve a different purpose in the data world today.

Data Warehouses

Definition

A data warehouse is a consolidated, structured repository for storing data assets. Data warehouses will store data in one of two ways: Star Schema or 3NF, but these are only fundamental principles in how you would like to store your data model. We have seen, advised, and implemented both principles (in addition to the snowflake schema which is a variation of the star schema in my mind), but the one major flaw is that everything must be strictly defined (both in schema and integration).  

Use Case

The most common use cases for creating and using a data warehouse are to consolidate data and answer a business related question such as: How many users are visiting my product pages from North America? This ties the information you are receiving from your end users with a business question that needs to be answered from a structured data set. This is what most would identify as the cookie cutter business intelligence solution.

 Data Warehouse Scheme

There is an alternative approach that is becoming more popular, especially when you are talking about cloud and more powerful warehouses. Organizations are adopting the ELT approach where they will “stage” their data in the warehouse (such as HP Vertica), and then let the power of the database perform the traditional transformation. Essentially, you are performing the most expensive operations with the system where you have more resources.

Data Warehouse Scheme

If you would like to learn more about warehouses, their extended use case, and the advantages/disadvantages of selecting a data warehouse, please click here.

Sample technologies used today: RDBMS, Redshift, Snowflake, HP Vertica

For more information, download our white paper "Your Guide to Enterprise Data Architecture".

Data Lakes

Definition 

A data lake is a term that represents a methodology of storing raw data in a single repository. The type of data that’s stored in the lake does not matter and could be unstructured, structured, semi-structured, or binary. The fundamental idea for a data lake is that you want to make available any/all data from applications so your data team can provide some insights on a business problem or value proposition. The challenge begins when you want to try to make sense of your data. If you are dumping data into a data lake, how do you know which data you need and which data you don’t need? How do you determine where the data resides in the lake? This very quickly can become a data swamp if not managed correctly.

Use Case

The use cases we see for creating a data lake revolve around reporting, visualization, analytics, and machine learning. If you would like to learn more about data lakes, please see a more in depth look here.

Here is the architecture we see evolving:

Data Lake Scheme

Sample technologies used today: HDFS, S3, Azure data lake

 

Data Hub

Definition

A data hub is a centralized system where data is stored, defined, and served from. I like to think of a data hub as a hybrid of a data lake and a database warehouse because it provides a central repository for your applications to dump data, but adds a level of harmonization at ingest so the data is indexed and can easily be queried. Please note that this is not the same as a data warehouse architecture because the ETL processing is merely for indexing the data you have rather than mapping it into a strict structure. The challenge comes in when you have to implement the data hub and how can you harmonize all of your siloed data sources.

Use Case

In general, we see the same use cases for a data hub as we would for a data lake: reporting, visualization, analytics, and machine learning.

Data Hub Scheme

Sample technologies used today: MarkLogic

 For more information, download our white paper "Your Guide To Enterprise Data Architecture".

Data Vault

Definition

A data vault is a system made up of a model, methodology and architecture that is specifically designed to solve a complete business problem as requirements change. So, as your business requirements morph over time, the data vault will maintain the historical system of reference or archive of your data and easily relate it to the new standard of data that you have defined. I like to think of the data vault as a customized, dynamic solution that gives business users access to all data (current and historical).   

 

Use Case

The glaring use case to me is one where you are auditing data for any reason: banking, security, logistics, or a number of other reasons why you are auditing data of your systems. Let’s say you decide that you need to update your security model to include additional fields and new applications in your enterprise. Using a data vault, you are able to checkpoint the time when you made the security model changes, update your infrastructure with your changes (and all associated applications), and the business team would continue receiving the full view of historical and current information regarding the audit trail.

Data Vault Scheme

Sample technologies used today: RDBMS, Redshift, Snowflake

 

Conclusion

I hope that you have learned a little bit about how we see each of these data models as well as seeing the value in each of them. There is not one model or technology that I can offer as being superior to the other. You must analyze your requirements, needs, and of course budget before deciding which approach to use. Technology is constantly evolving, and each of these models will evolve with it.

 

For more information, download our white paper "Your Guide To Enterprise Data Architecture".

Share this article:
Contact Us

Further questions? Contact us.

Forum

Talk to peers on our forum.

Want to keep in touch?

Follow our social media.

Topics

see all

Recent Posts