Data Lakehouse: Everything You Need To Know

Luke Smith
Enterprise Solutions Architect
December 2, 2022
Matt Tanner
Developer Relations Lead
December 2, 2022
Matt Tanner
Matt Tanner
Developer Relations Lead
Developer Relations Lead
January 25, 2023
16
 min read
Join our newsletter

In today’s technological climate, the increasing amount of data being generated can be overwhelming. The sheer amount of data that is created and stored every second seems almost infinite. The benefit of this new era of massive amounts of data is that companies can rely on the data to drive critical business decisions, improve product offerings, and serve customers better. The more data created, captured and analyzed, the more accuracy the business can see when using the data. As technology advances to accommodate this ever-growing amount of data, each day sees the birth of a new term or the evolution of existing solutions. The need for organizations to invest in technologies to capture and store data, to capitalize on the valuable insights that can be derived from it, has grown to an all-time high. 

Data warehouses have served the purpose of collating structured data over the years. These solutions sometimes began as in-house solutions in the early years, then began to progress into a “platform” approach as companies began to build out-of-the-box data warehouse solutions. The data warehouse led to the use of analytics to enable Business Intelligence and other use cases that excel when large amounts of data are used to derive insights. 

As the technology evolved, organizations began to realize that the data warehouse is inefficient when it comes to storing streaming information that can be used for Data Science and Machine Learning. As unstructured data became more common, data lakes solved the problem of handling large amounts of raw unstructured data. This unstructured data might make up the bulk of data being created and stored. This solved the problem of data storage but once the data was in the data lake, it can not be queried to perform structured operations, like those used in traditional SQL use cases. Therefore, organizations were forced to combine both data architectures to be able to fully utilize the data.

The combination of traditional data warehouses and data lakes solves many of the data processing issues that plagued the platforms when used individually. But, combining the two approaches also gives rise to a lot of operational and data governance complications. This led to the development of a different type of system which could bridge the gap: the data lakehouse.

Table of Contents

What is A Data Lakehouse?

A data lakehouse is a new data solution concept that combines elements of a data warehouse with those of a data lake. It is a hybrid architecture with the fundamental purpose of combining the best characteristics of both the data warehouse and data lake. It also addresses the limitations encountered when using the data warehouse and data lake as individual solutions.

The data lakehouse architecture is dual-layered. A warehouse layer is harbored over a data lake that can enforce schema-on-write. By using a schema-on-write approach, the data lakehouse provides quality and control within the data to enable efficient BI and reporting capabilities. It leverages the low-cost storage ability of the data lake to keep large volumes of data in its raw format and the structured features of the data warehouse for data management. The outcome is a single repository to access all of the enterprise data for various projects which may include data science, Business Intelligence (BI), and Machine Learning (ML). Covering this full array of use cases was just not efficiently possible with the data warehouse and data lake technologies when used on their own.

In a nutshell, the data lakehouse system stores and processes diverse data formats at a low cost just like the data lake, and manages and optimizes data for SQL performance just like the data warehouse. The best of both worlds to cover the use case needs of modern organizations.

Data Lakehouse: Simple, Flexible, And Low pricing 

Data lakehouse is built to house both structured and unstructured data. This means businesses only need to rely on one data repository instead of requiring the services of both the data warehouse and the data lake.

Previously, organizations relied on both technologies as part of their core data infrastructure and architecture. Generally, using structured data found in the data warehouse for BI analytics and leveraging the data lake for data science tasks, including Artificial Intelligence(AI) use cases, such as Machine Learning. Since the data lakehouse was introduced, these use cases have become more accessible within a single platform.

One feature of the data lakehouse is to easily handle unstructured data such as text, images, videos, and audio and to organize the data into structured forms. This is done by extracting and identifying features from the data and then allowing it to be cataloged and indexed. This ability shows the flexibility of using data lakehouse in resolving business needs. This classification and indexing can be done by the use of computer vision or natural language processing algorithms. These technologies can understand and classify the content of a picture, text, or voice file dumped as raw data into the data lakehouse. This enables teams to move faster in carrying out analysis all while ensuring a low-cost storage system, like the data lake. 

Data lakehouse also provide the most complete and up-to-date data available, are inexpensive to scale, and can be queried using any tool. This is a major benefit instead of being limited to tools that can handle only structured data, such as SQL-based tools.

The Technology Behind Enabling The Data Lakehouse

The data lakehouse has a few technological advancements that enable it to function the way it does. We will discuss three of them in this section, namely:

  • Metadata layers
  • High performance through new query engine designs
  • Optimized access for both data science and machine learning tools

Metadata Layers

The metadata layers on the data lakehouse can be described as a sort of middleman between the unstructured data and the data used to categorize and classify the data. It sits on top of open file formats, like Parquet, and keeps track of the files that form part of different table versions. This is what allows the data lakehouse to offer structured management features, such as ACID transactions. The metadata layer is also used for data validation, schema enforcement and evolution, support for data streaming, and gaining access to old versions of a table.

New Query Engine Designs For High Performance

For the data lakehouse to function and prove its efficiency, it has to carry out its operations faster than the existing two-tier architecture. This has led to new query engine designs that would enable high-performance SQL analysis. These designs include improvements such as retrieving data in RAM/SSDs and transcoding them into more efficient formats, vectorized execution on modern CPUs, auxiliary data structures such as statistics and indexes, and data layout optimization to cluster co-accessed data. With all of these, these new query engines become effective at combining the technologies in the data lakehouse to gain high performance on large datasets.

Access To Data Science And Machine Learning Tools For Optimization

Data scientists and machine learning engineers will find it easy to use the data lakehouse due to the open data formats that can be used to access the data. These include popular data science and machine learning tools like Pandas, PyTorch, TensorFlow, and Spark DataFrames amongst others. These tools can be used to access open data formats like Parquet and ORC. Data lakehouse features like audit history and time travel also help in improving reproducibility in machine learning.

History of Data Architectures

Data architectures date back in time to when information was just beginning to be stored digitally. This gave rise to data warehouses where structured data could be archived for specific business intelligence and reporting purposes. This served businesses well for decades and some legacy solutions may still leverage this dated approach. This approach fell out of favor when unstructured data began to be utilized more and used to derive even further business insights than the structured data found in the typical data warehouse. This led to the advent of data lakes, though with their limitations. 

With the dawn of the big data era, it was only a matter of time before suggestions would be made to combine the best of the two data architectures into one single unit to serve the needs of data scientists, machine learning engineers, and the rest. Let's take a brief look at the history of data architecture and how we arrived at the need for the data lakehouse.

Background On Data Warehouses

Data warehouses have served the purpose of analyzing structured data in decision-making and business intelligence applications over the years. First proposed in the 1980s, it has a long history of being used as a centralized repository for data accumulation from corporate sources like CRMs, RDBMS, flat files, and more. The output from the data warehouse was predominantly used for data analytics and reporting. The data, before being loaded into the data warehouses, must correlate with the predefined schema of the data warehouse as it is checked against it. Using this data format allows for complex queries on the structured data using SQL.

Data warehouses, though useful, come with their limitations. These limitations include high costs due to continuously growing volumes, and the inability to handle unstructured data. In modern organizations, unstructured data forms the bulk of data produced by businesses such as audio, video, social media posts, and text documents. Data warehouses also do not lend themselves well to complex data processing like machine learning and data science tasks. The data warehouse is mostly suited for a specific task involving structured data, therefore, the need arose for another data architecture that could handle or try to address these issues and use cases.

The Invention of Data Lakes

The birth of data lakes came about as the desire grew to handle raw data in a variety of formats in a cost-effective manner, for data science and machine learning purposes. A data lake can be defined as a repository that stores a large amount of raw data in its native formats. The data lake architecture supports data that is unstructured, semi-structured, structured, and in open file formats such as Apache Parquet for big data processing.

Unlike the data warehouse, a data lake does not require the transformation of data before loading since it does not have a schema. The lack of transformation during load makes it more robust and cost-effective compared to a data warehouse. This enables raw information like pictures, videos, and sound recordings to be stored and valuable insights harnessed by unlocking phenomenal amounts of value for organizations. For example, going through hours of production line video footage to uncover patterns and derive figures for better decision-making.

Unfortunately, data lakes cannot be used as traditional data warehouses for structured analysis, like business intelligence. This is because a data lake would require additional tools and techniques to support SQL queries. Data lakes also suffer from issues with performance, poor data quality, and inconsistency due to the way data is managed. Since the data is disorganized, since data is often dumped into the data lake, this can lead to a stagnation problem with the data itself.

With these challenges, many organizations sought ways to combine the two data architectures to surmount the problem. It was thought this could be done by using a two-tier data architecture model involving one big data lake and multiple purpose-built data warehouses.

Common Two-Tier Data Architecture

A two-tier data architecture involves the extraction, transformation, and loading (ETL) of data from operational databases into data lakes. Once migrated, it would be stored in low-cost object storage and in formats compatible with machine learning tools, then moved through ETL processes into a suitable format to be loaded into a data warehouse for business intelligence and analytics.

However, this multi-tier architecture involves lots of ETL steps. This causes issues for a data team since they have to constantly alternate between the two models to enable BI tools and ML processes across the data. This process results in delays, duplicate data, increased complexity, security challenges, and additional operational costs. It requires regular maintenance and may likely lead to data staleness if the data is not kept consistent between the two systems.

Due to the challenges of using a two-tier architecture, another path needed to be forged. This would need to be a system that can offer several improvements in data architectures, data processing, and metadata management to capture data common to all platforms for ML and BI applications. This is the exact reason the data lakehouse was proposed. The data lakehouse addresses problems encountered by the two-tier architecture such as:

  • Reliability - due to fewer cycles of ETL
  • Data staleness - since data is readily available for analytics
  • Advanced analytics - not possible using traditional data warehouses
  • Reduce cost - data is stored in one place, not stored in several systems simultaneously

Using CDC to Move Data into a Data Lakehouse

As stated earlier in this piece, the Data Lakehouse combines the best elements of Data Lakes and Data Warehouses. Combining these platforms ensures the data reliability, governance, and performance of a data warehouse with the flexibility, openness, and machine learning supports of a data lake. It helps to eliminate redundant ETL jobs and reduce data redundancy, saving the time and effort a data engineer would have to administer on multiple platforms.

The Lakehouse architectural features are now adopted by several propriety services like Databricks Delta Lake, Microsoft Azure Synapse Analytics service, and others. There are also other services like Google BigQuery, AWS Redshift, and a few others that support this architecture in a less native way. The implementation of the Lakehouse is also taking place within open-sourced platforms Apache Iceberg and Apache Hudi.

The effectiveness of Data Lakehouses can even be furthered when Change Data Capture (CDC) is applied. CDC extracts data changes in a source database and ingests the changes into the cloud in near real-time. It moves data continuously and loads only the changed data into the target. This means data volumes moving to the target are much smaller and quicker than heavy batch loads. 

Using CDC is more effective, reliable, and faster than the traditional batch ingestion of data, therefore, making it a much sought-after method of data ingestion. CDC is a go-to approach for data teams and analysts who want to merge data produced from different sources into a unified source of truth in real-time.

CDC makes data movement smoother and costs less as well. This is because the change feeds are smaller, cheaper, and ingested continuously so changes don’t impact negatively on the systems as they are evenly distributed. For example, online purchases can be analyzed instantly by the marketing team to understand a customer's behavior once a purchase is made. This data helps to improve the experience of the customer and helps the marketing team to tweak the marketing campaign with instant recommendations. CDC is important because data teams and teams leveraging the data can move faster since there’s no need to access multiple systems. Data is located and accessed via a single unified system. 

There are various methods in which CDC can be implemented on a Data Lakehouse to cater to changes in your data, the methods include: using timestamps, triggers, diff-based, or transactional log-based. CDC is very efficient as only data changes made after the previous replication is captured from the source and synced to the data in the Data Lakehouse. 

Databricks Data Lakehouse platform is one of the popular Data Lakehouse options available today. Data Lakehouse is powered by open-source technologies like Apache Spark, MLflow, and Delta Lake. In this article, we will explore Delta Lake and show how CDC can be implemented to move data to the Databricks Data Lakehouse. 

Delta Lake on Databricks provides database-like features such as ACID transactions, scalable metadata handling, schema evolution, enforcement, time travel, etc. Some of the highlights of Delta Lake include:

  • Delta Lake ensures data is shared with other organizations regardless of which computing platform they are on.
  • It removes data files that are no longer in the fresh state from the transaction log since they are older than the retention threshold.
  • It performs data optimization such as compaction and ordering.
  • It does table in-caching using the Spark node's local storage.

Streaming is part of Databricks DNA. End-to-end streaming and CDC is made possible on Databricks Lakehouse through Spark structured streaming and Delta Live tables (DLT). Delta Lake is built on Apache Spark and is designed to let users incrementally improve the quality of data in their Lakehouse until it is ready for consumption. Apache Spark has a native streaming service with data frames, but, Delta, has taken streaming and CDC further with the addition of live tables.

DLT takes care of the full orchestration of streaming with Databricks Lakehouse. Its operations include spinning clusters, watermarking, back pressure, handling, CDC, DQ, version controls, and others. With DLT, you can automatically create a high-quality lineage, perform data quality checks, and pipeline recovery. This is available through built-in features to drop, alert, or quarantine incorrect data if something does not match. Data quality metrics are also captured, tracked, and reported as data is moved into the platform. 

DLT also supports merge operations and log versioning from multiple sources. This is done using the Change Data Feed (CDF) feature which allows you to efficiently update and delete records. The CDF identifies changes and allows only the rows that have been changed between versions to make downstream consumption of Merge, Update, and Delete operations.

When using the CDF feature, raw data from different sources are inserted into the bronze table, then filtered, cleaned, and augmented in the silver table. Finally, aggregated values are computed in the gold table based on the changes to the data in the silver table. To have the CDF feature available on a table, you must enable it on the bronze table at creation. Alternatively you can also enable it as an update to the table, or on a cluster for all tables created by the cluster. More information about activation can be checked out here. 

Enabling CDC with Arcion

This section of this guide is going to show you how to enable CDC with Arcion to enable you to push data to Databricks Data LakeHouse. This will be further explained with an example outlining each step. If you have streaming data from your streaming entities and attributes that you want to tie to the rest of your data model, you can do this via Arcion. Arcion can use data sources such as Oracle, SQL Server, Teradata, Netezza, and many others to move data to the Data Lake. Once the data is in the Databricks Lakehouse instance, you can query and take actions based on the data to help inform decision making or other use cases.

Arcion is a no-code real-time CDC platform that completes the Lakehouse architecture. This combination gives rise to a unified data warehousing for streaming, BI, and ML on Lakehouse. Using Arcion to enable real-time replication to the Lakehouse opens up the possibility of use cases that were not previously possible. With Arcion and Databricks Data Lakehouse, you can:

  • perform fraud analytics, 
  • have sensor IoT anomaly detections,
  • improve the gaming experience for gamers, 
  • perform predictive maintenance, 
  • have an efficient stock trading platform, 
  • analyze large volumes of streaming IoT data from millions of sensors to harness actionable insights

All of this can be done at speeds of analyze 25,000 events per second.

Arcion accelerates data movement across applications, traditional transactional databases, and cloud platforms. This is done through high-performance, high-quality, and auto-scalable data pipelines. Arcion can be used to integrate data from diverse platforms in a zero code, high-efficiency, cloud-native way and move it to the Databricks Lakehouse to empower users to build applications at scale.

Three pillars on which Arcion CDC is built are transactional integrity, scalability, and data latency. Arcion CDC methods include:

  • Log Based: Supports all OLTP sources like Oracle, MySQL, Postgres, SQL Server, Mongo, Informix, Cassandra, etc.
  • Time-Based: It requires a TimeStamp column per table and the TimeStamp column would be updated on every modification. It has support for data warehouses like Teradata, BigQuery, etc.
  • Checksum Based: If the first two are not possible, Arcion uses this method to carry out CDC. It requires no setup and performs the checksum of tables in the source and target to determine if there is any difference, then integrates the differences between the two tables.

Next, let’s look at how to set up an end-to-end CDC pipeline from Oracle to Databricks Delta Lakehouse using Arcion. It will give you an overview of how to perform an initial load of snapshots and real-time data replication. Data from Oracle will be replicated and show instant changes on Databricks Lakehouse. Below is a step-by-step approach to how to get it all working.

  1. Log in to Arcion cloud using arcion.io
  2. On the Arcion page, Click on New Replication found at the top left corner of the screen as shown in the diagram below.
  1. The first step on the New page is General where you will set up replication.

 This is done by carrying out the following steps seen in the picture above:

  • Set up a new Name.
  • Select the Replication mode
  • ~Snapshot: For initial data load
  • ~Realtime: For CDC to pick up live changes
  • ~Full: To move data when there is a snapshot movement in the source database.
  • Write Mode
  • ~Appending: Used to append or add more data
  • ~Replacing: Used to define or replace data schema
  • ~Truncating: Used to truncate data but you can keep the table definitions.
  • After setting up a replication, click on Next.
  • By clicking on Next, it takes you to the Source page. 

Here you can select a default data source or create your source using Use an extractor. In this example, Oracle RDS is the source as seen in the picture above.

  • ~Click on Continue to Destination after specifying your source.
  • On the Destination page, select a destination for your data. In this case Databricks. 

To know if the database connection worked between the Source and Destination, click on Test the Connection.

  • ~When this is confirmed, click on Continue to Filter.
  • The Filter page displays the schema information of the table from which you can filter as you desire.
  • ~Click on Start Replication when you are done filtering for the replication to commence.
  • The Mapper function page displays next. You can rename tables, and columns, and change data types of tables and columns from this page. It appears as a dashboard.
  • ~Use the tab View Details on the Mapper page to see the progress of the replication and the Back to Summary tab on the View Details page to return to the Mapper page as shown below.

After this setup, if there is a change in the source table on Oracle, it automatically updates and reflects in the target Databricks through the Arcion service connection. The View Details page above is used to see any such changes.

Earlier on, it was mentioned that Databricks Data Lakehouse had improved the CDC and streaming process by using the Delta Lake live tables. The above example has simply inserted the data into a bronze table. Using the CDF feature of the Delta live tables, the bronze table can now be altered from bronze to silver and subsequently, from silver to gold. Doing this will enhance your streaming and live data in real time in your Databricks environment.

Conclusion

This article has explained the data lakehouse concept by stating that it is a combination of the best features of the data warehouse and data lake. Data lakes can be used to handle both structured and unstructured data for Business Intelligence, AI, and ML. As it becomes increasingly popular, more organizations will key into it as it is a step further in their own drive for data revolution. 

Despite all of its promise, it is worth noting that it is still a concept in its infancy, therefore, some drawbacks might occur along the way and a considerable amount of work will be carried out before the data lakehouse would get a wide implementation among data engineers. That being said, many major players within the data platform space have already begun to build and create production-ready versions of this concept.

Finally, we took a look at CDC and a data lakehouse in action by using Arcion, Oracle, and Databricks Data Lakehouse. We covered the benefits of Arcion as well as a step-by-step implementation to get you started on the platform.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements