The amount of data being produced in recent years has increased dramatically alongside the desire to gain insights from data. The data management practices of an organization can now serve as a competitive advantage in customer acquisition and retention, business development, operational expertise, and more.
There has been an increase not only in the volume of data being generated but in its velocity. The velocity of data refers to the speed in which data is generated, distributed, collected and how quickly data changes. Data with a high velocity is now commonplace as most user-facing applications and enterprise reporting tools expect access to large amounts of real-time or near real-time data. With this fundamental change in how data is expressed, several challenges need to be surmounted to make sure that the data generated serves the needs of the organization.
This article will focus on the data engineering concept of Change Data Capture, often shortened to “CDC”. In this article you will be introduced to:
- The concept of Change Data Capture
- How CDC fits into a data pipeline
- CDC’s role within Extract, Load, Transform (ETL) techniques
- How CDC works in real-world scenarios
- The benefits to businesses or organizations that adopt CDC
By the end of the article, you should have a firm understanding of Change Data Capture, its components, and the role it can play in managing rapidly changing data.
What is a CDC Data Pipeline?
Before you understand Change Data Capture and its role in a data pipeline, it’s important to first understand what constitutes a data pipeline. A data pipeline can be defined as a series of connected steps for data processing whereby the output of one step serves as a direct input to the next. Generally, this is part of moving data from a source database into a target database or big data platform. In a data pipeline, each data processing element in that chain is responsible for a particular action. A data pipeline can start from the point of ingestion of data, the transformation of that data, through to its storage at a final destination. Data pipelines can be unique in their setup and can serve different applications in the stack. A data pipeline is usually a fundamental part of most data replication and data migration solutions.
A data pipeline can also be set up to be triggered by events. A CDC data pipeline can be event-based, in that data changes in the source system can trigger the replication or extraction of data. Once an event occurs, the process of transforming or processing of the data, and eventual storage at a destination such as a data lake or data warehouse, is kicked off. The way to think about it is that whenever data changes, it moves through a series of processing steps, from the source to the destination. In the simplest terms, data pipelines provide an efficient way to move data from one location to another.
What is Change Data Capture?
Change Data Capture is a method used to detect changes in data and apply those changes to a different data store. When a change occurs in a source system, such as a database, action is taken by a process or another system to store or replicate those changes. Change Data Capture can be used for the replication of data, from one system to another, such as a big data platform, or to ensure that a source system and a target system are in sync.
Change Data Capture is a particularly important data replication and data migration tool for real-time data systems where analysis is required to be performed on recent data. Real-time predictive analytics use cases require the ability for changes in data to be replicated into a central data store to derive insights. Real-time recommendation engines are another application for CDC where instant access to changes in data can be relevant in bringing value to customers.
Without such an approach, stale data will be used and the insights derived could potentially be inaccurate. This could potentially have major business impacts. In addition to analysis, Change Data Capture can serve as a migration/backup solution for huge volumes of data. Instead of data being periodically migrated, with the inherent risks associated with such latency, both systems can stay in sync in a real-time fashion.
There are two main approaches to implementing Change Data Capture, namely, pull-based and push-based. Both approaches have benefits and drawbacks which are unique to each solution.
In the pull-based approach to Change Data Capture, it is the responsibility of the target (destination) system to periodically pull updated data from the source system. This approach is not efficient and is the less preferred approach when it comes to doing CDC in real-time. Essentially done by polling the source for changes, this is usually done in set intervals of time and isn’t based on changes to the data on the source.
In a push-based system, the source system broadcasts any changes to data, and the target system which is subscribed to those broadcasts performs actions that updates its own copy of the data. The push-based approach to Change Data Capture (CDC) experiences less latency as the target system is aware of changes the moment they occur. It also works well when there are multiple destinations for where the data will be pushed to. This is extremely relevant in modern times where organizations tend to use multiple platforms for storage, analytics, and insights.
What is Change Data Capture in ETL
Extract, Transform, Load (ETL) can be defined as a process in the data integration workflow. ETL consists of 3 distinct steps, including:
- Extraction - getting data from one or multiple sources
- Transformation - processing that data into a suitable format through cleaning, binning, applying calculations, concatenations, removal of outliers, etc.
- Loading - storing the transformed data in a data warehouse or some other target system.
Extract, Transform, Load (ETL) is a vital component of the data engineering process where insights are sought from data. The data generated by an enterprise is usually from multiple sources and are stored in different systems. An ETL process is how data can be collated, normalized, and loaded into an output container for further analysis or long-term storage. It is within this requirement, that it is clear to see that if the process around how ETL is implemented in an organization must accommodate a Change Data Capture component. If it does not, the data stored can get stale very quickly and suffer from data quality and data integrity issues, eventually leading to consequences to the business or missed opportunities.
A good approach to handling this problem is to ascertain that the ETL process is triggered whenever a change occurs in the various sources of data. This can be done by incorporating a CDC-enabled data pipeline in any place where data needs to be sent to a target system in real-time.
How Does Change Data Capture Work?
In this section, you will look at some actual methods for implementing Change Data Capture. To simplify this section, you will assume that there are two database systems, the first is called the source system and the other is the target (or destination) system. The source system is the primary database that contains the data whose changes you want to track. The data from the source system will be replicated into the target system. From this process, both databases will be kept in sync.
Depending on the type of implementation, the target/destination system is supposed to capture changes immediately after they occur. In a system that does not require real-time replication, this could also be done within a specified timeframe instead of instantaneously. The focus of the implementation of Change Data Capture can either be on the source system or the target system, depending on factors such as whether the CDC solution is pull-based or push-based.
If the focus is on the source system and it is query-based, it typically involves executing SQL statements at the source. This may happen after scanning records on the database. The disadvantage of this approach is that it consumes more compute resources and queries may happen on a production database. This may lead to performance issues. This is especially true if the solution is trying to serve a data-intensive application.
On the other hand, if the focus of the implementation is on the target database and it is event-driven, the change logs on the source database can be read and actions can be performed on the target to reflect the changes on the source system log files. Specifically this approach is known as “log-based CDC”.
Since Change Data Capture (CDC) is a software design pattern, it can be implemented at the application level or its implementation could rely solely on the database system in use.
An application-level implementation could involve having a column that tracks changes to specific row records in the database schema. Whenever a record is updated, the audit column which is used to track changes is also updated. The application will then determine according to its custom logic how to react to that change.
For an implementation that is based on the capabilities of the database, the best solution is to rely on the change logs that record every transaction that happens on the database. That way, the target system can scan the transaction (change) logs and replicate the transaction events that are currently not accounted for in its copy of the data. Log-based CDC has the advantage of having minimal impact on the production database, it also does not need administrative privileges on the database itself but only requires access to the logs. Using the change logs for CDC is dependent on the database in question and the details differ from one database system to another.
Benefits of Change Data Capture for Business
The benefits of Change Data Capture for businesses that understand the importance of data and data integration are numerous. The most obvious benefit of CDC for any business in the midst of a data migration project is the replication or migration of data, likely in real-time. This can unlock many use cases for the business that are not possible without the ability to analyze data in real-time.
For a digital transformation, where data is being moved from a legacy system to a new system, CDC can be a crucial part of your data migration strategy. Data migration is usually one of the largest challenges in digital transformation since it would normally require databases to be taken offline. Once offline, then the data would be extracted and loaded into the target database. With CDC, the requirement for a database to be offline to migrate data is no longer required.
CDC can serve as a tool to back up data to avoid loss. Data can be kept in sync across various systems to improve the reliability and availability of that data. This can also help as part of a business continuity plan.
Another benefit of CDC for businesses is having data constantly updated and stored in data warehouses for further analysis by Business Intelligence tools or as an input process to an AI/machine learning pipeline. Any insights derived from such data will be up to date and accurate.
Finally, a well-implemented Change Data capture solution can lead to cost savings in terms of the amounts spent on provisioning infrastructure for data handling and database migration. For organizations with large volumes of data, a native approach such as creating snapshots of data, can incur significant costs over time. With CDC, only segments of data need to be considered.
How does it look with Arcion?
Arcion goes beyond the basics of CDC. Supporting many of the most popular databases, data warehouses, and other popular tools, Arcion enables log-based CDC that goes further than just keeping data in sync. Arcion can also capture changes including DDLs, DMLs, schema changes, and several other non-DML changes. This makes for robust CDC pipelines that are easy to create and support, all with no-code required.
Arcion also supports other types of CDC including:
- Delta-based Change Data Capture: by computing incremental delta snapshots using a timestamp column in the table, Arcion is able to track modifications and convert that to operations in target.
- Checksum-based Change Data Capture: Arcion can perform an intelligent checksum to compare the rows and columns in the source and the target. It then adds the missing rows in the target table.
Arcion also supports both self-managed, on-premise deployments as well as the ability to run in a fully-managed environment while using Arcion Cloud. With almost unlimited flexibility, Arcion delivers real-time data wherever and whenever you need it.