The amount of data being produced in recent years has increased dramatically alongside the desire to gain insights from data. The data management practices of an organization can now serve as a competitive advantage in customer acquisition and retention, business development, operational expertise, and more.
There has been an increase not only in the volume of data being generated but in its velocity. The velocity of data refers to the speed in which data is generated, distributed, collected and how quickly data changes. Data with a high velocity is now commonplace as most user-facing applications and enterprise reporting tools expect access to large amounts of real-time or near real-time data. With this fundamental change in how data is expressed, several challenges need to be surmounted to make sure that the data generated serves the needs of the organization.
This article will focus on the data engineering concept of Change Data Capture, often shortened to “CDC”. In this article you will be introduced to:
- The concept of Change Data Capture
- How CDC fits into a data pipeline
- CDC’s role within Extract, Load, Transform (ETL) techniques
- How CDC works in real-world scenarios
- The benefits to businesses or organizations that adopt CDC
By the end of the article, you should have a firm understanding of Change Data Capture, its components, and the role it can play in managing rapidly changing data.
What is Change Data Capture?
Change Data Capture is a method used to detect changes in data and apply those changes to a different data store. When a change occurs in a source system, such as a database, action is taken by a process or another system to store or replicate those changes. Change Data Capture can be used for the replication of data, from one system to another, such as a big data platform, or to ensure that a source system and a target system are in sync.
Change Data Capture is a particularly important data replication and data migration tool for real-time data systems where analysis is required to be performed on recent data. Real-time predictive analytics use cases require the ability for changes in data to be replicated into a central data store to derive insights. Real-time recommendation engines are another application for CDC where instant access to changes in data can be relevant in bringing value to customers.
Without such an approach, stale data will be used and the insights derived could potentially be inaccurate. This could potentially have major business impacts. In addition to analysis, Change Data Capture can serve as a migration/backup solution for huge volumes of data. Instead of data being periodically migrated, with the inherent risks associated with such latency, both systems can stay in sync in a real-time fashion.
There are two main approaches to implementing Change Data Capture, namely, pull-based and push-based. Both approaches have benefits and drawbacks which are unique to each solution.
In the pull-based approach to Change Data Capture, it is the responsibility of the target (destination) system to periodically pull updated data from the source system. This approach is not efficient and is the less preferred approach when it comes to doing CDC in real-time. Essentially done by polling the source for changes, this is usually done in set intervals of time and isn’t based on changes to the data on the source.
In a push-based system, the source system broadcasts any changes to data, and the target system which is subscribed to those broadcasts performs actions that updates its own copy of the data. The push-based approach to Change Data Capture (CDC) experiences less latency as the target system is aware of changes the moment they occur. It also works well when there are multiple destinations for where the data will be pushed to. This is extremely relevant in modern times where organizations tend to use multiple platforms for storage, analytics, and insights.
Change Data Capture in ETL
Extract, Transform, Load (ETL) can be defined as a process in the data integration workflow. ETL consists of 3 distinct steps, including:
- Extraction - getting data from one or multiple sources
- Transformation - processing that data into a suitable format through cleaning, binning, applying calculations, concatenations, removal of outliers, etc.
- Loading - storing the transformed data in a data warehouse or some other target system.
Extract, Transform, Load (ETL) is a vital component of the data engineering process where insights are sought from data. The data generated by an enterprise is usually from multiple sources and are stored in different systems. An ETL process is how data can be collated, normalized, and loaded into an output container for further analysis or long-term storage. It is within this requirement, that it is clear to see that if the process around how ETL is implemented in an organization must accommodate a Change Data Capture component. If it does not, the data stored can get stale very quickly and suffer from data quality and data integrity issues, eventually leading to consequences to the business or missed opportunities.
A good approach to handling this problem is to ascertain that the ETL process is triggered whenever a change occurs in the various sources of data. This can be done by incorporating a CDC-enabled data pipeline in any place where data needs to be sent to a target system in real-time.
Change Data Capture Methods
Different methods can be used to implement CDC. Each comes with its own advantages and disadvantages, some are easier to implement while others may be more flexible. Let’s take a look at the 4 most common ways to implement Change Data Capture below.
Using Audit Columns to implement CDC is one way you can create Change Data Capture at the application level. Usually, this is done by adding “LAST_UPDATED” or “DATE_MODIFIED” timestamp columns. Then, retrieving the latest data is as simple as querying for the rows that have been changed since the data was last extracted. The advantage of using Audit Column CDC is that it can be built directly into the application which requires the data. It can all be done without needing to use any other tools since it is all handled between the database and the consuming application.
The disadvantages of this method are that it does require the database to handle extra load from the constant querying/scanning for changes in the data. Also, when data is deleted the tracking column in the row will also be deleted. This means that it can be tricky to propagate deletes to the target system or application. This can be customized in order to create a workaround for this, such as adding a deleted flag or a recently_deleted table but it is still not optimal. There are also many concerns with data consistency that comes with implementing this method of CDC.
The table delta or ‘tablediff’ approach to CDC is when a tool or script is used to compare the data in two tables to spot the differences. Then you can use additional scripts to create DML statements and apply the deltas from the source table to the target. The advantage of this approach is that you can implement it by simply adding a few native SQL scripts to find changes in the data. You can then use further SQL scripts to write the changes to the target database.
The disadvantage of this approach is that you will require three copies of all data in the database the original data, the previous snapshot, and the current snapshot. This extra storage will increase the storage costs for the data. This also means that it does not scale well when you have applications with really heavy transactional workloads.
The amount of CPU required to identify and apply the differences to a target increases in a linear fashion as more data volume is moved into storage. This approach also does not accommodate real-time use cases because of the amount of latency that appears when implementing a table delta CDC approach.
This method has been around for decades. Trigger-based CDC captures all insert, update and delete operations performed on tables or databases. For every insert, update and delete statement, a trigger is fired which captures the data manipulation language (DML) statement.
This can require a lot of work since trigger-based CDC requires database triggers to be created in order to identify the changes that have occurred in the source system. When the trigger is fired, the trigger will capture the changes and write the changes into another table, typically called a shadow or staging table.
Trigger-based CDC is easy to implement but can require a fair amount of work due to the manual process of creating the triggers for each table. This method of CDC can capture the entire state of the transaction and is also highly customizable.
Some disadvantages of using triggers to enable change data capture should also be called out. As mentioned previously, a big disadvantage is the setup of individual triggers for each table.
This can lead to costly implementation and management overhead in the case of a large source database. Another challenge with triggers is that for large transactions, there is a significant overhead of doing multiple writes to a database for every insert, update or delete.
Also, to apply the changes in the target database, the replication tool needs to connect to the source database at regular intervals; that can put additional load on the source database system and impact performance.
Log-based CDC, as mentioned earlier in our article, works by reading the transaction logs of the source database. By reading the database’s log, you get the complete list of all data changes in their exact order of application.
However, every database uses its own custom format to write redo log records to the transaction log. This means that a customized solution needs to be built to read those formats from the log and convert them into logical transactions.
The transactions then need to be pushed into DML statements that could be written to a target system. Therefore, there is a lot of engineering effort required to build a solution that enables a log-based CDC process to read a transaction log.
Log-based CDC comes with many advantages over other CDC methods. First, you can get a lot of additional metadata from the transaction log, e.g., a transaction identifier that is useful to ensure that the replication has the ability to resume upon crash. Second, in log-based CDC, there is no connection being made to the source database, and no extra queries are performed on the source system, thus making it a zero-impact solution.
Some log-based CDC solutions, such as Arcion, also offer agentless log-based CDC where no CDC agent even needs to be installed on the database server. The ability to not impact the performance of the source database is extremely important for large production systems which are likely running at full capacity and cannot bear any extra overhead.
Benefits of Change Data Capture
The benefits of Change Data Capture for businesses that understand the importance of data and data integration are numerous. The most obvious benefit of CDC for any business in the midst of a data migration project is the replication or migration of data, likely in real-time. This can unlock many use cases for the business that are not possible without the ability to analyze data in real-time.
For a digital transformation, where data is being moved from a legacy system to a new system, CDC can be a crucial part of your data migration strategy. Data migration is usually one of the largest challenges in digital transformation since it would normally require databases to be taken offline. Once offline, then the data would be extracted and loaded into the target database. With CDC, the requirement for a database to be offline to migrate data is no longer required.
CDC can serve as a tool to back up data to avoid loss. Data can be kept in sync across various systems to improve the reliability and availability of that data. This can also help as part of a business continuity plan.
Another benefit of CDC for businesses is having data constantly updated and stored in data warehouses for further analysis by Business Intelligence tools or as an input process to an AI/machine learning pipeline. Any insights derived from such data will be up to date and accurate.
Finally, a well-implemented Change Data capture solution can lead to cost savings in terms of the amounts spent on provisioning infrastructure for data handling and database migration. For organizations with large volumes of data, a native approach such as creating snapshots of data, can incur significant costs over time. With CDC, only segments of data need to be considered.
How Does It Look With Arcion?
Arcion goes beyond the basics of CDC. Supporting many of the most popular databases, data warehouses, and other popular tools, Arcion enables log-based CDC that goes further than just keeping data in sync. Arcion can also capture changes including DDLs, DMLs, schema changes, and several other non-DML changes. This makes for robust CDC pipelines that are easy to create and support, all with no-code required.
Arcion also supports other types of CDC when the source databases' transactions logs aren't available. These alternative CDC methods includes:
- Delta-based Change Data Capture: This is a way of doing audit column-style CDC by computing incremental delta snapshots using a timestamp column in the table, Arcion is able to track modifications and convert that to operations in target.
- Checksum-based Change Data Capture: This is a way of implementing table delta/”tablediff”-style CDC. Arcion can perform an intelligent checksum to compare the rows and columns in the source and the target. It then adds the missing rows in the target table. Learn more about Arcion’s low-latency Change Data Capture.
Arcion also supports both self-managed, on-premise deployments as well as the ability to run in a fully-managed environment while using Arcion Cloud. With almost unlimited flexibility, Arcion delivers real-time data wherever and whenever you need it.