Business transactions stored in relational databases are extremely important to gain an accurate understanding of the overall business operations. Without this data, organisations would be unable to effectively leverage insights generated from their analytics. This is why companies deploy the practice of data replication to data warehouses for analytical purposes.
Time-sensitive data replication is increasingly important in cloud migrations, which come with unique challenges to move and update data without disrupting the applications accessing it. In the past, many businesses have relied on batch methods of data transfer to be done once or a few times a day, which obviously doesn't work for migrating data in real time. In order to successfully migrate and replicate data in a timely manner, organizations must leverage optimized cloud storage platforms and up-to-date technologies that are designed specifically to handle this type of big data traffic while ensuring safe data transmission with no downtime.
Change Data Capture (CDC) is becoming increasingly popular in today's data-driven markets, as it provides fast, cost-effective and reliable service. CDC enables companies to nearly instantaneously move data from relational databases (like SQL Server or Oracle) to data warehouses, data lakes or other databases.This makes it an ideal solution for near real-time analytics on current operations and cloud migrations projects. There are four different CDC methods available: log-based, trigger extraction, change codes or lock and read. In this article, we will be covering all the information about these 4 CDC methods.
What is Change Data Capture?
Change Data Capture is an incredibly useful software process - it identifies and selectively tracks changes to core data within databases. CDC offers real-time or near-real-time data movement as new changes are being made in the database.
In many industries today, data is coming in at an ever-increasing pace. Decisions must be made quickly and precisely in order to achieve success. This where Change Data Capture can be the best option as it has emerged as an effective solution for achieving low-latency and reliable real-time data replication. Furthermore, it is extremely beneficial for cloud migrations since it enables companies to migrate their data with zero downtime.
As companies look to implement multi-cloud strategies in the coming years, an essential step for success is choosing the right change data capture method for their business.
Change Data Capture in ETL
Extract, Transform, Load (ETL) can be defined as a process in the data integration workflow. ETL consists of 3 distinct steps, including:
- Extraction - getting data from one or multiple sources
- Transformation - processing that data into a suitable format through cleaning, binning, applying calculations, concatenations, removal of outliers, etc.
- Loading - storing the transformed data in a data warehouse or some other target system.
Extract, Transform, Load (ETL) is a vital component of the data engineering process where insights are sought from data. The data generated by an enterprise is usually from multiple sources and are stored in different systems. An ETL process is how data can be collated, normalized, and loaded into an output container for further analysis or long-term storage. It is within this requirement, that it is clear to see that if the process around how ETL is implemented in an organization must accommodate a Change Data Capture component. If it does not, the data stored can get stale very quickly and suffer from data quality and data integrity issues, eventually leading to consequences to the business or missed opportunities.
A good approach to handling this problem is to ascertain that the ETL process is triggered whenever a change occurs in the various sources of data. This can be done by incorporating a CDC-enabled data pipeline in any place where data needs to be sent to a target system in real-time.
Change Data Capture Methods
Different methods can be used to implement CDC. Each comes with its own advantages and disadvantages, some are easier to implement while others may be more flexible. Let’s take a look at the 4 most common ways to implement Change Data Capture below.
Using Audit Columns to implement CDC is one way you can create Change Data Capture at the application level. Usually, this is done by adding “LAST_UPDATED” or “DATE_MODIFIED” timestamp columns. Then, retrieving the latest data is as simple as querying for the rows that have been changed since the data was last extracted. The advantage of using Audit Column CDC is that it can be built directly into the application which requires the data. It can all be done without needing to use any other tools since it is all handled between the database and the consuming application.
The disadvantages of this method are that it does require the database to handle extra load from the constant querying/scanning for changes in the data. Also, when data is deleted the tracking column in the row will also be deleted. This means that it can be tricky to propagate deletes to the target system or application. This can be customized in order to create a workaround for this, such as adding a deleted flag or a recently_deleted table but it is still not optimal. There are also many concerns with data consistency that comes with implementing this method of CDC.
The table delta or ‘tablediff’ approach to CDC is when a tool or script is used to compare the data in two tables to spot the differences. Then you can use additional scripts to create DML statements and apply the deltas from the source table to the target. The advantage of this approach is that you can implement it by simply adding a few native SQL scripts to find changes in the data. You can then use further SQL scripts to write the changes to the target database.
The disadvantage of this approach is that you will require three copies of all data in the database the original data, the previous snapshot, and the current snapshot. This extra storage will increase the storage costs for the data. This also means that it does not scale well when you have applications with really heavy transactional workloads.
The amount of CPU required to identify and apply the differences to a target increases in a linear fashion as more data volume is moved into storage. This approach also does not accommodate real-time use cases because of the amount of latency that appears when implementing a table delta CDC approach.
This method has been around for decades. Trigger-based CDC captures all insert, update and delete operations performed on tables or databases. For every insert, update and delete statement, a trigger is fired which captures the data manipulation language (DML) statement.
This can require a lot of work since trigger-based CDC requires database triggers to be created in order to identify the changes that have occurred in the source system. When the trigger is fired, the trigger will capture the changes and write the changes into another table, typically called a shadow or staging table.
Trigger-based CDC is easy to implement but can require a fair amount of work due to the manual process of creating the triggers for each table. This method of CDC can capture the entire state of the transaction and is also highly customizable.
Some disadvantages of using triggers to enable change data capture should also be called out. As mentioned previously, a big disadvantage is the setup of individual triggers for each table.
This can lead to costly implementation and management overhead in the case of a large source database. Another challenge with triggers is that for large transactions, there is a significant overhead of doing multiple writes to a database for every insert, update or delete.
Also, to apply the changes in the target database, the replication tool needs to connect to the source database at regular intervals; that can put additional load on the source database system and impact performance.
Log-based CDC, as mentioned earlier in our article, works by reading the transaction logs of the source database. By reading the database’s log, you get the complete list of all data changes in their exact order of application.
However, every database uses its own custom format to write redo log records to the transaction log. This means that a customized solution needs to be built to read those formats from the log and convert them into logical transactions.
The transactions then need to be pushed into DML statements that could be written to a target system. Therefore, there is a lot of engineering effort required to build a solution that enables a log-based CDC process to read a transaction log.
Log-based CDC comes with many advantages over other CDC methods. First, you can get a lot of additional metadata from the transaction log, e.g., a transaction identifier that is useful to ensure that the replication has the ability to resume upon crash. Second, in log-based CDC, there is no connection being made to the source database, and no extra queries are performed on the source system, thus making it a zero-impact solution.
Some log-based CDC solutions, such as Arcion, also offer agentless log-based CDC where no CDC agent even needs to be installed on the database server. The ability to not impact the performance of the source database is extremely important for large production systems which are likely running at full capacity and cannot bear any extra overhead.
Benefits of Change Data Capture
The benefits of Change Data Capture for businesses that understand the importance of data and data integration are numerous. The most obvious benefit of CDC for any business in the midst of a data migration project is the replication or migration of data, likely in real-time. This can unlock many use cases for the business that are not possible without the ability to analyze data in real-time.
For a digital transformation, where data is being moved from a legacy system to a new system, CDC can be a crucial part of your data migration strategy. Data migration is usually one of the largest challenges in digital transformation since it would normally require databases to be taken offline. Once offline, then the data would be extracted and loaded into the target database. With CDC, the requirement for a database to be offline to migrate data is no longer required.
CDC can serve as a tool to back up data to avoid loss. Data can be kept in sync across various systems to improve the reliability and availability of that data. This can also help as part of a business continuity plan.
Another benefit of CDC for businesses is having data constantly updated and stored in data warehouses for further analysis by Business Intelligence tools or as an input process to an AI/machine learning pipeline. Any insights derived from such data will be up to date and accurate.
Finally, a well-implemented Change Data capture solution can lead to cost savings in terms of the amounts spent on provisioning infrastructure for data handling and database migration. For organizations with large volumes of data, a native approach such as creating snapshots of data, can incur significant costs over time. With CDC, only segments of data need to be considered.
How Does It Look With Arcion?
Arcion goes beyond the basics of CDC. Supporting many of the most popular databases, data warehouses, and other popular tools, Arcion enables log-based CDC that goes further than just keeping data in sync. Arcion can also capture changes including DDLs, DMLs, schema changes, and several other non-DML changes. This makes for robust CDC pipelines that are easy to create and support, all with no-code required.
Arcion also supports other types of CDC when the source databases' transactions logs aren't available. These alternative CDC methods includes:
- Delta-based Change Data Capture: This is a way of doing audit column-style CDC by computing incremental delta snapshots using a timestamp column in the table, Arcion is able to track modifications and convert that to operations in target.
- Checksum-based Change Data Capture: This is a way of implementing table delta/”tablediff”-style CDC. Arcion can perform an intelligent checksum to compare the rows and columns in the source and the target. It then adds the missing rows in the target table. Learn more about Arcion’s low-latency Change Data Capture.
Arcion also supports both self-managed, on-premise deployments as well as the ability to run in a fully-managed environment while using Arcion Cloud. With almost unlimited flexibility, Arcion delivers real-time data wherever and whenever you need it.