In this blog, let’s take a close look at all the CDC solutions available today, weigh their pros and cons and make an informed decision as you choose the CDC solution that fits your requirement.
What is Change Data Capture?
Change Data Capture (CDC) uses change identification and capture in a source database for data integration and replication to the destination database, in a safe and reliable manner, maintaining data consistency throughout the process. Users can rely on timing logs for debugging. Arcion also implements agentless CDC for low latency data replication and zero downtime database migrations. Users are empowered with real-time analytics.
CDC vs. Change Tracking
Change Data Capture is an asynchronous process that tracks and records all data manipulation language (DML) changes to be applied to a target system. It keeps a comprehensive log that needs additional storage. On the other hand, Change Tracking is a lightweight synchronous process that captures only the last change to the data.
CDC vs. ETL Process
Change Data Capture brings real-time performance as an advantage over ETL that processes data in phases and in batches. Data extraction happens in real time as opposed to batches in ETL while data transformation and loading can happen simultaneously or at times data loading can precede data transformation. This greatly improves performance of the data transfer and replication process especially in large data volumes.
Challenges in Change Data Capture
Is it easy to implement a CDC? If not, what are some of the top challenges in CDC implementation?
Parsing The Transaction Log
Log-based CDCs can face a challenge in reading highly custom transactional logs, limiting the amount of databases supported.
Synchronizing The Capture Starting Point
A successful CDC process needs to achieve perfect synchronization of two functions:
- Taking a snapshot of the data at the current time
- Applying changes from that point forward
Handling a Table's Schema
Database schemas can vary significantly between systems due to different data types, non-standard behaviors, support for invalid data and more. To address these challenges, CDC transforms the data on moving from one schema to the other. But some level of manual intervention may be required to ensure data integrity.
Recovery and Exactly-Once Delivery
Adhering to an “exactly-once” delivery pattern can be tricky with CDC because it needs to maintain the source and targets in sync irrespective of any external or underlying issues (connectivity, errors, and so on).
Performance and Tuning
A CDC’s low latency performance depends on how well it is configured to handle data sources which can be multiple and vary for large data transfers. Each source has unique requirements that need some level of expertise.
Types of Change Data Capture
There are four common ways to implement a CDC.
Log-based CDC uses source database transaction logs.
- Acquire additional valuable metadata from the transaction logs
- Connectionless and hence zero-impact solution for the source database
- Highly custom logs may require customized CDC solutions and larger engineering efforts.
Query-based CDC relies on a timestamp or a separate identifier to identify database changes. Pros:
- Commonly used for data replication
- Deleted records cannot be identified since they are no longer in the database
- Severe performance overhead because the CDC continuously queries the source database for changes made
- Time stamp and audit date columns may not be replicated and hence the risk of incorrect replication
Trigger-based CDC uses triggers fired for database inserts, updates, and delete operations on tables or databases to capture the DML statement.
- Easy to implement.
- Highly customizable and captures the entire state of the transaction.
- Lot of manual work, which can be quite costly, is required to create and maintain all the triggers for each table and operation.
- Operational overhead as each database operation captured by the trigger needs an individual write operation to a shadow or staging table.
- The replication tool needs to establish a connection to the source database at regular intervals which can slow down performance, especially for large databases.
Things to Consider in a Change Data Capture Tool
Keep these factors in mind as you choose a CDC tool for your technology stack.
Performance and Scalability
Ensure you have a CDC tool that can scale as your data grows without slowing you down. Ask yourself these questions to evaluate a CDC tool.
Can the tool meet the SLA requirements unique to your use cases? Remember, data volume will continue to grow and real-time replication is critical in most cases. Ask for a proof of concept to validate these needs.
Can it scale on demand to support variable workloads and rising data volumes? At times, you may need more data, faster. Check if the CDC tool is based on a parallel cloud architecture to distribute replication threads across multiple compute clusters.
How does the tool impact your organization’s data architecture? Ensure the tool does not add burden on your existing infrastructure. Tools that use a source or target agent will slow down performance. Look for a cloud-native SaaS solution to minimize impact on-premise.
How does this tool extract new and changed data from the source? Pick a log-based CDC tool to minimize production workload disruptions.
Does the tool guarantee transactional integrity? This is one of the key considerations to ensure data integrity on the target database - that it does not operate on partial and invalid data.
Ease of Use
Ease of use is critical in a CDC tool. Do not let a hard-to-se tool indirectly affect speed of operations and performance of the process. Ask these questions to evaluate your tool’s ease-of-use.
Does the tool require extensive training? Ask your vendor to provide supporting documentation and a combination of live and recorded training sessions to help your teams become productive during the PoC stage.
What level of automation does it provide? Good CDC tools exploit automation for repetitive tasks.
What is the level of effort on implementation and overall management of the CDC tool? Seek out a tool that streamlines implementation, administration, and monitoring. Minimize or eliminate on premise installations with a cloud-native tool.
How does this tool affect your data team’s efficiency? Compare the number of hours required of your team with the CDC tool in contrast to what they spent earlier in data integration.
Look for a flexible tool that can meet evolving requirements and use cases.
Does the tool allow modular changes? The CDC tool should allow modular changes to elements without impacting the entire system.
What kind of data ingestion can the tool support? The tool should enable inserts, updates, or deletes in addition to schema changes.
Does the tool support an open architecture? Pick a tool that supports all the major sources, targets, processors, formats, tools - transformation, workflow, and database, and all major programming languages. In other words, an open data access and portability model.
Most industries require data governance and certain industries need to adhere to stringent data governance programs, especially with regards to personally identifiable information (PII).
Does this tool centralize all your pipeline metadata and metrics? The tool should make it easy for your data engineers to track data flows.
Does this tool provide granular role-based access controls? Check if the tool allows access control by the user, dataset, and task types. Third-party identity management tools integration is also a valid requirement to keep in mind.
Does the tool mask data? Empower your data teams to mask sensitive PII information while playing with advanced analytics or other necessary operations.
Does the tool track data lineage? The CDC tool should track for trustworthy sources, tasks, and its users.
Does the tool meet your regulatory and compliance requirements? Users should be able to audit user actions and document governance-related information for compliance and industry regulations such as the General Data Protection Regulation (GDPR), California Consumer Privacy Act, and the Health Insurance Portability and Accountability Act (HIPAA).
With this backdrop, let’s see if Arcion’s CDC can be your pick.
How Can Arcion Help?
Arcion provides a data pipeline tool that is scalable, reliable, and extremely easy to use - making it the go-to CDC tool for several leading enterprises. Arcion’s no-code CDC platform can inject data in minutes for near real-time decision-making. Its robust data pipelines ensure high availability, using log-based CDC and features auto-scalability. It supports flexible deployment models that can be on-premise, cloud-based or a more hybrid approach.
Arcion uses multiple types of CDC. These include log-based, delta-based, and checksum-based CDC. The tool handles both DML and non-DML changes. It also supports schema evolution and conversion, transformation in columns, and DDLs.
As with a zero-code platform, you can set up and configure Arcion CDC using its intuitive UI in minutes - with no custom code required. To make the user experience better, there is extensive documentation, tutorials, blogs, in addition to customer support available for any questions.
Let’s take a deeper dive into Arcion’s CDC tool features.
Sub-second Latency From Distributed & Highly Scalable Architecture
Arcion is the world’s only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
100% Agentless Change Data Capture
Arcion CDC is a completely agentless CDC supporting more than 20 connectors - once again making it the only CDC vendor to do so. It also supports all major enterprise databases (for e.g., Oracle, SQL Server, Sybase, DB2 Mainframes, IBM Informix, SAP Hana, to name a few) and open-source databases such as Postgres, MySQL, MongoDB, Cassandra, etc., along with a variety of data warehouses like Netezza, SAP Hana, BigQuery and Snowflake, among many others. Arcion uses database logs, and does not read from the database itself. Being agentless, it eliminates inherent security concerns as well. It guarantees data replication at scale, in real-time.
Effortless Setup & Maintenance
Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and costs.
Data Consistency Guaranteed
Arcion ensures zero data loss by adding extra validation support. It works automatically all along the replication and migration process, making it seamless and efficient.
Pre-Built Enterprise Data Connectors
See the full list of Arcion connectors. It is an extensive list of common databases, data warehouses, and cloud-native platforms. As an advantage over typical ETL tools, Arcion ensures that you have full control over data with the flexibility of automation. You can move data from a single source to multiple targets or vice versa.
Other Change Data Capture Tools
Here is a list of a few additional data replication and migration tools for an objective comparison.
Fivetran is a Software-as-a-Service (SaaS) product that allows enterprises to move siloed data into accessible storage like data warehouses in the cloud. With Fivetran, users can connect to multiple databases and applications without building data pipelines.
Fivetran acquired HVR, a data integration platform, in September 2021. HVR’s data replication technology leverages CDC. Fivetran indicated plans to integrate HVR with its products, which also includes support for on-prem installations by 2026. Existing on-prem HVR customers can start evaluating on-prem CDC alternatives.
- No-code platform
- Prebuilt data connectors
- Managed service
- Ease of use, no data pipelines required.
- Extensive list of pre-built SaaS connectors including Customer Relationship Management (CRMs) and social media apps.
- Data transformation capabilities
- No support for self-hosted deployment, either for SaaS or database connectors.
- Limited database connectors.
- Primarily cloud-only with a significant ramp ahead for on-prem.
- Limited to Oracle DDL Replication as described in their documentation.
- No granular control over data specification.
- Users are limited to using only the prebuilt data connectors.
- Does not support streaming column transformations for analytical platforms.
InfoSphere Information Server is a data integration platform. It has multiple offerings as part of the product suite - InfoSphere DataStage. InfoSphere DataStage is primarily an ETL tool.
- Graphical framework for data jobs
- Data integration
- ETL and ELT operations
- Eliminates organization data silos.
- Integrates data across multiple systems and enables data analytics for insights.
- Data standardization using a unified business language.
- Access to a rich ecosystem of IBM data tools.
- Vendor lock-in
- Pre-knowledge of IBM ecosystem required to optimize for customer use cases.
Oracle GoldenGate is a software tool designed to move data around between locations. It can be viewed as a family of products and not a single product in itself. Certain components are available on-prem and some on the Oracle cloud infrastructure.
- Business continuity and high availability
- Zero-downtime; initial load, database migration and upgrades
- Data integration
- Live reporting
- Data movement in real-time and with low latency.
- Data consistency and improved performance by moving only committed transactions
- Heterogeneous support for multiple databases running on different operating systems. Supports different versions and releases of Oracle database too.
- Simple architecture and easy to configure.
- High performance with minimal overhead on databases and infrastructure.
- Expensive product with separate licensing in addition to the Oracle DB license.
- Complex to deploy and configure. Time needed can be from days or weeks depending on the database and storage structures.
- Challenges in replicating character sets.
- Data extraction during replication may strain memory usage.
- Possible data delivery issues in XML data and HDFS. Need additional validation.
Talend is a data integration platform that enables ETL operations for various sources and targets. It has a drag-and-drop UI to improve productivity significantly (~10x faster than manual programming).
- Data integration
- Data integrity
- Data quality
- Data governance
- Support relational databases like MySQL, Oracle, etc.
- Strong developer community.
- UI based Talend Open Studio for ease of configuration; minimal coding required.
- Supports on-premise and cloud deployments.
- Only Oracle sources are supported by log-based CDC. For the rest, CDC is trigger-based and complex to set up. Needs additional backup to minimize the operational load on the source.
- Manual work in queries and coding required for set up, not completely no-code in most cases.
- Free version of the Open Studio is only for developers. There is a separate license for server installation which is quite expensive.
- Ambiguous pricing model.
Debezium is an open-source tool for CDC that is based on Apache Kafka. It can capture row-level changes using transactional logs. The order of events recorded by Debezium is the same as how changes were made to the database. These events become topics published to Apache Kafka.
- CDC tool
- Data monitoring
- Event streams
- Speed and scalability
- Open source software that is free.
- Maintains the order of the events.
- Fast and scalable.
- Supports database monitoring of common databases such as MySQL, MongoDB, PostgreSQL, SQLServer, etc.
- Extensive engineering effort and technical know-how required.
- No guaranteed transactional integrity (and zero data loss).
- Some Debezium connectors have scalability issues. For example, Postgres users faced memory exceptions while using plugins such as wal2json. Lack of snapshot tables is another shortcoming because large tables can be unavailable for longer periods.
- Does not handle schema evolution gracefully. Needs custom logic implementation in Kafka to process schema evolution.
- Limited data type support for some connectors. For example, Debezium’s Oracle connector cannot handle BLOB data types.
- Does not scale well with data volume. Data pipelines need extensive design and implementation time..
- Hidden costs in engineering effort and maintenance.
- Data pipelines can have long-term scalability issues even when they are designed with short term data volume in mind.
StreamSets is a complete end-to-end data integration solution to build, run, monitor, and deliver continuous data for DataOps. The tool is designed to minimize time spent on resolving issues and allow data teams to focus on applying data.
- Efficiently streaming and management of record-based data.
- User-friendly interface with live dashboards to fix errors in real time.
- Supports multiple file formats and connectivity options.
- StreamSets integration with spark functions is challenging for large datasets.
- The entire data flow must be paused when settings in one processor are updated.
Qlik Replicate is a data ingestion and replication platform for heterogeneous environments. It empowers organizations and corporations to move, stream, and ingest data across various locations with minimal impact on operational efficiencies.
- Data replication
- Data ingestion
- Support for most enterprise data sources including mainframes
- Heterogeneous systems like data warehouses, cloud platforms, enterprise databases, and more.
- Real-time data streaming for CDC.
- Automated replication to a cloud data warehouse with no manual coding required.
- Centralized monitoring of all resources through a single interface.
- Legacy tool that is not designed for the cloud era. Needs manual work and high maintenance efforts.
- It is not an end-to-end multithreaded solution, though it claims to be multi-threaded. In target databases like Snowflake and Databricks, since it is not multi-threaded it cannot scale horizontally. Ideal for smaller projects that replicate less than 1TB of data per day.
- Needs an additional third-party clustering solution to achieve high availability, which adds design and maintenance complexity.
- Cannot guarantee data consistency, leading to missing data and errors in target databases.
- There is no staging area. For any issues, the process needs to be initiated from the start to avoid data loss. Projects can get derailed and much more expensive.
- Only self-hosted deployment model. No managed cloud SaaS offering.
To wrap up, we have covered what a CDC is and the factors to keep in mind when selecting your CDC tool. The list of alternate database replication and migration tools with their pros and cons can help you make an objective selection. This blog is intended to act as a guide as you plan for a successful data migration or data replication project. Right decision-making often starts by asking the right questions.
To make things simple, if you are looking for an easy-to-use CDC solution that ensures reliability and scalability, look no further than Arcion. Arcion provides you complete deployment flexibility. You can install either on-premise or on cloud. With pre-built connectors, it is not just incredibly quick and easy to configure, it also delivers extremely high performant data pipelines. Book a free demo to see Arcion in action. Bring the power of zero downtime and zero data loss to your organization today.