Modern microservices-based application architectures and ever-evolving analytics requirements often lead to requirements for syncing data from multiple databases. Usually, these databases are scattered all across the enterprise, spanning many different systems. In most cases, this synchronization cannot wait for a batch-based job to run daily, or even hourly. In modern applications, real-time syncing of databases is often required. Every data engineer worth their salt has encountered this requirement at some point in their career. This is where the concept of Change Data Capture comes in.
What is Change Data Capture?
Change Data Capture, or CDC, deals with reacting to data changes in a datasource, transforming them, and loading them to another database or storage system on a real-time basis. Essentially, you would use CDC to make sure that multiple datasources stay in sync with one another in a nearly-instant fashion. This is different from a batch approach which would take changes at set intervals and move to apply them where they are needed. This approach is delayed and not conducive to real-time business insights or data needs. Change Data Capture allows users to keep multiple databases in sync or to replicate data to an analytics platform, such as Snowflake or Databricks, in real-time to allow for crucial business decisions to be made when they matter most.
Understanding Change Data Capture using Debezium
Implementing Change Data Capture primarily requires the database to emit events on inserts, updates, and deletes. This is generally accomplished by listening to write-ahead logs or binary logs that are present in most databases. The captured events are then pushed to a queue like Kafka or RabbitMQ. Lastly, a separate consumer process handles the update to the destination system.
While most databases make their logs available for subscription, they do not follow a common standard and use their own proprietary formats. Subscribing and processing these events can thus be done only by being tightly coupled with the source database. Hence, a modern development team that uses a variety of databases has their job cut out for them. There is a requirement to implement connectors for different databases without much possibility of code reuse. Debezium tries to solve this problem by providing connectors for some of the most popular databases. It supports platforms such as MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, and DB2.
A typical CDC architecture based on Debezium and Kafka Connect looks as below.
Can I use Debezium without Kafka?
While Apache Kafka is generally found in Debezium-based CDC architectures, it is not a mandatory requirement. Debezium Server can now even connect to other message processing mechanisms like Google Pub sub, AWS Kinesis, etc. It is also possible to run it in an embedded mode and use one's own mechanism to subscribe and process events.
Is Debezium Open-Source?
Debezium is an open-source distributed platform for Change Data Capture and a CDC tool for creating the data pipelines that enable CDC. The rest of this blog will cover how Debezium can be used for Change Data Capture, the drawbacks of this well-known CDC tool, and alternatives that are available.
Need for Debezium Alternatives
Even though the vision of an open-source CDC platform that can support most databases is poetic, in reality, development teams face many issues while implementing Debezium-based production pipelines. Aside from the fact that you will need a strong team with technical expertise in Kafka and DevOps to ensure scalability and reliability, the architecture itself comes with many limitations. Some of these concerns include:
- How to guarantee zero data loss? There is no guarantee of zero data loss or transactional integrity if any of the components in this setup fails. Maintaining integrity is the sole responsibility of the development team here.
- How to ensure the pipeline architecture is scalable? Even though it advertises many connectors, some of them have scalability issues. Even in the case of good old Postgres, it is not uncommon to run into out-of-memory exceptions while using plugins, like wal2json, to convert write-ahead log output to JSON. Another example is the lack of ability to snapshot tables while still being open for incoming events. This means for large tables, there could be long periods of time where the table is unavailable while snapshotting.
- How to handle schema evolution? Debezium does not handle schema evolution in a graceful manner. Even though there is support for schema evolution in some of the source databases, the procedure to implement this is different for different databases. This means a lot of custom logic has to be implemented in Kafka processing to handle schema evolution.
- How to support various data types for different connectors? Some of the connectors have limitations when it comes to supporting specific data types. For example, Debezium’s Oracle connector has a limitation in handling BLOB data types since they are still in an incubation state. It is not wise to use BLOB data types in Debezium-based production modules.
Debezium also comes with a lot of hidden costs. Even though it is free to use, a large amount of engineering effort and time is required to set up production CDC pipelines using Debezium. The availability of trained engineers to implement these pipelines is also another challenge.
In many cases, architects often design data pipelines that just solve today’s problem and do not consider the long-term implications or needs of the system. When the business grows, their data volume grows. This means that pipelines can become overloaded in very quickly. That is why it is wise to explore solutions with long-term scalability and support in mind.
These limitations pose serious problems while designing enterprise CDC pipelines based on Debezium. This is why it is wise to be aware of alternatives.
Top Debezium Alternatives
Arcion is the only cloud-native, distributed CDC-based data replication solution on the market. It was created with the purpose of helping enterprises to accelerate data connectivity across traditional transactional databases and cloud platforms through high-performance, high-availability, and auto-scalable data pipelines. Arcion focuses on unifying your data silos with zero maintenance and reliable CDC pipelines.
- Fast, reliable pipelines powered by low-latency, distributed change data capture and handles end-to-end replication (no Kafka needed).
- Native support for CDC across multiple databases like MySQL, Oracle, SAP HANA, SAP IQ, Informix, DB2, SQL Server, Teradata, etc and cloud analytic platforms like Databricks, Snowflake, SingleStore, Yugabyte, etc.
- Guarantee end-to-end data consistency and transactional integrity even with schema changes on source. In-built checkpointing and restart capability ensure you never miss a transaction and always carry out "exactly once data delivery".
- Cloud native architecture with built-in scale up & out parallelism allows Arcion to ingest at 10k ops/sec/table to cloud analytic platforms like Databricks & Snowflake. It is also designed to handle scalability for tables with billions of rows.
- Arcion handles schema changes and support schema evolution out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes.
- Completely managed, no-code method for setting up Change Data Capture for all major legacy transactional systems and cloud analytic platforms.
- Available in two deployment options: Self-hosted (on-prem or VPC) or fully managed, Arcion Cloud.
- Arcion Cloud only supports five connectors: MySQL, Snowflake, and Oracle as the source, and Databricks, Snowflake, and Singlestore as the target.
Talend Data Integration
Talend is a data integration platform that enables one to extract, transform and load data across various sources and destinations. Talend provides solutions for both cloud and on premises deployments. With its drag-and-drop interface, Talend boasts an increase in productivity that’s 10 times faster than hand-coding.
- Talend supports Change Data Capture with most common relational databases like MySQL, Oracle etc.
- Has great community support and a long history of supporting enterprise data pipelines.
- Talend Open Studio provides a user interface to configure the data source and destination. Most implementations can be done without writing code using this tool.
- The platform is available for on-premise and cloud-based deployments.
- Talend supports log-based CDC only for Oracle. For other databases, CDC is trigger-based and is complex to set up. This is because a backup is needed to take the load of the main source database. You can read more on Talend CDC here.
- The setup is not completely no-code based. Users will have to write queries and code in most cases.
- The Talend Open Studio is free only for development. The server installation is licensed and expensive.
- Talend pricing is not transparent and requires multiple conversations with the sales team.
HVR (now part of Fivetran)
HVR (acquired by Fivetran) is a data integration platform to connect and replicate data across various sources and destinations. HVR supports end to end data replication which means the product can do initial data migration followed by real-time data replication. HVR’s replication technology for most sources is powered by CDC (change data capture).
- HVR supports most of the OLTP databases as sources including MySQL, SQLServer, Oracle, DB2 etc. .
- It supports real-time transactional log-based replication.
- It has a data validation tool that can be run post data migration
- There is no managed service on the cloud, though data can be cloud-based. (There has been an announcement from Fivetran that some of the HVR connectors have been integrated inside Fivetran but the HVR platform is not available as a managed service on the cloud)
- No easy trial. It does not offer any self-serve options and requires contact with the sales team to get started.
- DDL Replication is limited to some source and targets. It is very well supported for Oracle as a source but not so great support for other sources as described in their Docs.
- No support for streaming column transformations like computing new partition columns for analytical platforms.
Hevo is a completely cloud-based ETL platform that can connect a variety of sources and destinations without using a single line of code. Users use Hevo to build end-to-end data pipelines that enables them to easily pull data from all your sources to the warehouse, run transformations for analytics, and deliver operational intelligence to business tools.
- Completely No-code data pipeline creation with a user-friendly interface for configuring source and destination
- Connector support is great for on-premise as well as cloud-based data sources.
- Hevo provides transparent pricing model and users can get started from their website based on a self service model
- Change Data Capture is implemented based on querying and requires tables to have a timestamp column. There is no transactional log based CDC support.
- The Change Data Capture mechanism employed by Hevo places unnecessary load on the source database. Hence it may not be ideal for production databases.
- Change Data Capture is supported only in the case of limited databases like PostgreSQL, MySQL, and Oracle.
- Completely cloud-based deployment model. On-premise installations are not possible. Hence this may not be an option for enterprises who are particular about data security and data location.
Qlik Replicate is a data integration platform that can be installed in Windows or Linux systems to connect data across various sources and destinations. Replicate supports bulk replication as well as real-time incremental replication using CDC (change data capture).
- Qlik enjoys comprehensive source data support with all the popular databases like MySQL, SQLServer, Oracle, PostgreSQL etc in the supported list.
- It supports transactional log-based replication.
- Closely coupled with Qlik Analytics which means data integration and visualization can be done inside one platform.
- Only supports on-premise installations, though data sources can be cloud-based.
- Qlik pricing is not transparent. It does not offer any self-serve options and requires contact with the sales team to get started.
- Change Data Capture feature comes with many limitations depending upon different data sources. For example, while using Oracle as a source, it does not support batch operations using primary keys. Such changes will not be reflected in the target database. In case of SQL Server, column level encryption is not supported.
- In case of SQLServer, masked data in the source database will appear in plain text in the target database.
Oracle Golden Gate
Oracle Golden Gate is Oracle’s solution for data integration and Change Data Capture. GoldenGate works as a “real-time data mesh” platform that allows users to design, execute, orchestrate, and monitor their data replication and stream data processing solutions that are hosted within Oracle Cloud Infrastructure.
- Golden Gate is one of the most well-known CDC solutions in the market and they've been around for more than a decade. Therefore, it is one of the most mature solutions on the market.
- Works well when Oracle is the source for CDC. Integration and setup are seamless in the case of Oracle.
- Beyond CDC, it can also act as an independent data integration service.
- Closely coupled with Oracle. Even though it can support other databases, like MySQL IBM DB2, support for common sources like PostgreSQL, in the case of Change Data Capture, is limited.
- Golden Gate installation and configuration is a complicated process and requires specialized skills. It is advised only for large enterprises with database administration teams.
- Golden Data does not support any cloud based data sources.
- Golden Gate licenses are very expensive and may be cost-prohibitive for small and medium businesses.
- Golden Gate is owned by Oracle, and they're naturally less incentivized to develop features that help users move data out of Oracle ecosystem.
We have now learned about how Debezium is a great open-source tool for implementing Change Data Capture. We also explored the alternatives that can be considered if you are being held back by its limitations. Arcion provides a cloud native, completely managed, and no-code pipeline creation tool for real-time data integration.
Looking for a great alternative to Debezium and Kafka-based Change Data Capture platforms? Book a personalized demo with our database expert to see Arcion in action.