If you’re building applications or doing anything with data within your enterprise, chances are that you have heard of Apache Kafka. One of the most popular stream processing platforms in existence, the use of Kafka since its inception has exploded. Kafka was built at LinkedIn and is now used at Netflix, Pinterest, and Airbnb. With many big names leveraging the platform, it makes sense why it's one of the most desired technologies to include in applications big and small. Let’s dive into the world of Apache Kafka and CDC.
Introduction to Apache Kafka
Apache Kafka is an open-source stream processing platform that features a distributed event store capable of ingesting, storing, and processing streaming data. The data processed by Kafka can come from various sources such as sensors, databases, mobile devices, cloud services, and software applications. Apache Kafka is mainly used for developing streaming applications that can handle simultaneous and continuously generated data from thousands of sources.
Another major use case for Apache Kafka is in the deployment of data streaming pipelines. In these pipelines, data from one set of sources is continuously fed as input through Apache Kafka, modified and transformed, and then sent out as a stream of records to other data platforms. Apache Kafka is an extremely efficient way to move data around an enterprise at scale through Kafka-based data pipelines.
Apache Kafka was initially developed by LinkedIn to handle its internal streaming requirements. Eventually, Kafka was made available to the broader community through the Apache Software Foundation. It is one of the most popular data platform components in the Apache stack and has easy integration with many other components within the Apache Stack. Common software combined with Kafka includes other Apache offerings like Apache Spark, Flink, Hadoop, and many others.
Apache Kafka offers three main functionalities to its users. These functionalities include:
- a way to publish and subscribe to a stream of event records,
- a way to store those streams, and finally,
- a way to process those streams in a scalable, reliable, and durable manner.
The architecture of Apache Kafka is designed around a set of APIs. The Admin API is used to manage and inspect topics, brokers, and other Kafka objects. In Apache Kafka, event records are published under topics that can be subscribed to. Brokers represent the storage layer of Apache Kafka. Another prominent Apache Kafka API is the Producer API which is used to write or publish a stream of records into a topic. The Consumer API is responsible for the subscription layer that enables users to subscribe to specific topics and process them. The Kafka Streams API is a higher-level library built on both the Producer and Consumer APIs that allows for stream processing applications to be more easily developed. It simplifies the concept of streams and provides access to higher-level functions that can be used for transforming and processing data. The Kafka Connect API is used to provide connectivity to Apache Kafka from and to various external systems like databases and warehouses. You can create your Kafka connector with the Connect API, however, there are hundreds of ready-to-use connectors provided by the community.
Key Features of Apache Kafka
When it comes to the popularity of Kafka, many key features make it a highly-desired platform for many enterprises. Some of the key features of Apache Kafka that make it a reliable event-streaming platform are explained below.
Apache Kafka is a scalable streaming platform because it utilizes a distributed event store and topics, wherein stream records are grouped and partitioned. This means that topics are stored on brokers across different buckets in a distributed fashion. This enables producers and consumers to read and write records from many brokers at the same time. The order of records in each topic is preserved despite being stored across clusters. This architecture allows Kafka to easily scale depending on the amount of traffic being pushed through the platform.
Apache Kafka is an incredibly fast distributed event store. It is capable of processing trillions of records a day and can scale to massive volumes without a lag or drop in performance. It has a decoupled architecture for data streams thereby guaranteeing increased speeds.
Data is stored in Apache Kafka in a highly available manner. Data is stored across different clusters and availability regions. Topic partitions mean that data is distributed and written permanently to disk, reducing the risks of data loss or corruption from server failures. Data can be stored for as long as needed and the data can be accessed in a high throughput environment.
Connectivity and Extensibility
Apache Kafka features a robust set of APIs that enables users to connect to all major data sources. Supported data sources include the most popular databases, data warehouses, big data stores, and online platforms. It can easily be included in your data pipeline because of the numerous connectors provided by the community and the open nature of the platform. Apache Kafka can also be extended to other ecosystems and use cases by leveraging its low-level APIs.
What is Change Data Capture
As mentioned earlier, in brief, Change Data Capture, or CDC, is a concept or mechanism that is used to identify a data change that has occurred in a data store. These changes are captured as CDC events. The CDC process will then apply the identified changes in the CDC events to other data stores in real time. It is used to synchronize data across the data stores. Using CDC ensures data consistency for applications and systems by moving data from one data repository to another.
There are various ways Change Data Capture can be implemented. The available methods will differ from one database to the other. One of the ways CDC can be implemented is through a query-based approach. This approach is where you query the source databases periodically to determine what has changed since the last query has been executed. From this, you can determine the changes and apply them to a target database downstream to keep it in sync with the source. The downfall of this approach is that it is resource intensive and can lead to performance degradation of the source database.
Another approach is through the use of a trigger-based system. This approach involves a process where a change in the source application triggers the modification of a “change” table. The “change” table then records all updates to the tracked tables in the database. These changes can then be queried and applied to downstream target databases.
The last approach we will discuss, and most preferred in many cases, is the use of a log-based Change Data Capture. Most CDC processes, at least those that are not intrusive, tend to rely on the internal log of a database called Write-Ahead Log (WAL). The Write-Ahead Log contains all the changes made to the database before they are applied to data files with tables and indexes. These changes can then be propagated to a destination to synchronize data between a source and a target.
Change Data Capture (CDC) can also encompass software tools to identify changes in source tables and databases. The software can then apply the detected changes to a target database to keep it in sync with the source. Using a CDC tool is one of the most effective and quick ways to implement real-time data synchronization between two or more databases. There are a few CDC tools on the market that offer effective CDC implementations with differing levels of customization and complexity.
The Importance of Kafka CDC
Change Data Capture, as seen in the previous section, is an invaluable tool when constructing data pipelines. When CDC principles are combined with Apache Kafka, organizations can access real-time data from their streaming sources through the reliability of an event streaming platform like Kafka. A major use of Kafka within the enterprise is enabling the transfer of data from one enterprise database to another. Kafka effectively turns database events into streaming data that can be acted upon by other applications to power real-time insights, analytics, and decision-making.
Apache Kafka is designed to scale and handle millions of messages. This allows Kafka to be used to set up efficient real-time data streaming pipelines that can move huge volumes of data with high availability and low latency. The advantages highlighted in the previous statement are some of the attributes of a well-thought Change Data Capture solution as well. Apache Kafka, used for CDC, can utilize the log-based approach by using the Kafka Connect API to read database change logs and ingest the changes as event streams. This is preferable because the events can be persisted on Kafka brokers, and consumed by subscribers. Using log-based ingestion of the data has no impact on the performance of the source system, a major plus. Kafka can also be used to transform the data on the fly before it is forwarded to other connected systems. This can ensure that the data is in the correct format before it reaches its destination.
Another important factor in Kafka-enabled CDC processes is that it can be used as the input to a real-time streaming application that harnesses data from various sources. The real-time streaming application that utilizes Kafka could take many forms. One example may be an analytics dashboard that captures user interactions on a website. Another may be a Business Intelligence (BI) application that analyzes the clickstream and performance of online advertisements.
Using Change Data Capture with Apache Kafka can lead to better utilization of the network bandwidth of your applications. With the high throughput of Kafka, you will not need to worry about batch processes impacting the performance of your data pipelines for end users. Overall, Kafka can be a very solid solution for companies who are looking to leverage or improve their CDC and data streaming capabilities.
Types of Kafka CDC
There are two types of Change Data Capture associated with Apache Kafka. These two types include some of the CDC concepts we touched on earlier, query-based CDC and log-based CDC. Kafka can easily support either of these methodologies when implementing a CDC solution.
Both types of CDC implementations have their specific pros and cons, many of which we discussed earlier. The ideal choice will be based on the requirements of the specific use case where CDC will be leveraged and what the underlying technologies can also support.
Query-based CDC using Kafka is done by querying data in the database and identifying the changes that have occurred since the last query was executed. The query usually has a predicate or conditional expression which may be tied to a timestamp or identifier column in a database table. The predicate may be used in conjunction with a WHERE clause to query slices of data that represent updates. The main advantage of query-based Kafka CDC is that it is easy to set up. It can be as simple as using a connection to the database like Java Database Connectivity (JDBC) and the ability to query the necessary tables. One of the disadvantages of this method is that it relies on specific columns in the source for its functionality to track changes. In a situation where you cannot modify the schema, you will have very little flexibility. Another major disadvantage is that query-based CDC also has an impact on the performance of the source database. This is a result of data constantly being polled to track changes. The idea of tracking changes less infrequently can lead to stale data and the consequences associated with that.
Log-based CDC using Kafka, on the other hand, is the preferred approach in most cases. Its functionality is built on scanning database transaction logs that contain the change event data. This method is non-intrusive and typically leads to having a holistic representation of the state of the database at any point in time. The modifications in the transaction logs can be streamed as events to a Kafka topic that tracks changes in the database. Log-based Kafka CDC provides greater efficiency and fidelity of the data captured. The main challenge is that it can be more intensive to set up since it requires higher privileges to the source database and the logs within it.
How to Enable Change Data Capture in Kafka
An easy way of enabling Kafka CDC is to utilize the Kafka Streams library. Streams in Kafka are defined as an unbounded set of data that is continuously updated. Kafka Streams can be seen as an immutable sequence of records where each data record is characterized as a key-value pair. It is the main abstraction that is used in developing stream processing applications in Kafka. Streams can be arranged in processor topologies that represent the logical abstraction of your stream processing code.
To enable Change Data Capture in Apache Kafka, you will need to leverage the Kafka Streams library. The library itself has no dependencies, apart from Kafka itself, to design stream processing topologies that can support exactly-once processing semantics. Exactly-once processing can guarantee that a data record will be processed exactly once even in the event of a failure. You can also use it to process records based on event time in conjunction with other windowing operations. Kafka Streams provides several stream processing primitives and gives you access to its APIs through the low-level Processor API or the high-level and easier-to-use Streams Domain Specific Language (DSL).
Next, let’s take a deep dive into how to enable CDC with Kafka using 2 different methods. The first method we will explore is the simplest and most scalable. Here we will use Arcion to build a CDC pipeline using Kafka. The second method is where we will use Kafka Streams to create a CDC pipeline.
Method 1: Set Up Kafka CDC Using Arcion
Oftentimes, using a third-party tool that is optimized for easily setting up CDC pipelines is the quickest and easiest way forward. You can set up Change Data Capture with a third-party tool like Arcion in a matter of minutes. This streamlines the process of creating a CDC pipeline versus directly using the APIs exposed by Kafka through its various components. One of the advantages of doing this is that the low-level interactions required to implement CDC in Kafka are abstracted away by a data platform like Arcion. By quickly editing the config files and running a CLI command, a pipeline is available in a few moments with no code and a minimal learning curve. Below is an example of the steps required to set up Change Data Capture with SQL Server, Kafka, and Arcion.
Step 1: Use SQL Server Installer to Add The SQL Server Replication Feature to The Database
Replication components can be installed using the SQL Server installation Wizard or at a command prompt. For the exact steps to do it with either approach, check out the Microsoft documentation.
Once you have your SQL Server installation equipped with the SQL Server Replication Feature, we can begin to move to the next step of installing Java and Apache Kafka on our Linux VM.
Step 2: Install Java On Your Linux VM
Next, you will need to install Java on the machine and add Java to the environment variable. Java is a required dependency since the Kafka libraries run on Java. To install Java and add it to the environment variables for the machine, use the commands listed below.
When the commands complete successfully, Java will be installed and configured as needed to run our Kafka installation. Next, let’s install Kafka.
Step 3: Download And Install Apache Kafka
Our next step will require a few different commands to download and untar the Kafka library. Each step is shown in the commands listed below.
In the above commands, we create a directory for Kafka to be extracted into, use wget to download the dependency, and then untar it so that it can be used. Now that we have Kafka downloaded and ready to run, we can proceed to get Arcion installed and configured.
Step 4: Download and Install Arcion Self-hosted
Our last step before actually starting up our solution is to download and install Arcion Self-hosted. This will require a few steps, including downloading Replicant, creating a home directory, and adding the license key. Each step can be seen in detail by referencing our quickstart guide in our docs. We will configure Replicant a little later in this walkthrough. Before that, we will fire up our Apache Kafka Connect cluster.
Step 5: Configure Arcion to Connect With SQL Server and Kafka
We now need to configure and enable Arcion to connect with SQL Server and Kafka. The replicant-cli dependency and corresponding folder we downloaded and extracted in Step 4 will now be used. We will refer to the directory as $REPLICANT_HOME in the following steps. Now, let’s go ahead and configure all of the connections with Arcion.
Set Up The Connection Configuration
From $REPLICANT_HOME, navigate to the sample Kafka connection configuration file.
Once the file is opened, you will need to make some changes to the configuration. Make the necessary changes as shown in the sample below. Some comments have been left in the example to guide you with each tweak that may be required.
Set Up The Applier Configuration
From $REPLICANT_HOME, navigate to the sample Kafka connection configuration file.
The configuration file has two parts: parameters related to snapshot mode and parameters related to real-time mode. Below is an example of parameters for snapshot mode.
Below is an example of parameters for real-time mode.
For more information on how to configure Arcion to connect with the SQL Server source connector and Kafka, visit here.
Step 6: Monitor Kafka Topics For Changes
The SQL Server connector publishes several metrics about the connector’s activities that can be monitored. The connector has two types of metrics:
- Snapshot metrics which help in monitoring the snapshot activity,
- Streaming metrics which help in monitoring the progress and activity while the connector reads CDC table data.
At this point, your setup is complete and your Kafka, SQL Server, and Arcion CDC setup are ready to go.
Advantages of Using Arcion
There are many benefits to using an integrated data management platform like Arcion when implementing CDC. Below are a few highlights of the features of Arcion that make it a great fit for enterprises that want to utilize a robust Change Data Capture solution.
Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever-increasing data volume on the Source.
100% Agentless Change Data Capture
Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported enterprise connectors. The agentless CDC applies to all complex enterprise databases including SQL Server, MySQL, Oracle, IBM Db2, SAP and many others. Arcion reads directly from the database transaction log, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.
Data Consistency Guaranteed
Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
Automatic Schema Conversion & Schema Evolution Support
Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources, which can get expensive.
Pre-Built Enterprise Data Connectors
Arcion has a library of pre-built data connectors. These connectors can support the most popular enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other ETL tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. If you branch out into other technologies outside of Kafka or other data technologies in your stack, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.
Method 2: Setup Kafka CDC Using Kafka Streams
To set up Kafka CDC using Kafka Streams, you can use the Processor API or the Kafka Streams DSL. The Streams DSL is built on the Processor API and is more beginner-friendly than directly using the API. It is also recommended for most users since most data processing operations can be expressed in a few lines of DSL code.
The Steams DSL has built-in abstractions for streams and tables such as KSteam, KTable, and GlobalKTable. You can also use declarative functional programming for stateless transformations like map and filter, or you can use stateful transformations for aggregations like count and reduce. The high-level steps required to define processor topologies using Streams DSL are outlined below.
- You will need to specify an input stream that reads records from Kafka topics. This can be a source processor which is a special type of stream processor that is not connected to any upstream processors. It can be seen as the starting point and can send consumed records downstream.
- Next, you can compose the transformations on the streams. Various stream processors or nodes can be connected, each responsible for a transformation step.
- Finally, you can publish the output streams of the transformation steps to Kafka topics or expose them to other systems through the available Kafka APIs.
Advantages of Using Kafka Streams
The Kafka Streams library allows for quick prototyping of stream processing applications through its many abstractions. You can write and deploy an application on a single machine or scale it out across multiple machines to support high-volume workloads. Kafka Streams are lightweight and can be easily integrated into Java applications. Kafka Streams also have a low barrier to entry for developers when compared with the Producer and Consumer APIs of Kafka. They are durable and can handle the load balancing of multiple instances of an application by relying on Kafka’s parallelism model.
Limitations and Challenges of Kafka CDC
The main challenge of implementing Kafka CDC is that it requires a lot of technical knowledge. Domain-specific expertise is required to properly design and deploy a CDC solution using Kafka. Using Kafka CDC takes a lot of heavy lifting and should be well thought out before attempting to build a production-ready CDC pipeline. You will also need to have the necessary database privileges if you are to set up a CDC pipeline that leverages data that is also accessed by other users and applications. The log-based Kafka CDC is a great way to insure this but will require knowledge of dealing with the transaction log, which is a relatively low-level database component.
This article provided a comprehensive look at Apache Kafka and its use as a Change Data Capture solution. You were introduced to the concept behind Apache Kafka, what it solves, and its architecture. You were also introduced to the Change Data Capture paradigm and how it can benefit businesses and organizations through data replication and migration. You were then shown the types of CDC available in Kafka and how to set up Change Data Capture using Arcion and Kafka Streams. Lastly, we reviewed the advantages and disadvantages of natively implementing CDC with Kafka.
With Arcion you can unlock the power of your data through zero data loss and zero downtime pipelines in minutes.