Kafka CDC SQL Server: A Detailed Guide

Matt Tanner
Developer Relations Lead
November 21, 2022
16
 min read
Join our newsletter

In modern times, the need for real-time data access and analytics is ubiquitous for almost all companies, regardless of industry. Businesses require this instantaneous access to make accurate predictions and to inform their business decisions and customers simply just demand it. Gone are the days when batch processing was the go-to method for tracking changes and moving data to other platforms or analytics engines. The demands of modern businesses can no longer be accommodated by such old and lagging technologies.

To fill the need for real-time data access and replication, many RDBMS databases have built-in Change Data Capture (CDC) functionalities. The inclusion of CDC as a core piece of a modern data stack helps to leverage data in real-time by creating pipelines that can monitor a source database for changes and apply those changes to various targets.

On top of some databases natively supporting CDC-based replication, other tools have also been created to fill this void and add even more flexibility. One of these tools is Apache Kafka, one of the most popular streaming tools currently in existence. Using Kafka is a very common way to build a pipeline with SQL Server to stream data in your SQL Server database. In this overview, we will go over a step-by-step approach to show how this can be done. We will also look at how an equivalent pipeline can also be created with Arcion even more simply. Both of these methods will be discussed below in detail, however, first, let’s take a look at what Apache Kafka is all about.

Table of Contents

What is Kafka

Apache Kafka is an open-source distributed event streaming service used to build high-performance data pipelines. Many organizations that adopt Kafka use it for streaming analytics, data integration, and to assist with database migrations. Kafka is one of the most well-known and frequently used platforms for data streaming. According to the Kafka website, it is used to meet the stream-processing needs of over 80% of the Fortune 100.

Streaming data is data that is continuously generated from various data sources. Each source's data is processed in real time and the data records created are typically sent for storage. To actually handle this processing and routing of data, a stream processing tool is required. Each of the sources will need to be connected to one or multiple destinations through a data pipeline. This is exactly where Kafka comes in to complete the solution.

Kafka is a distinguished tool that can handle large amounts of data at scale. Kafka was built to process data sequentially and incrementally as it is primarily used to build real-time streaming data pipelines. The Apache Kafka framework provides two types of messaging models: queuing and publish-subscribe. These messaging models offer different approaches to processing and moving data along the data pipelines created with Kafka. Both are highly scalable and can transfer millions of messages per second.

Reasons to Use Kafka 

Kafta has become an important and popular platform used by thousands of companies in the big data ecosystem for data streaming. Leveraging Kafka as part of a streaming solution has the following benefits associated with it:

  • Highly scalable: Kafka allows data to be distributed across multiple servers through its partitioned log model, therefore making data scale beyond what can be accomplished on a single server.
  • Fast: Kafka achieves very low latency by decoupling data streams to ensure it has high throughput.
  • Durable: Kafka avoids data failure by distributing partitions and replicating data across multiple servers. This helps to avoid data loss by making the data fault-tolerant and durable.

Things You Can Do With Kafka

Kafka offers three main functions to users, namely:

  • It helps you to publish and subscribe to streams of records.
  • Store streams of records efficiently in a chronological way in which the records were generated.
  • Helps in processing streams of records in real-time by providing connectors to import and export records from diverse data sources.

Methods To Achieve SQL Server CDC

The essence of this blog is to show you how to link Kafka with an SQL CDC Server to be able to stream data in real time. Two popular methods will be discussed, the first using the popular CDC tool Debezium and the second, simpler method that uses Arcion

Read our “Debezium Alternatives: The Ultimate Guide” to learn more about the limitations of using Debezium for enterprise-grade CDC, what are the top alternatives, and their pros and cons. 

Debezium is an open-source project with various plugins that can help retrieve data from databases as it captures changes to your data. Debezium is easily connected with Kafka through Kafka Connect. This allows for large data sets to be moved into and out of a Kafka data pipeline. Debezium’s SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database. The first time it connects to a SQL Server database/cluster, it reads a consistent snapshot of all of the schemas. 

The following section covers a step-by-step guide on how to create a streaming pipeline to ingest CDC events from SQL Server to Kafka. In order for this to work, we’ve listed the prerequisite system and configuration requirements below.

Prerequisites

Method 1: SQL Server CDC to Kafka Using Debezium

Connecting your SQL Server instance to Kafka is quite simple using Debezium. Below we will break down each step, including sample commands using a sample SQL Server database called TestDB

Step 1: Enable CDC For The Table(s) That Should Be Captured By The Connector

Our first step is to enable change data capture for the current database. We can do this by using the sys.sp_cdc_enable_db stored procedure. This procedure enables change data capture for the entire database. Once executed, change data capture records for insert, update, and delete activities will be created. Below is an example of how to call the stored procedure in SQL Server.

USE TestDB 

GO 

EXEC sys.sp_cdc_enable_db 

GO

Alternatively, you can turn on CDC only for specific tables within your database. To do this, we would instead use the sys.sp_cdc_enable_table stored procedure. An example of this can be seen below that enables CDC on the Posts table in the TestDB database.

USE TestDB 

GO 

EXEC sys.sp_cdc_enable_table 
@source_schema = N'dbo', 
@source_name = N'Posts', 
@role_name = N'Admin', 
@supports_net_changes = 1 
GO

Once we have enabled CDC on either the entire database or select tables, we can then check out the configuration by calling the sys.sp_cdc_help_change_data_capture stored procedure. This stored procedure returns the change data capture configuration for each table enabled for change data capture in the current database. An example of the command can be seen below.

USE MyDB

GO 

EXEC sys.sp_cdc_help_change_data_capture 

GO

If you require more info or a deep dive into the SQL Server CDC docs, visit here for more details on how to enable and disable CDC on SQL Server.

With either of the cdc_enable stored procedures executed, CDC will be enabled for the tables within your SQL Server instance. Our next step is to configure the Kafka setup.

Step 2: Install Java On Your Linux VM

On our Linux VM, mentioned in the prerequisites, you will need to install Java on the machine and add Java to the environment variable. Java is a required dependency since the Kafka libs run on Java. To install Java and add it to the environment variables for the machine, use the commands listed below.

> sudo yum install java-1.8.0
> export JAVA_HOME=/usr/lib/jvm/jre-1.8.0

When the commands complete successfully, Java will be installed and configured as needed to run our Kafka installation. Next, let’s install Kafka.

Step 3: Download And Install Apache Kafka

Our next step will require a few different commands to download and untar the Kafka library. Each step is shown in the commands listed below.

> mkdir kafka/
> cd kafka
> wget https://archive.apache.org/dist/kafka/3.2.1/kafka-3.2.1-src.tgz (asc, sha512)
> tar -xzf kafka-3.2.1-src.tgz (asc, sha512)
> cd kafka-3.2.1-src

In the above commands, we create a directory for Kafka to be extracted into, use wget to download the dependency, and then untar it so that it can be used. Now that we have Kafka downloaded and ready to run, we need to also install Debezium so that we can connect Kafka to our SQL Server instance.

Step 4: Download And Install A Debezium  Connector

In the kafka-3.2.1-src folder that we ended the last set of commands with, we will download the Debezium libraries. Then, we will untar it into your Kafka Connect environment before moving it to the directory of the plugins folder. Lastly, we will run the command to set our plugin path for Kafka Connect.

> wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-sqlserver/1.2.0.Final/debezium-connector-sqlserver-1.2.0.Final-plugin.tar.gz
> tar -xzf /kafka/connect/debezium-connector-sqlserver
> plugin.path=/kafka/connect

After running the above commands, Debezium is downloaded and installed in the correct directory. Now we are ready to actually spin up our Kafka server.

Step 5: Start Up The Kafka Server

The next step in the procedure is to start the Kafka server. ZooKeeper, an open-source project that is used to coordinate various processes on Kafka’s distributed cluster, will be used to start up the server. ZooKeeper can be started in a different terminal window using the command below. Apache ZooKeeper is available in the bin directory of your Kafka installation.

> bin/zookeeper-server-start.sh config/zookeeper.properties
[2020-07-19 19:08:32,351] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

Relevant information is obtained from the configuration above and a config file with default values is provided before starting the Kafka server with the code below.

> bin/kafka-server-start.sh config/server.properties
[2020-07-19 19:08:48,052] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2020-07-19 19:08:48,101] INFO Property socket.send.buffer.bytes is overridden to 1044389 (kafka.utils.VerifiableProperties)
...

Step 6: Configure Kafka Connect

Now that our Kafka cluster is up, we will want to add our SQL Server connector to Kafka Connect. Kafka Connect exposes a REST API to manage Debezium connectors. To communicate with the Kafka Connect service, you can send an API call with the relevant details using curl or another REST API client. 

We will use the Kafka Connect REST API to add the connector to your Kafka Connect cluster. Before we send any calls to the REST API, we first need to create the configuration. The configuration below is an example of the configuration for a connector instance that monitors a SQL Server server at port 3306 on 192.168.99.100.

{  
  "name": "inventory-connector",    
  "config": {    
  "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",     
  "database.hostname": "192.168.99.100",     
  "database.port": "1433",     
  "database.user": "sa",     
  "database.password": "Password!",     
  "database.dbname": "testDB",     
  "database.server.name": "fullfillment",     
  "table.whitelist": "dbo.customers",     
  "database.history.kafka.bootstrap.servers": "kafka:9092",     
  "database.history.kafka.topic": "dbhistory.fullfillment"   
 }
}

Once we are happy with this configuration, we can then create a request to enable the connector using the Kafka Connect REST API. The command will look like this.

$ curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
 { 
 "name": "inventory-connector", 
 "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", 
 "tasks.max": "1", 
 "database.hostname": "mysql", 
 "database.port": "3306", 
 "database.user": 
 "debezium", 
 "database.password": "dbz", 
 "database.server.id": "184054", 
 "database.server.name": "dbserver1", 
 "database.include.list": "inventory", 
 "database.history.kafka.bootstrap.servers": "kafka:9092", 
 "database.history.kafka.topic": "dbhistory.inventory" 
 } 
}'

When the connector starts, Kafka Connect takes a consistent snapshot of the schemas in your SQL Server database and starts streaming changes, producing events for every inserted, updated, and deleted row. At this point, we will have change events streaming into Kafka through our setup.

Step 7: Monitor Kafka logs For Changed Data 

The SQL Server connector publishes several metrics about the connector’s activities that can be monitored. The connector has two types of metrics, the Snapshot metrics which help in monitoring the snapshot activity, and the Streaming metrics which help in monitoring the progress and activity while the connector reads CDC table data. In the terminal that you used to start your Kafka instance, you should see a fair amount of log outputs from events that are occurring within the SQL server instance. 

Method 2: Using Arcion

Arcion can create a CDC data pipeline that delivers real-time data replication with low-impact to the source, resilient, reliable, at insane scalability. It has the ability to do automatic schema conversions and monitor for schema changes and propagate them appropriately. It supports all types of data versions for replication without restrictions, and has flexible deployment methods (Arcion Self-hosted & Arcion Cloud). Compared to the previous solution we implemented above, Arcion adds a whole new layer of scalability, security, and ease of use and maintenance.

One thing to note is that Arcion is an end-to-end replication tool - which means you don’t need any additional streaming tools like Kafka, Spark Streaming, Kinesis, etc. It simplifies data architecture and reduces the maintenance overheads significantly.

Let’s take a look at how to use Arcion with SQL Server and Kafka to build a data pipeline that scales and is easy to use.

Step 1: Use SQL Server Installer to add the SQL Server Replication Feature to the Database

Replication components can be installed by using the SQL Server installation Wizard or at a command prompt. For the exact steps on how to do it with either approach, check out the Microsoft documentation.

Once you have your SQL Server installation equipped with the SQL Server Replication Feature, we can begin to move to the next step of installing Java and Apache Kafka on our Linux VM.

Step 2: Install Java On Your Linux VM

Just like in the previous example, you will need to install Java on the machine and add Java to the environment variable. Java is a required dependency since the Kafka libs run on Java. To install Java and add it to the environment variables for the machine, use the commands listed below.

> sudo yum install java-1.8.0
> export JAVA_HOME=/usr/lib/jvm/jre-1.8.0

> sudo yum install java-1.8.0

> export JAVA_HOME=/usr/lib/jvm/jre-1.8.0

When the commands complete successfully, Java will be installed and configured as needed to run our Kafka installation. Next, let’s install Kafka.

Step 3: Download And Install Apache Kafka

Again, like the previous solution we created, uur next step will require a few different commands to download and untar the Kafka library. Each step is shown in the commands listed below.

> mkdir kafka/
> cd kafka
> wget https://archive.apache.org/dist/kafka/3.2.1/kafka-3.2.1-src.tgz (asc, sha512)
> tar -xzf kafka-3.2.1-src.tgz (asc, sha512)
> cd kafka-3.2.1-src

In the above commands, we create a directory for Kafka to be extracted into, use wget to download the dependency, and then untar it so that it can be used. Now that we have Kafka downloaded and ready to run, we need to also install Debezium so that we can connect Kafka to our SQL Server instance.

Step 4: Download And Install Arcion Self-hosted

Our last step before actually starting up our solution is to download and install Arcion Self-hosted. This will require a few steps, including downloading Replicant, creating a home directory for it, and adding in your license. Each step can be seen in detail by referencing our quickstart guide in our docs. We will configure Replicant a little later in this walkthrough. Before that, we will fire up our Kafka Connect cluster.

Step 5: Configure Arcion to Connect With SQL Server and Kafka

We now need to configure and enable Arcion to connect with SQL Server and Kafka. The replicant-cli dependency and corresponding folder we downloaded and extracted in Step 4 will now be used. We will refer to the directory as $REPLICANT_HOME in the following steps. Now, let’s go ahead and configure all of the connections with Arcion.

Set Up The Connection Configuration

From $REPLICANT_HOME, navigate to the sample Kafka connection configuration file.

vi conf/conn/kafka.yaml

vi conf/conn/kafka.yaml

Once the file is opened, you will need to make some changes to the configuration. Make the necessary changes as shown in the sample below. Some comments have been left in the example to guide you with each tweak that may be required.

type: KAFKA 

username: 'replicant' #Replace replicant with the username of your user that connects to your Kafka server
password: 'Replicant#123' #Replace Replicant#123 with your user's password 

#ssl:
#  enable: true
#  trust-store:
#      path: "/kafka.server.truststore.jks"
#      password: "" 

#Multiple Kafka brokers can be specified using the format below:
brokers:   
 broker1: #Replace broker1 with your broker name       
  host: localhost #Replace localhost with your broker's host       
  port: 19092 #Replace 19092 with your broker's port   
 broker2: #Replace broker2 with your broker name       
  host: localhost #Replace localhost with your broker's host       
  port: 29092 #Replace 29092 with your broker's port
max-retries: #Number of times any operation on the system will be re-attempted on failures.
retry-wait-duration-ms : #Duration in milliseconds replicant should wait before performing then next retry of a failed operation

Set Up The Applier Configuration

From $REPLICANT_HOME, navigate to the sample Kafka connection configuration file.

vi conf/dst/kafka.yaml

The configuration file has two parts namely: parameters related to snapshot mode and parameters related to real-time mode. Below is an example of parameters for snapshot mode.

snapshot: 
 threads: 16 #Specify the maximum number of threads Replicant should use for writing to the target  
 
 replication-factor: 1 
 schema-dictionary: SCHEMA_DUMP  # Allowed values: POJO | SCHEMA_DUMP| NONE 
 kafka-compression-type: lz4 
 kafka-batch-size-in-bytes: 100000 
 kafka-buffer-memory-size-in-bytes: 67108864 
 kafka-linger-ms: 10

Below is an example of parameters for real-time mode.

realtime:  
 txn-size-rows: 1000  
 before-image-format: ALL  # Allowed values : KEY, ALL  
 after-image-format: ALL   # Allowed values : UPDATED, ALL  
 # shard-key: id  
 # num-shards: 1  
 # shard-function: MOD  
 
 # per-table-config:  
 # - tables:  
 #   io_blitzz_nation:  
 #   shard-key: id  
 #   num-shards: 16  
 #   shard-function: NONE  
 #   io_blitzz_region:  
 #   shard-key: id  
 #   io_blitzz_customer:  
 #   shard-key: custkey  
 #   num-shards: 160

For more information on how to configure Arcion to Connect With SQL Server and Kafka, visit here.

Step 6: Monitor Kafka Topics For Changes

The SQL Server connector publishes several metrics about the connector’s activities that can be monitored. The connector has two types of metrics: 

  • Snapshot metrics which help in monitoring the snapshot activity, 
  • Streaming metrics which help in monitoring the progress and activity while the connector reads CDC table data.

At this point, your setup is complete and your Kafka, SQL Server, and Arcion CDC setup are ready to go.

Limitations of Kafka + Debezium Approach

Even though the vision of Kafka plus Debezium is poetic, in reality, development teams face many issues while implementing Debezium-and-Kafka-based production pipelines. Aside from the fact that you will need a strong team with technical expertise in Kafka and DevOps to ensure scalability and reliability, the architecture itself comes with many limitations. Some of these concerns include:

  • How to guarantee zero data loss? There is no guarantee of zero data loss or transactional integrity if any of the components in this setup fails. Maintaining integrity is the sole responsibility of the development team here. 
  • How to ensure the pipeline architecture is scalable? Even though Debezium advertises many connectors, some of them have scalability issues. Even in the case of good old Postgres, it is not uncommon to run into out-of-memory exceptions while using plugins, like wal2json, to convert write-ahead log output to JSON. Another example is the lack of ability to snapshot tables while still being open for incoming events. This means for large tables, there could be long periods of time where the table is unavailable while snapshotting. 
  • How to handle schema evolution? Debezium does not handle schema evolution in a graceful manner. Even though there is support for schema evolution in some of the source databases, the procedure to implement this is different for different databases. This means a lot of custom logic has to be implemented in Kafka processing to handle schema evolution.
  • How to support various data types for different connectors? Some of the connectors have limitations when it comes to supporting specific data types. For example, Debezium’s Oracle connector has a limitation in handling BLOB data types since they are still in an incubation state. It is not wise to use BLOB data types in Debezium-based production modules.

The Debezium+Kafka approach also comes with a lot of hidden costs. Even though it is free to use, a large amount of engineering effort and time is required to set up production CDC pipelines. The availability of trained engineers to implement these pipelines is also another challenge.

In many cases, architects often design data pipelines that just solve today’s problem and do not consider the long-term implications or needs of the system. When the business grows, their data volume grows. This means that pipelines can become overloaded very quickly. That is why it is wise to explore solutions with long-term scalability and support in mind. 

Why Use Arcion?

When it comes to using Kafka and SQL Server together, it’s obvious from the above demonstration that it is much simpler to implement and maintain Arcion than other approaches. 

Arcion is a go-to solution for many enterprises who are looking to select a data pipeline tool that is scalable, reliable, and extremely easy to configure and use. It provides robust data pipelines that offer high availability, leverage log-based CDC, and auto-scalable features. Available with multiple deployment options, Arcion can migrate data to and from on-prem data sources, cloud-based data sources or a mix of both. 

The zero code approach allows users to easily configure Arcion and build their data pipelines without having to write any code. Arcion can be set up and configured strictly through configuration files or by using Arcion’s intuitive and easy-to-use UI to set up pipelines in a matter of minutes. Compared to homegrown solutions or ones that mismatch a bunch of different technologies, Arcion makes implementation smooth by providing 24/7 support through extensive documentation, tutorials, blogs, and customer support.

Let’s take a look at some specific features that lend themselves well to the Kafka and SQL Server CDC use case.

Sub-Second Latency 

Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single threaded. It means Arcion users get ultra low latency CDC replication and can always keep up with the forever increasing data volume on Source.

Don’t just hear from us. Hear it directly from our user.

100% Agentless Change Data Capture

Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. The agentless CDC applies to all the complex enterprise databases including SQL Server. Arcion reads directly from database logs, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software on production environments. You can now replicate data in real time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.

Data Consistency Guaranteed

Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.

Automatic schema conversion & schema evolution support

Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources (expensive!).   

Pre-Built 20+ Enterprise Data Connectors

Arcion has a library of pre-built data connectors. These connectors can provide support for almost 20 enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other ETL tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. This means that if you branch out into other technologies outside of Kafka and SQL Server, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.

Conclusion

At this point, we have explored multiple facets of how to set up SQL Server CDC using Kafka. First, we took a look at what Kafka is all about and then took a deep dive into two methods of how CDC can be implemented. We outlined what it takes to implement SQL Server CDC to Kafka using Debezium and the alternative and easier route of using Arcion. Both were explained in detail, including the setup and commands it would take to configure both platforms. We also went in-depth on the Arcion features that bring convenience and quality to the SQL Server CDC use case. By now, you should be able to confidently make a decision on which approach is best for your applications and be able to get a good start on implementing CDC with Kafka and SQL Server. 

Looking for a modern CDC solution? Download Arcion Self-hosted or sign up for Arcion Cloud for free and unlock the power of your data through zero data loss and zero downtime pipelines in minutes.

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

5 connectors: Oracle, MySQL, Databricks, Snowflake, SingleStore

Pre-configured enterprise instance

Available in four US AWS regions

Free download

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements

Join the waitlist for Arcion Cloud (beta)

Fully managed, in the cloud.

Start your 30-day free trial with Arcion self-hosted edition

Self managed, wherever you want it.

Please use a valid email so we can send you the trial license.