Change Data Capture Azure SQL: The Definitive Guide

Luke Smith
Enterprise Solutions Architect
January 30, 2023
Matt Tanner
Developer Relations Lead
January 30, 2023
Matt Tanner
Developer Relations Lead
January 30, 2023
Get a migration case study in your inbox
Join our newsletter

It’s extremely common to have data that is generated by various sources across an enterprise. From tabular data found in databases, web data derived from the interactions of customers, to data from sensors and Internet of Things (IoT) devices, there are infinite possibilities and formats for the data produced. However, for this data to be of value to an organization, a data strategy must be in place that moves data from between where it is initially created or input to the applications that need it.

Change Data Capture, as you will see in the rest of this article, is one such technique that can move recently created and updated data from where the change occurred to the target system where it is needed. To enable Change Data Capture, you need a scalable data storage solution that guarantees availability, is performant and can recover from disaster or error.

In this article, we will focus on enabling Change Data Capture processes using Azure SQL Database. The concept of Change Data Capture will be the first topic we cover, followed by an overview of Azure SQL Database, including its key features and advantages. We will then look at how to enable Change Data Capture in Azure. As part of this, we will look at the steps necessary for implementing Change Data Capture in Azure SQL Database. Each step in the process will be highlighted and various table-valued functions will also be covered. Lastly, we will cover implementing Change Data Capture in Azure SQL Database using Arcion.

At the end of the article, the reader will have a firm grasp on the concepts of Change Data Capture, Azure SQL Database, how to leverage Azure SQL Database to set up a Change Data Capture pipeline, and how to use Arcion to achieve Azure SQL CDC through a unified data integration platform.

Table of Contents

What is Change Data Capture (CDC)?

Change Data Capture is a software design pattern that stipulates that changes in data in one part of an application, environment, or system are communicated or transmitted downstream to another part of the software system or architecture. In a good CDC solution, the process will make sure that only data that has changed is sent to the target system. To do this, changes in the source data are tracked so that only the updated data is sent from the source system to the target system.

In databases, Change Data Capture processes can be implemented between two or more databases. The source or primary database can have its data change events transmitted or communicated to one or more target databases downstream. By only transmitting the data that has been changed and applying that diff to the target database, CDC can be very performant and efficient. With a snapshot approach to CDC, the entire state of the database is copied from one location to another which is much less efficient.

There are various mechanisms for achieving Change Data Capture within databases. Two popular high-level methodologies are Push vs Pull approaches. With a push-based Change Data Capture approach, the source database pushes new changes/updates to the target database. The expectation is that the target database will be available to receive updates. If the source database is being used in production, the database will likely experience more workload since, in addition to responding to queries from applications, it also needs to send changes to the target database. For the pull-based approach, the target database periodically requests any changes from the primary database that have occurred since the last pull. With this approach there is less strain on the primary database and ensures the target database is up to date since it has to be online to request updates.

Most modern Relational Database Management Systems (RDBMS) support Change Data Capture natively. Between the available solutions, each implementation may be quite different. Some rely on adding additional columns to table data to track changes, while others use transaction logs that can be read to replicate changes.

What is Azure SQL Database?

Azure SQL Database is a part of the family of modern, efficient, reliable, and scalable cloud databases offered within Microsoft Azure's cloud services. Azure SQL DB is built on the same technology that powers Microsoft SQL Server, which is a favorite amongst enterprises because of the amount of workload it can process. It is also a tried and tested database solution and is readily supported and integrated with many enterprise business applications. Azure SQL Database is therefore an ideal solution for enterprises that want to maximize their in-house SQL Server database expertise.  It is a perfect fit for organizations with SQL Server expertise who want to migrate towards a cloud-centric approach where their data infrastructure is fully managed by a cloud service provider.

As with many fully-managed solutions, Azure Database SQL can enable teams to prototype more quickly and easily scale out production-ready solutions. It has built-in controls that can manage who has access to the data at a granular level, which is often required for large enterprises with sensitive data. It also has threat detection to ensure that data remains secure.

Azure SQL Database comes with a Service Level Agreement (SLA) of 99.995% of uptime, virtually guaranteeing that your data will be highly-available and performance will be maintained regardless of spikes or peak workloads. Since Azure SQL Database is a cloud offering in the more general Azure infrastructure, it is integrated with other Azure services on the platform, like Artificial Intelligence services, and also supports serverless computation for hyper scalability.

How to Enable Change Data Capture in Azure SQL Database

In Azure SQL Database, Change Data Capture is achieved through the recording of changes to the source tables in transaction logs. Another process then reads those changes from the logs and updates the mirrored or change tables. The changes in the mirrored table are then made available through Change Data Capture query functions. Unlike in SQL Server CDC, where the SQL Server Agent is responsible for starting stored procedures to capture changes, a Change Data Capture scheduler is used to run periodic capture and cleanup of change tables in Azure SQL Database.

To track changes in Azure SQL Database, Change Data Capture must first be enabled at the database level before it can be enabled on tables of interest. This activation should be done by a user with a db_owner role. CDC can be activated on the Azure SQL Database by running the sys.sp_cdc_enable_db stored procedure on the database to be enabled with Change Data Capture. You can also check to know whether CDC is enabled on a database by querying the is_cdc_enabled column in the sys.databases catalog view.

Below is the command required to enable Change Data Capture on a database in Azure SQL.

-- ====
-- Enable Database for CDC template
-- ====
USE MyDB
GO
EXEC sys.sp_cdc_enable_db
GO

The next step is to activate CDC on the source tables. To do so, the database user must be a member of the db_owner role. The stored procedure in sys.sp_cdc_enable_table can be used to achieve this. Running this stored procedure will create a capture instance for the individual table that is referenced. To determine if CDC has been enabled on a table, you can inspect the is_tracked_by_cdc column in the sys.tables catalog view. Below is the command to enable Change Data Capture on a table.

-- =========
-- Enable a Table Specifying Filegroup Option Template
-- =========
USE MyDB
GO
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name   = N'MyTable',
@role_name     = N'MyRole',
@filegroup_name = N'MyDB_CT',
@supports_net_changes = 1
GO

View Captured Data Using Change Data Capture in Azure SQL Database

When Change Data Capture is enabled on a table in Azure SQL Database, all columns by default are tracked. These tracked changes are recorded in a separate change table associated with the source table. However, each change is recorded as a new row in the Change table. The first 5 columns are used to store metadata information about the change while the remaining columns are used to store the actual data associated with the change. The metadata columns used are:

  • __$start_lsn
  • __$end_lsn
  • __$seqval
  • __$operation
  • __$update_mask

The __$start_lsn and __$end_lsn columns are used to identify the log sequence number (LSN). The LSN is used to track the commit associated with the change and the transaction it occurred in.

The __$seqval column is used to order more changes that may have occurred as part of the same transaction.

 The __$operation column is very important since it can be used to know the type of operation that occurred on the source table. The change table stores one-row entry for an INSERT or DELETE operation but stores two entries for an UPDATE operation. The first entry in an UPDATE operation contains the data before the change while the second entry contains the data after the change. The __$operation column has 4 valid possible values that identify each operation namely:

  • 1 = DELETE,
  • 2 = INSERT,
  • 3 = UPDATE (before change),
  • 4 = UPDATE (after change)

It is therefore possible to query the Change table using the __$operation column and valid parameters to view a slice of the updated data, however, it may be more beneficial to use table-valued functions in the long run.

The final metadata column is __$update_mask, which is a variable mask. For INSERT and DELETE operations, the mask will have all bits set. For UPDATE, only the columns with changed data will have their bits set.

Use the Table-valued Function for Viewing Change Data

The change tables in Azure SQL Database are not intended to be queried directly. Instead, Azure SQL Database provides a more systematic way for users to access change data through table-valued functions that are available. There are various table-valued functions that can help you access subsets of the data in your change tables depending on your needs. Some of the table-valued functions include:

cdc.fn_cdc_get_all_changes_<capture_instance>

This function returns a table that contains rows for every change that occurred in the source table within the interval specified by the log sequence number (LSN). So if a column was changed multiple times within the interval, each change is included in the result set returned.

 cdc.fn_cdc_get_net_changes_<capture_instance>

This function returns a table that contains only the net changes within the log sequence numbers range. That means that even if multiple changes occurred in a column within that interval, only one row is returned that reflects the net change for that column.

 Sys.fn_cdc_get_min_lsn

This function returns the value of the start_lsn column which is used to track the low endpoint to determine the validity of the capture instance.

 Sys.fn_cdc_has_column_changed

This function specifies whether the associated column was modified in the row change result using the update mask.

How Arcion Can Help Set Up CDC with Azure SQL

Arcion is a cloud-native Change Data Capture platform that integrates with various enterprise databases and supports the replication of data at scale while guaranteeing transactional integrity.  Arcion utilizes agentless, log-based CDC for its supported databases. This approach eliminates the need to install or manage CDC software and plugins on the source database.

The Azure SQL family of databases is fully supported by Arcion. The steps required to enable Change Data Capture natively on Azure SQL Database, which was discussed in previous sections of this article, can be more conveniently carried out with Arcion. This can be done through a simplified and intuitive UI or the CLI. For example, you can set up a streaming pipeline in Arcion using Change Data Capture to replicate changes from Azure SQL Database to any of the other destination databases supported such as Oracle, DB2, SAP HANA, Salesforce, PostgreSQL, MongoDB, Snowflake, and many more. Below are some highlights of using Arcion for Change Data Capture, including and in addition to some of the ones mentioned above.

Sub-Second Latency

Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever increasing data volume on Source.

100% Agentless Change Data Capture

Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. The agentless CDC applies to all complex enterprise databases including Azure SQL and SQL Server. Arcion reads directly from the database transaction log, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.

Data Consistency Guaranteed

Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.

Automatic Schema Conversion & Schema Evolution Support

Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources, which can get expensive!

Pre-Built Enterprise Data Connectors

Arcion has a library of pre-built data connectors. These connectors can provide support for the most popular enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other ETL tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. This means that if you branch out into other technologies outside of Azure SQL and SQL Server, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.

Conclusion

Change Data Capture is an invaluable technique for any data-centric organization to have in its data tool kit. When combined with a powerful database solution like Azure SQL Database, data can be replicated and utilized at scale. This article explained the concepts of Change Data Capture, Azure SQL Database, how to perform CDC in Azure SQL Database, and how to use table-valued functions to view change data.

Lastly, we looked at how a data integration platform like Arcion can help you simplify the CDC workflow in Azure SQL Database. Arcion offers a best-in-class solution for organizations looking for a modern approach to CDC. To get started today you can book a personal demo with our CDC experts today to get started!

With Arcion you can unlock the power of your data through zero data loss and zero downtime pipelines in minutes.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements