What is Microsoft SQL Server?
One of the most popular databases on the market, Microsoft SQL Server is a Relational Database Management System (RDBMS) developed by Microsoft. Microsoft markets MS SQL Server in different variants to suit the needs of a large, diverse audience. Each of these flavors come at an increasing price and with increased feature availability.
Some editions of Microsoft SQL Server include Standard, Web, Enterprise, Workgroup, and a few others. Microsoft also offers a cloud-based version of SQL Server known as Microsoft Azure SQL Database. Alongside Oracle Database and IBM's DB2, Microsoft SQL Server is one of the leading database solutions, a favorite of enterprises and large organizations. These organizations rely on SQL Server because of its security, reliability, and widespread technical support.
Unlike other relational database solutions, Microsoft SQL Server uses proprietary Transact-SQL (T-SQL) for its SQL operations. T-SQL is a dialect of SQL developed solely for use with Microsoft SQL Server. Within SQL Server there are proprietary data types as well that extend beyond traditional SQL data types. Queries in Microsoft SQL Server are declarative in nature and clearly specify what is to be retrieved, however, the query processor in SQL Server is in charge of optimizing queries according to a sequence of actions called a query plan.
An extremely flexible solution, Microsoft SQL Server can be used as the data store for miniature applications running on a single machine to sizeable resource-intensive internet applications serving a massive workload of concurrent users. It is a versatile database solution that can be a good fit for most use cases that require a secure and reliable way to store data. Whether you're deploying the solution on-premise or on a cloud platform, such as Google Cloud, Amazon Web Services, or Microsoft Azure, SQL Server is widely available and supported. This is especially true for those who are working within the .NET and Microsoft development ecosystems
What is SQL Server Change Data Capture (CDC)?
Change Data Capture is a software design pattern for databases whereby changes that occur in a source table or database are captured and replicated in another table or database. That means as data is updated in a source system, that change is applied in the target system. Microsoft SQL Server which is one of the leading database solutions that also supports this convention. This makes it a great tool for both database migration and data replication.
Change Data Capture in Microsoft SQL Server is implemented by the SQL Server agent to record INSERT, UPDATE, and DELETE operations as they occur in a table. The details of these recorded changes can be consumed in a relational format. The information in columns and other metadata required to replicate the change is then applied to a target environment, keeping the two separate data stores in sync, automatically.
Why do we use CDC in SQL Server?
The robustness of Microsoft SQL Server means that it is a good candidate for the implementation of an organization’s Change Data Capture strategies. There are several benefits to having Change Data Capture as part of the data migration and data replication strategy used within an organization.
In this section, you will be introduced to some of those benefits as they relate to Microsoft SQL Server. First, Change Data Capture can be used for real time loading of data into a data warehouse. Unlike Extract, Transform, Load (ETL) operations, where one database system is periodically migrating data to another database system, Change Data Capture implemented in Microsoft SQL Server can be used to stream changes and migrate data in a source SQL Server instance to a target system, like a data warehouse, as the changes occur. The data stored in the data warehouse can then be used for data analytics or by other Business Intelligence (BI) applications for the derivation of insights.
The benefit of doing this according to the Change Data Capture paradigm is that any insights gleaned from the data will be up to date. As many know, having real-time data is the key to making insightful decisions when they matter most. A delay in the data being processed could lead to lost opportunities or, in the case of predictive business analytics, completely stunt the use case.
Second, Change Data Capture in Microsoft SQL Server can be used to synchronize the data stored in an organization’s on-premises database with its cloud database. For Microsoft SQL Server, the equivalent cloud database offering - Microsoft Azure SQL Database can be used as the target that closely mirrors the on-prem data, so any updates to the on-prem data is replicated in the cloud. The cloud database can then be used as a data source by other data-intensive applications like reporting or analytics software. Using CDC as a data migration tool can ensure that databases are kept in sync and data quality remains intact, an important part of a data protection strategy.
Third, an audit application can be built into an organization’s workflow using Change Data Capture in Microsoft SQL Server to ascertain when data changed, why it was changed and by who. More will be discussed on Change Data Capture as an audit solution in a subsequent section of this article.
Understanding Microsoft SQL Server Incremental Data Load
Incremental data load can be defined as the process of copying only new data or updated data from a source database to a destination database. The data loaded into the destination database at the time the incremental load process begins includes modified or new records, every other record that is identical in both the source and destination database is ignored.
The main advantage of incremental data load is time savings. It takes less time to load only relevant data compared to setups where data in the destination or target database is completely dropped before the new state in the source database is copied over. By using Microsoft SQL Server as a data loader, it is possible to implement incremental data load through a few different approaches. These include adding timestamp/rowversion datatypes columns, created/modified date and time columns, Change Data Capture or through replication. When using incremental data load with Microsoft SQL Server, it is important to make sure that there is no misidentification of source rows that have changed so that data integrity can be preserved in the target database.
Installation and Architecture
To use Change Data Capture in Microsoft SQL Server, it must be enabled. Previously, Change Data Capture was only available in SQL Server Enterprise edition, however from SQL Server 2016, it can be used across other editions.
Change Data Capture in Microsoft SQL server works by using two SQL Server Agent jobs that are created once CDC is enabled on a database. The first agent is responsible for populating the database tables with the information that has changed while the second agent cleans old records according to the configurable retention policy.
When a database is enabled for Change Data capture, the cdc schema, cdc user, metadata tables and other objects are created for that database. A database can also be checked to know if it is currently enabled for CDC by querying the is_cdc_enabled column in the sys.databases catalog view.
Change Data Capture as an Audit Solution
Change Data Capture can be used in conjunction with Microsoft SQL Server as an audit solution to track changes in a table’s data. The database operations for INSERT, UPDATE and DELETE can be recorded with detailed information on a mirrored table whenever a change is detected. This can be more efficient than setting up individual triggers to write to audit tables.
Enabling CDC in SQL Server will provide an audit trail to see the change in the structure of the data. Additional columns are created in the mirrored table to record a description of the changes. Every INSERT statement has a corresponding record that shows the inserted value, the DELETE statement equally has a record for the deleted value, however, the UPDATE statement has two records. The first record for the UPDATE statement shows the data before the change and the second record shows the data after the update has been performed.
Some of the additional columns that are created as part of the audit functionality are __$start_lsn and __$end_lsn, which show the commit log sequence number assigned by SQL Server Engine for its internal use to track the change, __$seqval, which tracks the order of a specific change in the hierarchy of other changes in the same transaction, __$operation, which identifies the kind of operation (insert, delete, update), etc.
How to Enable CDC (Change Data Capture): Step-by-Step
There are two ways to enable Change Data Capture in Microsoft SQL Server. CDC can be enabled either at the database level or on specific tables. To enable Change Data Capture at the database level, a member of sysadmin fixed server role has to run a stored procedure (sys.sp_cdc_enable_db) in the database context. The content of the stored procedure is shown below.
If you are using SQL Server Management Studio, you can go to View > Template Explorer > SQL Server Templates and then select the change data capture sub-folder to access the templates.
To enable Change Data Capture for specific tables, it needs to have been enabled at the database level. CDC at the table level can be enabled by members of the db_owner fixed database role group. You can run the procedure below to enable it.
In the above procedure, the @filegroup_name parameter is used to specify the filegroup that will contain the change table associated with this instance. By default, the change table is located in the default filegroup of the database.
Change Data Capture can be disabled by a member of the sysadmin fixed server role group by running the stored procedure sys.sp_cdc_disable_db (Transact-SQL) in the database context. The content of the stored procedure is shown below.
Similarly, Change Data Capture can be disabled on a table by a member of the db_owner fixed database role via the stored procedure below.
Auditing DML Changes
After successfully enabling Change Data Capture at the database level and on specific tables, INSERT, UPDATE and DELETE operations get mirrored in the change table alongside values populated in the additional columns created to track the changes. For each Data Manipulation Language (DML) change such as an INSERT, for example, the data stored in the change table will vary according to the type of operation. In the __$operation column on the change table which was discussed in an earlier section, its value will be:
- 1 for DELETE
- 2 for INSERT
- 3 for an UPDATE before a change
- 4 for an UPDATE after a change
The change table will have new records that correspond to INSERT, UPDATE or DELETE operations performed on the source table. That way data in the change table can be queried to get the time a change occurred in a specific record, the type of change, the value associated with the record before and after the change occurred, and the order of the change and the specific columns that were updated.
SQL Server Change Data Capture: Pros & Cons
Most design patterns in data and software engineering require trade-offs and Change Data Capture with Microsoft SQL Server is no exception. Let’s take a look at some advantages and disadvantages of using SQL Server Change Data Capture as a strategy in your data stack.
Below are the most obvious benefits of using Microsoft SQL Server Change Data Capture for data transfer:
- SQL Server Change Data Capture does not have a requirement that the table to be replicated must have a primary key. Unlike other comparable solutions, this means that tables that do not have a primary key can have their data replicated in a target table.
- The Change Data Capture functionality in SQL Server can be used for Always On Availability Groups in SQL Server because the data captured by CDC is easily reproduced across replicas.
Some disadvantages of SQL Server Change Data Capture are listed below:
- Even though a Change Data Capture mechanism is native to SQL Server, there is still a significant amount of custom coding required if the changes recorded as part of the replication are to be fed to a data warehouse. That means that plumbing needs to be done to have SQL Server replication work with a connected data warehouse.
- To make sure that data integrity is maintained when using SQL Server CDC, ongoing maintenance of the database must be performed so that changes are not misrepresented, especially for new tables and data types.
In this article, you were introduced to the RDBMS - Microsoft SQL Server, you were also introduced to what Change Data Capture is as relates to SQL Server and its application as an audit solution. You were then shown how to enable CDC to migrate data in SQL Server both at the database and table level. Finally, some advantages and disadvantages of SQL Server Change Data Capture were enumerated to enable you to have a well-rounded understanding if faced with choosing SQL Server as a Change Data Capture solution.
Looking for a more robust CDC solution that easily supports many different sources and targets, including SQL Server? Arcion is one of the best data replication and data migration tools to use when implementing CDC with Microsoft SQL Server. By using Arcion, it’s easy for organizations to build pipelines using SQL Server as a source or destination. Easily move data from SQL Server to or from other platforms such as MySQL, MongoDB, and many other popular databases. Arcion also opens up use cases to efficiently move data into big data platforms such as Snowflake or DataBricks, with no code required. For this particular use case, Arcion is much easier than the built-in CDC supported by Microsoft SQL Server.
Benefits of using Arcion over Microsoft SQL Server CDC include:
- No-code connector support for 20+ sources and target databases and data warehouses
- Multiple deployment types supported across cloud and on-premise installations
- Configuration can easily be done through UI, without any SQL commands, SQL Server configuration changes, or code
- Automatic schema conversion & schema evolution support out-of-the box (including SQL and NoSQL conversion)
- Patent-pending distributed & highly scalable architecture: Arcion is the only end-to-end multi-threaded CDC solution on the market that auto-scales vertically & horizontally. Any process that Arcion runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput.
- Built-in high availability (HA): Arcion is designed with high availability built-in. It makes the pipeline robust without disruption and data always available in the target, in real-time.
- Auto-recovery (patent-pending): Internally, Arcion does a lot of check-pointing. Therefore, any time the process gets killed for any reason (e.g., database, disk, network, server crashes), it resumes from the point where it was left off, instead of restarting from scratch. The entire process is highly optimized with a novel design that makes the recovery extremely fast.
Arcion supports many different deployment types, including self-managed on-premise, fully-managed cloud, hybrid, and even multi-cloud deployments. If you’re looking for a flexible on-premise solution, try out Arcion Self-Managed. To get started even more quickly, try out Arcion Cloud and have a scalable and reliable cloud data migration tool set up in minutes.