Data storage and replication are two of the essential features of the data industry. In most cases, data operations should support the storage of data in a database and also support data integration with heterogeneous systems. The latter is where data replication comes into play.
MySQL is one of the most widely used databases for modern applications and other systems. Unfortunately, a common way of syncing data from a MySQL instance to another platform is by using a batch-based method. This generally puts a heavy onus on data engineers to constantly create and maintain these batch-based data pipelines. Batch-based pipelines generally extract and replicate data stored in MySQL databases into other data warehouses, data lakes, and cloud data platforms to make the data readily available for analysis.
Change Data Capture (CDC), on the other hand, is a powerful and sought-after real-time method of ingesting data from databases. CDC monitors the database continuously and replicates the changes made on your MySQL database to other downstream databases and applications. MySQL CDC is an essential modern method of extracting data from the MySQL database. Any changes in the data are encapsulated in a “change event” and applied to other data stores. When it is applied correctly, MySQL CDC also can replicate all changes in the data, including delete operations, from the source to the destination.
There are several methods that you can use to implement CDC in MySQL. Some of these methods are similar in approach as they overlap regardless of the technology used. In this guide, we are ultimately going to look at three simple ways in which you can effectively create and maintain real-time data ingestion. We will look at using MySQL CDC to detect changes made to the database and apply those changes to a target system.
What is MySQL?
MySQL is an open-source Relational Database Management System (RDBMS) that is based on a client-server model. MySQL was owned by the Swedish company MySQL AB and later sold to Sun Microsystems, now Oracle. The platform is now fully owned, supported, and distributed by Oracle.
An RDBMS organizes data into one or more tables where the data found in each table may be related to one another. These relationships help in structuring the data and enforcing a data model. SQL is the language used by programmers to create, extract, and modify data from a RDBMS. SQL can also facilitate management tasks within the platform. RDBMS system management includes tasks such as controlling users’ access to the database, managing network access, facilitating the testing of the database to show integrity, and creating database backups.
MySQL operates on the principles of RDBMS and uses SQL to query and manage the databases in a MySQL instance. Being an RDBMS, it possesses the ability to store data between tables efficiently, and create and manage databases of structured data. You can use and modify MySQL’s source code to meet your needs since it is open-sourced. This open-source approach does come with limitations since the GNU Public Licence (GPL) determines what can and can’t be altered.
MySQL is very fast and lightweight which has added to its popularity. It can run alongside other applications on the same server or be hosted independently. MySQL can also scale up easily to maximize the CPU power and memory of the machine where it is being hosted. MySQL is a crucial component of the popular LAMP (Linux, Apache, MySQL, Perl/PHP/Python) web application software stack. It is also easily deployed on many of the most popular computing platforms like Linux, macOS, Microsoft Windows, and Ubuntu. MySQL is still actively developed and maintained so it is constantly being updated and improved to cater to legacy and the most cutting-edge use cases.
Features of MySQL
These are the main features of MySQL:
- Open-source: MySQL is free and open-source, therefore, it can be modified through its source code to handle your basic requirements.
- Client/Server Architecture: MySQL is based on the model of a client-server where the database server (MySQL) can communicate with many clients (multiple applications and programs) to query and make changes.
- Scalable: MySQL is developed to scale easily to meet your business needs, be it a small or large amount of data, clusters of machines, etc. It can handle almost any amount of data.
- Character Sets: MySQL supports different character sets such as Latin 1 (cp 1252 character coding), German, Unicode characters, and so on.
- Data Types: MySQL contains multiple data types such as unsigned integers, signed integers, text, date, time, datetime, timestamp, blob, FLOAT, DOUBLE, character (CHAR), variable character (VARCHAR), and so on.
- Large Databases: MySQL supports large databases of up to about 40 to 50 million records, 150,000 to 200,000 tables, and up to 5,000,000,000 rows.
- Client and Utility Programs: MySQL contains many client and utility programs such as mysqladmin, MySQL Workbench, etc.
- Compatible With Many Operating Systems: MySQL runs on many operating systems like Microsoft Windows, macOS, Linux, varieties of UNIX, Nivell Netwave, and many more.
- MySQL Community: MySQL has a devoted community of developers that one can access to tackle issues and troubleshoot as well.
Benefits of Using MySQL
The benefits of using MySQL include:
- Ease of Use: MySQL is easy to use, all you simply need is a basic knowledge of SQL to build and interact with MySQL. You can also get valuable information about MySQL from several helpful resources like blogs, white papers, books, and the like.
- Flexibility: MySQL is highly flexible and it supports many applications.
- Fast and Reliable: Data access and manipulation is fast in MySQL as data is stored efficiently in its memory. This storage also ensures that data is consistent and true thereby making it reliable.
- Frequent Updates: MySQL is updated frequently to meet modern business needs with new features and security improvements.
- Data is Secure: MySQL is very secure and consists of a security layer that provides an interface with a password system to verify the host before accessing a database and the password is encrypted when connecting to the server.
- Storage Systems: MySQL stores repeatedly used statements as Query Cache and frequently used tables support is provided for the Memory Storage Engine.
How to Set Up MySQL CDC
Change Data Capture (CDC) is a technique that captures changes performed on a database. These changes include INSERT, UPDATE, and DELETE along with DDL changes which are then conveyed to the target database. With CDC activated in MySQL, it becomes easier to keep track of data changes within MySQL. With CDC enabled, data changes are captured in real or near real-time and can be replicated to a target platform. Using CDC can also be a great mechanism to recover and minimize the impact of failures since data may be extracted just before the failure occurred. The changes can then be reapplied to the database to guard against potential data loss and data integrity issues.
As soon as a change event is captured using MySQL CDC, the changes can be replicated in a data warehouse or other platform. Once on the target platform, the data can be further analyzed and used for business intelligence and other use cases such as auditing, copying data to another system, or simply processing events.
Setting up MySQL CDC is largely dependent on what is required of your business. The easiest and most efficient way of doing this is by using the binary logs, but other methods exist. Alternate methods include using triggers, timestamps, and general or audit log plugins. All of these methods will be discussed in the next section to help you figure out which method is the best fit for your use case and setup.
Understanding The Methods To Set Up MySQL CDC
The primary goal of MySQL CDC is to ensure that data changes are readily captured in MySQL and stored appropriately in a data warehouse. This section of this write-up is going to look at how to set up MySQL CDC using three methods:
- MySQL CDC Setup: Using MySQL Triggers
- MySQL CDC Setup: Using Binary Logs
- MySQL CDC Setup: Using Arcion
MySQL CDC Setup: Using MySQL Triggers
One of the easiest ways to implement CDC in MySQL is by using update timestamps. Anytime a record is inserted or modified, the update timestamp is altered to reflect the current time and date. This will allow processes and applications to know the last time when the record was changed. For this to work, it requires that you have an UPDATE timestamp field on the tables you want to capture. The UPDATE fields are then updated with the current timestamp whenever the record is changed. This method can become tedious and lack efficiency since it requires an external script to scan the table and take action every time a change occurs. To avoid the need for constant scanning and custom scripting, this is where database triggers come into play.
Triggers allow you to perform certain actions before or after an action has taken place in a database. They are sometimes referred to as database hooks and execute when predefined events happen in the database. Triggers can be used to always update the timestamp in a table since they look out for any INSERT, UPDATE, and DELETE operations. For example, when an UPDATE command is carried out on a row on the source table, a trigger can be set to fire. The updated rows UPDATE column, the one mentioned earlier, can then be updated to the current date and time. When an INSERT or DELETE command is executed, the trigger runs an update command in MySQL to automatically update the timestamp column to use the current date and time when the record was inserted or deleted.
The example below shows how to create a trigger for an INSERT operation. The trigger will be set on a source table named Students. The table will contain student_id, student_email, student_name, and updated columns. Here is an example SQL script below to create the example table.
Once the table is defined, an INSERT trigger can be created. The trigger will update the updated column on the affected rows with the current timestamp using the now() function.
Alternatively, a trigger can be created that captures all INSERT events and replicates this on a new row on the table student_change_events. This is similar to creating a change audit log. Below you can see the creation of the change event table and corresponding trigger.
Trigger stores captured events inside MySQL only. For these changes to be captured in other data systems, you will have to query the MySQL table holding the events continuously. This can lead to a rather complex setup and cause a fair amount of overhead on the entire system, including the source MySQL database.
Using triggers for MySQL CDC can be beneficial as shown in the listed points below:
- It captures changes instantly giving room for real-time processing.
- Triggers can be used to capture all event types namely: INSERT, UPDATE, and DELETE.
- Custom metadata can be easily added to the change event when using triggers.
Triggers on MySQL CDC also have their downside as shown below:
- Using triggers affects the performance of the database because it increases the execution time of the original operation.
- Triggers require changes are made to the source MySQL database.
- Creating and managing triggers increase the operational complexity of the database.
- If changes need to be applied to a data source outside of the source MySQL database, you will be required to create a data pipeline to query for the changes and apply them to the target database.
- Changes to the schema of the monitored table mean that the trigger function must be manually updated to match.
MySQL CDC Setup: Using Binary Logs
MySQL has an internal feature called binary log (binlog) where all the operations occurring on a database are recorded. The binlog also holds the record of DDL and changes to the tables’ data. The MySQL binlog provides an effective way of keeping tabs on changes happening on the data, highly useful for use with MySQL CDC processes. The binlog is useful for data recovery and replication between different databases found on different machines.
The binlog can also be described as a binary file on disk which holds records of events that prompt changes to the content or structure of the MySQL database. These events include INSERT, UPDATE, and DELETE operations. To use the binlog, MySQL Server must be started with the –log_bin option, which is not enabled by default. To activate binlog, the following three parameters must be set up correctly in the MySQL database server configuration: binlog_format, binlog_rows_image, and expire_logs_days.
The binlog_rows_image should be defined for MySQL 5.6 or higher and should be left out for earlier releases. The expire_logs_days determines how often data is brought from the log and retained in the database. Choosing the right number of days is essential when setting up the expire_logs_days. Setting the days to a very long time will reduce the available storage in the MySQL server or cluster. Setting it to a short time will reduce the retention time in which the log may be purged, reducing its effectiveness for certain use cases.
There are a lot of available open-source tools that can help in the implementation of binlog-based CDC. These tools include Debezium and Maxwell’s daemon, amongst many others. These can be especially useful tools when using a managed MySQL instance since they generally do not directly give access to the configuration. These tools are used to read the binary logs and then write to Kafka. From Kafka, the events can be written to other databases and data platforms.
Binary logs for MySQL CDC have certain advantages such as:
- It enables the capture of events in real-time making sure that downstream applications always have the latest data from MySQL.
- This MySQL CDC method does not affect the performance of the MySQL database as events are read directly on the file system.
- Binlog MySQL CDC handles and detects all types of change events like INSERT, UPDATE, DELETE, and even schema changes.
Having said this, binlog also has the following disadvantages:
- Binlog is not available in old versions of MySQL.
- Binlog does not store the entire change events that occur on a database table, rather, it stores operations that were performed within a given retention period which is defined by the configuration option expire_logs_days.
To overcome the challenges above, the last method using Arcion is recommended and explained next.
MySQL CDC Setup: Using Arcion
Using Arcion to enable CDC in MySQL is extremely simple and robust. This is especially true when comparing it to the methods discussed above and overcomes many of the limitations.
In the example below, Arcion is used to create a CDC pipeline between a MySQL instance and Databricks using some example schemas and data. For the full breakdown, including how to set up MySQL and Databricks to follow along, check out the full article here. The example below requires that the bin_log be enabled on the MySQL instance so that Arcion can track the changes and replicate them from MySQL over to the Databricks instance. Let’s take a look at each step required to set up the CDC-enabled pipeline.
First, log into Arcion Cloud or start an instance of Arcion Self-Hosted (GUI available), you will land on the Replications screen. Here, we will click on the New Replication button in the middle of the screen.
Next, we will select our replication mode and write mode. A few options are available to suit your needs. For replication modes, Arcion supports:
- Snapshot (the initial load)
- Full (snapshot + CDC)
For write modes, Arcion supports:
For our purposes here, we will select the replication mode as Full and the write mode as Truncating. You will also see that I have named the replication “MySQL to Databricks”. Once the Name is populated and the Replication and Write Modes are selected, click Next at the bottom of the screen.
We are then brought to the Source screen. From here we will click the Create New button.
We then will select MySQL as our source.
And then scroll to the bottom of the page and click Continue.
Now, we can add our MySQL instance details. These details include:
- Connection Name
All other fields will have default values.
Once the connection is saved, we will want to pull in the schema from the database. On the next page, we will be prompted to click the Sync Connector button. Click the button and Arcion Cloud will connect to our MySQL instance and pull down the schema.
Once completed, the UI in Arcion Cloud will display the retrieved schema. Then we will click Continue in the bottom right corner of the screen to move to the next step.
We now have our data source correctly configured. This will be displayed on the next screen as well as a Test Connection button. To ensure that everything is working correctly, we will click the Test Connection button.
The results should look like this once the test is done running. You can click the Done button to exit.
With our test successful, we can now click Continue to Destination in the bottom right corner of the screen to move to the steps where we set up our destination.
On the Destination screen, we will click New Connection to start the set up of our Databricks connector.
Then, select Databricks as your Connection Type and click Continue.
On the next screen, input your connection details. These details include
- Connection Name
- JDBC URL
All other fields will have default values.
For the JDBC URL and Host fields, these values can be found in the Databricks console. For the Username and Password fields, the Username field will just have the string “token” entered into it. The Password field will use a Personal Access Token generated within Databricks.
We will then click Save Connection at the bottom of the screen.
After the connection is saved, we will then click on Test Connection to ensure that the connection is working as intended.
The Connection Test modal will then display. If the connection is set up correctly, you will see all green checks on all the tests performed. Once complete, click Done.
Our last step on this screen is to click on the Continue button.
On the next screen, we will sync the connector. Click Sync Connector and wait for the process to complete.
Once complete, you will see the schema loaded onto the screen. We can then click Continue in the bottom right corner of the screen.
On the next screen, we will make sure to our new Databricks connection as the Destination. Once selected, click Continue to Filter.
On the Filters screen, we will check the Select All checkbox so that all of our tables and columns will be replicated from the source to the destination.
Optionally you can also click on the Map Tables and Per Table Configs (Applier Configuration Docs, Extractor Configuration Docs) buttons to add further configuration. For this example, we will leave these as their default values. After this, you will click Start Replication.
The replication will then begin.
Once the initial data is loaded, the pipeline will continue to run, monitor for changes, and apply those changes to the destination. The idle pipeline will still show RUNNING in the top right of the screen but will show a row replication rate of 0 until new data is written to the source. You’ll also notice that the Phase description of the pipeline will now show Change Data Capture instead of Loading Snapshot Data.
If we start adding data to the MySQL instance (for example, by running our load script again) we will see that Arcion detects this and will then sync that data over to Databricks in real-time. This can be seen in the spike in INSERT operations in the previous screenshot from updating some records in the MySQL instance.
Why Use Arcion For MySQL CDC?
When it comes to creating data pipelines easily, the right tool can make all the difference. Of all the tools available, we built Arcion so that it is much simpler to implement and maintain than other tools and approaches.
Arcion is a go-to solution for many enterprises who are looking to select a data pipeline tool that is scalable, reliable, and extremely easy to configure and use. It provides robust data pipelines that offer high availability, streaming capabilities through log-based CDC, and auto-scalable features. Available with multiple deployment options, Arcion can migrate data to and from on-prem data sources, cloud-based data sources or a mix of both.
The zero-code approach allows users to easily configure Arcion and build their data pipelines without writing any code. Arcion can be set up and configured strictly through configuration files or by using Arcion’s intuitive and easy-to-use UI to set up pipelines in a matter of minutes. Compared to homegrown solutions or ones that mismatch a bunch of different technologies, Arcion makes implementation smooth by providing 24/7 support through extensive documentation, tutorials, blogs, and customer support.
Let’s take a look at some specific features that will benefit you while building MySQL CDC pipelines with Arcion.
Many other existing data pipeline solutions don’t scale for high-volume, high-velocity data. This results in slow pipelines and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever-increasing data volume on Source.
100% Agentless Change Data Capture
Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. Arcion reads directly from database logs, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery - but without the inherent performance issues or the security concerns of having to install an agent to extract data from your pipeline sources.
Data Consistency Guaranteed
Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
Automatic Schema Conversion & Schema Evolution Support
Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes. This is possible by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of computing resources (expensive!).
Pre-Built 20+ Enterprise Data Connectors
Arcion has a large library of pre-built data connectors. These connectors can provide support for almost 20 enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other ETL tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. This means that if you branch out into other technologies, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.
The guide has covered a lot about MySQL CDC starting from definitions, and benefits, to showing how to set up MySQL CDC, and methods to implement it.
From the write-up, it is easy to see that using any of the mentioned methods will go a long way in creating an efficient work environment for your business growth. Though the methods listed are helpful, it is advisable that if you have the need for real-time MySQL CDC to use Arcion. Setting up MySQL CDC with Arcion is far easier, scalable, and more reliable than the other methods we reviewed
To get started with Arcion today, download Arcion Self-hosted for free and see Arcion in action yourself (no payment info required). If you prefer, you can also book a personal demo with our MySQL CDC experts today to get started!