Data can be infinitely complex, especially inside modern organizations. Typical databases store data but do not make it easy to make sense of that data. Of course, you could pull some high-level insights from a SQL query against your primary database, but deriving critical business insights is much harder on these platforms. Furthermore, using data from multiple databases like this is nearly impossible. This has sparked a considerable increase and dependency on data platforms like Snowflake and Databricks. These platforms can consolidate and store large amounts of data from multiple sources and allow for complex querying and deep insights from enterprise-wide data.
Once it gets onto these platforms, the possibilities for the data are almost infinite. Some everyday use cases across large enterprises include:
- Behavior and purchase analysis enable real-time recommendations, providing customers with a personalized experience.
- Expenditure pattern analysis enables financial institutions to use real-time fraud detection to prevent losses and protect customers.
To power these use cases, you need lots of historical data to derive patterns from
And real-time data streaming of the latest data to action or work upon. So how do you enable such a setup?
Through robust and fast data pipelines.
Modern tooling and integrations have made this very easy to configure. Engineering and support effort is exceptionally minimal when creating and utilizing data pipelines created by a product like Arcion. These pipelines can quickly enable CDC and unlock new capabilities for your business. Curious exactly how easy it is? We will get to that shortly after we go over what CDC, or Change Data Capture, is.
What is Change Data Capture?
By definition, Change Data Capture is an approach to data integration based on the identification, capture, and delivery of the changes made to enterprise data sources. In short, Change Data Capture instantly detects changes in data from a source, like a primary application database, and apply those data changes to the data warehouse. As a result, it addresses problems related to moving data safely, reliably, quickly, and consistently around the enterprise.
Previously, changes would be added to the backlog, move through a batch process which would bring the changes in data over to a destination, and applied well after those changes occurred in the source database. This approach was exceptionally far-off from real-time and left businesses to make decisions or recommendations when they needed them most.
A common characteristic of most Change Data Capture products is low impact on the source databases, especially those that rely on log scanning mechanisms.
Change Data Capture serves a variety of purposes:
- Minimal effort data streaming triggered by transactional database changes.
- Real-time database replication to support data warehousing or cloud migration.
- Real-time analytics enablement as data is transferred from transactional to analytic environments with really low latency.
- Enable database migration with zero downtime.
- Time travel log recording for debugging and audit purposes.
At this point, you can see how CDC works in theory. Now, let’s see how simple it is to set up a data pipeline with Arcion to see it in action.
Using Arcion, MySQL, and Snowflake for CDC
In this walkthrough, let’s imagine that we are working with a company with lots of retail data. They want to use this data to implement strategies for customers in real-time, such as product recommendations and other insights to boost sales. Unfortunately, their current setup does not facilitate this because their data is spread about in multiple databases and the platforms they are using do not support data analytics very well.
Because they need a common place to discover, manage, and share data on the cloud, the company has decided to use Snowflake. Their main goal is to move data from a MySQL database to Snowflake. After this, they will add their other databases into the mix. We will focus on the first step of the initial MySQL database and get that data into Snowflake. Later, database connections will follow the same steps.
They want to unlock analytic capabilities with the least development effort possible, given they are still evaluating cloud offerings. Real-time database replication has a good fit for this use case.
We will load our MySQL database with the Online Retail II UCI data set data. It will work well for our purposes and easily allow us to use the raw data to create a one-to-one copy of the data into our data lake created in Snowflake. This would effectively create a bronze layer approach to our data lake.
MySQL will be used as the source. It is a widely used and easy-to-set-up relational database, so most people will follow what I am doing and might be able to replicate the steps with other databases.
Snowflake will be used as the target data warehouse due to its vast presence in the market. For example, almost half of the Fortune 500 use it (source: Snowflake Fast Facts 2022 Report).
I’m also going to use Arcion because it offers cloud-native deployment options and OLTP and data warehouse connectors support, resulting in a straightforward setup process.
For the readers who prefer a video walk-through, click the video below to follow along. Otherwise, skip this part and jump to the step-by-step walk through below the video.
First, we will get our MySQL database configured and loaded up with our example data.
1. Create the source database
2. Create the source table
3. Create a user for replication matters
4. Grant the user only the minimal required privileges
5. Allow external network access to MySQL (port 3306 by default)
This step depends on the infrastructure that hosts the MySQL server and is detailing it is out of the scope of the present blog post. If external network access is not allowed, please consider setting up Arcion’s Replicant agent in the MySQL network instead of using Arcion Cloud.
6. Load data into the source table
7. Set Binary Log format to ROW
You will also need to ensure that the MySQL instances Binary Logging format (binlog_format) is set to ROW to support CDC with Arcion. Depending on how and where the instance is deployed, you can do this in many ways. However, here is an example of how to do it when running MySQL on Amazon RDS.
Next, we will move on to configuring our Snowflake instance.
1. Create the target database
2. Create the target schema
3. Create the target table
4. Create a role and a user for replication matters
5. Grant the role with the required privileges
Arcion Cloud CDC setup
With our data source and target created, we will now log into Arcion Cloud to set up our replication pipeline to enable CDC. You can sign up and log into Arcion here.
Once logged into Arcion Cloud, we will land on the Replications screen. Here, we will click on the New Replication button in the middle of the screen.
Next, we will select our replication mode and write mode. A few options are available to suit your needs. For replication modes, Arcion supports:
- Snapshot (the initial load)
- Full (snapshot + CDC)
For write modes, Arcion supports:
We will select the replication mode as Full and the write mode as Truncating for our purposes here. You will also see that I have named the replication “MySQL to Snowflake.”
Once the Name is populated and the Replication and Write Modes are selected, click Next at the bottom of the screen.
We are then brought to the Source screen. From here we will click the Create New button.
We then will select MySQL as our source.
And then, scroll to the bottom of the page and click Continue.
Now, we can add in our MySQL instance details. These details include:
- Connection Name
All other fields will default. For Username and Password, we will use the users created in the script we ran earlier against our MySQL instance.
Once the connection is saved, we will want to pull in the schema from the database. Arcion Cloud will prompt us to click the Sync Connector button on the next page. Click the button, and Arcion Cloud will connect to our MySQL instance and pull down the schema.
Once completed, the UI in Arcion Cloud will display the retrieved schema. Then we will click Continue in the bottom right corner of the screen to move to the next step.
We now have our data source correctly configured. This will be displayed on the next screen, as well as a Test Connection button. To ensure that everything is working correctly, we will click the Test Connection button.
The results should look like this once the test is finished running. Finally, you can click the Done button to exit.
With our test successful, we can now click Continue to Destination in the bottom right corner of the screen to move to the steps where we set up our destination.
On the Destination screen, we will click New Connection to start the setup of our Snowflake connector.
Then, select Snowflake as your Connection Type and click Continue.
On the next screen, input your connection details. These details include:
- Connection Name
All other fields will default. For username and password, we will use the users created in the script we ran earlier against our Snowflake instance.
On the next screen, we will sync the connector. Click Sync Connector and wait for the process to complete.
Once complete, you will see the schema loaded onto the screen. We can then click Continue in the bottom right corner of the screen.
Our last step in configuring the connection with Snowflake is to test the connection. We will click the Test Connection button and wait for the results to return to Arcion Cloud.
You should see that all tests have passed to ensure Arcion has access to everything required to create the connection.
> Note: if Host Port Reachable does not pass, ensure that you have not included “https://” on the URL for your Snowflake connection. This can cause that check to error out.
Now, we can click Continue to Filter to begin the Filter configuration for our pipeline.
On the Filters screen, we will check the Select All checkbox to replicate all of our tables and columns from the source to the destination.
You can also click on the Map Tables and Per Table Configs (Applier Configuration Docs, Extractor Configuration Docs) buttons to add further configuration. For our purposes, we will leave these as their default values. After this, you will click Start Replication.
The replication will then begin.
Once the initial data is loaded, the pipeline will continue to run, monitor for changes, and apply those changes to the destination. The idle pipeline will still show RUNNING in the top right of the screen but will show a row replication rate of 0 until new data is written to the source. You’ll also notice that the Phase description of the pipeline will now show Change Data Capture instead of Loading Snapshot Data.
If we start adding data to the MySQL instance (for example, by rerunning our load script), we will see that Arcion detects this and then syncs that data over to Snowflake in real-time.
As mentioned in the opening, our next steps would be to connect our other data sources in Arcion, if required. In minutes, we have set up a CDC-enabled data pipeline that can power all of the use cases defined by the business that need real-time data analytics. Our initial data from MySQL has been synced over to Snowflake and future data will be moved over to Snowflake in real-time as it is created, updated, or deleted from the MySQL data source.
The nature of this real-time data movement into Snowflake can power many use cases that require instant access to data in sync with one or multiple data sources or primary databases. For retail enterprises, near-instant inventory and supply chain management, better customer experience, and product recommendations can now be powered by this pipeline and the automatically-synced Snowflake data. This new functionality is unlocked in a matter of a few clicks.
Arcion Cloud allows us to set up these pipelines in a matter of minutes, with minimal configuration and minimal support and maintenance once the pipeline is running. To get started today, simply sign up for an Arcion Cloud account and activate your free 14-day trial to see how simple it is to create and use CDC-enabled data pipelines.