Today’s modern businesses leverage the latest technologies within their stack. Building applications with cutting-edge tools and platforms and delivering great customer experiences that are responsive and intuitive. The businesses that take it even further are those that are coupling great customer experiences and growing adoption with Big Data and AI platforms. The use of these platforms unlocks a massive amount of business intelligence and real-time analytics that help businesses and their customers to thrive.
One of the platforms often chosen for business intelligence and real-time analytics is Databricks. The Databricks platform offers multiple solutions including data science and engineering, machine learning, and SQL data storage. We will showcase Databricks SQL in the example below. Databricks SQL is a serverless data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI applications at scale with great performance, a unified governance model, open formats and APIs, and can integrate with many different tools.
In this blog post, I’ll cover the most important building blocks and advantages of using CDC to process data in real-time. These topics will hopefully offer insights into how to enable data-driven decision-making in industries like retail, media & entertainment, and finance. Leveraging real-time analytics can help with:
- Behavior and purchase analysis to enable better-targeted offerings and recommendations on the fly, providing customers with a more personalized experience.
- allowing sales teams to focus on the most efficient channels instead of spending time on less performant ones.
- Expenditure patterns analysis to allow financial institutions to detect fraud before it happens, effectively preventing losses.
Is your company still lacking real-time capabilities or reluctant to adopt the latest technologies? At most companies, legacy approaches to data migration and replication for analytics are still the norm and preferred solution. Many companies still process data in batch jobs, which can mean minutes, hours, or even days worth of delay from the time when the data was created or retrieved and when it gets to the database or platform where it will be analyzed. By the time the data reaches any type of processing, the latency caused by these legacy approaches has already severely depleted the viability of the data. A more modern solution is required in order to forego the issues with these legacy approaches. Change Data Capture, usually abbreviated to “CDC”, for instance, brings a modern and painless approach to moving data around, especially from transactional databases to data lakes. CDC is one of the easiest and most popular ways to create real-time data pipelines. Just how easy? In a matter of minutes, with no custom code, we can set up a pipeline. Read on to find out how!
What is Change Data Capture?
Change Data Capture is an approach to data integration that is based on the identifying and capturing changes made to a data source. Once the changes are identified, the CDC process or pipeline delivers those changes to a destination such as another database or data lake through replication. CDC pipelines address the challenges related to moving data safely, reliably, quickly, and consistently around the enterprise.
An advantage to using Change Data Capture to replicate data is the minimal impact it has on the source database while monitoring for changes in the data. This impact is especially true for pipelines that rely on log-based CDC. For solutions like Arcion that implement agentless CDC, the impact is even more minimal than traditional agent-based CDC. Agentless CDC removes the inherent performance issues that come with agent-based CDC, the potential security concerns, and administrative burdens associated with the installation of proprietary software on database servers. It decreases the amount of support effort needed to maintain data pipelines.
Change Data Capture serves a variety of purposes:
- Minimal effort data streaming triggered by transactional database changes.
- Real-time database replication to support data warehousing or cloud migration.
- Real-time analytics enablement as data is transferred from transactional to analytic environments with really low latency.
- Enable database migration with zero downtime.
- Time travel log recording for debugging and audit purposes.
As CDC has become more popular, many Change Data Capture solutions have popped up including many open-source and enterprise-grade solutions. Debezium is probably the most popular open-source solution, frequently used with Apache Kafka to enable event streaming. HVR has been available for over a decade but, after its acquisition to Fivetran, has become less visible in the ecosystem. Users are still awaiting for HVR’s functionalities to be merged into Fivetran’s offering for database CDC. Although it can be deployed in the leading cloud providers, it requires a fair amount of configuration to work properly. Not necessarily the most “cloud-native” solution on the market.
On the other hand, Arcion and Striim are newer technologies that have cloud and self-hosted deployment models. These solutions are truly cloud-native and make implementing CDC on the cloud extremely simple.
To show you how to create an end-to-end CDC solution with MySQL and Databricks, we will use the Arcion Cloud to demonstrate the entire process. If you require an on-premise solution, you can also follow the same steps in Arcion Self-hosted. Let’s get started!
A hands-on guide to Databricks Delta CDC using Arcion
In the example below, we will be implementing a CDC solution with Databricks for a retail company that wants to begin leveraging their invoice data to make more informed business decisions. They want to incorporate a data analytics solution into their current architecture. The issue they are facing is that their current on-premise servers are not able to support the additional workloads that would be introduced with an on-premise analytics solution. They have now decided that they will use a cloud-based solution to support their analytics use case and have decided on Databricks as the platform of choice. They want to unlock the necessary analytics capabilities with the least amount of development effort possible since they are still evaluating the viability of the proposed solution. For their scenario, using a CDC pipeline that supports real-time database replication is a good fit.
We will require some sample data that includes some retail invoices to demonstrate how the solution will work. For this, we will use one of the many sample retail datasets freely available on Kaggle. I’m going to use Online Retail II UCI as it will work well for our purposes and easily allow us to use the raw data to demonstrate how our CDC pipeline will work. We will use this data to create a one-to-one copy of the data from our database into our data lake, created in Databricks. Since the data will be going from source to destination, the data lake, in its raw form we are effectively creating a bronze layer approach to our data lake.
MySQL will be used as the source database. It is widely used and it should be easy to follow the process below. Depending on what database you use as a source, you may also be able to replicate the steps with other databases, such as Oracle or SQL Server (visit Arcion Docs for more info).
Databricks will be used as the target data warehouse due to its huge popularity and great performance, especially when coupled with Arcion. If your organization uses Snowflake as the data warehouse solution, read this MySQL to Snowflake blog for the step-by-step guide.
Our first step is to set up our source database. On a running instance of MySQL, we will run the following commands to create the database, user for our replication needs, and load the database with raw data. The steps below go over each facet.
- Create the source database
- Create the source table
- Create a user for replication matters
- Grant the user only the minimal required privileges
- Allow external network access to MySQL (port 3306 by default)
This step depends on the infrastructure that hosts the MySQL server and is detailing it is out of scope oth the present blog post. If external network access is not allowed for any reason, please consider setting up Arcion’s Replicant agent in the MySQL network instead of using Arcion Cloud.
- Load data into the source table
- Set Binary Log format to ROW
You will also need to ensure that the MySQL instances Binary Logging format (binlog_format) is set to ROW in order to support CDC with Arcion. This can be done many ways depending on how and where the instance is deployed. Here is an example of how to do it when running MySQL on Amazon RDS.
Our next step is to set up our Databricks instance so that we can start moving our MySQL data into the big data platform. Below we will cover exactly how to configure Databricks and generate the necessary connection details and token to connect it with our CDC pipeline.
- Create the SQL Warehouse
Once you’ve logged into your Databricks account, you will then click the Persona Switcher and select SQL.
Once you are on the SQL persona, you will then select the SQL Warehouses menu item from the side menu. Once on the SQL Warehouses screen, you will click on the Create SQL Warehouse button to create a new SQL warehouse on your Databricks instance.
A modal will then appear to help you create the new SQL Warehouse. Fill in the Name field and click Create. Optionally, you can also select your Cluster size from the dropdown. Since we have a minimal amount of data to store, I have selected the smallest cluster size which is a 2X-small instance.
After this, your new SQL Warehouse will be created on Databricks. Next, you’ll be brough to a screen which will show you the details of your instance and whether it is running. You can also get to this screen by clicking on your SQL Warehouse name in the list back on the SQL Warehouses main screen.
- Retrieve the Connection Details
At the top of the screen that shows the overview for our Arcion CDC SQL Warehouse, click on Connection details to grab a few details we will need to set up our CDC pipeline in Arcion. The details we will need include the Server hostname and the JDBC URL. Save these details for use later (or just remember where you can retrieve them when needed in the Arcion setup).
- Generate a new Personal Access Token
Lastly, we will need to generate an access token so that Arcion can connect to the Databricks SQL Warehouse we created. To do this, click on the Settings menu item in the sidebar and then select User Settings.
Once you are on the User Settings screen, under the Personal Access Tokens tab, click on Generate new token.
A modal will then appear where you can add a comment to identify the access token you are generating. For this example, I’ve put “Arcion CDC” in the Comment field. Then, you’ll click the Generate button in the bottom right of the modal.
Next, you’ll see the generated token appear in the modal. As it says, copy the access token now and save it. You won’t be able to retrieve it later. If you misplaced the token, you can just generate a new one later. Once you’ve copied the token, click Okay to dismiss the modal.
Now we have created and generated all of the necessary things on Databricks to create our pipeline with Arcion. Let’s actually generate our CDC pipeline now!
Arcion Cloud CDC setup
With our data source and target created, we will now log into Arcion Cloud to set up our replication pipeline to enable CDC. You can sign up for Arcion Cloud here or speak to our database expert team about Arcion Self-Hosted here.
Once logged into Arcion Cloud or started an instance of Arcion Self-Hosted, you will land on the Replications screen. Here, we will click on the New Replication button in the middle of the screen.
Next, we will select our replication mode and write mode. A few options are available to suit your needs. For replication modes, Arcion supports:
- Snapshot (the initial load)
- Full (snapshot + CDC)
For write modes, Arcion supports:
For our purposes here, we will select the replication mode as Full and the write mode as Truncating. You will also see that I have named the replication “MySQL to Databricks”. Once the Name is populated and the Replication and Write Modes are selected, click Next at the bottom of the screen.
We are then brought to the Source screen. From here we will click the Create New button.
We then will select MySQL as our source.
And then scroll to the bottom of the page and click Continue.
Now, we can add in our MySQL instance details. These details include:
- Connection Name
All other fields will be defaulted. For username and password we will use the users created in the script we ran earlier against our MySQL instance.
Once the connection is saved, we will want to pull in the schema from the database. On the next page, we will be prompted to click the Sync Connector button. Click the button and Arcion Cloud will connect to our MySQL instance and pull down the schema.
Once completed, the UI in Arcion Cloud will display the retrieved schema. Then we will click Continue in the bottom right corner of the screen to move to the next step.
We now have our data source correctly configured. This will be displayed on the next screen as well as a Test Connection button. To ensure that everything is working correctly, we will click the Test Connection button.
The results should look like this once the test is done running. You can click the Done button to exit.
With our test successful, we can now click Continue to Destination in the bottom right corner of the screen to move to the steps where we set up our destination.
On the Destination screen, we will click New Connection to start the set up of our Databricks connector.
Then, select Databricks as your Connection Type and click Continue.
On the next screen, input your connection details. These details include:
- Connection Name
- JDBC URL
All other fields will be defaulted.
For the JDBC URL and Host fields, we will fill in the details with the values we generated in Step 2 of the Databricks setup above. For Username and Password fields, the Username field will just have the string “token” entered into it. The Password field will use the Personal Access Token we generated in Databricks in Step 3 of the Databricks setup above.
We will then click Save Connection at the bottom of the screen.
After the connection is saved, we will then click on Test Connection to ensure that the connection is working as intended.
The Connection Test modal will then display. If the connection is set up correctly, you will see all green checks on all the tests performed. Once complete, click Done.
Our last step on this screen is to click on the Continue button.
On the next screen, we will sync the connector. Click Sync Connector and wait for the process to complete.
Once complete, you will see the schema loaded onto the screen. We can then click Continue in the bottom right corner of the screen.
On the next screen we will make sure to our new Databricks connection as the Destination. Once selected, click Continue to Filter.
On the Filters screen, we will check the Select All checkbox so that all of our tables and columns will be replicated over from the source to the destination.
Optionally you can also click on the Map Tables and Per Table Configs (Applier Configuration Docs, Extractor Configuration Docs) buttons to add further configuration. For our purposes, we will leave these as their default values. After this, you will click Start Replication.
The replication will then begin.
Once the initial data is loaded, the pipeline will continue to run, monitor for changes, and apply those changes to the destination. The idle pipeline will still show RUNNING in the top right of the screen but will show a row replication rate of 0 until new data is written to the source. You’ll also notice that the Phase description of the pipeline will now show Change Data Capture instead of Loading Snapshot Data.
If we start adding data to the MySQL instance (for example, by running our load script again) we will see that Arcion detects this and will then sync that data over to Databricks in real-time. This can be seen in the spike in INSERT operations in the previous screenshot from updating some records in the MySQL instance.
With that, we have successfully set up a CDC-enabled data pipeline with Arcion. Our initial data from MySQL has been synced over to Databricks and future data will be moved over to Databricks in real-time.
The nature of this real-time data movement into Databricks can power many use cases which require instant access to data that is in sync with one or multiple data sources or primary databases. For retail enterprises, near-instant inventory and supply chain management, better customer experience, and product recommendations can now be powered by this pipeline and the data that is instantly synced over to Databricks. This new functionality is unlocked in a matter of a few clicks.
Arcion Cloud allows us to set up these pipelines in a matter of minutes, with minimal configuration, and minimal support and maintenance once the pipeline is running. To get started today, simply sign up for an Arcion Cloud account or book a personalized demo to see Arcion Self-Hosted in action.