The ability to migrate and replicate data across various database sources into data analytics tools is essential for today’s businesses. As many organizations have discovered, a lot of information can be derived when all your data is located in a single and unified source of truth. As a result of this, competition has taken off among organizations to find solutions for building production-ready data pipelines and orchestrating workflows that can run at an ever-increasing scale. These pipelines are expected to deliver results in hours, minutes, or even in real time.
In this article, we are going to walk you through how to create a connection between the popular and open-source database management system MySQL and the enterprise-grade analytics and big data platform Databricks. This write-up will also show that when a connection is established correctly, users can use Databricks to perform data engineering and data science tasks with real-time data from MySQL. First, let’s take a look at both MySQL and Databricks to further your understanding of each platform before creating a pipeline between the two.
Introduction to MySQL
MySQL is one of the most popular open-source Relational Database Management Systems (RDMS) in existence. The platform runs on a client-server model and has numerous features that are ideal for both small and large applications. Since MySQL is open-source, anyone can use it by downloading the software and modifying the source code to meet their business needs.
MySQL’s popularity is driven by its high performance, reliability, and the fact that it is very easy to use and deploy. Due to its ease of use, it became a staple in one of the most popular web app stacks used across the internet: the LAMP stack. It is compatible with many computing platforms such as Microsoft Windows, macOS, Linux, and Ubuntu and supports connectivity through languages like PHP, Python, Java/JDBC, C++, C, ODBC, Perl, and Ruby. It also has support for open-source applications like Drupal, Joomla, Magento, and other popular CMS platforms with the most popular CMS use case being WordPress. Battle-tested as one of the most popular database solutions for many of the world’s largest web applications, it has become a go-to for big organizations such as Facebook, Twitter, Netflix, Uber, and others with large volumes of data and users accessing their applications.
MySQL also has variations such as MySQL Heatwave, a fully managed service that enables customers to run OLTP, OLAP, and machine learning workloads directly from a MySQL Database. Also available is MySQL Enterprise Edition with its complete set of advanced features, management tools, and technical support. Others include, MySQL for OEM/ISV and MySQL Cluster CGE used for web, cloud, and communications services.
Introduction to Databricks
Databricks is an enterprise big data platform, and much more, that was developed by the creators of Apache Spark. The platform provides a set of tools for unifying the process of building, deploying, sharing, scaling, and maintaining enterprise-grade data solutions. The tools available on Databricks can help with data engineering workflows, machine learning modeling, and tracking, data discovery/annotation exploration, creating dashboards and visualizations, managing security and governance, and lots more. These features are all easily made available since Databricks is cloud-native and is easily deployed on top of distributed Cloud computing environments like Microsoft Azure, AWS, or Google Cloud.
Databricks simplifies the process of collating large amounts of data either in a Data Warehouse or Data Lakes by incorporating its Lakehouse architecture. The Lakehouse architecture offers the capabilities and advantages of both a data warehouse and a data lake, in a combined fashion. This approach helps to easily eliminate unwanted data silos and provides data teams with a unified single source of data.
Databricks also offers several open-sourced projects such as Delta Lake, MLflow, Redash, Delta Sharing, Apache Spark, and Structured Streaming. Each of these technologies can be easily integrated into a Databricks instance. Databricks has given users the power to add and manage updates of these open-source integrations in the Databricks Runtime releases. You can also expand, integrate, and optimize performance using some of Databricks’ proprietary tools like Workflows, Photon, Delta Live Tables, Unity Catalog, and Databricks SQL. Another plus-side for users is that Databricks also makes it easy for you to interact with it programmatically by using various tools and technologies exposed through their REST APIs, CLI, and Terraform.
Methods to Load Data from MySQL to Databricks
Now that we are familiar with both platforms, it’s time to look at how to get the two connected to establish a robust data pipeline. In this section, we are going to look at two distinct methods by which we can load data from MySQL into Databricks. We will look at one method which allows for a massive amount of automation and scalability and a second which is more manual and less easy to implement at scale. The first, more automated method discussed will be using Arcion to create a CDC (Change Data Capture) pipeline. The second part will look at manually transferring data from MySL to Databricks.
Method 1: MySQL to Databricks Using Arcion
Arcion is one of the foremost real-time, in-memory Change Data Capture (CDC) solutions that offer users massive scalability and data consistency at all times. It is suitable for both migrating and replicating data across various enterprise databases and data warehouses. In this section, we will look at how to connect MySQL to Databricks using the Arcion CLI tool, Replicant.
Although Databricks can be hosted on various cloud environments, like Microsoft Azure, AWS, and Google Cloud, the steps in this write-up are specifically for an Azure Databricks instance. Many of the steps will be applicable across the other providers although slight tweaks may be needed. With Azure Databricks as a data target for our Arcion pipeline, you will need to have a Databricks workspace on Azure and an Azure container in ADLS Gen2.
With the prerequisites out of the way, let’s look at the steps involved in creating the connection between MySQL and Databricks using Arcion. The first few steps will show how to connect your MySQL instance as a data source in Arcion and subsequent steps will show you how to connect your Databricks instance to Arcion as a destination.
Step 1: Download And Install Arcion Self-hosted
The first thing we will need is to configure Arcion. For this, we will need to download and install the self-hosted version of Arcion. To gain access, you will be required to follow the steps outlined on the Arcion Self-hosted webpage. Once you have downloaded Arcion’s CLI tool Replicant, you will need to create a home directory for it. By default, the home directory will be where Replicant was downloaded. This home directory, from now on, will just be referred to as $REPLICANT_HOME in the examples below. The last step is adding your credentials to your Arcion instance, which is covered in the quick start guide that you can reference for more details on how to do this.
Having downloaded and installed Arcion Self-hosted, we can move on to the next steps where we will configure and enable Arcion to connect with MySQL and Databricks.
Step 2: Install mysqlbinlog Utility on the MySQL Server
Next, we will need to Install the mysqlbinlog utility on the machine where Replicant will be running. If you have the MySQL Client already installed, chances are you will already have the correct mysqlbinlog installed as well. You can test that by using the following command:
This command will display the version information of the mysqlbinlog utility if it is installed correctly. Ensure that the mysqlbinlog installed is compatible with the source MySQL version.
As an example, to install the mysqlbinlog utility on Linux through the command-line interface (CLI), you can follow these steps:
1. Open a terminal or shell session on your Linux machine.
2. Update the package manager's repository information by running the following command:
3. Once the repository information is updated, you can install the mysqlbinlog utility by running the following command:
This command will install the MySQL client package, which includes the mysqlbinlog utility.
4. During the installation process, you may be prompted to enter your password. Provide the password for the sudo command to continue with the installation.
5. After the installation is complete, you can verify that the `mysqlbinlog` utility is installed by running the following command:
Step 3: Enable Binary Logging in the MySQL Instance
For Arcion to accurately track the changes in the source MySQL instance, you’ll need to enable binary logging. To do this, carry out the following steps:
1. Open the MySQL option file var/lib/my.cnf. The my.cnf file, also known as the MySQL configuration file, is usually located in the /etc/mysql or /etc directory on Linux systems. However, the exact location may vary depending on the Linux distribution and how MySQL was installed. Once the file is found and opened, add the following lines below into the file to specify the base name to use for the binary log files and to set the binary logging format.
2. Then, use the command below to export the $MYSQL_HOME path.
3. To enable binary logging, we will need to restart the MySQL instance. Restart MySQL using the command below:
4. Lastly, confirm if binlogging is turned on by logging into the MySQL instance and running the following statements:
Step 4: Setting Up MySQL User for Arcion
For Arcion and Replicant to work correctly, it is best to create a user for Arcion to use to log into and access the MySQL instance. Below are a few steps and examples of SQL statements that show exactly how to do this. You’ll want to replace variables like 'username' and 'password' with your values.
1. Create a MySQL user.
2. Grant privileges on all the tables that would be involved in the replication.
3. Grant the following Replication privileges for the newly created user.
4. Lastly, log into the MySQL instance using the new user and confirm whether the newly created user can access the bin logs.
Step 5: Setting Up The Connection Configuration For MySQL
With our MySQL instance configured and the necessary user(s) created, we can start to plug the connection details into Arcion. First, go to the sample connection configuration file found on $REPLICANT_HOME and open it up for editing.
In the connection configuration file, we will need to add our values so Arcion can connect to our MySQL instance. If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant; if not, you will have to input them into the configuration file as seen in the example below.
From the example above, you will need to supply values for the following fields:
HOST: This refers to the hostname of the MySQL instance.
PORT: This is the port number of the MySQL instance.
USERNAME: This refers to the username of the user that connects to your MySQL instance.
PASSWORD: This is the password associated with the user.
MAX_NUMBER_OF_CONNECTIONS: This is used to specify the maximum number of connections Replicant can open in MySQL.
Step 6: Setting Up The Filter Configuration
The next step in configuring Arcion is to set up the Filter configuration file. This file will let Arcion know which data collections, tables, or files to replicate to best suit your replication strategy.
1. Go to the sample Filter configuration file found on $REPLICANT_HOME.
2. Within the file, specify the data to be replicated based on your needs. Below is a template of the format to be used within the file.
Step 7: Setting Up The Extractor Configuration
For real-time replication, a heartbeat table must be created in the source MySQL. To create a heartbeat table in your MySQL instance, do the following steps.
1. In the catalog/schema you are going to replicate, create a heartbeat table with the DDL as shown below. Remember to replace the <user_database> with the name of your specific database.
2. Grant the user configured for replication INSERT, UPDATE, and DELETE privileges for the heartbeat table.
3. Next, you’ll need to actually configure the Extractor configuration file. From $REPLICANT_HOME, open the Extractor configuration file.
4. In the Extractor configuration file, make the changes below in the realtime section to set up the heartbeat configuration.
With that final step, our MySQL instance is now configured to work with Arcion. Next, we need to set up the Databricks target.
Step 8: Create a Databricks Cluster
Below is a description of how to create both a Databricks all-purpose cluster and a SQL Warehouse. Either of these storage solutions can be used with Arcion. After the steps for configuring the cluster, there are a few details you’ll need to jot down to plug in the new instance to your connection configuration file in Arcion.
Setting Up an All-Purpose Cluster
To set up a Databricks all-purpose cluster, do the following steps:
- Log in to your Databricks Workspace.
- Go to Data Science & Engineering > Compute > Create Compute in the Databricks console.
- Enter the name you want to give the cluster.
- Select the latest Databricks runtime version.
- Set up an external stage.
- Click Create Cluster.
Getting The Connection Details for The All-Purpose Cluster
Having completed the cluster setup, you need to get the connection details for the cluster to be able to establish a connection between the Databricks instance and Arcion. This is provided to Replicant via the target connection configuration file. The necessary details can be retrieved for the newly created all-purpose cluster through the following steps:
- Click on the Advanced Options toggle.
- Click on the JDBC/ODBC tab and write down the Server Hostname, Port, and JDBC URL values.
Setting Up a SQL Warehouse
To set up a Databricks SQL warehouse (SQL Compute), do the following steps:
- Log in to your Databricks Workspace.
- Go to SQL > Review SQL warehouses > Create SQL Warehouse on the Databricks console.
- Enter a name you want to give the SQL warehouse.
- Select a cluster size.
- Set up an external stage.
- Click Create.
Getting The Connection Details Tor The SQL Warehouse
Having completed the setup, you need to get the connection details for the SQL warehouse to be able to establish a connection between the Databricks instance and Arcion. This info is provided to Replicant using the target connection configuration file. The necessary data can be retrieved for the new SQL warehouse instance through the following steps:
- Go to the Connection Details tab.
- Write down the Server Hostname, Port, and JDBC URL values.
Step 9: Create a Personal Access Token For The Databricks Cluster
A personal access token is required to configure replication capabilities through Replicant. To create a personal access token, visit the Generate a personal access token page of the Databricks docs for the exact steps required. You’ll need the token details in a later step so keep them handy.
Step 10: Configure The ADLS Container For Staging
You’ll now need to grant your Databricks instance access to your Azure Data Lake Storage (ADLS). Any of the following methods listed below can be used in granting Databricks access to ADLS.
- Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal.
- Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS token.
- Access Azure Data Lake Storage Gen2 or Blob Storage using the account key.
You can also use Spark configuration properties to access data in an Azure storage account if you do not want to use Python resources as shown in the resources above.
Below are the steps required in using the Spark configuration for both cluster and SQL warehouse.
Spark Configuration For a Cluster
- Click the Advanced Options toggle found on the cluster configuration page.
- Click on the Spark tab.
- Enter your configuration properties in the Spark Config textbox.
Spark Configuration For A SQL Warehouse
- Click your username in the top bar of Databricks workspace and select Admin Console from the dropdown menu.
- Select the SQL Warehouse Settings tab.
- Enter your configuration properties in the Data Access Configuration textbox.
The example below shows how to access data in an Azure storage account using the storage account key by entering your Spark configuration. In the example, <STORAGE_ACCOUNT> is the name of your Azure storage account and <STORAGE_ACCOUNT_KEY> is your storage account key.
Step 11: Obtain the JDBC Driver for Databricks
Next, you’ll need to obtain the correct JDBC driver for Replicant to be able to connect to your Databricks instance. You can obtain the appropriate JDBC driver for Legacy Databricks and Databricks Unity Catalog through the steps detailed below.
For Legacy Databricks
- Download the JDBC 4.2-compatible Databricks JDBC Driver ZIP.
- From the downloaded ZIP, locate and extract the SparkJDBC42.jar file.
- Put the SparkJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
For Databricks Unity Catalog
- Go to the Databricks JDBC Driver download page and download the driver.
- From the downloaded ZIP, locate and extract the DatabricksJDBC42.jar file.
- Put the DatabricksJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
Step 12: Configure The Replicant Connection For Databricks
At this point, we can begin to plug in our connection details for our new Databricks instance. This part of the process is simply providing the Databricks connection details we jotted down from the previous steps to Arcion. Adding those details can be done with the following steps:
1. Go to the Databricks sample connection configuration file found within $REPLICANT_HOME.
2. The connection configuration file obtained has two parts:
~a. Parameters related to target Databricks connection.
~b. Parameters related to stage configuration.
Parameters Related to The Target Databricks Connection
If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant; if not, you will have to input them manually as seen in the example connection configuration file below.
From the sample, we have the following fields which need values:
url: This is where you replace JDBC_URL with the JDBC you retrieved earlier when creating a cluster or SQL warehouse.
username: This is the username that connects you to your Databricks server.
password: This is the password registered with the username
host: This is where you input the HOSTNAME of your Databricks that was created earlier when creating a cluster or SQL warehouse.
port: This is the PORT_NUMBER of the Databricks that was created earlier creating a cluster or SQL warehouse.
max-connections: This specifies the maximum number of connections Replicant can open in Databricks. It can be altered to meet your needs.
Parameters Related to The Stage Configuration
The stage configuration requires you to use an external stage to hold the data files. Then, you’ll load the data onto the target database from there. The example below is a stage configuration for Azure Databricks using a storage account key and SAS token for authentication.
Using a Storage Account Key
Using a SAS Token
From the example, we have the following fields which need to be populated with a value:
type: This refers to the stage type. For Azure Legacy Databricks, set the type to AZURE.
root-dir: This is the directory under the ADLS container, this directory is used by Replicant to stage bulk-load files.
conn-url: This is the name of the ADLS container.
account-name: This is the name of the ADLS storage account, corresponding to the same storage account in the Configure ADLS container as the stage section.
secret-key: If you want to authenticate ADLS using a storage account key, specify your storage account key here.
sas-token: If you’re using a shared access signature (SAS) token to authenticate ADLS, specify the SAS token here.
Step 13: Configuring The Mapper File (Optional)
This is an optional step that aids in defining data mapping from your source to Azure Databricks in the mapper file. You can know more about this from Mapper configuration and Mapper configuration in Databricks.
Step 14: Setting Up The Applier Configuration
Lastly, we need to set up our Applier configuration to specify how data should be applied to our Databricks target. To do this, follow the steps outlined below.
1. Go to the applier configuration file found in $REPLICANT_HOME.
2. The configuration file opened has two parts:
~a. Parameters related to snapshot mode.
~b. Parameters related to realtime mode.
Parameters Related to Snapshot Mode
For the snapshot mode configuration, apply the following changes to the Applier configuration file:
Additional parameters that can be set for snapshot mode are also shown below:
Parameters Related to Realtime Mode
The realtime section in the Applier file is used to specify the configuration for real-time replication, which runs after snapshot mode ends in the case of a CDC setup. To get this to work, we have to enable Type-2 CDC as this allows a Target to have a history of all transactions performed in the Source. This allows transactions such as an INSERT in the Source is an INSERT in the Target, or an UPDATE in the Source is an INSERT in the Target with additional metadata for each transaction such as Operation Performed and Time of Operation. The primary requirement for Type-2 CDC is to enable full-row logging on the source, which we enabled earlier by turning on binary logging within the MySQL instance.
To enable Type-2 CDC for the Databricks target, the following steps should be carried out:
- Under the realtime section of the Databricks Applier configuration file, add the parameters shown below.
- Add the following parameter under the snapshot section in the Extractor configuration file.
With that, the last step is to run Replicant to begin the CDC process. To run Replicant, check out the Running Replicant docs page which outlines all of the necessary commands.
Advantages of Using Arcion
The following are the benefits of using Arcion to load data from MySQL to Databricks:
- Sub-second Latency From Distributed & Highly Scalable Architecture: Arcion is the world’s only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
- Arcion allows smooth schema management of your data thereby ensuring data integrity, reliability, and consistency. As the schema evolves on the MySQL instance, these changes will be applied to the Databricks instance automatically to keep both in sync.
- Effortless Setup & Maintenance: Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and costs.
- 1:1 dedicated support from POC to production: Arcion team offers live support for all self-hosted users around the clock through email, dedicated Slack channel, and calls. If users get stuck, our team of database replication experts is available to help get things on track to make your database replication and CDC efforts a success.
Method 2: MySQL to Databricks Manually
focusing on a method that relies on built-in tools in both platforms. This method uses the JDBC (Java Database Connectivity) driver for MySQL and the MySQL connector in Databricks Runtime. JDBC is an API for the Java programming language that defines how a client may access a database, enabling connections to virtually any relational database.
Option 1: Using JDBC
First, we will use the JDBC driver to connect to MySQL. The connection can be made in Python or Scala, two languages popular in data science due to their strong capabilities in data analysis and machine learning. The code snippets below show how to establish this connection in both languages. Databricks supports connecting to external databases using JDBC. The basic syntax provided below queries MySQL using its JDBC driver and using the MySQL connector in Databricks Runtime. It shows how to create the configuration and how to use the connection with examples in Python, SQL, and Scala. For more details on reading, writing, configuring parallelism, and query pushdown, visit Query databases using JDBC.
In the code snippets below, you’ll need to replace the placeholder fields ("<...>") with your specific database information. Be sure to replace <database-host-url>, <database-name>, <table-name>, <username>, and <password> with your actual MySQL database's host URL, the name of the database you want to connect to, the specific table within the database you're interested in, and your MySQL username and password, respectively.
You can query MySQL using its JDBC driver either in Python or Scala as shown below:
Option 2: Using The MySQL Connector in The Databricks Runtime
The second option is to establish a connection from Databricks to MySQL Server using DataFrames. DataFrames are a distributed collection of data, organized into named columns. They are conceptually equivalent to a table in a relational database but with more optimized operations. The code below is used to read data from a table in a MySQL database into a DataFrame in Apache Spark.
In Databricks Runtime 11.3 LTS and above, the mysql keyword can be used to include the driver for connecting to SQL Server. The Python and Scala examples below show how to establish a connection. To use these examples, you’ll need to run this code within a Databricks Notebook and replace the dbtable, host, port, database, user, and password variables with your actual MySQL instance details. Setting the port variable is optional and if not provided will default to 3306, the default port for MySQL.
Disadvantages of Manually Loading Data
Based on the above methods for manually loading data, there are a few disadvantages to going this route. Some of the disadvantages of manually loading data from MySQL to Databricks include:
- These tactics make it difficult to transfer data in real time. To do this, you will be required to create custom code that will move these changes between the platforms. Not doing this right could lead to potential problems with data accuracy and consistency.
- Since it is difficult to transfer data continuously in real time, this approach is really only suitable for a one-time bulk transfer of data. Replication use cases are hard to implement with this approach but one-time data migrations can be handled with it.
- As you scale this solution, you will likely be required to maintain and add to the custom scripts. This can be very time-consuming and could lead to errors and inconsistencies in the data.
- You may not have technical assistance when implementing this custom solution. There may be limited help and resources if a problem is encountered.
In this write-up, we reviewed how to connect MySQL to Databricks using Arcion for an automated and scalable approach and manually using Databricks native tools leveraging Python and Scala. We started by looking at what MySQL and Databricks are at a high level and then explored the two methods in more detail.
We looked at step-by-step instructions for both methods and covered the advantages and disadvantages of each. In the end, we highlighted the shortcomings of the second and more manual approach of the two and showed the preferred approach of using Arcion. For building robust pipelines to move data from MySQL to Databricks, Arcion is the most scalable and easy-to-use option. For more details on how to get started with Arcion, contact our team today.