The increased need for the migration and replication of data is the result of a constant demand for real-time data delivery. It has become a norm for organizations to try and make the most out of data found on different platforms. Enterprises know that the more data they have, the more accurate the output from analytics engines and big data solutions becomes.Â
With this in mind, the constant evolution of data requirements has been felt across different sectors. These fast-paced changes have given rise to new methods in which various platforms can be connected as a single entity to achieve the desires of enterprises. The need for continuous improvements to existing strategies is a constant battle that data scientists and their organizations face on a daily basis.
In this article, we will look at how to connect Microsoft SQL Server to Databricks. This combo of source and target platforms is extremely popular, but the methods for achieving this connectivity are vast. In the overview and walkthrough below, we will attempt to answer all the questions associated with how to get your Microsoft SQL Server to Databricks connection up and running. Later in the article, we will focus on two easy methods of creating this integration that includes both an automated and a manual approach. First, let’s dig into the details of what Microsoft SQL Server and Databricks are and the features they offer.
Introduction to SQL Server
Microsoft SQL Server is a relational database management system developed by Microsoft. The system is designed to manage, store, and retrieve enterprise-scale amounts of data. Microsoft SQL Server offers credible data security, and data integrity, and is extremely reliable as a primary application database. SQL Server instances may run on the same computer remotely or be set up in a distributed fashion. It also has support for various analytics, transactional, and business intelligence needs.Â
As the name implies, the Microsoft SQL server is built on the Structured Query Language (SQL) programming language which is used to manage databases and query data. The platform mainly uses a language extension to SQL known as Transact SQL (T-SQL). T-SQL caters to the use of variables, stored procedures, and other helpful parameters to enhance the capabilities of standard SQL.
Microsoft SQL Server has different editions which include an Enterprise, Standard, Express, and a Developer edition. The use of a particular edition required for a project is largely dependent on the needs of the user. Each edition has various features and use cases it is more suited to, for example, the Enterprise edition is mostly used by large organizations with huge workloads. These organizations use the Enterprise edition since they need to have access to all the features offered by SQL Server like advanced analytics and machine learning capabilities. In contrast, the Express edition is free and suited for smaller-scale applications, while the Standard version is an average offering for users that do not need any of the more advanced SQL Server features. Lastly, the Developer edition is similar in functionality to the Enterprise version but is mostly used by developers in the testing and modeling of demo applications.
Introduction to Databricks
Databricks is an enterprise big data platform, and much more, that was developed by the creators of Apache Spark. The platform aims to be full-service and, as such, provides a set of tools for unifying the process of building, deploying, sharing, scaling, and maintaining enterprise-grade data solutions. The tools available on the platform can help with data engineering workflows, machine learning modeling, and tracking, data discovery/annotation exploration, creating dashboards and visualizations, managing security and governance, and lots more. These features are all easily made available since Databricks is cloud-native and is easily deployed on top of distributed Cloud computing environments like Microsoft Azure, AWS, or Google Cloud.Â
Databricks simplifies the process of collating large amounts of data either in a Data Warehouse or Data Lakes by incorporating its Lakehouse architecture. The Lakehouse architecture offers the capabilities and advantages of both a data warehouse and a data lake, in a combined fashion. This approach helps to easily eliminate unwanted data silos and provides data teams with a unified single source of data.Â
Databricks also offers several open-sourced projects such as Delta Lake, MLflow, Redash, Delta Sharing, Apache Spark, and Structured Streaming giving. Each of these technologies can be easily integrated into a Databricks instance. Databricks has given users the power to add and manage updates of these open-source integrations in the Databricks Runtime releases. You can also expand, integrate, and optimize performance using some of Databricks’ proprietary tools like Workflows, Photon, Delta Live Tables, Unity Catalog, and Databricks SQL. Another plus-side for users is that Databricks also makes it easy for you to interact with it programmatically by using various tools and technologies exposed through their REST APIs, CLI, and Terraform.
Methods to Load Data from SQL Server to DatabricksÂ
Our next step is to actually explore some common ways to get our data from our primary SQL Server database into Databricks. The following section will look at two distinct methods in which data can be loaded from SQL Server into Databricks. The methods covered will include using Arcion to implement a CDC-enabled approach and another approach that will manually transfer data to Databricks via the Databricks CLI.
Method 1: SQL Server to Databricks Using Arcion
The first method we will dive into will be a more automated and real-time approach to moving data between the two platforms. For this, we will use Arcion. Arcion is a real-time, in-memory Change Data Capture (CDC) solution trusted by some of the largest organizations in the world. The highlight of Arcion's capabilities includes offering users massive amounts of data scalability and data consistency at all times. When it comes to applicable use cases, Arcion is suitable for migrating and replicating data with ease since it has integrations with various enterprise databases and data warehouses, including SQL Server and Databricks. Over the course of this section, we will explore how to connect Microsoft SQL Server to Databricks using Arcion’s CLI tool, known as Replicant.Â
As stated above when we introduced Databricks, Databricks can be hosted on various cloud platforms such as Microsoft Azure, AWS, and Google Cloud. Although many of the steps below will be similar across each platform, the steps in this write-up are specifically for an Azure Databricks instance. If you are following the instructions below and using Azure Databricks as a target with Arcion, you will need a Databricks workspace on Azure and an Azure container in Azure Data Lake Storage (ADLS) Gen2. With those prerequisites established, let’s look at the steps involved in setting it all up.Â
Step 1: Download And Install Arcion Self-hosted
The first thing we will need is to configure Arcion. For this, we will need to download and install the self-hosted version of Arcion. To gain access, you will be required to follow the steps outlined on the Arcion Self-hosted webpage. Once you have downloaded Arcion’s CLI tool Replicant, you will need to create a home directory for it. By default, the home directory will be where Replicant was downloaded. This home directory, from now on, will just be referred to as $REPLICANT_HOME in the examples below. The last step is adding your credentials to your Arcion instance, which is covered in the quick start guide that you can reference for more details on how to do this.
Having downloaded and installed Arcion Self-hosted, we can move on to the next steps where we will configure and enable Arcion to connect with SQL Server and Databricks.Â
Step 2: Set Up Connection Configuration for SQL Server
The next step will be for us to set up the connection configuration file for SQL Server. To do this, we will open the sample connection configuration file, sqlserver.yaml. The file can be found in the $REPLICANT_HOME directory under the conf/conn directory. Below is an example command to open up the file with Vi.
Then, we will go ahead and add our SQL Server credentials to give Arcion access to the instance. If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant. Alternatively, we can add them directly to thesqlserver.yaml file we opened with the previous command. Some example inputs can be seen below.
From the configuration above, you can set the following values:
Extractor: For SQL Server as a data source, two CDC Extractors are available to choose from, CHANGE and LOG. You can select the one you want to use by setting the extractor parameter in the connection configuration file. The CHANGE extractor is the default value, with it, SQL Server Change Tracking is used for real-time replication. If you choose to use the LOG extractor, the Replicant SQL Server Agent will be used as the CDC Extractor.
host: This refers to the hostname of the SQL Server instance
port: This is the port number of the SQL Server instance
username: This refers to the username of the DATABASE user
password: This is the password associated with the USERNAME
database: This is where you input the name of the SQL Server database to connect to
max_number_of_connections: This is used to specify the maximum number of connections Replicant would use to fetch data from the source.
is_azure: If you are hosting SQL Server on Azure, the is_azure parameter must also be set to true. In the connection configuration file, that will look like this:
After you are done setting the configuration, save the file and exit the text editor.
Step 3: Set up Extractor Configuration for SQL Server
Now that our connector configuration file is set, we need to set our extractor configuration. To set this, from the $REPLICANT_HOME directory open the Extractor configuration file located at conf/src/sqlserver.yaml. The command below shows an example of how to open up the file with Vi:
Once the file is opened, you will need to decide what replication mode you want to use. Arcion fully supports real time, delta-snapshot, and snapshot modes for SQL Server. In this example, the configuration we will use is real-time mode and our focus will be on full-mode replication. For this to work, we will also need to create a heartbeat table. It is also important in full-mode replication that all tables to be replicated have a primary key on them.Â
First, let’s create the heartbeat table. Below is an example of a SQL command you can run on your SQL Server instance to create a heartbeat table.
After creating the heartbeat table, the next thing to do is to add the realtime configuration to the Extractor configuration file. In the extractor configuration file opened earlier, let's add a section for realtime mode (if it’s not already present). The configuration below shows an example of how to set up Replicant to operate in realtime mode:
You can visit here for further reading and more information about Realtime mode configuration and parameters you can set within Arcion.
Optional Step: Enable Change Tracking for Realtime Mode
For SQL Server Change Tracking to be implemented in realtime mode, change tracking must be enabled on all the databases and tables. This step is only needed if your extractor setting was set to CHANGE in your Connection configuration file as seen below.
To enable change tracking in the SQL Server instance, execute the following example SQL command shown below with your parameters subbed in. The database_name variable in the statement will be replaced with the name of the database you want Change Tracking enabled on. If you have multiple databases you want to enable this for you can run this statement multiple times for the various databases.
Alternatively, Change Tracking can also be enabled on a specific table only. the example SQL command below can be used to enable change tracking on a pre-table basis. The table_name variable will be replaced with the name of the table you want Change Tracking enabled on. If you want to enable this on multiple tables, you can run the statement individually for each table.
Step 4: Create a Databricks Cluster
Below is a description of how to create both a Databricks all-purpose cluster and a SQL Warehouse. Either of these storage solutions can be used with Arcion. After the steps for configuring the cluster, there are a few details you’ll need to jot down in order to plug in the new instance to your connection configuration file in Arcion.
Setting Up an All-Purpose ClusterÂ
To set up a Databricks all-purpose cluster, do the following steps:
- Log in to your Databricks Workspace.
- Go to Data Science & Engineering > Compute > Create Compute in the Databricks console.
- Enter the name you want to give the cluster.
- Select the latest Databricks runtime version.
- Set up an external stage.
- Click Create Cluster.
Getting The Connection Details for The All-Purpose ClusterÂ
Having completed the cluster setup, you need to get the connection details for the cluster to be able to establish a connection between the Databricks instance and Arcion. This is provided to Replicant via the target connection configuration file. The necessary details can be retrieved for the newly created all-purpose cluster through the following steps:
- Click on the Advanced Options toggle.
- Click on the JDBC/ODBC tab and write down the Server Hostname, Port, and JDBC URL values.
Setting Up a SQL Warehouse
To set up a Databricks SQL warehouse (SQL Compute), do the following steps:
- Log in to your Databricks Workspace.
- Go to SQL > Review SQL warehouses > Create SQL Warehouse on the Databricks console.
- Enter a name you want to give the SQL warehouse.
- Select a cluster size.
- Set up an external stage.
- Click Create.
Getting The Connection Details Tor The SQL WarehouseÂ
Having completed the setup, you need to get the connection details for the SQL warehouse to be able to establish a connection between the Databricks instance and Arcion. This info is provided to Replicant using the target connection configuration file. The necessary data can be retrieved for the new SQL warehouse instance through the following steps:
- Go to the Connection Details tab.
- Write down the Server Hostname, Port, and JDBC URL values.
Step 5: Create a Personal Access Token For The Databricks Cluster
A personal access token is required to configure replication capabilities through Replicant. To create a personal access token, visit the Generate a personal access token page of the Databricks docs for the exact steps required. You’ll need the token details in a later step so keep them handy.
Step 6: Configure The ADLS Container For Staging
You’ll now need to grant your Databricks instance access to your Azure Data Lake Storage (ADLS). Any of the following methods listed below can be used in granting Databricks access to ADLS.
- Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal.
- Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS token.
- Access Azure Data Lake Storage Gen2 or Blob Storage using the account key.
You can also use Spark configuration properties to access data in an Azure storage account if you do not want to use Python resources as shown in the resources above.
Below are the steps required in using the Spark configuration for both cluster and SQL warehouse.
Spark Configuration For a Cluster
- Click the Advanced Options toggle found on the cluster configuration page.
- Click on the Spark tab.
- Enter your configuration properties in the Spark Config textbox.
Spark Configuration For A SQL Warehouse
- Click your username in the top bar of Databricks workspace and select Admin Console from the dropdown menu.
- Select the SQL Warehouse Settings tab.
- Enter your configuration properties in the Data Access Configuration textbox.
The example below shows how to access data in an Azure storage account using the storage account key by entering your Spark configuration. In the example, <STORAGE_ACCOUNT> is the name of your Azure storage account and <STORAGE_ACCOUNT_KEY> is your storage account key.
Step 7: Obtain the JDBC Driver for Databricks
Next, you’ll need to obtain the correct JDBC driver for Replicant to be able to connect to your Databricks instance. You can obtain the appropriate JDBC driver for Legacy Databricks and Databricks Unity Catalog through the steps detailed below.
For Legacy Databricks
- Download the JDBC 4.2-compatible Databricks JDBC Driver ZIP.
- From the downloaded ZIP, locate and extract the SparkJDBC42.jar file.
- Put the SparkJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
For Databricks Unity Catalog
- Go to the Databricks JDBC Driver download page and download the driver.
- From the downloaded ZIP, locate and extract the DatabricksJDBC42.jar file.
- Put the DatabricksJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
Step 8: Configure The Replicant Connection For Databricks
At this point, we can begin to plug in our connection details for our new Databricks instance. This part of the process is simply providing the Databricks connection details we jotted down from the previous steps to Arcion. Adding those details can be done with the following steps:
1. Go to the Databricks sample connection configuration file found within $REPLICANT_HOME.
2. The connection configuration file obtained has two parts:
~a. Parameters related to target Databricks connection.
~b. Parameters related to stage configuration.
Parameters Related to The Target Databricks Connection
If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant; if not, you will have to input them manually as seen in the example connection configuration file below.
From the sample, we have the following fields which need values:
url: This is where you replace JDBC_URL with the JDBC you retrieved earlier when creating a cluster or SQL warehouse.
username: This is the username that connects you to your Databricks server.
password: This is the password registered with the username
host: This is where you input the HOSTNAME of your Databricks that was created earlier when creating a cluster or SQL warehouse.
port: This is the PORT_NUMBER of the Databricks that was created earlier creating a cluster or SQL warehouse.
max-connections: This specifies the maximum number of connections Replicant can open in Databricks. It can be altered to meet your needs.
Parameters Related to The Stage Configuration
The stage configuration requires you to use an external stage to hold the data files. Then, you’ll load the data onto the target database from there. The example below is a stage configuration for Azure Databricks using a storage account key and SAS token for authentication.
Using a Storage Account Key
Using a SAS Token
From the example, we have the following fields which need to be populated with a value:
type: This refers to the stage type. For Azure Legacy Databricks, set the type to AZURE.
root-dir: This is the directory under the ADLS container, this directory is used by Replicant to stage bulk-load files.
conn-url: This is the name of the ADLS container.
account-name: This is the name of the ADLS storage account, corresponding to the same storage account in the Configure ADLS container as the stage section.
secret-key: If you want to authenticate ADLS using a storage account key, specify your storage account key here.
sas-token: If you’re using a shared access signature (SAS) token to authenticate ADLS, specify the SAS token here.
Step 9: Configure The Mapper File (Optional)
This is an optional step that aids in defining data mapping from your source to Azure Databricks in the mapper file. You can know more about this from Mapper configuration and Mapper configuration in Databricks.
Step 10: Set up The Applier Configuration
Lastly, we need to set up our Applier configuration to specify how data should be applied to our Databricks target. To do this, follow the steps outlined below.
1. Go to the applier configuration file found in $REPLICANT_HOME.
2. The configuration file opened has two parts:
~a. Parameters related to snapshot mode.
~b. Parameters related to realtime mode.
Parameters Related to Snapshot Mode
For the snapshot mode configuration, apply the following changes to the Applier configuration file:
Additional parameters that can be set for snapshot mode are also shown below:
Parameters Related to Realtime Mode
The realtime section in the Applier file is used to specify the configuration for real-time replication, which runs after snapshot mode ends in the case of a CDC setup. To get this to work, we have to enable Type-2 CDC as this allows a Target to have a history of all transactions performed in the Source. This allows transactions such as an INSERT in the Source is an INSERT in the Target, or an UPDATE in the Source is an INSERT in the Target with additional metadata for each transaction such as Operation Performed and Time of Operation. The primary requirement for Type-2 CDC is to enable full-row logging in the SQL Server Source.Â
To enable Type-2 CDC for the Databricks target, the following steps should be carried out:
- Under the realtime section of the Databricks Applier configuration file, add the parameters shown below.
- Add the following parameter under the snapshot section in the Extractor configuration file.
With that, the last step is to actually run Replicant to begin the CDC process. To run Replicant, check out the Running Replicant docs page which outlines all of the necessary commands.
Advantages of Using Arcion
The following are the benefits of using Arcion to load data from Microsoft SQL Server to Databricks:
- Sub-second Latency From Distributed & Highly Scalable Architecture: Arcion is the world’s only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
- Arcion allows smooth schema management of your data thereby ensuring data integrity, reliability, and consistency. As the schema evolves on the SQL Server instance, these changes will be applied to the Databricks instance automatically to keep both in sync.
- Effortless Setup & Maintenance: Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and costs.
- 1:1 dedicated support from POC to production: Arcion team offers live support for all self-hosted users around the clock through email, dedicated Slack channel, and calls. If users get stuck, our team of database replication experts is available to help get things on track to make your database replication and CDC efforts a success.
Method 2: SQL Server to Databricks Manually
In this section of the article, we will cover the second method of connecting Microsoft SQL Server to Databricks and running a manual replication process. To follow along with the example below, it is assumed that you already have a SQL Server Database deployed either locally or on a Cloud platform such as Google Cloud, AWS, and Microsoft Azure. If you don’t have an active Databricks or SQL server instance deployed, you will need to create a SQL Server Database and a Databricks instance. With your SQL Server and Databricks instances up and running, you can continue with the following steps to connect SQL Server to Databricks
The described method involves manual replication and assumes you already have a SQL Server database deployed and a Databricks instance running. If not, you need to set up these resources first.
Option 1: Using DataFrames To Create a Connection To SQL Server
The first step is to establish a connection from Databricks to SQL Server using DataFrames. DataFrames are a distributed collection of data, organized into named columns. They are conceptually equivalent to a table in a relational database but with more optimized operations. The code below is used to read data from a table in a SQL Server database into a DataFrame in Apache Spark.
In Databricks Runtime 11.3 LTS and above, the sqlserver keyword can be used to include the driver for connecting to SQL Server. The Python and Scala examples below show how to establish a connection. To use these examples, you’ll need to run this code within a Databricks Notebook and replace the hostName, port, username, password, databaseName, and schemaName.tableName variables with your actual SQL Server's details. Setting the port variable is optional and if not provided will default to 1433, the default port for SQL Server. The schema name is also optional and defaults to "dbo" if not provided.Â
For PythonÂ
For Scala
Alternatively, you can also use SQL syntax to import the data. The example script below also shows how to create a table and specify the SQL Server connection details. As with the Python and Scala example, replace <host-name>, <port>, <username>, <password>, <database-name>, and <schema-name.table-name> with your actual SQL Server's details. The port is once optional and defaults to 1433 as well as the schema name, which defaults to "dbo" if not provided.
This code essentially sets up a reference in Databricks to a specific table in your SQL Server database, using the provided connection details. After running this code, you can query the sqlserver_table within your Databricks notebook as if it was a local table, but the actual data stays in SQL Server.
Option 2: Using the Legacy JDBC Driver for Connectivity
In Databricks Runtime 11.2 and below, the JDBC settings must be specified to identify the driver and configurations. JDBC (Java Database Connectivity) is an API for the Java programming language that defines how a client may access a database. It provides methods to query and update data in a database and is oriented towards relational databases.
The Python and Scala examples below demonstrate how to query SQL Server using the SQL Server JDBC driver. Similar to option 1 above, replace <database-host-url>, <database-name>, <table-name>, <username>, and <password> with your actual SQL Server's details. You can run this code in a Databrick’s Notebook.
For Python
For Scala
All these code snippets assume that a Spark session (spark) is already active, which is automatically done for you in Databricks notebooks.Â
Disadvantages of Manually Loading DataÂ
Based on the above methods for manually loading data, there are a few disadvantages to going this route. Some of the disadvantages of manually loading data from SQL Server to Databricks include:
- These tactics make it difficult to transfer data in real time. To do this, you will be required to create custom code that will move these changes between the platforms. Not doing this right could lead to potential problems with data accuracy and consistency.
- Since it is difficult to transfer data continuously in real-time, this approach is really only suitable for a one-time bulk transfer of data. Replication use cases are hard to implement with this approach but one-time data migrations can be handled with it.Â
- As you scale this solution, you will likely be required to maintain and add to the custom scripts. This can be very time-consuming and could lead to errors and inconsistencies in the data.
- You may not have technical assistance when implementing this custom solution. There may be limited help and resources if a problem is encountered.
Conclusion
In this write-up, we went over how to connect Microsoft SQL Server to Databricks using Arcion for an automated and scalable approach and manually through using Databricks native tools leveraging Python and Scala. We started off by looking at what SQL Server and Databricks are at a high level and then dove into the two methods in more detail.
We looked at step-by-step instructions for both methods and covered the advantages and disadvantages of each. In the end, we highlighted the shortcomings of the second and more manual approach of the two and showed the preferred approach of using Arcion. For building robust pipelines to move data from SQL Server to Databricks, Arcion is the most scalable and easy-to-use option. For more details on how to get started with Arcion, contact our team today.