The role of data as a pivotal asset in today's digital world is unquestionable. In the data industry, it is now universally acknowledged that strategically structuring databases for integration with other databases or data warehouses can generate massive business benefits. These benefits extend beyond traditional data storage and help companies to unlock the ability to enable comprehensive business analytics and facilitate machine learning workloads. The result is a wealth of actionable insights, driving success across diverse applications and business operations. This has become the future of data and the standard businesses aim for.
Companies need to focus heavily on creating robust data pipelines to harness these benefits. Taking this direction includes some prerequisites that need to be satisfied to ensure a smooth, efficient data pipeline setup. Transforming and compiling large sets of data into digestible formats is necessary, followed by making sure data is readily available for external inquiries and internal reporting. These seemingly simple steps can significantly streamline operations and unlock many use cases. To take things a step further, data provisioning is expected to be responsive and meet the demands of users in real or near-real time.
In this article, we will look at how to enable the data benefits discussed previously, focusing specifically on extracting data from IBM Db2 and loading it into Delta Lake on Databricks. Databricks, powered by the robust Apache Spark, is a cloud-based service that, when integrated with IBM Db2, provides an enhanced environment for performing data engineering and data science operations on live Db2 data.
We will outline two methods to achieve this data pipeline: a hands-on, manual process and an automated one using Arcion. Both methods are designed to migrate data between IBM Db2 and Databricks, ultimately enabling analytics and machine learning use cases. First, let's look at both platforms and then dive into how to set up the pipelines.
Introduction to IBM Db2
First, we will look at our source database in this setup: IBM’s Db2 Platform. IBM Db2 is a leading cloud-native database engineered for critical workloads, enabling low-latency transactions and real-time analysis. Initially designed to handle relational models, Db2 was broadened to incorporate object-relational features and non-relational structures like JSON and XML. Db2 powers some of the most critical applications in the world and is heavily used within the finance and insurance spaces, amongst many others.
Db2 offers a single engine for database administrators (DBAs) and builds on decades of innovation, boasting top-notch security, easy scalability, high performance, and high availability. Its features allow for running mission-critical applications with high availability, assisting in real-time decision-making, allowing for fast data ingestion, in-database analytics, and in-memory data processing.
IBM Db2 is a cloud-native database optimized for high-data-intensive apps and services. It supports high-performance machine learning models inside Db2 for creating, evaluating, and predicting trends within the data stored on the platform. Db2’s single database engine handles transactional (OLTP), analytical (OLAP), and operational workloads.
Introduction to Databricks
Databricks is an enterprise big data software built on top of Apache Spark. One of the more popular big data platforms, Databricks equips users with tools for building, deploying, sharing, scaling, and maintaining enterprise-grade data solutions. By leveraging Apache Spark, Databricks simplifies data engineering workflows and enables machine learning models, data discovery/annotation, analytics dashboards, visualizations, security management, governance, and more. Databricks also offers excellent scalability and easy configuration since it can be deployed on many popular cloud platforms like Microsoft Azure, AWS, and Google Cloud.
Like other big data platforms, Databricks streamlines the process of gathering large amounts of data from Data Warehouses or Data Lakes using its lakehouse architecture. This architecture combines the capabilities of data warehouses and data lakes. As a result, a data lakehouse offers the flexibility, cost-efficiency, and scalability of a data lake and the data management capabilities and ACID compliance of a data warehouse. Truly the best of both worlds.
Databricks also supports several open-source projects, such as Delta Lake, MLflow, Redash, Delta Sharing, Apache Spark, and Structured Streaming. These technologies can be easily integrated into a Databricks instance. Users can enhance integration and optimize performance using Databricks’ advanced proprietary tools, including Workflows, Photon, Delta Live Tables, Unity Catalog, and Databricks SQL. Additionally, Databricks offers various tools and technologies for developers to interact with the platform via REST APIs, CLI, and Terraform.
Methods to Load Data from Db2 to Databricks
Now that we have looked at both platforms in detail, let’s explore how to connect them to create a robust data pipeline. In this section, we will look at two distinct methods we can use to load data from Db2 into Databricks. The first method, using Arcion, is a modern process that allows for a massive amount of automation and scalability, while the second is a manual process that is difficult to implement at scale. Let’s begin by looking at Arcion.
Method 1: Db2 to Databricks Using the Arcion
Arcion is one of the foremost real-time, in-memory Change Data Capture (CDC) solutions offering users massive scalability and data consistency. It is suitable for migrating and replicating data across various enterprise databases and warehouses. In this section, we will look at connecting Db2 to Databricks using the Arcion CLI tool, Replicant.
Db2 as a source database on Arcion comes in three packages, namely IBM Db2 with Kafka/MQ, IBM Db2 on Native I Series AS/400, and IBM Db2 with Native LUW, and all have their unique attributes, though our focus will be on IBM Db2 with Kafka/MQ. Likewise, Databricks can be hosted on various cloud environments, like Microsoft Azure, AWS, and Google Cloud, but the steps in this write-up are specifically for an Azure Databricks instance. Many steps are applicable across the other providers, although slight tweaks may be needed. With Azure Databricks as a data target for our Arcion pipeline, you must have a Databricks workspace on Azure and an Azure container in ADLS Gen2.
The first few steps in this section cover how to connect your Db2 instance as a data source in Arcion, and the subsequent steps show how to connect your Databricks instance to Arcion as a destination.
Step 1: Download And Install Arcion Self-hosted
The first thing we will need is to configure Arcion. For this, we must download and install the self-hosted version of Arcion. To gain access, you must follow the steps outlined on the Arcion Self-hosted page. Once you have downloaded Arcion’s CLI tool Replicant, you must create a home directory for it. By default, the home directory will be where Replicant was downloaded. From now on, this home directory will just be referred to as $REPLICANT_HOME in the examples below. The last step is adding your credentials to your Arcion instance, which is covered in the quick start guide that you can reference for more details on how to do this.
Having downloaded and installed Arcion Self-hosted successfully, proceed to the next steps to configure and enable Arcion to connect with Db2 and Databricks.
Step 2: Check Permissions on Source Db2
The next thing to do is to verify that the necessary permissions for using Db2 as a source to perform replication are applied. Check the Arcion docs here to find the exact permissions needed.
Step 3: Enable CDC Replication for Db2 MQ
To enable CDC, follow the instructions here to enable CDC-based replication on the Db2 MQ server.
Step 4: Create the Heartbeat Table
A heartbeat table must be created on the source database for CDC replication. You can use the DDL statement below to create this.
Step 5: Set up Connection Configuration for IBM Db2 with Kafka MQ
With the necessary permissions granted, CDC replication enabled, and a heartbeat table created on our IBM Db2 with Kafka MQ, we can move on to plugging the connection details into Arcion. First, open the sample connection configuration file in the $REPLICANT_HOME directory and open it up for editing.
The configuration file has two parts: parameters related to the source Db2 server connection and CDC logs and monitoring parameters.
Parameters Related to the Source Db2 Server Connection
In the connection configuration file, we must add our values so Arcion can connect to our Db2 instance. If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant; if not, you will have to input them into the configuration file we opened before. The exact syntax can be seen in the example below.
You’ll need to ensure that the database, host, port, username, and password values are set to the correct values for your Db2 instance.
If the proxy-source is set to true, Replicant will not attempt to connect to the source database. It can be enabled for real-time mode when the log is in a separate storage space from the source database.
Parameters Related to CDC Logs and Monitoring
You can choose between IBM MQ and Kafka as storage for Db2 logs for CDC-based replication. All the CDC log and monitoring configurations can be found in the cdc-log-config field. You can specify the storage type by using cdc-log-storage parameter.
Below is a sample CDC log configuration using MQ as cdc-log-storage.
If you choose Kafka to store your Db2 CDC logs, cdc-log-storage can be set to KAFKA_TRANSACTIONAL or KAFKA_EVENTUAL.
Below is a sample configuration for KAFKA_TRANSACTIONAL.
Below is a sample configuration for KAFKA_EVENTUAL.
Step 6: Set up Extractor Configuration
Now that our connector configuration file is set, we need to set our extractor configuration. To set this, from the $REPLICANT_HOME directory, open the Extractor configuration file. This can be done by running the following command in the terminal.
Once the file is opened, you must decide what replication mode to use. The configuration file has three parts that Arcion fully supports for Db2: realtime, delta-snapshot, and snapshot modes.
Parameters Related to Realtime Mode
In realtime mode, the start-position parameter specifying the starting log position for real-time replication is structured differently for Db2 MQ and Kafka; therefore, the parameters are also configured differently.
Below is a sample extractor real-time configuration for Db2 MQ.
Below is a sample extractor realtime configuration for Db2 Kafka.
Parameters Related to Snapshot Mode
The sample below shows how to set up parameters related to snapshot mode.
Parameters Related to Delta Snapshot Mode
You can use the sample below for delta snapshot mode.
Step 7: Create a Databricks Cluster
Below is a description for creating a Databricks all-purpose cluster and a SQL Warehouse. Either of these storage solutions can be used with Arcion. After the steps for configuring the cluster, there are a few details you’ll need to jot down to plug in the new instance to your connection configuration file in Arcion.
Setting Up an All-Purpose Cluster
To set up a Databricks all-purpose cluster, do the following steps:
- Log in to your Databricks Workspace.
- Go to Data Science & Engineering > Compute > Create Compute in the Databricks console.
- Enter the name you want to give the cluster.
- Select the latest Databricks runtime version.
- Set up an external stage.
- Click Create Cluster.
Getting The Connection Details for The All-Purpose Cluster
Having completed the cluster setup, you need to get the connection details for the cluster to establish a connection between the Databricks instance and Arcion. This is provided to Replicant via the target connection configuration file. The necessary details can be retrieved for the newly created all-purpose cluster through the following steps:
- Click on the Advanced Options toggle.
- Click on the JDBC/ODBC tab and note down the Server Hostname, Port, and JDBC URL values.
Setting Up a SQL Warehouse
Alternatively, you can also use a Databricks SQL Warehouse. To set up a Databricks SQL warehouse (SQL Compute), do the following steps:
- Log in to your Databricks Workspace.
- Go to SQL > Review SQL warehouses > Create SQL Warehouse on the Databricks console.
- Enter a name you want to give the SQL warehouse.
- Select a cluster size.
- Set up an external stage.
- Click Create.
Getting The Connection Details Tor The SQL Warehouse
Having completed the setup, you need to get the connection details for the SQL warehouse to establish a connection between the Databricks instance and Arcion. This info is provided to Replicant using the target connection configuration file. The necessary data can be retrieved for the new SQL warehouse instance through the following steps:
- Go to the Connection Details tab.
- Write down the Server Hostname, Port, and JDBC URL values.
Step 8: Create a Personal Access Token For The Databricks Cluster
A personal access token is required to configure replication capabilities through Replicant. To create a personal access token, visit the Generate a personal access token page of the Databricks docs for the exact steps required. You’ll need the token details in a later step, so make sure to save them and keep them handy.
Step 9: Configure The ADLS Container For Staging
You must now grant your Databricks instance access to your Azure Data Lake Storage (ADLS). Any of the following methods listed below can be used in granting Databricks access to ADLS, depending on the type of security you want to use.
- Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal.
- Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS token.
- Access Azure Data Lake Storage Gen2 or Blob Storage using the account key.
Alternatively, you can also use Spark configuration properties to access data in an Azure storage account if you do not want to use Python resources, as shown in the resources above.
Below are the steps required to use the Spark configuration for the cluster and SQL warehouse.
Spark Configuration For a Cluster
- Click the Advanced Options toggle found on the cluster configuration page.
- Click on the Spark tab.
- Enter your configuration properties in the Spark Config textbox.
Spark Configuration For A SQL Warehouse
- Click your username in the top bar of Databricks workspace and select Admin Console from the dropdown menu.
- Select the SQL Warehouse Settings tab.
- Enter your configuration properties in the Data Access Configuration textbox.
The example below shows how to access data in an Azure storage account using the storage account key by entering your Spark configuration. In the example, <STORAGE_ACCOUNT> is the name of your Azure storage account, and <STORAGE_ACCOUNT_KEY> is your storage account key.
Step 10: Obtain the JDBC Driver for Databricks
Next, you must obtain the correct JDBC driver for Replicant to connect to your Databricks instance. You can obtain the appropriate JDBC driver for Legacy Databricks and Databricks Unity Catalog through the steps below.
For Legacy Databricks
- Download the JDBC 4.2-compatible Databricks JDBC Driver ZIP.
- From the downloaded ZIP, locate and extract the SparkJDBC42.jar file.
- Put the SparkJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
For Databricks Unity Catalog
- Go to the Databricks JDBC Driver download page and download the driver.
- From the downloaded ZIP, locate and extract the DatabricksJDBC42.jar file.
- Put the DatabricksJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
Step 11: Configure The Replicant Connection For Databricks
At this point, we can begin to plug in our connection details for our new Databricks instance. In this part of the process, we simply provide the Databricks connection details we jotted down from the previous steps to Arcion. Adding those details can be done with the following steps:
1. Go to the Databricks sample connection configuration file found within $REPLICANT_HOME.
2. The connection configuration file obtained has two parts:
- Parameters related to target Databricks connection.
- Parameters related to stage configuration.
Parameters Related to The Target Databricks Connection
If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant; if not, you will have to input them manually, as seen in the example connection configuration file below.
From the sample, we have the following fields which need values:
url: This is where you replace JDBC_URL with the JDBC you retrieved earlier when creating a cluster or SQL warehouse.
username: This is the username that connects you to your Databricks server.
password: This is the password registered with the username
host: This is where you input the HOSTNAME of your Databricks created earlier when creating a cluster or SQL warehouse.
port: This is the PORT_NUMBER of the Databricks instance created earlier when you created a cluster or SQL warehouse.
max-connections: This specifies the maximum number of connections Replicant can open in Databricks. It can be altered to meet your needs.
Parameters Related to The Stage Configuration
The stage configuration requires an external stage to hold the data files. Then, you’ll load the data onto the target database from there. The example below is a stage configuration for Azure Databricks using a storage account key and SAS token for authentication.
Using a Storage Account Key
Using a SAS Token
From the example, we have the following fields which need to be populated with a value:
type: This refers to the stage type. For Azure Legacy Databricks, set the type to AZURE.
root-dir: This is the directory under the ADLS container, Replicant uses this directory to stage bulk-load files.
conn-url: This is the name of the ADLS container.
account-name: This is the name of the ADLS storage account, corresponding to the same storage account in the Configure ADLS container as the stage section.
secret-key: If you want to authenticate ADLS using a storage account key, specify your storage account key here.
sas-token: If you’re using a shared access signature (SAS) token to authenticate ADLS, specify the SAS token here.
Step 12: Configuring The Mapper File (Optional)
This optional step aids in defining data mapping from your source to Azure Databricks in the Mapper file. You can learn more about the Mapper file from the Mapper configuration and Mapper configuration in Databricks docs.
Step 13: Setting Up The Applier Configuration
Lastly, we must set up our Applier configuration to specify how data should be applied to our Databricks target. To do this, follow the steps outlined below.
1. Go to the applier configuration file found in $REPLICANT_HOME.
2. The configuration file opened has two parts:
- Parameters related to snapshot mode.
- Parameters related to realtime mode.
Parameters Related to Snapshot Mode
For snapshot mode configuration, apply the following changes to the Applier configuration file:
Additional parameters that can be set for snapshot mode are also shown below:
Parameters Related to Realtime Mode
The realtime section in the Applier file is used to specify the configuration for real-time replication, which runs after snapshot mode ends in the case of a CDC setup. To get this to work, we have to enable Type-2 CDC, which allows a target platform to have a history of all transactions performed in the source database. This allows transactions such as an INSERT on the source to be applied as an INSERT on the target, or an UPDATE on the source is an INSERT on the target with additional metadata for each transaction, such as Operation Performed and Time of Operation.
To enable Type-2 CDC for the Databricks target, the following steps should be carried out:
- Add the parameters shown below under the realtime section of the Databricks Applier configuration file.
- Add the following parameter under the snapshot section in the Extractor configuration file.
Step 14: Running Replicant
With that, the last step is to run Replicant to begin the CDC process. To run Replicant, check out the Running Replicant docs page, which outlines the necessary commands.
Advantages of using Arcion
Below are a few of the benefits of using Arcion to load data from Db2 to Databricks:
- Arcion is natively integrated with Databricks Partner Connect, where users can set up production-ready pipelines in minutes with a few clicks. Not to mention, Arcion is the only CDC solution that supports Databrick Unity Catalog.
- Sub-second latency from distributed & highly scalable architecture: Arcion is the only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
- Arcion allows smooth schema management of your data, thereby ensuring data integrity, reliability, and consistency. As the schema evolves on the MySQL instance, these changes will be applied to the BigQuery instance automatically to keep both in sync.
- Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. Arcion’s agentless CDC connectors apply to all the complex enterprise databases modern enterprises use. It reads directly from the transaction logs, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns of agent-based connectors.
- Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
- Effortless setup & maintenance: Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and cost.
- Arcion is SOC 2 Type 1 & Type 2, HIPAA, PCI compliant. The enterprise-grade security and compliance standards ensure data governance.
Method 2: Manual Data Transfer from Db2 to Databricks
The second method of integrating Db2 with Databricks is a more hands-on approach. It involves manually extracting data from Db2 and loading it into Delta Lake on Databricks. This process utilizes a connector, allowing various read and write operations to be performed using the Apache Spark engine.
Step 1: Extracting Data from Db2
One of the prevalent methods to extract data from a relational database is by writing a SELECT query, where you can specify filters, ordering, and limits on the results. Alternatively, the EXPORT command can export data from an entire table. For one-time batches, you can directly query the tables that need to be moved. However, for ongoing replication, you should specify database-specific actions to enable data collection and access for the database user.
To extract data from Db2, a connection must be defined by specifying the connection details. These details include the Host, Port, Database, Username, and Password. If SSL is enabled on the port, activate the SSL property and provide the certificate value for the SSL Certificate.
To create a connection and query data from Db2, you might use a script similar to the following:
You’ll need to replace the values for <your_db>, <your_host>, <your_port>, <your_username>, <your_password>, and <your_table> with your actual database name, hostname, port number, username, password, and table name, respectively.
Once you’ve established that your query returns the required data, you can move on to configuring the Db2 connector to read the extracted data.
Configuring the Db2 Connector as a Source
To configure the Db2 connector for reading table data, you need to set properties for the read mode. This can be achieved on IBM Db2 with the following steps:
1. From the job design canvas, double-click on the Db2 connector stage.
2. On the Properties tab, go to the Usage section and specify the Schema and Table Name to read from. Alternatively, a SELECT statement can be used.
3. The Row Limit and Byte Limit properties can be set to limit the number of rows returned by the stage. The default value for the number of partitions and the Byte Limit is 4.
4. Click on OK, then save the job.
Step 2: Loading Data into Delta Lake on Databricks
To load the data into Databricks, you’ll also need to create a Delta table. Once a Delta table is created, you can write data using Apache Spark's Structured Streaming API. The Delta Lake transaction log ensures exact processing, even when other streams or batch queries run against the table concurrently. In append mode, new records are added to the table as the stream runs.
Creating a Delta Table
You can use Python, R, Scala, or SQL to create a Delta table on Databricks for all reads, writes, and table creation commands. Here's a sample SQL command to create a Delta table:
Reading and Writing to a Delta Table
Now that the Delta table has been created, we can use it. Data in Delta tables can be accessed by either the table name or the table path, as demonstrated below:
Databricks Delta Lake uses standard SQL syntax for writing data to tables. Writing data to the table can be done using append mode. By doing this, we can add new data to an existing Delta table atomically, as shown below:
To replace all the data in a table atomically, you can use overwrite mode:
Disadvantages of Manual Data Loading
The manual method shown above best suits one-time data migration from the source Db2 database. Using this method at scale, for large amounts of data, or repeatedly becomes daunting. This approach becomes increasingly complicated and bulky when updating the data frequently or syncing multiple tables. Doing this would require additional code, making the data engineering processes more complex.
Building the pipeline from scratch involves multiple processes, and there is no fixed set of steps to follow or guidance when facing challenges. These drawbacks are especially prominent when dealing with a high data volume, a high data change rate, or complex transformations. However, these challenges can be mitigated with an automated process, such as using a platform like Arcion.
This blog post has shown you how to configure an IBM Db2 to Databricks by guiding you through the steps needed to create the connection between the two platforms. This was done using two methods: via a manual process and by using Arcion. We took an in-depth look at both Db2 and Databricks and also looked at some of the advantages and disadvantages of replicating data between the two platforms. Ultimately, users were shown the ease of use and scalability of Arcion for powering their modern data stack.
As we covered above, Arcion is an enterprise-ready Change Data Capture (CDC) solution that is profoundly scalable and efficiently integrates with some of the most popular enterprise databases and data warehouses. Arcion is the preferred platform for companies striving to modernize their data infrastructure with its real-time, in-memory CDC solution that ensures data consistency. You can get started today by connecting with our team at Arcion and begin migrating your data from Db2 to Databricks in a matter of minutes.