In today's data-driven world, the ability to effectively manage, analyze, and migrate data is crucial. As businesses continue to grow and evolve, so does the volume and complexity of their data. With this exponential growth, the need for scalable and flexible data management solutions has become more apparent than ever. One such transition that many data scientists and engineers are focused on is migrating and replicating data from IBM's Db2 to Snowflake.
Db2, one of the most widely-used enterprise database management systems out there, has been a staple in the data architecture of many organizations. However, the advent of cloud data platforms like Snowflake, known for its performance, scalability, and flexibility, has sparked interest in using such platforms for business cases that are hard to support on Db2.
In this guide we will look at an overview of Db2 and Snowflake, and detailed methodologies for migrating data between the two platforms. We'll start with an examination of Db2 and Snowflake's features, followed by a step-by-step walkthrough of data migration using Arcion, a trusted tool in the field, and its benefits. Lastly, we will take a look at using a more custom and manual approach to extracting and loading data into Snowflake from Db2 and its shortfalls. First stop, let’s take a look at IBM’s Db2 platform.
Introduction to Db2
Db2 is a suite of products developed by IBM that aid in the storage, retrieval, and analysis of data. Known for its footprint in some of the largest enterprises on earth, Db2 supports both structured and unstructured data types and is equipped to handle mission-critical workloads.
Db2 was first introduced to the market in 1983 and was built specifically for IBM platforms. Since then, it has evolved to support various operating systems, including Linux, Windows, and Unix, through a common codebase represented as Linux-Unix-Windows (LUX). Db2 is commonly used by companies and corporations that utilize IBM servers and require management of big data, as it is reliable, production-ready, performant, and scalable. The track-record for Db2 allows it to be held in high regard.
The data management products provided as part of the Db2 line are available for on-premises installation and as fully managed cloud offerings. This gives organizations the flexibility to choose the deployment type that best fits their use case. Like the offerings of other similar data platforms, some organizations can even opt for a hybrid strategy where proprietary or private data is stored on-premises, while more user-facing data is stored in the cloud.
Db2 is utilized heavily across several industries like finance, manufacturing, and automotive industries. Db2 has also become a favorite tool for data engineers, data architects, and data analysts, as it can help with designing, managing, optimizing, and monitoring data workflows throughout an organization.
The range of products under the Db2 umbrella include Db2 Database and Db2 Warehouse. Db2 Database, usually just referred to as Db2, is a Relational Database Management System (RDBMS) capable of handling enterprise-grade data and is particularly performant for Online Transaction Processing (OLTP) style deployments. Db2 Database can be deployed as a local, on-premises database that can be installed on servers run on an organization’s private infrastructure. Db2 Warehouse is a data warehouse product capable of handling analytics and machine learning workloads. This gives it an advantage over Db2 Database since it supports the storage of more diverse data types. It also features parallel processing capabilities and is available as an on-premises solution.
Db2 on Cloud and Db2 Warehouse on Cloud are similar in functionality to Db2 and Db2 Warehouse, with the key difference being that they are cloud offerings. As cloud offerings, both platforms offer a fully managed cloud experience where end users do not have to worry about the installation or management of the underlying software. Depending on the expertise of the company running the Db2 instances, this can sometimes be preferred.
Db2 also offers a product that caters to big data known as Db2 Big SQL. Db2 Big SQL is a SQL-on-Hadoop engine that provides superior query processing functionality and parallel operations suitable for handling data workloads with large volumes and velocity. Db2 Event Store is the Db2 product for in-memory data storage and utilizes an optimized in-memory database for event-driven applications capable of analyzing streaming data. This blog will specifically only focus on Db2 database migrations.
Features of Db2
Db2 has many great features that distinguish it from the many other database products and solutions currently available on the market. In this section, we will look at some of the standout features of the platform.
Advanced SQL Engine
Db2 encompasses a unified SQL engine that facilitates the writing of a query once, thereby making it immediately operable across the entire spectrum of Db2 products and platforms. As the Structured Query Language (SQL) was initially developed by IBM, Db2, being part of this ecosystem, supports SQL functionality and querying capabilities at an exceptionally high performance level. For example, a SQL query written for the Db2 Database can be directly applied to Db2 Warehouse without any modification.
Artificial Intelligence (AI) Functionality
Unlike some traditional database products that lack the inherent capability to exploit emerging AI technologies, Db2 possesses the ability to apply machine learning strategies. This empowers Db2 to optimize queries, enhance performance and efficiency, minimize overhead, and provide automatic data skipping, ensuring redundant data are not included in query results.
Data Types Support
Db2 accommodates various data types, including both structured and unstructured data. For instance, data in Db2 can be stored in a tabular form either as rows or columnar data. Moreover, Db2 can handle unstructured data types, such as XML and JSON, thereby providing extensive support for a wide range of data types. This versatility makes Db2 an adaptable solution for diverse data requirement criteria across projects.
Db2 stands as a highly available data solution, with data stored in Db2 products being readily accessible. It possesses several features to combat data disasters, employing recovery processes that ensure data remains available for querying and use.
Db2 is also notable for its scalability, allowing users to expand the capacity of their workloads. As it is available in cloud environments, users can augment their storage and computational power through cloud deployments in a managed process that does not disrupt their existing operations. For example, if a company's data workload grows, they can easily scale up their Db2 on Cloud deployment to accommodate this increased demand.
Introduction to Snowflake
Snowflake is one of the most popular data platforms in existence. It has been rapidly adopted by some of the largest organizations across almost every industry. Snowflake is an advanced, cloud-based data platform offering data storage and analytics services. Described as a data-as-a-service product, it empowers end users to manage both transactional and analytical workloads in a cloud environment that is fully-managed by Snowflake. This means that Snowflake requires no installation, configuration, management, or monitoring of any software or service. Notably, Snowflake operates exclusively in the cloud, with every aspect - from management to feature introduction - handled by the Snowflake team. Users can access Snowflake through a web browser and experience all of the functionality without any system or infrastructure management necessary.
One of Snowflake's key innovations lies in its separation of computation and storage, a design choice that allows each component to scale independently of the other. This results in reduced storage costs and improved query processing times. Snowflake's architecture is also unique since it is built from the ground up as a cloud-native solution. Snowflake is not based on any existing big data technology, but instead designed specifically for the cloud and built from the ground-up.
Even though the introduction of Snowflake has ushered a new kind of data product for the cloud age, Snowflake continues to support technologies that data engineers, architects, and administrators are already familiar with. For instance, Snowflake is SQL-compatible and supports most standard Data Definition Language (DDL) and Data Manipulation Language (DML) operations. This makes the learning curve for the platform manageable and allows even new users, with a data engineering background, to quickly utilize it.
Snowflake’s architecture is composed of three layers: the database storage layer, the query processing layer, and the cloud services layer. The database storage layer serves as Snowflake's central data repository. Data stored in this layer is optimized in a columnar format, with Snowflake managing the organization, structure, and compression of the files. However, the objects stored in this layer cannot be accessed directly and can only be retrieved through standard SQL queries.
The second layer is the query processing layer. The query processing layer leverages the concept of virtual warehouses to enhance query performance. Each virtual warehouse is a Massively Parallel Processing (MPP) compute cluster that can access data without impacting the performance of other virtual warehouses.
Lastly, the third layer is the cloud services layer which oversees various services that ensure all components function well and communicate with each other. These managed services include authentication, infrastructure management, access control, and metadata management, among others.
Features of Snowflake
Snowflake’s popularity has grown from the wide array of use cases in which it is applicable to handle. Let’s look at some of the most highlighted features of Snowflake below.
Snowflake is a platform designed specifically for cloud workloads, resulting in superior performance compared to data solutions that have been adapted for the cloud. It is cloud-agnostic, meaning it operates seamlessly on major cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. The choice of cloud provider does not impact Snowflake's performance or usability, allowing users to select the provider they are most comfortable with.
Thanks to its architecture, which separates computation and storage, Snowflake offers impressive scalability. Users can efficiently scale up or scale out components as needed, paying only for the services they use. This approach to scaling gives businesses and organizations considerable flexibility in designing their data strategies and saves both time and money.
Data Types Support
Snowflake accommodates both structured and semi-structured data types. It even boasts a unique data type called VARIANT, which allows for the storage of semi-structured data, such as JSON, alongside structured data types. Snowflake can also connect to objects in AWS, GCP, and Azure staging areas, facilitating data extraction pipelines.
Security and Data Governance
Snowflake enables users to comply with data governance policies by deploying instances in specific geographical regions. Security is integral to the Snowflake platform, supported by various authentication methods including Multi-factor Authentication (MFA), Federated Authentication, Single Sign-On (SSO), and External OAuth. All data is encrypted both at rest and in transit, and communication between clients and the server is secured by Transport Layer Security (TLS). Granular access control at the object level further increases the platform's security.
Snowflake's Time Travel feature allows users to view or restore their data in a historical context. Standard to all accounts, Time Travel can be extended up to 90 days in the past, depending on the user's Snowflake subscription tier. This feature enables users to 'undrop' objects, essentially restoring them, provided the drop event occurred within the user's Time Travel window.
How to Migrate Data from Db2 to Snowflake
There are quite a few different ways that you can migrate your data from Db2 to Snowflake. Figuring out what your use case is, the scale at which you need to implement, and other factors play a crucial part in deciding which method is best. In this section, you will be introduced to two popular methods - migrating data from Db2 to Snowflake using Arcion, a real-time, in-memory Change Data Capture (CDC) platform that is designed for infinite scalability and low latency. The second method we explore will be based on using custom code to migrate data, which can be seen as a more manual process.
Method 1: Db2 to Snowflake Using Arcion
The first approach to data migration we will look at is Arcion. Arcion provides a reputable in-memory Change Data Capture (CDC) solution that offers users petabyte-level scalability, data consistency, and data integrity. It is a great solution for use cases where data migration and/or data replication is required since it can integrate with multiple enterprise databases and data warehouses with pre-built connectors.
Arcion is available in two types of deployments, Arcion Self-hosted and Arcion Cloud. In this section, we will be using Arcion Self-hosted. Arcion Self-hosted has two ways that can be used to set up a CDC pipeline for replication and migration: via CLI & via GUI. For maximum flexibility, the CLI version will be used in the rest of this article since it supports the full suite of Arcion’s available connectors.
The Arcion CLI tool is called Replicant and is the tool we will use to migrate data from Db2 to Snowflake. To make sure that the platform will work correctly, you’ll need to ensure that the Snowflake user that Arcion will connect with possesses CREATE TABLE and CREATE STAGE privileges on the catalogs or schemas where Replicant will replicate tables. You’ll also need CREATE DATABASE and CREATE SCHEMA privileges granted to the user to create catalogs or schemas on the target Snowflake system. With those prerequisites handled, let us dive into replicating data using Arcion which is precisely mapped out in steps beneath.
Step 1: Download And Install Arcion Self-hosted
The first thing to do is to download and install Arcion Self-hosted. To do this, you’ll need to contact the Arcion team to get the Arcion program as well as a license key. Once you have access, the process would include downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME throughout this guide, and finally, adding your license. You can reference the quick start guide for more details on how to do these steps in detail.
Step 2: Enable Permissions on Db2
Ensure that the Db2 user that will be used as part of the replication process has read access to all the databases, schemas, and tables to be replicated. The user should also have read access to the following system views and tables:
The user must also be able to execute permission on the system procedures listed below:
Step 3: Create the Heartbeat Table for CDC Replication
To enable CDC replication with Arcion via Db2, you’ll need to create a heartbeat table on Db2. Below is an example script to set up a heartbeat table for use with Arcion CDC.
After creating the heartbeat table in Db2, we can then move forward with enabling CDC.
Step 4: Enable CDC Replication
The following steps will allow you to enable CDC replication in your Db2 instance.
- First, you must run the command below for every table you want to replicate. This command will enable change events to be captured which is necessary for enabling CDC.
- Next, you can check if the Db2 database is recoverable by running the command below.
- If the Db2 database is not recoverable, you can make it so by setting LOGARCHMETH1 or LOGARCHMETH2. An example of the commands to run to enable this are shown in the commands below.
In the setup shown above, it is assumed that you are running Replicant from the same server where the source Db2 database is deployed. If that is not the case, you’ll need to configure some environment variables for Db2 and install the Db2 Data Server Client in order to make Replicant work across servers. For more information on how to do this, check out the Arcion Db2 Source docs.
Step 5: Configure Replicant for Db2
Now we will actually begin to dial in our Db2 configuration within Arcion. In the $REPLICANT_HOME/conf/conn directory, You can find a sample connection configuration file called db2_src.yaml. You will need to open this file and from here you will configure the Db2 connection. In the example below, you can see how to configure the file with a username and password for authentication.
The node key in the example above represents the name of the Db2 node you are to connect to. The value of READ_LOG in cdc-log-storage informs Replicant that you want to use the native db2ReadLog as your log reader for the CDC process.
Step 6: Set up Extractor Configuration
After the connection configuration file has been set, you can set up the extractor configuration file. Depending on how the file is set, you can enable snapshot mode, delta snapshot mode, or real-time mode. In this example, we will set up the extractor to do a snapshot load of the data. Essentially, this will take our data and migrate it to Snowflake once. To keep the two platforms in sync, you’ll also need to set up the extractor for realtime mode. To set it up for snapshot mode, go to $REPLICANT_HOME and issue the command below to open the Extractor configuration file.
Once opened, you can use the below example as a sample configuration for snapshot mode, replacing the value as necessary based on your Db2 instance.
For my specifics and advanced configurations, check out the Extractor config docs. With this configuration set, we are done with our Db2 connector configuration and can move onto configuring Snowflake.
Step 7: Set up connection configuration for Snowflake
From $REPLICANT_HOME, the directory where Arcion is installed, go to the sample Snowflake Connection configuration file, snowflake.yaml, located in the conf/conn/ directory.
The configuration file for setting up the connection for Snowflake is divided into two parts: the Snowflake connection-related parameters and the stage configuration-related parameters.
For Snowflake connection-related parameters, to set up the target Snowflake instance, two methods can be used to get an authenticated connection, they are RSA key pair authentication or basic username and password authentication. To use the basic username and password authentication, you can get your credentials by using Replicant to retrieve the credentials from AWS Secrets Manager if you are using it. Alternatively, you can specify your credentials in the connection configuration file as seen below.
From the example above, the value definitions for each field are as follows:
type: This indicates the connection type representing the database, it is Snowflake in this case.
host: This shows the hostname of your Snowflake instance.
port: This is the port number to connect to the host.
warehouse: This is the name of the Snowflake warehouse.
username: This refers to the username credential to access the Snowflake system.
password: This is the password associated with the username.
max-connections: This is used to indicate the maximum number of connections Replicant uses to load data into the Snowflake instance.
Having set up the Snowflake connection-related parameters, you can also set up the stage configuration-related parameters. The stage configuration enables users to tune native or external staging areas for bulk loading.
The RSA key pair for authentication is going to be used to configure the parameters related to stage configuration. Snowflake’s key pair authentication support is used for enhanced authentication security as an alternative to the basic username and password. To use this, follow the link above to go through the procedures for generating a private key, generating a public key, storing both keys, assigning the public key to a Snowflake user, and verifying the user’s public key fingerprint. Finally, editing the connection configuration with these new details for the stage configuration should also be done.
Step 8: Setting Up The Applier Configuration
Next, we need to add the Applier configuration for Snowflake. The Applier configuration file contains all the parameters that Replicant uses while loading data into Snowflake. While it is not necessary to modify most of the parameters, they can be adjusted to optimize the Applier job as necessary
From $REPLICANT_HOME, go to the sample Snowflake Applier configuration file:
The configuration file for setting up the Applier configuration is divided into two parts namely the parameters related to snapshot mode and the parameters related to real-time mode. For our purposes, we will only set the parameters for snapshot mode.
In the Applier configuration file ,for snapshot mode, we will make the changes below:
With this, the Snowflake connector has been configured and we can now run Replicant Arcion’s program for running the CDC pipeline which will be the final step in the Db2 to Snowflake connection using Arcion.
Step 9: Running Replicant
From $REPLICANT_HOME, run the following command to execute Replicant in snapshot mode. Once the command is executed, Replicant starts up the pipeline immediately and will begin migrating data from Db2 to Snowflake.
After the process has completed, your Db2 data will now have been migrated over to Snowflake.
Advantages of using Arcion
As seen in the example above, Db2 and Snowflake are fully supported by Arcion. The ability to migrate and replicate data from Db2 to Snowflake can be done through a few simple steps. For example, you can set up a streaming pipeline in Arcion using Change Data Capture to replicate changes from Db2 to any of the other destination databases supported, like Snowflake, or others like Oracle, DB2, SAP (ASE, HANA, IQ), SingleStore, and many more. Below are some highlights of using Arcion for data migrations and Change Data Capture.
Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever-increasing data volume on Source.
100% Agentless Change Data Capture
Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported enterprise connectors. The agentless CDC applies to all complex enterprise databases including Db2 and Snowflake. Arcion reads directly from the database transaction log, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.
Data Consistency Guaranteed
Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
Automatic Schema Conversion & Schema Evolution Support
Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources, which can get expensive!
Method 2: Db2 to Snowflake With Custom Code
To migrate or replicate data from Db2 to Snowflake using a custom code approach, the process typically involves transferring data from the source database to a staging area. In Snowflake, this stage could be internal or external, pointing to objects in a cloud provider such as Amazon S3. Data in the staging area is then imported into Snowflake using various methods, including Snowpipe, the COPY INTO command, or tasks. It's crucial that the Snowflake table receiving the data supports the same data types and schema as the source database table. Here are the steps for this process:
Step 1: Extract Data from Db2
With the EXPORT utility in Db2, you can export data from specific tables or the entire database. To do so, you need DATAACCESS authority, the CONTROL privilege, or the SELECT privilege on all tables you want to export. Below is an example of how you can use the EXPORT command:
In this command, your_data.ixf is the filename where the exported data will be stored, and your_table is the table from which you're exporting data.
Step 2: Create a Snowflake Stage
A Snowflake stage serves as a temporary location for your data before it's imported into Snowflake. As Snowflake doesn't directly pull data from databases, it relies on the concept of a stage. You can create an external stage using AWS or your preferred cloud provider. Ensure that Snowflake has the necessary permissions to access the relevant files. Below is an example of how to create a stage in Snowflake using AWS and S3:
In this command, replace your_bucket/your_folder/, your_aws_key_id, your_aws_secret_key, and your_encryption_key with your actual values before running the command.
Step 3: Import Data into Snowflake
Lastly, you can import data into Snowflake from the external stage using the COPY INTO command. The data will be loaded into Snowflake tables. Here's an example command:
Replace your_snowflake_table with the actual name of your Snowflake table. After running the command, the data from the staging area will be moved into the table in Snowflake.
Limitations and Challenges of Using Custom Code Method
The followings are the limitations and challenges associated with the method above and using custom code to extract and load data from Db2 to Snowflake:
- The custom code method described here is more of a manual process and does not necessarily support real-time replication or migration.
- The custom code method shown requires expertise on both Db2 and Snowflake and an inherent understanding of the schema structures and how to convert between them without suffering data loss.
- This method of connecting Db2 to Snowflake can be difficult to implement and errors can easily creep up if not handled properly. This can create issues which may not appear until later in time or as you start to scale/move more data.
We've taken a comprehensive look at the process of migrating data from Db2 to Snowflake in this article, offering a deep dive into both platforms and their individual features. We've explored two distinct migration methods, underscoring the potential pitfalls of creating a custom code solution from scratch.
Arcion, however, offers a robust and intuitive alternative. By opting for Arcion, you're choosing a solution that boosts your data team's productivity, streamlines the migration process, and ensures the scalability and performance of your data platform. This is a choice that not only simplifies your migration and replication journey but also enhances the value you extract from your data.
If you're ready to get started with Arcion and implement a robust solution for migrating Db2 to Snowflake, contact the Arcion team today! Let our team of Db2 to Snowflake migration experts assist you in making your migration and replication projects a success.