Moving data from a primary application database to an analytics platform can uncover massive insights and opportunities. The amount of data that enterprises have contains value that is usually underserved by traditional application databases. From this, the rise of big data and data analytics platforms was born.
One of the most popular databases in existence today is MySQL and powering a good chunk of the internet, MySQL is the backend for some very large applications. Because of the volume of MySQL deployments out there, it means that there is a lot of data ready to be harnessed. As much as MySQL can get businesses so far in terms of garnering insights, Snowflake has become the specialized platform of choice.
To garner insights from Snowflake, you need to get data into it. For that, you’ll need to build a data pipeline to move data from MySQL to Snowflake. Fortunately, this is exactly what we will cover in this blog in two different flavors. First, we will go over using Arcion to move data from MySQL to Snowflake. After that, we will look at how to move data from MySQL to Snowflake with custom code. Let’s start by taking a deeper look at MySQL and Snowflake first.
Moving data from a primary application database to an analytics platform can uncover massive insights and opportunities. The amount of data that enterprises have contains value that is usually underserved by traditional application databases. From this, the rise of big data and data analytics platforms was born.
One of the most popular databases in existence today is MySQL and powering a good chunk of the internet, MySQL is the backend for some very large applications. Because of the volume of MySQL deployments out there, it means that there is a lot of data ready to be harnessed. As much as MySQL can get businesses so far in terms of garnering insights, Snowflake has become the specialized platform of choice.
To garner insights from Snowflake, you need to get data into it. For that, you’ll need to build a data pipeline to move data from MySQL to Snowflake. Fortunately, this is exactly what we will cover in this blog in two different flavors. First, we will go over using Arcion to move data from MySQL to Snowflake. After that, we will look at how to move data from MySQL to Snowflake with custom code. Let’s start by taking a deeper look at MySQL and Snowflake first.
Introduction to MySQL
MySQL is one of the most popular open-source Relational Database Management Systems (RDMS) in existence. The platform runs on a client-server model and has numerous features that are ideal for both small and large applications. Since MySQL is open-source, anyone can use it by downloading the software and modifying the source code to meet their business needs.
MySQL’s popularity is driven by its high performance, reliability, and the fact that it is very easy to use and deploy. Due to its ease of use, it became a staple in one of the most popular web app stacks used across the internet: the LAMP stack. It is compatible with many computing platforms such as Microsoft Windows, macOS, Linux, and Ubuntu and supports connectivity through languages like PHP, Python, Java/JDBC, C++, C, ODBC, Perl, and Ruby. It also has support for open-source applications like Drupal, Joomla, Magento, and other popular CMS platforms with the most popular CMS use case being WordPress. Battle-tested as one of the most popular database solutions for many of the world’s largest web applications, it has become a go-to for big organizations such as Facebook, Twitter, Netflix, Uber, and others with large volumes of data and users accessing their applications.
MySQL also has variations such as MySQL Heatwave, a fully managed service that enables customers to run OLTP, OLAP, and machine learning workloads directly from a MySQL Database. Also available is MySQL Enterprise Edition with its complete set of advanced features, management tools, and technical support. Others include, MySQL for OEM/ISV and MySQL Cluster CGE used for web, cloud, and communications services.
Introduction to Snowflake
Snowflake is a cloud-based Software-as-a-Service (SaaS) data warehouse and analytics platform which enables customers to analyze data. Its cloud data warehouse is built on Amazon Web Service (AWS), Microsoft Azure, and Google infrastructure, providing a platform for storing and retrieving data. It is powered by the standard ANSI SQL protocol that supports full and semi-structured data such as JSON, Parquet, XML, and many other formats.
With Snowflake, there is no hardware or software to install, configure, or manage since the platform is fully managed by Snowflake itself. This means organizations do not need dedicated resources for setup, maintenance, and support of the platform, as you would for in-house and physical operations. This is because everything is carried out on the cloud and requires near-zero administration.
Snowflake’s unique architecture separates its storage unit from its compute unit, thereby scaling differently so customers can use and pay for both independently. It is very flexible and highly scalable to accommodate massive amounts of data and a lot of users using its computing power. For further information on Snowflake, visit the official website.
How to Migrate Data from MySQL to Snowflake
Below we will look further into how to migrate and replicate data from MySQL to Snowflake. The methods below cover both an automated and a more manual and custom approach. For each method, we will cover the advantages and disadvantages that can help to inform your decision of what method you decide to implement. With that, let’s start by looking at how to move data from MySQL to Snowflake to Arcion.
Method 1: MySQL to Snowflake Using Arcion
Arcion is an enterprise-ready Change Data Capture (CDC) solution that offers users petabyte-level scalability, data consistency, and data integrity. It is a viable solution for both data migration and data replication since it can easily integrate with some of the most popular enterprise databases and data warehouses.
Arcion’s CLI tool, Replicant, is what we will use to migrate data from MySQL to Snowflake. To enable this to work correctly, you’ll need to ensure that the Snowflake user that Arcion will connect with possesses CREATE TABLE and CREATE STAGE privileges on the catalogs or schemas where Replicant will replicate tables. You’ll also need CREATE DATABASE and CREATE SCHEMA privileges granted to the user to create catalogs or schemas on the target Snowflake system. With those prerequisites out of the way, let's dive into implementing CDC with Arcion through the steps shown below.
Step 1: Download And Install Arcion Self-hosted
The first thing we will need is to configure Arcion. For this, we will need to download and install the self-hosted version of Arcion. To gain access, you will be required to follow the steps outlined on the Arcion Self-hosted webpage. Once you have downloaded Arcion’s CLI tool Replicant, you will need to create a home directory for it. By default, the home directory will be where Replicant was downloaded. This home directory, from now on, will just be referred to as $REPLICANT_HOME in the examples below. The last step is adding your credentials to your Arcion instance, which is covered in the quick start guide that you can reference for more details on how to do this.
Having downloaded and installed Arcion Self-hosted, we can move on to the next steps where we will configure and enable Arcion to connect with MySQL and Snowflake.
Step 2: Install mysqlbinlog Utility on the MySQL Server
Next, we will need to Install the mysqlbinlog utility on the machine where Replicant will be running. If you have the MySQL Client already installed, chances are you will already have the correct mysqlbinlog installed as well. You can test that by using the following command:
This command will display the version information of the mysqlbinlog utility if it is installed correctly. Ensure that the mysqlbinlog installed is compatible with the source MySQL version.
As an example, to install the mysqlbinlog utility on Linux through the command-line interface (CLI), you can follow these steps:
1. Open a terminal or shell session on your Linux machine.
2. Update the package manager's repository information by running the following command:
3. Once the repository information is updated, you can install the mysqlbinlog utility by running the following command:
This command will install the MySQL client package, which includes the mysqlbinlog utility.
4. During the installation process, you may be prompted to enter your password. Provide the password for the sudo command to continue with the installation.
5. After the installation is complete, you can verify that the `mysqlbinlog` utility is installed by running the following command:
Step 3: Enable Binary Logging in the MySQL Instance
For Arcion to accurately track the changes in the source MySQL instance, you’ll need to enable binary logging. To do this, carry out the following steps:
- Open the MySQL option file var/lib/my.cnf. The my.cnf file, also known as the MySQL configuration file, is usually located in the /etc/mysql or /etc directory on Linux systems. However, the exact location may vary depending on the Linux distribution and how MySQL was installed. Once the file is found and opened, add the following lines below into the file to specify the base name to use for the binary log files and to set the binary logging format.
- Then, use the command below to export the $MYSQL_HOME path.
- To enable binary logging, we will need to restart the MySQL instance. Restart MySQL using the command below:
- Lastly, confirm if binlogging is turned on by logging into the MySQL instance and running the following statements:
Step 4: Setting Up MySQL User for Arcion
For Arcion and Replicant to work correctly, it is best to create a user for Arcion to use to log into and access the MySQL instance. Below are a few steps and examples of SQL statements that show exactly how to do this. You’ll want to replace variables like 'username' and 'password' with your values.
- Create a MySQL user.
- Grant privileges on all the tables that would be involved in the replication.
- Grant the following Replication privileges for the newly created user.
- Lastly, log into the MySQL instance using the new user and confirm whether the newly created user can access the bin logs.
Step 5: Setting Up The Connection Configuration For MySQL
With our MySQL instance configured and the necessary user(s) created, we can start to plug the connection details into Arcion. First, go to the sample connection configuration file found on $REPLICANT_HOME and open it up for editing.
In the connection configuration file, we will need to add our values so Arcion can connect to our MySQL instance. If your connection credentials are stored on AWS Secrets Manager, you can retrieve them using Replicant; if not, you will have to input them into the configuration file as seen in the example below.
From the example above, you will need to supply values for the following fields:
HOST: This refers to the hostname of the MySQL instance.
PORT: This is the port number of the MySQL instance.
USERNAME: This refers to the username of the user that connects to your MySQL instance.
PASSWORD: This is the password associated with the user.
MAX_NUMBER_OF_CONNECTIONS: This is used to specify the maximum number of connections Replicant can open in MySQL.
Step 6: Setting Up The Filter Configuration
The next step in configuring Arcion is to set up the Filter configuration file. This file will let Arcion know which data collections, tables, or files to replicate to best suit your replication strategy.
- Go to the sample Filter configuration file found on $REPLICANT_HOME.
- Within the file, specify the data to be replicated based on your needs. Below is a template of the format to be used within the file.
Step 7: Setting Up The Extractor Configuration
For real-time replication, a heartbeat table must be created in the source MySQL. To create a heartbeat table in your MySQL instance, do the following steps.
- In the catalog/schema you are going to replicate, create a heartbeat table with the DDL as shown below. Remember to replace the <user_database> with the name of your specific database.
- Grant the user configured for replication INSERT, UPDATE, and DELETE privileges for the heartbeat table.
- Next, you’ll need to configure the Extractor configuration file. From $REPLICANT_HOME, open the Extractor configuration file.
- In the Extractor configuration file, make the changes below in the realtime section to set up the heartbeat configuration.
With that final step, our MySQL instance is now configured to work with Arcion. Next, we need to set up the Snowflake target.
Step 8: Setting Up The Connection Configuration for Snowflake
From $REPLICANT_HOME, the directory where Arcion is installed, go to the sample Snowflake Connection configuration file, snowflake.yaml, located in the conf/conn/ directory.
The configuration file for setting up the connection for Snowflake is divided into two parts: the Snowflake connection-related parameters and the stage configuration-related parameters.
For Snowflake connection-related parameters, to set up the target Snowflake instance, two methods can be used to get an authenticated connection, they are RSA key pair authentication or basic username and password authentication. To use the basic username and password authentication, you can get your credentials by using Replicant to retrieve the credentials from AWS Secrets Manager if you are using it. Alternatively, you can specify your credentials in the connection configuration file as seen below.
From the example above, the value definitions for each field are as follows:
type: This indicates the connection type representing the database, it is Snowflake in this case.
host: This shows the hostname of your Snowflake instance.
port: This is the port number to connect to the host.
warehouse: This is the name of the Snowflake warehouse.
username: This refers to the username credential to access the Snowflake system.
password: This is the password associated with the username.
max-connections: This is used to indicate the maximum number of connections Replicant uses to load data into the Snowflake instance.
Having set up the Snowflake connection-related parameters, you can also set up the stage configuration-related parameters. The stage configuration enables users to tune native or external staging areas for bulk loading.
The RSA key pair for authentication is going to be used to configure the parameters related to stage configuration. Snowflake’s key pair authentication support is used for enhanced authentication security as an alternative to the basic username and password. To use this, follow the link above to go through the procedures for generating a private key, generating a public key, storing both keys, assigning the public key to a Snowflake user, and verifying the user’s public key fingerprint. Finally, editing the connection configuration with these new details for the stage configuration should also be done.
Step 9: Setting Up The Applier Configuration
Next, we need to add the Applier configuration for Snowflake. The Applier configuration file contains all the parameters that Replicant uses while loading data into Snowflake. While it is not necessary to modify most of the parameters, they can be adjusted to optimize the Applier job as necessary
From $REPLICANT_HOME, go to the sample Snowflake Applier configuration file:
The configuration file for setting up the Applier configuration is divided into two parts namely the parameters related to snapshot mode and the parameters related to real-time mode.
To configure the parameters related to snapshot mode
For snapshot mode, make the changes below:
To configure the parameters related to realtime mode
The realtime section in the snowflake.yaml config file is used to configure the parameters related to realtime mode, below is a sample code showing how to set up the config file for the realtime mode.
With this, the Snowflake connector has been configured and we can now run Replicant Arcion’s program for running the CDC pipeline which will be the final step in the MySQL Server to Snowflake connection using Arcion.
Step 10: Running Replicant
From $REPLICANT_HOME, run the following command to execute Replicant in realtime mode. When Replicant starts up, the pipeline will be immediately activated after the command is executed.
After executing the command and the pipeline is started in realtime mode, Replicant does the following:
- Replicant first creates the destination schemas if they are not already present but, if the destination schemas are present, Replicant appends them to the existing tables.
- Replicant starts replicating real-time operations performed within the MySQL instance, obtained via log-based CDC. By default, real-time mode starts replicating from the latest log position, but a custom start position can be specified by the user in the real-time section of the extractor configuration file.
Advantages of Using Arcion
As seen in the example above, MySQL and Snowflake are fully supported by Arcion. The ability to migrate and replicate data from MySQL to Snowflake can be done through a few simple steps. For example, you can set up a streaming pipeline in Arcion using Change Data Capture to replicate changes from MySQL to any of the other destination databases supported, like Snowflake, or others like Oracle, DB2, SAP (ASE, HANA, IQ), SingleStore, and many more. Below are some highlights of using Arcion for data migrations and Change Data Capture.
Sub-Second Latency
Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever-increasing data volume on Source.
100% Agentless Change Data Capture
Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported enterprise connectors. Agentless CDC applies to all complex enterprise databases including MySQL and Snowflake. Arcion reads directly from the database transaction log, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.
Data Consistency Guaranteed
Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
Automatic Schema Conversion & Schema Evolution Support
Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources, which can get expensive!
Method 2: MySQL to Snowflake by Custom Code
The second method of moving data from MySQL to Snowflake is by using a custom code approach. The process being implemented is divided into four separate steps: extracting the data from MySQL, handling the data types, staging the files in internal/external storage, and finally, copying the staged files into a Snowflake table.
Step 1: Extracting Your Data from MySQL
There are two ways discussed here on how you can extract data from MySQL. First, using the command line tool mysqldump, and secondly, running SQL query using MySQL client. Let’s start by looking at the mysqldump tool.
Using mysqldump to extract data
mysqldump is a client utility on MySQL mostly used for the creation of a logical backup for a database or table. The sample below shows how to extract a single table using mysqldump.
After executing the command, the table_name.sql output file will be in the form of an insert table statement shown below:
If the mysqldump is running on the same machine or a different machine where the MySQL instance is running, you can simply convert the file to CSV by using the command below:
Using a SQL query to extract data
You can also extract data by executing SQL commands using MySQL client utility and then redirect the output to a file. Below is a sample of how this can be done.
The output obtained from the execution above can then be transformed using text editing utilities like sed or awk to clean and format the data.
Either of the approaches above will give us the raw data from MySQL in a format that it can be moved into Snowflake.
Step 2: Handling the Data Types and Formats
It is important to handle data types and formats from MySQL so that they would match the corresponding Snowflake data types. This can be done by using CTRL+FIND/REPLACE to alter or change or even delete some stuff that is not compatible with Snowflake. You can also download and install the snowsql command line to help you handle larger migrations more easily.
Before replicating data from MySQL to Snowflake, it is necessary to take note of the following, as this will help you not to complicate the handling of data types on Snowflake.
- Unlike other cloud-based analytical services, constraints such as UNIQUE, PRIMARY KEY, FOREIGN KEY, and NOT NULL are supported on Snowflake.
- Several character sets including UTF-8, UTF-16, and others are supported on Snowflake.
- Most of the date/time formats are allowed on Snowflake. You can also specify them while loading your data to tables using the File format option.
Once your data types are corrected to be supported in Snowflake, we can move forward to loading the data into the staging environment.
Step 3: Staging The Data Files in a Storage Area
A stage acts as an interim storage space where data is initially kept before being transferred to a warehouse or an analytical platform. It can either be internal or external and serves as a conduit for loading and unloading data from Snowflake. This staging area often leverages an object storage solution offered by a public cloud provider such as AWS or GCP due to their nearly unlimited scalability. Fortunately, Snowflake provides support for both internal and external stages.
Internal Stage
An internal stage is automatically designated for a user and a table for staging data files. Additionally, there's an option to create an Internal Named Stage using SQL statements, thereby providing a myriad of options for named stages, which simplifies the process and offers more flexibility when loading data.
Snowflake requires running several DML and DDL statements. Therefore, it's imperative to use SnowSQL, a command-line client, to connect to Snowflake, execute SQL queries, and manage all DDL and DML operations, such as loading data into or unloading data out of the database tables. An example of creating or replacing an Internal Named Stage is illustrated below:
Following the creation of an internal stage, the PUT command can be used to stage the data files. Here is the syntax to accomplish that:
External Storage
Snowflake also supports external staging for Amazon S3 and Microsoft Azure. You can establish your external stage on these platforms to store your data files. The process of creating external storage on S3 comprises three steps: creating a storage bucket, creating the directory to house the CSV file, and uploading the CSV file to the S3 bucket. Additionally, you need to supply IAM credentials and encryption keys when creating an external stage on S3. This provides Snowflake with the necessary permissions to read and write data to the S3 bucket using the IAM role.
The following example can be used to supply the IAM credentials and encryption keys:
It’s important to note that before executing any of these commands, you would need to ensure that you have the necessary permissions and roles in Snowflake to carry out these operations, and the paths, credentials, and keys used in the examples need to be replaced with your actual data.
Step 4: Copying The Staged Files Into a Snowflake Table
Your explanation is mostly accurate but needs slight corrections for clarity and technical precision. The COPY INTO SQL command is indeed used to load files from a staging area into a Snowflake table. However, it doesn't load the data into a Snowflake warehouse per se, but rather into a specified table within a database in Snowflake. The warehouse is just used to execute the command.
Finally, the COPY INTO SQL command is leveraged to copy files from the staging area into a designated table in Snowflake. Upon executing the SQL command, data from all staged files gets loaded into the specified table within a Snowflake database. The warehouse is used to execute these operations. Below are a few different examples demonstrating how to utilize the COPY INTO SQL command for data loading.
To load data from a named internal stage, you can use:
To load a single file from an external stage, the command remains the same:
To load data files directly from an external location like S3, use the following command while substituting with your actual credentials:
To load files specified using patterns, use:
Note that the above commands are executed within the context of a running Snowflake session and will use the compute resources of the Snowflake warehouse that is currently active for the session.
Limitations and Challenges of Using a Custom Code Method
The followings are the limitations and challenges associated with this method:
- Limited to Full Data Loads: the custom code method primarily caters to full data loading to the target table. Implementing Change Data Capture (CDC) would require additional custom code to create a data pipeline to extract incremental load from MySQL and update the Snowflake table. This process can be time-consuming and complex.
- Lack of Real-Time Streaming: Real-time data streaming isn't supported with this method of connection. As a result, it doesn't facilitate immediate data availability, which can be a major hurdle for businesses that need to make real-time decisions based on the most up-to-date information.
- Error-Prone and Cumbersome: This method of connecting MySQL Server to Snowflake can indeed be cumbersome and error-prone. It demands close attention and technical expertise. Mistakes could lead to migration failures, data corruption, or data loss.
- Inefficient for Large Data Volumes: Custom code solutions might not be suitable for large-scale data migration, as they might not be optimized for high performance or efficient resource usage. This could result in prolonged migration times or even process failures.
- Maintenance and Upkeep: Custom scripts require ongoing maintenance and updates to stay compatible with changes in source and target systems. This can be a resource-intensive task and could also lead to downtime during updates.
- Lack of Robust Error Handling and Recovery: Custom code solutions often lack robust error handling and recovery mechanisms. In case of failures, it might be hard to pinpoint the exact cause, and there might not be an easy way to resume the process from where it failed.
- Absence of a User-Friendly Interface: With custom code, there's typically no user-friendly interface for managing and monitoring the data migration process. This might make it hard for non-technical stakeholders to oversee the migration process.
- Lack of Documentation and Support: Unlike commercial solutions, custom code does not usually come with extensive documentation or professional support. This could make it difficult to troubleshoot issues or improve the code over time.
By considering these limitations, you can better evaluate whether a custom code approach is suitable for your MySQL to Snowflake migration or if it would be more beneficial to use a specialized data migration tool or service like Arcion.
Conclusion
In conclusion, transitioning data from MySQL to Snowflake is a process that demands meticulous attention and proficiency. We have explored two potential methodologies for executing this migration: leveraging Arcion's capabilities or developing custom code. While both methods possess their unique advantages, it's evident that the Arcion route offers substantial benefits including efficiency, ease of use, robust error handling, and more. The custom code alternative, on the other hand, although providing a high level of flexibility, comes with its own set of challenges and potential drawbacks.
The choice of method will largely depend on your organization's specific needs, the complexity of your data, your team's technical abilities and expertise, and the resources you have at your disposal. Regardless of the choice, a successful migration requires a sound strategy and rigorous execution.
If you are looking for an easier, more efficient way to migrate your data from MySQL to Snowflake, we highly recommend considering Arcion. As shown above, Arcion is easy to implement and can help you achieve MySQL to Snowflake connectivity in a matter of minutes. Our team is ready and eager to assist you in creating a seamless data migration experience, minimizing downtime, and ensuring your data is moved securely and accurately. Want to chat with one of our MySQL to Snowflake migration and replication experts? Reach out to the Arcion team today to get started.