AWS RDS To Snowflake: Load data In Minutes With 2 Easy Methods

Luke Smith
Enterprise Solutions Architect
July 14, 2023
Matt Tanner
Developer Relations Lead
July 14, 2023
Luke Smith
Matt Tanner
Enterprise Solutions Architect
Developer Relations Lead
July 15, 2023
19
 min read
Join our newsletter

Organizations often grapple with efficiently synchronizing their daily data streams from various platforms into a single platform. By unifying data, companies can be empowered to derive meaningful insights, such as through analytics, to inform strategic decisions that drive their businesses forward.

In today's fast-paced business environment, companies frequently have a disparate approach to storing and analyzing data. They use databases to securely store transactional data and separate data warehouses for analytical processing. Such an arrangement necessitates a seamless connection between these two (or more) entities, ensuring a real-time data flow from the transactional databases to the analytical engine. This arrangement enhances data understanding and prevents performance degradation of the source database when using the data for analytics.

Data warehouses have earned their reputation as reliable structured and unstructured data repositories. They permit robust analysis without disrupting the operational integrity of the warehouse itself. Consequently, many organizations opt to transfer their data from operational databases to these warehouses to facilitate comprehensive and efficient data analysis.

In this blog post, we will explore how to create a connection between various cloud-hosted databases to a data warehouse. Specifically, we'll provide a detailed, step-by-step guide on establishing a connection between databases hosted on AWS RDS and Snowflake. This guide will show you how to effectively migrate data between these two powerful data management tools. Below we will explore two distinct yet efficient methods, a manual and an automated, to move data between the two platforms. First, let's take a look into the specifics of Amazon RDS.

Table of Contents

What is Amazon RDS?

Amazon Relational Database Services (RDS) is a comprehensive web service that Amazon Web Services (AWS) provides. This platform facilitates the setting up, operating, and scaling of a relational database within the AWS cloud environment. Since its inception in 2009, Amazon RDS has been actively catering to many operating systems, aiming to simplify traditionally complex and time-consuming administrative tasks. The result is a streamlined, efficient database management process, enabling users to focus more on their core responsibilities and less on database operations.

At its core, Amazon RDS operates as a fully managed suite of services. This design allows for rapid deployment and ease of scalability for a wide array of relational database engines. Whether your data resides in the cloud or on-premises, Amazon RDS effectively manages it. The supported engines encompass MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Amazon RDS on AWS Outposts for on-premises requirements.

One of the unique characteristics of Amazon RDS is its ability to automate time-consuming administration tasks. These include hardware provisioning, database setup, patching, and backups, thus freeing up your time to focus on your applications to provide better value to your business. Furthermore, it offers high availability and failover support for DB instances with Multi-AZ deployments, ensuring your databases remain operational and accessible.

Amazon RDS stands out not just for its technical features but also for its economic efficiency. It provides flexible pricing options where users can either opt for a free trial to acquaint themselves with the platform or choose a pay-as-you-go model, paying solely for the services they utilize with no attached minimum fees. This is attainable through the use of RDS on-demand or reserved instances. Amazon RDS's pricing model is transparent and predictable, allowing you to optimize variable and steady workloads costs.

Additionally, Amazon RDS offers scalable capacity to meet the demands of standard relational databases, enabling users to easily perform a variety of administrative processes. These processes range from backing up databases and implementing point-in-time recovery to patching database software. It even supports the execution of more complex tasks, such as replicating databases for enhanced availability and scaling read traffic, creating read replicas, or sharding databases at the application level.

Amazon RDS provides a robust and flexible security model to bolster its security offerings. Users can leverage AWS Identity and Access Management (IAM) to manage access to their database instances. In addition, Amazon RDS also includes built-in security features for protecting and encrypting sensitive data at rest and in transit. With the ability to run your DB instances in Amazon Virtual Private Cloud (VPC), you can isolate your database in your virtual network and connect to your on-premises IT infrastructure using industry-standard encrypted IPsec VPNs. 

In summary, Amazon RDS is a powerful, feature-rich service that combines speed, flexibility, and cost-effectiveness, allowing businesses of all sizes to leverage the power of relational databases without the associated administrative burden.

What is Snowflake?

Snowflake is a cloud-based Software-as-a-Service (SaaS) data warehouse and analytics platform which enables customers to analyze data. Its cloud data warehouse is built on Amazon Web Service (AWS), Microsoft Azure, and Google infrastructure, providing a platform for storing and retrieving data. It is powered by the standard ANSI SQL protocol that supports full and semi-structured data such as JSON, Parquet, XML, and many other formats. 

With Snowflake, there is no hardware or software to install, configure, or manage since Snowflake itself fully manages the platform. This means organizations do not need dedicated resources for setup, maintenance, and support of the platform, as you would for in-house and physical operations. Everything is carried out on the cloud and requires near-zero administration.

Snowflake’s unique architecture separates its storage unit from its compute unit, scaling differently so customers can use and pay for both independently. It is flexible and highly scalable, accommodating massive amounts of data and many users using its computing power. For further information on Snowflake, visit the official website.

How to Connect Amazon RDS to Snowflake? (Methods) ( H2 )

Amazon Relational Database Service (RDS) allows you to create and connect to various DB instances, including MariaDB, MySQL, Microsoft SQL Server, Oracle, or PostgreSQL. Before creating or connecting to a DB instance, ensure you've completed the Setting up for Amazon RDS tasks, as each DB engine may have slightly different setup and connection requirements.

This article will discuss two methods to connect Amazon RDS and Snowflake: ETL custom scripts and Arcion. These methods enable seamless data migration between the platforms.

Method 1: Custom Scripts

Custom scripts offer a flexible way to create a connection between AWS RDS and Snowflake. This process is broken down into two primary steps for ease of understanding.

Step 1: Exporting Amazon RDS DB Snapshot Data to Amazon S3

Set up a working environment before exporting Amazon RDS Snapshot data to Amazon S3. This environment should include a Snowflake virtual warehouse, an active user with the necessary permissions to connect and query a cluster, an RDS instance, and a sample database, as demonstrated below:


To export DB snapshot data to an Amazon S3 bucket, Amazon RDS extracts data from a database or a table in the snapshot, which is then stored in a compressed Apache Parquet format. Follow these steps to export the DB snapshot data to S3:

  1. Create a manual snapshot of your data to Amazon S3. Use the following shell script as a guide:
#!/bin/bashtimestamp=$(date +%Y-%m-%d-%H-%M-%S)db_name="phdata-rds-snowflake"db_backup="${db_name}-backup-${timestamp}"aws rds create-db-snapshot \    --db-instance-identifier ${db_name} \    --db-snapshot-identifier ${db_backup}


  1. Create an AWS Identity and Access Management (IAM) policy to grant the snapshot export access to the S3 bucket. Use this AWS CLI command as an example:
aws iam create-policy --policy-name ExportPolicy --policy-document '{    "Version": "2012-10-17",    "Statement": [        {            "Effect": "Allow",            "Action": ["s3:ListBucket","s3:GetBucketLocation"],            "Resource": ["arn:aws:s3:::*"]        },        {            "Effect": "Allow",            "Action": ["s3:PutObject*","s3:GetObject*","s3:DeleteObject*"],            "Resource": ["arn:aws:s3:::<YOUR_BUCKET_NAME>","arn:aws:s3:::<YOUR_BUCKET_NAME>*"]        }    ]}'
  1. Create an IAM role for Amazon RDS to access the Amazon S3 buckets:
aws iam create-role --role-name rds-s3-export-role --assume-role-policy-document '{    "Version": "2012-10-17",    "Statement": [        {            "Effect": "Allow",            "Principal": {"Service": "export.rds.amazonaws.com"},            "Action": "sts:AssumeRole"        }    ]}'
  1. Attach the created IAM policy to the IAM role:
aws iam attach-role-policy --policy-arn <YOUR_POLICY_ARN> --role-name rds-s3-export-role

‚Äć

  1. Go to the navigation pane from the Amazon RDS console and select Snapshots. On the Snapshots list window, select the snapshot to be exported to S3 or use the following sample code:
#!/bin/bashtimestamp=$(date +%Y-%m-%d-%H-%M-%S)export_task_id="my-export"db_snapshot_arn="********"target_bucket_name="********"iam_rds_exporter_arn="********"kms_key_id="********"aws rds start-export-task \    --export-task-identifier ${export_task_id} \    --source-arn ${db_snapshot_arn} \    --s3-bucket-name ${target_bucket_name} \    --iam-role-arn ${iam_rds_exporter_arn} \    --kms-key-id ${kms_key_id}


Step 2: Loading Data to Snowflake

After exporting the data from Amazon RDS to Amazon S3, the next step is to load this data into Snowflake. This involves several steps:

  1. Create a file format object in Snowflake, specifying the Parquet file format type:
create or replace file format parquet_format type = 'parquet';
  1. Use SnowSQL to create a stage that allows querying the AWS S3 bucket:
CREATE OR REPLACE STAGE rds_s3_stage    file_format = parquet_format    credentials = (        aws_key_id='********',        aws_secret_key='********'    )    url = 's3://<YOUR_BUCKET_NAME>';
  1. Create a target relational table in Snowflake for the S3 data:
create or replace temporary table amazon_rds_snapshot (       brand varchar default null,       device varchar default null,       airlines variant default null );
  1. Load the S3 data into the Snowflake table:
  1. Run a query on the relational table to confirm successful data access:
select * from amazon_rds_snapshot;

Disadvantages of using Custom Scripts to Connect Amazon RDS to Snowflake

While custom scripts offer flexibility, they come with certain considerations:

  • This method typically involves a one-time loading of the RDS database and may not cater to continuous data ingestion needs.
  • The S3 bucket must be in the same AWS region as the snapshot.
  • Only certain regions, including parts of the US (Oregon, Ohio, N.Virginia), Europe (Ireland), and Asia Pacific (Tokyo), currently support exporting snapshots.
  • Additional coding might be needed for data cleaning, transformation, and enrichment, which can increase the complexity of the process.
  • Regular updates and maintenance of the custom scripts may be required to handle changes in data structures or schemas.
  • Error handling and resolution may require considerable technical expertise and time investment.

Method 2: Using Arcion to Move Data from Amazon RDS to Snowflake

Arcion is an enterprise-ready Change Data Capture (CDC) solution that offers users petabyte-level scalability, data consistency, and data integrity. It is a viable data migration and replication solution since it can easily integrate with some of the most popular enterprise databases and data warehouses.

Arcion’s CLI tool, Replicant, is what we will use to migrate data from Amazon RDS to Snowflake. To enable this to work correctly, you’ll need to ensure that the Snowflake user that Arcion will connect with possesses CREATE TABLE and CREATE STAGE privileges on the catalogs or schemas where Replicant will replicate tables. You’ll also need CREATE DATABASE and CREATE SCHEMA privileges granted to the user to create catalogs or schemas on the target Snowflake system. With those prerequisites out of the way, let's dive into implementing CDC with Arcion through the below steps. 

Step 1: Download And Install Arcion Self-hosted

The first thing to carry out is to download and install Arcion Self-hosted. The process would include downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, $REPLICANT_HOME, and finally, adding your credentials. You can reference the quick start guide for more details on how to do this.

Having downloaded and installed Arcion Self-hosted, we move on to further steps where we will configure and enable Arcion to connect with Amazon RDS and Snowflake. As stated earlier, the DB instance may be MariaDB, MySQL, Microsoft SQL Server, Oracle, or PostgreSQL.

Step 2: Setting Up The Connection Configuration File For Amazon RDS

To set up the source connector Amazon RDS, go to the sample connection configuration file found on $REPLICANT_HOME.

vi conf/conn/<database-name>.yaml

Within this file, all of the necessary parameters can be set to ensure that Arcion can connect to the source database. You must make the necessary changes by inputting your database's specific credentials, as seen below.

type: <database-name>host: <host> #Enter the hostname or IP address to connect to the source             #databaseport: <port> #Enter the port on which the source database server is runningusername: <database-username> #Enter the source database usernamepassword: <user-password>  #Enter the source database passwordmax-connections: 30max-retries: 10retry-wait-duration-ms: 1000

From the code above, we have the following:

HOST: This refers to the hostname of the Amazon RDS instance

PORT: This is the port number of the Amazon RDS instance

USERNAME: This refers to the username of the database user

PASSWORD: This is the password associated with the username

DATABASE: This is where you input the name of the Amazon RDS database to connect to

MAX_NUMBER_OF_CONNECTIONS: This is used to specify the maximum number of connections Replicant would use to fetch data from the source After you are done with the configuration, save the file.

Step 3: Setting Up The Filter Configuration For The Amazon RDS Instance

In the Filter configuration, you can specify which tables should be included in any migration or replication activities. To set this, from $REPLICANT_HOME, go to the Filter configuration file:

vi filter/<source-database-name>_filter.yaml

Make the necessary changes as shown in the snippet below:

allow:- catalog: "catalog_name" #Enter your database name if applicable  schema : "schema_name" #Enter your schema name if applicable  types: [TABLE,VIEW] #Database object types to replicate  allow:    "Table_Name1": #Enter the name of your table    "Table_Name2": #Enter the name of your table

Step 4: Setting Up The Extractor Configuration

A heartbeat table must be created in the source Amazon RDS database for real-time replication. To create a heartbeat table in your Amazon RDS instance, do the following steps (which may vary depending on which database you are using. The steps below have been tested against MySQL).

  1. In the catalog/schema, you will replicate and create a heartbeat table with the DDL, as shown below. Remember to replace the <user_database> with the name of your specific database.
CREATE TABLE `<user_database>`.`replicate_io_cdc_heartbeat`(  timestamp BIGINT NOT NULL,  PRIMARY KEY(timestamp));
  1. Grant the user configured for replication INSERT, UPDATE, and DELETE privileges for the heartbeat table. 
  2. Next, you’ll need to configure the Extractor configuration file. From $REPLICANT_HOME,  open the Extractor configuration file.
vi conf/src/mysql.yaml
  1. In the Extractor configuration file, make the changes below in the realtime section to set up the heartbeat configuration.
realtime:  heartbeat:    enable: true    catalog: tpch #Replace tpch with the name of the database containing your heartbeat table    table-name: replicate_io_cdc_heartbeat #Replace replicate_io_cdc_heartbeat with your heartbeat table's name if applicable    column-name: timestamp #Replace timestamp with your heartbeat table's column name if applicable

With that final step, our Amazon RDS instance is configured to work with Arcion. Next, we need to set up the Snowflake target.

Step 5: Setting Up The Connection Configuration for Snowflake

From $REPLICANT_HOME, the directory where Arcion is installed, go to the sample Snowflake Connection configuration file, snowflake.yaml, located in the conf/conn/ directory.

vi conf/conn/snowflake.yaml

The configuration file for setting up the connection for Snowflake is divided into two parts: the Snowflake connection-related parameters and the stage configuration-related parameters. 

For Snowflake connection-related parameters, to set up the target Snowflake instance, two methods can be used to get an authenticated connection, they are RSA key pair authentication or basic username and password authentication. To use the basic username and password authentication, you can get your credentials by using Replicant to retrieve the credentials from AWS Secrets Manager if you are using it. Alternatively, you can specify your credentials in the connection configuration file below.

type: SNOWFLAKEhost: SNOWFLAKE_HOSTNAMEport: PORT_NUMBERwarehouse: "WAREHOUSE_NAME"username: "USERNAME"password: "PASSWORD"max-connections: 20 #Specify the maximum number of connections replicant can open in Snowflakemax-retries: 10retry-wait-duration-ms: 1000

From the example above, the value definitions for each field are as follows:

type: This indicates the connection type representing the database, it is Snowflake in this case. 

host: This shows the hostname of your Snowflake instance.

port: This is the port number to connect to the host.

warehouse: This is the name of the Snowflake warehouse.

username: This refers to the username credential to access the Snowflake system.

password: This is the password associated with the username.

max-connections: This indicates the maximum number of connections Replicant uses to load data into the Snowflake instance.

Having set up the Snowflake connection-related parameters, you can also set up the stage configuration-related parameters. The stage configuration lets users tune native or external staging areas for bulk loading. 

The RSA key pair for authentication will be used to configure the parameters related to stage configuration. Snowflake’s key pair authentication support is used for enhanced authentication security as an alternative to the basic username and password. To use this, follow the link above to go through the procedures for generating a private key, generating a public key, storing both keys, assigning the public key to a Snowflake user, and verifying the user’s public key fingerprint. Finally, editing the connection configuration with these new details for the stage configuration should also be done. 

Step 6: Setting Up The Applier Configuration

Next, we need to add the Applier configuration for Snowflake. The Applier configuration file contains all the parameters Replicant uses while loading data into Snowflake. While it is not necessary to modify most of the parameters, they can be adjusted to optimize the Applier job as necessary

From $REPLICANT_HOME, go to the sample Snowflake Applier configuration file:

vi conf/dst/snowflake.yaml

The configuration file for setting up the Applier configuration is divided into two parts: the parameters related to snapshot mode and the parameters related to real-time mode. 

To configure the parameters related to snapshot mode

For snapshot mode, make the changes below:

snapshot:  threads: 16 #Specify the maximum number of threads Replicant should use for writing to the target  batch-size-rows: 100_000  txn-size-rows: 1_000_000  #If bulk-load is used, Replicant will use the native bulk-loading capabilities of the target database  bulk-load:    enable: true|false #Set to true if you want to enable bulk loading    type: FILE|PIPE #Specify the type of bulk loading between FILE and PIPE    serialize: true|false #Set to true if you want the generated files to be applied in a serial/parallel fashion    #For versions 20.09.14.3 and beyond    native-load-configs: #Specify the user-provided LOAD configuration string which will be appended to the s3-specific LOAD SQL command
To configure the parameters related to realtime mode

The realtime section in the snowflake.yaml config file is used to configure the parameters related to realtime mode. Below is some sample code showing how to set up the config file for the realtime mode.

realtime: threads: 8 #Specify the maximum number of threads Replicant should use for writing to the target max-retries-per-op: 30 #Specify the maximum amount of retries for a failed operation retry-wait-duration-ms: 5000 #Specify the time in milliseconds Replicant should wait before retrying a failed operation cdc-stage-type: FILE #Enter your cdc-stage-type

With this, the Snowflake connector has been configured, and we can now run Replicant Arcion’s program for running the CDC pipeline, which will be the final step in the Amazon RDS to Snowflake connection using Arcion.

Step 7: Running Replicant

From $REPLICANT_HOME, run the following command to execute Replicant in realtime mode. When Replicant starts up,  the pipeline will be immediately activated after executing the command.

./bin/replicant realtimeconf/conn/<src_database>.yaml \conf/conn/snowflake.yaml \--extractor conf/src/<src_database>.yaml \--applier conf/dst/snowflake.yaml¬† \--id repl2 --replace ‚Äďoverwrite

After executing the command and the pipeline is started in realtime mode, Replicant does the following:

  • Replicant first creates the destination schemas if they are not already present but if the destination schemas are present, Replicant appends them to the existing tables.
  • Replicant replicates real-time operations within the Amazon RDS instance, obtained via log-based CDC. By default, real-time mode starts replicating from the latest log position. Still, a custom start position can be specified by the user in the real-time section of the extractor configuration file.

Advantages of Using Arcion

  1. Sub-second latency from distributed & highly scalable architecture: Arcion is the only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
  2. Arcion allows smooth schema management of your data, thereby ensuring data integrity, reliability, and consistency. As the schema evolves on the MySQL instance, these changes will be applied to the BigQuery instance automatically to keep both in sync.
  3. Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. Arcion‚Äôs agentless CDC connectors apply to all the complex enterprise databases modern enterprises use, like Microsoft SQL Server, MongoDB, and all versions of Oracle. Arcion reads directly from the transaction logs, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery ‚ÄĒ but without the inherent performance issues or the security concerns of agent-based connectors.
  4. Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
  5. Effortless setup & maintenance: Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and cost.
  6. Arcion is SOC 2 Type 1 & Type 2, HIPAA, PCI compliant. The enterprise-grade security and compliance standards ensure data governance.

Conclusion

In this article, we've explored two primary methods to transition your data from Amazon RDS to Snowflake: custom ETL scripts and an automated, scalable tool, Arcion. Each approach offers unique benefits, with the most suitable for your needs depending on various factors. These include your project's specific demands, your team's skill set (since some efforts will require the skills of data engineers), and the nature of your data. A successful migration hinges on both detailed planning and being able to execute each step within the plan. Transferring your data from Amazon RDS to Snowflake is no different. It requires you to pay close attention to every stage of the operation to ensure a smooth migration or replication implementation. Failure to do so can lead to adverse issues with your data down the line.

While both methods presented here have advantages, they also have challenges. For instance, their limitations and potential complications can constrain custom ETL scripts. Arcion, however, provides an alternative that can alleviate many of the complexities seen with custom scripts and code. With a streamlined implementation process, Arcion can expedite the creation of a connection between Amazon RDS and Snowflake, saving you both time and potential headaches.

In conclusion, if you're searching for a more streamlined and efficient way to transition your data from Amazon RDS to Snowflake, Arcion is the trusted way to go. Arcion is a real-time, in-memory Change Data Capture (CDC) solution that offers scalability and ensures consistent data integrity throughout the migration and replication process. Arcion's team of experts is always on hand to guide you through the process, helping you navigate any challenges and ensuring a seamless migration and replication experience. Get started with Arcion today by connecting with our team to help you skillfully set up and implement a data integration strategy for migrating and replicating data from AWS RDS to Snowflake.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements