Data warehousing solutions have been greatly adopted by data enthusiasts and the urge to create data pipelines has also grown across the board. Data pipelines have become a necessary part of data analysis as data analysts seek ways to harness every possible feature offered by the warehousing platforms. These benefits include the ability to gain relevant insights from data produced and to stay ahead of the competition.
Migrating data from various data sources, especially cloud solutions, is a common theme in modern analytics. As such, this blog post is going to explain how to connect Snowflake to PostgreSQL so you can seamlessly transfer your data between both platforms. PostgreSQL and Snowflake are both very popular tools with a lot of different use cases but there are very few guides for detailing the connection between Snowflake and PostgreSQL. In this article, we will cover two methods for moving data from Snowflake into Postres. The first will be a manual and less scalable method using custom scripts. The second method we will use Arcion to set up a CDC pipeline that can support real-time data replication and migration.
As you likely already know, Snowflake is a powerful data warehousing and analytics platform that allows users to analyze their data through its cloud offering. PostgreSQL, is an object-relational database system that is reputable amongst developers for its integrity and compatibility with other data solutions, as well as its scalability. Let's dive into the various ways we can create a data pipeline between the two platforms, but first, let’s look at a brief introduction to Snowflake and PostgreSQL.
What is Snowflake?
Snowflake is a cloud-based Software-as-a-Service (SaaS) data warehouse and analytics platform which enables customers to analyze data. Its cloud data warehouse is built on Amazon Web Service (AWS), Microsoft Azure, and Google infrastructure, providing a platform for storing and retrieving data. It is powered by the standard ANSI SQL protocol that supports full and semi-structured data such as JSON, Parquet, XML, and many other formats.
With Snowflake, there is no hardware or software to install, configure, or manage since the platform is fully managed by Snowflake itself. This means organizations do not need dedicated resources for setup, maintenance, and support of the platform, as you would for in-house and physical operations. This is because everything is carried out on the cloud and requires near-zero administration.
Snowflake’s unique architecture separates its storage unit from its compute unit, thereby scaling differently so customers can use and pay for both independently. It is very flexible and highly scalable to accommodate massive amounts of data and a lot of users using its computing power. For further information on Snowflake, visit the official website.
What is Postgres?
PostgreSQL is an open-source relational database management system (RDBMS) operated by the Global Development Group. Fans of PostgreSQL tend to enjoy many benefits, including a large amount of extensibility within the platform and strict SQL compliance. It uses the standard Structured Query Language ( SQL) and pgAdmin tool to provide GUI and SQL interface.
PostgreSQL has been in active development since its initial release in 1996. It is widely used in production systems running on all major operating systems such as Linux, Unix, Mac OS, Windows, and others. PostgreSQL is extremely versatile and can be used for different use cases such as a data warehouse, for use with analytics/business intelligence applications, and as the database layer for web and mobile applications.
PostgreSQL has a lot of history and has been in existence for a long time, backed by years of development through its open-sourced community. This history and strong community foundation give it a good reputation among data management users as it is highly reliable, has data integrity, and is easy to use.
How to Connect Snowflake to Postgres
Now that we’ve taken a brief look over each platform, let’s move on to explaining how to connect Snowflake to PostgreSQL. Below we will look at two methods to achieving this, the first is by using custom scripts and the second is by using Arcion. Let’s dive in!
Method 1: Custom Scripts
First, let’s walk through a method to establish a connection between Snowflake and PostgreSQL by creating and executing custom scripts. The technique involves exporting data from Snowflake to a staging area in CSV, JSON, or PARQUET format. Once exported, the next step is to import this data into PostgreSQL.
Before we get started, it's essential to note the nuances of data type conversion between Snowflake and PostgreSQL, as there will be domain-specific logic needed to migrate the data. Prior to moving data between the two platforms, consider the following points:
- Use the appropriate file encoding for your specific needs. Snowflake supports a variety of character sets, including UTF-8.
- Snowflake offers support for constraints such as UNIQUE, PRIMARY KEY, FOREIGN KEY, and NOT NULL. Snowflake, by default, exports NULL as '\\N' in the CSV, while Postgres interprets a blank space ' ' as NULL.
- Snowflake supports almost all date/time formats.
- Be mindful of type differences between Snowflake and PostgreSQL data types to avoid any downstream errors when unloading from Snowflake.
Step 1: Unloading Data from Snowflake to a Stage
The COPY INTO <location> command is a useful method to extract data from Snowflake tables into one or more files in a stage. The stage can either be an internal or external location, such as Amazon S3, Google Cloud Storage, or Microsoft Azure. The extracted files can then be retrieved from the stage using the GET command.
Below is a script which shows how to extract data from a Snowflake table and exporting it to an Amazon S3 bucket, in CSV format.
Step 2: Loading Data from a Stage to PostgreSQL
To load data from the staged file into PostgreSQL, one option is to download the staged file to your application server and upload it to PostgreSQL using the COPY INTO statement. Alternatively, you can utilize a cloud service's external stage utilities to load the file directly into PostgreSQL.
In the below example, you can see the script using the GET command in Snowflake to download a CSV file. This can be run on your local machine or application server:
After the CSV file is downloaded, use PostgreSQL's COPY command to load the data from the file into the desired PostgreSQL table. The example script below demonstrates how this can be done.
Alternatively, if you are using AWS and both Snowflake and PostgreSQL are hosted in the same region, you can use Amazon S3 to transfer files from Snowflake's stage to PostgreSQL. AWS offers its own Data Pipeline service for this purpose, or you can use the open-source tool 's3cmd' for direct file transfers.
Below is example of loading data into PostgreSQL using the COPY command in conjunction with the s3_to_redshif function provided by AWS Data Pipeline:
In the above script, <access-key> and <secret-access-key> should be replaced with your AWS access key ID and secret access key.
Disadvantages of using Custom Scripts to Connect Snowflake to Postgres
As with many manual procedures, there are some disadvantages that can impact using the above techniques to migrate data. Below are some disadvantages of using the custom scripting method for migrating data from Snowflake to PostgreSQL:
Complexity: This approach requires a good understanding of both Snowflake and PostgreSQL, including their SQL syntax, data types, and file handling capabilities. Not all data scientists have extensive experience with SQL or database administration, so this could be a steep learning curve for some.
Time-Consuming: Writing custom scripts and debugging any issues that arise can be a time-consuming process, especially for larger databases or more complex data structures.
Error-prone: Manual scripting leaves room for human error. A simple mistake in the script could lead to data loss, corruption, or inaccuracies.
No Direct Support: There's no dedicated support team to call upon if you encounter problems. You'll need to rely on community forums, documentation, or in-house expertise to resolve any issues.
Scalability Concerns: If the volume of data grows, the scripts might need to be updated or optimized to handle the larger datasets. This approach may not scale well without significant additional work.
Inefficiency with Large Datasets: Moving large datasets by exporting to a file and then importing may be inefficient, particularly if network bandwidth is a limitation. Direct data transfer methods could be faster.
Concurrency Issues: If data in the source database is updated frequently, maintaining data consistency in the target database can be challenging. The migration process might need to be carefully scheduled to avoid data discrepancies.
Security Concerns: Downloading data onto a local machine or an application server could potentially expose sensitive information if not handled with adequate security measures. This means that the user implementing the migration will need to ensure that the process complies with all relevant data privacy and protection regulations.
Dependency on Third-Party Tools: If you are using cloud services like AWS S3 for the data transfer, you'll be dependent on their availability and performance. Any issues with these services could disrupt your data migration process.
Maintenance Overhead: Over time, the scripts will need to be maintained and updated to adapt to changes in database schema, security protocols, and the evolving capabilities of Snowflake and PostgreSQL.
while custom scripting can provide a flexible way to migrate data, it also brings several challenges that anyone implementing it should consider. More automated, enterprise-grade solutions tend to be more efficient for large-scale or ongoing data migrations.
Method 2: Using Arcion to Move Data from Snowflake to Postgres
The second method of creating a connection between Snowflake and PostgreSQL explained in this section of the write-up will involve the using Arcion and it’s CLI tool, Replicant. Arcion is a real-time in-memory Change Data Capture (CDC) solution that offers a high level of scalability and sub-second latency, ensuring the reliability of your data. It is a leading solution for data migration and data replication and has integrations with various enterprise databases and data warehouses.
To enable this to work correctly, ensure the PostgreSQL user that Arcion will connect with as a destination possesses CREATE TABLE privilege on the catalogs or schemas where Replicant will replicate tables to. You’ll also need to grant the user CREATE DATABASE and CREATE SCHEMA privileges on the target PostgreSQL system. Once these permissions are set, you can proceed to the first step.
Step 1: Download And Install Arcion Self-hosted
The first thing to do is to download and install Arcion Self-hosted. To do this, you’ll need to contact the Arcion team to get the Arcion program as well as a license key. Once you have access, the process would include downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME throughout this guide, and finally, adding your license. You can reference the quick start guide for more details on how to do these steps in detail.
Step 2: Set up Connection Configuration for Snowflake
Once downloaded, go to the sample Connection Configuration file found in $REPLICANT_HOME.
In the file, the connection consists of two parts: parameters related to the source Snowflake server connection and parameters related to stage configuration.
Parameters related to source Snowflake server connection
To connect to the Source Snowflake server, two methods can be used to get an authenticated connection: RSA key pair authentication or basic username and password authentication. To use the basic username and password authentication, you can get your credentials by using Replicant to Retrieve the credentials from AWS Secrets Manager if you stored your credentials there. If you aren’t, you can input your credentials in the connection configuration file as seen below.
From the sample above, the value definitions for each field are as follows:
type: This indicates the connection type representing the database, it is Snowflake in this case.
host: This shows the hostname of your Snowflake instance.
port: This is the port number to connect to the host.
warehouse: This is the name of the Snowflake warehouse.
username: This refers to the username to access the Snowflake server.
password: This is the password associated with the username.
max-connections: This indicates the maximum number of connections Replicant can open in Snowflake
Parameters related to stage configuration
The stage configuration enables users to select by default Snowflake’s native stage or external staging areas like Azure and S3 for bulk loading. Basic authentication for stage configuration or the RSA key pair for authentication can be used to configure the parameters related to stage configuration. To use Snowflake’s key pair authentication support, click on the link and follow the instructions for generating a private key, generating a public key, storing both keys, assigning the public key to a Snowflake user, and verifying the user’s public key fingerprint, and editing the connection configuration file.
Step 3: Set up Extractor Configuration
Next, you’ll specify your configuration in the Extractor configuration file to meet your replication mode requirements. An example of how to set up the sample Extractor configuration file snowflake.yaml found in the $REPLICANT_HOME/conc/src directory. Within the file, you can set your Extractor parameters for both snapshot and realtime mode. Examples of both are shown below.
snapshot mode configuration
realtime mode configuration
Step 4: Set up Connection Configuration for PostgreSQL
Now that the Snowflake source is set up, we can move to configuring PostgreSQL as the target. For this, go to the sample connection PostgreSQL configuration file found on $REPLICANT_HOME.
You can use basic username and password authentication to configure the connection file. If your credentials are stored on AWS Secret Manager, Replicant can use Retrieve the credentials from AWS Secrets Manager to retrieve them. Alternatively, you can put your credentials in the connection configuration file as seen below.
From the sample above, the value definitions for each field are as follows:
type: This indicates the connection type representing the database, it is PostgreSQL in this case.
host: This shows the hostname of your PostgreSQL host.
port: This is the port number to connect to the host.
database: This is the database name.
username: This refers to the username to connect to your PostgreSQL server.
password: This is the password associated with the username.
max-connections: This indicates the maximum number of connections Replicant can open in PostgreSQL.
socket-timeout-s: This is the time-out value for the socket read operation.
max-retries: The number of times an operation can be re-attempted upon failure.
retry-wait-duration-ms: The duration Replicant should wait before performing the next try of a failed operation
Step 5: Configure the Mapper file (optional)
This is an optional step that can be used to define your data mapping from the source Snowflake to your target PostgreSQL. You can do this by simply specifying the mapping rules in the mapper file as shown in the configuration file below.
Step 6: Set up Applier Configuration
From REPLICANT_HOME, go to the PostgreSQL Applier sample configuration file:
The configuration file has two parts, parameters related to snapshot mode and parameters related to realtime mode.
Parameters related to snapshot mode
The sample code below shows how to configure parameters related to the snapshot mode.
Parameters related to realtime mode
The realtime section of the configuration file is used to set parameters for realtime mode.
Step 7: Running Replicant
From $REPLICANT_HOME, run the following command to execute Replicant in realtime mode. When Replicant starts up, the pipeline will be immediately activated after the command is executed.
After executing the command and the pipeline is started in realtime mode, Replicant does the following:
- Replicant first creates the destination schemas if they are not already present but, if the destination schemas are present, Replicant appends them to the existing tables.
- Replicant starts replicating real-time operations performed within the Snowflake instance, obtained via log-based CDC. By default, real-time mode starts replicating from the latest log position, but a custom start position can be specified by the user in the real-time section of the extractor configuration file.
Advantages of Using Arcion
As seen in the example above, Snowflake and PostgreSQL are fully supported by Arcion. The ability to migrate and replicate data from Snowflake to PostgreSQL can be done through a few simple steps. For example, you can set up a streaming pipeline in Arcion using Change Data Capture to replicate changes from Snowflake to any of the other destination databases supported, like PostgreSQL, or others like Oracle, DB2, SAP (ASE, HANA, IQ), SingleStore, and many more. Below are some highlights of using Arcion for data migrations and Change Data Capture.
Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever-increasing data volume on Source.
100% Agentless Change Data Capture
Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported enterprise connectors. The agentless CDC applies to all complex enterprise databases including PostgreSQL and Snowflake. Arcion reads directly from the database transaction log, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.
Data Consistency Guaranteed
Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
Automatic Schema Conversion & Schema Evolution Support
Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources, which can get expensive!
This article served as a comprehensive step-by-step guide on how to set up an Snowflake to PostgreSQL integration. We covered two methods which included a custom or manual setup for importing data via CSV and using Arcion. We covered each step of how to implement a manual CSV import into PostgreSQL and also discussed some of the limitations of this approach. For Arcion, we showed how to implement real-time CDC to move data from Snowflake to PostgreSQL. Arcion was highlighted as a simple and scalable solution for Snowflake to PostgreSQL integration and replication.
As we saw, Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from Snowflake to PostgreSQL. To easily implement a data integration strategy, get started with Arcion today by connecting with our team of Snowflake and PostgreSQL replication experts. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.