In today's data-driven industry, where extracting valuable insights from vast amounts of data is essential for making informed business decisions, the ability to efficiently replicate a database containing millions of records to a data warehouse has become crucial. Organizations heavily rely on the deductions derived from their data to ensure growth and sustainability. It is imperative to have a fast, efficient, and cost-effective method for transferring data between different database systems to analyze it and understand it fully.
Typically, databases can be categorized into two types: Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP). An OLTP database serves as a storage and processing unit that handles concurrent transactions. It is optimized for tasks such as online banking, shopping, text messages, and order entry, where multiple users perform simultaneous operations. On the other hand, an OLAP data warehouse focuses on data storage and processing, primarily storing denormalized data to facilitate efficient querying and data analysis.
This blog post will specifically examine the migration process from PostgreSQL, a popular example of an OLTP database that adheres to the ACID (Atomicity, Consistency, Isolation, Durability) principles, to Snowflake, a powerful cloud-based OLAP data warehouse. By establishing a seamless connection between these two platforms, you will gain the ability to conveniently migrate your data, ensuring a smooth transition and optimal performance in your data analysis endeavors. In the following sections, we will look at two different methods, one that is highly manual and requires custom code and scripts, and another that is fully automated and requires no code. With the agenda set, let's get started on this data migration journey, exploring the intricacies and best practices along the way.
What is Postgres?
PostgreSQL is a popular and powerful open-source Relational Database Management System (RDBMS). It uses standard Structured Query Language (SQL) and the pgAdmin tool to provide GUI and SQL interface. Postgres is a strictly SQL-compliant platform.
It has been widely used in production systems for many years and has been in active development since its initial release in 1996. It supports all major operating systems such as Linux, Unix, Mac OS, and Windows. It is highly reliable, has data integrity, and data correctness backed by years of development through its open-source community.
PostgreSQL is a very robust database that is frequently used as the database layer for web applications, as a data warehouse, or for analytics/business intelligence applications. PostgreSQL is truly a multi-use database with a track record of being the database of choice for some of the largest and most well-known applications and services.
PostgreSQL is commonly seen being used:
- By large corporations and startups for transactional database purposes.
- As a back-end database in Linux, Apache, PostgreSQL, and PHP (python and Perl) (LAPP) stack to power dynamic websites and web applications.
- As a geospatial database used for geographic information systems (GIS) with the PostGIS extension found on PostgreSQL.
What is Snowflake?
Snowflake is a cloud-based Software-as-a-Service (SaaS) data warehouse and analytics platform which enables customers to analyze data. Its cloud data warehouse is built on Amazon Web Service (AWS), Microsoft Azure, and Google infrastructure, providing a platform for storing and retrieving data. It is powered by the standard ANSI SQL protocol that supports full and semi-structured data such as JSON, Parquet, XML, and many other formats.
With Snowflake, there is no hardware or software to install, configure, or manage since the platform is fully managed by Snowflake itself. This means organizations do not need dedicated resources for setup, maintenance, and support of the platform, as you would for in-house and physical operations. Everything is carried out on the cloud and requires near-zero administration.
Snowflake’s unique architecture separates its storage unit from its compute unit, thereby scaling differently so customers can use and pay for both independently. It is very flexible and highly scalable to accommodate massive amounts of data and a lot of users using its computing power.
How to Connect Postgres to Snowflake? (Methods)
Two methods of creating a connection between PostgreSQL and Snowflake will be explained in this section of the write-up so you can have a seamless and effective transfer of data across both tools. The first method will involve you using custom scripts to achieve the connection and the second is using an effective modern third-party tool Arcion.
Method 1: Custom Scripts
In this approach, we will use custom scripts to migrate data from PostgreSQL to Snowflake. The process involves exporting data from PostgreSQL as a CSV file and then importing the CSV file into Snowflake. However, it's important to note that PostgreSQL and Snowflake handle data types and data conversion differently. Therefore, it's crucial to ensure that NULL values are parsed correctly on both platforms and to be aware of the appropriate file encodings to use. Let's go through the steps involved in this migration process.
Step 1: Extracting Data from PostgreSQL
There are multiple ways to extract data from PostgreSQL: using the COPY TO command, the copy command, or the pg_dump utility.
Using the COPY TO command
The COPY TO command is the easiest and fastest method to export data from PostgreSQL tables to a CSV file. It generates a CSV file on the database server. You can use the COPY TO command to export an entire table, the results of a SELECT query, or extract data from multiple tables using a PL/PgSQL procedure. Here's an example code snippet for using the COPY TO command:
To extract data from multiple tables you can use a PL/PgSQL procedure. To use a PL/PgSQL procedure, execute the following command in the psql prompt:
The COPY TO command has the following attributes:
- An absolute path to the file is required.
- TO specifies the table or query from which the file will be exported.
- FORMAT specifies the file format to be used (e.g., text, binary, CSV, JSON, etc.).
- HEADER indicates whether the first line of the file contains column headers.
- DELIMITER specifies the character used to separate columns within each line of the file.
- NULL represents the string that denotes an empty value.
- QUOTE is the quoting character used when data is enclosed in quotes (the default is double quote, ").
pg_dump is a utility primarily used for backing up a PostgreSQL database or tables. However, it can also be used to extract data from tables. Here's an example command to use pg_dump for data extraction:
The output file, table_name.sql, will contain INSERT statements for the data, as shown below:
Step 2: Staging of Data Files
After extracting the data from PostgreSQL, the next step is to upload the extracted file to a data stage. The staging location can be either internal or external and can be temporary or permanent. It may also include a directory table that catalogs the staged files in cloud storage. Let’s take a look at how to load data to both an internal and external stage in the steps documented below.
Internal stages can be created by users using SQL statements. They provide a means to load data and define file formats. Here's the SQL syntax for creating an internal stage:
Here's a script example of how to create an internal stage:
To stage data files to an internal stage, you can use the PUT command. This will stage data files from a local directory/folder on a client machine to a named internal stage, a stage for a specified table, or a stage for the current user. Below is the syntax for the PUT command:
An external stage references data files stored outside of Snowflake, such as in Amazon S3 buckets, Google Cloud Storage buckets, or Microsoft Azure containers. The storage location can be private/protected or public but should be accessible from the client machine. Below is the syntax for creating an external stage:
Here's an example script to reference when creating an external stage:
In the example above, you’ll need to replace the values for URL, CREDENTIALS, and ENCRYPTION with your corresponding values for where your data is stored.
Step 3: Copying Staged Files to Snowflake
To load data from staged files into Snowflake, you can use the COPY INTO <table> command. This command loads data from staged files to a named stage, a named external stage, or an external location (e.g., Amazon S3, Google Cloud Storage, or Microsoft Azure) into an existing table. Here's the syntax for the COPY INTO command:
Below is an example of copying data from an Amazon S3 external location:
In the example above, you’ll need to replace the FROM URL, CREDENTIALS, and ENCRYPTION parameters with ones that match your implementation.
By following these steps, you can migrate data from PostgreSQL to Snowflake using custom code. To keep your Snowflake table up to date with the latest data changes, you'll need to implement additional code for ongoing updates. This is one of a few downsides to this approach that we will discuss below.
Disadvantages of using Custom Scripts to Connect Postgres to Snowflake
While using custom scripts to migrate and replicate data from PostgreSQL to Snowflake can be a viable approach, there are a few disadvantages to consider. Let’s take a look at some of them in more detail.
Data Type and Conversion Challenges: PostgreSQL and Snowflake may handle data types and data conversion differently. This can lead to potential challenges when migrating data, especially with complex data types or data transformations. It requires careful attention and validation to ensure that the data is accurately converted during the migration process.
Manual Effort and Maintenance: Implementing custom scripts for data migration requires manual effort and ongoing maintenance. As the data structures or schemas change over time, the scripts may need to be updated accordingly. This can increase the complexity and effort required to ensure the accuracy and integrity of the migrated data especially if the migration will happen more than once.
Lack of Automation: Custom scripts often lack the automation capabilities that dedicated migration tools or platforms offer. Manual scripting may require additional steps and manual interventions to handle complex scenarios, resulting in a more time-consuming and error-prone process. If custom scripts are automated, they can become very brittle and break even with the slightest change to connectivity details, schemas, etc.
Limited Scalability: The custom script approach may face scalability limitations when dealing with large volumes of data. It may be challenging to optimize the performance and efficiency of the migration process, especially when dealing with millions or billions of records. This can lead to longer migration times and potential disruptions in data availability during the migration process.
Dependency on Technical Expertise: Building and maintaining custom migration scripts requires a certain level of technical expertise in both PostgreSQL and Snowflake. It may require specialized knowledge of SQL, scripting languages, and the intricacies of the data structures in both systems. This dependency on technical expertise can introduce complexity and potential bottlenecks in the migration process.
Data Consistency and Integrity: Ensuring data consistency and integrity during the migration process is crucial. Custom scripts may require additional validation and error-handling mechanisms to handle potential issues such as data discrepancies, data truncation, or data loss. Without robust error handling and validation mechanisms, there is a risk of migrating inaccurate or incomplete data which can cause issues down the road.
Lack of Built-in Monitoring and Reporting: Custom scripts may lack built-in monitoring and reporting capabilities that dedicated migration tools or platforms provide. Monitoring the progress of the migration, identifying potential errors or issues, and generating comprehensive reports may require additional manual effort and customization. Without this functionality in place, it can be hard to diagnose issues once the data has been migrated and is being used.
Considering these disadvantages, it's important to evaluate whether the custom script approach aligns with your specific migration requirements, data complexity, and available resources. Exploring alternative migration methods or leveraging specialized migration tools, like Arcion, can help mitigate some of these challenges and provide a more streamlined and robust data migration experience. Next, let’s take a look at a more automated and scalable approach to migrating data with Arcion.
Method 2: Using Arcion to Move Data from Postgres to Snowflake
Arcion is an enterprise-ready Change Data Capture (CDC) solution that offers users petabyte-level scalability, data consistency, and data integrity. It is a viable solution for both data migration and data replication since it can easily integrate with some of the most popular enterprise databases and data warehouses.
Arcion’s CLI tool, Replicant, is what we will use to migrate data from PostgreSQL to Snowflake. To enable this to work correctly, you will need to ensure that the Snowflake user that Arcion will connect with possesses CREATE TABLE and CREATE STAGE privileges on the catalogs or schemas where Replicant will replicate tables. You will also need CREATE DATABASE and CREATE SCHEMA privileges granted to the user to create catalogs or schemas on the target Snowflake system. With these out of the way, let us dive into implementing CDC with Arcion through the steps shown below.
Step 1: Download And Install Arcion Self-hosted
The first thing we will need is to configure Arcion. For this, we will need to download and install the self-hosted version of Arcion. To gain access, you will be required to follow the steps outlined on the Arcion Self-hosted webpage. Once you have downloaded Arcion’s CLI tool Replicant, you will need to create a home directory for it. By default, the home directory will be where Replicant was downloaded. This home directory, from now on, will just be referred to as $REPLICANT_HOME in the examples below. The last step is adding your credentials to your Arcion instance, which is covered in the quick start guide that you can reference for more details on how to do this.
Having downloaded and installed Arcion Self-hosted, we can move on to the next steps where we will configure and enable Arcion to connect with PostgreSQL and Snowflake.
Step 2: Create a user in PostgreSQL
After the installation of Arcion, the next step is to create a user in PostgreSQL by executing the following steps.
- Connect to your database server and log into the PostgreSQL client
- Create a user that will be used for replication
- Grant the permissions listed below
Step 3: Setup PostgreSQL for Replication
For Arcion to access the data in Postgres, you’ll need to create a PostgreSQL user for replication. Below is an example of the commands that need to be run on your Postgres instance to do this.
First, you’ll need to open the postgresql.conf file so it can be edited. Below is the command you can use to edit it with vi.
Next, set up the parameters as shown below in postgresql.conf
Lastly, to enable log consumption for CDC replication on the PostgreSQL server, the test_decoding plugin that is by default installed in PostgreSQL can be used. Alternatively, you can install the logical decoding plugin wal2json.
For the tables part of the replication process that do not have a primary key, set the REPLICA IDENTITY to FULL.
Step 4: Setup Connection Configuration
From $REPLICANT_HOME, go to the connection configuration file.
Next, you can instruct Replicant to retrieve your connection credentials from AWS Secrets Manager if they are stored there. If not, simply put them as shown in the sample below.
You can also enable SSL for the connection by including the SSL field and specifying the needed parameters.
Step 5: Setup Filter Configuration
From $REPLICANT_HOME, open the filter configuration file.
Specify the data to be replicated based on your replication needs using the template format shown below.
Step 6: Setup Extractor Configuration
To set up the extractor configuration, you will have to specify the configuration in the Extractor configuration file according to your requirements. A sample Extractor configuration file, postgresql.yaml, is found in the $REPLICANT_HOME/conf/src directory. In this file, you can configure either snapshot, realtime, or delta-snapshot replication modes based on your needs by specifying parameters within each respective mode.
To configure snapshot replication
Below is a sample for the snapshot configuration mode.
To configure realtime replication
For the realtime replication mode, you must create a heartbeat table in the source PostgreSQL database. The steps below are used to create the table, grant privileges, and configure the Extractor configuration for realtime replication.
To meet the heartbeat table requirement, create a heartbeat table in the schema of the database to be replicated using the DDL below.
Next, Grant INSERT, UPDATE, and DELETE privileges to the user configured for replication.
Lastly, under the realtime section of the extractor configuration file, specify your configuration, including your heartbeat table details, as seen in the example below.
To configure delta-snapshot replication
The delta-snapshot replication mode can be used when you are unable to create replication slots in PostgreSQL using either wal2json or test_decoding. It carries out its replication without replication slots and uses PostgreSQL’s internal column to identify changes. The example below shows how this can be done in the delta-snapshot section of the Extractor configuration file.
Step 7: Setting Up The Connection Configuration for Snowflake
With our PostgreSQL connection configured, we can now move on to setting up the Snowflake connection.
From $REPLICANT_HOME, the directory where Arcion is installed, go to the sample Snowflake Connection configuration file, snowflake.yaml, located in the conf/conn/ directory.
The configuration file for setting up the connection for Snowflake is divided into two parts: the Snowflake connection-related parameters and the stage configuration-related parameters.
For Snowflake connection-related parameters, to set up the target Snowflake instance, two methods can be used to get an authenticated connection, they are RSA key pair authentication or basic username and password authentication. To use the basic username and password authentication, you can get your credentials by using Replicant to retrieve the credentials from AWS Secrets Manager if you are using it. Alternatively, you can specify your credentials in the connection configuration file as seen below.
From the example above, the value definitions for each field are as follows:
type: This indicates the connection type representing the database, it is Snowflake in this case.
host: This shows the hostname of your Snowflake instance.
port: This is the port number to connect to the host.
warehouse: This is the name of the Snowflake warehouse.
username: This refers to the username credential to access the Snowflake system.
password: This is the password associated with the username.
max-connections: This indicates the maximum number of connections Replicant uses to load data into the Snowflake instance.
Having set up the Snowflake connection-related parameters, you can also set up the stage configuration-related parameters. The stage configuration enables users to tune native or external staging areas for bulk loading.
The RSA key pair for authentication is going to be used to configure the parameters related to stage configuration. Snowflake’s key pair authentication support is used for enhanced authentication security as an alternative to the basic username and password. To use this, follow the link above to go through the procedures for generating a private key, generating a public key, storing both keys, assigning the public key to a Snowflake user, and verifying the user’s public key fingerprint. Finally, editing the connection configuration with these new details for the stage configuration should also be done.
Step 8: Setting Up The Applier Configuration
Next, we need to add the Applier configuration for Snowflake. The Applier configuration file contains all the parameters that Replicant uses while loading data into Snowflake. While it is not necessary to modify most of the parameters, they can be adjusted to optimize the Applier job as necessary
From $REPLICANT_HOME, go to the sample Snowflake Applier configuration file:
The configuration file for setting up the Applier configuration is divided into two parts namely the parameters related to snapshot mode and the parameters related to real-time mode.
To configure the parameters related to snapshot mode
For snapshot mode, make the changes below:
To configure the parameters related to realtime mode
The realtime section in the snowflake.yaml config file is used to configure the parameters related to realtime mode, below is a sample code showing how to set up the config file for the realtime mode.
With this, the Snowflake connector has been configured and we can now run Replicant Arcion’s program for running the CDC pipeline which will be the final step in the PostgreSQL to Snowflake connection using Arcion.
Step 9: Running Replicant
From $REPLICANT_HOME, run the following command to execute Replicant in realtime mode. When Replicant starts up, the pipeline will be immediately activated after the command is executed.
After executing the command and the pipeline is started in realtime mode, Replicant does the following:
- Replicant first creates the destination schemas if they are not already present but, if the destination schemas are present, Replicant appends them to the existing tables.
- Replicant starts replicating real-time operations performed within the PostgreSQL instance, obtained via log-based CDC. By default, real-time mode starts replicating from the latest log position, but a custom start position can be specified by the user in the real-time section of the extractor configuration file.
This article showed you how to transfer PostgreSQL with Snowflake for the transfer of data. We covered how to do this using two distinct methods: custom scripts, and using Arcion to set up a real-time CDC data pipeline.
From the examples above, we can see that using custom scripts is a full refresh method, essentially taking a snapshot of the entire dataset and adding it to the target. To keep data up-to-date would require more scripts if you want to get any incremental data. On the other hand, using Arcion ensures that any data changes in PostgreSQL are captured in real-time and sent to Snowflake. This means that any incremental changes are detected and replicated to the warehouse in real time, keeping any insights generated from analytics up-to-date and relevant.
Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from PostgreSQL to Snowflake. To easily implement a data integration strategy, get started with Arcion today by connecting with our team. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.