PostgreSQL Streaming Replication: 2 Easy ways

Luke Smith
Enterprise Solutions Architect
December 20, 2022
Matt Tanner
Developer Relations Lead
December 20, 2022
Matt Tanner
Matt Tanner
Developer Relations Lead
Developer Relations Lead
May 1, 2023
18
 min read
Join our newsletter

With PostgreSQL, there are many ways to observe and copy changes from one database to another. When it comes to keeping data in sync across multiple platforms, there's nothing more efficient than using streaming replication. In this article, we will focus on two easy ways to enable streaming replication using PostgreSQL as a source. Let’s take a look at what PostgreSQL is, a few ways to implement replication, and a step-by-step walkthrough of how to use Arcion and how to implement streaming replication natively in Postgres.

Table of Contents

Introduction To PostgreSQL

PostgreSQL is a popular and powerful open-source Relational Database Management System (RDBMS). It uses standard Structured Query Language (SQL) and the pgAdmin tool to provide GUI and SQL interface. Postgres is a strictly SQL-compliant platform. 

PostgreSQL has been widely used in production systems for many years and has been in active development since its initial release in 1996. It provides support for all major operating systems such as Linux, Unix, Mac OS, and Windows. It is highly reliable, has data integrity, and data correctness backed by years of development through its open-source community.

PostgreSQL is a very robust database that is frequently used as the database layer for web applications, as a data warehouse, or for analytics/business intelligence applications. PostgreSQL is truly a multi-use database with a track record of being the database of choice for some truly large and great applications and services. 

PostgreSQL is commonly seen being used:

  • By large corporations and startups for transactional database purposes.
  • As a back-end database in Linux, Apache, PostgreSQL, and PHP (python and Perl) (LAPP) stack to power dynamic websites and web applications. 
  • As a geospatial database used for geographic information systems (GIS) with the PostGIS extension found on PostgreSQL.

For more information on PostgreSQL, you can visit the official site here.

Introduction To Replication

In general, replication is the act of copying or reproducing an entity so that multiple copies of the original will be in existence. Replication in databases is closely related to the definition given above as it is the act of copying data from one database instance to another. Replication allows for the sharing of information between two or more databases, generally as a means of keeping two or more data sources in sync.

Replication can be carried out between databases hosted on the same physical machine or across a network. The database on which initial changes occur acts as the source, also known as the active, master, or primary node. The other databases or nodes that receive the subsequent changes or copy of the replicated data are known as the target, sometimes referred to as the destination, slave, or secondary node. The source database usually allows for read and write capabilities while the target is generally used as a read-only server. Of course, there are scenarios where all databases that are part of the replication strategy can perform read and write operations while staying in sync.

Best Ways To Replicate Data From PostgreSQL

Data can be replicated in PostgreSQL either synchronously or asynchronously. This can be done using a single-master or multi-master model, and a logical or physical replication model.

  • With synchronous replication, the source database receives a write transaction and propagates the change to all targets in the replication pipeline. A confirmation is sent before the transaction is committed or marked as completed. With asynchronous replication, the primary node receives a transaction and then executes the transaction before propagating that change to standby nodes. 
  • In a single-master setup, there is only one active node that can receive write and read transactions. Other nodes within the setup perform read operations only. On the other hand, in a multi-master replication, more than one node can receive and perform write and read operations.
  • The core idea of physical replication is that it tries to preserve the state of the physical system as much as possible. To do this, physical replication uses binary formatting to enable replication. For logical replication, the only concern in regard to copying the data is to make sure that it is identical to the change that occurred. 

Prerequisites

In order to understand and implement the methods discussed below, you will need to ensure two prerequisites.

  • Familiarity with PostgreSQL and basic database concepts.
  • PostgreSQL is installed on a host workstation and a standby.

Once these two prerequisites are complete, you can move forward and implement both of the streaming replication approaches outlined in the rest of the article.

Replication Methods In PostgreSQL

As mentioned, there are many ways to implement streaming replication in PostgreSQL. We are going to discuss two methods in which you can carry out PostgreSQL streaming replication in this section of the write-up. The two methods we will cover will be using Arcion as a tool to quickly set up a real-time CDC pipeline and the built-in streaming replication functionality that is available natively in Postgres. Let's get started by looking at the first method, Arcion.

Method 1: Arcion, An Easy Way To Do It 

Arcion allows users to create a no-code data pipeline for replication. It offers both self-managed on-premise and fully-managed cloud products to fit your exact needs. By using Arcion, it is easy for organizations to build pipelines for replication using many different data sources. Arcion allows you to move data from a source database to an external system such as a big data platform like Snowflake or Databricks, with no code required. 

For many use cases, Arcion is much easier and more flexible than the built-in replication and CDC tools supported by the major database providers. Below is an example of how to make use of Arcion for streaming replication in PostgreSQL. 

Create a User in Postgresql

Step 1: Log in to the PostgreSQL client.

psql -U $POSTGRESQL_ROOT_USER

Step 2: Create a user to be used for replication.

CREATE USER <username> PASSWORD '<password>';

Step 3: Grant the user the following permissions:

GRANT USAGE ON SCHEMA "<schema> TO <username>; 

GRANT
SELECT
ON ALL TABLES IN SCHEMA "<schema>" TO <username>; 

ALTER ROLE <username> WITH REPLICATION;

Setup PostgreSQL for Replication

In order for Arcion to access the data in Postgres, you’ll need to create a PostgreSQL user for replication. Below shows an example of the commands that need to be run on your Postgres instance to do this.

Step 1: Edit the postgresql.conf

vi $PGDATA/postgresql.conf

Step 2: Set up the parameters as shown below in postgresql.conf

wal_level = logical
max_replication_slots = 1 #Can be increased if more slots need to be created

 Step 3: To enable log consumption for CDC replication on the PostgreSQL server, the test_decoding plugin that is by default installed in PostgreSQL can be used. Alternatively, you can install the logical decoding plugin wal2json.

Set Up Connection Configuration

Next, in Arcion we must set up our configuration file with our connection details. Providing these details lets Arcion know how to access the database.

Step 1: On the server containing your Arcion deployment, from REPLICANT_HOME, navigate to the PostgreSQL connection configuration file:

vi conf/conn/postgresql.yaml

 Step 2: If your connection file is stored in AWS Secrets Manager, you can tell Replicant to retrieve it. Alternatively, you can put in credentials such as username and password in the configuration file directly. This is the method that can be seen in the example below.

type: POSTGRESQL 

host: localhost #Replace localhost with your PostgreSQL host name
port: 5432 #Replace the default port number 5432 if needed 

database: "postgres" #Replace postgres with your database name
username: "replicant" #Replace replicant with your postgresql username
password: "Replicant#123" #Replace Replicant#123 with your user's password 

max-connections: 30 #Maximum number of connections replicant can open in postgresql
socket-timeout-s: 60 #The timeout value for socket read operations. The timeout is in seconds and a value of zero means that it is disabled.
max-retries: 10 #Number of times any operation on the source system will be re-attempted on failures.
retry-wait-duration-ms: 1000 #Duration in milliseconds Replicant should wait before performing then next retry of a failed operation. 

#List your replication slots (slots which hold the real-time changes of the source database) as follows  
 replication-slots:    
  io_replicate: #Replace "io-replicate" with your replication slot name      
  - wal2json #plugin used to create replication slot (wal2json | test_decoding)    
  io_replicate1: #Replace "io-replicate1" with your replication slot name      
  - wal2json 

log-reader-type: SQL [SQL|STREAM]

 Step 3: If you require SSL, you can enable SSL by including the SSL field and specifying the parameters in the configuration file. An example of this can be seen below: 

ssl:  
 ssl-cert: <full_path_to_SSL_certificate_file>  
 root-cert: <full_path_to_SSL_root_certificate_file> 
 ssl-key: <full_path_to_SSL_key_file>

Set Up Filter Configuration

Next, we must configure the Filter to specify which tables, columns, and data should be replicated.

Step 1: On the server containing your Arcion deployment, still in the REPLICANT_HOME directory, navigate to the PostgreSQL filter configuration file:

vi filter/postgresql_filter.yaml

Step 2: Now, in the postgresql_filter.yaml file, we will add the exact configuration which will specify the data that should be replicated. The examples below show the format that this configuration should be input as.

allow:  
 catalog: <your_catalog_name>  
 schema: <your_schema_name>  
 types: <your_object_type> 
 
#If not collections are specified, all the data tables in the provided catalog and schema will be replicated
allow:  
 <your_table_name>:    
 allow: ["your_column_name"]    
 condtions: "your_condition"   
 
 <your_table_name>:      
 allow: ["your_column_name"]    
 conditions: "your_condition"   
 
 <your_table_name>:     
 ```

For a detailed explanation of configuration parameters in the filter file, refer to the Filter Reference docs.

Set Up Extractor Configuration

Lastly, we will need to create a heartbeat table on the PostgreSQL instance. This table ensures an accurate computation of latency that is needed for optimal results from Arcion.

Step 1: On your master PostgreSQL database, create the heartbeat table for Arcion:

CREATE TABLE "<user_database>"."public"."replicate_io_cdc_heartbeat"("timestamp" INT8 NOT NULL, PRIMARY  KEY("timestamp"))

Step 2: On the server containing your Arcion deployment, from the REPLICANT_HOME directory, navigate to the PostgreSQL extractor configuration file:

vi conf/src/postgresql.yaml

Step 3: In the extractor configuration file, add the following lines, with the appropriate substitutions, to make the heartbeat table from your Postgres instance available to Arcion.

realtime:  
 heartbeat:    
  enable: true    
  catalog: "postgres" #Replace postgres with your database name    
  schema: "public" #Replace public with your schema name    
  table-name [20.09.14.3]: replicate_io_cdc_heartbeat #Heartbeat table name if changed    
  column-name [20.10.07.9]: timestamp #Heartbeat table column name if changed

With that, your PostgreSQL instance has now been added as a source to Arcion. The last step, depending on the platform you are replicating data to, is to set up your target configuration. To create a target, visit the target setup docs and select your desired platform.

Once the target is configured, Arcion can be started and replication can begin. Depending on what mode you’ll be running Arcion in, different commands can be issued. For a full breakdown of how to run Arcion once it is configured, check out the docs.

Method 2: PostgreSQL Streaming Replication

Streaming replication is a method of replicating data between two or more servers. In this configuration, the actual source database is known as the primary or main server. The servers where the data is replicated to are generally referred to as the slave or standby servers. The standby server is constantly updated by reading a stream of Write-Ahead Log (WAL) records from the primary server. This action can be carried out synchronously or asynchronously depending on the needs of the business.

PostgreSQL streaming replication, also known as WAL replication, is set up by installing PostgreSQL on all the servers where replication will occur. The WAL files can then be transferred across the servers and transactions can be applied to the standby servers. The most common implementation of this is implemented by using the master-slave model. 

The master server, which is the primary instance, is built to handle the primary database and all its operations. The slave acts as a secondary instance and tracks all the changes made on the primary server, applying these changes to itself. The master is a read-and-write server while the slave is a read-only server. The data on the slave server are merely copies of the original data hosted on the master server. 

If needed, the slave server can be promoted to a primary server. This can be a good way of ensuring high availability and backup of data in case of disaster or error. When the master server goes, a slave server can quickly be switched into the master position, ensuring that data is readily accessible and service interruptions are minimized.

To set up PostgreSQL streaming replication, the following steps are required.

  • Configuring the master node
  • Configuring the standby node.
  • Starting the replication

Let’s take a look at each of these steps in detail below.

Configuring Master Node

The first step is to actually ensure that a master node is created. You can use an existing PostgreSQL database for this, however, in the example below we will walk through the steps of creating a new Postgres database cluster.

Step 1: Database Initialization

To initialize the database, the initidb utility command can be used. initidb creates a new PostgreSQL database cluster which is a collection of databases managed by a single server instance having directories in which the database data will live, generating shared catalog tables, and creating templates

Step 2: Creating A New User

The next step is to create a new user with replication privileges. This is done using the command below.

CREATE USER <user_name> REPLICATION LOGIN ENCRYPTED PASSWORD ’<password>’;

The username and password for the query above should contain the username and password of the user you will use for replication. The command above will ensure the necessary privileges will be granted for the user to enable replication.

Step 3: Configuring PostgreSQL Streaming Replication Properties

Configure the streaming properties using the PostgreSQL configuration file, postgresql.conf. Apply the following changes to it that are shown in the example below.

wal_level = logical
wal_log_hints = on
max_wal_senders = 8
max_wal_size = 1GB
hot_standby = on

The parameters listed above include:

  • wal_level: This is used to enable PostgreSQL streaming replication, its has replica, minimal, and logical as possible values. It is used to specify the level of logging for the WAL to allow connections from standby servers.
  • wal_log_hints: This helps when the standby server is out of sync with the master server. It is needed for the pg_rewind capability.
  • max_wal_senders: This is used to indicate the maximum number of concurrent connections from standby servers or streaming base backup clients. The default value is 10 and choosing 0 means replication is disabled.
  • max_wal_size: This is used to specify the size of the WAL files that can be retained within the log files.
  • hot_standby: This is used to enable a read-only connection with the slave when it is set to on.

Step 4: Replication Entry In pg_hba.conf File

Adding a replication entry will allow the created servers to connect for replication purposes. The pg_hba.conf file is usually located in the data directory of PostgreSQL. The code below is an illustration of how to set up the primary server to allow a user to connect and act as a standby server using a specified IP address.

host replication <user_name> IPaddress md5

Configuring Standby Node

After configuring the master node for PostgreSQL streaming replication, the next step is to configure the standby, or slave, node. The standby node is configured for streaming replication in two easy steps.

Step 1: Create A Backup Of The Master Node

pg_basebackup utility is used in the creation of a backup of the master node. This is the starting point of the slave node. It is used as given in the code below.

pg_basebackp -D <data_directory> -h <master_host> -X stream -c fast -U rep_user -W

From the code above, the following parameters are gotten:

  • -D: This shows the directory you are working on.
  • -h: This is used to indicate the master host.
  • -X: This is used to specify the necessary transactional log files.
  • -C: This is used to set the checkpoints.
  • -W: This prompts a user for a password before connecting to the database.

 

Step 2: Creating The Replication Configuration File

Create the replication configuration file as a recovery.conf if it does not exist. It is created in the PostgreSQL installation data directory or can be created automatically using the -R option in the pg_basebackup. The recovery.conf file should be set up as shown in the example below.

standby_mode = ‘on’
primary_conninfo  = ‘host=<master_host> port=<postgres_port> user=<replication_user> password=<password> 
application_name=”host_name”’
recovery_target_timeline = ‘latest

From the example above, the following parameters can be seen.

  • standby_mode: This causes the server to start as the slave when this parameter is set to ON.
  • primary_conninfo: With the help of a connection string, primary_conninfo is used to create a connection between the slave and the master server. The IP address, username, and password of the master and slave are input for the creation of the connection. An example of this is illustrated below.
primary_conninfo = 'host=192.168.0.26 port=5432 user=rep_user password=rep_pass'
  • recovery_target_timeline: The recovery time is set using this parameter.

Step 3: Restart Secondary Server

Finally, the secondary server must be restarted to complete the configuration process. With this, your PostgreSQL instance is set up for Streaming replication and will replicate data between the primary and secondary databases.

Conclusion

This article has taken you through how to set up PostgreSQL streaming replication. Two ways of implementing this were discussed above. The first method, using Arcion, allows users to leverage a dynamic no-code connector platform to quickly set up replication. Arcion can be used by organizations to build pipelines for replication beyond just a Postgres-to-Postgres setup, unlike native Postgres streaming capabilities. For more information on how to set up PostgreSQL Streaming Replication using Arcion, visit here.

The second approach mentioned was by using PostgreSQL built-in replication for streaming. This method shows how you could configure the master and standby nodes to ensure a reliable streaming service. However, it can be deduced that this method may become difficult very quickly if one is not vested with the requisite knowledge, hence, using Arcion for streaming replication is a better and simpler approach. Arcion also includes the added bonus of flexibility which includes supporting many different sources and target platforms as well as supporting automated schema evolution between platforms. To learn more about Arcion and try it out for yourself, sign-up for a free download of Arcion Self-Hosted and start enabling Postgres streaming replication today.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements