
With PostgreSQL, there are many ways to observe and copy changes from one database to another. When it comes to keeping data in sync across multiple platforms, there's nothing more efficient than using streaming replication. In this article, we will focus on two easy ways to enable streaming replication using PostgreSQL as a source. Let’s take a look at what PostgreSQL is, a few ways to implement replication, and a step-by-step walkthrough of how to use Arcion and how to implement streaming replication natively in Postgres.
Introduction To PostgreSQL
PostgreSQL is a popular and powerful open-source Relational Database Management System (RDBMS). It uses standard Structured Query Language (SQL) and the pgAdmin tool to provide GUI and SQL interface. Postgres is a strictly SQL-compliant platform.
PostgreSQL has been widely used in production systems for many years and has been in active development since its initial release in 1996. It provides support for all major operating systems such as Linux, Unix, Mac OS, and Windows. It is highly reliable, has data integrity, and data correctness backed by years of development through its open-source community.
PostgreSQL is a very robust database that is frequently used as the database layer for web applications, as a data warehouse, or for analytics/business intelligence applications. PostgreSQL is truly a multi-use database with a track record of being the database of choice for some truly large and great applications and services.
PostgreSQL is commonly seen being used:
- By large corporations and startups for transactional database purposes.
- As a back-end database in Linux, Apache, PostgreSQL, and PHP (python and Perl) (LAPP) stack to power dynamic websites and web applications.
- As a geospatial database used for geographic information systems (GIS) with the PostGIS extension found on PostgreSQL.
For more information on PostgreSQL, you can visit the official site here.
Introduction To Replication
In general, replication is the act of copying or reproducing an entity so that multiple copies of the original will be in existence. Replication in databases is closely related to the definition given above as it is the act of copying data from one database instance to another. Replication allows for the sharing of information between two or more databases, generally as a means of keeping two or more data sources in sync.
Replication can be carried out between databases hosted on the same physical machine or across a network. The database on which initial changes occur acts as the source, also known as the active, master, or primary node. The other databases or nodes that receive the subsequent changes or copy of the replicated data are known as the target, sometimes referred to as the destination, slave, or secondary node. The source database usually allows for read and write capabilities while the target is generally used as a read-only server. Of course, there are scenarios where all databases that are part of the replication strategy can perform read and write operations while staying in sync.
Best Ways To Replicate Data From PostgreSQL
Data can be replicated in PostgreSQL either synchronously or asynchronously. This can be done using a single-master or multi-master model, and a logical or physical replication model.
- With synchronous replication, the source database receives a write transaction and propagates the change to all targets in the replication pipeline. A confirmation is sent before the transaction is committed or marked as completed. With asynchronous replication, the primary node receives a transaction and then executes the transaction before propagating that change to standby nodes.
- In a single-master setup, there is only one active node that can receive write and read transactions. Other nodes within the setup perform read operations only. On the other hand, in a multi-master replication, more than one node can receive and perform write and read operations.
- The core idea of physical replication is that it tries to preserve the state of the physical system as much as possible. To do this, physical replication uses binary formatting to enable replication. For logical replication, the only concern in regard to copying the data is to make sure that it is identical to the change that occurred.
Prerequisites
In order to understand and implement the methods discussed below, you will need to ensure two prerequisites.
- Familiarity with PostgreSQL and basic database concepts.
- PostgreSQL is installed on a host workstation and a standby.
Once these two prerequisites are complete, you can move forward and implement both of the streaming replication approaches outlined in the rest of the article.
Replication Methods In PostgreSQL
As mentioned, there are many ways to implement streaming replication in PostgreSQL. We are going to discuss two methods in which you can carry out PostgreSQL streaming replication in this section of the write-up. The two methods we will cover will be using Arcion as a tool to quickly set up a real-time CDC pipeline and the built-in streaming replication functionality that is available natively in Postgres. Let's get started by looking at the first method, Arcion.
Method 1: Arcion, An Easy Way To Do It
Arcion allows users to create a no-code data pipeline for replication. It offers both self-managed on-premise and fully-managed cloud products to fit your exact needs. By using Arcion, it is easy for organizations to build pipelines for replication using many different data sources. Arcion allows you to move data from a source database to an external system such as a big data platform like Snowflake or Databricks, with no code required.
For many use cases, Arcion is much easier and more flexible than the built-in replication and CDC tools supported by the major database providers. Below is an example of how to make use of Arcion for streaming replication in PostgreSQL.
Create a User in Postgresql
Step 1: Log in to the PostgreSQL client.
Step 2: Create a user to be used for replication.
Step 3: Grant the user the following permissions:
Setup PostgreSQL for Replication
In order for Arcion to access the data in Postgres, you’ll need to create a PostgreSQL user for replication. Below shows an example of the commands that need to be run on your Postgres instance to do this.
Step 1: Edit the postgresql.conf
Step 2: Set up the parameters as shown below in postgresql.conf
Step 3: To enable log consumption for CDC replication on the PostgreSQL server, the test_decoding plugin that is by default installed in PostgreSQL can be used. Alternatively, you can install the logical decoding plugin wal2json.
Set Up Connection Configuration
Next, in Arcion we must set up our configuration file with our connection details. Providing these details lets Arcion know how to access the database.
Step 1: On the server containing your Arcion deployment, from REPLICANT_HOME, navigate to the PostgreSQL connection configuration file:
Step 2: If your connection file is stored in AWS Secrets Manager, you can tell Replicant to retrieve it. Alternatively, you can put in credentials such as username and password in the configuration file directly. This is the method that can be seen in the example below.
Step 3: If you require SSL, you can enable SSL by including the SSL field and specifying the parameters in the configuration file. An example of this can be seen below:
Set Up Filter Configuration
Next, we must configure the Filter to specify which tables, columns, and data should be replicated.
Step 1: On the server containing your Arcion deployment, still in the REPLICANT_HOME directory, navigate to the PostgreSQL filter configuration file:
Step 2: Now, in the postgresql_filter.yaml file, we will add the exact configuration which will specify the data that should be replicated. The examples below show the format that this configuration should be input as.
For a detailed explanation of configuration parameters in the filter file, refer to the Filter Reference docs.
Set Up Extractor Configuration
Lastly, we will need to create a heartbeat table on the PostgreSQL instance. This table ensures an accurate computation of latency that is needed for optimal results from Arcion.
Step 1: On your master PostgreSQL database, create the heartbeat table for Arcion:
Step 2: On the server containing your Arcion deployment, from the REPLICANT_HOME directory, navigate to the PostgreSQL extractor configuration file:
Step 3: In the extractor configuration file, add the following lines, with the appropriate substitutions, to make the heartbeat table from your Postgres instance available to Arcion.
With that, your PostgreSQL instance has now been added as a source to Arcion. The last step, depending on the platform you are replicating data to, is to set up your target configuration. To create a target, visit the target setup docs and select your desired platform.
Once the target is configured, Arcion can be started and replication can begin. Depending on what mode you’ll be running Arcion in, different commands can be issued. For a full breakdown of how to run Arcion once it is configured, check out the docs.
Method 2: PostgreSQL Streaming Replication
Streaming replication is a method of replicating data between two or more servers. In this configuration, the actual source database is known as the primary or main server. The servers where the data is replicated to are generally referred to as the slave or standby servers. The standby server is constantly updated by reading a stream of Write-Ahead Log (WAL) records from the primary server. This action can be carried out synchronously or asynchronously depending on the needs of the business.
PostgreSQL streaming replication, also known as WAL replication, is set up by installing PostgreSQL on all the servers where replication will occur. The WAL files can then be transferred across the servers and transactions can be applied to the standby servers. The most common implementation of this is implemented by using the master-slave model.
The master server, which is the primary instance, is built to handle the primary database and all its operations. The slave acts as a secondary instance and tracks all the changes made on the primary server, applying these changes to itself. The master is a read-and-write server while the slave is a read-only server. The data on the slave server are merely copies of the original data hosted on the master server.
If needed, the slave server can be promoted to a primary server. This can be a good way of ensuring high availability and backup of data in case of disaster or error. When the master server goes, a slave server can quickly be switched into the master position, ensuring that data is readily accessible and service interruptions are minimized.
To set up PostgreSQL streaming replication, the following steps are required.
- Configuring the master node
- Configuring the standby node.
- Starting the replication
Let’s take a look at each of these steps in detail below.
Configuring Master Node
The first step is to actually ensure that a master node is created. You can use an existing PostgreSQL database for this, however, in the example below we will walk through the steps of creating a new Postgres database cluster.
Step 1: Database Initialization
To initialize the database, the initidb utility command can be used. initidb creates a new PostgreSQL database cluster which is a collection of databases managed by a single server instance having directories in which the database data will live, generating shared catalog tables, and creating templates
Step 2: Creating A New User
The next step is to create a new user with replication privileges. This is done using the command below.
The username and password for the query above should contain the username and password of the user you will use for replication. The command above will ensure the necessary privileges will be granted for the user to enable replication.
Step 3: Configuring PostgreSQL Streaming Replication Properties
Configure the streaming properties using the PostgreSQL configuration file, postgresql.conf. Apply the following changes to it that are shown in the example below.
The parameters listed above include:
- wal_level: This is used to enable PostgreSQL streaming replication, its has replica, minimal, and logical as possible values. It is used to specify the level of logging for the WAL to allow connections from standby servers.
- wal_log_hints: This helps when the standby server is out of sync with the master server. It is needed for the pg_rewind capability.
- max_wal_senders: This is used to indicate the maximum number of concurrent connections from standby servers or streaming base backup clients. The default value is 10 and choosing 0 means replication is disabled.
- max_wal_size: This is used to specify the size of the WAL files that can be retained within the log files.
- hot_standby: This is used to enable a read-only connection with the slave when it is set to on.
Step 4: Replication Entry In pg_hba.conf File
Adding a replication entry will allow the created servers to connect for replication purposes. The pg_hba.conf file is usually located in the data directory of PostgreSQL. The code below is an illustration of how to set up the primary server to allow a user to connect and act as a standby server using a specified IP address.
Configuring Standby Node
After configuring the master node for PostgreSQL streaming replication, the next step is to configure the standby, or slave, node. The standby node is configured for streaming replication in two easy steps.
Step 1: Create A Backup Of The Master Node
pg_basebackup utility is used in the creation of a backup of the master node. This is the starting point of the slave node. It is used as given in the code below.
From the code above, the following parameters are gotten:
- -D: This shows the directory you are working on.
- -h: This is used to indicate the master host.
- -X: This is used to specify the necessary transactional log files.
- -C: This is used to set the checkpoints.
- -W: This prompts a user for a password before connecting to the database.
Step 2: Creating The Replication Configuration File
Create the replication configuration file as a recovery.conf if it does not exist. It is created in the PostgreSQL installation data directory or can be created automatically using the -R option in the pg_basebackup. The recovery.conf file should be set up as shown in the example below.
From the example above, the following parameters can be seen.
- standby_mode: This causes the server to start as the slave when this parameter is set to ON.
- primary_conninfo: With the help of a connection string, primary_conninfo is used to create a connection between the slave and the master server. The IP address, username, and password of the master and slave are input for the creation of the connection. An example of this is illustrated below.
- recovery_target_timeline: The recovery time is set using this parameter.
Step 3: Restart Secondary Server
Finally, the secondary server must be restarted to complete the configuration process. With this, your PostgreSQL instance is set up for Streaming replication and will replicate data between the primary and secondary databases.
Conclusion
This article has taken you through how to set up PostgreSQL streaming replication. Two ways of implementing this were discussed above. The first method, using Arcion, allows users to leverage a dynamic no-code connector platform to quickly set up replication. Arcion can be used by organizations to build pipelines for replication beyond just a Postgres-to-Postgres setup, unlike native Postgres streaming capabilities. For more information on how to set up PostgreSQL Streaming Replication using Arcion, visit here.
The second approach mentioned was by using PostgreSQL built-in replication for streaming. This method shows how you could configure the master and standby nodes to ensure a reliable streaming service. However, it can be deduced that this method may become difficult very quickly if one is not vested with the requisite knowledge, hence, using Arcion for streaming replication is a better and simpler approach. Arcion also includes the added bonus of flexibility which includes supporting many different sources and target platforms as well as supporting automated schema evolution between platforms. To learn more about Arcion and try it out for yourself, sign-up for a free download of Arcion Self-Hosted and start enabling Postgres streaming replication today.