PostgreSQL CDC: A Comprehensive Guide

Luke Smith
Enterprise Solutions Architect
December 13, 2022
Matt Tanner
Developer Relations Lead
December 13, 2022
Matt Tanner
Matt Tanner
Developer Relations Lead
Developer Relations Lead
May 4, 2023
20
 min read
Join our newsletter

PostgreSQL is an open-source relational database that can be used to get access to your data in real-time. It has advanced features, is very versatile, functions as an RDBMS database, and can double as a data warehouse. 

Connecting PostgreSQL to other data stores for the exchange of data is done through data pipelines which may use time-based batches. These batch jobs are used to retrieve data from time to time from other stores which are then consolidated in PostgreSQL. Using batch jobs to move data can prove to be inadequate and open to errors and false results. This is because the latest version of the data may not be present in the transactional database because it had not been synced properly. Data synchronization issues are a major detriment to making data-driven decisions that are accurate and reliable. In this situation and others that require consistent and real-time data synchronization, another method is needed to continuously and accurately sync data between the instances. One of the most popular methods to implement this is to use Change Data Capture, usually shortened to CDC.

In databases, CDC is used to track changes in the data and take action regarding the changes. Usually, the action taken is to replicate the data changes from the source database to a target data store. Based on the above definition, CDC essentially captures change data, transforms the change data into a format a destination database supports for upload, and finally, uploads the data to the destination database. This keeps the data in sync between two, or more, databases.

There are multiple CDC methods available that can be implemented for use with a PostgreSQL database. A PostgreSQL-compatible CDC solution requires a way to detect and identify changes made to your database and apply the changes to changes to a target system. 

In this article, you will learn how to implement CDC in PostgreSQL using four different methods. These methods include 

  • Using Triggers in PostgreSQL
  • Using Queries in PostgreSQL
  • Using Logical Replication in PostgreSQL
  • Using Arcion
Table of Contents

What Is PostgreSQL?

PostgreSQL is an open-source relational database management system (RDBMS) operated by the Global Development Group. Fans of PostgreSQL tend to enjoy many benefits, including a large amount of extensibility within the platform and strict SQL compliance. It uses the standard Structured Query Language ( SQL) and pgAdmin tool to provide GUI and SQL interface.

PostgreSQL has been in active development since its initial release in 1996. It is widely used in production systems running on all major operating systems such as Linux, Unix, Mac OS, Windows, and others. PostgreSQL is extremely versatile and can be used for different use cases such as a data warehouse, for use with analytics/business intelligence applications, and as the database layer for web and mobile applications.

PostgreSQL has a lot of history and has been in existence for a long time, backed by years of development through its open-sourced community. This history and strong community foundation give it a good reputation among data management users as it is highly reliable, has data integrity, and is easy to use.

PostgreSQL is commonly used for the following purpose:

  1. For transactional database purposes by large corporations and startups to support their products.
  2. As a back-end database in the LAPP stack to power dynamic websites and web applications. LAPP stack includes Linux, Apache, PostgreSQL, and PHP (python and Perl).
  3. It also supports a geospatial database that is used for geographic information systems (GIS) with the PostGIS extension found on PostgreSQL.

What Is CDC?

As mentioned earlier, in brief, Change Data Capture, or CDC, is a concept or mechanism that is used to identify a data change that has occurred in a data store. The CDC process will then apply those identified changes to other data stores in real time. It is used to synchronize data across the data stores. Using CDC ensures data consistency for applications and systems by moving data from one data repository to another. Most CDC processes, at least those that are not intrusive, tend to rely on the internal log of a database called Write-Ahead Log (WAL). The Write-Ahead Log contains all the changes made to the database before they are applied to data files with tables and indexes. These changes can then be propagated to a destination to synchronize data between a source and a target.

Change Data Capture (CDC) can also encompass software tools to identify changes in source tables and databases. The software can then apply the detected changes to a target database to keep it in sync with the source. Using a CDC tool is one of the most effective and quick ways to implement real-time data synchronization between two or more databases. There are a few CDC tools on the market that offer effective CDC implementations with differing levels of customization and complexity.

Having explained what PostgreSQL and CDC both represent, let us dive into the ways in which PostgreSQL CDC can be set up from the four methods mentioned earlier.

How To Implement CDC With Triggers In PostgreSQL

Triggers allow you to carry out actions in a database when a specific action or event has taken place. A trigger can be defined as a specification you make that the database should automatically execute a particular function whenever a certain type of operation is carried out. The trigger itself could be executed before or after the operation is done depending on how the trigger is configured. In PostgreSQL, triggers can be attached to tables, views, and foreign tables. They can be defined to execute when any INSERT, UPDATE or DELETE events occur in the table. Triggers can also be used for TRUNCATE statements.

To trigger CDC on a specific table in PostgreSQL, you can call the SELECT statement using the audit.audit_table syntax that is shown below. 

SELECT audit.audit_table( ‚Äė target_table_name‚Äô ) ;

With this trigger, the table now has audit events recorded at a row level for every insert/update/delete and the query text will always be logged. The statement is to be executed by a user role with permission to access the audit schema and this approach captures events only inside PostgreSQL.

To cancel the auditing, use the code below:

DROP TRIGGER audit_trigger_row on target_table_name;
DROP TRIGGER audit_trigger_stm on target_table_name;

To use triggers to enable CDC, a common way is to create a table for the change records. Then, when a change occurs on a table the trigger will write the necessary change data to the table containing the change records. To actually propagate the changes to other data systems like data warehouses, you will have to continually query the PostgreSQL table harboring the change events. This is usually done by writing programs to read and parse logs to apply the changes to the target databases. This process can be quite complex and isn’t overly efficient. For more on triggers in PostgreSQL, visit here.

The advantage of triggers in PostgreSQL is that, unlike the other methods that will be discussed, all operations are carried out at the SQL level. With triggers, all change events are captured in real time since the triggers execute instantly when a change occurs. The downside of using triggers is it has a significant impact on the performance of the database. This is because as triggers run on the database any operations they perform add to the load placed on the database. Another downside is that if changes are to be reflected on a target outside of the PostgreSQL instance where the trigger is executed then a different data pipeline is also needed. The second pipeline will query for any changes and apply them to the target database. Creating and managing triggers and multiple pipelines can lead to an increasing amount of operational complexity.

How To Implement CDC With Queries In PostgreSQL

One of the methods of implementing CDC solutions in PostgreSQL is by querying an update timestamp column in the table. Any time a record is inserted or modified in a table, an update is made to the update timestamp column to reflect the date and time the record was changed. This means you can query the table only to return the rows that have been updated since the last time it was queried for changes. In practice, monitoring changes made to the timestamp column would be done using scripts to detect and write the changes to a target database.

The sample command below can be used to query a table with a timestamp column to only return rows that have been updated after a certain time. 

SELECT * FROM (table name) WHERE (timestamp column) > ‚ÄėTIMESTAMP_LAST_QUERY‚Äô;

The main advantage of query-based CDC is it can be implemented without any major changes made to PostgreSQL. The only requirement is that the schema has a column to record the timestamp of the last change event on the row. It is however disadvantaged for record deletion as it does not capture DELETE events. Tracking delete events will require implementing a workaround, such as disabling DELETE operations and using a deleted boolean column instead of actually deleting the record. Query-based CDC also puts an additional load on the PostgreSQL server since it uses the query layer for extracting data. Applying the changes to another target database will require constantly querying the table for changes, increasing the use of resources on the server. In all, this approach is intensive and demands a lot of effort, time, and resources from the developer.

How To Implement CDC With Logical Replication In PostgreSQL

PostgreSQL logical replication can be described as a streaming representation of the Write-Ahead Log (WAL). As mentioned earlier, the Write-Ahead Log maintains a record of all the data changes on the PostgreSQL database. These changes are observed as logical decoding streams and, at the database level, are identified as a logical replication slot. A replication slot is simply a stream of changes in a database. A database might have more than one slot or stream of changes.

Logical replication is implemented by using a logical decoding plugin, which converts WAL internal representation into the desired format for ingestion. PostgreSQL provides pgoutput by default which allows for Postgres database changes to be output into streams. pgoutput works well for replicating data between PostgreSQL instances since the streams are in binary form. It does not work well when trying to move PostgreSQL change stream data to another type of database or platform.

For moving change stream data to a platform outside of PostgreSQL a logical decoding plugin is recommended. A logical decoding plugin is a program written in C and is installed on the source PostgreSQL server. A popular logical decoding plugin is the wal2json plugin. With this plugin, the change stream data is formatted into JSON so it can be read by other target platforms in JSON format. This is preferable in these instances versus the binary output of pgoutput.

When PostgreSQL is configured, the change capture details should show the schema and tables to read from, the initial change to use, and the operations that will be included. Also, the name of the replication slot to be used must be defined and the configuration should determine if the replication slots would be removed on close. You can also decide how it should behave when it encounters an unsupported data type. Lastly, the user that connects to the database to access the change event details must have the replication or super role.

Logical replication in PostgreSQL uses a publish and subscribe model where one or more subscribers can subscribe to one or more publications on a publisher node. Data is pulled from the publications by subscribers and they may even re-publish the data to allow replication or more complex configurations.

To enable logical replication, changes have to be made to the configuration files postgresql.conf and pga_hba.conf. In the postgresql.conf file, the wal_level needs to be set to logical.

 wal_level  =  logical 

In the pga_hba.conf file, we need to add an entry like the one below where we will add a Repuser entry so that the user can have access to the server (on all IPv4 addresses).

Repuser    0.0.0.0/0    md5

For further clarity, let's use the example below to demonstrate how to set up logical replication. In the example, a publisher and a subscriber node will be configured to enable the implementation of CDC.

Publisher Configuration

In Postgresql.conf, set wal_level to logical and restart the PostgreSQL service.

 -bash-4.2$ grep 
wal_level /var/lib/pgsql/10/data/postgresql.confwal_level = logical
-bash-4.2$

Create User

Use the code below to create a user

postgres=# 
CREATE USER repuser_1 REPLICATION ENCRYPTED PASSWORD 'pYUBG_787V';CREATE ROLE
postgres=# 

Create a Database and Populate the data

Create a database testdb_1 and populate some data using pgbench.

postgres=# 
CREATE DATABASE testdb_1;CREATE DATABASE
postgres=#
postgres=# \q
-bash-4.2$
-bash-4.2$ /usr/pgsql-10/bin/pgbench -i -s 5 testdb_1
NOTICE:  table   "pgbench_history" does not exist, skipping
NOTICE:  table   "pgbench_tellers" does not exist, skipping
NOTICE:  table   "pgbench_accounts" does not exist, skipping
NOTICE:  table   "pgbench_branches" does not exist, skipping
creating tables...
100000 of 500000 tuples (20%) done (elapsed 0.18 s,remaining 0.71 s)
200000 of 500000 tuples (40%) done (elapsed 0.36 s,remaining 0.54 s)
300000 of 500000 tuples (60%) done (elapsed 0.58 s,remaining 0.39 s)
400000 of 500000 tuples (80%) done (elapsed 1.00 s,remaining 0.25 s)
500000 of 500000 tuples (100%) done (elapsed 1.44 s,remaining 0.00s)
vacuum...
set primary keys...
done.
-bash-4.2$ psql testdb_1
psql (10.0)
Type "help" for help.
testdb_1=# \dt+
List of relations
Schema |       Name       | Type    |  Owner   |    Size   | 
--------+------------------+-------+----------+---------+----
public |   pgbench_accounts | table | postgres | 64 MB     |
public |   pgbench_branches | table | postgres | 40 kB     |
public |   pgbench_history  | table | postgres | 0   bytes |
public |   pgbench_tellers  | table | postgres |   40 kB   |
(4 rows)
testdb_1=#

Create Publication

Create a publication from the code below.

testdb_1=# CREATE PUBLICATION   testdb_1_pub FOR TABLE 
pgbench_accounts, pgbench_history;

CREATE PUBLICATION

testdb_1=#
testdb_1=# GRANT SELECT on TABLE   pgbench_accounts,pgbench_history  TO   repuser_1;
GRANT
testdb_1=#

Lastly, in pg_hba.conf, set the configuration file pg_hba.conf as seen below

-bash-4.2$ cat /var/lib/pgsql/10/data/pg_hba.conf
# TYPE  DATABASE        USER       ADDRESS                   METHOD
host    all             all        192.168.241.231/32        md5
local   all             all                                  peer

Subscriber Configuration

Create a database

Create a database for the subscriber

postgres=# create database testdb_1_lr;
CREATE DATABASE
postgres=#

Create tables

We need to create the tables we want to replicate in the SLAVE server.

testdb_1_lr=# CREATE TABLE pgbench_history ( 
tid integer,
bid integer,
aid integer,
delta integer,
mtime timestamp   without time zone,
filler   character(22));
CREATE TABLE
testdb_1_lr=# CREATE TABLE pgbench_accounts (aid integer NOT   NULL,bid integer,abalance integer,filler character(84)) WITH (fillfactor='100');
CREATE TABLE
testdb_1_lr=# ALTER TABLE ONLY pgbench_accounts ADD CONSTRAINT   pgbench_accounts_pkey PRIMARY KEY (aid);
ALTER TABLE
testdb_1_lr=#

Create subscription

The sample below is used to create a subscription to connect to the publication.

testdb_1_lr=# CREATE SUBSCRIPTION testdb_1_sub CONNECTION   'dbname=testdb_1 host=192.168.241.230   user=repuser_1 password=pYUBG_787V port=5432' PUBLICATION testdb_1_pub;
NOTICE:  created   replication slot "testdb_1_sub" on publisher
CREATE SUBSCRIPTION
testdb_1_lr=#

Let’s run a quick test on the subscriber.

-bash-4.2$ psql testdb_1_lr
psql (10.3)
Type "help" for help.
testdb_1_lr=# \! ifconfig    ens33| grep 'inet' | cut -d: -f2 | awk '{ print $2}'
192.168.241.231
testdb_1_lr=# \dt+
List of relations
Schema |       Name       | Type    |  Owner   |    Size   | 
--------+------------------+-------+----------+---------+-----------
public |   pgbench_accounts | table | postgres | 64 MB     |
public |   pgbench_history  | table | postgres | 0   bytes |
(2 rows)
testdb_1_lr=#

Logical replication will start by copying a snapshot of the data on the publisher database. When this is accomplished, changes executed on the publisher will be sent to the subscriber in real time. The subscriber will apply the changes based on the order in which commits were made on the publisher. This is done to ensure that transactional consistency is guaranteed for the publications within any single subscription.

Using Arcion For PostgreSQL CDC

Of course, sometimes it’s easier to use a CDC tool which allows for deep customization and high performance without the effort of a custom solution. The built-in tools for replication within PostgreSQL will work off the start but they are hard to scale. As replication becomes more complicated, whether that’s through high volume or through added code complexity, many of the native tools begin to show their weaknesses. When it comes to using CDC with PostgreSQL, using a purpose-built CDC tool like Arcion to implement and maintain CDC functionality is the preferred choice. 

Arcion is the go-to solution for many enterprises who are looking to select a data pipeline tool that is scalable, reliable, and extremely easy to configure and use. It provides robust data pipelines that offer high availability, leverage log-based CDC, and auto-scalable features. Available with multiple deployment options, Arcion can migrate data to and from on-prem data sources, cloud-based data sources or a mix of both. If you’re interested in learning more about how Arcion can help you achieve and scale  real-time replication with PostgreSQL, visit our PostgreSQL docs pages.

The zero-code approach allows users to easily configure Arcion and build their data pipelines without having to write any code. Arcion can be set up and configured strictly through configuration files or by using Arcion’s intuitive and easy-to-use UI to set up pipelines in a matter of minutes. Compared to homegrown solutions or ones that mismatch a bunch of different technologies, Arcion makes implementation smooth by providing 24/7 support through extensive documentation, tutorials, blogs, and customer support.

Let’s take a look at some specific features that are relevant for Postgres CDC.

Sub-Second Latency 

Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever increasing data volume on the source system.

100% Agentless Change Data Capture

Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. The agentless CDC applies to all complex enterprise databases including PostgreSQL. Arcion reads directly from database logs, and implements log based CDC while never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery ‚ÄĒ but without the inherent performance issues or the security concerns of agent-based CDC connector.

Data Consistency Guaranteed

Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and data replication while making sure that zero data loss has occurred.

Automatic schema conversion & schema evolution support

Arcion handles schema changes out of the box requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target's schema evolution. Other solutions would reload (re-do the snapshot) the data when there is a schema change in the source databases, which causes pipeline downtime and requires a lot of compute resources, which can be expensive!   

Pre-Built 20+ Enterprise Data Connectors

Arcion has a library of pre-built data connectors. These connectors can provide support for almost 20 enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other ETL tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. This means that if you branch out into other technologies outside of PostgreSQL, such as Microsoft SQL Server, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.

Conclusion

This article has covered a few possible solutions in which data in PostgreSQL can be continuously synced via CDC. The methods we touched on included applying CDC  through the use of triggers, queries, and logical replication. Each of these methods was explained and syntax was given on how they can be implemented. Lastly, one final option was given to make implementing CDC even easier and more scalable. In the last section, we looked at Arcion and its agentless, no-code approach to building data pipelines for PostgreSQL instances. 

Ultimately, deciding on the method to use that will suit you and your company’s CDC needs will depend on resources, time, and experience. Some methods may be easy to implement but tougher to scale, some may be too simplistic for your CDC needs, and some may check every box. The easiest and most robust way to achieve CDC in PostgreSQL can be done by using Arcion to build scalable pipelines in a matter of minutes. To try out Arcion for yourself, download Arcion Self-hosted for free today (no payment info required).

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements