Generally, most businesses use more than one platform for their data needs. This can include your primary database(s) where data is initially written, big data platforms where insights are mined, and maybe even some archive systems where older records are stored to free up space on primary databases. How does one make sure that the data is correctly moved between all of these systems accurately and efficiently? In modern times, this is where businesses leverage the power of Change Data Capture.
Most database systems and other data platforms can support CDC and a variety of tools can be used to implement it. In this particular article, we will look specifically at how to use CDC with Oracle databases and some of the tools that can be used to do so.
What is Oracle CDC?
Oracle CDC, or Oracle Change Data Capture, is a technology that efficiently identifies data that has been added, updated, or removed from an Oracle database. These changes are then replicated to a target database so that the changes are made available to other systems, applications, or individuals. Using CDC allows for the easy extraction and transportation of relational data from one or more databases into a data warehouse for analysis. Data warehouse destinations may include Snowflake, SQL Server, Azure Synapse, Redshift S3, Kafta, Postgres, or Google BigQuery. By using CDC, real-time data replication can be achieved and unlock real-time analytics use cases that many of these platforms excel at. Something not possible before when using a batch-based approach to extracting and loading data into big data platforms.
With Oracle, implementing a CDC process allows for monitoring data that is inserted, updated, and deleted so that these changes can be reflected in the destination. Although real-time CDC is generally the desired end state, this could also be done at specific time intervals set by a user. Using CDC allows only data changes to be replicated to the destination database or data warehouse. This method eliminates the need to always copy or extract an entire database or even entire tables. By only applying changes, you can avoid the consequences of replicating entire databases or tables, the main consequence being that the process can be time-consuming, expensive, and cause performance degradation.
What are the Types of Oracle Replication with CDC
Oracle Replication with CDC can be done in a few different ways. Depending on your use case, each type may present its own pros and cons. CDC with Oracle can be achieved by using either timestamps or triggers.
1. Oracle CDC using Timestamps
Changed records can be replicated by using audit timestamps. These timestamps note the date and time when a record was inserted or updated so that the changes can be replicated to the target. This is a frequently used method for replication that works for a fair amount of use cases but it does have the following disadvantages:
- Records that are deleted cannot be identified since they are no longer in the table. This obviously causes restrictions as it is not suitable on many occasions, especially where data in tables is deleted frequently.
- It puts stress on the system and increases the overhead on the server. This happens because the CDC process has to run queries continually on the source Oracle database and tables to check for changes.
- Incorrect replication may arise as audit dates and timestamps columns may not be recorded accurately by the source application. This then may affect data integrity.
2. Oracle CDC Using Triggers
Oracle CDC can be set to capture changes by using triggers that are defined on the tables you want to replicate. When an insert, update or delete occurs, the trigger writes the changes into another database table with the timestamps and replication is then carried out from the audit table. One of the major concerns of using this approach is the performance impacts of using triggers. Using triggers puts a further load on the Oracle database, increasing the overhead of each operation and potentially leading to performance degradation.
Log-based Oracle CDC Using Transaction Logs
Aside from the two types of CDC mentioned above, log-based approaches can also be used. Using log-based CDC in Oracle means that the CDC process uses transaction logs to identify and apply changes to target platforms or tables. To achieve this, the CDC tool mines the Oracle redo log and/or the archive logs generated by the Oracle instance. These logs are generally used for data recovery in the case of system failure, making them a great source to identify changes happening in the source Oracle database.
Compared to the approaches above, log-based CDC is regarded as the best and preferred Oracle CDC method. This approach is preferred since it does not put much pressure on the Oracle source system since it works asynchronously. Since log-based CDC only requires access to the logs to replicate any changes, it adds minimal load to the source database and it replicates all the changes accurately and efficiently.
Log-based CDC provides a real-time view of the data on the destination as it happens on the source. This real-time capability is possible since transactions are applied to the destination immediately after the logs are mined. Once mined and replicated, the data can be accessed on the destination or target platform which may be a data warehouse, data lake, analytical application, or consolidated across various enterprise systems and applications. Once the changed data has been replicated to the target system important deductions, valuable insights, opportunities, business trends, and operational efficiencies can be obtained from the data.
Log-based Oracle CDC using transactional logs can be enabled on all Oracle versions including 12c and 19c. The only difference in implementation across the version may be that log formats may be altered and look a little different across each Oracle version.
Oracle CDC 12c: Oracle Streams (Deprecated)
Oracle Streams were a native CDC tool provided by Oracle that could enable real-time replication on Oracle databases. Oracle Streams automatically pushed transaction changes to a queue for onward replication and would write these changes to the target database or tables. Unfortunately, for users who want to use this approach now, it is deemed obsolete and has been phased out from Oracle 12c. Users are encouraged to use other solutions, such as Oracle GoldenGate, to enable this functionality. Compared to Oracle Streams, Oracle GoldenGate is a premium version that comes with a steep licensing fee. Fortunately, other log-based CDC options are available, such as Arcion, that are easier to set up and much more cost-effective.
Oracle CDC 19c: Oracle LogMiner Continuous Mining (Deprecated)
Oracle LogMiner provides users with methods of querying logged changes made on an Oracle database by using SQL commands and referencing data in the Oracle redo logs. Database Administrators mostly use it to query transaction logs but it also can be used to deliver real-time changes from Oracle redo-logs that third-party CDC tools can use for the replication of Oracle data.
Just like the Oracle Streams above, Oracle LogMiner is also considered a legacy approach to CDC and is obsolete. Oracle LogMiner has been phased out from version Oracle 19c to push customers to use other tools for CDC, namely pushing users towards the high-cost Oracle GoldenGate.
Oracle GoldenGate Replication
Oracle acquired GoldenGate in 2009, and is now the Oracle-preferred replication tool of choice. Of course, the tool itself is a premium tool with a premium price. GoldenGate is available as an add-on license option for Oracle Enterprise Edition Database where it can provide real-time replication from various sources using CDC. It can also be separately licensed for non-Oracle databases.
To create an efficient replication process, Oracle GoldenGate needs other Oracle-licensed products as data replication requires the creation of tables on the target based on table definitions on the source database. Using GoldenGate also means that users need to have data type mappings created. These can be provided with other Oracle-licensed tools such as Oracle Data Integrator (ODI), Oracle’s ETL tool, to avoid data loss. With these tools in place, Oracle GoldenGate delivers CDC replication in real-time with incremental updates on the destination continually along with high throughput and a low impact on the source systems.
The flexibility of using GoldenGate though can really be limited if you are planning to use it for targets outside of the Oracle ecosystem. For instance, if your target is Amazon S3, Snowflake, or any other cloud data platform, you will require a separate license for Oracle GoldenGate for Big Data.
As mentioned before, Oracle GoldenGate can be very expensive as you will require other Oracle-licensed products to use the product to its full potential. For example, Oracle Veridata is needed to verify and reconcile data from the source, and Oracle GoldenGate Studio will be needed for a graphic user interface, Oracle GoldenGate Management Pack is also required for alert and to monitor capabilities. Bearing in mind that these products will be licensed separately and the integration between these various products can lead to significant costs. This can become a major blocker and turn-off for organizations with smaller budgets.
In addition to the costly license fees, GoldenGate is not optimized for the cloud data warehouses (CDWs):
- GoldenGate for CDWs setup is a lengthy process & requires Kafka (or other streaming service), coding & heavy coordination between DevOps, Data Engineering & Developers.
- Maintaining pipelines from GoldenGate to CDWs requires a significant ongoing maintenance, especially technical resources with deep GoldenGate skill, which drives perpetual resource costs.
- GoldenGate pipes are complex, with many integration points that break frequently, so downstream models, analysis & apps can see significant performance impacts.
- The license costs, the engineering resources needed to implement & maintain, and the additional computes required to run makes the solution expensive, brittle & suboptimally performant. Not to mention the constant fear of the audit process due to the pricing structure.
Oracle RAC Replication (Real Application Clusters) CDC
Oracle Real Application Clusters (RAC) is an Oracle Database Environment that is highly available and scalable. RAC allows the running of a single Oracle database across multiple servers while accessing shared storage. In the case where a cluster fails, the database instance continues to run on the remaining servers or nodes in the cluster.
Every single server/node creates a unique transaction log recognized by the thread identifier and an incremental series meaning users can start small and add nodes or servers as the need for more becomes necessary. This can become increasingly complex as you scale since logs must be read and applied in the sequence they are generated by the servers. This approach to Oracle CDC makes this process tedious and limited since it only replicates the data for high availability and redundancy and not necessarily for power analytics use cases or data movement to Big Data platforms.
Oracle CDC For High Throughput/High Volume Environments
In some Oracle environments, a bulk of transactions may occur rapidly at certain hours during the day and then return to a normal pace. These spikes in usage or consistently high data volumes can pose a challenge as the CDC process needs to handle the high throughput increases and scale appropriately. The system must try to provide replication at the same speed achieved when the system is under less load. Most CDC tools in these high throughput environments will experience significant lag when processing CDC. This can be helped by scaling to match the demands put on the system during peak hours and scaling down as traffic begins to ease.
Oracle CDC In A Catch-up Scenario
A backlog of transactions may build up when there is difficulty in accessing the Oracle database. This could happen because of network failure or when other issues arise making it difficult for logs to be accessed and for replication to take place. When these issues are resolved, the CDC process needs to replicate the backlog of transactions on the target. As a result of this, catch-up needs to be done as if CDC is operating in a high throughput environment to get the source and target systems in sync as quickly as possible. If the target system is not able to get in sync, subsequent transactions may be significantly delayed or may never catch up, especially if there is a significant backlog.
Remote Log Mining for Oracle CDC
On occasions where the Oracle source system is overloaded and the risk of performance degradation is possible, a reasonable option may be to mine the log files remotely from the Oracle source database. This will help to alleviate any extra load on the system and help to mitigate any detrimental effects on the system's performance. The logs can be mined on a completely different server and later merged with the target database to reduce any added load on the source database.
Why The Process Needs To Be Accurate For Oracle CDC
The Oracle CDC transaction process needs to be accurate to ensure referential integrity. When an application performs a transaction that is saved, the change data is created and captured in the underlying Oracle tables. This change data has to be brought across for all the transactions when performing CDC to reflect these changes on the destination. If only a portion or part of it is available for replication, it causes referential integrity difficulties at the destination as the data would not match the source.
Even though users may assume that their Oracle CDC process may be running accurately and error-free with acceptable latency, it is also important to know if this is the reality and if every change is being replicated to the destination. Of course, the best way to ensure that all the pieces of the puzzle are as they should be is to check the results. In the next section, we will explore exactly how we can verify that the process is running accurately.
Oracle CDC: Time Series Your Data For Point-In-Time Analytics
To ensure accuracy, especially for analytics, point-in-time queries are important on the destination data lake or data warehouse. The timestamps in the data will have either the commit date from the Oracle source or the timestamp that the records were saved to Oracle with, rather than the timestamp when they were loaded to the destination. This can allow us to do audits to make sure that the data was correctly loaded into the destination. By using the timestamp from the source, we can sequentially ensure that each piece of data is accounted for instead of using a timestamp of when it was loaded into the destination, which would give us minimal insight into the accuracy of the data.
Why Do You Need An Oracle Replication Tool?
It is a lot easier to use an automated Oracle replication tool than hand-coding an Oracle CDC solution as the Oracle replication tool is purpose-built and offers accuracy, dependability, and is easier to implement than a homegrown solution. Having to code solutions for replication can be a daunting task as maintenance and support can pose a serious problem. Using an automated Oracle replication tool makes the entire process straightforward and can save the organizations from a lot of headaches down the line.
When selecting an Oracle replication tool to use, there are some things to look out for such as:
- Can the replication tool handle a scenario where you have a high throughput?
- Will the replication tool be able to catch up and recover when there is an outage resulting in large CDC volumes?
- Can the replication tool work remotely and deliver accurate data?
- Can it be trustworthy and deliver data in real-time?
- Can it handle schema evolution or schema drift?
- If there is a need for an upgrade, can the replication tool work with Oracle 12c and Oracle 19c?
- Bonus point: Can it support various data types for different connectors?
These questions can help guide you toward the optimal solution for your particular setup. Of course, if two solutions are evenly matched in terms of functionality, the deciding factor is usually cost and ease of use.
Why Use Arcion?
Arcion is a go-to solution for many enterprises that are looking to select an Oracle CDC tool that is scalable, reliable, extremely easy to configure and use, and would not break their budget. It provides robust data pipelines that offer high availability, leverage log-based CDC, and auto-scalable features. Available with multiple deployment options, Arcion can migrate data to and from on-prem data sources, cloud-based data sources or a mix of both.
The zero code approach allows users to easily configure Arcion and build their data pipelines without having to write any code. Arcion can be set up and configured strictly through configuration files or by using Arcion’s intuitive and easy-to-use UI to set up pipelines in a matter of minutes. Compared to homegrown solutions or ones that mismatch a bunch of different technologies, Arcion makes implementation smooth by providing 24/7 support through extensive documentation, tutorials, blogs, and customer support.
Let’s take a look at some specific features that lend themselves well to the Oracle CDC use case.
Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single threaded. It means Arcion users get ultra low latency CDC replication and can always keep up with the forever increasing data volume on Source.
Don’t just hear from us. Hear it directly from our user.
Oracle Native Log Reader
In our August 2022 release, we released the Oracle Native Log Reader feature, which helps our users to achieve 10x faster extraction from Oracle but with zero impact. Arcion now reads directly from Oracle redo log files (online and archive) to stream CDC records out of an Oracle database. This is a tremendous win for all our users. With Oracle Native Log Reader, Arcion users can stream changes out of Oracle without connecting to Oracle and running expensive LogMiner queries inside Oracle (which could impact the performance of the Oracle server and also consume a lot of PGA memory).
Another benefit of the Native Log Reader compared to Oracle API (e.g., LogMiner) is the complete decoupling from an existing Oracle feature which had several limitations (e.g. LogMiner cannot mine binary column data more than 4KB).
But the secret sauce is when the Oracle Native Log Reader combined with our scalable, parallel, end-to-end multi-threaded architecture, Arcion is able to extract CDC records out of Oracle faster than any traditional CDC solutions. If you ever have trouble extracting data out of high-volume Oracle installations like RAC, etc., because of the terabytes of redo log it produces daily, Arcion is your answer.
Oracle Native Log Reader is currently in beta and available with Arcion Self-hosted, and it’ll be available in Arcion Cloud later this year. Read VentureBeat’s coverage of Arcion’s August 2022 release.
Automatic DDL/Schema Evolution
Keeping schemas aligned between Source and Target is a classic problem among data engineers who set up and maintain pipelines. Previously, replication would need to be stopped if there was a change to the DDL or schema on the source database. Relevant updates would need to be made to the config files or schema to reflect the change on the target system. Only then could replication resume–if it didn’t need to start over. This, of course, leads to downtime, consumes expensive compute resources, and can be prone to user error and data loss when editing configuration files.
Arcion supports automated DDL and schema evolution without requiring any user intervention, including being the only platform to currently support out-of-the-box DDL replication with Snowflake and Databricks. Arcion automatically replicates DDL and schema updates to the target database when a change is made on the source, eliminating downtime. Furthermore, Arcion does this automatically, meaning no manual changes to the configuration files and no manual pipeline restarts are needed. Everything is taken care of by Arcion platform without requiring any user intervention.
Here is a quick video to show you how Arcion Automatic DDL works:
100% Agentless Change Data Capture
Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 15+ connectors. The agentless CDC applies to all the complex enterprise databases including all Oracle versions. Arcion reads directly from the transaction logs, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software on production environments. You can now replicate data in real time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns.
Data Consistency Guaranteed
Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.
Pre-Built Enterprise Data Connectors
Arcion has a library of pre-built data connectors. These connectors can provide support for the most popular enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other CDC tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. This means that if you branch out into other technologies outside of Oracle, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.
Data is a valuable asset for any organization and with it, gainful decisions are drawn as patterns can be deduced from the data for the growth and improvement of the company. Thousands of organizations across the globe use Oracle databases for the storage of diverse information ranging from sales data, customer details, operations, campaigns, HR, and much more. Ensuring that this data is duplicated accurately across all the business locations makes room for a smooth flow of operations leading to a conducive environment with evidence-based decision-making. When we talk about the “future of business”, we are really talking about helping businesses become more efficient with data to drive their business to the next frontier.
Irrespective of the size, position, or profit margins of your company, businesses are looking for ways to leverage the information they can derive from the vast amounts of data collected. One of the best ways to do this is by capturing changes and delivering that data in real time to where it needs to be for processing and analysis. The most modern and efficient way to do this is to use Change Data Capture, or CDC.
This article looked at a few ways to implement CDC, including legacy approaches, and provided you with relevant information to understand everything there is to know about it. It provided in-depth knowledge about Oracle CDC using subtle points to highlight its significance and ways you can use it with ease to meet your business needs. With this knowledge in hand, you can now confidently see the value of using CDC as well as move forward with a robust implementation to support your Oracle CDC needs.
Interested in seeing Arcion in action? Book a personalized demo today with our database experts.