Oracle CDC to Kafka - Load data in Minutes

Luke Smith
Enterprise Solutions Architect
January 11, 2023
Matt Tanner
Developer Relations Lead
January 11, 2023
Matt Tanner
Matt Tanner
Developer Relations Lead
Developer Relations Lead
January 25, 2023
20
 min read
Join our newsletter

Bringing the massive amount of data produced by modern enterprises together, quickly, is the main priority for many organizations. Combining data from across the enterprise can lead to deeper insights and enable more accurate decision-making. In this article, we will explore how data can be moved and combined in real-time by using Oracle CDC and Kafka to integrate data from various data sources.

Most techniques outlined below can be implemented very quickly, potentially in a matter of minutes. We will go over how Kafka can act as a link to detect data changes in a database. The data can be captured by Oracle CDC for onward movement to an Oracle Database. On top of this common approach, we will also explore other tools, like Arcion. Many of these tools allow for even easier development of real-time data pipelines that leverage CDC.

Table of Contents

What is Kafka?

Apache Kafka is an open-source distributed event streaming service used to build high-performance data pipelines. These pipelines created can be used for streaming analytics, data integration, and many other use cases where real-time data is advantageous.

Kafka is one of the most widely used data streaming platforms in use today. The process of streaming data allows data that is continuously generated from various data sources to be moved in real-time to a target destination. The data records being streamed are typically sent for storage or to an analytics platform. Kafka is a distinguished tool that can handle the influx of data generated by modern enterprises. Kafka can process the data sequentially and incrementally as it is primarily used to build real-time streaming data pipelines.

The Apache Kafka framework provides two messaging models: queuing and publish-subscribe. The implementation is based on Java and Scala, taking data streams from data sources and then providing real-time analysis of the big data streams. It can transfer millions of messages per second. Kafka is relatively easy to configure and is extremely scalable.

What is Oracle?

Oracle is a popular Relational Database Management System (RDBMS) produced and marketed by Oracle Corporation. It is a multi-model database system where its database is a collection of data treated as a unit. These collections are used to store and retrieve related information. Oracle database can also enable enterprise grid computing, and was one of the first to do so. Grid computing is all about the on-demand sharing of computing resources within a tightly-coupled network. This technology allows Oracle to be one of the most flexible and cost-effective ways to manage information and applications.

Oracle databases have both logical and physical structures. The perk to this approach is that the physical storage of data can be managed without affecting access to logical storage structures. Oracle services and products offer customers cost-optimized and highly-performant versions of Oracle Database. Oracle also offers in-memory, NoSQL, and MySQL databases that are available on-premise or in the cloud via Oracle Cloud Infrastructure.

Tools and Techniques for Setting Up a Kafka Oracle Integration

Various tools and techniques can be used to set up Kafka Oracle integration. The examples below will cover a few of them and give details about each method so you can decide if the technology fits your specific use case.

Develop Your Custom Connector and Use The Connect API

The first method that will be discussed is how to integrate Kafka and Oracle by developing a custom connector and using the connect API. Apache Kafka exposes APIs for fully-managed connectors with client libraries available in many languages such as C, C++, Java, Python, and Go. These APIs and connectors allow you to write code to move data from a system and write it to Kafka with massive flexibility. The biggest concern with using this method is that you also need to write code that would account for handling failures, restarting the system, scaling, logging, running in multiple nodes, and so on. With so much custom code required, the increase in flexibility can become a tedious job with many technical hurdles to overcome.

With this in mind, the next method discussed, Kafka Connect, may be a more suitable and scalable option. Compared to building a custom connector, when you want to integrate a data system, such as Oracle CDC to Kafka, it can be much simpler with Kafka Connect.

Kafka Connect Tool

Kafka Connect is a tool for streaming data between Apache Kafka and other data systems. It enables large amounts of data to be moved in and out of Kafka by simply defining connectors. The available connectors can allow for the entire database or metrics from your application servers to be imported into Kafka topics. Kafka Connect offers efficient scalability and reliability, with low latency.

Kafka Connect is free, open-source, and widely adopted. Kafka Connect works as a centralized data hub for data integration between databases, key-value stores, search indexes, and file systems. With this connection in place, it becomes a lot easier to stream data to and from Kafka through reliable and quality connector plugins.

Kafka Connect can be deployed as a standalone process that runs on a single machine or as a distributed, scalable service that can support an entire organization. Kafka connect involves two types of connections to work. These connections are the source connector and the sink connector. The source connector is used to ingest entire databases and stream table updates to Kafka topics. The sink connector is used to deliver data from Kafka topics to secondary indexes/databases. Now that you know the basics, let’s review what has been covered below.

Kafka Connect Features

Kafka Connect has the following features:

  • Kafka Connect can either be deployed as a single mode referred to as a standalone or a distributed service.
  • Kafka Connect is highly flexible and scalable as it runs with streaming and batch-oriented systems on a single node or scale to a distributed service.
  • Kafka Connect is regarded as a central data pipeline allowing the use of  abstractions to pull or push data to Kafka.
  • Kafka connects allows the use of REST API for the management of multiple connectors in a cluster.
  • Kafka Connect leverages existing connectors or connectors can be extended for a specific use case.

Kafta Connect Concepts

To discuss the inner workings of Kafka connect, let us first look at some important concepts for a clearer understanding.

  • Connectors: This is used to define where data should be copied to and from. It coordinates data streaming by creating and managing Tasks.
  • Tasks: This is used to implement how data is copied to or from Kafka.
  • Workers: This is the running process that executes Connectors and Tasks. It can either be standalone or distributed.
  • Converters: This is the code used to translate data between Connect and the system sending or receiving data.
  • Transforms: This is a simple logic statement to alter each message produced by or sent to a Connector.
  • Dead Letter Queue: This is used to illustrate how Connect handles Connector errors.

Having defined Kafka Connect and explored its concepts and features, let us dive into how to set it up to integrate with a database system like Oracle. Below is a brief example of how this can be done.

Step 1: Configure Startup Worker Parameters

Connectors and Tasks described previously are seen as logical units of work that are scheduled to execute in a process. This process is referred to as Worker in Kafka, and to get a connection, it needs to be configured.

Some Startup Worker Parameters to be configured include:

  • value.converter: This takes care of value conversion through messages sent or received from Kafka.
  • key.converter: This is the code that will be used to indicate the class to convert from Kafka Connect format to a format written to Kafka. It is used to translate the conversion of keys in messages written to the system sending or receiving data from Kafka.
  • bootstrap.servers: This is a list of Kafka servers used to loop in connections to Kafka.
  • offset.storage.topic: This is used to specify the topic to be used for the storage of offsets, it has lots of partitions and it is usually configured for compaction.
  • offset.storage.file.filename: This is used to indicate the file to store offset data for the integration.
  • config.storage.topic: This is used to specify the topic that will be used for storing the Connector and Task configurations in a distributed mode.
  • Connector configuration: This is used for the configuration of Connector parameters such as Name, Tasks, Connector Class, and lots more.

Step 2. Define the Transformations That You Need

Transformations help to make simple and lightweight modifications to messages. The transformation accepts a record as an input and outputs a modified record based on the transformation described. This is convenient for minor data adjustments and event routing. If required, multiple transformations can also be hooked together in the connection configuration. It becomes necessary as data formats stored in Kafka partitions might be different from the source or sink databases, therefore, the following transformation parameters, and potentially others not listed here, should be defined:

  • transforms: This is a list of the transformations and indicates the order in which the transformations take place.
  • transforms.$alias.type: This shows the fully qualified class name for the transformation.
  • transform.$alias.$transformationSpecificConfig: This indicates the configuration properties for the transformation.
  • extractfield: This is used to extract the specified field from a Struct when a schema is present, or a Map in the case of schemaless data.
  • extracttopic: This replaces the record topic with a new topic derived from its key or value.
  • insertfield: This inserts a field using attributes from the record metadata or a configured static value.
  • insertheader: This inserts a literal value as a record header, though, it is not currently available for managed connectors.
  • replacefield: This is used to filter or rename fields.

Step 3. Specify Your Predicates

To apply transformations to specific messages, you’ll need to specify predicates. The predicate will be used so that the transformation is applied only to records that satisfy a specific condition. Predicates can be configured with the properties shown below and others that may not be listed here:

  • predicates: This is used to indicate where predicates will be applied for a set of aliases.
  • predicates.$alias.type: This shows the fully qualified class name for the predicate.
  • predicates.$alias.$predicateSpecificConfig: This highlights the configuration properties for the predicate.

Step 4. Specify Your Error Reporting and Logging Options

An invalid record error may occur when using the Kafka Connect tool. This may be a result of several conditions like having records arrive at the Sink Connector serialized in a JSON format when the Sink Connector was configured to handle data in Avro format. Kafka Connect handles this error based on the connector configuration property for errors.tolerance. Some frequently used options to handle errors include:

  • errors.log.enable
  • errors.log.include.messages
  • errors.dead letterqueue.context.headers.enable

Step 5. Start the Standalone Connector or Distributed Mode Connector 

To continue the Kafka Connect configuration, you have to decide how you want the worker that handles the process of connectors and tasks to run. Kafka Connect can be run in standalone mode or distributed mode. This is done by carrying out thorough research to identify which mode best works for your environment.

  • Standalone mode: This is useful for the development and testing of Kafka Connect on a local machine for environments that typically use a single agent.
  • Distributed mode: This runs workers on multiple machines to form a Connect cluster. This mode is more fault tolerant than the standalone. If a node develops an issue, Kafka Connects distributes the work of that node to other nodes in the cluster without any data loss. This is possible since the connection configurations, offset information, and status are stored in the Kafka cluster and replicated. Distributed mode is recommended for production environments because it is highly scalable, available, and has enormous management benefits and more nodes can be added whenever the need arises.

Step 6. Use REST APIs to Manage Your Connectors

The REST API is used to manage and monitor connectors on Kafka Connect and runs by default on port 8083. Since Kafka Connect runs as a service, the REST API is used to manage its connectors. When executed in the distributed mode, the REST API becomes the primary interface to the cluster. When a request is made to any cluster, the REST API automatically forwards the request to its required destination. The standalone mode also runs the REST interface despite being a simple connector on the command line. It uses the REST API to get status information, add, and remove connectors without stopping the connection process.

Using Oracle CDC to Kafka

Using Oracle CDC with Kafka allows the capture of everything in an already existing database along with new changes made to the data. This capture can be done in two ways, query-based CDC or log-based CDC.

Query-based CDC uses a database query to gather new data from the database. The query itself will include a predicate to identify the changes made, usually based on a timestamp field, an incrementing identifier column, or potentially both. Query-based CDC can be implemented via the JDBC connector for Kafka Connect.

Log-based CDC uses the database's transaction log to retrieve information about the changes made. Depending on the database you use, the transaction log implemented may vary but they all follow the same principles. When changes such as insert, update or delete occur in a database, it is written/added to the transaction log. From here the new entry is decoded and the actual data from the database is written to a new event added to Apache Kafka. Log-based CDC can be implemented using several connectors but, for this article, we will focus on Oracle GoldenGate and later, using Arcion.

Using GoldenGate

Oracle GoldenGate is a managed service provider providing a comprehensive software package for replicating data from different environments. It provides solutions that enable customers to design, execute, and monitor data replication. GoldenGate allows for the streaming of data through data integration, transactional CDC, transformations, and verifications between operational and enterprise systems.

Oracle GoldenGate can be used to implement log-based CDC with Oracle CDC to Kafka. The Oracle GoldenGate for Big Data Kafka Handler streams change capture data from an Oracle GoldenGate trail to a Kafka topic. Additionally, the Kafka Handler provides functionality to publish messages to a separate schema topic. Schema publication for Avro and JSON is supported.

Using Kafka’s JDBC Connector

Another alternative of Oracle CDC to Kafka is to leverage query-based CDC using Kafka’s JDBC connector. You can use the Kafka Connect JDBC source connector to import from any relational database into Apache topics with a JDBC driver. Also, the JDBC sink connector can be used to export data from Kafka topics to other relational databases through a JDBC driver. The JDBC connectors support a wide range of databases without needing custom code, thereby allowing connections to be made to many RDBMS such as Oracle, SQL Server, MySQL, DB2, and others.

Using Arcion

Due to the complexity of using Kafka, many enterprises are looking to select a replication tool that is scalable, reliable, and extremely easy to configure and use. Many of these organizations turn to Arcion. Although using Kafka and Oracle together can be a great way to leverage CDC, it can be quite complex to implement and maintain. Using a purpose-built platform that offers flexible deployments for on-premise and cloud, and a no-code approach can be much simpler to get off the ground quickly. Now, let’s look at an example of how easy it is to enable CDC with Arcion and Oracle.

Connecting Oracle to Arcion

Arcion allows users to use an easy-to-use UI to set up a replication pipeline using CDC. In a few steps, you’ll have Oracle connected to Arcion. Once connected, a snapshot of the Oracle database can be loaded into the target platform, which in this example will be Snowflake. After the initial snapshot is loaded, any subsequent changes will also be synced in real-time from Oracle to the target. Let’s begin with a quick example of how to connect Oracle to Snowflake with Arcion.

The first step is to log into Arcion and set up a new replication by clicking the New button on the Replications screen.

The next step is to choose a Replication mode and a Write mode. Once selected, click the Next button.

> For more details on the different Replication and Write modes, check out our docs.

On the  Source screen, select Oracle as the source. If no Oracle Source is defined, click Create New and follow the instructions to create one. Once selected, click Continue to Destination.

On the Destination screen, select Snowflake as the target destination. If no Snowflake Destination is defined, click New Connection and follow the instructions to create one. Once selected, click Continue to Filter.

Next, on the Filter screen, we will choose what schemas, tables, and columns we want to replicate from Oracle to Snowflake. Choose how you would like to filter the data by checking off the applicable checkboxes on the left and clicking Start Replication.

From here, the replication will start. This next screen will allow you to monitor the process and inspect any errors that occur.

Because we selected Snapshot Mode as the replication type, once the data has been moved from Oracle to Snowflake, Arcion will shut down the process. If you would like to consistently replicate all changes through CDC, you can choose Full or Real-time mode. Once our Snapshot job completes, we will see a summary.

With that, you will have connected Oracle to Snowflake. Of course, this is just one of many examples that Arcion can support.

Benefits of Using Arcion for Oracle CDC

Arcion provides robust data pipelines that offer high availability, leverage log-based CDC, and auto-scalable features. The flexible deployment options mean that Arcion can migrate data to and from on-prem data sources, cloud-based data sources, or a mix of both. Arcion’s zero-code approach to data replication allows users to easily configure Arcion and build their data pipelines without having to write any code. Arcion can be set up and configured strictly through configuration files or by using Arcion’s intuitive and easy-to-use UI to set up pipelines in a matter of minutes. Arcion also makes implementation smooth by providing extensive documentation, tutorials, blogs, and 24/7 customer support.

Let’s take a look at some specific features that lend themselves well to implementing replication for Oracle databases.

Sub-Second Latency

Many other existing CDC solutions don’t scale for high-volume, high-velocity data, resulting in slow pipelines, and slow delivery to the target systems. Arcion is the only distributed, end-to-end multi-threaded CDC solution that auto-scales vertically & horizontally. Any process that runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput. There isn’t a single step within the pipeline that is single-threaded. It means Arcion users get ultra-low latency CDC replication and can always keep up with the forever-increasing data volume on Source.

Oracle Native Log Reader

With our Oracle Native Log Reader feature, users can achieve 10x faster extraction from Oracle but with zero impact on the source system. Arcion now reads directly from Oracle redo log files (online and archive) to stream CDC records out of an Oracle database. This is a tremendous win for all our users. With Oracle Native Log Reader, Arcion users can stream changes out of Oracle without connecting to Oracle and running expensive LogMiner queries inside Oracle (which could impact the performance of the Oracle server and also consume a lot of PGA memory).

Another benefit of the Native Log Reader compared to Oracle API (e.g., LogMiner) is the complete decoupling from an existing Oracle feature which had several limitations (e.g. LogMiner cannot mine binary column data more than 4KB).

But the secret sauce is when the Oracle Native Log Reader is combined with our scalable, parallel, end-to-end multi-threaded architecture. Arcion can extract CDC records out of Oracle faster than any other traditional CDC solution. If you ever have trouble extracting data out of high-volume Oracle installations, like Oracle RAC, because of the terabytes of redo log it produces daily, Arcion is your answer.

Automatic DDL/Schema Evolution

Keeping schemas aligned between Source and Target is a classic problem among data engineers who set up and maintain pipelines. Previously, replication would need to be stopped if there was a change to the DDL or schema on the source database. Relevant updates would need to be made to the config files or schema to reflect the change on the target system. Only then could replication resume or, worse yet, be restarted. This, of course, leads to downtime, consumes expensive compute resources, and can be prone to user error and data loss when editing configuration or schema files.

Arcion supports automated DDL and schema evolution without requiring any user intervention, including being the only platform to currently support out-of-the-box DDL replication with Snowflake and Databricks. Arcion automatically replicates DDL and schema updates to the target database when a change is made on the source, eliminating downtime. Furthermore, Arcion does this automatically, meaning no manual changes to the configuration files and no manual pipeline restarts are needed. Everything is taken care of by the Arcion platform without requiring any user intervention.

100% Agentless Change Data Capture

Arcion is the only CDC vendor in the market that offers 100% agentless CDC to all its supported 20+ connectors. Arcion’s agentless CDC connectors apply to all the complex enterprise databases modern enterprises use, like Microsoft SQL Server, MongoDB, and all versions of Oracle. Arcion reads directly from the transaction logs, never reading from the database itself. Previously, data teams faced administrative nightmares and security risks associated with running agent-based software in production environments. You can now replicate data in real-time, at scale, with guaranteed delivery — but without the inherent performance issues or the security concerns of agent-based connectors.

Data Consistency Guaranteed

Arcion provides transactional integrity and data consistency through its CDC technology. To further this effort, Arcion also has built-in data validation support that works automatically and efficiently to ensure data integrity is always maintained. It offers a solution for both scalable data migration and replication while making sure that zero data loss has occurred.

20+ Pre-Built Enterprise Data Connectors

Arcion has a library of pre-built data connectors. These connectors can provide support for over 20 enterprise databases, data warehouses, and cloud-native analytics platforms (see full list). Unlike other CDC tools, Arcion provides full control over data while still maintaining a high degree of automation. Data can be moved from one source to multiple targets or multiple sources to a single target depending on your use case. This means that if you branch out into other technologies outside of Oracle, you’ll already have the capability within Arcion to handle your new sources and targets without the need for another pipeline technology.

Conclusion

In this write up we discovered a few ways to power Oracle replication through Kafka. We dove into what CDC is all about and offered you some possible methods you could use to replicate your data. We took a deep dive into native Oracle and Kafka replication techniques, including creating custom connectors or using Kafka Connect.

Lastly, we took a deep dive into using Arcion for Oracle CDC. We looked at many of the features and advantages of Arcion as well as a brief example of how to connect ORacle to Arcion for CDC. To get started with Oracle replication and try out Arcion for yourself, simply download Arcion Self-hosted for free or spin up pipelines in Arcion Cloud in minutes with no payment info required.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Free download

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements

Join the waitlist for Arcion Cloud (beta)

Fully managed, in the cloud.

Start your 30-day free trial with Arcion self-hosted edition

Self managed, wherever you want it.

Please use a valid email so we can send you the trial license.