Data-driven decision-making has become a major factor among corporations that want to remain relevant and modern. In today's world, these companies rely heavily on data analysis and big data processing to make business decisions. This type of approach requires moving data from databases found in different locations into a central store or repository. This is usually done in batches or in real-time and provides analytics tools used for the analysis of the data they need to perform optimally.
Being able to have a fast, reliable, scalable, and easy-to-use platform where data can be stored, processed, transformed, and explored all in one place is crucial. Making the most out of the large amount of data produced daily is a constant quest for data engineers, data scientists, and data analysts. The efforts put in helping to ensure that an organization has the data needed to make decisions that improve their business and overall productivity.
One technology used heavily as a primary application database is Oracle Database. It is one of the most popular and efficient databases around and has a rich history of being trusted by enterprises large and small. Oracle Databases are used by many companies to manage, collect, organize, and store data from applications.
When it comes to analytics and machine learning, Databricks is a common platform to stream data into, especially from Oracle. Databricks is a cloud-based data tool that allows users to transform data, explore data through Machine Learning models, and much more. It is frequently used for achieving business needs all in a one-stop shop as it provides tools to handle even the most complex business use cases in a single location.
In this article, we will explore both Oracle and Databricks and explain how to load data from Oracle to Databricks. We will look at two primary techniques for loading data: using Arcion for CDC-enabled data loading and also a more legacy-based manual approach. Each of these methods will highlight how data replication and migration can help you in virtualizing your data storage so it can be accessed from anywhere in the world, in a single place. The outcomes of implementing these methods will help your business leverage data science to support better and more accurate decision-making.
Introduction to Oracle
Oracle Database is an extremely popular Relational Database Management System (RDBMS) that offers comprehensive and fully integrated applications and services on the cloud. The Oracle platform is commonly used by enterprises to handle large amounts of data and is trusted as a go-to technology when building systems that need to scale. It is a multi-model database system that stores, organizes, and retrieves information in a variety of possible ways. Oracle is also offered through a cloud offering that delivers consistent processes, helps in migrating enterprise workloads, and also builds cloud-native apps.
Part of Oracle’s popularity comes from its flexible deployment option. Oracle is cross-platform and able to be deployed and run on multiple operating systems including Windows, UNIX, Linux, and macOS. When it comes to language support, SQL is the query language that is used on Oracle to interact with the database. Oracle Database is available for on-premise deployments or as a managed solution as an Oracle Cloud Infrastructure offering.
There are five editions of the Oracle database to choose from, including Standard Edition One, Standard Edition, Enterprise Edition, Express Edition, and Personal Edition. The differences between the editions are quite large although the underlying database is generally the same. Standard Edition One has everything required to build business applications, while Standard Edition offers support for larger machines and Oracle RAC. Enterprise Edition contains all the features of the Oracle database, Express Edition (XE) is an entry-level edition, and the final offering is a single-user edition known as Personal Edition.
Features of Oracle
As a trusted enterprise technology, Oracle offers many features that can be utilized. Many of these features have assisted enterprises big and small to create powerful applications and services. Let’s explore these features in more detail in the breakdown below.
- Client/Server Architecture: Oracle Database operates within a client/server architecture. This approach allows processing to be split between both client and server applications. Though it is based on distributed systems, it provides the same data consistency and transparency as a non-distributed system.
- Availability: Due to technologies like Real Application Clusters (RAC), Data Guard, and Golden Gate, Oracle is never out of service, offline, or unavailable. This leads to the database being available at all times even during planned or unplanned schedules, and downtimes.
- Scalability: Oracle contains features such as Real Application Clusters (RAC) and advanced Portability to ensure that your database is highly scalable regardless of your application's needs and usage.
- Concurrent Processing: Several applications can run at the same time on Oracle as it has support for multiple users due to its concurrent processing ability.
- Performance: the performance of an Oracle instance can easily be optimized and increased with tools like Oracle Real Application Testing, Oracle Database In-Memory, and Oracle Advanced Compression. These tools help to improve factors such as query execution time and faster retrieval and alteration of data.
Introduction to Databricks
Databricks is an enterprise big data software built on top of Apache Spark. The platform provides a set of tools that make it easy to unify the process of building, deploying, sharing, scaling, and maintaining enterprise-grade data solutions. Built on top of Apache Spark, Databricks makes it easier to build and explore data engineering workflows, machine learning modeling, data discovery/annotation, analytics dashboards and visualizations, managing security and governance, and lots more. Databricks is highly scalable and easy to configure since it can be deployed on top of distributed Cloud computing environments like Microsoft Azure, AWS, or Google Cloud.
Databricks simplifies the process of collating large amounts of data either in a Data Warehouse or Data Lakes by incorporating its data lakehouse architecture. The lakehouse architecture offers the capabilities of both the data warehouse and the data lake. By combining the two platforms, a data lakehouse offers the flexibility, cost-efficiency, and scalability of a data lake and the data management capabilities and ACID compliance of a data warehouse.
Features of Databricks
Databricks is a very fully-featured data engineering platform. Below we will take a look at some of the most popular features and reasons for enterprises to use Databricks.
- Unified Platform: Databricks is a platform that unifies all your data into a single source to enable data consistency, help in data governance, and make your data available for all your analytics and AI needs. Your data teams do not need to source for data independently but can get open and secure highly-performant data sharing through Databricks.
- Language: Databricks supports multiple languages such as Python, Scala, R, and SQL. These languages can be used in the same coding environment to build algorithms. Spark SQL can be used for data transformation while R can be used for data visualization and Python used to evaluate performance.
- Data Sources: One of the major features of Databricks is its ability to connect with a wide range of sources to help you perform data analytics. It not only connects with cloud storage services like AWS, Azure, or Google Cloud and on-premise storage like SQL servers, CSV, and JSON, but it also provides extension connectivity for Platforms like MongoDB, Avro files, and lots more.
- Flexibility: Databricks is flexible as it provides scalability for small-scaled jobs such as development or testing and the running of jobs on a large scale like big data processing. It helps you to scale efficiently with every workload from simple data pipelines to large scaled data.
- Optimized Spark Engine: Databricks can easily integrate with various open-source libraries and allows users to quickly set up fully-managed Apache Spark clusters. These clusters can be set up and configured without constant monitoring as Databricks lets you get the most recent versions of Apache Spark.
Methods to Load Data from Oracle to Databricks
In this section, we will explore two methods whereby the integration of data between Oracle and Databricks can be achieved seamlessly. The first method we will look at will utilize Arcion, a Change Data Capture (CDC) tool that automates a good chunk of the data loading process. It’s worth mentioning that Arcion is an official integration partner for Databricks, and also a Databricks Venture portfolio company.
The second method we will cover is a manual configuration of loading data from Oracle to Databricks using some of the native tools built into the platforms.
Method 1: Oracle to Databricks Using the Arcion
First, let's look at how you can use Arcion’s Oracle and Databricks connectors to replicate data from your Oracle database into Databricks. To do this effectively, you will be required to set up the source database, set up the target database, then run Arcion’s replication tool, Replicant. If need be, you may also want to check the performance and troubleshoot or tweak the settings to match your exact needs.
To set up a replication pipeline from the Oracle database to Databricks using Arcion’s Oracle connector, you must first prepare the Oracle Database for replication by executing the five steps documented below in an Oracle client.
Step 1: Obtain the JDBC Driver for Oracle
The appropriate JDBC driver for Oracle has to be obtained before setting up Replicant, Arcion’s CDC tool. To do this, visit the Oracle Database JDBC driver Downloads page to download a driver that meets your business needs. To run Arcion efficiently, we recommend JDBC Driver for Oracle 18c and Java 8-compatible drivers. Go to the Oracle Database 18c Downloads section from the link provided above to download the necessary drivers. Be sure to put the ojdbc8.jar file inside the $REPLICANT_HOME/lib directory. The $REPLICANT_HOME directory is the directory where Replicant is installed. For more information on how to install Replicant, see the documentation.
Step 2: Set up Oracle User
Our next step is to create an Oracle User for Replicant to use. To set up an Oracle user, perform the following commands, replacing the <VARIABLES> with the values you want to use.
1. Create a user for Replicant by using the command below.
2. Provide the create session permission.
3. Next, you can grant the select permission to the specific tables that would be involved in the replication.
Alternatively, you can grant permission on all the tables as shown below.
Step 3: Set up Change Data Capture (CDC)
To use log-based CDC, the Oracle database should be in ARCHIVELOG mode. To set the database to ARCHIVELOG mode, use the command shown below before proceeding to set permissions and enable the logs.
1. From ARCHIVELOG mode, grant the EXECUTE_CATALOG_ROLE role to the user created for Replicant so you can use the DBMS_LOGMNR package.
2. For Replicant to access the START_LOGMNR procedure, provide the following permissions as seen below. The shown commands will work for Oracle 12c and above.
3. For Replicant to access v_$logmnr_contents, provide the permission below for Oracle 19c and above.
After setting up the permissions to grant replicant access to the necessary logs, you now need to enable either the primary key or all column logging at the database level or the table level. One thing to note is that if you use table-level logging, you must also enable it in the CDC heartbeat table we will create in Step 7 as well.
For database-level supplemental logging, you will run the following commands on the database instance:
1. Run the following command to enable Force Logging.
2. Run the following command to enable PRIMARY KEY logging.
Alternatively, you can enable ALL column logging.
For table-level supplemental logging, Force Logging is not required. To enable table-level logging you will do the following:
1. For PRIMARY KEY logging use the following command.
Alternatively, for ALL Column logging use the following command.
If table-level logging is enabled, Force Logging should not be enabled as it is used for database-level logging only.
2. Grant access to the following permissions as they would be required.
Step 4: Set up Global Permissions
Global permissions can either be granted as One Time Access or as Continuous Access. One-Time Access is only needed for the initial data snapshot and for re-init operations where snapshots are taken for newly added tables. One-Time Access permission can be revoked after the necessary operations are completed.
On the other hand, Continuous Access permissions are given and required throughout any replication process for it to work correctly. To set up continuous access, follow the steps listed below to enable the global permissions needed for replication to work.
1. For One-Time Access, run the following command to provide the correct privileges.
2. For Continuous Access permissions, run the following command to grant needed access.
3. Grant the following continuous access permissions for the tables involved in the replication or enable flashback for all tables.
Alternatively, you can grant continuous access permission to all tables using the following command.
4. Lastly, provide access to the system views to allow for schema migration as shown below.
Step 5: Set up Connection Configuration
Our next step is to set up the connection configuration so that Replicant can connect to our Oracle instance. To set up the connection configuration, carry out the following steps:
1. From $REPLICANT_HOME, go to the Oracle connection configuration file.
2. Your connection credentials can be retrieved easily by Replicant if it is stored on AWS Secrets Manager, otherwise, you will need to input them as shown in the code.
Step 6: Set up Filter Configuration
To set up the filter configuration, carry out the following steps:
1. From $REPLICANT_HOME, go to the filter configuration file.
2. Specify the schema(s) or table(s) to be replicated to meet your needs as explained in the filter file template below.
Step 7: Set up Extractor Configuration
For real-time replication, you must set up a heartbeat table in the source Oracle database. To create the heartbeat table, do the following:
1. Create a heartbeat table in the schema you are going to replicate with the commands below:
2. Ensure that the user created for replication has access to the heartbeat table. If they don’t, grant INSERT, UPDATE, and DELETE privileges to the user configured for replication.
With the heartbeat table created, you now need to configure the Extractor configuration. The steps required to set up the extractor are shown below.
1. From $REPLICANT_HOME, go to the extractor configuration file for Oracle.
2. In the oracle.yaml file, we will add in our configuration parameters. The configuration has two parts: parameters related to snapshot mode and parameters related to real-time mode. With CDC, snapshot mode runs first to extract and load the initial data. After the data is loaded, realtime mode is flipped on and any further updates to the data are handled through real-time replication which keeps the source and the target in sync.
To set the parameters related to snapshot mode let's take a look at each parameter. The following list includes the snapshot mode parameters related to Oracle as a data source:
- fetch-create-sql: This is used to apply the exact create SQL on the source to target. This parameter is only available when creating an Oracle-to-Oracle pipeline.
- fetch-create-sql-no-constraints: This is used to split create table SQL and Primary/Unique/Foreign key constraints as different SQLs. Tables are created without constraints but after the completion of the snapshot, the constraints are applied. This parameter is only available when creating an Oracle-to-Oracle pipeline.
- serialize-fetch-createSql: This is used to fetch create SQL in a serialized manner after fetching table schema.
- serialize-fetch-create-Sql-no-constraints: This is used to fetch the SQL of Primary/Unique/Foreign Key constraints in a serialized manner. This parameter is only available when creating an Oracle-to-Oracle pipeline.
- native-load: Replicant uses this parameter set to set up the Oracle Data Dump Export (expdp) utility to load the table instead of JDBC. it handles large-scale data. The configuration parameters listed below are available under the native-load.
- enable: This is used to enable the native-load and it is set at true or false.
- stage-type: This is used to determine the type of staging area. The allowed values are SHARED_FS and ASM.
- directory: This denotes the Oracle directory object corresponding to the stage-type.
- path: This is the full path to the Network File System (NFS) representing the directory shared between the Replicant and Oracle.
Below is a sample code showing the configuration for snapshot mode.
Next, let’s look at the parameters related to real-time mode. If your data is to be replicated in real-time, the real-time section is what you will use to specify your configuration. The following Oracle-specific parameters are available:
Below is a sample configuration to illustrate the parameters associated with the real-time mode.
Advantages of Using Arcion
If you’ve followed the steps above, you can now begin to experience the benefits of Arcion. Although there are plenty of benefits we can list here, we will look at three of the most important ones for Oracle-to-Databricks replication.
Below are a few of the benefits of using Arcion to load data from Oracle to Databricks:
- No-code configuration and maintenance make it simple and easy to set up Oracle to Databricks replication with Arcion. Within minutes you can create production-ready data pipelines that are flexible and scalable.
- Arcion ensures smooth schema management by supporting automated schema evolution. This takes away the manual task of keeping schemas in sync between the platforms and eliminates any tedious and manual work that may be required.
- Performance and scalability are included by default. The distributed and parallel nature of the Arcion platform allows it to be extremely performant even when under heavy load, dealing with massive amounts of data. These capabilities are maintained while keeping costs minimal.
- The Arcion team offers live support around the clock to make sure any roadblocks are taken care of quickly. Arcion offers support through various channels including email, chat, and easily scheduled calls with database and replication experts.
- Arcion natively integrated with Databricks Partner Connect, where users can set up production-ready pipelines in minutes with a few clicks. Below is a video that shows how to replicate data from Oracle to Databricks via Databricks Partner Connect.
Method 2: Oracle to Databricks manually
There are plenty of ways to use the tools built into Oracle and Databricks to move data manually. Of course, most of these methods can become compute-intensive when used at scale and don’t generally facilitate real-time data loading. In this article, the second tactic for data loading will focus on using CSV export.
Oracle to Databricks Using CSV Export
To use the CSV Export functionality, you will first need to connect Oracle to use the CSV Export function. This is required to generate the CSV so you can move the exported data via CSV to Databricks.
Step 1: Connecting to Oracle and perform the CSV Export
- Connect to the database and table to be exported.
- Open the context menu by right-clicking on the table and in the menu select Export to start the Export Wizard.
- In the Export Wizard, deselect the Export DDL option and change the format to CSV. You’ll also set the path for where the exported file will be saved. After this is set, click Next.
- On the screen you will specify the columns you wish to export. You can use the Edit button to change the default settings if you want and then proceed by clicking Next.
- In the export summary that is displayed next, check the output to make sure your settings and everything are as expected. Lastly, click Finish to complete the export.
Step 2: Moving the CSV data to Databricks
The CSV data can be read into Databricks in two different ways. The first option is to use the Databricks UI or to directly upload the data to the table directly. To import the CSV through the UI or to upload to the table, do the following:
- On the Workspace UI, click on the Settings icon found in the lower-left corner and select Admin Console.
- Next, go to the Workspace Settings tab and expand the Advanced section.
- Ensure the Upload Data using the UI toggle is turned on and click Confirm to upload the CSV to Databricks.
- The data is then uploaded to the table where you can modify and read the data to be migrated from the CSV data to Databricks. by clicking on the Preview Table button you can choose a cluster to preview the table in. At this point you can also choose the appropriate data type for the table attributes. After the configurations, click on the Create Table button.
At this point, the Oracle data exported via CSV will be available in Databricks. Having successfully moved the CSV data into Databricks, you can now go to the Data Section and select the cluster where the file was uploaded to read the data.
Disadvantages of Manually Loading Data
There are quite a few disadvantages, especially over more automated methods, to moving data manually from Oracle to Databricks. Let’s take a look in more detail at the disadvantages of manually loading the data from one platform to the other.
- The process of loading data manually from Oracle to Databricks can be confusing as it is not a straightforward process and requires a few steps. This process involves the movement of data from one location to another before the final upload which must be executed in the proper order.
- Loading the data manually may not give room for smooth schema management, therefore, this may result in lots of tedious work. Keeping the schema in sync between the two platforms can be difficult to do when managing them manually. Not having the schemas in sync can lead to a massive amount of problems including the situation where data is not able to be loaded or is loaded improperly.
- This approach only allows for a snapshot to be uploaded so any updates after the CSV is created will have to be imported again, manually, to keep the database relatively up to date. Real-time replication can not be achieved through this approach and frequent updates that move data through CSV export can be intensive and prone to error and issues.
This article has looked at how to load data from Oracle to Databricks in two different ways. First, we looked at using data to load Oracle Database data in Databricks. After that, we gave another more manual method a try: CSV Export and load. Both the manual and automated approaches have their advantages and disadvantages, many of which were covered in the article. With the overview shown in this article, you can use either approach using the instructions that were provided above.
Of course, if you are looking for the easiest way to load data from Oracle Database into Databricks, Arcion’s CDC platform is the easiest and most robust way to do it. The highly-scalable platform provides a no-code approach to setting up real-time CDC pipelines for loading data from one platform to another. To try out the platform for yourself, talk with our CDC experts today to create Oracle to Databricks pipelines in a matter of minutes.