In today's rapidly evolving data landscape, organizations are continuously seeking more efficient and scalable solutions to handle their data processing and analytics needs. With the exponential growth of data volumes and the advent of cloud-based technologies, businesses are increasingly looking to migrate their data from traditional systems to modern, cloud-based data warehouses. One migration path that has gained immense popularity is moving data from Oracle, one of the leading relational database management systems known for its robustness and reliability, to BigQuery, a powerful and fully managed data warehouse hosted on the Google Cloud Platform.
In this blog, we delve into the intricacies of migrating data from Oracle to BigQuery, providing a comprehensive overview of the entire process. We will explore the advantages of leveraging BigQuery's cloud-native architecture for its elastic scalability and ability to unlock the full potential of an organization’s data. Additionally, we will highlight the various methods available for loading data from Oracle to BigQuery, ranging from automated solutions like Arcion to more manual approaches.
By delving into the nuances of these migration methods, you will be equipped with the knowledge necessary to make informed decisions when embarking on a data migration journey. Whether you are considering a migration project or simply interested in exploring the possibilities, this blog serves as a valuable resource, offering insights into the benefits and challenges associated with each approach.
Join us as we explore the world of data migration, discover the possibilities that Oracle to BigQuery migration brings, and gain valuable insights into the methods and tools available to facilitate a seamless transition. With the right knowledge and approach, you can harness the power of BigQuery and unlock new opportunities for data-driven decision-making within your organization.
Introduction to Oracle
Oracle is a popular Relational Database Management System (RDBMS) produced and marketed by Oracle Corporation. It is one of the most used RDBMS to store, organize, and retrieve information. Oracle was the first database to use Enterprise Grid Computing (EGC) to perform tasks. EGC helps in the retrieval of data from any location through a group of computers connected to a network. This makes Oracle a flexible and cost-effective way to manage information and applications.Â
Oracle runs on multiple types of servers across different operating systems such as Windows, UNIX, Linux, and macOS. Oracle is a multi-model database system where each database is a collection of data treated as a unit to store and retrieve related information. Oracle offers in-memory, NoSQL, and MySQL databases that are available on-premise through Oracle Cloud or via Oracle Cloud.
Oracle has five editions of its database to choose from, they include Standard Edition One, Standard Edition, Enterprise Edition, Express Edition (XE), and Personal Edition. For further information on Oracle, visit here.
Introduction to BigQuery
Google BigQuery is a serverless, cloud-based data warehouse that offers extensive data analytics web services for processing huge amounts of data. It supports the querying of data using ANSI SQL and can be used to analyze data on a large scale, running into billions of rows.Â
The BigQuery platform also utilizes columnar storage called Capacitor to ease data querying and aggregation of results. BigQuery has built-in auto-scaling which allows the platform to automatically allocate computing resources as needed. This means that users don’t need to spin up new instances or virtual machines to scale. BigQuery can securely handle large volumes of data since BigQuery automatically encrypts all data before it is written to disk. When the data is read by an authorized user it is automatically decrypted. By default, Google manages the encryption keys used to protect your data although this can be configured as needed.
Methods to Load Data from Oracle to BigQuery
In this next section, we will go over two methods that can be used to move data from Oracle to BigQuery. These two methods of loading data from Oracle to Bigquery are explained in step-by-step detail so that the methods can be easily followed and implemented. The first method we will look at will use Arcion to set up an automated and scalable pipeline. The next method is to manually export data as a CSV and upload it to BigQuery. Of course, this is using a custom code method of which we will discuss the disadvantages as well. With the agenda set, let’s dive into it!
Method 1: Oracle to BigQuery Using Arcion
The first method of connecting Oracle to BigQuery discusses how to use Arcion’s Oracle and BigQuery connectors to replicate data from your Oracle source database to a BigQuery target. For this to be done, we will set up the source database and connector, set up the target database and connector, then run Replicant, Arcion’s CLI tool that runs the replication process.
To load data into BigQuery, you’ll need to make sure you have the necessary IAM permissions set since certain permissions are required to run and load data into BigQuery tables and partitions.
Step 1: Download And Install Arcion Self-hosted
The first thing to do is to download and install Arcion Self-hosted. To do this, you’ll need to contact the Arcion team to get the Arcion program as well as a license key. Once you have access, the process would include downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME throughout this guide, and finally, adding your license. You can reference the quick start guide for more details on how to do these steps in detail.
Step 2: Obtain the JDBC Driver for Oracle
The appropriate JDBC driver for Oracle has to be obtained before setting up Replicant, Arcion’s CDC tool. To do this, visit the Oracle Database JDBC driver Downloads page to download a driver that meets your business needs. To run Arcion efficiently, we recommend JDBC Driver for Oracle 18c and Java 8-compatible drivers. Go to the Oracle Database 18c Downloads section from the link provided above to download the necessary drivers. Be sure to put the ojdbc8.jar file inside the $REPLICANT_HOME/lib directory. The $REPLICANT_HOME directory is the directory where Replicant is installed. For more information on how to install Replicant, see the documentation.
Step 3: Set Up The Oracle User
Our next step is to create an Oracle User for Replicant to use. To set up an Oracle user, perform the following commands, replacing the <VARIABLES> with the values you want to use.
- Create a user for Replicant by using the command below.
- Provide the create session permission.
- Next, you can grant the select permission to the specific tables that would be involved in the replication.
Alternatively, you can grant permission on all the tables as shown below.
Step 4: Set Up Change Data Capture (CDC)
To use log-based CDC, the Oracle database should be in ARCHIVELOG mode. To set the database to ARCHIVELOG mode, use the command shown below before proceeding to set permissions and enable the logs.
 Set permissions
- From ARCHIVELOG mode, grant the EXECUTE_CATALOG_ROLE role to the user created for Replicant so you can use the DBMS_LOGMNR package.
- For Replicant to access the START_LOGMNR procedure, provide the following permissions as seen below. The shown commands will work for Oracle 12c and above.
- For Replicant to access v_$logmnr_contents, provide the permission below for Oracle 19c and above.
Enable logs
After setting up the permissions to grant replicant access to the necessary logs, you now need to enable either the primary key or all column logging at the database level or the table level. One thing to note is that if you use table-level logging, you must also enable it in the CDC heartbeat table we will create in Step 7 as well.
For database-level supplemental logging, you will run the following commands on the database instance:
- Run the following command to enable Force Logging.
- Run the following command to enable PRIMARY KEY logging.
Alternatively, you can enable ALL column logging.
For table-level supplemental logging, Force Logging is not required. To enable table-level logging you will do the following:
- For PRIMARY KEY logging use the following command.
Alternatively, for ALL Column logging use the following command.
If table-level logging is enabled, Force Logging should not be enabled as it is used for database-level logging only.
- Grant access to the following permissions as they would be required.
Step 5: Set Up The Global Permissions
Global permissions can either be granted as One-Time Access or as Continuous Access. One-Time Access is only needed for the initial data snapshot and for re-init operations where snapshots are taken for newly added tables. One-Time Access permission can be revoked after the necessary operations are completed.Â
On the other hand, Continuous Access permissions are given and required throughout any replication process for it to work correctly. To set up continuous access, follow the steps listed below to enable the global permissions needed for replication to work.
- For One-Time Access, run the following command to provide the correct privileges.
- For Continuous Access permissions, run the following command to grant needed access.
- Grant the following continuous access permissions for the tables involved in the replication or enable flashback for all tables.
Alternatively, you can grant continuous access permission to all tables using the following command.
- Lastly, provide access to the system views to allow for schema migration as shown below.
Step 6: Set Up The Oracle Connection Configuration
Our next step is to set up the connection configuration so that Replicant can connect to our Oracle instance. To set up the connection configuration, carry out the following steps:
- From $REPLICANT_HOME, go to the Oracle connection configuration file.
- Your connection credentials can be retrieved easily by Replicant if it is stored on AWS Secrets Manager, otherwise, you will need to input them as shown in the code.Â
Step 7: Set Up The Filter Configuration
To set up the filter configuration, carry out the following steps:
- From $REPLICANT_HOME, go to the filter configuration file.
- Specify the schema(s) or table(s) to be replicated to meet your needs as explained in the filter file template below.
Step 8: Set Up The Extractor Configuration
For real-time replication, you must set up a heartbeat table in the source Oracle database. To create the heartbeat table, do the following:
- Create a heartbeat table in the schema you are going to replicate with the commands below:
- Ensure that the user created for replication has access to the heartbeat table. If they don’t, grant INSERT, UPDATE, and DELETE privileges to the user configured for replication.
With the heartbeat table created, you now need to configure the Extractor configuration. The steps required to set up the extractor are shown below.
- From $REPLICANT_HOME, go to the extractor configuration file for Oracle.
- In the oracle.yaml file, we will add in our configuration parameters. The configuration has two parts: parameters related to snapshot mode and parameters related to real-time mode. With CDC, snapshot mode runs first to extract and load the initial data. After the data is loaded, realtime mode is flipped on and any further updates to the data are handled through real-time replication which keeps the source and the target in sync.
‍

Different modes of replication that Arcion supports
To set the parameters related to snapshot mode let's take a look at each parameter. The following list includes the snapshot mode parameters related to Oracle as a data source:
- fetch-create-sql: This is used to apply the exact create SQL on the source to target. This parameter is only available when creating an Oracle-to-Oracle pipeline.
- fetch-create-sql-no-constraints: This is used to split create table SQL and Primary/Unique/Foreign key constraints as different SQLs. Tables are created without constraints but after the completion of the snapshot, the constraints are applied. This parameter is only available when creating an Oracle-to-Oracle pipeline.
- serialize-fetch-createSql: This is used to fetch create SQL in a serialized manner after fetching table schema.
- serialize-fetch-create-Sql-no-constraints: This is used to fetch the SQL of Primary/Unique/Foreign Key constraints in a serialized manner. This parameter is only available when creating an Oracle-to-Oracle pipeline.
- native-load: Replicant uses this parameter set to set up the Oracle Data Dump Export (expdp) utility to load the table instead of JDBC. it handles large-scale data. The configuration parameters listed below are available under the native-load.
- enable: This is used to enable the native-load and it is set at true or false.
- stage-type: This is used to determine the type of staging area. The allowed values are SHARED_FS and ASM.
- directory: This denotes the Oracle directory object corresponding to the stage-type.
- path: This is the full path to the Network File System (NFS) representing the directory shared between the Replicant and Oracle.
Below is a sample code showing the configuration for snapshot mode.
Next, let’s look at the parameters related to real-time mode. If your data is to be replicated in real-time, the real-time section is what you will use to specify your configuration. The following Oracle-specific parameters are available:
- block-ddl-transaction
- use-current-scn
- start-scn
- inter-source-latency-s
- log-miner-dict-file
- oldest-active-txn-window-hr
Below is a sample configuration to illustrate the parameters associated with the real-time mode.
Step 9: Obtain the JDBC driver for Google BigQuery
Replicant also requires the correct JDBC driver for Google BigQuery as a dependency. To obtain the right driver, you can follow the steps shown below:
- Go to the JDBC drivers for BigQuery page.
- From there, download the latest JDBC 4.2-compatible JDBC driver ZIP.
- From the downloaded ZIP, locate and extract the GoogleBigQueryJDBC42.jar file.
- Put the GoogleBigQueryJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
Step 10: Set up connection configuration for Google BigQuery
From the $REPLICANT_HOME directory, you will open the BigQuery Connection configuration file:
Once the BigQuery Connection configuration file is open, we will configure the BigQuery server connection. For this, you will need to input your credentials such as your BigQuery username and password. If your connection credentials are stored on AWS Secrets Manager, Replicant can simply obtain them for you. If you are manually inputting these credentials, the sample below shows how they can be added:
From the sample above, the value definitions for each field are:
type: This indicates the connection type representing the database, it is “BIGQUERY” in this case.Â
host: This shows the hostname of your BigQuery instance.
port: This is the port number to connect to the host.
username: This refers to the username credential to access the BigQuery system.
password: This is the password associated with the username.
max-connections: This indicates the maximum number of connections Replicant can use to load data into the BigQuery instance.
Arcion supports both HTTP and HTTPS for BigQuery connection.
To use HTTP, follow these steps:
- Set http as the protocol in the host URL.
- Set port to 80.
To use HTTPS, follow these steps:
- Set https as the protocol in the host URL.
- Set port to 443.
The https setup is what is shown in the example above.
Arcion can also support CSV and Parquet as intermediary formats to send data to the BigQuery server. To specify the stage format, you can add the stage field in the connection configuration file with the desired file-format chosen:
Step 11: Set up Applier configuration
Next, we will need to set the Applier configuration to tell Arcion how to apply the data to our BigQuery instance. From $REPLICANT_HOME, open the sample BigQuery Applier configuration file:
Arcion Replicant supports snapshot, realtime, and full replication modes for BigQuery as a target. Below we will look at how to set the Applier configuration based on what type of replication should be applied.
Configuration for snapshot replication
Replicant supports the following two methods for snapshot replication; loading data with the load job method, which is the default method, and streaming data using the BigQuery Storage Write API.
For the configuration with the load job method, a sample is shown below:
For a configuration with Storage Write API, a sample is given below:
Configure realtime replication
For realtime replication mode, Replicant also supports loading data with the load job method which is the default method, and streaming data using BigQuery Storage Write API.
For the configuration with the load job method, a sample is shown below
For the configuration with the Storage Write API method, a sample is given below:
The replay-strategy parameter found in the code above is Arcion’s way to implement CDC changes and apply these changes in real time to the target. Arcion supports the following replay strategies:
- NONE
- INSERT_DELETE
- INSERT_MERGE
- IN_MEMORY_MERGE
Step 12: Running Replicant
With that, the last step is to run Replicant to begin the CDC process. To run Replicant, check out the Running Replicant docs page which outlines all of the necessary commands.
Advantages of using Arcion
- Sub-second Latency From Distributed & Highly Scalable Architecture: Arcion is the world’s only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
- Arcion allows smooth schema management of your data thereby ensuring data integrity, reliability, and consistency. As the schema evolves on the Oracle instance, these changes will be applied to the BigQuery instance automatically to keep both in sync.
- Effortless Setup & Maintenance: Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and cost.
- 1:1 dedicated support from POC to production: The Arcion team offers live support for all self-hosted users around the clock through email, a dedicated Slack channel, and calls. If users get stuck, our team of database replication experts is available to help get things on track to make your database replication and CDC efforts a success.
Method 2: Oracle to BigQuery manually
The second method will explore how to link Oracle to BigQuery manually. We are going to export our data from the Oracle database as a CSV file, then upload it to Google Cloud Storage (GCS), before finally uploading the data from GCS to BigQuery.Â
Step 1: Connecting Oracle to Export Data as a CSV Format
- Open the Oracle SQL Developer tool.
- Right-click on the table to open the context menu, then select Export to start the integrated Export Wizard.
- Deselect the option Export DDL and change the format to CSV then set the path for the exported file. Click on Next.
- On the next screen, enter the filename and location. Specify the columns you wish to export by using the columns tab. You can also use the Edit button to change the settings if you wish or proceed by clicking on Next.
- A summary of the export process is drawn up, check to ensure your settings and everything are in order before clicking on Finish to complete the export.
Step 2: Upload the CSV file to Google Cloud Storage (GCS).
Having completed the process of exporting your Oracle data to the CSV file, the next phase is the transfer or uploading of the CSV file to Google Cloud Storage (GCS) from the local filesystem. There are numerous ways to get this done such as using the Google Cloud Console, Command line, Client libraries, Terraform, and REST APIs.The article will look at the first two methods mentioned: the Google Cloud Console and Command line.
Before going ahead to describe the process involved in these methods, you must first create a bucket which is a basic container that will hold all your data in the Cloud Storage to which the CSV file will be uploaded. If you already have a bucket, skip the creation of the bucket to the next part.
Creating a bucket
To create the bucket using Console, do the following:
- In the Google Cloud Console, go to the Cloud Storage Buckets page.
- Click Create Bucket to open the creation form.
- Enter the information you wish to give the bucket and click on Continue after completing each step. The information to be input includes:
a) Enter a Name for the bucket
b) Choose the Region for Location type and us-east1 for Location.
c)Â Choose Standard for the default storage class.
d) Choose Uniform for Access Control.
e) Uncheck Enforce public access prevention on this bucket so you can share your objects.
f)Â Choose None for protection tools. - Click on Create.
A bucket has been created on Cloud Storage to hold your data, we can now proceed on how to upload data to Google Cloud Storage.
Using The Google Cloud Console to upload data to Google Cloud Storage.
To use the Google Cloud Console to upload an object to a bucket, complete the steps below:
- In the Google Cloud Console, go to the Cloud Storage Buckets page.
- Â In the list of available buckets, select the bucket you wish to upload the object.
- In the Objects tab, you can either do the following:
- Drag and drop the desired file from your desktop or file manager depending on where your file is stored to the main pane in the Google Cloud console.
- Click the Upload files button, and a dialog box appears from which you will select the files you want to upload. Click on Open after the selection.
Using Command Line to upload data to Google Cloud Storage.
For the command line, you can either use gcloud or gsutil to upload an object to a bucket. Below is an explanation of how this is done.
For gcloud, we are going to use the gcloud storage cp command below to achieve this.
From the command above, we have OBJECT_LOCATION which is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object.Â
If successful, you will receive a response similar to the one below.
For gsutil, we are going to use the gsutil cp command below to achieve this.
From the command above, we have OBJECT_LOCATION which is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object.Â
If successful, you will receive a response similar to the one below.
Remember, to complete the uploading of data using both the Google Cloud console and the command-line utility, you must have the proper IAM permissions set. These are mentioned at the top of the Method 1 steps above.
Step 3: Moving the data from Google Cloud Storage (GCS) to BigQuery
The final step in the process of connecting Oracle to BigQuery is moving the data from Google Cloud Storage (GCS) to BigQuery. There are several ways to load your CSV file into BigQuery but we will focus on using the Cloud console and SQL.
Before proceeding, the following items are things to note to ensure a seamless transfer of your data to BigQuery:
- When CSV data is loaded from Google Cloud Storage, it is loaded into a new table or partition, or appended to or overwritten an existing table or partition.Â
- The data loaded into BigQuery is converted into a columnar format for Capacitor which is BigQuery’s storage format.
- When data is loaded from GCS to a BigQuery table, the dataset that contains the table you want to create, append, or overwrite must be in the same regional or multi-regional location as the Cloud Storage bucket.
- Grant Identity and Access Management (IAM) roles that permit users to perform tasks. You have to grant IAM permissions to load data into BIgQuery and also grant IAM permissions to load data from Cloud Storage.
To load data into BIgQuery, the IAM permissions needed to run a load job into BigQuery include:
- bigquery.tables.create
- bigquery.tables.updateData
- bigquery.tables.update
- bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions needed to load data into a BigQuery table or partition.
- roles/bigquery.dataEditor
- roles/bigquery.dataOwner
- roles/bigquery.adminÂ
- bigquery.user
- bigquery.jobUser
Permissions to load data into BIgQuery from Cloud Storage, the IAM permissions needed include:
- storage.buckets.get
- storage.objects.get
- storage.objects.list
Having granted the IAM role permissions required, let us proceed to move data from Google Cloud Storage to BigQuery.
Loading the CSV file Google Cloud Storage using the Console into BigQuery
To load the CSV data from GCS into a new BigQuery table using the Console, do the following:
- Go to the BigQuery page found on the Google Cloud console.
- In the Explorer pane, select a dataset by expanding your project.
- In the Dataset info section, click Create table.
- In the Create table panel, specify the following details:
a)Â In the Source section, select Google Cloud Storage in the Create table from the list. Then, do the following:
- Select a file from the Cloud Storage bucket. Â
- For File format, select CSV.
b)Â In the Destination section, specify the following details:
- For Dataset, select the dataset in which you want to create the table.
- In the Table field, enter the name of the table that you want to create.
- Verify that the Table type field is set to Native table.
c) In the Schema section, enter the schema definition. To enable the auto-detection of a schema, select Auto detect.Â
d) Specify the optional step Partition and cluster settings if you wish.
e) Click Advanced options and do the following:
- For Write preference, select Write if empty as this option creates a new table and loads your data into it.
- For the Number of errors allowed, enter the maximum number of rows containing errors that can be ignored or leave the default value of 0.
- Unknown values are used to ignore values in a row that are not present in the table's schema.
- For the Field delimiter, choose the character that separates the cells in your CSV file: Comma, Tab, Pipe, or Custom.
- For Header rows to skip, enter the number of header rows to skip at the top of the CSV file.
- Select Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file.
- Select Allow jagged rows to accept rows in CSV files that are missing trailing optional columns.Â
- For Encryption, click the Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
f) Finally, click Create table.
Loading the CSV file Google Cloud Storage using SQL into BigQuery
SQL uses the LOAD DATA DDL Statement to load the CSV data from GCS into a new BigQuery table. To use SQL to load into a new table mytable, do the following:
- In the Google Cloud console, go to the BigQuery page.
- Enter the statement below in the query editor:
- Click Run.
We have only succeeded in loading our CSV file into a new table in BigQuery from Google Cloud Storage (GCS), for appending to or overwriting a table and other information or methods of doing this, you can visit the official documentation here.
Disadvantages of Manually Loading Data
- Writing custom code is ideal when you need to move data once. Carrying out the manual process becomes cumbersome and bulky when trying to update the data frequently and repeatedly.
- BigQuery does not guarantee data consistency for external data sources as unexpected behavior may be encountered when changes are made to the data while a query is running.Â
- The data set's location value must be in the same region or contained in the same multi-region as the Cloud Storage Bucket.
- Nested or repeated data can not be used on CSV files as the CSV format does not support this type of data.
- Compressed and uncompressed files cannot both be included in the same load job when using a CSV.
- The maximum size for a gzip file for CSV is 4 GB.
Conclusion
This article served as a comprehensive step-by-step guide on how to set up an Oracle to BigQuery integration. We covered two methods which included a custom or manual setup for importing data via CSV and using Arcion. We covered each step of how to implement a manual CSV import into BigQuery and also discussed some of the limitations of this approach. For Arcion, we showed how to implement real-time CDC to move data from Oracle to Google BigQuery. Arcion was highlighted as a simple and scalable solution for Oracle to BigQuery integration and replication.
As we saw, Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from Oracle to BigQuery. To easily implement a data integration strategy, get started with Arcion today by connecting with our team of Oracle and BigQuery replication experts. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.