Oracle to BigQuery: Load Data in Minutes

Luke Smith
Enterprise Solutions Architect
June 1, 2023
Matt Tanner
Developer Relations Lead
June 1, 2023
Matt Tanner
Developer Relations Lead
June 1, 2023
Get a migration case study in your inbox
Join our newsletter

In today's rapidly evolving data landscape, organizations are continuously seeking more efficient and scalable solutions to handle their data processing and analytics needs. With the exponential growth of data volumes and the advent of cloud-based technologies, businesses are increasingly looking to migrate their data from traditional systems to modern, cloud-based data warehouses. One migration path that has gained immense popularity is moving data from Oracle, one of the leading relational database management systems known for its robustness and reliability, to BigQuery, a powerful and fully managed data warehouse hosted on the Google Cloud Platform.

In this blog, we delve into the intricacies of migrating data from Oracle to BigQuery, providing a comprehensive overview of the entire process. We will explore the advantages of leveraging BigQuery's cloud-native architecture for its elastic scalability and ability to unlock the full potential of an organization’s data. Additionally, we will highlight the various methods available for loading data from Oracle to BigQuery, ranging from automated solutions like Arcion to more manual approaches.

By delving into the nuances of these migration methods, you will be equipped with the knowledge necessary to make informed decisions when embarking on a data migration journey. Whether you are considering a migration project or simply interested in exploring the possibilities, this blog serves as a valuable resource, offering insights into the benefits and challenges associated with each approach.

Join us as we explore the world of data migration, discover the possibilities that Oracle to BigQuery migration brings, and gain valuable insights into the methods and tools available to facilitate a seamless transition. With the right knowledge and approach, you can harness the power of BigQuery and unlock new opportunities for data-driven decision-making within your organization.

Table of Contents

Introduction to Oracle

Oracle is a popular Relational Database Management System (RDBMS) produced and marketed by Oracle Corporation. It is one of the most used RDBMS to store, organize, and retrieve information. Oracle was the first database to use Enterprise Grid Computing (EGC) to perform tasks. EGC helps in the retrieval of data from any location through a group of computers connected to a network. This makes Oracle a flexible and cost-effective way to manage information and applications. 

Oracle runs on multiple types of servers across different operating systems such as Windows, UNIX, Linux, and macOS. Oracle is a multi-model database system where each database is a collection of data treated as a unit to store and retrieve related information. Oracle offers in-memory, NoSQL, and MySQL databases that are available on-premise through Oracle Cloud or via Oracle Cloud.

Oracle has five editions of its database to choose from, they include Standard Edition One, Standard Edition, Enterprise Edition, Express Edition (XE), and Personal Edition. For further information on Oracle, visit here.

Introduction to BigQuery

Google BigQuery is a serverless, cloud-based data warehouse that offers extensive data analytics web services for processing huge amounts of data. It supports the querying of data using ANSI SQL and can be used to analyze data on a large scale, running into billions of rows. 

The BigQuery platform also utilizes columnar storage called Capacitor to ease data querying and aggregation of results. BigQuery has built-in auto-scaling which allows the platform to automatically allocate computing resources as needed. This means that users don’t need to spin up new instances or virtual machines to scale. BigQuery can securely handle large volumes of data since BigQuery automatically encrypts all data before it is written to disk. When the data is read by an authorized user it is automatically decrypted. By default, Google manages the encryption keys used to protect your data although this can be configured as needed.

Methods to Load Data from Oracle to BigQuery

In this next section, we will go over two methods that can be used to move data from Oracle to BigQuery. These two methods of loading data from Oracle to Bigquery are explained in step-by-step detail so that the methods can be easily followed and implemented. The first method we will look at will use Arcion to set up an automated and scalable pipeline. The next method is to manually export data as a CSV and upload it to BigQuery. Of course, this is using a custom code method of which we will discuss the disadvantages as well. With the agenda set, let’s dive into it!

Method 1: Oracle to BigQuery Using Arcion

The first method of connecting Oracle to BigQuery discusses how to use Arcion’s Oracle and BigQuery connectors to replicate data from your Oracle source database to a BigQuery target. For this to be done, we will set up the source database and connector, set up the target database and connector, then run Replicant, Arcion’s CLI tool that runs the replication process.

To load data into BigQuery, you’ll need to make sure you have the necessary IAM permissions set since certain permissions are required to run and load data into BigQuery tables and partitions.

Step 1: Download And Install Arcion Self-hosted

The first thing to do is to download and install Arcion Self-hosted. To do this, you’ll need to contact the Arcion team to get the Arcion program as well as a license key. Once you have access, the process would include downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME throughout this guide, and finally, adding your license. You can reference the quick start guide for more details on how to do these steps in detail.

Step 2: Obtain the JDBC Driver for Oracle

The appropriate JDBC driver for Oracle has to be obtained before setting up Replicant, Arcion’s CDC tool. To do this, visit the Oracle Database JDBC driver Downloads page to download a driver that meets your business needs. To run Arcion efficiently, we recommend JDBC Driver for Oracle 18c and Java 8-compatible drivers.  Go to the Oracle Database 18c Downloads section from the link provided above to download the necessary drivers. Be sure to put the ojdbc8.jar file inside the $REPLICANT_HOME/lib directory. The $REPLICANT_HOME directory is the directory where Replicant is installed. For more information on how to install Replicant, see the documentation.

Step 3: Set Up The Oracle User

Our next step is to create an Oracle User for Replicant to use. To set up an Oracle user, perform the following commands, replacing the <VARIABLES> with the values you want to use.

  1. Create a user for Replicant by using the command below.
CREATE USER <USERNAME> IDENTIFIED BY <PASSWORD>
DEFAULT TABLESPACE <user-defined-tablespace>
QUOTA unlimited on <user-defined-tablespace>
TEMPORARY TABLESPACE TEMP;
  1. Provide the create session permission.
CREATE USER <USERNAME> IDENTIFIED BY <PASSWORD>
  1. Next, you can grant the select permission to the specific tables that would be involved in the replication.
GRANT SELECT ON <TABLENAME> TO <USERNAME>;

Alternatively, you can grant permission on all the tables as shown below.

GRANT SELECT ANY TABLE TO <USERNAME>;

Step 4: Set Up Change Data Capture (CDC)

To use log-based CDC, the Oracle database should be in ARCHIVELOG mode. To set the database to ARCHIVELOG mode, use the command shown below before proceeding to set permissions and enable the logs.

SHUTDOWN IMMEDIATE
STARTUP MOUNT
ALTER DATABASE ARCHIVELOG
ALTER DATABASE OPEN

 Set permissions

  1. From ARCHIVELOG mode, grant the EXECUTE_CATALOG_ROLE role to the user created for Replicant so you can use the DBMS_LOGMNR package.
GRANT EXECUTE_CATALOG_ROLE TO <USERNAME>
  1. For Replicant to access the START_LOGMNR procedure, provide the following permissions as seen below. The shown commands will work for Oracle 12c and above.
GRANT LOGMINING TO <USERNAME>;
  1. For Replicant to access v_$logmnr_contents, provide the permission below for Oracle 19c and above.
GRANT SELECT ON v_$logmnr_contents TO <USERNAME>;
GRANT SELECT ON gv_$archived_log TO <USERNAME>;
GRANT SELECT ON v_$logfile TO <USERNAME>;

Enable logs

After setting up the permissions to grant replicant access to the necessary logs, you now need to enable either the primary key or all column logging at the database level or the table level. One thing to note is that if you use table-level logging, you must also enable it in the CDC heartbeat table we will create in Step 7 as well.

For database-level supplemental logging, you will run the following commands on the database instance:

  1. Run the following command to enable Force Logging.
ALTER DATABASE FORCE LOGGING
  1. Run the following command to enable PRIMARY KEY logging.
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS

Alternatively, you can enable ALL column logging.

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS

For table-level supplemental logging, Force Logging is not required. To enable table-level logging you will do the following:

  1. For PRIMARY KEY logging use the following command.
ALTER TABLE <TABLE_NAME> ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS

Alternatively, for ALL Column logging use the following command.

ALTER TABLE <TABLE_NAME> ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS

If table-level logging is enabled, Force Logging should not be enabled as it is used for database-level logging only.

ALTER DATABASE FORCE LOGGING
ALTER SYSTEM SWITCH LOGFILE
  1. Grant access to the following permissions as they would be required.
GRANT SELECT ON gv_$instance TO $USERNAME;
GRANT SELECT ON gv_$PDBS TO $USERNAME;
GRANT SELECT ON gv_$log TO $USERNAME;
GRANT SELECT ON gv_$database_incarnation to $USERNAME;

Step 5: Set Up The Global Permissions

Global permissions can either be granted as One-Time Access or as Continuous Access. One-Time Access is only needed for the initial data snapshot and for re-init operations where snapshots are taken for newly added tables. One-Time Access permission can be revoked after the necessary operations are completed. 

On the other hand, Continuous Access permissions are given and required throughout any replication process for it to work correctly. To set up continuous access, follow the steps listed below to enable the global permissions needed for replication to work.

  1. For One-Time Access, run the following command to provide the correct privileges.
GRANT SELECT ON DBA_SEGMENTS TO <USERNAME>;
  1. For Continuous Access permissions, run the following command to grant needed access.
GRANT SELECT ON gv_$database TO <USERNAME>;
GRANT SELECT ON gv_$transaction TO <USERNAME>;
GRANT SELECT ON gv_$session TO <USERNAME>;--Not required for replicant release 20.8.13.7 and above
  1. Grant the following continuous access permissions for the tables involved in the replication or enable flashback for all tables.
GRANT FLASHBACK ON <TABLE_NAME> TO <USERNAME>;

Alternatively, you can grant continuous access permission to all tables using the following command.

GRANT FLASHBACK ANY TABLE TO <USERNAME>;
  1. Lastly, provide access to the system views to allow for schema migration as shown below.
GRANT SELECT ON ALL_TABLES TO <USERNAME>;
GRANT SELECT ON ALL_VIEWS TO <USERNAME>;
GRANT SELECT ON ALL_CONSTRAINTS TO <USERNAME>;
GRANT SELECT ON ALL_CONS_COLUMNS TO <USERNAME>;
GRANT SELECT ON ALL_PART_TABLES TO <USERNAME>;
GRANT SELECT ON ALL_PART_KEY_COLUMNS TO <USERNAME>;
GRANT SELECT ON ALL_TAB_COLUMNS TO <USERNAME>;
GRANT SELECT ON SYS.ALL_INDEXES TO <USERNAME>;
GRANT SELECT ON SYS.ALL_IND_COLUMNS TO <USERNAME>;
GRANT SELECT ON SYS.ALL_IND_EXPRESSIONS TO <USERNAME>;

Step 6: Set Up The  Oracle Connection Configuration

Our next step is to set up the connection configuration so that Replicant can connect to our Oracle instance. To set up the connection configuration, carry out the following steps:

  1. From $REPLICANT_HOME, go to the Oracle connection configuration file.
vi conf/conn/oracle.yaml
  1. Your connection credentials can be retrieved easily by Replicant if it is stored on AWS Secrets Manager, otherwise, you will need to input them as shown in the code. 
type: ORACLE

host: localhost #Replace localhost with your Oracle hostname
port: 1521 #Replace the default port number 1521 if needed
service-name: IO #Replace IO with the service name of your Oracle Listener

username: 'REPLICANT' #Replace REPLICANT with your username to connect to Oracle
password: 'Replicant#123' #Replace Replicant#123 with the your user's password

max-connections: 30 #Maximum number of connections replicant can open in Oracle
continuous-log-mining: true

Step 7: Set Up The Filter Configuration

To set up the filter configuration, carry out the following steps:

  1. From $REPLICANT_HOME, go to the filter configuration file.
vi filter/oracle_filter.yaml
  1. Specify the schema(s) or table(s) to be replicated to meet your needs as explained in the filter file template below.
allow:
  schema: <your_schema_name>
  types: <your_object_type>

  allow:
    your_table_name_1:
      allow: ["your_column_name"]
      conditions: "your_condition"  

    your_table_name_2:

    your_table_name_3:
      allow: ["your_column_name"]
      conditions: "your_condition"

Step 8: Set Up The Extractor Configuration

For real-time replication, you must set up a heartbeat table in the source Oracle database. To create the heartbeat table, do the following:

  1. Create a heartbeat table in the schema you are going to replicate with the commands below:
CREATE TABLE "<schema>"."replicate_io_cdc_heartbeat"(
  "timestamp" NUMBER NOT NULL,
  PRIMARY KEY("timestamp"));
  1. Ensure that the user created for replication has access to the heartbeat table. If they don’t, grant INSERT, UPDATE, and DELETE privileges to the user configured for replication.

With the heartbeat table created, you now need to configure the Extractor configuration. The steps required to set up the extractor are shown below.

  1. From $REPLICANT_HOME, go to the extractor configuration file for Oracle.
vi conf/src/oracle.yaml
  1. In the oracle.yaml file, we will add in our configuration parameters. The configuration has two parts: parameters related to snapshot mode and parameters related to real-time mode. With CDC, snapshot mode runs first to extract and load the initial data. After the data is loaded, realtime mode is flipped on and any further updates to the data are handled through real-time replication which keeps the source and the target in sync.

‍

Different modes of replication that Arcion supports

To set the parameters related to snapshot mode let's take a look at each parameter. The following list includes the snapshot mode parameters related to Oracle as a data source:

  • fetch-create-sql: This is used to apply the exact create SQL on the source to target. This parameter is only available when creating an Oracle-to-Oracle pipeline.
  • fetch-create-sql-no-constraints: This is used to split create table SQL and Primary/Unique/Foreign key constraints as different SQLs.  Tables are created without constraints but after the completion of the snapshot, the constraints are applied. This parameter is only available when creating an Oracle-to-Oracle pipeline.
  • serialize-fetch-createSql: This is used to fetch create SQL in a serialized manner after fetching table schema.
  • serialize-fetch-create-Sql-no-constraints: This is used to fetch the SQL of Primary/Unique/Foreign Key constraints in a serialized manner. This parameter is only available when creating an Oracle-to-Oracle pipeline.
  • native-load: Replicant uses this parameter set to set up the Oracle Data Dump Export (expdp) utility to load the table instead of JDBC. it handles large-scale data. The configuration parameters listed below are available under the native-load.
  • enable: This is used to enable the native-load and it is set at true or false.
  • stage-type: This is used to determine the type of staging area. The allowed values are SHARED_FS and ASM.
  • directory: This denotes the Oracle directory object corresponding to the stage-type.
  • path: This is the full path to the Network File System (NFS) representing the directory shared between the Replicant and Oracle.

Below is a sample code showing the configuration for snapshot mode.

snapshot:
  threads: 16
  fetch-size-rows: 10_000
  verify-row-count: false
  _fetch-exact-row-count: true
  _traceDBTasks: true
#  inf-number-behavior: EXCEPTION   # EXCEPTION, ROUND, CONVERT_TO_NULL
#  flashback-query: true
#  parallel-query: true
#  fetch-user-roles: true

#   native-load:
#     enable: false
#     stage-type: SHARED_FS
#     directory: SHARED_STAGE
#     shared-path: FULL_PATH_TO_NFS #full path to the NFS shared directory 

#   per-table-config:
#   - schema: tpch
#     tables:
#       lineitem1:
#         row-identifier-key: [ORDERKEY, LINENUMBER]
#         extraction-priority: 1 #Higher value is a higher priority. Both positive and negative values are allowed. The default priority is 0 if unspecified.
#       lineitem1:
#         row-identifier-key: [ORDERKEY, LINENUMBER]
#        products:
#          per-partition-config:
#          - partition-name: SYS_P461
#            row-count: 0
#          - partition-name: SYS_P462
#            row-count: 0
#          - partition-name: SYS_P463
#            row-count: 1
#          - partition-name: SYS_P464
#            row-count: 3
#        part:
#          row-count: 2000

Next, let’s look at the parameters related to real-time mode. If your data is to be replicated in real-time, the real-time section is what you will use to specify your configuration. The following Oracle-specific parameters are available:

  • block-ddl-transaction
  • use-current-scn
  • start-scn
  • inter-source-latency-s
  • log-miner-dict-file
  • oldest-active-txn-window-hr

Below is a sample configuration to illustrate the parameters associated with the real-time mode.

realtime:
  threads: 4
  _traceDBTasks: true
  #fetch-size-rows: 0
  heartbeat:
    enable: true
    schema: "tpch"
    interval-ms: 10000
    table-name: replicate_io_cdc_heartbeat

  start-position:
    start-scn: 2362927

Step 9: Obtain the JDBC driver for Google BigQuery

Replicant also requires the correct JDBC driver for Google BigQuery as a dependency. To obtain the right driver, you can follow the steps shown below:

Step 10: Set up connection configuration for Google BigQuery

From the $REPLICANT_HOME directory, you will open the BigQuery Connection configuration file:

vi conf/conn/bigquery.yaml

Once the BigQuery Connection configuration file is open, we will configure the BigQuery server connection. For this, you will need to input your credentials such as your BigQuery username and password. If your connection credentials are stored on AWS Secrets Manager, Replicant can simply obtain them for you. If you are manually inputting these credentials, the sample below shows how they can be added:

type: BIGQUERY

host: https://www.googleapis.com/bigquery/v2
port: 443
project-id: bigquerytest-268904
auth-type: 0
o-auth-service-acc-email: bigquerytest@bigquerytest-268904.iam.gserviceaccount.com
o-auth-pvt-key-path: <path_to_oauth_private_key>
location: US
timeout: 500

username: "USERNAME"
password: "PASSWORD"

max-connections: 20

max-retries: 10
retry-wait-duration-ms: 1000

From the sample above, the value definitions for each field are:

type: This indicates the connection type representing the database, it is “BIGQUERY” in this case. 

host: This shows the hostname of your BigQuery instance.

port: This is the port number to connect to the host.

username: This refers to the username credential to access the BigQuery system.

password: This is the password associated with the username.

max-connections: This indicates the maximum number of connections Replicant can use to load data into the BigQuery instance.

Arcion supports both HTTP and HTTPS for BigQuery connection.

To use HTTP, follow these steps:

  • Set http as the protocol in the host URL.
  • Set port to 80.

To use HTTPS, follow these steps:

  • Set https as the protocol in the host URL.
  • Set port to 443.

The https setup is what is shown in the example above.

Arcion can also support CSV and Parquet as intermediary formats to send data to the BigQuery server. To specify the stage format, you can add the stage field in the connection configuration file with the desired file-format chosen:

stage:
  type: NATIVE
  file-format: {CSV|PARQUET}

Step 11: Set up Applier configuration

Next, we will need to set the Applier configuration to tell Arcion how to apply the data to our BigQuery instance. From $REPLICANT_HOME, open the sample BigQuery Applier configuration file:

vi conf/dst/bigquery.yaml

Arcion Replicant supports snapshot, realtime, and full replication modes for BigQuery as a target. Below we will look at how to set the Applier configuration based on what type of replication should be applied.

Configuration for snapshot replication

Replicant supports the following two methods for snapshot replication; loading data with the load job method, which is the default method, and streaming data using the BigQuery Storage Write API.

For the configuration with the load job method, a sample is shown below:

snapshot:
  threads: 16

  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: false

  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true

  use-quoted-identifiers: false

For a configuration with Storage Write API, a sample is given below:

snapshot:
  threads: 16

  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: true

  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true

Configure realtime replication

For realtime replication mode, Replicant also supports loading data with the load job method which is the default method, and streaming data using BigQuery Storage Write API.

For the configuration with the load job method, a sample is shown below

realtime:
  threads: 16
    _traceDBTasks: true
    skip-tables-on-failures : false
    replay-strategy: IN_MEMORY_MERGE
    per-table-config:
      - catalog: tpch_scale_0_01
        tables:
          nation:
            replay-strategy: MERGE
            enable-dependency-tracking: false
          region:
            replay-strategy: INSERT_MERGE
            enable-dependency-tracking: true

For the configuration with the Storage Write API method, a sample is given below:

realtime:
  threads: 16
  _traceDBTasks: true
  skip-tables-on-failures : false

  use-write-storage-api: true

  replay-strategy: IN_MEMORY_MERGE
  per-table-config:
    - catalog: tpch_scale_0_01
      tables:
        nation:
          replay-strategy: MERGE
          enable-dependency-tracking: false
        region:
          replay-strategy: INSERT_MERGE
          enable-dependency-tracking: true

The replay-strategy parameter found in the code above is Arcion’s way to implement CDC changes and apply these changes in real time to the target. Arcion supports the following replay strategies:

  • NONE
  • INSERT_DELETE
  • INSERT_MERGE
  • IN_MEMORY_MERGE

Step 12: Running Replicant

With that, the last step is to run Replicant to begin the CDC process. To run Replicant, check out the Running Replicant docs page which outlines all of the necessary commands.

Advantages of using Arcion

  1. Sub-second Latency From Distributed & Highly Scalable Architecture: Arcion is the world’s only CDC solution with an underlying end-to-end multi-threaded architecture, which supports auto-scaling both vertically and horizontally. Its patent-pending technology parallelizes every single Arcion CDC process for maximum throughput. So users get ultra-low latency and maximum throughput even as data volume grows.
  2. Arcion allows smooth schema management of your data thereby ensuring data integrity, reliability, and consistency. As the schema evolves on the Oracle instance, these changes will be applied to the BigQuery instance automatically to keep both in sync.
  3. Effortless Setup & Maintenance: Arcion's no-code platform removes DevOps dependencies; you do not need to incorporate Kafka, Spark Streaming, Kinesis, or other streaming tools. So you can simplify the data architecture, saving both time and cost.
  4. 1:1 dedicated support from POC to production: The Arcion team offers live support for all self-hosted users around the clock through email, a dedicated Slack channel, and calls. If users get stuck, our team of database replication experts is available to help get things on track to make your database replication and CDC efforts a success.

Method 2: Oracle to BigQuery manually

The second method will explore how to link Oracle to BigQuery manually. We are going to export our data from the Oracle database as a CSV file, then upload it to Google Cloud Storage (GCS), before finally uploading the data from GCS to BigQuery. 

Step 1: Connecting Oracle to Export Data as a CSV Format

  • Open the Oracle SQL Developer tool.
  • Right-click on the table to open the context menu, then select Export to start the integrated Export Wizard.
  • Deselect the option Export DDL and change the format to CSV then set the path for the exported file. Click on Next.
  • On the next screen, enter the filename and location. Specify the columns you wish to export by using the columns tab. You can also use the Edit button to change the settings if you wish or proceed by clicking on Next.
  • A summary of the export process is drawn up, check to ensure your settings and everything are in order before clicking on Finish to complete the export.

Step 2: Upload the CSV file to Google Cloud Storage (GCS).

Having completed the process of exporting your Oracle data to the CSV file, the next phase is the transfer or uploading of the CSV file to Google Cloud Storage (GCS) from the local filesystem. There are numerous ways to get this done such as using the Google Cloud Console, Command line, Client libraries, Terraform, and REST APIs.The article will look at the first two methods mentioned: the Google Cloud Console and Command line.

Before going ahead to describe the process involved in these methods, you must first create a bucket which is a basic container that will hold all your data in the Cloud Storage to which the CSV file will be uploaded. If you already have a bucket, skip the creation of the bucket to the next part.

Creating a bucket

To create the bucket using Console, do the following:

  1. In the Google Cloud Console, go to the Cloud Storage Buckets page.
  2. Click Create Bucket to open the creation form.
  3. Enter the information you wish to give the bucket and click on Continue after completing each step. The information to be input includes:
    a) Enter a Name for the bucket
    b) Choose the Region for Location type and us-east1 for Location.
    c) Choose Standard for the default storage class.
    d) Choose Uniform for Access Control.
    e) Uncheck Enforce public access prevention on this bucket so you can share your objects.
    f) Choose None for protection tools.
  4. Click on Create.

A bucket has been created on Cloud Storage to hold your data, we can now proceed on how to upload data to Google Cloud Storage.

Using The Google Cloud Console to upload data to Google Cloud Storage.

To use the Google Cloud Console to upload an object to a bucket, complete the steps below:

  1. In the Google Cloud Console, go to the Cloud Storage Buckets page.
  2.  In the list of available buckets, select the bucket you wish to upload the object.
  3. In the Objects tab, you can either do the following:
  • Drag and drop the desired file from your desktop or file manager depending on where your file is stored to the main pane in the Google Cloud console.
  • Click the Upload files button, and a dialog box appears from which you will select the files you want to upload. Click on Open after the selection.

Using Command Line to upload data to Google Cloud Storage.

For the command line, you can either use gcloud or gsutil to upload an object to a bucket. Below is an explanation of how this is done.

For gcloud, we are going to use the gcloud storage cp command below to achieve this.

gcloud storage cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

From the command above, we have OBJECT_LOCATION which is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object. 

If successful, you will receive a response similar to the one below.

Completed files 1/1 | 164.3kiB/164.3kiB

For gsutil, we are going to use the gsutil cp command below to achieve this.

gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

From the command above, we have OBJECT_LOCATION which is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object. 

If successful, you will receive a response similar to the one below.

Operation completed over 1 objects/58.8 KiB.

Remember, to complete the uploading of data using both the Google Cloud console and the command-line utility, you must have the proper IAM permissions set. These are mentioned at the top of the Method 1 steps above.

Step 3: Moving the data from Google Cloud Storage (GCS) to BigQuery

The final step in the process of connecting Oracle to BigQuery is moving the data from Google Cloud Storage (GCS) to BigQuery. There are several ways to load your CSV file into BigQuery but we will focus on using the Cloud console and SQL.

Before proceeding, the following items are things to note to ensure a seamless transfer of your data to BigQuery:

  • When CSV data is loaded from Google Cloud Storage, it is loaded into a new table or partition, or appended to or overwritten an existing table or partition. 
  • The data loaded into BigQuery is converted into a columnar format for Capacitor which is BigQuery’s storage format.
  • When data is loaded from GCS to a BigQuery table, the dataset that contains the table you want to create, append, or overwrite must be in the same regional or multi-regional location as the Cloud Storage bucket.
  • Grant Identity and Access Management (IAM) roles that permit users to perform tasks. You have to grant IAM permissions to load data into BIgQuery and also grant IAM permissions to load data from Cloud Storage.

To load data into BIgQuery, the IAM permissions needed to run a load job into BigQuery include:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions needed to load data into a BigQuery table or partition.

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin 
  • bigquery.user
  • bigquery.jobUser

Permissions to load data into BIgQuery from Cloud Storage, the IAM permissions needed include:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list

Having granted the IAM role permissions required, let us proceed to move data from Google Cloud Storage to BigQuery.

Loading the CSV file Google Cloud Storage using the Console into BigQuery

To load the CSV data from GCS into a new BigQuery table using the Console, do the following:

  1. Go to the BigQuery page found on the Google Cloud console.
  2. In the Explorer pane, select a dataset by expanding your project.
  3. In the Dataset info section, click Create table.
  4. In the Create table panel, specify the following details:
    a) In the Source section, select Google Cloud Storage in the Create table from the list. Then, do the following:
    - Select a file from the Cloud Storage bucket.  
    - For File format, select CSV.
    b) In the Destination section, specify the following details:
    - For Dataset, select the dataset in which you want to create the table.
    - In the Table field, enter the name of the table that you want to create.
    - Verify that the Table type field is set to Native table.
    c) In the Schema section, enter the schema definition. To enable the auto-detection of a schema, select Auto detect. 
    d) Specify the optional step Partition and cluster settings if you wish.
    e) Click Advanced options and do the following:
    - For Write preference, select Write if empty as this option creates a new table and loads your data into it.
    - For the Number of errors allowed, enter the maximum number of rows containing errors that can be ignored or leave the default value of 0.
    - Unknown values
    are used to ignore values in a row that are not present in the table's schema.
    - For the Field delimiter, choose the character that separates the cells in your CSV file: Comma, Tab, Pipe, or Custom.
    - For Header rows to skip, enter the number of header rows to skip at the top of the CSV file.
    - Select Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file.
    - Select Allow jagged rows to accept rows in CSV files that are missing trailing optional columns. 
    - For Encryption, click the Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
    f) Finally, click Create table.

Loading the CSV file Google Cloud Storage using SQL into BigQuery

SQL uses the LOAD DATA DDL Statement to load the CSV data from GCS into a new BigQuery table. To use SQL to load into a new table mytable, do the following:

  1. In the Google Cloud console, go to the BigQuery page.
  2. Enter the statement below in the query editor:
LOAD DATA OVERWRITE mydataset.mytable
FROM FILES (
  format = 'CSV',
  uris = ['gs://bucket/path/file.csv']);
  1. Click Run.

We have only succeeded in loading our CSV file into a new table in BigQuery from Google Cloud Storage (GCS), for appending to or overwriting a table and other information or methods of doing this, you can visit the official documentation here.

Disadvantages of Manually Loading Data

  1. Writing custom code is ideal when you need to move data once. Carrying out the manual process becomes cumbersome and bulky when trying to update the data frequently and repeatedly.
  2. BigQuery does not guarantee data consistency for external data sources as unexpected behavior may be encountered when changes are made to the data while a query is running. 
  3. The data set's location value must be in the same region or contained in the same multi-region as the Cloud Storage Bucket.
  4. Nested or repeated data can not be used on CSV files as the CSV format does not support this type of data.
  5. Compressed and uncompressed files cannot both be included in the same load job when using a CSV.
  6. The maximum size for a gzip file for CSV is 4 GB.

Conclusion

This article served as a comprehensive step-by-step guide on how to set up an Oracle to BigQuery integration. We covered two methods which included a custom or manual setup for importing data via CSV and using Arcion. We covered each step of how to implement a manual CSV import into BigQuery and also discussed some of the limitations of this approach. For Arcion, we showed how to implement real-time CDC to move data from Oracle to Google BigQuery. Arcion was highlighted as a simple and scalable solution for  Oracle to BigQuery integration and replication.

As we saw, Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from Oracle to BigQuery. To easily implement a data integration strategy, get started with Arcion today by connecting with our team of Oracle and BigQuery replication experts. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements