Microsoft SQL Server to BigQuery: 2 Best Ways

Luke Smith
Enterprise Solutions Architect
May 16, 2023
Matt Tanner
Developer Relations Lead
May 16, 2023
Matt Tanner
Developer Relations Lead
May 16, 2023
Get a migration case study in your inbox
Join our newsletter

When it comes to data warehousing, Google’s BigQuery is one of the most popular platforms to push data into. Fully managed and scalable, it makes sense why data engineers have gravitated toward the technology. A major source of the data in BigQuery comes from enterprise databases, such as Microsoft SQL Server. This article explores two common ways to move data from Microsoft SQL Server into Google BigQuery. The first method will cover a manual ETL process, and the second method will use Arcion to create a more scalable and automated pipeline in minutes.

Table of Contents

Introduction to Microsoft SQL Server

Our first stop is to look at a brief introduction to SQL Server. Microsoft’s SQL Server platform is a relational database management system that is used by some of the largest businesses on the planet. The platform is designed to manage, store, and retrieve information when requested by other software applications. The software applications using the SQL Server backend may be on the same server or across a network in a distributed environment.

SQL Server supports various analytics operations, transactional operations, as well as business intelligence operations. SQL Server, as you may have assumed, uses Structured Query Language (SQL) to manage and query the database. SQL Server has many built-in services such as SQL Server Reporting Services (SSRS), SQL Server Agent Services (SSAS), SQL Server Integration Services (SSIS), and others.

Introduction to BigQuery

Google BigQuery is a serverless, cloud-based data warehouse that offers big data analytics web services for processing huge amounts of data. It supports the querying of data using ANSI SQL which is used to analyze data on a large scale, running into billions of rows. 

The BigQuery platform also utilizes columnar storage called Capacitor to ease data querying and aggregation of results. BigQuery has built-in auto scaling which allows the platform to automatically allocate computing resources as needed. This means that users don’t need to spin up new instances or virtual machines to scale. BigQuery can securely handle large volumes of data since BigQuery automatically encrypts all data before it is written to disk. When the data is read by an authorized user it is automatically decrypted. By default, Google manages the key encryption keys used to protect your data although this can be configured as needed.

How to Migrate Data from Microsoft SQL Server to BigQuery

There are several different options in which a connection can be set up to migrate data from Microsoft SQL Server to BigQuery. Some of the most common methods include continuous replication through CDC and Cloud Data Fusion Replication job, Manual ETL setup, Arcion, and many other popular approaches. In this article, we will be covering in detail the steps you need to follow to migrate your data from SQL Server to Snowflake using a manual ETL process and Arcion. We will cover both in a step-by-step manner so that users can implement either method confidently. Let’s start by looking at the manual ETL process.

Method 1: Using Manual ETL Process to Connect Microsoft SQL Server to BigQuery

In this method, the source data coming from SQL Server requires a staging area. In this staging area, data is uploaded before onward transfer and upload to BigQuery. The staging area can be either external or internal depending on the architecture of your system. For this write-up, we will be using Google Cloud Storage (GCS) as the staging area for the SQL Server data that will later be migrated to BigQuery.

In the manual process of connecting Microsoft SQL Server to BigQuery, we will divide the process into three distinct sections. These sections include:

  • Exporting SQL Server data to a CSV file through SQL Server Management Studio (SSMS) export wizard. 
  • Uploading the CSV file to the staging area, Google Cloud Storage (GCS).
  • Moving the data from Google Cloud Storage (GCS) to BigQuery.

Step 1: Exporting SQL Server data using SQL Server Management Studio

SQL Server Management Studio (SSMS) will be used to access and extract data from the SQL Server database. The SQL Server infrastructure is managed by SSMS and used as an integrated environment across SQL Server, Azure SQL Database, Azure SQL Managed Instance, and other Microsoft SQL products. SSMS can be used to do everything from access, configure, manage, and administer the databases hosted within SQL Server. We will use SSMS to export the data into a Comma Separated Values (CSV) file before outputting it to a GCS Bucket. 

Before we get to exporting the CSV, we need to ensure that SQL Management Studio is installed. Once it is installed, we will carry out the following steps:

  1. Launch SQL Server Management Studio and connect your Microsoft SQL Server instance.
  2. From the Object Explorer window, select the database you want to export. Once selected, you can then right-click on the Tasks context menu. From the sub-menu, choose Export Data to export the table in CSV.
  3. After selecting Export Data, you will see the SQL Server Import and Export Wizard window appear. From here you will click on the Next button. In the next window, choose the data source you want to copy from the drop-down menu.
  4. For the data source, you will choose SQL Server Native Client. 
  5. Next, select a SQL Server instance.
  6. Under Authentication, select “Use Windows Authentication”.
  7. Select the database you want to copy under the Database option.
  8. Click on Next after filling out the inputs.
  9. In the Choose a Destination window that appears, go to the Destination drop-down box and select Flat File Destination so data can be copied from the SQL Server to CSV.
  10. Now, create a name for the CSV file where the data will be written in the File name field, then click on Next.
  11. In the window that appears next, you will choose the Copy data from one or more tables or views option and click Next.
  12. The next screen to appear will be the Configure Flat File Destination screen. From here you will choose the table you want to export from the Source table or view options. Once selected, click Next.
  13. On the Save and Run Package window that appears you will click Next since there is nothing to change.
  14. The Complete Wizard that appears next shows a breakdown of all your selections during the entire exporting process. Ensure that your options are correct and accurate before proceeding to click on the Finish button. Once clicked, the Wizard will start exporting the SQL server database to CSV. 

After the exporting process, you are alerted if the export was successful or not on the next window that comes up. If the export is successful, your CSV file will be found in the file destination that was specified for export.

Step 2: Upload the CSV file to Google Cloud Storage (GCS)

Having completed the process of exporting your SQL Server data to the CSV file, the next phase is the transfer or uploading of the CSV file to Google Cloud Storage (GCS) from the local filesystem. There are numerous ways to get this done including through the console, command line, client libraries, Terraform, and REST APIs. In this article, we will look at the first two methods mentioned here, uploading via the console and command line.

Before going ahead to describe the process involved in the two methods, let us first create a GCS Bucket to store the data in. An GCS Bucket is a basic container that will hold all the data in Google Cloud Storage. The bucket will be where the CSV file will be uploaded to. If you already have a bucket created that you want to use you can skip the creation and move to the next step.

Creating a bucket

To create the bucket using the GCP Console, do the following:

  1. In the Google Cloud Console, go to the Cloud Storage Buckets page.
  2. Open the bucket creation form by clicking Create bucket.
  3. Enter the information you wish to give the bucket and click on Continue after completing each step. The information to be input includes:
  1. Enter a Name for the bucket
  2. Choose the Region for Location type and “us-east1” for Location (or the region that is closest to you).
  3. Choose Standard for the default storage class.
  4. Choose Uniform for Access control.
  5. Uncheck Enforce public access prevention on this bucket so you can share your objects.
  6. Choose None for protection tools.
  1. Click on Create.

After you’ve clicked create, the bucket will have been created on Cloud Storage to hold your CSV data. Next, we can proceed on how to upload data to Google Cloud Storage.

Using the GCP Console to upload data to Google Cloud Storage

To use the GCP Console to upload an object to a bucket, you’ll need to complete the following steps.

  1. In the Google Cloud Console, go to the Cloud Storage Buckets page.
  2.  In the list of available buckets, select the bucket you wish to upload the object.
  3. In the Objects tab, you can do either of the following to add the file:
  1. Drag and drop the desired file from your desktop or file manager depending on where your file is stored to the main pane in the Google Cloud console.
  2. Click the Upload files button, and a dialog box appears from which you will select the files you want to upload. Click on Open after the selection.

Using Command Line to upload data to Google Cloud Storage

Alternatively, to use the command line, you can either use gcloud or gsutil to upload an object to a bucket. An explanation of how this is done is given below.

  • For gcloud, we are going to use the gcloud storage cp command below to achieve this.
gcloud storage cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

In the command above, the OBJECT_LOCATION is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object. 

If successful, you will receive a response similar to the one below.

Completed files 1/1 | 164.3kiB/164.3kiB
  • To use gsutil instead, we can use the gsutil cp command below to achieve the same result. An example command can be seen below.
gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

In the command above, the OBJECT_LOCATION is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME is the name of the bucket to which you are uploading your object. 

If successful, you will receive a response similar to the one below.

Operation completed over 1 objects/58.8 KiB.

It’s important to note that to complete the uploading of data using either the Google Cloud console or the command-line utility, you must have the proper IAM permissions.

Step 3: Moving the data from Google Cloud Storage (GCS) to BigQuery

The final step in the process of connecting Microsoft SQL Server to BigQuery is moving the data from Google Cloud Storage (GCS) to BigQuery. There are several ways to load your CSV file into BigQuery. This can be done using the console, SQL, bq, API, a custom Java program, and quite a few other ways. In this article we will be focused on using the GCP Console and SQL.

Before proceeding, the following items are things to note to ensure a seamless transfer of your data into BigQuery:

  • When CSV data is loaded from Google Cloud Storage, it is loaded into a new table or partition, or appended to or overwritten an existing table or partition. 
  • The data loaded into BigQuery is converted into a columnar format for Capacitor, which is BigQuery’s storage format.
  • When data is loaded from GCS to a BigQuery table, the dataset that contains the table you want to create, append, or overwrite must be in the same regional or multi-regional location as the Cloud Storage bucket.
  • Grant Identity and Access Management (IAM) roles that permit users to perform tasks. You have to grant IAM permissions to load data into BIgQuery and also grant IAM permissions to load data from Cloud Storage.

To load data into BIgQuery, the IAM permissions needed to run a load job into BigQuery include:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create

The following predefined IAM roles are needed to load data into a table or partition in BigQuery.

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin 
  • bigquery.user
  • bigquery.jobUser

Permissions to load data into BigQuery from Cloud Storage, the IAM permissions needed include:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list

Having taken note of the IAM role permissions required and granting them, let us proceed to move the CSV data from our Google Cloud Storage Bucket to BigQuery.

Loading the CSV file Google Cloud Storage using the Console into BigQuery

To load the CSV data from GCS into a new BigQuery table using the Console, do the following:

  1. Go to the BigQuery page found on the Google Cloud console.
  2. In the Explorer pane, select a dataset by expanding your project.
  3. In the Dataset info section, click Create table.
  4. Specify the following details in the Create table panel:

    a. Select Google Cloud Storage in the Create table from the list in the Source section. Then, do the following:

    - Select a file from the Cloud Storage bucket.
    - For File format, select CSV.  

    b. Specify the following details in the Destination section:

    - Select the dataset in which you want to create the table by using the option Dataset.
    - Enter the name of the table that you want to create by using the Table field option.
    - Ensure that the Table type field is set to Native table.

    c. Enter the schema definition in the Schema section. To enable the auto-detection of a schema, select Auto detect

    d. Specify the optional step Partition and cluster settings if you wish.

    e. Click Advanced options and do the following:

    - For Write preference, select Write if empty as this option creates a new table and loads your data into it.
    - For Number of errors allowed, enter the maximum number of rows containing errors that can be ignored or leave the default value of 0.
    - Unknown values
    are used to ignore values in a row that are not present in the table's schema.
    - Commas
    , Tabs, Pipes, or Custom are characters that separate cells in your CSV file. Select Field delimiter to choose a character. 
    - For Header rows to skip, enter the number of header rows to skip at the top of the CSV file.
    - Select Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file.
    - To accept rows in CSV files, select Allow jagged rows.
    - A Cloud Key Management Service key for Encryption is to be used,  click the Customer-managed key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.

    f. Finally, click Create table.

Loading the CSV file Google Cloud Storage using SQL into BigQuery

SQL uses the LOAD DATA DDL Statement to load the CSV data from GCS into a new BigQuery table. To use SQL to load into a new table mytable, do the following:

  1. In the Google Cloud console, go to the BigQuery page.
  2. Enter the statement below in the query editor:
LOAD DATA OVERWRITE mydataset.mytableFROM FILES (  format = 'CSV',  uris = ['gs://bucket/path/file.csv']);
  1. Click Run.

At this point we have succeeded in loading our CSV file into a new table in BigQuery from Google Cloud Storage (GCS). For appending to or overwriting a table and other information or methods of doing this, you can visit the official Google Cloud documentation here to see all of the options.

Limitations and Challenges of Using Custom Code Methods

Although using a custom code approach like the one discussed above is possible, it may not always be optimal. Manual processes can have many issues and limitations that appear quite quickly at scale. The following are the limitations of using the custom code method explained above.

  • Writing custom code is ideal when you need to move data once. Carrying out the manual process becomes cumbersome and bulky when trying to update the data frequently and on a recurring basis.
  • BigQuery does not guarantee data consistency for external data sources as unexpected behavior may be encountered when changes are made to the data while a query is running. 
  • The data set's location value must be in the same region or contained in the same multi-region as the Cloud Storage Bucket.
  • Nested or repeated data can not be used on CSV files as the CSV format does not support this type of data.
  • Compressed and uncompressed files cannot both be included in the same load job when using a CSV.
  • The maximum size for a gzip file for CSV is 4 GB.

Method 2: Microsoft SQL Server to Bigquery using Arcion

In this section, we will explore the second method of how to create a connection to migrate data from SQL Server to BigQuery. This time we will be using the Arcion CLI tool, Replicant.

Arcion, a real-time in-memory Change Data Capture (CDC) solution guarantees users petabyte-level scalability, sub-second latency, and ensures data consistency at all times. It is one of the foremost solutions for data migration and data replication and has integrations with various enterprise databases and data warehouses, including SQL Server and BigQuery. 

To load data into BigQuery, you need to make sure you have the necessary IAM permissions. The correct permissions are required to run a load job and to load data into BigQuery tables and partitions. Refer to the section above (the section around IAM permissions) for the exact permissions needed. Once the correct permissions are granted, we can look at the steps involved in connecting Microsoft SQL Server to BigQuery using Arcion. 

Step 1: Download And Install Arcion Self-hosted

The first step in our setup is to get in touch with our team to access the download link of Arcion Self-hosted. This step involves downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME in the steps below, and finally, adding your credentials. You can reference the quick start guide for more details on how to do each of these requirements.

Having downloaded and installed Arcion Self-hosted, we can move on to further steps where we will configure and enable Arcion to connect with SQL Server and Snowflake. 

Step 2: Set up Connection Configuration for SQL Server

First we will open the sample SQL Server connection configuration file found in the $REPLICANT_HOME directory.

vi conf/conn/sqlserver.yaml

If your connection credentials are stored in a Secrets Manager, you can retrieve them using Replicant; if not, you will have to input the credentials as seen below in the username and password  fields.

type: SQLSERVERextractor: {CHANGE|LOG}host: localhostport: 1433username: 'USERNAME'password: 'PASSWORD'database: 'tpcc'max-connections: MAX_NUMBER_OF_CONNECTIONS#ssl:#  enable: true#  hostname-verification: false

From the configuration above, we have the following values that can be specified:

HOST: This refers to the hostname of the SQL Server instance

PORT: This is the port number of the SQL Server instance

USERNAME: This refers to the username of the DATABASE user

PASSWORD: This is the password associated with the USERNAME

DATABASE: This is where you input the name of the SQL Server database to connect to

MAX_NUMBER_OF_CONNECTIONS: This is used to specify the maximum number of connections Replicant would use to fetch data from the source

For those hosting SQL Server on Azure, the parameter below must be added and set to true in the connection configuration file:

is_azure: true

After you are done adding the necessary values to the configuration file, save and exit the file.

Step 3: Set up Extractor Configuration for SQL Server

Next, we will go to the SQL Server Extractor configuration file in the $REPLICANT_HOME directory.

vi conf/src/sqlserver.yaml

Arcion supports three modes of replications  for SQL Server: Realtime, delta-snapshot, and snapshot. In our example we will set the configuration to support real-time mode and our focus will be on full mode replication. For this type of replication, we will also need to create a heartbeat table. It is also important with full-mode replication that all tables to be replicated have a primary key on them. 

The example below shows how a heartbeat table can be created in the SQL Server instance.

CREATE TABLE "tpcc"."dbo"."replicate_io_cdc_heartbeat"("timestamp" BIGINT NOT NULL, PRIMARY KEY("timestamp"))

After creating the heartbeat table, the next thing to do is to add the realtime configuration to the Extractor configuration file that we opened above. The sample configuration below shows how to set up Replicant to operate in realtime mode:

realtime: 
	threads: 4 
  fetch-size-rows: 10000 
  fetch-duration-per-extractor-slot-s: 3 
 heartbeat:      
 	enable: true      
  catalog: "tpcc"      
  schema: "dbo"      
  interval-ms: 10000

You can visit here for further reading and more information about Realtime mode configuration and parameters on Arcion.

Step 4: Obtain the JDBC driver for Google BigQuery

Replicant relies on the JDBC driver for Google BigQuery. To obtain the required driver, you will need to follow these steps:

Step 5: Set up connection configuration for Google BigQuery

From the $REPLICANT_HOME directory, open the BigQuery Connection configuration file.

vi conf/conn/bigquery.yaml

After locating the BigQuery Connection configuration file, configure the BigQuery server connection by inputting your credentials, like username and password. Alternatively, if your connection credentials are stored in a Secrets Manager, Replicant can obtain them for you. The sample below shows how to input the credentials:

type: BIGQUERY


host: https://www.googleapis.com/bigquery/v2
port: 443
project-id: bigquerytest-268904
auth-type: 0
o-auth-service-acc-email: bigquerytest@bigquerytest-268904.iam.gserviceaccount.com
o-auth-pvt-key-path: <path_to_oauth_private_key>
location: US
timeout: 500


username: "USERNAME"
password: "PASSWORD"


max-connections: 20


max-retries: 10
retry-wait-duration-ms: 1000

From the sample above, the value definitions for each of the fields are as follows:

type: This indicates the connection type representing the database, it is BigQuery in this case. 

host: This shows the hostname of your BigQueryinstance.

port: This is the port number to connect to the host.

username: This refers to the username credential to access the BigQuery system.

password: This is the password you put in reference to the username.

max-connections: This is used to indicate the maximum number of connections Replicant uses to load data into the BigQuery instance.

Arcion supports both HTTP and HTTPS for BigQuery connection. Both can be configured in the bigquery.yaml shown above. 

To use HTTP, follow these steps:

  • Set http as the protocol in the host URL.
  • Set port to 80.

To use HTTPS, follow these steps:

  • Set https as the protocol in the host URL.
  • Set port to 443.

Arcion supports CSV and Parquet as intermediary formats to send data to the BigQuery server. To specify the stage format, use the stage field in the connection configuration file:

stage:
  type: NATIVE
  file-format: {CSV|PARQUET}

Step 6: Set up Applier configuration

Our last step in configuration is to specify the Applier configuration. To set this, from $REPLICANT_HOME directory you’ll open to the sample BigQuery Applier configuration file:

vi conf/dst/bigquery.yaml

Arcion Replicant supports snapshot, realtime, and full replication modes for BigQuery as a target.  Below we will look at how to set up the configuration for snapshot and realtime modes.

Configuration for snapshot replication

Replicant supports the following two methods for snapshot replication:

  • loading data with the load job method, the default method
  • streaming data using BigQuery Storage Write API

With this in mind, let’s take a look at how each of these methods can be configured in Arcion.

For the configuration with the load job method, the configuration set in the conf/dst/bigquery.yaml file can be set as shown in the example below.

snapshot:
  threads: 16


  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: false


  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true


  use-quoted-identifiers: false

For a configuration that uses the Storage Write API, you can follow the example configuration shown below:

snapshot:
  threads: 16


  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: true


  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true
Configure realtime replication

For realtime replication, Replicant supports the same two methods we looked at in the snapshot section above: the load job and streaming data using BigQuery Storage Write API methods.

For using the load job method, the configuration in the conf/dst/bigquery.yaml file can be set as shown below.

realtime:
  threads: 16
    _traceDBTasks: true
    skip-tables-on-failures : false
    replay-strategy: IN_MEMORY_MERGE
    per-table-config:
      - catalog: tpch_scale_0_01
        tables:
          nation:
            replay-strategy: MERGE
            enable-dependency-tracking: false
          region:
            replay-strategy: INSERT_MERGE
            enable-dependency-tracking: true

To use the Storage Write API method, the configuration can be set as shown below.

realtime:
  threads: 16
  _traceDBTasks: true
  skip-tables-on-failures : false


  use-write-storage-api: true


  replay-strategy: IN_MEMORY_MERGE
  per-table-config:
    - catalog: tpch_scale_0_01
      tables:
        nation:
          replay-strategy: MERGE
          enable-dependency-tracking: false
        region:
          replay-strategy: INSERT_MERGE
          enable-dependency-tracking: true

The replay-strategy parameter found in the code above is Arcion’s way to implement CDC changes and apply these changes in real-time to the target. Arcion supports the following replay strategies:

  • NONE
  • INSERT_DELETE
  • INSERT_MERGE
  • IN_MEMORY_MERGE

Conclusion

This article served as a comprehensive step-by-step guide on how to set up Microsoft SQL Server to BigQuery integration. We covered two methods which included a custom or manual setup for importing data via CSV and using Arcion. We covered each step on how to implement a manual CSV import into BigQuery and also discussed some of the limitations of this approach. For Arcion, we showed how to implement real-time CDC to move data from SQL Server to Google BigQuery. Arcion was highlighted as a simple and scalable solution for  SQL Server to BigQuery integration and replication.

As we saw, Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from SQL Server to BigQuery. To easily implement a data integration strategy, get started with Arcion today by connecting with our team of SQL Server and BigQuery replication experts. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.
Luke has two decades of experience working with database technologies and has worked for companies like Oracle, AWS, and MariaDB. He is experienced in C++, Python, and JavaScript. He now works at Arcion as an Enterprise Solutions Architect to help companies simplify their data replication process.
Join our newsletter
Back to Blog
Microsoft SQL Server to BigQuery: 2 Best Ways

Microsoft SQL Server to BigQuery: 2 Best Ways

Matt Tanner
Developer Relations Lead
May 16, 2023

Introduction to Microsoft SQL Server

Our first stop is to look at a brief introduction to SQL Server. Microsoft’s SQL Server platform is a relational database management system that is used by some of the largest businesses on the planet. The platform is designed to manage, store, and retrieve information when requested by other software applications. The software applications using the SQL Server backend may be on the same server or across a network in a distributed environment.

SQL Server supports various analytics operations, transactional operations, as well as business intelligence operations. SQL Server, as you may have assumed, uses Structured Query Language (SQL) to manage and query the database. SQL Server has many built-in services such as SQL Server Reporting Services (SSRS), SQL Server Agent Services (SSAS), SQL Server Integration Services (SSIS), and others.

Introduction to BigQuery

Google BigQuery is a serverless, cloud-based data warehouse that offers big data analytics web services for processing huge amounts of data. It supports the querying of data using ANSI SQL which is used to analyze data on a large scale, running into billions of rows. 

The BigQuery platform also utilizes columnar storage called Capacitor to ease data querying and aggregation of results. BigQuery has built-in auto scaling which allows the platform to automatically allocate computing resources as needed. This means that users don’t need to spin up new instances or virtual machines to scale. BigQuery can securely handle large volumes of data since BigQuery automatically encrypts all data before it is written to disk. When the data is read by an authorized user it is automatically decrypted. By default, Google manages the key encryption keys used to protect your data although this can be configured as needed.

How to Migrate Data from Microsoft SQL Server to BigQuery

There are several different options in which a connection can be set up to migrate data from Microsoft SQL Server to BigQuery. Some of the most common methods include continuous replication through CDC and Cloud Data Fusion Replication job, Manual ETL setup, Arcion, and many other popular approaches. In this article, we will be covering in detail the steps you need to follow to migrate your data from SQL Server to Snowflake using a manual ETL process and Arcion. We will cover both in a step-by-step manner so that users can implement either method confidently. Let’s start by looking at the manual ETL process.

Method 1: Using Manual ETL Process to Connect Microsoft SQL Server to BigQuery

In this method, the source data coming from SQL Server requires a staging area. In this staging area, data is uploaded before onward transfer and upload to BigQuery. The staging area can be either external or internal depending on the architecture of your system. For this write-up, we will be using Google Cloud Storage (GCS) as the staging area for the SQL Server data that will later be migrated to BigQuery.

In the manual process of connecting Microsoft SQL Server to BigQuery, we will divide the process into three distinct sections. These sections include:

  • Exporting SQL Server data to a CSV file through SQL Server Management Studio (SSMS) export wizard. 
  • Uploading the CSV file to the staging area, Google Cloud Storage (GCS).
  • Moving the data from Google Cloud Storage (GCS) to BigQuery.

Step 1: Exporting SQL Server data using SQL Server Management Studio

SQL Server Management Studio (SSMS) will be used to access and extract data from the SQL Server database. The SQL Server infrastructure is managed by SSMS and used as an integrated environment across SQL Server, Azure SQL Database, Azure SQL Managed Instance, and other Microsoft SQL products. SSMS can be used to do everything from access, configure, manage, and administer the databases hosted within SQL Server. We will use SSMS to export the data into a Comma Separated Values (CSV) file before outputting it to a GCS Bucket. 

Before we get to exporting the CSV, we need to ensure that SQL Management Studio is installed. Once it is installed, we will carry out the following steps:

  1. Launch SQL Server Management Studio and connect your Microsoft SQL Server instance.
  2. From the Object Explorer window, select the database you want to export. Once selected, you can then right-click on the Tasks context menu. From the sub-menu, choose Export Data to export the table in CSV.
  3. After selecting Export Data, you will see the SQL Server Import and Export Wizard window appear. From here you will click on the Next button. In the next window, choose the data source you want to copy from the drop-down menu.
  4. For the data source, you will choose SQL Server Native Client. 
  5. Next, select a SQL Server instance.
  6. Under Authentication, select “Use Windows Authentication”.
  7. Select the database you want to copy under the Database option.
  8. Click on Next after filling out the inputs.
  9. In the Choose a Destination window that appears, go to the Destination drop-down box and select Flat File Destination so data can be copied from the SQL Server to CSV.
  10. Now, create a name for the CSV file where the data will be written in the File name field, then click on Next.
  11. In the window that appears next, you will choose the Copy data from one or more tables or views option and click Next.
  12. The next screen to appear will be the Configure Flat File Destination screen. From here you will choose the table you want to export from the Source table or view options. Once selected, click Next.
  13. On the Save and Run Package window that appears you will click Next since there is nothing to change.
  14. The Complete Wizard that appears next shows a breakdown of all your selections during the entire exporting process. Ensure that your options are correct and accurate before proceeding to click on the Finish button. Once clicked, the Wizard will start exporting the SQL server database to CSV. 

After the exporting process, you are alerted if the export was successful or not on the next window that comes up. If the export is successful, your CSV file will be found in the file destination that was specified for export.

Step 2: Upload the CSV file to Google Cloud Storage (GCS)

Having completed the process of exporting your SQL Server data to the CSV file, the next phase is the transfer or uploading of the CSV file to Google Cloud Storage (GCS) from the local filesystem. There are numerous ways to get this done including through the console, command line, client libraries, Terraform, and REST APIs. In this article, we will look at the first two methods mentioned here, uploading via the console and command line.

Before going ahead to describe the process involved in the two methods, let us first create a GCS Bucket to store the data in. An GCS Bucket is a basic container that will hold all the data in Google Cloud Storage. The bucket will be where the CSV file will be uploaded to. If you already have a bucket created that you want to use you can skip the creation and move to the next step.

Creating a bucket

To create the bucket using the GCP Console, do the following:

  1. In the Google Cloud Console, go to the Cloud Storage Buckets page.
  2. Open the bucket creation form by clicking Create bucket.
  3. Enter the information you wish to give the bucket and click on Continue after completing each step. The information to be input includes:
  1. Enter a Name for the bucket
  2. Choose the Region for Location type and “us-east1” for Location (or the region that is closest to you).
  3. Choose Standard for the default storage class.
  4. Choose Uniform for Access control.
  5. Uncheck Enforce public access prevention on this bucket so you can share your objects.
  6. Choose None for protection tools.
  1. Click on Create.

After you’ve clicked create, the bucket will have been created on Cloud Storage to hold your CSV data. Next, we can proceed on how to upload data to Google Cloud Storage.

Using the GCP Console to upload data to Google Cloud Storage

To use the GCP Console to upload an object to a bucket, you’ll need to complete the following steps.

  1. In the Google Cloud Console, go to the Cloud Storage Buckets page.
  2.  In the list of available buckets, select the bucket you wish to upload the object.
  3. In the Objects tab, you can do either of the following to add the file:
  1. Drag and drop the desired file from your desktop or file manager depending on where your file is stored to the main pane in the Google Cloud console.
  2. Click the Upload files button, and a dialog box appears from which you will select the files you want to upload. Click on Open after the selection.

Using Command Line to upload data to Google Cloud Storage

Alternatively, to use the command line, you can either use gcloud or gsutil to upload an object to a bucket. An explanation of how this is done is given below.

  • For gcloud, we are going to use the gcloud storage cp command below to achieve this.
gcloud storage cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

In the command above, the OBJECT_LOCATION is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object. 

If successful, you will receive a response similar to the one below.

Completed files 1/1 | 164.3kiB/164.3kiB
  • To use gsutil instead, we can use the gsutil cp command below to achieve the same result. An example command can be seen below.
gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

In the command above, the OBJECT_LOCATION is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME is the name of the bucket to which you are uploading your object. 

If successful, you will receive a response similar to the one below.

Operation completed over 1 objects/58.8 KiB.

It’s important to note that to complete the uploading of data using either the Google Cloud console or the command-line utility, you must have the proper IAM permissions.

Step 3: Moving the data from Google Cloud Storage (GCS) to BigQuery

The final step in the process of connecting Microsoft SQL Server to BigQuery is moving the data from Google Cloud Storage (GCS) to BigQuery. There are several ways to load your CSV file into BigQuery. This can be done using the console, SQL, bq, API, a custom Java program, and quite a few other ways. In this article we will be focused on using the GCP Console and SQL.

Before proceeding, the following items are things to note to ensure a seamless transfer of your data into BigQuery:

  • When CSV data is loaded from Google Cloud Storage, it is loaded into a new table or partition, or appended to or overwritten an existing table or partition. 
  • The data loaded into BigQuery is converted into a columnar format for Capacitor, which is BigQuery’s storage format.
  • When data is loaded from GCS to a BigQuery table, the dataset that contains the table you want to create, append, or overwrite must be in the same regional or multi-regional location as the Cloud Storage bucket.
  • Grant Identity and Access Management (IAM) roles that permit users to perform tasks. You have to grant IAM permissions to load data into BIgQuery and also grant IAM permissions to load data from Cloud Storage.

To load data into BIgQuery, the IAM permissions needed to run a load job into BigQuery include:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create

The following predefined IAM roles are needed to load data into a table or partition in BigQuery.

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin 
  • bigquery.user
  • bigquery.jobUser

Permissions to load data into BigQuery from Cloud Storage, the IAM permissions needed include:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list

Having taken note of the IAM role permissions required and granting them, let us proceed to move the CSV data from our Google Cloud Storage Bucket to BigQuery.

Loading the CSV file Google Cloud Storage using the Console into BigQuery

To load the CSV data from GCS into a new BigQuery table using the Console, do the following:

  1. Go to the BigQuery page found on the Google Cloud console.
  2. In the Explorer pane, select a dataset by expanding your project.
  3. In the Dataset info section, click Create table.
  4. Specify the following details in the Create table panel:

    a. Select Google Cloud Storage in the Create table from the list in the Source section. Then, do the following:

    - Select a file from the Cloud Storage bucket.
    - For File format, select CSV.  

    b. Specify the following details in the Destination section:

    - Select the dataset in which you want to create the table by using the option Dataset.
    - Enter the name of the table that you want to create by using the Table field option.
    - Ensure that the Table type field is set to Native table.

    c. Enter the schema definition in the Schema section. To enable the auto-detection of a schema, select Auto detect

    d. Specify the optional step Partition and cluster settings if you wish.

    e. Click Advanced options and do the following:

    - For Write preference, select Write if empty as this option creates a new table and loads your data into it.
    - For Number of errors allowed, enter the maximum number of rows containing errors that can be ignored or leave the default value of 0.
    - Unknown values
    are used to ignore values in a row that are not present in the table's schema.
    - Commas
    , Tabs, Pipes, or Custom are characters that separate cells in your CSV file. Select Field delimiter to choose a character. 
    - For Header rows to skip, enter the number of header rows to skip at the top of the CSV file.
    - Select Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file.
    - To accept rows in CSV files, select Allow jagged rows.
    - A Cloud Key Management Service key for Encryption is to be used,  click the Customer-managed key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.

    f. Finally, click Create table.

Loading the CSV file Google Cloud Storage using SQL into BigQuery

SQL uses the LOAD DATA DDL Statement to load the CSV data from GCS into a new BigQuery table. To use SQL to load into a new table mytable, do the following:

  1. In the Google Cloud console, go to the BigQuery page.
  2. Enter the statement below in the query editor:
LOAD DATA OVERWRITE mydataset.mytableFROM FILES (  format = 'CSV',  uris = ['gs://bucket/path/file.csv']);
  1. Click Run.

At this point we have succeeded in loading our CSV file into a new table in BigQuery from Google Cloud Storage (GCS). For appending to or overwriting a table and other information or methods of doing this, you can visit the official Google Cloud documentation here to see all of the options.

Limitations and Challenges of Using Custom Code Methods

Although using a custom code approach like the one discussed above is possible, it may not always be optimal. Manual processes can have many issues and limitations that appear quite quickly at scale. The following are the limitations of using the custom code method explained above.

  • Writing custom code is ideal when you need to move data once. Carrying out the manual process becomes cumbersome and bulky when trying to update the data frequently and on a recurring basis.
  • BigQuery does not guarantee data consistency for external data sources as unexpected behavior may be encountered when changes are made to the data while a query is running. 
  • The data set's location value must be in the same region or contained in the same multi-region as the Cloud Storage Bucket.
  • Nested or repeated data can not be used on CSV files as the CSV format does not support this type of data.
  • Compressed and uncompressed files cannot both be included in the same load job when using a CSV.
  • The maximum size for a gzip file for CSV is 4 GB.

Method 2: Microsoft SQL Server to Bigquery using Arcion

In this section, we will explore the second method of how to create a connection to migrate data from SQL Server to BigQuery. This time we will be using the Arcion CLI tool, Replicant.

Arcion, a real-time in-memory Change Data Capture (CDC) solution guarantees users petabyte-level scalability, sub-second latency, and ensures data consistency at all times. It is one of the foremost solutions for data migration and data replication and has integrations with various enterprise databases and data warehouses, including SQL Server and BigQuery. 

To load data into BigQuery, you need to make sure you have the necessary IAM permissions. The correct permissions are required to run a load job and to load data into BigQuery tables and partitions. Refer to the section above (the section around IAM permissions) for the exact permissions needed. Once the correct permissions are granted, we can look at the steps involved in connecting Microsoft SQL Server to BigQuery using Arcion. 

Step 1: Download And Install Arcion Self-hosted

The first step in our setup is to get in touch with our team to access the download link of Arcion Self-hosted. This step involves downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME in the steps below, and finally, adding your credentials. You can reference the quick start guide for more details on how to do each of these requirements.

Having downloaded and installed Arcion Self-hosted, we can move on to further steps where we will configure and enable Arcion to connect with SQL Server and Snowflake. 

Step 2: Set up Connection Configuration for SQL Server

First we will open the sample SQL Server connection configuration file found in the $REPLICANT_HOME directory.

vi conf/conn/sqlserver.yaml

If your connection credentials are stored in a Secrets Manager, you can retrieve them using Replicant; if not, you will have to input the credentials as seen below in the username and password  fields.

type: SQLSERVERextractor: {CHANGE|LOG}host: localhostport: 1433username: 'USERNAME'password: 'PASSWORD'database: 'tpcc'max-connections: MAX_NUMBER_OF_CONNECTIONS#ssl:#  enable: true#  hostname-verification: false

From the configuration above, we have the following values that can be specified:

HOST: This refers to the hostname of the SQL Server instance

PORT: This is the port number of the SQL Server instance

USERNAME: This refers to the username of the DATABASE user

PASSWORD: This is the password associated with the USERNAME

DATABASE: This is where you input the name of the SQL Server database to connect to

MAX_NUMBER_OF_CONNECTIONS: This is used to specify the maximum number of connections Replicant would use to fetch data from the source

For those hosting SQL Server on Azure, the parameter below must be added and set to true in the connection configuration file:

is_azure: true

After you are done adding the necessary values to the configuration file, save and exit the file.

Step 3: Set up Extractor Configuration for SQL Server

Next, we will go to the SQL Server Extractor configuration file in the $REPLICANT_HOME directory.

vi conf/src/sqlserver.yaml

Arcion supports three modes of replications  for SQL Server: Realtime, delta-snapshot, and snapshot. In our example we will set the configuration to support real-time mode and our focus will be on full mode replication. For this type of replication, we will also need to create a heartbeat table. It is also important with full-mode replication that all tables to be replicated have a primary key on them. 

The example below shows how a heartbeat table can be created in the SQL Server instance.

CREATE TABLE "tpcc"."dbo"."replicate_io_cdc_heartbeat"("timestamp" BIGINT NOT NULL, PRIMARY KEY("timestamp"))

After creating the heartbeat table, the next thing to do is to add the realtime configuration to the Extractor configuration file that we opened above. The sample configuration below shows how to set up Replicant to operate in realtime mode:

realtime: 
	threads: 4 
  fetch-size-rows: 10000 
  fetch-duration-per-extractor-slot-s: 3 
 heartbeat:      
 	enable: true      
  catalog: "tpcc"      
  schema: "dbo"      
  interval-ms: 10000

You can visit here for further reading and more information about Realtime mode configuration and parameters on Arcion.

Step 4: Obtain the JDBC driver for Google BigQuery

Replicant relies on the JDBC driver for Google BigQuery. To obtain the required driver, you will need to follow these steps:

Step 5: Set up connection configuration for Google BigQuery

From the $REPLICANT_HOME directory, open the BigQuery Connection configuration file.

vi conf/conn/bigquery.yaml

After locating the BigQuery Connection configuration file, configure the BigQuery server connection by inputting your credentials, like username and password. Alternatively, if your connection credentials are stored in a Secrets Manager, Replicant can obtain them for you. The sample below shows how to input the credentials:

type: BIGQUERY


host: https://www.googleapis.com/bigquery/v2
port: 443
project-id: bigquerytest-268904
auth-type: 0
o-auth-service-acc-email: bigquerytest@bigquerytest-268904.iam.gserviceaccount.com
o-auth-pvt-key-path: <path_to_oauth_private_key>
location: US
timeout: 500


username: "USERNAME"
password: "PASSWORD"


max-connections: 20


max-retries: 10
retry-wait-duration-ms: 1000

From the sample above, the value definitions for each of the fields are as follows:

type: This indicates the connection type representing the database, it is BigQuery in this case. 

host: This shows the hostname of your BigQueryinstance.

port: This is the port number to connect to the host.

username: This refers to the username credential to access the BigQuery system.

password: This is the password you put in reference to the username.

max-connections: This is used to indicate the maximum number of connections Replicant uses to load data into the BigQuery instance.

Arcion supports both HTTP and HTTPS for BigQuery connection. Both can be configured in the bigquery.yaml shown above. 

To use HTTP, follow these steps:

  • Set http as the protocol in the host URL.
  • Set port to 80.

To use HTTPS, follow these steps:

  • Set https as the protocol in the host URL.
  • Set port to 443.

Arcion supports CSV and Parquet as intermediary formats to send data to the BigQuery server. To specify the stage format, use the stage field in the connection configuration file:

stage:
  type: NATIVE
  file-format: {CSV|PARQUET}

Step 6: Set up Applier configuration

Our last step in configuration is to specify the Applier configuration. To set this, from $REPLICANT_HOME directory you’ll open to the sample BigQuery Applier configuration file:

vi conf/dst/bigquery.yaml

Arcion Replicant supports snapshot, realtime, and full replication modes for BigQuery as a target.  Below we will look at how to set up the configuration for snapshot and realtime modes.

Configuration for snapshot replication

Replicant supports the following two methods for snapshot replication:

  • loading data with the load job method, the default method
  • streaming data using BigQuery Storage Write API

With this in mind, let’s take a look at how each of these methods can be configured in Arcion.

For the configuration with the load job method, the configuration set in the conf/dst/bigquery.yaml file can be set as shown in the example below.

snapshot:
  threads: 16


  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: false


  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true


  use-quoted-identifiers: false

For a configuration that uses the Storage Write API, you can follow the example configuration shown below:

snapshot:
  threads: 16


  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: true


  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true
Configure realtime replication

For realtime replication, Replicant supports the same two methods we looked at in the snapshot section above: the load job and streaming data using BigQuery Storage Write API methods.

For using the load job method, the configuration in the conf/dst/bigquery.yaml file can be set as shown below.

realtime:
  threads: 16
    _traceDBTasks: true
    skip-tables-on-failures : false
    replay-strategy: IN_MEMORY_MERGE
    per-table-config:
      - catalog: tpch_scale_0_01
        tables:
          nation:
            replay-strategy: MERGE
            enable-dependency-tracking: false
          region:
            replay-strategy: INSERT_MERGE
            enable-dependency-tracking: true

To use the Storage Write API method, the configuration can be set as shown below.

realtime:
  threads: 16
  _traceDBTasks: true
  skip-tables-on-failures : false


  use-write-storage-api: true


  replay-strategy: IN_MEMORY_MERGE
  per-table-config:
    - catalog: tpch_scale_0_01
      tables:
        nation:
          replay-strategy: MERGE
          enable-dependency-tracking: false
        region:
          replay-strategy: INSERT_MERGE
          enable-dependency-tracking: true

The replay-strategy parameter found in the code above is Arcion’s way to implement CDC changes and apply these changes in real-time to the target. Arcion supports the following replay strategies:

  • NONE
  • INSERT_DELETE
  • INSERT_MERGE
  • IN_MEMORY_MERGE

Conclusion

This article served as a comprehensive step-by-step guide on how to set up Microsoft SQL Server to BigQuery integration. We covered two methods which included a custom or manual setup for importing data via CSV and using Arcion. We covered each step on how to implement a manual CSV import into BigQuery and also discussed some of the limitations of this approach. For Arcion, we showed how to implement real-time CDC to move data from SQL Server to Google BigQuery. Arcion was highlighted as a simple and scalable solution for  SQL Server to BigQuery integration and replication.

As we saw, Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from SQL Server to BigQuery. To easily implement a data integration strategy, get started with Arcion today by connecting with our team of SQL Server and BigQuery replication experts. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.

Matt Tanner
Developer Relations Lead

Take Arcion for a Spin

Deploy the only cloud-native data replication platform you’ll ever need. Get real-time, high-performance data pipelines today.

Free download

8 sources & 6 targets

Pre-configured enterprise instance

Available in four US AWS regions

Contact us

20+ enterprise source and target connectors

Deploy on-prem or VPC

Satisfy security requirements

Join the waitlist for Arcion Cloud (beta)

Fully managed, in the cloud.