When it comes to data warehousing, Google’s BigQuery is one of the most popular platforms to push data into. Fully managed and scalable, it makes sense why data engineers have gravitated toward the technology. A major source of the data in BigQuery comes from enterprise databases, such as Microsoft SQL Server. This article explores two common ways to move data from Microsoft SQL Server into Google BigQuery. The first method will cover a manual ETL process, and the second method will use Arcion to create a more scalable and automated pipeline in minutes.
Introduction to Microsoft SQL Server
Our first stop is to look at a brief introduction to SQL Server. Microsoft’s SQL Server platform is a relational database management system that is used by some of the largest businesses on the planet. The platform is designed to manage, store, and retrieve information when requested by other software applications. The software applications using the SQL Server backend may be on the same server or across a network in a distributed environment.
SQL Server supports various analytics operations, transactional operations, as well as business intelligence operations. SQL Server, as you may have assumed, uses Structured Query Language (SQL) to manage and query the database. SQL Server has many built-in services such as SQL Server Reporting Services (SSRS), SQL Server Agent Services (SSAS), SQL Server Integration Services (SSIS), and others.
Introduction to BigQuery
Google BigQuery is a serverless, cloud-based data warehouse that offers big data analytics web services for processing huge amounts of data. It supports the querying of data using ANSI SQL which is used to analyze data on a large scale, running into billions of rows.Â
The BigQuery platform also utilizes columnar storage called Capacitor to ease data querying and aggregation of results. BigQuery has built-in auto scaling which allows the platform to automatically allocate computing resources as needed. This means that users don’t need to spin up new instances or virtual machines to scale. BigQuery can securely handle large volumes of data since BigQuery automatically encrypts all data before it is written to disk. When the data is read by an authorized user it is automatically decrypted. By default, Google manages the key encryption keys used to protect your data although this can be configured as needed.
How to Migrate Data from Microsoft SQL Server to BigQuery
There are several different options in which a connection can be set up to migrate data from Microsoft SQL Server to BigQuery. Some of the most common methods include continuous replication through CDC and Cloud Data Fusion Replication job, Manual ETL setup, Arcion, and many other popular approaches. In this article, we will be covering in detail the steps you need to follow to migrate your data from SQL Server to Snowflake using a manual ETL process and Arcion. We will cover both in a step-by-step manner so that users can implement either method confidently. Let’s start by looking at the manual ETL process.
Method 1: Using Manual ETL Process to Connect Microsoft SQL Server to BigQuery
In this method, the source data coming from SQL Server requires a staging area. In this staging area, data is uploaded before onward transfer and upload to BigQuery. The staging area can be either external or internal depending on the architecture of your system. For this write-up, we will be using Google Cloud Storage (GCS) as the staging area for the SQL Server data that will later be migrated to BigQuery.
In the manual process of connecting Microsoft SQL Server to BigQuery, we will divide the process into three distinct sections. These sections include:
- Exporting SQL Server data to a CSV file through SQL Server Management Studio (SSMS) export wizard.Â
- Uploading the CSV file to the staging area, Google Cloud Storage (GCS).
- Moving the data from Google Cloud Storage (GCS) to BigQuery.
Step 1: Exporting SQL Server data using SQL Server Management Studio
SQL Server Management Studio (SSMS) will be used to access and extract data from the SQL Server database. The SQL Server infrastructure is managed by SSMS and used as an integrated environment across SQL Server, Azure SQL Database, Azure SQL Managed Instance, and other Microsoft SQL products. SSMS can be used to do everything from access, configure, manage, and administer the databases hosted within SQL Server. We will use SSMS to export the data into a Comma Separated Values (CSV) file before outputting it to a GCS Bucket.Â
Before we get to exporting the CSV, we need to ensure that SQL Management Studio is installed. Once it is installed, we will carry out the following steps:
- Launch SQL Server Management Studio and connect your Microsoft SQL Server instance.
- From the Object Explorer window, select the database you want to export. Once selected, you can then right-click on the Tasks context menu. From the sub-menu, choose Export Data to export the table in CSV.
- After selecting Export Data, you will see the SQL Server Import and Export Wizard window appear. From here you will click on the Next button. In the next window, choose the data source you want to copy from the drop-down menu.
- For the data source, you will choose SQL Server Native Client.Â
- Next, select a SQL Server instance.
- Under Authentication, select “Use Windows Authentication”.
- Select the database you want to copy under the Database option.
- Click on Next after filling out the inputs.
- In the Choose a Destination window that appears, go to the Destination drop-down box and select Flat File Destination so data can be copied from the SQL Server to CSV.
- Now, create a name for the CSV file where the data will be written in the File name field, then click on Next.
- In the window that appears next, you will choose the Copy data from one or more tables or views option and click Next.
- The next screen to appear will be the Configure Flat File Destination screen. From here you will choose the table you want to export from the Source table or view options. Once selected, click Next.
- On the Save and Run Package window that appears you will click Next since there is nothing to change.
- The Complete Wizard that appears next shows a breakdown of all your selections during the entire exporting process. Ensure that your options are correct and accurate before proceeding to click on the Finish button. Once clicked, the Wizard will start exporting the SQL server database to CSV.Â
After the exporting process, you are alerted if the export was successful or not on the next window that comes up. If the export is successful, your CSV file will be found in the file destination that was specified for export.
Step 2: Upload the CSV file to Google Cloud Storage (GCS)
Having completed the process of exporting your SQL Server data to the CSV file, the next phase is the transfer or uploading of the CSV file to Google Cloud Storage (GCS) from the local filesystem. There are numerous ways to get this done including through the console, command line, client libraries, Terraform, and REST APIs. In this article, we will look at the first two methods mentioned here, uploading via the console and command line.
Before going ahead to describe the process involved in the two methods, let us first create a GCS Bucket to store the data in. An GCS Bucket is a basic container that will hold all the data in Google Cloud Storage. The bucket will be where the CSV file will be uploaded to. If you already have a bucket created that you want to use you can skip the creation and move to the next step.
Creating a bucket
To create the bucket using the GCP Console, do the following:
- In the Google Cloud Console, go to the Cloud Storage Buckets page.
- Open the bucket creation form by clicking Create bucket.
- Enter the information you wish to give the bucket and click on Continue after completing each step. The information to be input includes:
- Enter a Name for the bucket
- Choose the Region for Location type and “us-east1” for Location (or the region that is closest to you).
- Choose Standard for the default storage class.
- Choose Uniform for Access control.
- Uncheck Enforce public access prevention on this bucket so you can share your objects.
- Choose None for protection tools.
- Click on Create.
After you’ve clicked create, the bucket will have been created on Cloud Storage to hold your CSV data. Next, we can proceed on how to upload data to Google Cloud Storage.
Using the GCP Console to upload data to Google Cloud Storage
To use the GCP Console to upload an object to a bucket, you’ll need to complete the following steps.
- In the Google Cloud Console, go to the Cloud Storage Buckets page.
- Â In the list of available buckets, select the bucket you wish to upload the object.
- In the Objects tab, you can do either of the following to add the file:
- Drag and drop the desired file from your desktop or file manager depending on where your file is stored to the main pane in the Google Cloud console.
- Click the Upload files button, and a dialog box appears from which you will select the files you want to upload. Click on Open after the selection.
Using Command Line to upload data to Google Cloud Storage
Alternatively, to use the command line, you can either use gcloud or gsutil to upload an object to a bucket. An explanation of how this is done is given below.
- For gcloud, we are going to use the gcloud storage cp command below to achieve this.
In the command above, the OBJECT_LOCATION is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME, the name of the bucket to which you are uploading your object.Â
If successful, you will receive a response similar to the one below.
- To use gsutil instead, we can use the gsutil cp command below to achieve the same result. An example command can be seen below.
In the command above, the OBJECT_LOCATION is the local path to where your object is saved, and the DESTINATION_BUCKET_NAME is the name of the bucket to which you are uploading your object.Â
If successful, you will receive a response similar to the one below.
It’s important to note that to complete the uploading of data using either the Google Cloud console or the command-line utility, you must have the proper IAM permissions.
Step 3: Moving the data from Google Cloud Storage (GCS) to BigQuery
The final step in the process of connecting Microsoft SQL Server to BigQuery is moving the data from Google Cloud Storage (GCS) to BigQuery. There are several ways to load your CSV file into BigQuery. This can be done using the console, SQL, bq, API, a custom Java program, and quite a few other ways. In this article we will be focused on using the GCP Console and SQL.
Before proceeding, the following items are things to note to ensure a seamless transfer of your data into BigQuery:
- When CSV data is loaded from Google Cloud Storage, it is loaded into a new table or partition, or appended to or overwritten an existing table or partition.Â
- The data loaded into BigQuery is converted into a columnar format for Capacitor, which is BigQuery’s storage format.
- When data is loaded from GCS to a BigQuery table, the dataset that contains the table you want to create, append, or overwrite must be in the same regional or multi-regional location as the Cloud Storage bucket.
- Grant Identity and Access Management (IAM) roles that permit users to perform tasks. You have to grant IAM permissions to load data into BIgQuery and also grant IAM permissions to load data from Cloud Storage.
To load data into BIgQuery, the IAM permissions needed to run a load job into BigQuery include:
- bigquery.tables.create
- bigquery.tables.updateData
- bigquery.tables.update
- bigquery.jobs.create
The following predefined IAM roles are needed to load data into a table or partition in BigQuery.
- roles/bigquery.dataEditor
- roles/bigquery.dataOwner
- roles/bigquery.adminÂ
- bigquery.user
- bigquery.jobUser
Permissions to load data into BigQuery from Cloud Storage, the IAM permissions needed include:
- storage.buckets.get
- storage.objects.get
- storage.objects.list
Having taken note of the IAM role permissions required and granting them, let us proceed to move the CSV data from our Google Cloud Storage Bucket to BigQuery.
Loading the CSV file Google Cloud Storage using the Console into BigQuery
To load the CSV data from GCS into a new BigQuery table using the Console, do the following:
- Go to the BigQuery page found on the Google Cloud console.
- In the Explorer pane, select a dataset by expanding your project.
- In the Dataset info section, click Create table.
- Specify the following details in the Create table panel:
‍
a. Select Google Cloud Storage in the Create table from the list in the Source section. Then, do the following:
‍
- Select a file from the Cloud Storage bucket.
- For File format, select CSV. Â
b. Specify the following details in the Destination section:
‍
- Select the dataset in which you want to create the table by using the option Dataset.
- Enter the name of the table that you want to create by using the Table field option.
- Ensure that the Table type field is set to Native table.
c. Enter the schema definition in the Schema section. To enable the auto-detection of a schema, select Auto detect.Â
d. Specify the optional step Partition and cluster settings if you wish.
e. Click Advanced options and do the following:
‍
- For Write preference, select Write if empty as this option creates a new table and loads your data into it.
- For Number of errors allowed, enter the maximum number of rows containing errors that can be ignored or leave the default value of 0.
- Unknown values are used to ignore values in a row that are not present in the table's schema.
- Commas, Tabs, Pipes, or Custom are characters that separate cells in your CSV file. Select Field delimiter to choose a character.Â
- For Header rows to skip, enter the number of header rows to skip at the top of the CSV file.
- Select Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file.
- To accept rows in CSV files, select Allow jagged rows.
- A Cloud Key Management Service key for Encryption is to be used, click the Customer-managed key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
f. Finally, click Create table.
‍
Loading the CSV file Google Cloud Storage using SQL into BigQuery
SQL uses the LOAD DATA DDL Statement to load the CSV data from GCS into a new BigQuery table. To use SQL to load into a new table mytable, do the following:
- In the Google Cloud console, go to the BigQuery page.
- Enter the statement below in the query editor:
- Click Run.
At this point we have succeeded in loading our CSV file into a new table in BigQuery from Google Cloud Storage (GCS). For appending to or overwriting a table and other information or methods of doing this, you can visit the official Google Cloud documentation here to see all of the options.
Limitations and Challenges of Using Custom Code Methods
Although using a custom code approach like the one discussed above is possible, it may not always be optimal. Manual processes can have many issues and limitations that appear quite quickly at scale. The following are the limitations of using the custom code method explained above.
- Writing custom code is ideal when you need to move data once. Carrying out the manual process becomes cumbersome and bulky when trying to update the data frequently and on a recurring basis.
- BigQuery does not guarantee data consistency for external data sources as unexpected behavior may be encountered when changes are made to the data while a query is running.Â
- The data set's location value must be in the same region or contained in the same multi-region as the Cloud Storage Bucket.
- Nested or repeated data can not be used on CSV files as the CSV format does not support this type of data.
- Compressed and uncompressed files cannot both be included in the same load job when using a CSV.
- The maximum size for a gzip file for CSV is 4 GB.
Method 2: Microsoft SQL Server to Bigquery using Arcion
In this section, we will explore the second method of how to create a connection to migrate data from SQL Server to BigQuery. This time we will be using the Arcion CLI tool, Replicant.
Arcion, a real-time in-memory Change Data Capture (CDC) solution guarantees users petabyte-level scalability, sub-second latency, and ensures data consistency at all times. It is one of the foremost solutions for data migration and data replication and has integrations with various enterprise databases and data warehouses, including SQL Server and BigQuery.Â
To load data into BigQuery, you need to make sure you have the necessary IAM permissions. The correct permissions are required to run a load job and to load data into BigQuery tables and partitions. Refer to the section above (the section around IAM permissions) for the exact permissions needed. Once the correct permissions are granted, we can look at the steps involved in connecting Microsoft SQL Server to BigQuery using Arcion.Â
Step 1: Download And Install Arcion Self-hosted
The first step in our setup is to get in touch with our team to access the download link of Arcion Self-hosted. This step involves downloading Arcion’s CLI tool Replicant, creating a home directory where Replicant was downloaded, referred to as $REPLICANT_HOME in the steps below, and finally, adding your credentials. You can reference the quick start guide for more details on how to do each of these requirements.
Having downloaded and installed Arcion Self-hosted, we can move on to further steps where we will configure and enable Arcion to connect with SQL Server and Snowflake.Â
Step 2: Set up Connection Configuration for SQL Server
First we will open the sample SQL Server connection configuration file found in the $REPLICANT_HOME directory.
If your connection credentials are stored in a Secrets Manager, you can retrieve them using Replicant; if not, you will have to input the credentials as seen below in the username and password fields.
From the configuration above, we have the following values that can be specified:
HOST: This refers to the hostname of the SQL Server instance
PORT: This is the port number of the SQL Server instance
USERNAME: This refers to the username of the DATABASE user
PASSWORD: This is the password associated with the USERNAME
DATABASE: This is where you input the name of the SQL Server database to connect to
MAX_NUMBER_OF_CONNECTIONS: This is used to specify the maximum number of connections Replicant would use to fetch data from the source
For those hosting SQL Server on Azure, the parameter below must be added and set to true in the connection configuration file:
After you are done adding the necessary values to the configuration file, save and exit the file.
Step 3: Set up Extractor Configuration for SQL Server
Next, we will go to the SQL Server Extractor configuration file in the $REPLICANT_HOME directory.
Arcion supports three modes of replications for SQL Server: Realtime, delta-snapshot, and snapshot. In our example we will set the configuration to support real-time mode and our focus will be on full mode replication. For this type of replication, we will also need to create a heartbeat table. It is also important with full-mode replication that all tables to be replicated have a primary key on them.Â
The example below shows how a heartbeat table can be created in the SQL Server instance.
After creating the heartbeat table, the next thing to do is to add the realtime configuration to the Extractor configuration file that we opened above. The sample configuration below shows how to set up Replicant to operate in realtime mode:
You can visit here for further reading and more information about Realtime mode configuration and parameters on Arcion.
Step 4: Obtain the JDBC driver for Google BigQuery
Replicant relies on the JDBC driver for Google BigQuery. To obtain the required driver, you will need to follow these steps:
- Go to the JDBC drivers for BigQuery page.
- From there, download the latest JDBC 4.2-compatible JDBC driver ZIP.
- In the downloaded ZIP, locate and extract the GoogleBigQueryJDBC42.jar file.
- Lastly, put the GoogleBigQueryJDBC42.jar file inside the $REPLICANT_HOME/lib directory.
Step 5: Set up connection configuration for Google BigQuery
From the $REPLICANT_HOME directory, open the BigQuery Connection configuration file.
After locating the BigQuery Connection configuration file, configure the BigQuery server connection by inputting your credentials, like username and password. Alternatively, if your connection credentials are stored in a Secrets Manager, Replicant can obtain them for you. The sample below shows how to input the credentials:
From the sample above, the value definitions for each of the fields are as follows:
type: This indicates the connection type representing the database, it is BigQuery in this case.Â
host: This shows the hostname of your BigQueryinstance.
port: This is the port number to connect to the host.
username: This refers to the username credential to access the BigQuery system.
password: This is the password you put in reference to the username.
max-connections: This is used to indicate the maximum number of connections Replicant uses to load data into the BigQuery instance.
Arcion supports both HTTP and HTTPS for BigQuery connection. Both can be configured in the bigquery.yaml shown above.Â
To use HTTP, follow these steps:
- Set http as the protocol in the host URL.
- Set port to 80.
To use HTTPS, follow these steps:
- Set https as the protocol in the host URL.
- Set port to 443.
Arcion supports CSV and Parquet as intermediary formats to send data to the BigQuery server. To specify the stage format, use the stage field in the connection configuration file:
Step 6: Set up Applier configuration
Our last step in configuration is to specify the Applier configuration. To set this, from $REPLICANT_HOME directory you’ll open to the sample BigQuery Applier configuration file:
Arcion Replicant supports snapshot, realtime, and full replication modes for BigQuery as a target. Below we will look at how to set up the configuration for snapshot and realtime modes.
Configuration for snapshot replication
Replicant supports the following two methods for snapshot replication:
- loading data with the load job method, the default method
- streaming data using BigQuery Storage Write API
With this in mind, let’s take a look at how each of these methods can be configured in Arcion.
For the configuration with the load job method, the configuration set in the conf/dst/bigquery.yaml file can be set as shown in the example below.
For a configuration that uses the Storage Write API, you can follow the example configuration shown below:
Configure realtime replication
For realtime replication, Replicant supports the same two methods we looked at in the snapshot section above: the load job and streaming data using BigQuery Storage Write API methods.
For using the load job method, the configuration in the conf/dst/bigquery.yaml file can be set as shown below.
To use the Storage Write API method, the configuration can be set as shown below.
The replay-strategy parameter found in the code above is Arcion’s way to implement CDC changes and apply these changes in real-time to the target. Arcion supports the following replay strategies:
- NONE
- INSERT_DELETE
- INSERT_MERGE
- IN_MEMORY_MERGE
Conclusion
This article served as a comprehensive step-by-step guide on how to set up Microsoft SQL Server to BigQuery integration. We covered two methods which included a custom or manual setup for importing data via CSV and using Arcion. We covered each step on how to implement a manual CSV import into BigQuery and also discussed some of the limitations of this approach. For Arcion, we showed how to implement real-time CDC to move data from SQL Server to Google BigQuery. Arcion was highlighted as a simple and scalable solution for SQL Server to BigQuery integration and replication.
As we saw, Arcion is a real-time, in-memory Change Data Capture (CDC) solution that guarantees scalability and ensures data consistency at all times when moving data from SQL Server to BigQuery. To easily implement a data integration strategy, get started with Arcion today by connecting with our team of SQL Server and BigQuery replication experts. CDC pipelines built with Arcion are easy to build, maintain, and scale and are production-ready in a matter of minutes.