Migrating and replicating data from a source database into analytics and big data platforms is a very common use case. Over the years, there have been many ways to accomplish this. One of the main methods for achieving this was to extract the data from the source databases in a nightly batch and load it into the big data platform. This worked until the modern demand for data began requiring data to be delivered to the big data platform in real-time. With that, platforms like Snowflake began to take off in popularity and deliver capabilities that enterprises only dreamed of.
This need for real-time data delivery meant that new solutions needed to be imagined. One such solution was Change Data Capture, also known as CDC. CDC allowed for changes in the data on a source database to be updated in the data warehouse instantly, as the change occurred. Through CDC pipelines, businesses and the industry have been revolutionized. From retail, to finance and government, not a single industry has lacked the desire to implement a real-time data strategy.
A common technology used in source databases is Microsoft SQL server. With its scalability, performance, and the luxury of being under the Microsoft product umbrella, SQL Server has become an extremely common data storage solution for enterprises large and small. In this article, we will take a look at 2 easy methods for moving data from SQL Server to Snowflake. Let’s get started by first diving into the world of SQL Server.
What is Microsoft SQL Server?
Microsoft SQL Server is an enterprise Relational Database Management System (RDBMS) designed to store and retrieve large amounts of relational data. It is widely used by corporations and organizations in mission-critical applications because of its rich history of security, data integrity, and reliability.
Microsoft SQL Server was first released in 1989 and has undergone several iterations that have increased its features and availability across platforms. It was initially only available on Windows but is now available for use on Linux machines as well. Microsoft SQL Server has full support for SQL and, in addition to that, comes with its proprietary extensions to SQL called Transact SQL (T-SQL). Transact SQL facilitates the use of variables, stored procedures, and other features that augment regular SQL.
Microsoft SQL Server is available in different editions. The exact offering needed for the project or organization depends on the use case of the customer and their data requirements. The enterprise offering is great for large organizations with huge workloads that want access to all the features of the database and guaranteed scalability is the Microsoft SQL Server Enterprise. It comes with support for advanced analytics, and machine learning. It also provides fine-grained security and access controls. Microsoft SQL Server Standard is the mid-tier offering and comes with all the standard features of the database. It is suitable for applications that do not need access to some of the advanced features of SQL Server included in the enterprise versions. It also comes with basic reporting and analytics features. Other editions available are the SQL Server Express which is a free version best suited for small-scale applications. The SQL Server Developer edition is similar in functionality to Microsoft SQL Server Enterprise but is for use in development environments to test, model, and demo applications. It should never be used in production.
The Microsoft SQL Server platform is made up of a few vital components. These components include:
- the database engine which is responsible for storing, processing, and securing data
- the SQL Server agent which functions as a task scheduler that can react to events or be triggered manually
- the SQL Server executable that can start, stop or pause an instance of an SQL Server
Microsoft SQL Server can have several instances running on the same machine and is a good choice for deploying standby servers that can prevent service outages. High availability is native to SQL Server and is easy to configure and scale.
Key Features of SQL server
As with any platform that grows and maintains popularity, there are reasons why enterprises continue to choose to use it. Some of the key features of Microsoft SQL Server that make it so popular are listed below.
One of the most important value propositions of Microsoft SQL Server is its high-security standard. Microsoft SQL Server is designed to secure data both at rest and in transit. It is the database platform with the fewest security vulnerabilities according to the National Institute of Standards and Technologies National Vulnerabilities Database. It offers extremely secure data storage since data is constantly encrypted. A big benefit to SQL Server is that you can perform complex computations on the encrypted data based on the roles and access levels assigned to a user.
Microsoft SQL Server can be installed on both Windows and Linux operating systems and supports container orchestration technologies like Kubernetes. This makes the provisioning and deployment of SQL Server instances reliable and compatible with modern DevOps practices.
Microsoft SQL Server boasts high availability of data and accelerated database recovery in cases of a data disaster. It offers great uptime metrics and fast failover when using Always On availability groups to maximize the availability of a set of user databases for an enterprise. The data stored in SQL Server is consistent and remains so after recovery, even if there were active transactions at the time a failure occurred.
Microsoft SQL Server enables you to gain insights from your data through its integration with Power BI Report Server. It allows you to quickly explore and visualize your data to gain a better understanding of your operations. Natively, SQL Server also offers various tools which can help with business intelligence efforts. These tools include SQL Server Reporting Services, SSAS (SQL Server Analysis Services), and SSIS (SQL Server Integration Services).
What is Snowflake?
Snowflake is a modern cloud data platform that is designed from the ground up to handle the heavy data workloads of modern enterprises. To enable Snowflake to power its highly-performant platform, it is not dependent on any existing database technology. Snowflake achieves its scalability and connectivity through a combination of a unique cloud-native architecture and a newly built SQL query engine. Snowflake is a fully managed cloud service which means there is no software to install, configure or manage to leverage the platform. The managed nature of the platform means that any updates and feature improvements are immediately available to users.
Snowflake utilizes Massively Parallel Processing (MPP) compute clusters for its querying. This enables the platform to achieve lightning-fast query executions. In Snowflake, there is also no concept of a “main” database. This allows Snowflake to mitigate scenarios where the system lags because of a high number of concurrent users. Snowflake, instead, persists data in a central repository and this data is accessible from all compute nodes on the platform. Snowflake separates compute from storage, enabling it to scale out both compute and storage capabilities when the need arises.
Snowflake’s architecture is made up of three layers or components. These components include database storage, query processing, and cloud services. The database storage layer is responsible for how the data is compressed and stored on the platform. The database storage layer is not visible or accessible to users directly but users use SQL queries to interact with the data stored there. The query processing layer is the execution domain in charge of utilizing virtual data warehouses that rely on MPP to process queries. The cloud services layer is the coordination component which is a collection of services that carry out activities like user authentication, infrastructure management, metadata management, query optimization, and access controls. Because of its architecture, Snowflake makes collaboration between distributed teams conflict-free as data can be accessed concurrently without an impact on performance.
Key Features of Snowflake
Although many big data and analytics platforms exist, Snowflake has quickly become a dominant player in the market. The key features of Snowflake that have led to its popularity are listed below.
Snowflake has a modern approach to managing and accessing data in the cloud. It serves as a single repository of data and acts as a data warehouse for distributed teams. Team members, with valid authorizations, can access the data from across the enterprise instead of data silos with exclusive access only to certain teams. Snowflake is also cloud provider agnostic and supports all the major cloud service providers like Amazon Web Services, Azure, and Google Cloud Platform.
Snowflake has a wide variety of security features that can be used to restrict who can access what data. You can restrict access to data based on IP addresses, employ corporate Single Sign On (SSO) through federated authentication, or enable two-factor authentication. Snowflake automatically encrypts data whether it is in transit or at rest. It also offers data governance policies that you can use according to the needs of your organization to make sure data access is tightly controlled.
Snowflake fully supports standard SQL alongside other extensions. It supports Data Manipulation Language (DML) operations for bulk loading and unloading of data. It also supports analytical aggregates, stored procedures, transient tables, and scalar and tabular user-defined functions (UDFs).
Snowflake has client connectors and drivers for a large number of programming languages and protocols. Support connectivity includes a Spark connector, Python connector, ODBC client driver, .NET driver, JDBC client driver, and many more. It also has a vibrant ecosystem of third-party applications and partners that connect to various data sources.
How to Load Data from Microsoft SQL Server to Snowflake
In this section, you will be shown how to load data from Microsoft SQL Server to Snowflake using two methods. The methods shown can be used for a wide array of use cases such as moving data from an on-premises Microsoft SQL Server instance to Snowflake as part of your data migration strategy. You may also use these methods as a means to modernize your data infrastructure. Whatever the reason, this section will show you how to achieve that first with Arcion, a cloud-native Change Data Capture platform, and then with SnowSQL, a command line client available through Snowflake.
Method 1: Using Arcion to Migrate Microsoft SQL Server to Snowflake
Arcion is a cloud-native Change Data Capture solution that guarantees petabyte-level scalability and data consistency. It also has integrations to various enterprise databases and data warehouses. Arcion is a highly performant solution for data migration and data replication. In this section, We will use Arcion to migrate data from SQL Server to Snowflake using the Arcion CLI tool, Replicant.
Step 1: Download And Install Arcion Self-hosted
First, we will need to download and install Arcion Self-hosted. This will require a few steps, including downloading Replicant, creating a home directory for it, and adding your license. Each step can be seen in detail by referencing our quickstart guide in our docs.
Next, we need to configure and enable Arcion to connect with SQL Server and Snowflake. We will refer to the directory where Replicant was downloaded to as $REPLICANT_HOME in the following steps. Now, let’s go ahead and configure all of the connections with Arcion.
Step 2: Set up Connection Configuration for SQL Server
From $REPLICANT_HOME, navigate to the sample connection configuration file:
If you store your connection credentials in AWS Secrets Manager, you can tell Replicant to retrieve them. For more information, see Retrieve credentials from AWS Secrets Manager.
Otherwise, you can put your credentials like usernames and passwords in plain form like the sample below:
Replace the following placeholders in the example above with:
HOST: hostname of the SQL Server instance
PORT: port number of the SQL Server instance
USERNAME: the username of the DATABASE user
PASSWORD: the password associated with USERNAME
DATABASE: the name of the SQL Server database to connect to
MAX_NUMBER_OF_CONNECTIONS: the maximum number of connections Replicant would use to fetch data from the source—for example, 30
Note: If you’re hosting SQL Server on Azure, you must also set the following parameter to true in the connection configuration file:
For your Source SQL Server, you can choose from two CDC Extractors. You can specify the Extractor to use by setting the extractor parameter in the connection configuration file to any of the following values:
The default value. With this value set, SQL Server Change Tracking is used for real-time replication. In this case, you don’t need to follow the documentation for Replicant SQL Server Agent.
Uses the Replicant SQL Server Agent as CDC Extractor. In this case, please follow the Replicant SQL Server Agent docs.
After this is complete, save the file.
Step 3: Set up Extractor Configuration for SQL Server
From $REPLICANT_HOME, navigate to the Extractor configuration file:
For full mode replication, you need to create a heartbeat table. Below is an example command of how this table can be created.
It’s also important for full mode replication that all tables that are going to be replicated have a primary key on them.
Once the heartbeat table is created, we will add our realtime configuration to the Extractor configuration file. Below is a sample configuration for setting up Replicant to operate in realtime mode:
For more information about the configuration parameters for realtime mode, see Realtime Mode.
(Optional) Step 3.1: Enable Change Tracking for realtime mode
To use SQL Server Change Tracking for realtime mode, all databases and tables must have change tracking enabled. This step will need to be done if, in your Connection configuration file, you’ve set your extractor setting to
To enable change tracking in the SQL Server instance, execute the following SQL command:
replace database_name with the name of the database you want Change Tracking enabled on.
To enable Change Tracking on a specific table only, execute the following SQL command for each table being replicated:
Replace table_name with the name of the table you want Change Tracking enabled on.
Step 4: Set up connection configuration for Snowflake
From $REPLICANT_HOME, navigate to the sample Snowflake Connection configuration file:
The configuration file has two parts:
- Parameters related to the target Snowflake connection.
- Parameters related to stage configuration.
For connecting to the target Snowflake instance, you can choose between two methods for an authenticated connection:
- RSA key pair authentication
- Basic username and password authentication
In this example, we will simply connect to the server using the basic username and password authentication. For simplicity, we will specify the credentials in a plain form in the connection configuration file as shown below:
Replace the following placeholders in the example above with your values. The value definitions can be seen below for each field:
type: The connection type representing the database. In this case, it’s Snowflake.
host: The hostname of your Snowflake instance.
port: The port number to connect to the host.
warehouse: The name of the Snowflake warehouse.
username: The username credential to access the Snowflake system.
password: The password associated with the username.
max-connections: The maximum number of connections Replicant uses to load data into the Snowflake instance.
Once these details are added, save the Snowflake configuration file.
Step 5: Set up Applier configuration
From $REPLICANT_HOME, navigate to the sample Snowflake Applier configuration file:
The configuration file has two parts:
- Parameters related to snapshot mode.
- Parameters related to realtime mode.
Step 5.1: Parameters related to realtime mode
To operate in realtime mode, we will use the realtime section in the snowflake.yaml config file opened at the start of Step 5. Below is an example of how to set up the config file for realtime mode.
For more information about the configuration parameters for realtime mode, see the Realtime Mode docs.
Now that your Snowflake connector is configured, we can run Replicant, Arcion’s program for running the CDC pipeline.
Step 6: Running Replicant
From $REPLICANT_HOME, run the following command to execute Replicant in realtime mode:
Once the command is executed, Replicant will start up the pipeline. In real-time mode, Replicant first creates the destination schemas if they are not already present. If the destination schemas are present, Replicant appends to the existing tables.
In real-time mode, Replicant starts replicating real-time operations obtained from log-based CDC. By default, real-time mode starts replicating from the latest log position, but a custom start position can be specified by the user in the real-time section of the extractor configuration file.
With this, you have set up a real-time CDC pipeline with Arcion. As changes occur in the source SQL Server database, they will be replicated in your Snowflake instance. In a matter of minutes, with no code, we have created a robust pipeline that will enable us to further many use cases, including running real-time analytics through the Snowflake platform.
Method 2: Using SnowSQL to Migrate Microsoft SQL Server to Snowflake
In this method, we will migrate data from Microsoft SQL Server to Snowflake using Snowflake’s command line client, SnowSQL. The entire migration process involves exporting your MS SQL Server database or tables through a tool like SQL Server Management Studio as a CSV file. Next, we will have to upload the CSV file to an object storage solution in the cloud, such as Amazon Web Services S3, Google Cloud Storage, or Azure Blob Storage. For this demonstration using SnowSQL, it is assumed that you have uploaded the exported CSV to an Amazon Web Services S3 bucket. The final step involves setting up a Snowflake storage integration to access AWS S3. This integration will create an external stage in Snowflake that references the external storage location (AWS S3), and migrate (copying) the data from AWS S3 into Snowflake through the COPY INTO command. This will load the SQL Server data into your Snowflake table.
To set up a Snowflake storage integration to access AWS S3, we must configure access permissions for the S3 bucket. The permissions must also include allowing Snowflake to perform SQL actions. Below are the required permissions that the S3 bucket must have.
Below is an example of a sample policy that grants the required access to objects in an S3 bucket.
Next, create an IAM policy in AWS that allows you to load and unload data from an S3 bucket. Take note of the IAM Role ARN since it will be used later to migrate the data to Snowflake. Once the IAM policy is created, use the command below to create a cloud storage integration in Snowflake.
Next, retrieve the AWS IAM user for your Snowflake account via the command below.
You should take note of STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID. These are the AWS IAM user created for the Snowflake account and the external ID that is needed to establish a trust relationship respectively.
Back in AWS, the next step is to grant the IAM User permission to access the bucket objects. Below is a sample policy document for the IAM role in which we will replace the values of the parameters for AWS and sts:ExternalId with those retrieved above.
On Snowflake, we can now create an external stage that references the integration by issuing the commands below.
At this point, you can use the COPY INTO SQL command from SnowSQL to load the data from the staged file into the target table on Snowflake. The command below is an example of how to do this.
Now, your SQL Server data should be successfully imported into your Snowflake instance. Of course, compared to Arcion, this is a one-time migration that does not do ongoing replication. You could either perform this manually as needed or set up a batch process that could automate this. Regardless, achieving real-time replication using this strategy is limited. Let’s look at these limitations in detail below.
Limitations and Challenges of Using Custom Code
Using method 2 as outlined in the previous section comes with some limitations. Apart from the fact that it is a more involved process and can be error-prone, you cannot use it to migrate files that are larger than 160 GB in size without relying on the Amazon S3 REST API. The custom code method also does not support real-time streaming of data from Microsoft SQL Server to Snowflake. Any data migrated in this manner must be in snapshots and may already be stale before the transfer. Another limitation of this approach is that you have to be familiar with the technologies being used. This includes being aware of your cloud storage provider and the Snowflake APIs to be able to successfully perform the migration. As part of this, you must be on the lookout for any changes in processes or updates on both platforms that could break existing workflows. With these limitations in mind, it can be concluded that generally these approaches are quite brittle and their reliability and scalability are questionable.
Cloud-native data architectures are here to stay and nimble organizations are beginning to migrate their workloads from on-premises data solutions to cloud solutions. Snowflake is a popular cloud data platform with a solid cloud-native architecture and key features which enable the latest use cases. We looked at how to migrate data from Microsoft SQL Server to Snowflake using the SnowSQL tool and Arcion. As discussed in the section on using a custom code approach, using SnowSQL has limitations and challenges that can be addressed by a platform like Arcion. Arcion enables data migration from Microsoft SQL Server to Snowflake through its no-code, real-time platform.
Interested to learn more about how to easily migrate and replicate data from SQL Server to Snowflake? Schedule a personalized demo with our team of database migration and replication experts today.