The concept of data warehouses acting as central repositories for data from disparate sources within an organization has grown in importance. This is a given since data analysis, reporting, and other business intelligence techniques have become the foundation for making informed business decisions.
Traditional databases, though great at storing data, do not lend themselves easily to analytics and machine learning use cases. Coupled with the fact that data is usually stored in multiple locations across different databases, effective analysis cannot be carried out holistically on the data. The solution to this generally is to consolidate it in a central repository, like a data warehouse. Data warehouse solutions are therefore in high demand as businesses require data analysis to gain insights and operational know-how to scale and optimize their activities.
Many data warehouse vendors exist, but Snowflake is one of the leading data warehouse platforms on the market. Snowflake’s popularity is mainly driven by its cloud-native approach which leads it to be favored by organizations and software teams that have a cloud-first approach to data storage and utilization.
In this article, we will walk you through what Snowflake is and the key features that make it stand out from other data warehouse solutions. You will also be introduced to the concept of Change Data Capture (CDC) and what CDC means in Snowflake. An overview will also be presented on how Change Data Capture works in Snowflake, alongside its implementation using streams. Finally, we will do a complete walkthrough of how to set up CDC using Arcion and Snowflake.
What is Snowflake?
Snowflake is an advanced data platform that provides a Software-as-a-Service(SaaS) product for data warehousing, data processing, and analytics. It is a cloud-native product that was built from the ground up to harness the compute power and storage available in the cloud. Snowflake is only available for deployment on public cloud infrastructure and is not available as an on-premises or private cloud hosted solution.
Snowflake is not built on any particular database or big data platform technology. It is designed around a completely new SQL engine and an architecture that is optimized for the cloud. Organizations that do not want to maintain an in-house team to install, maintain and manage data infrastructure find Snowflake’s cloud-based approach particularly useful. Since Snowflake is a fully-managed service this means that ongoing maintenance, management, or upgrades are handled by Snowflake.
The architecture of Snowflake is made up of three layers. These three layers include:
- Database storage
- Query processing
- Cloud services
Snowflake separates its storage and compute capabilities for better performance. It operates on a hybrid architecture that combines the benefits of Shared Disk and Shared Nothing architectures. Snowflake uses the traditional Shared Disk approach for persistent storage and a Shared Nothing approach for querying using Massively Parallel Processing (MPP) compute clusters.
Snowflake is well-known for its flexibility compared to other platforms since Snowflake can scale or reduce resources based on demand. Snowflake’s scaling capabilities can be leveraged at both the storage and compute level. These scaling capabilities mean that customers can easily control cost since they will only pay for what they utilize in terms of compute and storage. Snowflake can be deployed on any of the big three cloud providers - Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform.
Key Features of Snowflake
Some of the notable/most beloved features of Snowflake are listed and explained below.
Standard and Extended SQL Support: Despite combining unique architecture and being cloud-native, Snowflake supports most SQL Data Definition Language (DDL) and Data Manipulation Language (DML) operations. This means it supports standard statements like INSERT, UPDATE, and DELETE and also statistical aggregate functions, transactions, stored procedures, and DML for loading and unloading data. The skills of existing teams using SQL databases can be ported to Snowflake, thereby, reducing the barrier to entry.
Security, Governance, and Data Protection: Snowflake has various security and governance policies to protect and store data. Users can choose the geographical location where data is stored, allowing for easier compliance with standards such as GDPR. Snowflake also provides support for several authentication mechanisms, including:
- Multi-factor authentication (MFA)
- Federated authentication/single sign-on (SSO)
- and more
In Snowflake, all communication between the client and the server is protected through Transport Layer Security (TLS). Fine-grained control of data is also provided in Snowflake through object-level access control to make sure that users only have access to the data they need, nothing more.
Ease of Connectivity/Availability of Tools: Snowflake has a web-based Graphical User Interface (GUI) for the management of accounts, monitoring of resources, and querying of data. It also provides a CLI client called SnowSQL which can be used to issue commands to Snowflake in a more scripted or programmatic fashion. There are also an abundance of client connectors and drivers that allow for connectivity to send and receive data from other popular tools.
Database Replication and Failover: Databases in Snowflake can be replicated and synced across multiple Snowflake accounts in different regions. Databases can be configured for failover to specific Snowflake accounts to ensure business continuity and enhance disaster recovery.
What is Change Data Capture (CDC)?
Change Data Capture, also known as CDC, is a software design pattern that is mainly used to capture changes in a source database so that they can be replicated in a downstream system or target database. CDC is one of the best ways to enable real-time data replication and migration.
Most databases support some form of Change Data Capture natively, although each implementation will come with nuances. With the rise of analytics and reporting, Change Data Capture has become increasingly important to data stakeholders as real-time data is demanded across applications. Many business functions, such as real-time recommendation engines, also require the use of CDC in order to work optimally and deliver business value.
What is Snowflake CDC?
Change Data Capture in Snowflake is executed through the use of streams. A stream is used to take logical snapshots of a source object, such as tables, views, or external tables. Streams also record information about DML changes occurring within source objects in Snowflake. When the stream for a table is created, a pair of additional columns are added to the source table. The new columns store metadata that tracks changes to data in the source table. The data changes tracked by a stream are expected to be consumed or moved to more permanent storage within a set retention period. If this doesn’t happen, the historical changes captured by the stream are no longer accessible and a new stream will need to be created to reactivate the tracking of changes from that point forward.
How does a Snowflake CDC work?
As stated in the previous section, Change Data Capture in Snowflake works via change tracking using table streams. For a stream object to record DML changes, such as inserts, updates, and deletes continuously, it needs to know the point in time that the stream records were last consumed. The solution to this is to use an “offset”. The offset is a pointer that indicates the point in time since a stream was read in a transaction.
The offset can be thought of as a bookmark that can be placed or removed. A stream offset is located between two table versions so querying a stream returns the changes caused by transactions after the offset and within the current time of the query. The table stream makes a table representation of what has changed at the row level and it records this between two transactional points in time of the source object. Data is not stored within streams themselves but instead uses the metadata in combination with table versioning. Using the offset then makes it possible to query and consume a series of change records in a transactional manner.
What are Snowflake Streams?
A stream in Snowflake, also known as a table stream, is an object that records DML changes made to a source object. It uses the metadata associated with those changes so that actions can be taken on the changed data. A stream can provide a minimal set of changes by leveraging the offset from its current placement to the current version of the table. When queried, a stream will return the historic data in the shape and names of the source object, alongside additional columns that provide more information about the type of change.
Let’s take a look at the additional columns returned as part of the result of the query to a stream.
METADATA$ACTION: The value of this column indicates the kind of DML operation that was recorded. The values are INSERT or DELETE. Note that the UPDATE statement is represented with two-row entries for INSERT and DELETE.
METADATA$ISUPDATE: This column indicates whether the row entry was part of an UPDATE statement. The value is TRUE for entries that were part of an UPDATE and FALSE otherwise.
METADATA$ROW_ID: This specifies the unique and immutable ID for a row and can be used to track changes on a particular row over time.
What are the Stream Types Available in Snowflake?
There are three stream types in Snowflake namely standard, append-only, and insert-only. Each of these types can serve a different role depending on your use case. The stream types available in Snowflake include:
Standard: The standard stream type is supported for streams on tables, directory tables, and views. It tracks all DML changes in the source object (inserts, updates, deletes) and also tracks table truncations. This stream type returns the net change in a row and does this by performing a join on the inserted and deleted rows within a change set. What this means, for example, is that if a row is inserted, and then deleted between two transactional points in time, a standard stream will not return it as part of the query. Because of this, a standard stream is also called a “delta stream” as it returns the net effect of transactions executed on the data.
Append-only: Append-only streams track row inserts only. Updates and deletes, including table truncation, are ignored by an append-only stream. For example, if 5 rows are inserted in a source object and 2 rows are deleted, a query to an append-only stream will return 3 rows. Append-only streams are more performant than standard streams as they consume fewer resources since they only track inserts. Append-only is supported for streams on standard tables, directory tables, and views.
Insert-only: Insert-only streams like append-only streams track row inserts only, however, they are only supported on external tables. Files that are referenced by external tables on cloud storage locations are overwritten if an old file is deleted and a new one replaces it. The refresh of the external table metadata may not be automatic in all cases.
Why Use Streams in Snowflake?
The main use of streams in Snowflake is to track changes in data in a source table and to achieve Change Data Capture capabilities. It is cheap resource-wise to create a stream in Snowflake since data is not stored in the stream object. Different types of streams can therefore be created on a source table for various purposes and users. A stream is versatile and can be used to query changes on standard tables, shared tables, views, secure views, directory tables, and external tables.
The downfall to using Streams in Snowflake is that they do require a fair amount of configuration and understanding in order to achieve certain use cases. Compared to a tool like Arcion, there is a much steeper learning curve. With a platform like Arcion, you can enable CDC for multiple targets very easily (e.g., a few clicks) and also gives the flexibility to set up pipelines both through CLI and through a UI. Arcion also enables a no-code approach to setting up pipelines instead of the more scripting-intensive approach to setting up Streams in Snowflake.
How to Setup Snowflake Change Data Capture with Streams?
This section of the article will provide hands-on instructions in a step-by-step manner. If you intend to follow along, you should log in to Snowflake GUI.
Firstly, you’ll need to create a database in Snowflake. For example, you can do so using the following commands:
Next, you will need to create tables that will serve as the source table and the target table. You can create the source table like so.
To create the target table, use the command below.
The structure of the source and target table is the same, the only difference is the names.
Next, create a stream on the members_source table to track DML changes using the command below.
You will now populate the source table with dummy data. To do so, use the command below.
You can view the tracked data from the stream using the select statement.
The data returned from the query to the stream includes the additional columns METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID. Note that the offset in the stream is not advanced, nor is the data returned actually consumed. This is because a select statement was used in the above transaction. To consume a stream, you need to perform a DML operation such as an insert or merge.
Next, you will use the merge statement to consume the data returned from the stream into the members_target table. An example of such a statement can be seen below.
After issuing the command above, the members_target table will be populated with data that captures the changes in the members_source table.
In this simple example we can see how we can use Streams to manually move data from a source to a target table. For basic data replication, this may be okay. However, what if we wanted to do things automatically and at scale or use a data source which isn’t Snowflake? This is where we can begin to harness the flexibility and power of Arcion for setting up CDC with Snowflake.
How to Setup Snowflake Change Data Capture with Arcion
Setting up CDC with Snowflake is easy to do with Arcion. Snowflake is a natively integrated Target within Arcion and a partner.
In this example, we will look at setting up a pipeline between a MySQL instance and Snowflake. We will use the Arcion UI to set up both the source and target for the pipeline. Of course, this can also be done through the Arcion CLI as well if needed.
First, we log into Arcion Cloud to set up our replication pipeline to enable CDC. You can sign up and log into Arcion.
Once logged into Arcion Cloud, we will land on the Replications screen. Here, we will click on the New Replication button in the middle of the screen.
Next, we will select our replication mode and write mode. A few options are available to suit your needs. For replication modes, Arcion supports:
- Snapshot (the initial load)
- Full (snapshot + CDC)
For write modes, Arcion supports:
For our purposes here, we will select the replication mode as Full and the write mode as Truncating. You will also see that I have named the replication “MySQL to Snowflake”.
Once the Name is populated and the Replication and Write Modes are selected, click Next at the bottom of the screen.
We are then brought to the Source screen. From here we will click the Create New button.
We then will select MySQL as our source.
And then scroll to the bottom of the page and click Continue.
Now, we can add in our MySQL instance details. These details include:
- Connection Name
All other fields will be defaulted. For username and password you will need to use your own credentials to connect to your server, the ones below are just an example. Once you’ve input your details, you will click Save Connection.
Once the connection is saved, we will want to pull in the schema from the database. On the next page, we will be prompted to click the Sync Connector button. Click the button and Arcion Cloud will connect to our MySQL instance and pull down the schema.
Once completed, the UI in Arcion Cloud will display the retrieved schema. Then we will click Continue in the bottom right corner of the screen to move to the next step.
We now have our data source for MySQL configured. Our MySQL source will now be displayed on the screen as well as a Test Connection button. To ensure that everything is working correctly, we will click the Test Connection button.
Some basic tests will then run to ensure a few important factors are working as expected. The results should look like this once the test is done running. You can click the Done button to exit.
With our test successful, we can now click Continue to Destination in the bottom right corner of the screen to move to the steps where we set up our destination. This will be when we add our connection to Snowflake.
On the Destination screen, we will click New Connection to start the set up of our Snowflake connector.
Then, select Snowflake as your Connection Type and click Continue.
On the next screen, input your connection details. These details include:
- Connection Name
All other fields will be defaulted. For username and password you will need to use your own credentials to connect to your Snowflake server, the ones below are just an example.
On the next screen, we will sync the connector. Click Sync Connector and wait for the process to complete.
Once complete, you will see the schema loaded onto the screen. We can then click Continue in the bottom right corner of the screen.
Our last step in configuring the connection with Snowflake is to test the connection. We will click the Test Connection button and wait for the results to return to Arcion Cloud.
You should see that all tests have passed to make sure Arcion has access to everything required in order to create the connection.
Note: if Host Port Reachable does not pass, ensure that you have not included “https://” on the URL for your Snowflake connection. This can cause that check to error out.
Now, we can click Continue to Filter to begin the Filter configuration for our pipeline.
On the Filters screen, we will check the Select All checkbox so that all of our tables and columns will be replicated over from the source to the destination. Depending on your schema, you will see different tables and columns.
Optionally, you can also click on the Map Tables and Per Table Configs (Applier Configuration Docs, Extractor Configuration Docs) buttons to add further configuration. For our purposes, we will leave these as their default values. After this, you will click Start Replication.
The replication will then begin to migrate data from our MySQL instance over to Snowflake.
Once the initial data is loaded, the pipeline will continue to run, monitor for changes, and apply those changes to the destination. The idle pipeline will still show RUNNING in the top right of the screen but will show a row replication rate of 0 until new data is written to the source. You’ll also notice that the Phase description of the pipeline will now show Change Data Capture instead of Loading Snapshot Data.
If we start adding data to the MySQL instance we will see that Arcion detects this and will then sync that data over to Snowflake in real-time. This is CDC in action!
This has been a comprehensive dive into Snowflake and Change Data Capture. By this point, you should have a firm understanding of Snowflake and its key features, Snowflake CDC, how Change Data Capture works in Snowflake. We also looked at the concept of streams, the types of streams, and how streams can enable Change Data Capture natively in Snowflake, and its limitations. Lastly, we took a look at using Arcion to meet your organization's CDC needs with Snowflake. By using Arcion, it’s easy for organizations to build pipelines with Snowflake. Benefits of using Arcion include:
- No-code connector support for 20+ sources and target databases and data warehouses
- Multiple deployment types supported across cloud and on-premise installations
- Configuration can easily be done through UI (available both in self-hosted & cloud), with minimal effort and zero code
- Automatic schema conversion & schema evolution support out-of-the-box (including SQL and NoSQL conversion)
- Patent-pending distributed & highly scalable architecture: Arcion is the only end-to-end multi-threaded CDC solution on the market that auto-scales vertically & horizontally. Any process that Arcion runs on Source & Target is parallelized using patent-pending techniques to achieve maximum throughput.
- Built-in high availability (HA): Arcion is designed with high availability built-in. It makes the pipeline robust without disruption and data is always available in the target, in real-time.
- Auto-recovery (patent-pending): Internally, Arcion does extensive check-pointing. Therefore, any time the replication process gets killed for any reason (e.g., database, disk, network, server crashes), it resumes from the point where it was left off, instead of restarting from scratch. The entire process is highly optimized with a novel design that makes the recovery extremely fast.
To get a no-code CDC solution that works seamlessly with Snowflake, try out Arcion Cloud and have a scalable and reliable CDC-enabled pipeline set up in minutes. If you’re looking for a flexible on-premise solution, try out Arcion Self-hosted. Regardless of the deployment type chosen, Arcion provides unparalleled flexibility and scalability for data pipelines using Snowflake.