In today's business landscape, data is coming from more sources than ever before. This data is often complex and requires cleansing and transformation before it can be ingested for reporting and analytics.
While simply having this data is a critical first step, it is not enough. To make data-driven decisions, business leaders need real-time access to this data, and they need to have confidence in its accuracy and completeness.
To achieve this, businesses must invest in a data infrastructure that can handle today's data volume, variety, and velocity. This infrastructure must be able to ingest data from various sources, clean and transform it, and make it available in real-time for analytic reporting or power data-intensive apps.
With the right data infrastructure in place, businesses can unlock the true value of their data and make better-informed decisions that drive growth and success.
Enter Change Data Capture.
What is Change Data Capture?
To fully unlock the value of their data, businesses must first invest in the right data infrastructure. Change Data Capture (CDC) is one essential piece of this puzzle.
CDC involves continuously monitoring database transactions and capturing changes as they occur. This gives businesses real-time visibility into their data, allowing them to make better-informed decisions that drive growth and success.
When combined with other data management tools such as data warehousing and data lakes, CDC can help businesses transform their data into their most valuable assets.
And that's precisely what the most prolific brands of this generation are doing.
How Change Data Capture Works
Change Data Capture (CDC) is a process that captures data changes at the source and propagates those changes to a target. The source can be a database, file system, or any other data store, while the target is either another database or a data warehouse.
CDC can track insertions, deletions, and updates in the source data, and it can provide near-real-time visibility into those changes. CDC can also facilitate data synchronization between two or more data stores.
CDC typically uses log-based or trigger-based approaches to capture data changes. In a log-based approach, CDC reads the logs of the source data store to identify and track changes.
In a trigger-based approach, CDC relies on triggers configured in the source data store to capture changes. Once changes have been captured, they are propagated to the target via an ETL process.
Some CDC solutions also provide bidirectional synchronization, like Arcion, which allows changes in the target data store to be propagated back to the source in parallel.
Why Big Brands Use Change Data Capture
When it comes to data ingestion and database migration at the enterprise scale, companies face several major challenges:
- Exporting and migrating data takes databases offline for hours.
- Without data validation tools built-in, businesses risk data corruption and inconsistency.
- Data moves–at best–through slow and inefficient batch jobs.
- Schema changes in source databases require manual intervention triggering pipeline downtime
- Data lags behind going into the target, leading to stale data and impacting the performance of ML models.
And when it comes to making decisions that impact the business in real-time or time-sensitive, data that's even a few minutes old can be too outdated.
Change Data Capture overcomes all these issues by providing a real-time, accurate, and complete view of data as it changes. This empowers companies to make better-informed decisions that can help them drive growth and success. Thanks to CDC:
- Database migration no longer requires taking the source database offline and achieving zero downtime.
- Data is validated in flight and offline, ensuring transactional integrity and data consistency.
- Data can be incrementally loaded into the target, reducing the window of opportunity for data corruption.
- Data is always up-to-date in the target, eliminating the need for time-consuming and error-prone batch jobs.
- Schema changes in the source database automatically propagate to the target, eliminating pipeline downtime.
- Data is always synchronized between the source and target in parallel, so data is never stale.
Now, let's look at how some of the biggest brands use Change Data Capture to make data their most valuable assets.
SpinalTap: The CDC Tool Built by Airbnb
Over the last couple of years, Airbnb has faced increasingly unique and complex data challenges. As a service-based platform, dynamic pricing, reservation, and availability workflows need to be processed and analyzed in near-real-time to power a variety of features such as search ranking, cancellation predictions, and real-time messaging.
To address these needs, Airbnb built SpinalTap, a log-based Change Data Capture system that propagates data changes from MySQL, DynamoDB, and their in-house storage solution, to Apache Kafka.
Thanks to CDC, Airbnb can achieve several goals:
Cache Invalidation for Real-Time Messaging
By propagating data changes in near-real-time, SpinalTap ensures that the cache is always up-to-date and invalidates it when necessary. This enables Airbnb to provide users with timely and relevant information through features like real-time messaging.
Preferring an asynchronous approach allowed the organization to decouple the data propagation process from the main request-response cycle, which would have been much more complicated to achieve with a synchronous approach. The result is a more responsive user interface with minimal impact on the performance of the overall system.
Search Indexing for Reviews, Inboxes, and Support Tickets
Airbnb has multiple search products that need to be constantly updated with the latest data:
- Review searches
- Support ticket search boxes
SpinalTap is the perfect fit for building the application's indexing pipeline because it can propagate data changes in its data stores (MySQL and DynamoDB) to its search backends in near-real-time with at-least-once delivery semantics and. This enables the search engine to index the latest data and provide users with up-to-date results.
Offline Big Data Processing for Machine Learning
Airbnb's machine learning algorithms need to be trained on the latest data to accurately reflect its users in real-time. SpinalTap is responsible for exporting online datastores to Airbnb's offline big data processing systems for data streaming. This requires high throughput, low latency, and proper scalability.
Thanks to SpinalTap's modular and extensible design, it is easy to add new sources and targets. This has allowed Airbnb to quickly adapt SpinalTap to their changing needs and scale it to support their ever-growing data volume and user base.
Signaling for Availability and Booking Changes
As soon as a user books or cancels a reservation, that information must be propagated to other services to keep the system up-to-date. Depending on services (availability, booking, payment) can subscribe and react to data changes from another service in near real-time.
For example, when a user cancels a reservation, the payment service needs to be notified so that it can refund the user. This would not be possible with a traditional polling-based approach because the payment service would not know to check for a refund until the next poll, which could be hours or days later.
With SpinalTap, the payment service can immediately notify the cancellation and process the refund immediately. This results in a better user experience and fewer support tickets.
DBEvents: How Uber Manages MySQL Change Data at Scale
Uber's core business is built on a rich set of data: everything from the location of its riders and drivers to the price of a ride in a specific city at a specific time. To power features like real-time ETAs and surge pricing, Uber cannot afford to have stale data.
But Uber's old data pipes had several inefficiencies:
- Compute usage: Using a large table to take multiple, inconsistent snapshots and reloading them at intervals is wasteful since only a few records may have been changed.
- Upstream stability: The job puts massive strain on the source, such as during heavy reads on a MySQL table, because of the requirement to load the complete table frequently.
- Data correctness: A significant amount of time was required to program and test data quality checks ahead of time, which led to a poor user experience for data lake users as they had to wait for the data to become available.
- Latency: The time it took for the change to occur in the source table and become available to be queried on the data lake was extremely long, resulting in outdated data.
To keep its data fresh, Uber built DBEvents, a Change Data Capture system that propagates changes from MySQL to Apache Kafka in two ways:
Bootstrap: How Uber Uses Snapshots to Initialize Its Change Data Feeds
Uber developed a source pluggable library that moves data into the Hadoop Digital File System (HDFS) through Marmary, Uber's custom-built data ingestion tool. The semantics from this library are used to bootstrap datasets while providing flexibility for other systems, like batch processing and streaming, to use the same data. And each external source backs up a snapshot of its raw data into HDFS.
With the help of an Uber-made open-sourced service called StorageTapper, which reads data from MySQL databases, then schematizes it and writes it to Kafka or HDFS.
This approach to event creation on target storage systems, as used by Uber, allows the company to create logical backups. Instead of a direct copy of a dataset, a logical backup uses the events created by StorageTapper, to provide a verifiable, auditable dataset that can be used to restore data to any state.
This is especially useful for replicating data across multiple regions, which Uber needs to do to provide its service globally.
Incremental Ingestion: How Uber Propagates Changes in Real-Time With Binlogs
StorageTapper also supports incremental ingestion, which propagates changes in MySQL databases to Kafka or HDFS in real-time. To do this, StorageTapper reads binary logs (binlogs) from MySQL databases, then uses a custom-built Apache Kafka connector based on the Apache Avro format to write the data to Kafka.
Each binlog represents a transaction on the database, and each event in the binlog has a type (e.g., insert, update, delete) and some data. And since each event represents the exact order that the data was changed in the database, they roll over to the data lake in the exact same order.
This allows Uber to have exactly-once semantics when replicating data from MySQL databases, which is critical for maintaining data correctness.
DBLog: How Netflix's Distributed Databases Use Change Data Capture
Netflix has several microservices that rely on hybrid backends, including RDS, NoSQL, ElasticSearch, and Iceberg. For example, when a new movie is added to the platform, the information is written from Apache Flink to a Cassandra database that stores all the metadata about that movie. It would also be required by the UI (accessed through ElasticSearch), and analyzed with a data warehouse solution (such as Iceberg).
Since no single database can handle all the reads and writes for a movie's metadata, Netflix uses a watermark-based CDC system called DBLog to stream data across its various databases.
DBLog uses database triggers to capture changes and write them to a message queue. It also uses watermarks to keep track of which changes have been processed. This allows Netflix to process data in real-time and replay past events if necessary.
Netflix's DBLog processes data in two ways:
Log processing is a critical part of the DBLog framework. Databases emit events in a message queue, and the DBLog agent picks up these messages and writes them to an object store. This process helps to ensure that the data is consistent and accurate.
The log processing system is designed to handle high volumes of data and to provide a consistent experience for users. The system is reliable and scalable, and it can be used with any type of database.
Dump processing is an important part of data management. Dumps store data that would otherwise be lost when transaction logs are purged. By taking dumps in chunks, it is possible to progress with log events while still retaining a full record of the data.
This makes dump processing an essential part of reconstituting a complete source dataset. In addition, dump processing can also help to improve the performance of data mining and analysis by reducing the number of log files that need to be processed.
To ensure that chunks are taken in a way that does not stall log event processing for an extended period of time, watermark events are created in the change log. By doing this, the chunk selection is sequenced, preserving the history of log changes. This way, a selected row with an older value cannot override a newer state from log events.
In addition, by using watermark events, we can more easily identify which chunks need to be processed and in what order. Consequently, this method provides an efficient and effective way to take chunks while still ensuring that log event processing is not stalled for an extended period of time.
Let's take a look at how this works:
Pretend there is a new movie that is being added to the Netflix platform with the ID of 123. The first step is to add the movie's metadata to Cassandra. This write will be replicated to all other databases that need this information. To do this, a DBLog agent will pick up the write and write it to an object store.
The next step is to take a dump of the data. This is done in chunks so that log event processing is not stalled for an extended period of time. For this to work, a watermark event is created in the change log. This way, the chunk selection is sequenced, and we can more easily identify which chunks need to be processed and in what order.
Once the dump is taken, the next step is to process the data. This is done by taking the log files and running them through a series of jobs, giving us the complete dataset we need.
How Shopify is Learning to Manage its Abstract Data
Like the aforementioned companies, Shopify has its own complex data lake, consisting of multiple sources, systems, and data types. To make matters more challenging, Shopify has both internal data warehouses to manage and consumer data from its massive merchant platform of over one million e-commerce stores.
At first, the e-commerce magnate used data warehousing specifically for internal analytics. But as Shopify became the leading platform for merchants building an online business, it quickly realized the need to provide its users with data-driven insights.
Their user base grew faster than they could scale their data management strategy; as they tried to meet their merchants' needs, they were frequently met with challenges:
- Slow query times on OLAP databases
- Limited support for structured data sets
- Batch loading for data ingestion pipelines
So they built out a separate Merchant Analytics platform based on low-latency, columnar storage databases that could handle the scale and speed their merchants demanded.
But Shopify's new merchant features (Shopify Email, Marketing, Capital, and Balance) that give the platform its competitive edge all created new data sets that didn't fit into the existing schema.
And while splitting the two data management platforms meant that users could access high-quality analytics and visualizations without interfering with Shopify's data scientists, it also meant that they needed two systems with different programming languages, tools and standards, and even sets of the same information.
For data scientists, this meant more time spent wrangling data instead of using their skills to build features that would further merchant (and business) success.
Longboat: Shopify's Internal Data Extraction Platform
Shopify's data team realized that they could take a page from their own playbook and use Change Data Capture to give them the accessibility, efficiency, and flexibility they needed to manage their data.
So they created Longboat.
Longboat is a bespoke, query-based CDC software that works with batch extraction jobs. It works by querying the source database for changes and then using those results to generate a new dataset. But it was not without its problems.
Because the software worked in batch extraction jobs, the data was always a minimum of an hour old. And although most data extraction queries were simple enough, Longboat missed all hard deletes, meaning that some data could be unintentionally removed from the dataset while others could be unintentionally left in.
This process also missed intermediate states, so if a buyer started the checkout process but didn't complete it, that data would be left out on the merchant's end, making retargeting and understanding consumer behavior much more difficult. And even when these are not glaring issues, data consistency and missing or lost data are.
Log-Based Change Data Capture and Event Streams, Built With Debezium
The solution to the data management problem at Shopify was to use log-based CDC instead of query-based CDC batch jobs to generate the data sets for their merchant analytics platform.
Since Kafka was a major part of Shopify's existing framework, Shopify's connector uses Debezium, an open source project that offers tools to make it easy to stream changes from various database management systems.
While Longboat extracted sharded MySQL data and wrote queries to cloud storage, Debezium writes change events to Kafka topics.
But even as increased uptake and scale pushed Shopify's data engineering team to look for new ways to make their data platform more reliable, easier to use, and less costly to operate, the benefits of using log-based CDC with Kafka and Debezium have been clear: they have been able to standardize their data extraction tools across their organization, move away from batch jobs, and get their data in near-real-time while unify their batch streaming sources.
And the fact that this process is not yet refined by any means indicates a very bright and exciting future for Shopify's data management platform.
However, it also shines a light on the shortcomings of Debezium, such as difficult to scale and ensure data consistency. For other solutions, check out our blog about Debezium alternatives.
How Reddit Moves its Raw Data Into Its Kafka Cluster
Reddit has long been a big data company and traffic powerhouse.
But with that much user engagement and data comes a big responsibility to keep things running smoothly.
Previously, Reddit's process of moving events and databases into its Kafka cluster required numerous steps, including creating AWS EC2 replicas of Postgres data for data snapshotting and setting up and maintaining complex scripts for data backfilling.
Despite troubleshooting and attempts at easing this process, it was clear that this process would not be able to handle Reddit's increasing volume and velocity of data for two reasons:
- The data was inconsistent. Since Kafka's services worked in real-time, the daily snapshot data (which ran overnight) worked in opposition to this, creating data inconsistencies.
- The infrastructure was too fragile. Since databases and read-replicas ran on EC2 instances, if one failed, the entire process would fail–in which case, manual intervention was required to get things back up and running.
How Reddit Uses Debezium to Stream Change Data
To solve these problems, Reddit used Debezium to stream changes from Postgres databases into Kafka.
This successfully addressed Reddit's two primary concerns: data inconsistencies and infrastructure fragility.
No more running data snapshots overnight.
Thanks to Debezium, the Reddit team is able to create real-time snapshots of its Postgres data in their data warehouse and is able to handle any data changes including schema evolution. However, the Reddit team has ran into issues with invalid schema types making its way into their Data Warehouses. But they don't restrict schema evolution for microservices as the recipient of the data tends to be that same team within their Data Warehouse.
Debezium has also allowed Reddit to move away from its complex and error-prone infrastructure. Small, lightweight Debezium pods replace the big, bulky EC2 instances, and Debezium's built-in monitoring and management means that Reddit no longer has to maintain its own scripts.
The end result? A more reliable, easier-to-use, and less costly data management platform that can handle Reddit's increasing volume and velocity of data.
But for Debezium to work, Reddit employs lots of engineering resources that understand the technology and implement it correctly.
That's the price of using free software. For many companies, an open-source solution requires too much in-house expertise to justify the costs.
The Future of Change Data Capture
The above examples show how Change Data Capture is used by some of the biggest companies in the world to solve some of their most difficult data challenges.
As Shopify, Reddit, and other companies have found, CDC can be a powerful tool for simplifying complex data architectures, reducing costs, and increasing reliability.
The benefits of CDC are clear. And as more and more companies adopt CDC, the technology will only continue to grow in popularity and scale.
Data-driven leaders use CDC to drive breakthrough advantages–if you're not using CDC, you're already behind.
How Arcion Fits Into the Picture
Companies like Uber, Netflix, and Shopify build their own CDC solutions because they process large volumes of data and need low-latency pipelines to power their real-time applications and serve millions of users worldwide. But these companies are the exception, not the rule.
For most organizations–even well-established corporations–the cost and complexity of building a CDC solution from scratch are simply not feasible. Many companies have tons of data trapped in legacy systems and on-premises, and they need a way to modernize their data infrastructure and move to the cloud.
That's where Arcion comes in.
Arcion is a fully-managed change data capture platform that makes it easy and affordable for enterprises to stream, ingest, and migrate their data in real-time. And it empowers organizations to migrate data to modern databases and move from on-prem to the cloud.
With Arcion, there's no need to build or maintain your own CDC solution, or hire an army of engineers to crack the code of making the in-house solution scalable and ensure data consistency. As the only distributed CDC solution on the market, we take care of everything so you can focus on what's important: your business.