I recently read a great article by Nnamdi Iregbulem on how the modern data stack is becoming real-time on Crunchbase. The article mentioned some tools for building real-time infrastructure, and real use cases, and in general, did a great job of providing an overview of our industry’s changing landscape. However, as someone closely involved in the very process of bringing about that change, I couldn’t help but feel that a very important part of the modern data stack was missing - real-time data ingestion.
So I wanted to expand on the conversation Nnamdi started and share my own experience with the modern data stack, why real-time data ingestion is an essential part of the modern data stack, and how enterprises can integrate it within their existing data infrastructure.
What is The Modern Data Stack (and What's Missing From It)?
Business profitability today is largely dependent on the organization’s ability to innovate and adapt to rapidly changing climates. Decisions need to be made quickly to be in front of the competition and at the top of the minds of customers. For business executives to make these timely decisions, they need to always have the necessary information. But traditional data infrastructure and batch processing are outdated and slow, inhibiting businesses from moving effectively and decisively.
This has highlighted the need for organizations to rethink their data strategy so they can react quickly to changing data as it happens. One of the most effective ways to do this is by adopting the modern data stack. The modern data stack refers to a suite of tools used by organizations for highly efficient data integration. Conventionally, these tools include the following (in order of data flows):
- A fully-managed data pipeline for ELT (Extract, Load, Transform)
- A cloud-based columnar warehouse or data lake as a destination
- A data transformation tool
- Business intelligence or data visualization platform
MDS has been a focused attempt at reimagining traditional data flow, but with cloud-based tools to build a more scalable, faster, and more flexible data infrastructure, However, MDS is not just about specific tools but rather about solving data problems and meeting the modern demands of enterprises. So like any other tech stack, the MDS is always evolving.
For a long part, our data stacks have relied heavily on batch applications, which only run every 24 to 48 hours, are extremely resource-intensive, inefficient, expensive, and are fairly prone to data inconsistencies. This severely impedes the ability of organizations to get up-to-date and accurate information. But that’s not all.
The modern data stack has been largely centered around SaaS tool data integration (driven by unicorn companies like Fivetran) while ignoring the use cases and value of transactional databases like Oracle, SQL Servers, MySQL, DB2, etc.
In a way, this makes sense as the market valuation for the leading SaaS data integration tools are worth $12 billion, with unicorn startups such as Fivetran and Airbyte valued at $5.8 billion and $1.5 billion respectively. The growing nature of SaaS tools means that there is a lot of market need to be met in this vertical and rules of free market dictate that companies will naturally cater to this sector.
However, the truth is that enterprise core systems store high value data and serve as the backbone for corporate strategy. But once again, this incredibly valuable data is virtually inaccessible within the modern data stack because transactional and operational databases are trapped by design within the systems of records for a myriad of reasons including security and performance concerns.
This results in limited APIs, forcing enterprises to extract data from OLTP databases using slow and inefficient batch replication and unreliable custom scripts. There are a few companies like HVR and Oracle Golden Gate but none of these existing systems are cloud-native and offer a fully-managed cloud service option. I find it ironic that the entire point of the MDS is to avoid data silos but the lack of real-time data integration between systems of records and the modern analytics platforms just creates more data silos of the most valuable data storage systems.
The missing piece in this modern data stack is thus, finding a standard way to push large-scale, enterprise data from systems of records (OLTP databases & data warehouses) to the modern data platform your business relies on. Fortunately, not only have we found the missing piece, but also developed it to an enterprise-ready stage - let’s call it Real-time ELT.
How Real-time ELT Bridges That Gap?
Traditionally, ETL (Extract, Transform, Load) tools collected batches of data from a source system on a specified schedule, then transformed the data and loaded it into a data warehouse or a separate database. But customer expectations have increased tremendously in the last couple of years. For instance, 80% of American consumers are more likely to purchase from a company that personalizes their sales offering. But CRM becomes ineffective if they have to spend hours if not days for the information to be updated.
Financial services find themselves in a conundrum where existing batch applications result in higher resource expenditure, significant data inconsistencies, slower analytics, and ultimately, lost revenue. With real-time fraud detection alone (which on its own is a nearly $30 billion industry), financial services stand to save hundreds of millions of dollars.
It’s commonly agreed among strategists and banking industry analysts that real-time is the future of finance and companies that do not begin implementing the necessary systems today will find themselves pushed out of competition in the coming decade. Real-time ELT (Extract, Load, Transform) solves these problems through log-based CDC.
Change Data Capture (CDC) is a software process where only changes are tracked in order to keep the entire data system and application up to date. In log-based CDC, changes are tracked and reflected via change logs in real-time which gives businesses instant access to data that is not only up-to-date but also 100% accurate. Log-based CDC is also responsible for:
- Real-time ELT’s characteristic low-latency capabilities
- Eliminating the need for re-syncs
- Significantly improving scalability, performance and data consistency
- Helping phase out resource-intensive custom scripts that fail often
But there’s so much more to a real-time ELT tool than just the (massive) benefits of CDC.
Real-time ELT connects data from Systems of Records with the rest of MDS tools, especially to cloud analytics platforms so organizations can take advantage of real-time analytics, drive ML/AI workloads, reduce operational overhead, and much more.
Real-time/streaming ELT is ideal for organizations that have massive datasets requiring deduplication and other preprocessing before ingestion into a real-time analytics data platform. With a modern, real-time data stack, simple incremental transformations take place multiple times each second and give insights instantly. It’s also an easily accessible, easy-to-implement option for enterprises that need to modernize their data infrastructure to support modern use cases.
What to Look for When Selecting a Real-time ELT Tool?
With all the different platforms that are available or that are being built to process large-scale real-time data, it can be difficult to know which one to select. But as Nnamdi said, “Assembling and stringing together these various systems is still tricky today. But organizations that make these investments will reap rich rewards—primarily the achievement of the fabled “real-time enterprise,” an organization capable of perceiving and reacting to events and changes in their business as they happen.”
Fortunately, it’s easy to identify the right ELT tool by looking for key characteristics that are much more likely to complete the gaps in the modern data stack.
- Support wide range of production-grade connectors
The MDS ecosystem has many tools that connect to sales and marketing applications but almost none for enterprise data stores. Therefore, it’s critical that your ELT tool should have built-in connectors for all required systems and organizations, especially OLTP systems and data warehouses. This helps prevent data silos and makes it easier to get any data to and from any system. Data connectors, especially for core data systems, also help in enterprise growth as companies can much more readily adopt a variety of databases, each optimized for different use cases without worrying about interoperability. Production-grade connectors means ability to do CDC on core enterprise systems like Oracle, DB2, MongoDB etc. with no limitations. This allows the organization the ability to gather all essential data from different source systems, with no limitation on the types or versions of the production databases.
- Zero data loss architecture
Choose a tool that guarantees 100% complete and accurate data transfer, with zero data loss. and which delivers the change only once and accurately in the target. Data loss is a major problem for any enterprise but can be especially disastrous for companies that leverage machine learning models for analytics and predictions (such as in finance, CRM, supply chain management, etc) as even slightly off data can completely ruin the accuracy of these models.
When the data is loaded into the target, it should also follow the same transactional semantics as the source to ensure data integrity. This means that data should be applied in the right order. Otherwise, if data that was stored later shows up earlier, your reports will give inaccurate information, leading to poor business decisions.
- Zero impact on source
The ELT tool should only perform change data capture on transaction logs and not fire queries on the source production databases to gather changes. It should not disturb the source data sources, performance, and operating processes. It should also not time stamp production databases. Log-based CDC does just this by only reading transactional change streams and logs, thus ensuring zero impact to source production databases.
- Zero maintenance pipelines
If there were schema changes, typically there would be a need to stop the pipeline and manually change the schema on both sources and targets, before resuming the pipelines. This would, of course, require a team of engineers to be on call in order to continuously look for and monitor any changes to the schema.
To avoid this, enterprises must look for an ELT solution with data pipelines that are easy to maintain. For example, if the pipeline breaks down, the ELT tool should not require a manual restart as stopping the pipeline can cause missing and/or stale data. The platform should be equipped to handle all schema changes and evolutions automatically.
Encryption is one way of protecting Personally Identifiable Information (PII) data and other sensitive data. Once the data is encrypted, the risk of a data breach is reduced to a large extent, and the impact of the breach is contained.
The ELT solution should simplify this process so this data can be handled effectively and efficiently, in line with regulatory guidelines.
As an organization grows, so also will its data needs. Auto-scaling should be built into the ELT solution and actually be a key feature in the product. The tool should have performance optimization features to address growing business needs. It should also be able to handle high volume, high velocity, and high variety data. In the cloud era, businesses are looking to automatically scale resources up and down based on their needs.
Exploring the Top Options for Real-time ELT
We’ve established why and how a real-time ELT tool fills in the missing gap in the modern tech stack but which real-time ELT tool should businesses look at specifically?
(Un)fortunately, the answer to this question is rather simple. There just aren’t enough truly real-time ELT solutions that meet all the requirements of modern enterprises. There are ELT solutions but almost every option has major shortcomings. That said, there are a few options to enable real-time ELT in your organization, each slightly different in terms of size, sophistication, and target use cases,
- Debezium + A team of engineers
Debezium is the most active open-source change data capture (CDC) connector software that works in tandem with Apache Kafka. Debezium records all row-level changes in a transactional log for each database. Applications listening to Debezium get access to near real-time updates that they can use to perform actions. Debezium with support for a variety of different databases and datastores.
Although a powerful tool, Debezium is not a fully-fledged ELT solution - it only offers the “E” or “Extract” capability right out of the box. The “LT” or “Load and Transform’ must be built by the users.
Because it’s built on Kafka, it integrates very well with many of Kafka's services such as Kafka Connect, Confluent’s Kafka Schema Registry, and Apache Avro which help streamline and automate many of the tasks. That said, in order for Debezium to be a viable real-time ELT solution, businesses will still require a team of engineers that can write and maintain custom code for the various components of a Debezium-based architecture.
Debezium is one of the core parts of the data stack for some of the biggest companies so the problems and limitations are well-documented. For instance, Shopify recently published their experience with using Debezium and mentioned some major challenges faced during modernization, including but not limited to:
- Breaking schema changes
- Read locks causing contention on Shopify’s databases
- Debezium does not support snapshotting a table without blocking incoming binlog events (results in unacceptable latency in a data-loss event).
- several tables in Shopify’s Core monolith that are too big to snapshot in any reasonable time frame.
Some of the problems can be solved through custom programming but the majority of problems faced by Shopify “are not yet resolved with our platform, and we’ve had to communicate these limitations to early platform adopters.”
In addition to the problems faced by Shopify, Debezium customers also face some common challenges including:
- Debezium is only an extractor CDC solution, meaning other components of ELT must be built internally.
- Some Oracle data types are not supported
- Debezium has no in-built data validation tools
- Schema evolution doesn’t work 100% of the time and incompatibility can result in forced updates and re-syncs.
- Since every company writes custom code for the “Load” part differently, it’s difficult to ascertain whether or not the pipeline is scalable and guarantees zero data loss before it's deployed.
With Airbyte, we move a step above something like Debezium. The two big differences between Debezium and Airbyte is that, unlike Debezium, Airbyte qualifies as an ELT solution (albeit barely so) and that it offers a zero-code interface. However, even at this stage, there is a huge gap between competitors in terms of capabilities, scale, and support network. Airbyte isn’t as well-developed or enterprise-ready as the next two options on this list and as a result, relies on the open-source community to maintain and update its data connectors. Perhaps for enterprise users, the only silver lining is that with Airbyte’s Connector Development KIT (CDK), users can build their own custom data connectors (although it has a library of over 100 pre-built community data connectors as well).
Unfortunately, at the moment, the open source model is the most prominent key differentiator for Airbyte from the more well-developed ELT options on the market currently. For instance, it’s tightly coupled to SQL and dbt workflows and pipelines so if your current use cases do not fit into this model, it’s best to look elsewhere.
In addition to this, there are some pretty big capabilities and features missing from Airbyte, including:
- Not all connectors are ready for enterprise production.
- Does not handle schema evolution, must be done manually
- Unclear if the architecture is guaranteed zero data loss
- No data validation tools
- Fivetran + HVR
Perhaps one of the most popular ELT solutions on the market today, Fivetran will be a viable tool for real-time change data capture for a large portion of enterprises. Fivetran has one of the most expansive libraries of data connectors as well as numerous integrations that help companies modernize their data infrastructure - but only a part of it.
Fivetran mainly focuses on SaaS tool integration and so far there has not been massive adoption for Fivetran’s enterprise sources like Systems of Records, OLTP systems, and data warehouses. This means that enterprises that need to leverage their transactional databases must look elsewhere.
However, to make up for this, Fivetran completed its acquisition of HVR, in October of last year. At $700 million, this was one of the biggest deals in data startups ever. Before the acquisition, HVR was already one of the heritage CDC-based solutions on the market and a quasi-competitor to Fivetran. With this deal, Fivetran and HVR will, at some point, merge their codebases and are expected to deliver a single managed service offering, combining HVR’s CDC pedigree and Fivetran’s cloud infrastructure.
Arcion is the world’s first cloud-native, fully-managed CDC data replication platform. In many ways, Arcion ushers in a new category of data integration platforms, ones that better represent the future. It is built on a highly parallel, highly distributed architecture with low latency enterprise-grade CDC for maximum performance and a zero-data loss architecture. Being cloud-native also means that users get near limitless scalability.
Arcion holds a special place for me because I am the founder and CTO of the company. Back when I was working at SingleStore (prev. MemSQL), I experienced the problems of slow-moving data first hand and realized that the lack of a real-time CDC technology was the biggest bottleneck for enterprises with growing data needs and massively impacted the adoption of modern cloud platforms. I realized that every data team will benefit from this technology and therefore decided to start a company to build that technology. Arcion extracts the maximum value from enterprise data by interconnecting heritage & modern databases as well as transactional databases such as Oracle, DB2, SAP, and MySQL.
Arcion is also a true ELT platform, providing full coverage every step of the way including:
- Automatic schema conversion
- Guaranteed data consistency
- Built-in data validation tools
- Real-time monitoring, and more
For enterprises and chief data officers, one of the more interesting and unique features Arcion brings to the table is how it builds database interconnectivity. With Arcion, one can deploy production-ready data pipelines in minutes, without a single line of code. This was unheard of in a world where data pipelines can take weeks if not months to be ready and still fail often.
Of course, with limited time and resources, we’ve had to prioritize some things over others. For instance, Arcion only focuses on databases and not SaaS apps. Arcion does have both self-hosted and cloud variants but the latter has fewer data connectors (up to 20 by the end of Q3 this year).
But the good thing is that Arcion (like many other options on this list) is growing rapidly! Fresh off the heels of a $13 million Series A round, we’re building a larger partner ecosystem, bringing new features, integrations, and more.
I think the success of Arcion, Fivetran, Airbyte, and Debezium is a testament to the value of database interoperability and I am extremely excited for the industry. As for companies, the important thing is to adopt the modern mindset and that the data stack is going real-time. Most of the data replication platforms on the market today are very powerful so the only thing you need to do is figure out your requirements and choose the platform that meets them.
The growth of the SaaS industry has been meteoric, only comparable to the likes of the cloud revolution. However, the success of SaaS apps does not change the fact that enterprises still heavily rely on systems of records, OLTP systems, and data warehouses which the modern data stack, for the most part, does not support.
As a result, enterprises have had to resort to conventional slower, and unreliable methods. But by connecting enterprise databases, real-time ELT has become the missing piece in our modern data stack, allowing enterprises to go fully real-time.