Future Proofing Data Pipelines
Data pipelines are critical structures for moving data from its source to a destination. For decades, companies have used data pipelines to move information, such as from transactional with analytic systems. However, as the needs of companies change over time, they might find their data pipeline requirements also change. Now a company called Equalum is hoping to bridge today’s pipeline requirements to tomorrow’s capability.
Enterprises are rife with pipelines of various types, moving data this way and that. That could include extract, transform, and load (ETL) pipelines that work in batch; ELT pipelines that pump raw data into cloud data warehouses for transformation; and even pipelines based on change data capture (CDC) technology that pumps a continuous stream of updates into some downstream system.
There are also real-time streaming systems, including those based on messaging technologies like Apache Kafka, that can move data in real-time through a pipeline. And modern companies are also implementing data pipelines as microservices and APIs, suing protocols like REST to exchange the data.
Large companies likely are using multiple types of data pipelines, with a mishmash of packaged ETL products moving data combine with hand-built and framework-based pipelines, all working together to move data at the frequency that its business units demand.
Some might sees this data pipeline sprawl in the enterprise as a problem. Nir Livneh, the CEO and founder of Equalum, sees it as an opportunity.
Livneh is advocating for a new type of data pipeline: the multi-modal data pipeline. By stitching together ETL, ELT, and CDC data pipeline capabilities into a single platform, Equalum can solve both batch and real time use cases, and thereby stay adaptable to customers’ changing data requirements, Livneh says.
“If you look at companies today, everything is changing with how things are becoming more digital and analytics is becoming the center of the organization,” says Livneh, who has been building big data systems for two decades, including at Quest Software and with the Israeli Intelligence Forces.
“There’s a lot more use cases coming from analytics these days,” he continues. “Some of them require real time service level agreements. Some require more of a historical approach, post-mortem analytics, operational analytics. And that requires multiple approaches of how you want to integrate and ingest data into those analytic models.”
Equalum’s software, which utilizes the Apache Spark and Apache Kafka frameworks, supports traditional ETL data pipelines, where the transformation occurs in the pipe (using Spark routines, with the data persisted to Kafka). It supports ELT workloads, where the transformation occurs in the destination (often Snowflake or Databricks, Livneh says). And it also supports CDC processing with a broad suite of low-level connectors for popular relational databases.
The idea behind being multi-modal is that it allows customers to adapt their data pipelines over time, and to be able to handle new data workloads that emerge in the future, Livneh says.
“A lot of what I’ve seen in my career was IT or data teams building and then starting to stich” products together, he says. “What you thought initially is not what it’s going to look like a year from now. There’s a lot of mistakes that will be made in future-proofing your architecture.”
In the ideal world we’re currently building in the cloud, every application is instrumented with APIs that allow analysts, data scientists, and data engineers to access all the data they need, at whatever interval they require.
But in the real world, legacy applications squirreled away into data closets and server rooms process the bulk of transactions. They don’t have APIs. In fact, the IT department won’t even let you touch the application because the original developer died 10 years ago and left an undocumented mess.
“We’re not living in a Google world. Not everybody has this amazing data-driven microservices approach, where everything connects and feeds automatically,” Livney says. “Those architecture are the luxury of new companies. The old companies, they have legacies. They have system that nobody touches for years.”
That’s where CDC comes in handy, Livneh says. Equalum developed its own CDC technology to extract binary data from a variety of popular databases, including Oracle, SQL Server, Postgres, MySQL, and Db2 (including Db2 for Linux, Unix, and Windows, and Db2 for i; it doesn’t support Db2 for z/OS).
In addition to extracting data from databases, it can extract data from web services, applications, file systems, and message queues. Once the data is extracted, it relies on Spark-based business logic to transform and prep the data for its target, and it relies on Kafka to move the data.
Spark and Kafka are powerful frameworks for building data pipelines, but many people are gun-shy when it comes to managing large clusters, especially after years of Hadoop hoopla, Livneh says.
“You spend $5 million on Hadoop. Everybody does that 10 years ago,” he says. “Then suddenly there’s Spark, and everybody says, OK, Hadoop does not make sense anymore. What do you do? Spend another $5 million on Spark? It doesn’t make sense.”
Equalum, which is nearly six years old, appears to be gaining traction with its multi-modal approach to building and maintaining data pipelines. The company counts Fortune 500 firms like Siemens and Warner Bros. among its customers, as well as a few other Fortune 100 ecommerce, retail, and pharmaceutical clients that it cannot name. It’s also raised $25 million.
Looking forward, Livneh is hoping to keep adapting his product to keep his clients nimble. On the roadmap is something called autonomous pipelines, which he likens to IntelliSense for data modeling.
“It’s just going to make you a lot more productive because you’re not going to have to worry about every single point in your data,” he says. “The product will do a lot of that stuff for you.