Data Pipeline Automation: The Next Step Forward in DataOps
The industry has largely settled on the notion of a data pipeline as a means of encapsulating the engineering work that goes into collecting, transforming, and preparing data for downstream advanced analytics and machine learning workloads. Now the next step forward is to automate that pipeline work, which is a cause that several DataOps vendors are rallying around.
Data engineers are some of the most in-demand people in organizations that are leveraging big data. While data scientists (or machine learning engineers, as many of them are calling themselves nowadays) get most of the glory, it’s the data engineers who do much of the hands-on-keyboard work that makes the magic of data science possible.
Just as data science platforms have emerged to automate the most common data science tasks, we are also seeing new software tools emerging to handle much of the repetitive data pipeline work that is typically handled by data engineers. Data scientists have tools like Kubeflow and Airflow to automate machine learning workflows, but data engineers need their own DataOps tools for managing the pipeline.
In a recent white paper on DataOps, the Eckerson Group explains that the need for better automation comes largely from the immaturity of data analytics pipelines.
“The development of data and analytics pipelines, both simple and complex, is still a handcrafted and largely non-repeatable process with minimal reuse, managed by individuals working in isolation with different tools and approaches,” Wayne Eckerson and Julian Ereth write in “DataOps: Industrializing Data and Analytics. “The result is both a plodding development environment that can’t keep pace with the demands of a data-driven business and an error-prone operational environment that is slow to respond to change requests.”
The emerging DataOps field borrows many concepts from DevOps techniques used in general software engineering, including a focus on agility, leanness, and continuous delivery, Eckerson Group writes. The core difference is that it’s implemented in a data analytics environment that touches many data sources, data warehouses, and analytic methodologies.
“As data and analytics pipelines become more complex and development teams grow in size,” Eckerson and Ereth write, “organizations need to apply standard processes to govern the flow of data from one step of the data lifecycle to the next – from data ingestion and transformation to analysis and reporting. The goal is to increase agility and cycle times, while reducing data defects, giving developers and business users greater confidence in data analytics output.”
There are a handful of vendors delivering shrink-wrapped solution in this area, and not (yet) many open source tools. While DataOps is growing in recognition and need, the tools that supported automated data pipeline flows are relatively new, Eckerson Group writes. That represents a market opportunity for software vendors and a new niche to fill.
One of the vendors automating data pipeline work is Infoworks.io. The Palo Alto, California company says its Autonomous Data Engine addresses a wide range of data engineering tasks, from the point of ingestion and change data capture to shaping the data and preparing it consumption by analytics.
“This is where typically 80% of the time is spent,” says Amar Arsikere, CTO, chief product officer, and co-founder of Infoworks. “Ingesting the data, keeping it synchronized, transforming it, preparing the data models, accelerating those models into the right query performance — we are taking that heavy lifting that people have to do with high degree of automation.”
Arsikere co-founded Infoworks four years ago to address the challenge he saw in the big data space. A former Google engineer who developed the first BigTable data warehouse, Arsikere went on to build a large in-memory database at Zynga. Having solved the data pipeline problem twice, he realized there was demand for a shrink-wrapped solution in a market that was hand coding workflows for Hadoop and other big data ecosystems, and so he founded Infoworks.
Infoworks’ business is still ramping up, but it already has several large clients who are managing thousands of pipelines that touch multiple petabytes of data, according to Arsikere. “We have customers who have on-boarded 1,500 pipelines or use cases in a matter of a few months that otherwise would have taken them a year or more,” he says. “You still need a data engineer to use the product but we are making that data engineer much more productive.”
Another DataOps outfit automating the data pipeline is DataKitchen. The Cambridge, Massachusetts company uses a cooking metaphor to describe its DataOps platform. Multiple people can share and follow recipes while working in a data kitchen to build dishes from multiple ingredients.
Nexla also provides a DataOps platform that helps to automate the creation of data connections to databases and other repositories; the management of repetitive data transformation tasks; data schema management; and data lineage monitoring in big data ecosystems, such as Hadoop. The Milbrae, California company, which serves customers in the e-commerce, insurance, travel, and healthcare, also helps to automate the management of data in various data formats, like Parquet, ORC, and Avro.
You can also find data pipeline automation solutions from Bedrock Data. The Boston, Massachusetts-based company sells a product called Fusion that focuses on fusing data from Software as a Service (SaaS) applications like Salesforce and Marketo. The software automatically creates a SQL data warehouse from the disparate data sources, and keeps it up to date, enabling analysts to access it via BI visualization tools.
StreamSets also serves the emerging DataOps and automated data pipeline markets. The San Francisco, California company touts its software offering as a “cross-platform data movement layer” that gives customers better visibility and control over the performance of their data pipelines, including detection of data drift.
Apache Spark is a favorite tool among data engineers, as it provides a powerful environment for manipulating data for downstream analytics and machine learning. However, Spark is not the easiest tool to learn, and optimizing it can be difficult. Many of the data pipeline tools here use Spark under the covers, but hide some of Spark’s complexity.
Many of these data pipeline products run on Hadoop and Spark environments, and can also run on public clouds. They offer pre-build connectors to common data sources, such as HDFS, S3-compatible object stores, Excel, FTP, and relational and NoSQL databases.
While DatOps is emerging as an important discipline, the product category is not nailed down, and different vendors offer different parts of the solution. In InfoWorks case, the company also includes a built-in data catalog and an OLAP cube component that allows users to rapidly roll out interactive BI applications. Having all these layers built into the data pipeline product makes sense for users, he says.
“By building something end to end, for everything that a data engineer has to do, there is no third-party integration,” Arsikere says. “There’s no external coding or glue code that’s required to assemble different components. So that that means is it’s easier to get your use case built. That’s what we’re focused on — how can an enterprise build and deploy these use cases fast.”