Leaky Pipelines and the Business Case for Data DevOps
As enterprises embrace digital transformation and migrate critical infrastructure and applications to the cloud as a key component of those efforts, what can be called “data clouds” have started to take shape. Built on multi-cloud data infrastructure, such as Databricks’ or Snowflake’s data platforms, these data clouds enable businesses to break free from application and storage silos, sharing data throughout on-premises, private, public, and hybrid cloud environments.
As a result, data volumes have skyrocketed. Big data-enabled applications increasingly generate and ingest more and different types of data from a variety of technologies, such as AI, machine learning (ML), and IoT sources, and the nature of data itself is radically changing in both the volume and shape of data sets.
As data is freed from constraints, visibility into the data lifecycle gets fuzzy, and traditional quality control tools quickly become obsolete.
Bad Data Flows Just as Easily Through Data Pipelines as Good Data
For the typical enterprise, data monitoring and management is still handled by legacy tools that were designed for a different era, such as Informatica Data Quality (released in 2001) and Talend (released in 2005). These tools were designed to monitor siloed, static data, and they did that well. However, as new technologies entered the mainstream — big Data, cloud computing, and data warehouses/lakes/pipelines — data requirements changed.
The legacy data quality tools were never designed (or intended) to serve as quality control tools for today’s complex continuous data pipelines that carry data in motion from application to application, and cloud to cloud. Yet, data pipelines frequently feed data directly into customer experience and business decision making software, which opens up massive risks.
A good example of how bad data can escape notice and undermine business goals is “mistake airline fares.” Erroneous currency conversions, human input errors, and even software glitches generate mistake fares so often that some travel experts specialize in finding them.
This is just one example. Bad data can result in incorrect credit scores, shipments sent to the wrong addresses, product flaws, and more. Market research firm Gartner has found that “organizations believe poor data quality to be responsible for an average of $15 million per year in losses.”
Where Are the Safety Inspectors and Cleanup Crews for Data in the Pipelines?
As developers rush to catch up to the challenges of maintaining and managing data in motion at scale, most first turn to the DevOps and CI/CD practices they used to build modern software applications. To port those practices to data, however, there is a key challenge: developers must understand that data scales differently than applications and infrastructure.
With applications increasingly being powered by data pipelines from cloud-based data lakes and warehouses and streaming data sources (such as Kafka and Segment), there needs to be continuous monitoring of the quality of these data sources to prevent outages from occurring.
Organizations must ask, who is responsible for inspecting data before it hits data pipelines, and who cleans up the messes when data pipelines leak or feed bad data into mission-critical applications? As of now, the typical business’ approach to pipeline problems and outages is a purely reactive one, scrambling for fixes after applications break.
Why Enterprises Need Data DevOps
If today’s typical multi-cloud, data-driven enterprise hopes to scale data platforms with agile techniques, DevOps and data teams should look back on their own evolution to help them plan for the future, specifically taking note of a missing component as DevOps is ported to data: Site Reliability Engineering (SRE).
DevOps for software only succeeded because a strong safety net, SRE, matured alongside it. The discipline of SRE ensured that organizations could monitor the behavior of software after deployment, ensuring that in-production apps meet SLAs in practice, not just theory. Without SRE, the agile approach would be too risky and error-prone to rely on for business-critical applications and infrastructure.
Modern data pipelines would benefit from something similar to SRE: Data Reliability Engineering.
Some organizations already do dev/stage testing on their data software, but standard dev/stage testing is barely a quality check for big data in motion. Data has characteristics that make it impossible to manage through traditional testing practices. For starters, it’s harder to test data because data is dynamic. The data that will be flowing through your pipes – often generated through apps ingesting real-time information – may not even be available at the time of development or pipeline deployment.
If you rely on dev/stage testing, plenty of bad data can flow through your good data pipelines, resulting in outages and errors, but your quality control tools won’t be able to spot the problem until well after something goes wrong.
Your testing may tell you that data sets are reliable, but that’s only because you’re testing best-guess samples, not real-time data flows. Even with perfect data processing software, you can still end up with garbage data flowing through your pipeline because good pipes aren’t designed for quality control, but just flow.
Getting Started with Data DevOps and Data Reliability Engineering
Developing Data DevOps capabilities shouldn’t be a heavy lift for organizations that already embrace agile and DevOps practices. The trick is carving out new roles and competencies tailored to the unique characteristics of today’s sprawling, ever-moving, high-volume, cloud-enabled data.
However, if you follow the six steps below to lay the proper quality control foundation, your organization will be well on its way to reigning in out-of-control data.
1. Embrace Data DevOps and Clearly Define the Role
Modern data presents different challenges than legacy static data (and the systems that support it), so be sure to clearly differentiate Data DevOps roles from closely related positions. For instance, data engineers are not quality control specialists, nor should they be. They have different priorities. The same is true for data analysts and other software engineers.
2. Determine How and Where DREs Will Fit Into Your Organization
The DRE should work closely with DataOps/DevOps teams, but the role should be created within data teams. To ensure continuous quality, the DRE must be involved in all the key steps in the data creation and management process.
3. Provide the Tools that Set Your Data DevOps Team Up for Success
Data DevOps should have its own set of tools, expertise, and best practices, some drawn from related fields (software testing, for instance), others developed to meet the unique challenges of high-volume, high-cardinality data in motion.
4. Determine How to Author and Maintain Quality Checks and Controls
Many data quality programs fail because the legacy and home-grown tools used to author quality checks struggle to handle complexity. Thus, these tools are complex themselves, difficult to use, and end up being shelfware. It’s critical to think through the process of updating and maintaining data quality checks as data evolves, relying on intuitive tools that make it easy to get the job done right.
5. Start Mapping Processes
Don’t forget to map out processes as the Data DevOps team evolves. Be sure your Data DevOps team knows what procedures to follow when a data outage occurs. The DRE may need to pull in other experts, such as data engineers, data analysts or even business stakeholders, who can interpret the data and disambiguate legitimate changes from quality issues.
How will that escalation process work in your organization?
6. Paint a Clear Picture of What Successful Remediation Looks Like
Big Data remediation is a unique challenge. With dynamic data, certain types of remediation just don’t make sense. For instance, if you’re correcting problems that resulted in failed http requests or slow page loads, those sessions are lost.
What does successful remediation look like, and how do you know the problem is now under control?
Conclusion: Modern Data-Driven Apps Need Data DevOps to Ensure Mission-critical Data Reliability
Modern cloud-enabled, data-driven enterprises need reliable, high-quality data to meet their business goals. However, the complexity of data in modern environments means that businesses need DevOps not just for IT and applications, but also for data. Your Data DevOps team and DREs will have much in common with traditional DevOps and SREs, but Data DevOps is going to be a discipline that needs to approach continuous data quality with a fresh set of eyes, figuring out how to plug dangerous gaps in the data lifecycle, especially with respect to data reliability and data quality.
For most enterprises, the next step to getting your data under control is simply taking a step, any step, towards ensuring continuous quality. Make data quality control a priority, embrace Data DevOps, and start mapping out how these new capabilities will fit with your existing DevOps, data, and testing teams, and you’ll be miles ahead of your competitors.
About the author: Manu Bansal is the co-founder and CEO of Lightup Data. He was previously co-founder of Uhana, an AI-based analytics platform for mobile operators that was acquired by VMware in 2019. He received his Ph.D. in Electrical Engineering from Stanford University.