Building Continuous Data Observability at the Infrastructure Layer
Data is the lifeblood of business today, but getting it where it needs to go is hard, especially as data volumes grow. Data pipelines become the repeatable method for moving this digital crude, but monitoring the flows for possible data errors becomes more difficult as the number of pipelines increases. For companies facing this dilemma, embracing continuous data delivery at the infrastructure layer can provide some relief.
The pressures put on data engineers by data-driven companies have been well documented. In fact, 97% of data engineers say they’re burned out, and another 78% say they wish their jobs came with a therapist, according to a 2021 report on the critical position. At the same time, the quality of the data running through pipelines isn’t great. A survey earlier this year found that 56% of CRM users reported missing or incomplete data, and another 46% reported incorrect data.
Data can go wrong in any number of ways, and humans will invariably be needed to make them right. It’s all about keeping the service level agreements (SLA) intact, says Arvind Prabhakar, the CTO and co-founder of StreamSets, which was acquired by German software giant Software AG for about $580 million earlier this year.
“These SLAs are well understood by the operational staff and the development staff,” says Prabhakar, who previously worked at Hadoop distributor Cloudera and ETL king Informatica. “They get it. They will wake up in the middle of the night if the warehouse reports are not ready by the morning for the exec reviews.”
StreamSets, which Prabhakar and former Informatic executive Girish Pancha founded in 2014, made its mark on the industry by developing software that tracks the changes of data values in a pipeline over time. This data drift can be insidious because it’s difficult to detect and can wreak havoc on metrics and reports by surreptitiously changing values over time.
Today StreamSets is pushing the concept of continuous data, which reflects the need for data pipelines to run 24/7 but without the benefit of a human engineer constantly watching it. Instead, the data monitoring and observability layer should become part of the infrastructure instead of separate from it, Prabhakar says.
“Sometimes the problems are so nuanced or esoteric or just complicated that people just come to the conclusion ‘This requires manual supervision,’” he says. “[Or they say] ‘I don’t have control over it.’”
However, this hands-on approach doesn’t work at today’s massive data volumes or with today’s huge data varieties, he says.
“In a world when a vast majority of the data no longer residing in highly consistent, correct, pre thought out, predesigned systems, you cannot manually scale this,” Probhakar says. “You cannot have one person focusing on one pipeline or 10 pipelines and making sure that everything is done right and there’s no needle in the haystack that’s going to blow up a big problem in the system.”
There are a large number and types of problems that can go wrong with data. Badly formatted JSON, inconsistency between IPv4 and IPv6 addresses, a shift from 10-digit to 12-digit numbers, or putting the right data in the wrong column. There are many ways data can “drift” out of alignment over time.
It doesn’t matter whether one’s pipelines are developed using traditional ETL tools such as those from Informatica or Talend–which have full visibility into the data–or whether they’re built using schema-less tools like Kafka that have no visibility into the payload, he says. Neither one handles data drift, which is what sets StreamSets set out to do about eight years ago.
The number of applications and pipelines is exploding at the moment, which makes the issue of master data management a pressing one for companies that want to actually trust their data. During a conversation with an IT professional at a bank, Probhakar asked how many applications the bank had.
“The answer was ‘We have 5,000 known applications,” he tells Datanami. “And it was weird that term she added, ‘known applications.’ Without any prompting, she went onto say ‘You know, less than 1% are fully understood, documented, and controlled, and are considered critical path.’”
While a bank may have tens of thousands of established applications, the pace of new application creation proliferates at many other companies. The IoT is proving to be an especially rich source of new application and data generation, and many companies are struggling to keep up with the massive number of new data pipelines pumping data at various intervals–batch, micro-batch, or streaming (StreamSets supports any interval).
With all that data looming, relying on humans to keep it running smoothly simply isn’t feasible. Probhakar says. To support that continuous data load and keep hitting SLAs, the monitoring and observability of data drift must be pushed into the infrastructure itself and become part of the underlying pipeline, he says.
“What we’ve been trying to champion as part of the DataOps strategy is to say, look, you cannot scale that. The manual overlay to support the SLAs that your business processes require is no longer sustainable and supportable, not at this scale, not at the volumes we’re dealing with, not at the complexity you’re dealing with,” he says. “That’s the underlying theme of what we’ve been trying to champion all these years.”