Follow Datanami:
April 21, 2021

Lighting Up Data’s GIGO Problem

(Profit_Image/Shutterstock)

It goes without saying that companies need good data if they’re going to become data-driven. But all too often, companies are working with data that’s erroneous, sub-standard, or just plain wrong. The latest vendor to emerge in this increasingly hot segment of the big data market is Lightup, which today announced the beta of its data quality monitoring solution.

Data quality issues have existed since the dawn of the data age. But for Lightup CEO and co-founder Manu Bansal, the situation reached a head with his first software startup, Uhana, which developed a machine learning system for telecoms.

“We were building some pretty fancy machine learning models on top,” Bansal says of Uhana, which VMware acquired in 2019 for an undisclosed amount. “But where we struggled most was in making sure the data feeding those systems or those pipelines was clean. It was a garbage in, garbage out system.”

Garbage in, garbage out, or GIGO, of course, is the bane of data scientists everywhere. If the data that underlies machine learning algorithms is not trusted, then the decisions that ML models make cannot be trusted, either. This was a problem at Uhana.

“We just didn’t have any visibility into the health of the data feeding the system,” Bansal tells Datanami. “That to me was very frustrating because those issues would often get detected too late. They would already percolate all to the way to the customer at times. And even when you would find them, this process of….root causing them would just be very ad hoc, very manual. We just didn’t have them in the playbook initially. So that’s what we set out to do with Lightup.”

Bansal co-founded Lightup with Rajiv Ramanathan and Vivek Joshi in 2019 to automate data quality checking. The service they built utilizes machine learning to detect when elements of data start to deviate from a baseline. When a Lightup SQL query detects a problem in a database table, the service sends an alert to customers via Slack, Mattermost, PagerDuty, Microsoft Teams, Flock, email, or a webhook. It’s then up to the customer to do something about it.

Lightup architecture

When problems are found, one of the most popular options is simply to stop the data pipeline, Bansal says. “Some have called it the circuit breaker pattern. Others call it fail fast,” he says. “The idea is that you want to stop the flow of data if you detect something funny, because no data is better than bad data, believe it or not.”

Lightup runs as a service atop Kubernetes, and is designed to execute its SQL-based data quality checking routines within a customers’ cloud data warehouse, such as Snowflake, Amazon Redshift or Athena, Google Cloud BigTable, or Databricks. (Alternatively, it can work with a Kafka bus, but the data must first be landed into a “sidecar” repository to establish and maintain a baseline of data quality metrics, Bansal says. That process is invisible to the user, he says.)

Lightup’s data quality checks run as native SQL routines in the cloud data warehouse, which gives it the scale it needs to find problem data amid the vastness of the big data space. Traditional approaches to data quality typically require customers to move their data into the data quality tool, but that doesn’t fly when a huge amounts of data are moving quickly, Bansal says.

“It’s something we’re hearing from Fortune 500 companies now,” he says. “They have [an existing data quality tool] and they say it’s not scaling. They want checks to happen within in an hour, let’s say, on a terabyte of data. They want something that can [work] with the architecture and apply the check where the data already lives, which is your warehouse or the data lake.”

After the initial setup, Lightup is designed to run in the background, and only alert users when something has gone wrong with the data. There are a number of data issues it can detect, including problems with:

  • Data availability: Is there zero data where there is supposed to be data? Is data being updated at the expected interval?
  • Data conformity: Has the data format changed?
  • Data validity: Have the values of collected data points changed?
  • Schema consistency: Is the data schema still correct?

Lightup provides two ways to interact with its product for two classes of users. The first is a dashboard that will appeal to analysts who are expecting an easy-to-use, shrink-wrapped product. The second is an API layer that can be used by data engineers.

This class of users says “I want to bring this into my own CI/CD pipeline, my own data orchestration pipelines,” Bansal says. “If I’m running jobs with dbt or Spark, scheduling them with Airflow and these jobs are executing every hour, then I want to trigger those tests when data gets updated.”

Lightup is designed to be operated in a semi-supervised manner, since there will be situations where humans need to collaborate to figure out what’s going wrong with the data. For example, a gaming company may detect a significant drop in the number of transactions. Is that because user behavior changed, such as the pandemic or the launch of a competing gaming platform? Or maybe it’s because the log data collector stopped working?

“That’s something that the data engineer can usually not interpret by themselves and they start to bring in others from the organization,” Bansal says. “It starts to become a collaborative exercise.”

The system also needs flexibility to adapt to the cadence of data movement. For example, a financial institution may anticipate lots of transactions occurring once a month, but whether that occurs on the 13th or the 14th of the month can’t be predicted.

“It’s very easy to forecast using black box techniques, which will make assumptions that won’t hold true in practice,” Bansal says. “That tends to be a very big problem. It’s easy to go overboard to the point where you’re creating alert fatigue. Now no one wants to look at the system anymore.”

Bansal has found that works best is a semi-supervised approach that combines human intuition and time-series algorithms. If the algorithm starts to generate faulty alerts on data quality issues, then users can tune it before setting it lose into production.

Lightup is accepting applications for its beta program. The company is welcoming users who will provide feedback and help to guide the development of the product. The beta is free to use for up to 10 data quality indicators (DQIs), which is the unit that Lightup uses to bill for its service. One DQI can run data quality checks against data residing in one table. The company offers a paid service that starts at 20 DQIs, which cost $10 per table per month. More info is at www.lightup.ai.

Related Items:

8 Key Considerations for Embarking on a Data Integrity Journey

Data Culture ‘Disconnect’ Identified in New Index

The Real-Time Future of ETL

Datanami