A Bottom-Up Approach to Data Quality
Despite the amazing progress we’ve made in novel data processing techniques, poor data quality remains the bane of analytics. It’s why data scientists spend upwards of 80% of their time preparing and cleansing data instead of exploring the data and building models that leverage it. It’s also why one vendor’s bottoms-up approach to data quality could be a model worth exploring.
Poor-quality data is “a huge problem,” Bruce Rogers, chief insights officer at Forbes Media, said earlier this year. “It leaves many companies trying to navigate the Information Age in the equivalent of a horse and buggy.” Awareness of data quality is another issue. A 451 Group study from 2016 found more than 80 percent of survey respondents believed their data quality was better than it actually was.
Most shops that are trying to innovate on data spend an inordinate amount of time clean up their data, and have invested in technologies and techniques to help them automate that task. A good number have tried centralized master data management (MDM) solutions to link all data to a master file, while the Hadoop-driven data lake phenomenon has brought a more distributed technique called self-service data preparation to the forefront.
While these solutions are neither simple nor cheap, they’re seen as the best approaches to avoiding the garbage in, garbage out (GIGO) syndrome. There’s no silver bullet to eliminate GIGO, but strict enforcement of data standards (for MDM) and use of machine learning algorithms to automate the fixing of common mistakes (for self-service data prep) are seen as the next-best alternatives.
Now a group of Michiganders is finding traction with another approach to solving the data quality issue. Simply put: Instead of fixing the data after it’s made its way into a data warehouse or data lake, the company, called Naveego, aims to find and fix data quality issues as they originate in business applications themselves.
“We’re really focused on the low-level data in the line of business applications and providing data quality solution to help business run better and monitor and manage their business processes,” says Derek Smith, CEO of Naveego.
Naveego didn’t start out with a data quality focus. Back when it was still part
of Safety Net, the Naveego Business Intelligence platform provided a cloud-based repository for storing data from different systems for the purpose of building reports and graphs. It also had MDM capabilities, which helped during an engagement with Breitburn Energy.
“We saw some struggles they were having around MDM in general, and we looked at what we had built and said, ‘We can actually solve a lot of these problems for them,'” Smith tells Datanami. “We showed them the tool and it ended up being used for the MDM component. It was during that engagement that we really pivoted and learned the value of what we built.”
Safety Net spun Naveego out into its own company in 2014 to focus on MDM, and soon thereafter the Traverse City, Michigan company started developing data quality tools. It received some funding earlier this year, and is currently in the process of rolling out its new data quality solution, called Naveego DQS.
According to Smith, DQS enables companies to detect data quality issues that exist across multiple on-premise and cloud databases and applications. Once it helps the customer find the data quality issues, it helps them monitor the databases and applications so they don’t come back.
DQS works by leveraging the existing SQL scripts that most business analysts have already written. These SQL scripts are often run periodically, such as before a quarterly report is run, to ensure that the data is up to snuff and error-free.
“A lot of times these consultants or employees will have these go-to SQL queries they’re using, and they’re typically running them manually or running them at the end of the day to prepare a report to go out,” Smith says. “What we did is said, ‘Let’s connect to all of your system and allow you to use your SQL skills, but put it into an automated system that’s going to run every day for you and collect the information and also provide the visibility that the problem is corrected and it’s not re-occurring.”
Navegoo DSX has native access into Oracle, MySQL, and SQL Server databases, and an ODBC driver can get the company into most other databases, the company says. It also can automate data quality scripts running on cloud data sources via a REST API. Any issues that are detected are flagged and uploaded into a cloud-based reporting environment for long-term monitoring.
Mike Dominick, Naveego’s chief product officer and a former employee of Breitburn Energy, says Naveego DSX can start delivering better data in days. However, there’s no ready-made blueprint for fixing every type of data quality problem.
“Sometimes it’s going back into the cloud-based vendor management system and finish the provisioning process because they forget to add some critical information,” Dominick says. “Or it’s going into the accounting system to look at data that was maybe entered incorrectly. Or maybe it’s calling guys in the field because maybe their data is out of tolerance, or they’re missing data that wasn’t submitted from field workers.”
We’re currently in the midst of a movement away from monolithic business applications towards a more best-of-breed approach. However, existing data quality tools largely lack the ability to detect problems that occur across multiple systems that are linked together to execute business processes. This is the differentiator that Naveego hopes to exploit.
“Your traditional ERP or database platform can’t really go out to the cloud and say ‘Did the five things required in the cloud system that I need or that the system downstream needs to complete this process, is it all there?'” Dominick says. “We actually have the ability in our platform to look across all five systems as one cohesive component, and say, is everything in the right place at the right time, and if not, which department is responsible for that, and build that visibility so we can find the breakdown in those processes that are becoming more and more disruptive to the organization as you have these different silos that participate in a much bigger process.”
This approach won’t solve all data quality issues. But for data sitting in ERP systems and relational databases, which is a very valuable asset for many companies, it could get them closer to eliminating troublesome errors. And when this largely relational data makes its way into the data warehouse or the cloud, the need for other data quality checks will be reduced.
In fact, Smith says this approach will pay dividends that exceed what more complex data quality tools can provide.
“It’s really about providing a simple way to monitor and provide value at the foundation level,” the CEO says, “but then also provide visibility up to the C level so they can get an idea how efficiently everything is working and start to build trust in the data that’s ultimately feeding their analytics dashboards.”
A new release of the software issued last week adds support for NoSQL databases, such as MongoDB and Apache Cassandra. The company has also added ElasticSearch for providing analytics around large data sets. It has also incorporated Apache Kafka to help ingest data.