Finding a Single Version of Truth Within Big Data
There seems to be an implicit promise associated with the rise of big data analytics: By taking more measurements and calculations, that we can deliver deeper insights atop source data, and do so at quicker intervals than before. But this premise—that today’s analytics can get us closer to that single version of the truth—may be harder to achieve than first thought.
In the past, organizations would spend millions of dollars in their quest to achieve the best version of the truth. They would pull data out of source systems, cleanse it, standardize it according to some master data management (MDM) model, and then carefully prep it for analysis in a data mart or warehouse. It was an expensive and time-consuming practice, but it was the best approach available for getting accurate answers.
While today’s enterprise data warehouses are very powerful and are still useful in analytics, the rise of big data is causing organizations to re-think their approach to analytics. With big data technologies systems like Apache Hadoop and Apache Spark, companies today have vast new capabilities to store and analyze their data like they never could before.
However, all that power doesn’t necessary bring us closer to the Holy Grail of modern BI: one version of the truth.
Data, Data Everywhere
Frank Bien, the CEO of business intelligence software vendor Looker, says the explosion of data analytics tools—particularly visualization tools–is one reason why companies today are struggling to achieve a single source of the truth with big data.
“There’s no governance of the data,” Bien tells Datanami. “Everyone, in their own spreadsheet or visualization or workflow, is describing data differently. People get in a room and they can’t agree on data metrics. ‘Why are we calling this lifetime value? You’re not describing that right!'”
Having a common set of agreed-upon definitions is obviously one of the first steps in achieving a single source of the truth. But enabling those definitions at the data layer isn’t always easy. Looker’s solution to this dilemma is to bring data modeling back into the fold through LookML, a data modeling language that allows users to describe the data stored in big analytic database, such as Greenplum or Hadoop.
Once the data has been modeled, then Looker helps to transform it in place. That gives users the advantage of working from a big data set, but ensures that it doesn’t have to be moved by traditional ETL techniques, which can be costly and time-consuming in a big data environment.
“We give these companies a common vocabulary and a reliable source of metrics that goes way beyond what some of the simpler visualization tools are doing,” Bien says. “They’re making really informed and better decisions based on those metrics and data.”
Just the Facts, Man
The whole point of business intelligence is to find solid facts that can be used for making business decisions. But what if those facts are actually wrong much of the time?
That’s the unfortunate truth uncovered by Blazent, which develops analytic tools aimed at optimizing the back-office operations in big companies, such as telecommunications firms. According to Blazent, the data used for analysis is wrong, on average, about 40 percent of the time.
The problem is, data can sometimes “drift” when an organization embarks on a large-scale data analytic project that involves taking data from multiple sources and blending it, says Michael Ludwig, Blazent’s chief product architect.
“Without going through a very sophisticated process [to cleanse and standardize the data], you’re essentially making decisions based on bad data,” Ludwig tells Datanami. “There’s no way to align data that’s coming from many different data models with many different representations of the same thing, but named differently or needing to be modified in some way.”
Earlier this summer, Blazent introduced a new product that uses Apache Spark to power the many data quality checks and standardization routines that its clients need to perform in order to have faith in the quality of the data.
Rise of Plural Truth
While Looker and Blazent advocate using data modeling and fast in-memory processing to bypass traditional ETL in the quest for BI’s holy grail, there are some in the analytics industry who assert that the whole notion of a “single version of the truth” is nonsense.
Jeff Jonas, an IBM Fellow and chief scientist for its Entity Analytics business, wrote an interesting article in 2013 titled “There is No Single Version of the Truth.” In it, Jonas argues that trying to find a single version of the truth is a waste of resources time in today’s real-time business climate. Instead, we should embrace “plural versions of the truth.”
“The ‘best’ data depends on its source and purpose,” Jonas writes. “While a company may have employee data in different systems, like IT, HR, Finance etc., the employee name and address maintained by the payroll system is probably the best one to use for tax filing.”
That doesn’t mean Jonas thinks organizations should not try to reconcile data plurality. But instead of the traditional “merge-purge” technique that involves massive batch jobs that compare new data against the old data, Jonas thinks we are better of using an “entity resolution system.”
“Entity resolution systems generally retain every record and attribute, each with its associated attribution,” he writes. “Because entity resolution systems have no data survivorship processing, there is no chance future relevant data will be prematurely discarded.”
There is no foreseeable end to the tsunami of data that’s barreling toward us. However, all that data is not created equal. The businesses that figure out how to accurate blend multiple data streams without succumbing to drifts in their metrics will enjoy a competitive advantage.