Follow Datanami:
September 18, 2019

On the Origin of Business Insight in a Data-Rich World


Where does business insight come from? In what circumstances does it arise? Does it erupt spontaneously when a critical mass of data has been reached? Or does it only present itself after a methodical analysis has been conducted? These are questions worth asking now, as we find ourselves in the midst of an architectural shift from on-prem Hadoop to cloud-based systems, and witnessing the emergence of automated machine learning and deep learning sytsems.

The rise of Hadoop and other NoSQL technologies over the past decade coincided with the emergence of big data as a phenomenon. When these technologies made petabyte-scale storage and parallel processing affordable to the average company, suddenly they could justify holding onto “data exhaust” and other trails of human behavior they otherwise would have discarded.

Thousands of companies leveraged the new technologies and techniques to great effect. Organizations found it was very easy to land just about any type of data in their Hadoop clusters, and many organizations did just that. The schema-on-read approach built into Hadoop allowed organizations land petabytes worth of unstructured and semi-structured data now, and figure out what to do with it later.

But as companies got their feet wet with the new tech, they found this big data thing was harder than expected. Getting insight out of the data stored in Hadoop proved a lot harder than people anticipated. (In fact, just finding where the data was stored often proved to be big challenge, which spawned the nascent data catalog business.) Organizations that dreamed of turning data scientists lose on big data to create powerful algorithms that could instantly react to new pieces of data instead found they needed to hire teams of data engineers to stitch everything together and make things work. And even then, many projects failed.

The Hadoop hysteria always bothered Addison Snell, an HPC industry analyst and CEO of Intersect360. “There was this big notion of ‘Hey I’ve got this data, let’s get a jar of Hadoop and rub it on our data,'” Snell said during a panel discussion at Tabor Communication’s Advanced Scale Forum in April of this year. “‘What do you mean we still have a big data problem? Did we run out of Hadoop? Get another jar!’

(Steve Cukrov/Shutterstock)

“That was kind of the big thing we were selling there,” Snell continued. “But in the end, that wasn’t the solution. And now it’s kind of advancing from there.”

Garbage In….

Hadoop, alas, was not the silver bullet that people hoped it was. The technology was a breakthrough for its time, and many companies did get value from it. But Hadoop’s technical complexity, combined with the loosening of analytic precepts and overheated big data expectations, proved to be too big a hurdle for the average company to overcome. As the air is let out of the Hadoop balloon, you have seen the value of companies attached to it decline. Cloudera merged with Hortonworks to create the last Hadoop distributor, while MapR sold itself in desperation to HPE.

The data lake phenomenon continues on AWS, Azure, and Google Cloud, where all of the same data processing engines that attracted people to Hadoop – Spark, Hive, Flink, Presto, and even good old MapReduce – are alive and well. The public cloud promises all the goodness of Hadoop and the schema-on-read approach, but without the trouble of managing hardware. And even some of the data engineering and integration challenges are minimized, as long as you don’t stray too far from the AWS, Microsoft, or Google branded offerings.

The cloud’s rise begs the question: Did we learn anything from the whole Hadoop experience? Or did we just substitute an infatuation with one shiny object for another?

One person with an opinion on the matter is Monte Zweben, who barely survived the first dot-com implosion and so far is weathering Hadoop’s crash as the CEO of Splice Machine, which implements a relational database on top of Hadoop and public clouds. According to Zweben, we have failed to learn some important Hadoop lessons.

“You can dump all the data you want on S3 or the Azure data lake and do it mindlessly and you will end up in the same place that the first generation of adopters of Cloudera and Hortonworks and MapR ended up,” he told Datanami recently. “It is the wrong way of thinking.”

Instead of “rubbing a jar of cloud” on a data problem so to speak, Zweben recommends that organizations think carefully about the business problem that they hope to address with big data tech.

“You’ve got to figure out what business outcomes you’re going to try to achieve,” he said. “Whether that’s reducing fraud or reducing churn in marketing or better patient outcomes in healthcare, then you find the application that can benefit from the data and the machine learning that you want to inject, and modernize that application.”

The first generation of data warehouses built on relational databases used a schema-on-write approach, which required a lot of work upfront. Developers have to define columns and enforce data types before they could even land a single piece of data. By contrast, the new schema-on-read approaches used by NoSQL and Hadoop systems required no upfront design work. They were untyped, constrained.

“With NoSQL systems, you can dump garbage in there and have no constraints whatsoever,” he said. “So this lifting of a technical constraint, I think, led to bad behavior.”

…Garbage Out

Zweben isn’t the first tech industry executive to question the wisdom of new big data approaches. Bill Schmarzo, who is now at Hitachi Vantara, has spent years espousing the need for identifying business outcomes prior to initiating data collection, let alone analysis. Even at the height of Hadoop hysteria back in 2015, Schmarzo was questioning the value of gathering lots of data without a specific use in mind.

“We think that big data means that data governance isn’t important any more. I’m an old old-school data warehousing guy and it always makes me laugh,” Scmarzo told Datanami four years ago. “If you didn’t like data governance before big data, you’re going to hate it after big data. Big data makes that problem worse.”

While machine learning algorithms are very good at identifying patterns in big data, those patterns don’t have much value without business context around them. Yes, today’s technology has gotten very good at helping us to identify of patterns and anomalies hidden in huge amounts of data, but the technologies can’t tell a decision-maker whether it matters or not. Without structure, data is just a series of 1s and 0s. It’s not business insight.

Companies today are looking to AutoML solutions to give them the answers they need. But even the AutoML companies are waving caution flags on the idea that their tools can automate everything. Nick Elprin, the CEO and co-founder of Domino Data Lab, recently threw cold water on the notion that citizen data scientists armed with AutoML tools will take over the jobs of data scientists.

“There’s going to be a place for them. They’re useful for a set of things,” Elprin said about citizen data scientists. “But for any problem that’s going to be really competitively differentiating for a business or require deep domain expertise or inventing something new, I think that’s going to be hard for citizens to attack that problem because of the depth of knowledge and expertise [data scientists require] and the constraints of the tools.”

Ryohei Fujimaki, the CEO of dotData, is a bit more optimistic about the potential for AutoML tools to automatically generate business insights from raw data. Specifically, the PhD-level data scientist says his dotData algorithms can excel at finding the meaningful insights hidden in data.

“Data scientists or domain experts can typically explore hundreds or a thousand of feature hypothesis, given a use case, but our feature engineering technology explores more than a million features out of tens of tables,” he told Datnami recently. “Typically dotData can explore let’s say a million to a couple million feature hypothesis automatically, and then find out hundreds of promising features.”

This may be as close to the magic lamp that we’ve gotten with big data. The maturation of neural networks, in particular, has provided a compelling use case for massive data sets. However, Fujimaki readily admits that, due to regulatory limitations in the financial services industry, most of his customers are not using dotData-generated models. While they may use the AutoML version to see what’s possible and to test the bounds of accuracy, when it comes to developing a production system, companies usually hand-code the model.

“Automated machine learning is great, but at the end they have to do a lot of customization on their model, and typically a decision tree or linear type of model, a simpler model” is a better approach, he said. The lack of clean, annotated data is also thwarting the adoption of neural nets, he added.

There’s no doubt that computer technology has evolved at a rapid pace over the past 10 years. Compared to 2009, the average organization has much more powerful data analytics platforms available to them today. And there’s no reason to think that the advances won’t continue.

But for all the technological progress, organizations are still struggling to extract meaningful business insight from their data. Part of this is the ongoing explosion of data and the tremendous data volumes today, but another element of the failure is the desire to think new technology can provide a shortcut.

There are occasionally big breakthroughs that reset our expectations, and today’s AutoML tools and neural networks can provide major benefits in specific areas. But even these technologies don’t eliminate the need for a more thoughtful and methodical approach to managing and mining big data for big insights. In the end, the value of the business insights you get will tend to reflect the quality of the tools, work, and data that you put into finding them.

Related Items:

Re-Imagining Big Data in a Post-Hadoop World

One Deceptively Simple Secret for Data Lake Success

Are We Asking Too Much from Citizen Data Scientists?