Big Data Is Still Hard. Here’s Why
We’re over a decade into the big data era that emerged from the tectonic collision of mobile, Web 2.0, and cloud forces. Bolstered by progress in machine learning, we stand at the cusp of a new AI era that promises even greater automation of rudimentary tasks. But despite the progress in AI, big data remains a major challenge for many enterprises.
There are lots of reasons why people may feel that big data is a thing of the past. The biggest piece of evidence that big data’s time has passed may be the downfall of Hadoop, which Cloudera once called the “operating system for big data.”
After acquiring Hortonworks, Cloudera and MapR Technologies became the two primary backers of Hadoop distributions. The companies had actually been working to distance themselves from Hadoop’s baggage for some time, but they apparently didn’t move fast enough for customers and investors, who have hurt two companies by holding out on (Hadoop) upgrades and investments.
In Hadoop’s place, we have the public cloud vendors and their loose collection of data storage and processing options. Companies can do everything on the Amazon, Microsoft, and Google clouds that they sought to do with Hadoop, at the same petaybyte scale. In fact, the clouds have even more processing options, and none of the requirements to actually stand up and manage physical clusters, which is fueling huge growth in clouds.
But companies that were hoping the cloud would solve their data management challenges will be disappointed to find that things aren’t any easier on the cloud than they are on-premise, says Buno Pati, the CEO of Infoworks, a provider of data orchestration tools.
“The cloud doesn’t solve that problem,” Pati told Datanami recently. “There’s not a cloud vendor today that can give you a highly automated, integrated, and abstracted system on which you can manage the entirety of your data and analytics activity.”
While cloud vendors give customers an abundance of in-house data processing options — and an even wider array of third-party solutions via cloud marketplaces – it’s still up to the customer to connect all the dots, Pati said.
“The cloud is an escape valve,” he continued. “They say, ‘Okay, this didn’t work on-prem. Let me run away and try it in the cloud.’…But at the end of the day, it’s still your burden, as the user of the system, to put it all together and make it work.”
With so much investment in cloud platforms, one might assume that data management is destined to improve, to get simpler and easier over time. In actuality, most established enterprises will continue to use on-prem systems to store the most critical data and run the most critical workloads, while using cloud options for data and workload that are less critical and also newer.
This emerging hybrid world, which encompasses on-premise and cloud workloads, will introduce more complexity to data management tasks, and open more opportunities for failure, than if companies had were running everything on premise or everything in the cloud. Cloudera and AWS, as the leading representative of the two camps, have both promised to address this challenge with hybrid data management solutions and tools that work seamlessly across both cloud and on-premise systems, but those promises have not yet been fulfilled.
In the meantime, the big data ecosystem will try to fill in the gaps. We’ll see integrated toolsets for handling the full spectrum of data management tasks — from including security and backup to governance and access control– across on-premise and cloud-based systems. And higher up the toolchain, vendors are developing ways to track, discover, prepare, and integrate data across multiple silos.
Companies are demanding better tools, says Avinash Shahdadpuri, an engineer with Nexla, which develops big data integration tools.
“Building connections and providing visibility to this data is a very big challenge at enterprises,” Shahdadpuri said. “If you can make the connection and make it more actionable for end users, that in itself is a big challenge that a lot of enterprises face before they even use data.”
Companies want to automate as much of the repetitive and rudimentary data engineering tasks as they can, he said.
“For a data scientist, these are boilerplate activities that typically are very standardized,” Shahdadpuri said. “You should not spend 70% of the time doing the boilerplate activities. You should spend time actually writing the models and deriving insights out of it rather than trying to connect this format of data, converting it to another format and stuff like that, because that doesn’t really add value.”
Repeating the Past
Devising a process to holistically mange disparate and siloed petabyte- and exabyte-scale data sets in support of multiple analytic workloads and multiple user constituencies is a daunting challenge, to put it mildly. It’s a challenge that had not been solved in the past with smaller data and less demanding workloads and user bases in 2000 and 2010, and it’s not going to be solved by 2020 either.
That’s not to say that we’re not making progress. Technology has improved remarkably since the start of the big data era in the first decade of the 21st century. Even Hadoop, which everybody seems to love to hate these days, barely resembles Yahoo’s original vision of a distributed storage system hooked to a batch-oriented MapReduce processing framework. Hadoop did actually become enterprise-grade, even if it requires an army of engineers to run.
The problem, of course, is that real world companies don’t live in green fields. No matter how great the new new thing actually is, you have to integrate it with your existing systems and your existing processes. This is the essential “gotcha” that has doomed many a promising technology, and one that has kept systems integrators in the black for decades.
As it relates to big data, the industry suffers from a lack of creative thinking and the capability to connect the dots for customers, according to Pati.
“You walk into any large enterprise and you’ll see multiple systems built around Teradata and Hadoop and Azure and Goggle and Amazon,” said Pati, who is also a venture capitalist. “But they are all replicas of what was done in the past with Teradata, which involved a lot of development effort, a lot of skilled talent. And at the end of the day, the demand on the systems in the past were not that great.
“But as this got replicated across those different environment to support different applications – whether it’s Customer 360, machine learning, AI or enterprise analytics – it has become a terrible fragmented mess that’s difficult to manage,” Pati continued. “They seem to be following a legacy model of point capabilities and not taking a holistic view that you have to put solutions together for customer. They can’t be left holding that bag.”
As far as we’ve come with machine learning and AI, we still have yet to fully master all aspects of managing big data. The scale and the complexity of today’s data and analytics applications don’t lend themselves to easy solutions. While a silver bullet is nowhere to be found, that’s not a reason to ignore the problem.
There is still a lot of innovation left to be done in the big data world. If and when major progress is ever made at a wide scale, it will pave the path for even greater utilization of data in the future.