Hadoop at Strata: Not Exactly ‘Failure,’ But It Is Complicated
Has Hadoop failed to deliver the goods? That was a question on the minds of Strata + Hadoop World attendees last week, with opinions expressed both pro and con. Regardless of your stance on that question, it’s clear that the center of gravity has moved beyond the yellow elephant.
The most obvious clue that the industry has moved on from Hadoop came on the second day of the conference, when Cloudera‘s Doug Cutting and O’Reilly’s Ben Lorica announced that the conference changed from “Strata + Hadoop World” to “Strata Data Conference.”
Instead of trying to fit all the barnyard animals into the name (Cutting suggested Hadoop + Hive + Hbase + Spark + all the others omnivores, as well as “Cutting Con,” which maybe actually would have worked), the conference organizers went back to the roots of the Strata conference in 2011.
(Note to self: it’s ALL about the data.)
That doesn’t mean Hadoop is irrelevant. We will need a place to land unstructured and semi-structured data. But when the biggest Hadoop distributor removes the name of Hadoop from its flagship conference, it’s clearly an indicator that things haven’t gone quite as expected.
It’s clear that the concept of Hadoop as “data operating system” or even a “data hub” — let alone the center of the data universe — has passed. The idea of a “virtual” data plane composed of multiple physical silos has taken its place.
That’s not stopping the Hadoop distributors from declaring their platforms ready to deal with the new (which is the old) reality of data — namely, that it’s distributed and siloed and hard to deal with.
Cloudera, Hortonworks, and MapR have all dedicated time and money to incorporating streaming data pipelines and data analytics into their core platforms, with the promise of centralized data security, governance, and management. That’s all good. However, we’ve yet to see whether they can keep up with the rapidly moving data target.
Shift to Real-Time
The comments from Strata attendees that came up again and again had to do with the advent of real-time streaming architectures and the misalignment with the core Hadoop strategy of “centralize everything.” Hadoop will play a role in centralizing data science and machine learning workloads, but it won’t have a hand in active execution and scoring of live data at the edge, says Bill Schmarzo, the Dell EMC vice president, university professor, and Dean of Big Data.
“There’s still the need, I believe, that data science development still has to happen at the core, or at the data lake, as they’re calling it now,” Schmarzo tells Datanami. “To me, Hadoop still works great for that…But the actual model execution is going to go closer to the edges, because you have data latency problems.”
This view was echoed by Scott Gnau, the CTO of Hortonworks, which, like Cloudera, has taken Hadoop out of the name of their flagship conference. Last year, you will recall, Hortonworks changed the name of its Hadoop Summit line of conferences to DataWorks Summit to reflect the greater role of data streaming architectures, including its Hortonworks Data Flow (HDF) product, which combines Apache Nifi, Apache Storm, and Apache Kafka.
“The center of gravity has moved,” Gnau says. “There are two things we need to do. One is broadening so we’re not just Hadoop, but we’re covering the lifecycle of data: data moving, data at rest, and data analytics as a core infrastructure play.” The other is to simplify the Hadoop products.
To address that second concern, Hortonworks is rolling out a series of pre-built Hadoop bundles to address common use cases, with the first one an enterprise data warehouse package developed with Alation and AtScale deigned to create a production-ready Hadoop-based EDW in seven weeks or less. Hortonworks will be rolling out more of these over the coming months, including one for cybersecurity and another for IoT automation.
The entire Hadoop ecosystem has been working at this very problem. In fact, much of the news out of last week’s Strata + Hadoop World was all about taking Hadoop pain out of the equation. For example, data cleansing software vendor Paxata rolled out a version of its software delivered on a preconfigured rack-based appliance that customers can just slide into their data center and away they go.
“It’s completely packaged,” says Nenshad Bardoliwalla, co-founder and chief product officer with Paxata, which has dependences in Spark and Hadoop in its machine learning-driving data quality software.
“You don’t have to hire a Hadoop administrator. You don’t have to know how Hadoop works. You don’t have to know how to tune Spark jobs. We’ve taken all that complexity out and made something that anybody can run,” he says.
Big Data Reset
Big companies have the resources to deal with Hadoop and invest in the DevOps resources to make it work for them, Paxata’s Bardoliwalla says. But the divisions of midsize companies struggle under the complexity of Hadoop. (If you’re a small company, forget about running your own cluster.)
“What we hear over and over again with Hadoop customers, especially those who have used Hadoop for 12 to 18 months,” he says, “is we poured all the data into the data lake. Now what?”
The difficulty in getting data out of Hadoop has been well documented. Bobby Johnson, who helped develop a time-series database with his behavioral analytics company Interana, told Datanami about a meeting he had with a prominent venture capitalist in Silicon Valley.
“He said, with no hint of irony, ‘We’ve solved writing data into Hadoop. Now we just need to solve reading data out.'”
Hadoop may have been marketed as a better data warehouse, but Hortonworks Gnau pushed back against the idea that Hadoop is too hard and has failed its overall mission because it’s an inferior SQL repository and engine compared to the traditional EDW vendors, like Teradata, where Gnau spent 20 years. (See “Anatomy of a Hadoop Project Failure” for a case study on how Blackboard couldn’t get its Hadoop-based data warehouse project to work.)
“Hadoop exists and was invented to solve a problem, which was not to be a cheaper RDBMs,” he said. “That’s not what it was created to do. It’s a complete miss on what’s happening in the world, whether it be the data generated from the Web, IoT, sensors. That’s the opportunity space.”
Of course, Hadoop wasn’t created to be a better EDW, although that’s how Cloudera initially marketed it. Cutting initially created Hadoop to be a better indexing engine for Yahoo, but over time it’s morphed into what it is today: a general purpose distributed storage and processing platform, with EDW replacement as the enticing fruit to get Hadoop in the door–a Trojan Horse, as it were, to a world of big data riches that awaits.
Too Big To Fail?
To a certain extent, Hadoop’s failure to be all things to all people was destined from the beginning. Sure, Hadoop may not be able to deliver the SLA of a Teradata, but that’s missing the point, says Hortonworks Gnau.
“Those RDBMs are highly effective for high-service level, third-normal firm data with an integrated data model and complex queries in SQL – all that highly integrated from the file system up through the operating system through software layer. It’s all completed integrated and known,” he says. “If we did that in Hadoop, then we wouldn’t be able to address that white space of unknown data.”
It’s that unclaimed “white space” of big data analytics that is so enticing, and yet so difficult to claim. It’s simultaneously the thing that kept Hadoop going, and yet undermined it’s ability to solve specific data problems in a concise manner.
Peter Wang, the CTO and co-founder of Continuum Analytics, got a notion of where Hadoop might be headed in the very beginning, during the first Strata conference held up the road in Santa Clara back in 2011.
“There was an evening talk by Doug
Cutting, and no one was there,” Wang tells Datanami. “Everybody was out getting drinks. Me and 12 other idiots are sitting in the audience and literally the creator of Hadoop is talking about the future roadmap and where it’s going to go.”
What did that lack of audience
engagement mean? Wang read the tea leaves, and concluded that the technology roadmap of Hadoop would be orthogonal “to how Hadoop will be sold, to how it will be implemented, to how success will be declared, and how the next thing will then be engendered from that.”
In other words: Hadoop was a vehicle upon which vendors would pin their data analytic aspirations. Whether or not it was technically the best platform to achieve those dreams was not the issue. The “white space” of future big data possibilities would be filled with Hadoop. But the concept may have gone too far, as big data innovation continued to occur outside of Hadoop. At some point, trying to reconcile the progress in Spark and Kafka and Tensorflow and everything else with Hadoop became too great to bare.
In retrospect, it probably wasn’t fair to the Hadoop project and product itself, which does a fine job of ingesting huge sums of semi-structured and unstructured data, and for orchestrating batch analytics. But the fickle nature, rapid evolution, and huge demands of big data analytics was simply too much for one project to handle–even one as big and strong as Hadoop seemed to be.
Wang relates one of his favorite quotes from the hit television series “Madmen” to help explain the Hadoop phenomenon.
“Rachel asks Don Draper, ‘What do you think of Love?,'” Wang says. “Draper says, ‘I don’t believe in love. Guys like me invented love to sell nylons.’
“Hadoop is what Hadoop is,” Wang continues. “But Hadoop as the savior or the next platform was invented to sell Hadoop. It was invented to sell nylons.”