Follow Datanami:
September 29, 2017

Hadoop Was Hard to Find at Strata This Week

It’s been barely six months since the word “Hadoop” was removed from the name of the world’s biggest big-data trade show, and at this week’s Strata Data Conference in New York City, Hadoop all but disappeared. Yet key parts of the Hadoop platform appear likely to survive.

It was an auspicious absence, to be sure. Making a big yellow elephant essentially vanish in the space of half a year is not an easy feat. But the fact remains that what used to be the rallying point for an entire industry has essentially been reduced to an afterthought. Cloudera, which puts on the show with O’Reilly Media, scarcely even mentioned Hadoop.

Sri Ambati, the CEO and co-founder of H2O.ai has strong opinions on the matter. “Hadoop is dead,” he told Datanami during an interview about its new Driverless AI offering, which runs on NVIDIA GPUs and automates feature engineering. “Spark killed Hadoop.”

That line of thought was commonly heard at the show. Gartner put the fork into Hadoop earlier this year in its hype cycle for data management, where it declared Hadoop distributions “obsolete” due to complexity, and stated that “the questionable usefulness of the entire Hadoop stack is causing many organizations to reconsider its role in their information infrastructure.”

MapR is usually considered to be one of the three primary distributors of Apache Hadoop, along with Cloudera and Hortonworks, but you likely won’t catch them saying the “H” word in mixed company. CEO Matt Mills recalls his first day with the company two years ago, which happened to be the first day of the Fall 2015 Strata + Hadoop World show.

“Over the next 30 days, we were all about this distro thing,” he recalls. “I’m in here going, If this is the business we’re in, we lost. We were at it for months, but on December 8 2015 we announced the converged data platform, and we’ve been running from it [Hadoop] since. We took the elephant off our building.”

That hasn’t stopped many people (including this editor) from throwing MapR into the Hadoop bucket, but the message is finally starting to sink in. “It’s happened here at the conference a half dozen times,” Mills says. “Somebody sits down and says, now look, you’re not Hadoop. And I want to say — finally after two years it’s starting to resonate with folks that while we’re about big data, we’re really not about Hadoop.”

Spark is the dominant data processing engine in Hadoop (Leigh Prather/Shutterstock)

Justin Erickson, a product manager with Cloudera, says the company made a conscious effort to stop talking about Hadoop because the term is ambiguous and just leads to confusion among customers. “Having a conversation about ‘Is this Hadoop or not Hadoop’ — it just ends up being more confusing,” he tells Datanami. “You look at our S1 filing and how we talk about ourselves as a company, you’ll find very few mentions of Hadoop. That’s on purpose.”

Instead of having a conversation about Hadoop and whether it will take your business to the next stage, the conversations are more about how Cloudera can address business challenges, such as detecting fraud or predicting churn, with its four main packaged offering, including a data engineering package, a data science platform, an analytic databases package, and another package for operational databases. That’s a welcome change, Erickson says.

“There was a long period where the rallying cry was ‘Hadoop is synonymous with big data,’ even if the folks didn’t understand what the context of it was,” Erickson says. “We get people excited about Hadoop and then we have to work with them and say, this is actually why you are excited about Hadoop. Now we’re talking more about the reality of what’s going on. And the reality of what we see is it actually means something to the business, rather than ‘I need Hadoop otherwise I’m going to be embarrassed and I’m going to fall behind.'”

Even if Hadoop itself is no longer the center of attention, it doesn’t mean that people are no longer licensing or subscribing to Hadoop software. The Hadoop family of products — which includes core parts like YARN, HDFS, and MapReduce and about 30 related products like Hive, Hbase, Spark, Kudu, Impala, etc — are still being adopted by new customers. Sales are growing for Cloudera, which is now a public company, just as they are for Hortonworks, which is also a public company. MapR, which plans to go public one day and largely discloses its finances as if it’s already public, also shows growth in sales of its Hadoop converged data platform.

Tracking the barnyard animals has always been a chore

The message from supporting software vendors is the same. “We see growth,” says Wei Zheng, vice presidents of products at data wrangling firm Trifacta, which just inked a white label deal with cloud giant Google. “We’ve seen a lot of growth for Hadoop also in the cloud. Cloudera will tell you that their product release is all about the cloud.”

Pepperdata, which develops software to monitor and optimize Hadoop and Spark workloads, doubled its revenues last year. Much of the growth comes from expanding Hadoop clusters. But in the future, the workloads that need optimizing will run mostly on emerging cloud architectures, like Kubernetes, according to its co-founder, Chad Carson.

“We’re really strategically focused much more Spark [than Hadoop] and Kubernetes is coming up right behind it,” he says. “Right now big data has its own specialized stuff. We think Kubernetes is going to replace everything so we’ve actually been working on getting HDFS working on Kubernetes. We’re also working with Google, Red Hat, Palantir and Bloomberg to put Spark on Kubernetes.”

Pepperdata CEO Ash Munshi, who formerly was the CTO of Hadoop pioneer Yahoo, says Spark is displacing Hadoop as the focal point for companies as they look to change or evolve their computing architectures to the emerging data paradigm.

“The guys at Databricks have done such a good job creating a community around this thing and the momentum, and now with the DataFrames architecture that’s actually been put inside, it’s very much a unifying architecture that solves a whole host of problems,” he says. “It’s not going to go away anytime soon.”

It took about 10 years for Hadoop hype to rise and fall, which is in-line with a general 10-year cycle in computer science, the CEO says. But the question companies are grappling with today is not which technology to use, but where to put the data.

“They’re not making technology bets. I think they’re making strategic bets on how much do I do on prem versus how much do I do on cloud, how much do I do on hybrid.  Do I flex to the cloud or not? Do I flex locally or not? What am I allowed to put on the cloud? Every government is saying, you can’t have that there, you have to have that there. That complicates the world. That’s a big can of worms. And that is strategically driving where the data lives. Because where the data lives has a huge impact.”

Cloudera’s Erickson would not argue with either of those premises. He cites a 2016 survey that found Cloudera’s distribution of Hadoop to be a common way for folks to obtain Spark, and that most of that Spark is hitting YARN. (Questions about work Cloudera is doing around Kubernetes were not answered; Cloudera is currently in a quiet period following its recent announcement that it’s doing a follow-on offering to raise more cash from investors.) “MapReduce is absolutely dead,” he says.

Data lakes come in many forms (cherezoff/Shutterstock)

But even if core Hadoop components like YARN and MapReduce are being replaced and even if Amazon S3 and Microsoft ALDS loom large as repositories for massive data, Erickson says Hadoop’s main architectural concept – that data should be centralized and that application workloads should be moved to the data — is still strong.

“If you look where we stand in terms of how we’re embracing clouds and object stores, it’s basically the same concept,” he says. “It’s just now embraced and rematerialized in a way that you now have a global store that you can go and use.”

Dave Mariani, the CEO of BI on Hadoop vendor AtScale, has yet another take on the rise and fall of Hadoop hype, and what replaces it going forward.

“I think the real revolution was data warehouse to data lake, which is basically schema on write versus schema on read,” he tells Datanami. “To me that’s the real innovation. Hadoop just happened to be first implementation of a data lake.”

As companies adopt cloud-based object stores to store their big data, they’re still benefiting from the schema on read approach, even if they’re not getting that capability from Hadoop. “We all thought maybe Hadoop was going to be the one place. We now realize it’s a place,” he says. “We think that it’s a multi-data platform world.  It’s not just going to be one place.”

Hadoop may not be “the one,” but that doesn’t mean it lacks any value. It also doesn’t mean that businesses are fleeing Hadoop, although Gartner claims that many are re-examining its role. There’s evidence that new customers are buying Hadoop and existing customers are expanding their existing Hadoop infrastructures. That’s a testament to Hadoop as a maturing technology, Mariani says.

“We don’t see that Hadoop is dead out there. We don’t see that at all,” he says. “Now, you come to the show and you don’t feel the electricity that you felt in the early days. Those were the pioneering days. It’s 10 years old. It’s more mature, and you have a different set of people coming to these conferences.”

Related Items:

Hadoop Has Failed Us, Tech Experts Say

Hate Hadoop? Then You’re Doing It Wrong

Congratulations Hadoop, You Made It–Now Disappear

This story was corrected.  The survey cited by Erickson did not state that 57% of Cloudera customers are using Spark. Instead the report found that 57% of a group of 8,000 users surveyed obtained Spark via Cloudera. Datanami regrets the error.

Datanami