The Hadoop dream of unifying data and compute in a distributed manner has all but failed in a smoking heap of cost and complexity, according to technology experts and executives who spoke to Datanami.
“I can’t find a happy Hadoop customer. It’s sort of as simple as that,” says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. “It’s very clear to me, technologically, that it’s not the technology base the world will be built on going forward.”
Thousands of organizations store huge amounts of data in Hadoop, and so Hadoop won’t disappear overnight. After all, many companies still run mainframe applications that were originally developed half a century ago. But thanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says.
“The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10,” Muglia says. “That’s just nuts given how long that product, that technology has been in the market and how much general industry energy has gone into it.”
One of the companies that supposedly tamed Hadoop is Facebook, which developed relational database technologies like Hive and Presto to enable SQL querying for data stored in HDFS. But according to Bobby Johnson, who helped run Facebook’s Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a “historical glitch.”
“That may be a little strong,” Johnson says. “But there’s a bunch of things that people have been trying to do with it for a long time that it’s just not well suited for.”
Hadoop’s strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it’s ill-suited for running interactive, user-facing applications, he says.
“It’s never really broken out of the developer world,” Johnson says. “People have this idea of, ‘Oh there’s all these legacy data warehouses and the new cool way to do it is Hadoop.’ But when you try to do that, you realize it’s actually a lot worse. It’s better than a data warehouse in that have all the raw data there, but it’s a lot worse in that it’s so slow.”
Getting answers out of Facebook’s Hadoop environment was an exercise in patience and frustration, according to Johnson. “After years of banging our heads against it at Facebook, it was never great at it,” he says. “It’s really hard to dig into and actually get real answer from…You really have to understand how this thing works to get what you want.”
Hadoop is great if you’re a data scientist who knows how to code in MapReduce or Pig, Johnson says, but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data.
“At the Hive layer, it’s kind of OK. But people think they’re going to use Hadoop for data warehouse…are pretty surprised that this hot new technology is 10x slower that what they’re using before,” Johnson says. “[Kudo, Impala, and Presto] are much better than Hive. But they are still pretty far behind where people would like them to be.”
The Hadoop community has so far failed to account for the poor performance and high complexity of Hadoop, Johnson says. “The Hadoop ecosystem is still basically in the hands of a small number of experts,” he says. “If you have that power and you’ve learned know how to use these tools and you’re programmer, then this thing is super powerful. But there aren’t a lot of those people. I’ve read all these things how we need another million data scientists in the world, which I think means our tools aren’t very good.”
A better architecture for enabling people to be productive with big data can be found in Apache Kafka, Johnson says.
“I like the Kafka version of the world, which is there’s a pipe of data and anything that wants to do something useful with it can tap into that thing,” he says. “That feels like a better unifying principal, a stream of data and what’s happening versus a disk format.”
Kafka creator Jay Kreps was responsible for running a large Hadoop cluster at LinkedIn before co-founding Confluent. “It’s just a very complicated stack to build on,” he tells Datanami. “I think that’s more a technology problem than anything else.”
While Kafka is included in many Hadoop distributions, Kreps purposely avoided building any Hadoop dependencies into Kafka, and he strived to make it as simple to use as possible.
“Kafka is definitely its own thing. It runs standalone. It has no connection to Hadoop. I think that’s absolutely a good thing for people who are trying to build production application,” he says. “You can imagine if you’re trying to build something that’s part of your UX and you need to pull in Kafka, it’s really important that you don’t have to pull in 55 very complicated Hadoop-oriented projects and set all that up and operate it. That would be something that would make this very difficult to get going with, and I think that’s one of appeals of this whole streaming processing areas is that it’s relatively easy to work with.”
Plenty of software vendors have hitched their wagons to Hadoop, which was the first open source distributed computing platform to see widespread adoption. Some of these companies, like real-time stream processing vendor Datatorrent, have succeeded in helping to mask some of the underlying complexity in Hadoop.
“From an industry standpoint, people understand the amount of data they have to process necessitates a distributed computing platform,” says Phu Hoang, who built Hadoop-based systems at Yahoo before co-founding DataTorrent several years ago. “Hadoop is the only player out there. They may not be married to it. But they are definitely married to the distributed computing and scale-out strategy.”
Until there’s a better option, Hadoop will be the only game in town for distributed processing, says Hoang, who created the DataTorrent RTS product before releasing it as an open source project named Apache Apex.
“Hadoop is painful. But they don’t see another solution,” he says. “Until there may be other distributed computing platforms out there, our focus will be on about making that one as easy to use as possible.”
At the end of the day, no one really wants to get down and play around with infrastructure, Hoang continues. “They want to get their jobs done. They want insight and action,” he says. “Our job is focusing on that application layer and providing solutions to people who need to process data as fast as possible and get insights, and make that Hadoop layer as far away from what they have to think about as possible.”
According to Snowflake’s Muglia, people will stop thinking about Hadoop, but for different reasons.
“It’s not like the people who bet on Hadoop were dumb at the time. They were operating on the best knowledge that they had for where the industry was going,” he says. “It’s just that as you watch over time and you see how it’s maturing in some area and isn’t maturing in others, we’ve seen a stunning lack of success of maturing the Hadoop industry. At the same time that has happened, the public cloud has emerged as a completely viable alternative that virtually every customer is investing in. In the public cloud environment, most of the Hadoop technologies are no longer relevant.”
Unless you have a large amount of unstructured data like photos, videos, or sound files that you want to analyze, a relational data warehouse will always outperform a Hadoop-based warehouse. And for storing unstructured data, Muglia sees Hadoop being replaced by S3 or other binary large object (BLOB) stores.
Muglia says many of his Snowflake customers are Hadoop refugees. “Everything’s a project” on Hadoop, he says. “From the first moment you want to deploy the first node, then designing how to lay out your data and store your data, getting the data then beginning to decide how to query it–all these things are just huge efforts. When we provision a Snowflake account, all of those things are solved for you when in the first seconds – five seconds! – of using Snowflake. With Hadoop that could be a 12 month project.”
While core Hadoop technologies like HDFS will not see much more adoption for business analytics going forward, other technologies included in Hadoop distributions, like Spark and Presto, will persist and even thrive in the post-Hadoop world, Muglia says.
“If you look at the classic Gartner hype model…we’re in the valley of despair with Hadoop right now,” he says. “As Hadoop exits the valley of despair and enters the period of maturity, the level of adoption of that technology will be orders of magnitude less than what was predicted during the hype cycle.”
Hadoop, Triple Stores, and the Semantic Data Lake
Delivering on the Data Lake Promise
Data Lake Showdown: Object Store or HDFS?