Here’s What Doug Cutting Says Is Hadoop’s Biggest Contribution
Apache Hadoop isn’t the center of attention in the IT world anymore, and much of the hype has dissipated (or at least regrouped behind AI). But the open source software project still has a place for on-premise workloads, according to Hadoop co-creator Doug Cutting, who says Hadoop will be remembered most of all for a single contribution it made to IT.
At last week’s Strata Data Conference in San Francisco, Cutting shared some time with Datanami to discuss the open source software project that he’s best known for, the state of distributed systems development right now, and what he sees coming next down the pike.
“[Hadoop] is still going strong,” says Cutting, who is the chief architect at Cloudera, which merged with its rival Hortonworks earlier this year to become the clear market leader in the Hadoop space. “The bulk of our customers are still on-prem, using HDFS. That’s the state of the art today. In terms of the state of practice, people are aspiring to do more and more things in the cloud, and they’re gradually transitioning things there. So when we look forward, we see HDFS and YARN being used less and less.”
While Hadoop’s share of big data workloads will decline, its total usage will continue to increase going forward, Cutting says. “The amount of data under management is going to keep growing at a tremendous clip for decades, and for a certain set of problems, that’s going to continue to be the best tool,” he says. “So we’ll be supporting it, I expect, in 10 years.”
Popping Hype Bubbles
Cutting scoffed a bit when reminded that Gartner says Hadoop peaked early, and that its plateau of productivity will be something less than originally forecast.
“Gartner has their Hype Cycle. Good thing it has ‘hype’ in it,” he says. “Hype is something that can peak and not be proportional to actual utility. There was a lot of hype around it [Hadoop]. The way I look at it, people want open source solutions still. That’s not going away. People have more and more data, and a lot of problems for which this stack is the most applicable in what they’re choosing to do with it.”
Cutting says he doesn’t see anything else coming along that will replace or supplant Hadoop, particularly when it comes to large-scale, on-premise deployments.
“You can imagine there’s going to be some great recession of data generation. I don’t think that’s very likely,” he says. “You could imagine somebody is going to come out with some kind of alternative stack that’s going to entirely replace it. Amazon and Microsoft and Google to some degree are trying to do that. And they’ll convert some percentage of people.
“But there’s a lot of folks who don’t want to get locked into a single cloud vendor, who want to run things on premises, who have large enough system or have legal requirements where they can’t put things in a public cloud, and people who want to play different public clouds off one another by having some portability,” he continues. “I don’t think any one of those is going to entirely own the data market. If you accept that as the truth, that those three won’t get a monopoly over all data systems, and that people are going to continue to want these, I think there’s a strong future for this platform.”
Now, Cutting has never been one to toot his own horn. One might expect the Java developer to defend Apache Hadoop simply because he, along with Mike Cafarella, created it back in the early 2000s to solve the scalability problem of Yahoo’s index of the Web.
But Cutting actually has been fairly ruthless in his assessment of his own technological creations, including Apache Hadoop and the Apache Lucene search engine. “Nothing is sacred,” Cutting said at the Apache Big Data conference in 2016. “Any components can be replaced by something that is better.”
One of the fascinating things about Cutting is that his loyalty doesn’t rest with Hadoop – which has had a wild ride, to be sure — but with the open source software development process that gave rise to Hadoop, as well as the open source community that’s built around it.
There’s little doubt that Hadoop has lost some of the tremendous momentum it possessed in the 2012-2015 timeframe. At that time, Hadoop was the center of the big data world, both figuratively and literally. Hadoop was designed to be the singular data storage and processing platform, an “operating system for big data” per Cloudera. Anybody wanting to access big data needed to run on Hadoop.
Hadoop’s star rose as people tried out the new distributed operating paradigm, and the open source community and the commercial markets responded. Dozens, if not hundreds, of big data projects and products were developed to run on the fledgling Hadoop ecosystem.
However, soon cracks in Hadoop’s wall started to appear, and the biggest was the sheer difficulty of integrating the various projects that Hadoop distributors like Cloudera were shipping. Distributions that started out with a handful of projects soon ballooned to encompass nearly three dozen inter-related big data projects.
While HDFS and YARN use might wane as cloud alternatives build, the Hadoop ecosystem as a whole has flourished, and likely will continue to do so, according to Cutting. Hadoop’s lesson that really shines through is how it let developers experiment and try new things.
“Hadoop came at a very opportune time,” he says. “It certainly wasn’t transactional or relational in any fundamental way. It tended to encourage people to be more experimental, agile in their approach, to embrace all kinds of wacky data formats and what people like to call unstructured, which I think is kind of a pejorative for what a database doesn’t handle elegantly.”
But interestingly, Hadoop had a lot of flaws, which made a lot of opportunities for other folks to say ‘That’s a nice start in a new direction, but we can do better,'” Cutting says. “So we got things like Spark, a vastly higher level and more functional API for doing big data computations. We got different kinds of storage engines besides file systems. We’ve got HBase and Kudu. We’ve had great streaming support, things like Kafka….and query engines that are really scalable to go along with it – SQL ones, Impala, as well as Solr and Elastic.”
The Biggest Contribution
While new distributed systems like Hadoop and friends garner attention, the vast majority of money is spent building traditional relational and transactional systems, Cutting says. Looking forward, it’s likely we’ll continue to see new technologies like Hadoop continue to gain traction, even if it’s not Hadoop itself.
“It’s a long process,” Cutting says. “More and more folks are finding they can use this set of tools to do data warehousing more effectively, and more cost effectively as well. That’s been a nice boon. I don’t see any reason for this to stop. The development style of open source has really fueled this innovation and will continue to.”
Just as Spark and Kafka emerged from earlier stabs at using Hadoop to solve particular problems, new technologies will arise to fill new gaps that have been exposed as part of the constantly shifting IT and business landscape. It’s impossible to predict with any accuracy exactly how it will unfold, but only a fool would bet against the open source community not having a big hand in the process.
Cutting says the thing that has unlocked much of the innovation embodied by Hadoop and its follow-on technologies is open source. The success of the open source software development and delivery method is what Hadoop proved, and what people remember as Hadoop’s primary success, he says.
“I think that’s a pretty strong model,” Cutting says of open source. “You saw some innovation from universities and a little bit from database companies. But I think open source has really unlocked that. I think people saw that. In terms of Hadoop, the lasting impact is establishing that style of development — ecosystem development, as well as application development.”