April 1, 2016

Cutting On Random Digital Mutations and Peak Hadoop

Alex Woodie

Doug Cutting giving a keynote address Wednesday during Strata + Hadoop World 2016 in San Jose, California

In a wide-ranging Strata + Hadoop World talk on Wednesday that reminds us why we like Doug Cutting so much, the father of Hadoop riffed on the evolution of big data tech, the power of open source, the promise of Flink, and the possibility of “peak Hadoop” at the ripe old age of 10.

“It’s easy to say it’s all hype,” the Cloudera chief architect and Apache Hadoop co-founder said during a presentation on the next 10 years of Hadoop that was posted to YouTube by The Hive. “People ask, ‘Is this a hype bubble. Are we at peak Hadoop yet?’ Depending on what you mean by Hadoop…I say no, we’re a long ways from that. We’re still at the very early stages.”

The fact that Cutting had to parse the question–just what exactly is Hadoop anyway?–is a leading indicator of just how far Hadoop has come, how much attention it’s gained, and how hard it can be to qualify. There’s something going on, definitely, but it’s hard to pin down. As the source of the most relevant technology to come out of Silicon Valley in a decade, it’s no wonder that people look to Cutting (who also created the Lucene search engine) for insight.

If you consider Hadoop to be the two things that it started with–HDFS and MapRecuce–then yes maybe we’ve reached peak Hadoop, Cutting said during his talk, an expanded version of the Strata + Hadoop World keynote address he delivered earlier in the day. “But the more interesting thing,” he said, “is the broader thing.”

Ah yes, but what is this broader thing? “I think we’ve got these long-term trends, like advancements in hardware driven by Moore’s Law and other related things,” he said. “It’s given us an incredibly large quantity of hardware for less and less money each year.”

If you look back 10 to 20 years, computation was isolated among companies, Cutting said. The accounting department or the warehouse may have used computers, but they were narrowly focused on specific functions. “That’s changing, and that’s continue to change rapidly,” he said. “Almost every traditional enterprise is becoming predominantly driven by data.”

And these data-driven companies in healthcare, transportation, retail, and other industries aren’t getting their newfound data superpowers from the old guard of enterprise software, guys like Oracle (NYSE: ORCL), IBM (NYSE: IBM), and Microsoft (NASDAQ: MSFT). No, the new capabilities are emanating from open source.

“There’s a separate universe from the IT companies that sold software to enterprises that built the databases,” Cutting said. “The hacker community is what I came out of and we were using open source pretty widely. It was accepted as a way of doing things. We shared software and collaborated, we incorporated new methodologies, whereas enterprise software was pretty stodgy. It didn’t change very quickly. It wasn’t respected very much.”

After Cutting and his colleague Mike Cafarella got Nutch running at scale at Yahoo (NASDAQ: YHOO) and decided to rename the system as an open source project called Apache Hadoop, he felt like he had completed something. Probably the last thought in his head was to turn Hadoop into (gulp) enterprise software used by stodgy old organizations like banks and railways and governments.

“I achieved my ambition, which was to build an open source implementation of Google’s ideas,” he said. “But there were other people who saw greater opportunity in this. They realized that industries were going through digital transition, that more and more companies would need this in the coming years. But they weren’t ready to pick up this software and run with it like LinkedIn (NYSE: LNKD) or Facebook (NASDAQ: FB) or Twitter (NYSE: TWTR) would.”

So Cutting joined Cloduera and the rest is history, right? Well, not quite. It’s not clear if Cutting foresaw what would happen next–that Hadoop would become much, much more than HDFS and MapReduce, that it would have legs not only to survive 10 years in a cutthroat enterprise IT market, but would lay the foundation for a platform that fostered and thrived on change.

Enter the Spark

“We’re starting to see something that’s more interesting I think in the long term as basic elements of the stack get replaced,” Cutting said. “Spark appeared originally four or five years ago, and got to a point where people found it was interesting to use. And it’s gotten embraced and supported. It is a much better API than MapReduce, much easier to uses for batch computing. But isn’t just batch. It supports online computation.”

Apache Spark is quickly becoming the dominant execution engine for Hadoop. The technological wunderkind has grabbed the big data industry by the horns with the golden promises that we can all bask in the glow of in-memory based insights without the gobbledygook of MapReduce. Finally, we have arrived!

But already, people are looking beyond Spark and wondering what comes next. Apache Flink–the elegant framework coming out of Berlin and data Artisans that offers a single API for dealing with data at rest and data in motion–is turning heads as a better, faster Spark. “My impression is Flink is architected probably a little better than Spark,” Cutting said. “It has some advantages in the way it works.”

Like Spark, Flink runs on Hadoop, but isn’t married to it, and can run on Mesos and other distributed resource schedulers. Other influencers in the big data space see promise in Flink, particularly in combination with Apache Kafka to analyze fast-moving data the moment it arrives. Flink’s arrival now surely that means Hadoop is past its prime, no?

Mesos is arguably the top competitor to the core of the Hadoop platform

No. Cutting–who has to be partial to Hadoop, which he named after his son’s throw-around wubbie–is unfazed by the rapid evolution of technology, this increasingly chaotic state of the big data space, where project after project are spun out and thrown at the world. In fact, he seems to revel in it. He wants more.

“I think we’re going to continue to see this kind of improvement,” he said. “It comes from this method of lack of control. There is no central control of the Hadoop platform. Rather what we see is Darwinian evolution. We have people creating new projects, Spark’s a great example. It came out of Berkeley as a sort of random mutation. People tried it. Over time they find it’s a useful mutation, as opposed to the six or seven others that we don’t remember. And it’s the successful one that overtook [the others] and it becomes a success.”

Hadoop As Canvas

In Cutting’s view, Hadoop has become a platform that will continually change and adapt to the needs of its users. It will take advantage of the latest innovations coming out of open source. The future of Hadoop is bright, he says. No sign of a peak here, and MapReduce is but a distant memory.

“As we look forward, it’s a very exciting time,” he said. “We’ve got now a platform that’s much better than the prior generation we had. Not only is it less expensive to do processing and storage, but it’s more functional. You can do not just SQL processing with a variety of SQL engines. But you can do search and machine learning and you can do a NoSQL storage. You have all these options you can do on a shared repository of data. So this encourages a better style of application development.”

In this new world, experimentation is actively encouraged as a way to separate digital haves from have-nots. What better way to identify the next Spark or the next Flink or the next Kafka? While not every experiment will pan out, the population as a whole will benefit, while the stodgy old enterprise software companies turn green with envy.

“It’s an evolutionary process,” Cutting said. “In order to figure out which systems are best we need to figure out which systems are not best and fail. It depends on your level of tolerance for experimentation. Cloudera as a vendor tries to curate and use only the best that have proven themselves. If you’re conservative then you can work with the vendor and stick with the stack….if you’re more experimental you’ll try other things.”

That can lead to a heap of confusion on the part of customers, who are confronted with more than dozens Hadoop components and a dizzying array of projects to consider. “But on the whole it’s better than the alternative,” Cutting said. “You could have a massive standard body that says this will be the one API for streaming. It might suck. There might be a better way of doing things. And trying to improve things through a standard body is a lot harder. I think it’s a better process to develop [in open source].”

Specific Gravity

Hadoop has gained enough momentum now that it won’t simply fade away. The distributed computational framework has captivated the imagination of people, and primed the pump for the further democratization of big data through parallel access to massive data sets. It’s achieved a certain critical mass whereby it attracts new things like Spark and Flink simply by being the defacto standard platform for big data. If you’re betting on big data at this point, in one way or another, you’re largely betting on Hadoop.

What will Hadoop look like in 10 years? It’s tough to say, but according to Cutting, it will be quite different, and that’s OK.

“We’re seeing this steady adoption of technology, and we’re seeing a steady integration of new technologies. It’s part of the canon that people are building on, that this technology [will be] around in 10 years,” Cutting said. “Ten years ago Spark wasn’t there. People can build on Spark and count on Spark being around for another decade or more. In a while Spark will be like COBOL. It’s simply part of the canon.”

And the future of Hadoop is…whatever the open source community decides it will be, Cutting said. (Kirill Wright/Shutterstock)

What will Hadoop look like if core components like MapReduce, HDFS, and YARN are replaced? To Cutting, who doesn’t seem attached to technology as much as he is to ideas, it will simply be another day.

“We’re an open source big data platform,” he said. “If over time people stop using MapReduce, people stop using HDFS, people stop using YARN, we’re OK with that. We’ll survive that, because we’ll move onto whatever the new storage system, the new scheduler, the new execution engines are. Because we’re about building this open source platform and supporting it.”

As Hadoop turns 10, the core lesson here is don’t cling to technology, because it will soon change. “I worry about companies …dedicating themselves to a particular project,” he said. “I think that’s a very risky thing because it may not last that long….A company needs to have a long term platform, and I think long term, this platform is going to continue to change.”

Happy Birthday, Hadoop: Celebrating 10 Years of Improbable Growth

Applications: Enterprise Analytics

Technologies: Frameworks, Middleware

Sectors: Financial Services, Government, Healthcare, Manufacturing, Retail

Vendors: Cloudera, Facebook, IBM, LinkedIn, Microsoft, Oracle, Twitter

Tags: doug cutting, Flink Kafka, Hadoop, Spark, strata hadoop world