Hadoop’s Second Decade: Where Do We Go From Here?
As we reported last week, the first Hadoop cluster went online at Yahoo 10 years ago. The platform has enjoyed phenomenal, if improbable, growth since then. But where does it go from here? Once again, we tapped the knowledge of Hadoop creator Doug Cutting and other experts in the big data industry to get the low down on the high technology.
Cutting, who is the chief architect at commercial Hadoop pioneer Cloudera, sees the loose confederation of open source projects surrounding Hadoop continuing to adapt to changes in the broader environment. In particular, he sees projects that can leverage large memory spaces, such as Apache Spark and Kudu—the new file system that Cloudera unveiled last year to better handle rapid ingest and analysis of data–helping to steer the ship.
“With hardware advances, memory-based-storage will soon be much more affordable and much faster,” Cutting tells Datanami via email. “Current generations of software are designed and optimized for existing memory and storage hardware and will need to adapt or be replaced by software that can take full advantage of this new hardware.”
While Cloud deployments of Hadoop are still in the minority, that won’t last for long, Cutting says. “As Hadoop and friends make this transition, they need to take better advantage of cloud facilities, like elastic computation, object storage, and regional data centers,” he says. “Adoption of open-source big-data software will continue. Soon this won’t be the crazy new thing any longer, but rather just the accepted normal, enterprise software.”
The worlds of Hadoop and virtual container technologies like Docker will collide, predicts Arun Murthy, who worked with Cutting on early Hadoop clusters while at Yahoo and is a co-founder of Hortonworks.
“The first phase of Hadoop was about establishing the tech and making sure developers could build apps on top of pure data,” Murthy says via email. “Today’s modern apps are taking us from post-transactions to pre-transaction and thanks to Hadoop and new data types, users can run very fine-grain analytics.
“The trend has been one machine, then virtualization, and now things like Docker and containerization that are primarily driving efficiency,” Murthy continues. “We are moving toward building apps that are DevOps-friendly and repeatable so that no one has to rewrite and duplicate efforts. Hadoop is going to be the framework that will carry Dockerish apps. Users want to see the approach of an app that you download and run on your platform, which makes it simpler overall for developers and end users.”
When Hadoop was born 10 years ago, nobody was predicting that it would end up as the de facto standard for a new breed of data-oriented computer operating system. And it will be equally tough to predict here Hadoop ends up in 10 years, says Nenshad Bardoliwalla, the co-founder and chief product officer at Paxata, a provider of data prep and cleansing software.
“I think Hadoop has established itself as the distributed operating system…and operating systems typically don’t’ go away. They’re fairly sticky,” Bardoliwalla tells Datanami. “I think we’ll still be talking about Hadoop, but what Hadoop is will be very different. Who knows what components there will be. Everything underneath [Hadoop], from a technology perspective, will be continuously evolved and improved, and HDFS will look very different.”
Hadoop has been evolving away from its batch-oriented roots for some time, and there is general consensus that Hadoop will continue its metamorphosis into a system for consuming, processing, and analyzing data in real-time.
“The next wave of Hadoop adoption in 2016 will be directly linked to the massive growth of real-time data and streaming,” says Sean Suchter, the founder and CEO of a provider of performance management software for Hadoop called Pepperdata. “Because of this, cluster performance issues are going to become even more critical for enterprises to pay close attention to because service levels (in terms of time) will be just as critical as scalability.”
According to Suchter, Hadoop has its work cut out for it–or rather, the Apache Hadoop community does. “With distributed systems like Hadoop, a performance wall is inevitably hit because it was designed to gracefully schedule and start applications on their respective clusters, but not to manage those same applications while running,” he says via email. “In short, Hadoop is powerless to control active job performance.”
That’s why Suchter founded Pepperdata: to control the flow of application traffic on Hadoop. “Hadoop has the metering lights to send jobs onto the cluster, but what it still needs are traffic cops to address performance once those jobs are running,” he says. “Without active controls making thousands of decisions in real time, SLAs will constantly be missed, severely limiting the types of applications Hadoop can be used for.”
Ted Dunning, the chief applications architect at MapR Technologies, says Hadoop will be the site of a great convergence of storage, messaging, and processing capabilities, with an eye towards combining analytical and operational workloads onto a single, all-encompassing platform.
“If I look back at some of the system diagrams we did at MusicMatch or HNC Software 20 years ago…you can see we were crying in our beer for not having good real time messaging,” Dunning says. “We had horrendously complicated workflows.”
Thanks to huge advancements in the performance of real-time messaging systems, we can now build systems that are much simpler in their design, he says. “For a long time, fractioning and partitioning of data systems were a necessary evil, but it’s become an unnecessary evil,” he says. “I don’t think you need to split everything apart anymore.”
Hadoop still carries vestiges of its batch-oriented beginning, but it’s quickly becoming capable of running in a real-time manner. “Those old restrictions…are being lifted overnight. They’re melting away,” Dunning says. “New kinds of data that are inherently real time are becoming available, and the ability to process it from these new messaging systems is happening.
“Any time a technology changes by three orders of magnitude, it is like a fundamentally new thing. It’s not like a variation on the old. It is a new revolutionary thing,” Dunning continues. “That’s what we’re seeing here. We’re seeing an absolute revolution, especially in platforms, where you can converge that together with, I can’t believe I’m saying this, but Hadoop is 10 years old–traditional big data.”
We’re in the midst of a “historic re-platforming” that will forever change the face of computing, Dunning says. “I’ve seen two or three of these in the past,” he says. “Micro processors, relational databases, ubiquity of computing and the Internet. And I think we’re seeing something comparable right now.”