Follow Datanami:
August 12, 2019

Re-Imagining Big Data in a Post-Hadoop World

(mw2st/Shutterstock)

In the big data battle for architectural supremacy, the cloud is clearly winning and Hadoop is clearly losing. Customers are shying away from investing in monolithic Hadoop clusters in favor of more nimble (if not less expensive) cloud platforms. While the bubble has clearly popped on Hadoop, organizations still face many questions when it comes to building for big data.

HPE’s acquisition of MapR last week was a signpost of sorts on the demise of Hadoop. Once viewed as the cutting-edge platform of the future, Hadoop now appears like another legacy platform past its prime. Customers that once looked to Hadoop as the core technology to drive their big data strategies are now looking to shift gears and adopt cloud platforms to bring those data strategies to fruition.

The shift has been profound, both at a technological level and a marketplace level. At a technological level, Hadoop’s co-mingling of compute and storage — one of the hallmarks of the distributed architecture until the community modified HDFS to support erasure coding with the lackluster Hadoop 3.0 release —  has fallen out of favor. In place of HDFS, we have massive cloud-based object stores built on the model of AWS S3 and the capability to spin up compute as needed, using virtualization technology like Kubernetes as opposed to YARN.

Instead of spending large sums to employ teams of engineers to run complex on-prem Hadoop clusters, organizations have figured out that it’s more economical to use pre-built distributed computing services developed by AWS, Microsoft Azure, or Google Cloud Platform, and turn operational control to the cloud vendors.

These cloud platforms closely resemble Hadoop and include all of the computational engines that emerged from the Hadoop world – Spark, Hive, HBase, and yes, even MapReduce  – but the heavy burden of operational complexity falls to the cloud vendors, not to the customers.

Impedance Mismatch

The operational complexity of Hadoop was a killer, says Monte Zweben, the CEO and co-founder of Splice Machine, which develops a relational database for Hadoop and other platforms.

“When we need to transport ourselves to another location and need a vehicle, we go and buy a car,” he says. “We don’t buy a suspension system, a fuel injector, and a bunch of axles and put the whole thing together, so to speak. We don’t go get the bill of materials.

“If you look at Hadoop and the commercial models of the distributors, these are the bills of material that you need to put together product,” Zweben continues. “And they are very effective, very powerful, and extremely complicated, and they’re targeted at the engineering organizations of the world that build software. They were marketed to IT organizations around the world that have more operational skills that implement platforms and keep them alive 7×24.”

That impedance mismatch was central to Hadoop’s demise, and did damage to the Hadoop business model. Faced with an onslaught from the cloud, Hadoop subscriptions have stagnated, leading to the very public struggles of MapR and Cloudera. HPE anted up for MapR’s fire sale and in the process saved many of its own enterprise clients in the Fortune 500 and Global 2000 from the ignominy of running an unsupported version of an enterprise data platform. Cloudera still hasn’t named a permanent CEO in the wake of the resignations of former CEO Tom Reilly and Mike Olson, one of its co-founders and its chief strategy officer.

Die Hard Elephants

So where do we go from here? The momentum behind Hadoop has clearly waned, but don’t completely give up on the yellow elephant just yet, says Mike Leone, a senior analyst at Enterprise Strategy Group.

“Dead is a bit strong of a word, but the market is definitely shrinking instead of growing,” Leone tells Datanami. “Our research shows that about 12% of organizations still leverage Hadoop as part of their analytics initiatives. Hadoop had amazing promises from a business standpoint, but fell short on the delivery.”

Organizations had big aspirations for utilizing big data, and while Hadoop may not be the vehicle that delivers organizations to the big data promised land, those aspirations are still there.

(Valery-Brozhinsky/Shutterstock)

“Now, there are a number of different ways to achieve the business benefits Hadoop promised, with the continuously growing number of services being offered by the major cloud vendors,” Leone says. “And for those industries not interested in going to the cloud, the major cloud vendors are hoping to enable organizations to take their big data and analytics services to on-premises environments, with technologies like AWS Outposts and Google Cloud’s Anthos.”

With billions invested in Hadoop over the past decade, corporations will be loath to turn off their clusters. Instead, most experts expect Hadoop stacks to stick around for a while, running the custom applications that customers built on it. It becomes just another legacy technology in Global 2000 data centers that still run their share of IBM mainframes, AS/400s, and even the occasional VAX systems.

The New Cloud Architecture

One side effect of the victory of cloud vendors over on-premise Hadoop is that cloud features are being backported, so to speak, to on-premise systems.

“The cloud architecture is sort of making its way to on-prem data centers,” says Ashish Thusoo, the CEO of Qubole, a provider of cloud-based big data systems. “What does cloud architecture mean? It means all the infrastructure is offered as a service, as opposed to monolith offerings.”

Object stores built on the model of S3 and orchestration frameworks based on Kubernetes that allow compute to be spun up and spun down quickly are the most visible examples of cloud features making their way to on-premise data centers.

“[It’s widely accepted] that a cloud architecture with a separation of compute and storage and compute being ephemeral with a lot of automation to create clusters and everything offered as a service is making its way everywhere,” Thusoo says. “It’s very early in its evolution. It’s nowhere near mainstream or anything like. But that’s what we see these public cloud vendors trying to do.”

Hadoop Lessons Learned

While some might view the disintegration of the Hadoop market as a failure, others will view it as a necessary chapter in IT history.

Hadoop was modeled after technology developed by Google and put into action at Yahoo, and eventually adopted by the other technology giants, like Facebook, Twitter, and Uber, which contributed their creations to open source. The Hadoop method represented one approach to building distributed systems, which were being adopted by the Global 2000 for first time. It clearly worked well for some firms, and did not work as well for others. As the world evolved, other architectural ideas emerged that many considered better, so we try something new. And so on and so forth.

(Valery Brozhinsky/Shutterstock)

The Hadoop lesson won’t be ignored, Leone predicts. “I think Hadoop served as a great introduction to a new way of doing things,” he says. “For those organizations that waited to adopt big data processing technology, there are better approaches to accomplish that now i.e., Spark or leveraging cloud services like GCP’s Dataproc or AWS EMR.”

Cloudera will try to compete with a hybrid data platform that eliminates the lock-in dilemma posed by cloud vendors (although the cloud vendors clearly view lock in as a feature of their business models, not a bug). But according to Leone, it’s only a matter of time until the cloud vendors simply “take out” the remaining Hadoop vendors.

“While organizations can still leverage their preferred Hadoop vendors technology on a cloud of their choice, the cloud vendors have created managed services to take out all the complexity associated with Hadoop, such [as] ongoing integration, management, and maintenance,” Leone explains. “If organizations have invested hundreds of thousands of dollars establishing processes that are yielding value to an organization, they will be hard pressed to change those workflows. It’s more appealing to lift and shift those processes to a more efficient infrastructure managed by a cloud vendor. Worst case for the cloud vendors, orgs are running on their infrastructure. The best case for the cloud vendors, orgs ditch their Hadoop vendor and used their managed service.”

If Zweben has his way, adopters of new cloud architectures will never repeat what he considers one of the worst features of Hadoop: schema on read.

“In the first generation of Hadoop, everybody just focused on throwing their data onto the platform. There was a great deal of talk about schema on read. And what that meant, to everybody in the community, was don’t worry about it! Just put the data out on Hadoop and people will come and consume in ways that they need it.

“And this was a woeful mistake,” he continues. “It led to a data swamp. And if you combine the complexity of Hadoop, the current state of the data swamp, and the success of the public cloud, you can see that this turned out to be a very big problem for the Hadoop distribution companies.”

Some things never go out of style

One could make the argument that Cloudera, Hortonworks, and MapR all missed the public cloud, and now are having their lunches eaten by AWS, Azure, and GCP. There is truth to that, Zweben says. But that doesn’t mean that customers can go ahead and use their new cloud architectures the same way they abused Hadoop.

“You can dump all the data you want on S3 or the Azure data lake and do it mindlessly and you will end up in the same place that the first generation of adopters of Cloudera and Hortonworks and MapR ended up,” he says. “It is the wrong way of thinking.”

The right way to think about big data, in Zweben’s view, is to first figure out what business outcomes you’re hoping to achieve, and then building out from there. Only after you know the business challenge can you be assured that you’re collecting the right data and applying machine learning in the right way.

“Think first about the application you’re going to modernize, and then go find the data you need and the models you need to inject to modernize that application,” Zweben advises. “That inversion of thinking will radically change this whole marketplace.”

Related Items:

Cloud Analytics Proving Costly for Some

HPE Acquires MapR

Is Hadoop Officially Dead?

 

Datanami