July 31, 2017

Hate Hadoop? Then You’re Doing It Wrong

Andrew Brust

(mw2st/Shutterstock)

There’s been a lot of Monday morning quarterbacking saying Hadoop never made sense in the first place and that it has met its demise. Comments from “Hadoop is slow for small, ad hoc jobs” to “Hadoop is dead and Spark is the victor” are now commonplace.

The frenzy around Hadoop’s deficiencies is almost as fierce as the initial hype around how powerful and disruptive it was. However, while it’s understandable that people have come up against difficulties deploying Hadoop, that doesn’t mean the negative chatter is true.

Out of Sight

TCP/IP powers the Internet, your email, your apps and more. But chances are you don’t hear about it. When you request a ride sharing service, stream media or surf the Internet, you’re benefitting from its power.

For the most part, we all rely on TCP/IP on a daily basis, yet have no interest or need to configure it. We don’t spend time going on our Macs typing commands like ifconfig to see how your WiFi adapter is configured to get online.

The complexity of the TCP/IP stack is mostly invisible to us now, as Hadoop’s complexity will eventually be

In the 1990s, TCP/IP used to be sold as a product, and adoption was somewhat tepid. Eventually, TCP/IP got built into operating systems and, perhaps paradoxically, that’s when it conquered all. It became a universal standard, and at the same time it disappeared from plain sight.

Hadoop Is Infrastructure

Similarly, Hadoop is the TCP/IP of the Big Data world. It’s the infrastructure that delivers huge benefits. But that benefit is greatly diluted when the infrastructure is exposed. Hadoop has been marketed like a Web browser, but it’s much more like TCP/IP.

If you’re working with Hadoop directly, you’re doing it wrong. If you’re typing “Hadoop” and a bunch of parameters at the command line, you’ve got it all backwards. Do you want to configure and run everything yourself, or do you just want to work with your data and let analytics software handle Hadoop on the back end?

Most people would choose the latter, but the Big Data industry often directed customers to the former. The industry did it with Hadoop before, and they’re doing it with Spark and numerous machine-learning tools now.

It’s a case of the technologist tail wagging the business user dog, and that never ends well.

Dev Tools Aren’t Biz Tools

It’s not that the industry has been totally oblivious to this problem. Some vendors have tried to up their tooling game and smooth out Hadoop’s rough edges. Open source projects with names like Hue, Jupyter, Zeppelin and Ambari have cropped up, aiming to get Hadoop practitioners off the command line.

Getting Hadoop users off the command line doesn’t necessarily mean they’re more productive

But therein lies the problem. We need tools for business users, not Hadoop practitioners. Hue is great for running and tracking Hadoop execution jobs or for writing queries in SQL or other languages. Jupyter and Zeppelin are great for writing and running code against Spark, in data science-friendly languages like R and Python, and even rendering the data visualizations that code produces.

The problem is these tools don’t get rid of command line tasks; they just make people more efficient at doing them. Getting people physically off the command line may be helpful, but having them do the same stuff, even if it’s easier, doesn’t really change the equation.

There’s a balancing act here. To do Big Data analytics right, you shouldn’t have to use the engine – Hadoop in this case – directly, but you still want its full power. To make that happen, you need an analytics tool that tames the technology, without dismissing it or shooing it away.

Find that middle ground and you’ll be on the right track.

The Path Forward for Hadoop

Hadoop isn’t dead, nor is it the problem. Hadoop is an extremely powerful, critical technology. But it’s also infrastructure. It never should have been the poster child for Big Data. Hadoop (and Spark, for that matter) is technology that should be embedded in other technologies and products. That way those technologies can, in turn, leverage their power, without exposing their complexity.

Hadoop’s not anymore dead than TCP/IP. The problem is how people have used it, not what it does. If you want to do big data analytics right, then exploit its power behind the scenes, where Hadoop belongs. If you do it that way, Hadoop will be resurrected, not by magic, but by common sense.

But you probably won’t even notice. And if that’s the case, then you’re doing it right.

About the author: Andrew Brust is Senior Director, Market Strategy and Intelligence at Datameer, liaising between the Marketing, Product and Product Management teams, and the big data analytics community.  And writes a blog for ZDNet called “Big on Data” (zdnet.com/blog/big-data); is an advisor to NYTECH, the New York Technology Council; serves as Microsoft Regional Director and MVP; and writes the Redmond Review column for VisualStudioMagazine.com.

Related Items:

Hadoop Has Failed Us, Tech Experts Say

Hadoop at Strata: Not Exactly ‘Failure,’ But It Is Complicated

Charting a Course Out of the Big Data Doldrums

Share This