August 26, 2014

Why Hadoop Isn’t the Big Data Solution You Think It Is

Alex Woodie

Hadoop carries a lot of promise in the IT world for the way it has democratized access to massively parallel storage and computational power. But the level of hype that surrounds Hadoop is disproportionate to its present capabilities, raising the possibility of a big data letdown of elephantine proportions.

The emergence of Hadoop as a next-generation platform for parallel computing has piqued the interest of customers and investors alike. What mid-sized company looking for a big data edge wouldn’t want the same computing architecture that Yahoo and Google use, especially if the technology is open source and runs on commodity X86 servers? The general consensus holds that, if you’ve got a big data problem–or even better, a potential big data advantage–then Hadoop must be your answer.

Not so fast, says Mathias Golombek, the CTO of EXASOL, the German developer of a massively parallel processing (MPP) in-memory database that competes with the likes of Teradata, Actian, and SAP. “Hadoop is really great at scaling storage systems,” Golombek tells Datanami. “But at the end of the day, if you want to do number crunching and complex analytics, you need an in memory or MPP solution that’s specialized for analytical area, which Hadoop isn’t yet.”

That is what EXASOL customer King Ltd, the company behind the Candy Crush Saga game, is doing. Millions of Candy Crush players generate about 10 billion events per day, and that data is landed in a Hadoop cluster. But when King wants to analyze the data, the data is moved from HDFS into EXASOL’s column-oriented database, which is world’s fastest database according to benchmarks.

“I think Hadoop is a great thing for EXASOL because customers are able to store a lot more data,” Golumbek says. “The whole idea of programming MapReduce jobs that can be scaled up to hundreds or thousands of nodes was very appealing to many programmers and they built some applications on that….But now if you have a look at recent research and projects, the whole thing is going in the direction that you need again a standard SQL interface.”

The notion that Hadoop is not for analytics will not gain a lot of nods of agreement from Hadoop distributors. Cloudera, Hortonworks, and others will rightfully point to the availability of SQL interfaces, machine learning algorithms, real-time stream processing, and third-party apps as proof that you can do analytics on Hadoop. The combination of Apache Spark running on YARN-enabled Hadoop v2 means the platform is no longer totally dependent on the MapReduce paradigm and its sidekick Pig.

While the YARN-powered Hadoop version 2 is most definitely more inclusive of technologies beyond MapReduce and able to run multiple workloads simultaneously, just getting to version 2 may take a bigger investment than initially budgeted.

For example, the digital media analytics company comScore has been successfully running its MapR Technologies M5 release since 2009. The company relies heavily on first-gen MapReduce jobs to crunch incoming data and prepare it for customers, such as ad firms, who use it to determine what ads to buy. A separate Greenplum MPP database from EMC is used to do the deep exploratory analysis, while Hadoop is left to execute the daily reports.

“It’s a great thing,” comScore CTO Mike Brown says of Hadoop 2, YARN, and the capability to mix workloads. “My concern is, how do we get from 1 to 2? It’s a pretty big task. The way they’ve evolved Hadoop from an API perspective, I can’t just take MapReduce 1 jobs and run it on a MapReduce 2 cluster. I have to re-compile it if have a lot of jobs in production.” Which comScore does.

It would be a stretch to say that Hadoop has become a victim from its own early success. But the difference between first-gen Hadoop and Hadoop version 2 is big enough that it’s creating an impediment to upgrades. Companies that are new to Hadoop can easily adopt V2 without the burden of upgrading dozens or hundreds of MapReduce jobs. But for those early adopters, the move is not so easy.

Hadoop is still an emerging technology and is going to suffer its share of growing pains, just like all emerging technologies do. It would be a mistake to think that Hadoop is going to solve all of one’s big data problems, says Judith Hurwitz, CEO of the analyst firm Hurwitz and Associates.

“When there’s a technology that gets very popular, then people want to say, ‘It’s everything. Well it’s not everything,” Hurwitz tells Datanami. “I saw something today where somebody said Hadoop is replacing the data warehouse. No, it’s not. It’s different. You have certain types of data where what you want to do is very structured and you’re looking for a specific set of pattern and looking to analyze a set of data and it’s usually subset of data and it’s usually much more targeted, whereas Hadoop is much more focused on unstructured data.”

Hadoop so far has proved that it’s very good at solving a set of problems. The experiences of ComScore and King, which use Hadoop to land and perform the initial transformations on huge amounts of raw data, are similar to many other Hadoop implementations. But in these situations, Hadoop is but one part of a complex workflow that touches many other parts.

EXASOL’s Golombek sees customers using HDFS to store large amounts of unstructured data. “But then typically there’s a heterogeneous architecture with different layers,” he says. “You have something like a graph database or an in-memory real-time relational database. If you combine all these different applications into one big solution for the customer, then they can benefit in the best way. I don’t believe one system fits everything, like our competitors say–and IBM or especially Oracle and SAP telling the story of combining transactional and analytic system. For me that’s really a kluge of different solutions that don’t fit.”

The Hadoop community and ecosystem is growing tremendously, fueled by big customer interest, and even bigger interest from venture capitalists, who have poured billions in the Cloudera, Hortonworks, MapR, and Pivotal. Actual Hadoop spending is relatively modest, accounting for about $815 million in 2014, according to projections by Wikibon. But amid all the Hadoop hype, it’s important not to lose sight of the ultimate goal, which is building applications that solve real-world problems.

“In technology markets, we tend to get caught up in that we turn something that’s a tool and an enabler into the market and I don’t believe it is,” Hurwitz says of Hadoop. “Obviously it will continue to gain momentum. It’s extremely important. But it’s not the solution. It’s a technique. It’s a set of technology approaches that are obviously very important, but it’s a foundational element–it’s not the solution.”

Please Stop Chasing Yellow Elephants, TIBCO CTO Pleads

The Big Data Market By the Numbers

Applications: Enterprise Analytics

Technologies: Frameworks

Sectors: Retail

Tags: big data, Hadoop, hype, MPP

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Why Hadoop Isn’t the Big Data Solution You Think It Is

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Why Hadoop Isn’t the Big Data Solution You Think It Is

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link