Why Hadoop Isn’t the Big Data Solution You Think It Is
Hadoop carries a lot of promise in the IT world for the way it has democratized access to massively parallel storage and computational power. But the level of hype that surrounds Hadoop is disproportionate to its present capabilities, raising the possibility of a big data letdown of elephantine proportions.
The emergence of Hadoop as a next-generation platform for parallel computing has piqued the interest of customers and investors alike. What mid-sized company looking for a big data edge wouldn’t want the same computing architecture that Yahoo and Google use, especially if the technology is open source and runs on commodity X86 servers? The general consensus holds that, if you’ve got a big data problem–or even better, a potential big data advantage–then Hadoop must be your answer.
Not so fast, says Mathias Golombek, the CTO of EXASOL, the German developer of a massively parallel processing (MPP) in-memory database that competes with the likes of Teradata, Actian, and SAP. “Hadoop is really great at scaling storage systems,” Golombek tells Datanami. “But at the end of the day, if you want to do number crunching and complex analytics, you need an in memory or MPP solution that’s specialized for analytical area, which Hadoop isn’t yet.”
That is what EXASOL customer King Ltd, the company behind the Candy Crush Saga game, is doing. Millions of Candy Crush players generate about 10 billion events per day, and that data is landed in a Hadoop cluster. But when King wants to analyze the data, the data is moved from HDFS into EXASOL’s column-oriented database, which is world’s fastest database according to benchmarks.
“I think Hadoop is a great thing for EXASOL because customers are able to store a lot more data,” Golumbek says. “The whole idea of programming MapReduce jobs that can be scaled up to hundreds or thousands of nodes was very appealing to many programmers and they built some applications on that….But now if you have a look at recent research and projects, the whole thing is going in the direction that you need again a standard SQL interface.”
The notion that Hadoop is not for analytics will not gain a lot of nods of agreement from Hadoop distributors. Cloudera, Hortonworks, and others will rightfully point to the availability of SQL interfaces, machine learning algorithms, real-time stream processing, and third-party apps as proof that you can do analytics on Hadoop. The combination of Apache Spark running on YARN-enabled Hadoop v2 means the platform is no longer totally dependent on the MapReduce paradigm and its sidekick Pig.
While the YARN-powered Hadoop version 2 is most definitely more inclusive of technologies beyond MapReduce and able to run multiple workloads simultaneously, just getting to version 2 may take a bigger investment than initially budgeted.
For example, the digital media analytics company comScore has been successfully running its MapR Technologies M5 release since 2009. The company relies heavily on first-gen MapReduce jobs to crunch incoming data and prepare it for customers, such as ad firms, who use it to determine what ads to buy. A separate Greenplum MPP database from EMC is used to do the deep exploratory analysis, while Hadoop is left to execute the daily reports.
“It’s a great thing,” comScore CTO Mike Brown says of Hadoop 2, YARN, and the capability to mix workloads. “My concern is, how do we get from 1 to 2? It’s a pretty big task. The way they’ve evolved Hadoop from an API perspective, I can’t just take MapReduce 1 jobs and run it on a MapReduce 2 cluster. I have to re-compile it if have a lot of jobs in production.” Which comScore does.
It would be a stretch to say that Hadoop has become a victim from its own early success. But the difference between first-gen Hadoop and Hadoop version 2 is big enough that it’s creating an impediment to upgrades. Companies that are new to Hadoop can easily adopt V2 without the burden of upgrading dozens or hundreds of MapReduce jobs. But for those early adopters, the move is not so easy.
Hadoop is still an emerging technology and is going to suffer its share of growing pains, just like all emerging technologies do. It would be a mistake to think that Hadoop is going to solve all of one’s big data problems, says Judith Hurwitz, CEO of the analyst firm Hurwitz and Associates.
“When there’s a technology that gets very popular, then people want to say, ‘It’s everything. Well it’s not everything,” Hurwitz tells Datanami. “I saw something today where somebody said Hadoop is replacing the data warehouse. No, it’s not. It’s different. You have certain types of data where what you want to do is very structured and you’re looking for a specific set of pattern and looking to analyze a set of data and it’s usually subset of data and it’s usually much more targeted, whereas Hadoop is much more focused on unstructured data.”
Hadoop so far has proved that it’s very good at solving a set of problems. The experiences of ComScore and King, which use Hadoop to land and perform the initial transformations on huge amounts of raw data, are similar to many other Hadoop implementations. But in these situations, Hadoop is but one part of a complex workflow that touches many other parts.
EXASOL’s Golombek sees customers using HDFS to store large amounts of unstructured data. “But then typically there’s a heterogeneous architecture with different layers,” he says. “You have something like a graph database or an in-memory real-time relational database. If you combine all these different applications into one big solution for the customer, then they can benefit in the best way. I don’t believe one system fits everything, like our competitors say–and IBM or especially Oracle and SAP telling the story of combining transactional and analytic system. For me that’s really a kluge of different solutions that don’t fit.”
The Hadoop community and ecosystem is growing tremendously, fueled by big customer interest, and even bigger interest from venture capitalists, who have poured billions in the Cloudera, Hortonworks, MapR, and Pivotal. Actual Hadoop spending is relatively modest, accounting for about $815 million in 2014, according to projections by Wikibon. But amid all the Hadoop hype, it’s important not to lose sight of the ultimate goal, which is building applications that solve real-world problems.
“In technology markets, we tend to get caught up in that we turn something that’s a tool and an enabler into the market and I don’t believe it is,” Hurwitz says of Hadoop. “Obviously it will continue to gain momentum. It’s extremely important. But it’s not the solution. It’s a technique. It’s a set of technology approaches that are obviously very important, but it’s a foundational element–it’s not the solution.”