Rethinking Real-Time Hadoop
Hadoop is considered by many to be the best and brightest platform for running big data analytics. Its ability to scale and its vibrant open source community, it is thought, are cementing its place as the center of the analytic data hub of the future. The only problem: Hadoop was envisioned as a batch-oriented system, and its real-time capabilities are still emerging, which has created a gap that fast in-memory NewSQL databases are rushing to fill.
When Yahoo sprung Hadoop and MapReduce on us not so many years ago, it was a batch-oriented system through and through. Since then, many have tried remaking it into a real-time system–through the HBase database, SQL interfaces like Impala and Hive, streaming data engines like Storm, and in-memory frameworks like Spark.
However, these technologies do not appear ready to match the real-time analytic capabilities that in-memory NoSQL and NewSQL databases deliver now. Earlier this month, MemSQL unveiled a new column-oriented store to its in-memory NewSQL database that’s designed to provide fast query response time against transactional data. And last month, NewSQL database vendor VoltDB retrofitted its in-memory relational database with faster SQL capabilities with an eye toward enabling real-time analytics on huge data stores.
In both cases, the NewSQL database vendors paid homage to Hadoop’s impressive capability, but made it clear that the platform just doesn’t hold water when it comes to informing decisions against big data sets with latencies measured in milliseconds.
Hadoop’s Batch Legacy
“Hadoop is still a great platform for storing large volumes of data,” says MemSQL’s dsirector of product marketing Mark Horton. “The issue that a lot of companies are dealing with is, once we get that data in Hadoop, how can we extract the data, Whether it’s Hive or Pig, they need easy access to it, and to derive value. As we see it today, it [Hadoop] not built for real time analytics.”
VoltDB CTO Eric Betts says he has seen plenty of Hadoop project fail to generate returns when real-time decision-making is a priority. “Hadoop won’t generate 10,000 responses to you per second 24/7. That’s not its use case at all,” he says. “Its use case is doing massive filtering and scanning.”
MapReduce is not the only game in town on Hadoop anymore, and with the introduction of YARN in Hadoop 2, we will see other data processing engines grow in use and prominence. Some organizations are adopting stream processing engines such as S4, Storm, and Spark Streaming with the goal of giving Hadoop more real-time chops.
Betts doesn’t think this approach will work beyond simple analytics, such as counting problems. “Storm isn’t a database. It has no persistence mechanism associated with it. The best it can to do orchestrate a call into another call,” Betts says. “I agree they are positioned [for real time analytics]. But they’re not built for decision-making. If you need to make a decision, you need to make a decision in some context, and you need a stateful system.”
The Hadoop vendors are aware of the need for Hadoop to become more real-time oriented, and that’s one of the main reasons why they have invested so heavily in improving SQL access to data stored in either HDFS or HBase. Cloudera is continuing to build on its Apache Impala SQL access engine and is now shipping Spark as part of its Hadoop distribution, while Hortonworks is continuing to invest in Hive, which it’s making better via Tez. MapR Technologies, which has also adopted Cloudera’s Impala, has also tweaked the underlying code in Hadoop to speed things up a bit–in particular, changing the way HBase works to improve the latency.
Keep the MapReduce?
While the momentum behind SQL is large and getting bigger, there are some who question whether abandoning the MapReduce framework is the best way to achieve low latency in Hadoop. A company called ScaleOut Software, for instance, has developed a product called the ScaleOut hServer that marries the speed of in-memory processing with the mass-data efficiency of MapReduce.
ScaleOut Software founder Bill Bain says the data parallel computation capabilities of MapReduce are a great fit for real-time analytics. “Hadoop MapReduce, because it’s data parallel, means that that work can be spread out across a set of servers and that gives you the speed up you need to handle large data sets, up into the petabytes, that are changing rapidly,” he says. “There’s nothing wrong with the MapReduce paradigm for analyzing that [fast moving] data, and the fact that it scales means you can do it very, very quickly. So you wouldn’t want to give up on MaReduce for real time, because it’s exactly what you need.”
SQL access alone can’t deliver the goods for real-time analytics, Bain says. For instance, they won’t solve the problem of the hedge fund that needs to continually act on changing stock prices or pieces of news. “You have to update the data based on the query–you have to write back to the database. But you also have to do an analytic computation and decide whether to generate an alert,” he says. “That’s not part of the query. That’s part of business logic. And that logic has to be scaled to run fast enough to give you real time results, so you come back to data parallel computation is the only way to solve the problem.”
Bain says the Apache Spark project is probably the Hadoop engine that has the closest fit to what ScaleOut has built with its hServer, which effectively applies MapReduce jobs continually against smaller pieces of data running in memory on commodity hardware. Spark, which Cloudera is embracing, also keeps much of its data in memory. But, unlike Spark, ScaleOut hServer allows updates to in-memory data, and it ensures that these updates are highly available, he says.
Hadoop has come very far in a short amount of time. It holds tremendous promise as an all-purpose data hub for many different uses in the enterprise. Hadoop’s technology is still evolving, and there will undoubtedly be big advanced in real time analytics. In the meantime, other technologies, like very fast in-memory NoSQL databases, will compete with Hadoop for real time business.