Converged Platform or Federated Data Plane? The Debate Heats Up
“Bring the compute to the data.” That was Hadoop’s calling card and solution for the problem of moving big data. However, the rise of cloud repositories and streaming technologies is causing Hadoop distributors to question whether that architecture is the best one going forward. Datanami seeks answers this week at Strata + Hadoop World.
The rise of real-time streaming is both an opportunity and a threat to the Hadoop way of doing things. Some of the big batch workloads that will be disrupted were written in MapReduce on early Hadoop implementations. Now many of those workloads are moving to Apache Spark.
The Hadoop companies aren’t standing still as streaming data becomes a priority for customers. Hortonworks (NASDAQ: HDP) has released Hortonworks Data Flow (HDF) to complement its Hortonworks Data Platform (HDP), while Cloudera has embraced Apache Kafka, the popular publish-and-subscribe message broker, running under YARN. MapR may have done the furthest of the three by integrating a Kafka-like publish and subscribe messaging bus directly into its converged platform (which already sported NFS-style file system hooks and a range of other real-time improvements on HDFS).
It’s apparent that Hortonworks and MapR have decidedly different views on how streaming fits into traditional Hadoop architectures. Hortonworks has taken a data plane approach that seeks to virtually connect multiple data repositories in a federated manner, whereas MapR is going all-in on the converged approach that stressed the importance of a single unified data repository. Cloudera, meanwhile, sits somewhere in the middle (although it’s probably closer to MapR).
Matt Morgan, vice president of product and alliance marketing at Hortonworks, says there’s no point in trying to cram everything that’s needed in a modern application into a single unified architecture.
“If you look at modern data application, and if you compare the architectures that are associated with them and facilitating data needs of those applications, you will see conclusively that these apps are a composition of platforms,” Morgan says. “This composition of platforms includes a data architectures that exist both on the cloud and on-premise, and Hortonworks sees this as one data plane.”
With the just-announced HDP version 2.5, Horton is giving customers more tools to unify the security and governance of data existing in four or more places, including on- and off-premise data lakes like HDP and streaming data platforms like HDF. Specifically, it’s building hooks between Apache Atlas (the data governance component) and Apache Knox (the security tool) that give customers a single view of their data.
Morgan says the market is responding positively to the data plane message. “People are no longer looking at it like, ‘Oh my god, I need to have one single platform, some converged approach,'” he says, clearly referencing MapR’s approach. “I can approach this through a connected approach to glean the information from the data no matter where the data is. That’s the conversation that’s accelerating. That’s the one that’s resonating.”
The folks at MapR are having the same sort of conversations with their customers, but the conversations are headed the opposite way.
According to Jack Norris, the senior vice president of data and applications for MapR, the benefits of having a single unified cluster to handle both data at rest and data in motion still hold water. That’s especially true in the context of next-generation applications, such as those that are built on microservices, which the company is building support for.
“One of the benefits of integrating that publish and subscribe is not just from the efficiency standpoint,” Norris tells Datanami. “If the data pipeline moves from this kind of ingestion process to actually the exchange of data in a rapid manner, then there’s huge advantage to having that happen on the same fabric then having a return trip and having to coordinate the data as it goes back and forth to this other publish and subscribe mechanism.”
When you integrate multiple architectures, you run the risk of introducing additional latency into the whole equation, he says. “Then you’ve introduced delays and gaps and issues on recovery and security and so forth,” Norris says. “As customers get more sophisticated with these flows, that will be a more significant advantage.”
Cloudera, meanwhile, is embracing aspects of both Hortonworks and MapR. The first (and still largest) Hadoop distributor hasn’t integrated a message broker directly into its Hadoop distribution in the same manner than MapR has. But many of its customers are running Kafka directly on YARN, thereby benefiting from some data locality. (For what it’s worth, Hortonworks also includes Kafka in its distribution.)
Cloudera wants to manage all data–whether at rest or in motion–from one location while bringing different engines to bare on that data. At the same time, it’s a firm believer in the Lambda architecture first advocated by Nathan Marz, the creator of Apache Storm, according to Charles Zedlewski, vice president of products for Cloudera.
“Kafka is the piece of a larger real-time or near real-time architecture,” Zedlewski tells Datanami. “It’s typically the combination of Kafka and Spark Streaming for the so called speed layer. But then there’s always the batch later that works in conjunction with it, because people want to operate with larger history of events.”
Cloudera isn’t a believer at this point in Kafka creator Jay Kreps idea of a Kappa architecture, which employs a single layer for processing batch and real-time data using the same engine. ” “Reasonable people can disagree,” Zedlewski says.
With that said, Cloudera does see its Kudu project–which offers a happy medium between the scan performance of HDFS and the record-level updating capability of Hbase–delivering some convergence in this regard.
“In the future, we see Kudu as the real optimized store for these Lambda architectures,” Zedlewski says, “because can do real-time response to single events. It it can be the speed layer and batch layer for a single store.”
Stream processing and real-time analytics are increasingly becoming where the action is in the big data space. When you consider that Kafka is used by more than one-third of the Fortune 500, you realize the types of investments that corporations are putting into real-time applications. They need to get this right.
As real-time streaming architectures like Kafka continue to gain steam, companies that are building next-generation applications upon them will debate the merits of the unified and the federated approaches. It’s an important question to answer, but one in which there is no clear answer, at least not yet.