Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
Panasas Gets Real About Hadoop
Brent Welch, CTO at HPC storage company, Panasas, says he’s been getting a kick out of watching the MapReduce and Hadoop world learn some of the tough lessons the high performance computing folks encountered years ago.
The realms of HPC and enterprise big data have been thrust together via the tectonic force of the Hadoop push, but according to Welch, it leaves a deep chasm for many use cases that can only be filled by rethinking storage approaches. More specifically, Welch believes high performance network attached storage (NAS) can offer some rather appealing features for the MapReduce world, which is still bumbling with limitations and lack of interoperability related to the Hadoop distributed file system (HDFS).
The problem here is, if you were to ask most people using Hadoop why they haven’t looked to NAS as their approach of choice, they would probably think you were nuts.
According to the Panasas CTO, the real reason the Hadoop community doesn’t like NAS is that they’ve had bad experiences. “They run big Hadoop workloads against an NFS filer and the filer just gets creamed so, of course, the job doesn’t go well.” Their solution is to have high performance NAS that’s been purpose-built to work with Hadoop to get around this problem—a message that seemed to resonate with Panasas’ new partner Hortonworks, which has been struggling with ways to let users move data on and off HDFS, allowing them to leverage their existing infrastructure and take advantage of Hadoop for appropriate workloads in a more reliable, scalable way.
The result of this wrangling with Hadoop can be found in Panasas’ recently-announced ActiveStor 14 parallel storage system, which has been enhanced with Hadoop support. The company was doing demos of the product at the Supercomputing show last week (SC12) to prove how their internal Hadoop benchmarks were putting ActiveStor 14 ahead of a standard local disk implementation by around 30 percent. The Hadoop partnership is, in great part, aimed at maximizing former investments in HPC systems while still allowing companies to leverage Hadoop. The side benefit, Welch claims, is that there are no new boxes to sit in the corner and float on their own HDFS island and that Hadoop data can be brought into the real big data fold.
There are some rather interesting performance numbers for the haters, by the way. Panasas recently benchmarked the benefits of using high performance NAS for Hadoop with some notable results, although again, it makes good sense to think about these in context of data size and what the end goals are. We talked about the benchmarking whitepaper and will share it with you here when it arrives. Most of the company’s stories about the advantages are in the context of existing Panasas customers who wanted to leverage Hadoop without buying (or renting) more hardware and who needed a way to move data on and off a Hadoop cluster without pulling their hair out.
But on that note, in Panasas’ view, no matter what those workloads look like, there is no performance reason why anyone would want to run with local disk inside the nodes, although they do admit to the potential emergence of use cases for people who want a hybrid approach. In such a case, users would opt for using shared managed storage on their important data while kicking the less critical temporary datasets to local disk. Besides, argued Welch, why would anyone want to tie their compute nodes down with data when they fail and then they’re stuck replacing them at different rates? It’s too hard to manage and keep track of and it simply makes more sense to move a computation around versus the data. All of that aside, he explained that the IT folks who manage disparate systems don’t want important data “scattered” between points in the Hadoop grid—they want to keep track of it and control it. This is the sweet spot of NAS for Hadoop, since, as Panasas claims, the data in HDFS land is tied into Hadoop in such a way that’s nearly impossible (without extensive wrangling) to do anything else but Hadoop workloads with it.
In some ways, we’re still talking about apples to oranges when it comes to Panasas’ traditional customers in the HPC space and some of the Hadoop-using enterprise folks they’re trying to reach. The NAS vendor thinks that the data sizes the enterprise big data folks are talking about are really not all that large, especially when compared to some of the scientific sets that emerge in technical computing. The company’s CMO, Barbara Murphy, who joined our conversation, said that they operate under the rough assumption that the market for Hadoop is more bifurcated than the big data vendors lead us to believe. “On the one side you have the 100 petabyte market—but there also the 2 terabyte market and in total, only about 2% of all enterprise are using Hadoop.” She said these are more the Web 2.0 companies they are doing their own thing anyway.
Welch jumped in with the assertion that “people get a bit of religious fervor about the Hadoop way.” He said that instead of diving in headfirst, there is a simpler way to leverage it when it’s needed by running Hadoop against Panasas’ own parallel file system, pNFS. “It’s going to run Hadoop well, it doesn’t have to displace the use of the local drives on the nodes (even if it could) and there are also other advantages to having high performance shared, reliable, managed storage that can work well in a Hadoop environment.”
In essence, MapReduce came of age inside of what Welch calls a “weak network environment” that he contrasts with HPC, which “has a good network with compute in one place and storage in another with the right bandwidth in between the two.” His goal at Panasas lately has been to demonstrate, through benchmarks like Terasort, how his company’s storage approach stacks up against local disks on the compute nodes. With Hortonworks tying together the integration and interop pieces via the partnership, it’s becoming possible for HPC users who already leverage Panasas to tap into Hadoop without as much of a headache—and this will probably not be the last such partnership of its kind between now and next year’s SC event.