Tachyon Support Coming to Big Data Hypervisor
Organizations that are deploying Apache Spark to do data science on big data may be inclined to invest in Tachyon, the in-memory file system that was developed next to Spark at the AMPlab. Getting Spark and Tachyon spun up and deployed on bare metal can be a hassle, but it’s a business opportunity to BlueData, which is aiming to be the VMware of big data.
Tachyon is a distributed, in-memory file system designed to enable reliable file sharing at memory-speed across cluster frameworks. The software–which sits above HDFS in the AMPlab diagram–is emerging as a potentially key component to enable the creation of big data analytics pipelines that touch many different engines, such as Spark, Hive, and MapReduce.
The folks at BlueData see a lot of potential in Tachyon, so much that it decided to the file system in EPIC, the name of its big data virtualization platform that lets non-technical users spin up big data clusters with just a few mouse clicks. BlueData, which took home first place in Strata’s recent Startup Showcase, already supports Hadoop distributions from Cloudera and Hortonworks in its software, in addition to Apache Spark, HDFS, and other file systems like Gluster and NFS.
With support for Tachyon now set to be formally unveiled as a tech preview at the upcoming Strata show, BlueData feels it’s well-positioned to help companies remove the shackles preventing them from riding their on-premise clusters into big data’s wild blue yonder, with all the agility, flexibility, and extensibility that is normally afforded only to cloud deployments.
“We always felt that Tachyon had a very high potential to provide the underlying in-memory file system” for emerging big data applications, says Kumar Sreekanti, CEO and co-founder of Mountain View, California-based BlueData. “We recognize that, as much as Hadoop has garnered interest, we think that in-memory or real-time processing will be here to stay. And there will be new frameworks and new applications that will be coming.”
Tachyon clearly is one of those core infrastructure components that BlueData is betting on will gain traction as real-time analytics becomes more prevalent. Before founding BlueData with his former VMware colleague Tom Phelan, Sreekanti spent time at the AMPlab, the University of Berkeley project that gave rise to Apache Spark and Tachyon. Ion Stoica, the co-director of the AMPLab and CEO of Spark-backer Databricks, is also an adviser to BlueData.
Getting Tachyon running is not a trivial manner, but BlueData says it can take a lot of the headache and hassle out of managing Tachyon as a virtual asset that can be easily created, duplicated, moved, and destroyed without impacting the actual hardware resources that sit underneath it. It is doing that today for Hadoop and Spark, and will soon be doing that for Tachyon with EPIC.
“The real value proposition here is you spin up a Tachyon file system once in the platform, and multiple clusters and multiple users can leverage that shared in-memory file system,” says Anant Chintamaneni, vice president of products for BlueData.
“What happens today is that most people will bring up a Hadoop cluster then manually get the Tachyon code and integrate it with their Hadoop cluster,” he continues. “If they move from that Hadoop cluster to another Hadoop cluster, then they have to spin the Tachyon file system up all over again. If they want to enable these environments for developers, it’s very tedious to integrate Tachyon into each of these environments. Today none of the Hadoop management tools or anybody else out there—Cloudera Manager, Ambari, what have you–can support Tachyon today in their install process.”
While it was developed as part of AMPlab’s stack–with Mesos as the underlying resource manager and Spark providing high-level interfaces for SQL computing, stream processing, graph databases, R and others—there’s nothing preventing Tachyon from being adopted by other bid data engines in the Hadoop big data stack. “Tachyon is HDFS API compliant,” Chintamaneni says. “You can store data in Tachyon and run a Hive job against it or a MapReduce job against it, in addition to Spark.”
While the entire Hadoop stack may benefit, it will be Spark that ultimately drives Tachyon adoption, Chintamaneni says. “I think Tachyon and Spark are made for each other,” he says. “As you see Spark gaining momentum, and as folks start using Spark for more real use cases with larger data volumes, I think they’ll start seeing some of the issues with Spark that are going to be the driver for bringing Tachyon into the stack.”