August 14, 2014

AMPLab’s Tachyon Promises to Solidify In-Memory Analytics

Alex Woodie

U.C Berkeley’s AMPLab first landed on the radar screens of data scientists with Apache Spark, which promises to provide an in-memory data processing framework to replace or augment MapReduce. More recently, the tech wizzes at AMPLab have whipped up Tachyon, a new distributed file system that sits atop HDFS and aims to allow multiple Hadoop or Spark applications and jobs to access the same data at memory speeds without fears of corrupting it.

The rapid rise of Apache Spark demonstrates the widespread desire in the analytics community for faster processing and more granular iteration of analysis. If the first generation of Hadoop, which leaned heavily on batch-oriented MapReduce jobs, gave us a taste of what was possible with big data analytics, then Spark and its in-memory framework are viewed as the vehicles that will take us to the promised land of interactive and real-time analytics.

Tachyon buffers data as it flows between Spark and HDFS

The in-memory nature of Apache Spark is critical to achieving the big speed-ups over first-gen MapReduce applications that enterprises demand. The more memory that’s available, the more data can be kept up in the air, and the more value can be extracted from it. But problems creep up when the data needs to be written to a file system, where it can be picked up by an application in the next stage of the pipeline. To maintain fault tolerance, the data is typically written to disk via HDFS. But this slows down the whole process and eliminates some of the benefits of using in-memory processing in the first place.

Tachyon provides a potential solution to that dilemma by enabling data to be written to a file system while still in memory, without giving up the fault-tolerance that writing to spinning disk via HDFS has provided.

Tachyon uses a distributed architecture to provide resilience

According to the official Tachyon website at tachyon-project.org, Tachyon provides a memory-centric distributed file system that enables reliable file sharing at memory-speed across cluster frameworks, like Spark and MapReduce. “Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed,” it says. “Thus, Tachyon avoids going to disk to load datasets that are frequently read.”

The open source product, which was first released in 2013, is compatible with existing Spark, Shark, and MapReduce programs, and uses the same HDFS standards for file operations, such as create, open, read, write, close, and delete. But instead of writing to disk, it keeps it all in memory for as long as possible. In addition to HDFS, it supports Amazon S3, GlusterFS, and single-node local file systems as the underlying file system for resiliency purposes, and more file systems are slated to be supported in the future.

The secret sauce in Tachyon is how it recovers from errors. It uses a lineage-based approach that involves logging the data transformations used to build a data sets, and then using those logs to rebuild the data if needed. “Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information,” says Haoyuan Li, the lead developer for Tachyon at the AMPLab, in a summary of his upcoming Strata + Hadoop World talk.

Initial results show that Tachyon can attain write throughput 300x higher, and speed up jobs more than 10x, over HDFS. It does this in a reliable manner by avoiding the use of synchronous data replication to disk, and writing data to disk asynchronously, only after it’s been written to memory.

This approach could help enterprises implementing big data analytic systems to overcome disk and network I/O limitations. “More importantly, we believe that due to the inherent bandwidth limitations of replication, a lineage-based recovery strategy like Tachyon’s might be the only way to make cluster storage systems match the speed of in-memory computations in the future,” Li and his fellow AMPlab co-developers Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica and Hortonworks developer Eric Baldeschwieler write in a November 2013 paper titled “Tachyon: Memory Throughput I/O for Cluster Computing Frameworks” (click here to see a copy).

While Tachyon is still in Alpha, it’s a supported component of the AMPLab’s Berkeley Data Analytics Stack (BDAS), along with Spark, Shark, Spark Streaming, and Mesos. It’s the default off-heap storage medium for Spark and is included in the Fedora distribution from Redhat. Momentum is building for the software, which saw a 0.5 release unveiled less than a month ago.

Today, the Tachyon project involves more than 40 contributors from over 15 institutions, including Yahoo, Intel, Hortonworks, and others. It’s received backing from Cloudera, Databricks, ClearStory Data, Palantir, Conviva, GE, Facebook, Cisco, Ericsson, and others. The software, which is distributed under an Apache 2.0 license, is commercially supported by Atigeo, and has been deployed in at multiple companies.

It appears that Tachyon is well-positioned to help solve one of the impediements to the move to in-memory computing. As Apache Spark and in-memory competing gain momentum on Hadoop, chances are good that you’ll be hearing more about Tachyon in the months ahead.

Where Does Spark Go From Here?

Databricks Takes Apache Spark to the Cloud, Nabs $33M

Applications: Complex Event Processing, Enterprise Analytics

Technologies: Processors, Storage, Systems

Sectors: Academia

Tags: apache spark, Hadoop, in-memory

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

AMPLab’s Tachyon Promises to Solidify In-Memory Analytics

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

AMPLab’s Tachyon Promises to Solidify In-Memory Analytics

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link