Alluxio Bolsters Data Orchestration for Hybrid Cloud World
Alluxio was conceived at Cal Berkeley’s AMPLab as a virtualization layer that eliminates barriers between data silos, and allows users to access remote data as if it were local. With today’s launch of Alluxio 2.0, the technology has evolved to the next stage, with a focus on simplifying data engineering in today’s emerging multi-cloud and hybrid cloud paradigms.
In the battle of big data architectures, Hadoop is losing and the cloud is winning. The evidence is hard to dispute. While the open source Apache Hadoop community has taken steps to remain relevant – like separating compute and storage and adopting erasure coding with the December 2017 launch of Hadoop 3.0 – it proved insufficient to stem the rising tides of public cloud vendors, where data and compute are separated by default and data is stored in massive object stores.
We are watching big data workloads explode on AWS, Azure, and Google Cloud. Cheap and abundant storage coupled with elastic compute and pre-built applications for real-time streaming, SQL analytics, and machine learning are proving to be a very enticing combination for many companies.
But the fact is that most enterprises are retaining their most critical workloads and data on prem. Billions have been invested in Hadoop, and companies are loathe to abandon it (although some are). As much as the cloud vendors are encouraging customers to move all of their data into the cloud, that is not likely to happen, especially as concerns about vendor lock-in continue to mount.
So instead we have the current situation, where companies are building multi-cloud and hybrid on-prem and cloud systems for many of their IT and big data applications. While this gives customer the freedom they demand, it’s proving challenging for data engineers to keep data in synch between on-prem systems and their cloud repositories.
Silos, Silos, Everwhere
The growing divide between on-prem and cloud data silos isn’t going to go away any time soon, says Haoyuan “H.Y.” Li, the creator of Alluxio the technology and the CTO and founder of Alluxio the company.
“All the storage vendors will say the same thing,” Li told Datanami in a recent interview. “They want to be the data lake, the single one. But it will never happen — at least it won’t happen in the next 20 to 30 years.”
As part of Li’s PhD thesis at Berkeley, he documented a five-to-10 year cycle for storage. Every five to 10 years, we see a breakthrough in storage, where storage gets cheaper, easier, bigger, and better than the previous generation. The emergence of Hadoop nearly 15 years ago was one such moment, and the recent surge in cloud-based object stores is another.
Get used to the constant churn in storage, said Li, who said Hadoop was the beta of the Data Ecosystem, which evolved to today’s version 1.0. “That’s the status quo of the data ecosystem,” he said. “It’s very complex and very hard for organization to get a harmonious and easy-to-use data platform, and there are many data silos. That’s the challenge in our current ecosystem today.”
According to Li’s theory, there will be another storage breakthrough soon. But Li recognized the folly in going out and moving all of your petabytes into the great new storage medium. That’s the genesis of Alluxio (originally called Tachyon), which was to protect customers from the constantly churning storage market and instead provide a single API that could work with any underlying storage medium.
Instead of fighting the constant emergence of new storage systems and playing the data storage shuffle, Alluxio encourages customers to accept the differences in underlying storage system, to embrace data silos as a fact of life.
“All the storage vendors, their goal is to kill the silos,” he said. “Our position is to embrace [data silos]. That’ a fundamental difference.”
With Alluxio 2.0, Li is making it easier for customers to embrace data silos, wherever they might reside.
One of the ways it does that is through a new policy-driven data management function that automatically positions data in the appropriate system according to its age. So the hottest data may reside in RAM, the warm data may sit in SSDs or hard disk drives, and the coldest data is moved to one of the cheap “glacier” repositories offered by AWS, Azure, and Google Cloud.
“People have a lot of data. But at the end of the day, the valuable data is not all your data. Typically the valuable data is 1% to 10% of your data. The industry average is 4%,” Li said. “With that in mind, you’re moving a very small piece of your data. You cannot replicate all your data all over the place. But with a system like this you can easily orchestrate your data and move the useful data to the right place.”
Alluxio also gives users more control over data access policies that customers set for the underlying storage systems that it’s connected to. The software already feature fine-grained control at the file level, and with version 2.0, users can set polices at the directory and folder levels.
Alluxio can “see” the data residing in the file system trees through the metadata, but it won’t actually pull the data until the compute requests it, says Dipti Borkar, vice president of product for Alluxio. “That’s done on demand as the compute requests it,” she said. “You want it compute driven, on demand. You can also pin pre-fetching policies if you know beforehand what a working set might be.”
The new release also makes it easier to move data among cloud object stores to optimize storage and compute resources. Sometimes companies want data movement to occur continuously, and be highly automated, Borkar said. “It can live on forever once you defined a policy,” she said. “But sometimes you want to have movement that’s highly efficient on a one-time basis.”
Optimizing data for specific compute engines is another new feature. Alluxio clusters (or clusters on which the Alluxio JAR file is deployed) can now be optimized for different workloads, Borkar said.
“You can do compute-focused partitioning to allow for working sets to be not polluted across and having very tight performance and locality,” she said. “So you now can say ‘I want 30 nodes to be dedicate to Spark, Hive gets 500 nodes, and Presto, you get 40 nodes,'” she said.
Alluxio also gets a new REST-based interface for moving data among on-prem or cloud systems, or any combination. Borkar said this feature will simplify the process of mounting S3, HDFS, and other storage repositories to Alluxio.
“So you can say I want to join data that’s in the Kaggle data service or Data.gov and I want to do that on the fly with data in S3 or maybe even on premise HDFS, and that can get pulled in on demand,” she said. “that provides a simpler approach than pulling that data into one data lake. Just leave it where it is, and pull it on demad, as your data science or data team requires it. We think this is a very interesting capability, partially as data science gets even more advance, and the need for extremal data and combining it with enterprise data becomes more important.”
Alluxio also gets better integration with AWS, specifically EMR. There is now a bootstrapping mechanism that makes Alluxio part of every node within the EMR cluster, Borkar said. “You get metadata caches and strong consistent of metadata that you don’t get with S3,” she said, “but you also get data locality with the data cache that can be used for Spark, Presto, Hive, etc.”
The open source Alluxio community has been busy, and there’s even more stuff with 2.0. Just as Alluxio supports tiering of data, it now supports tiering of metadata. The hottest metadata stays in memory, but as it ages or is accessed less frequently, the metadata is moved off-heap to a RocksDB database. That one trick will allow Alluxio to scale into the billions of files, Borkar said.
Alluxio 2.0 also sports a new transport protocol. It’s moved away from Thrift to GPRC, which is Google’s version of the Remote Procedure Call (RPC) protocol. The move was made to improve the efficiency of data movement in large clusters, according to Borkar.
“It allows us to scale to thousands of nodes,” she said. “The largest known community user has a 1,300 node cluster. But with this, we expect to go to 3,000 to 4,000 nodes and beyond. Probably once you hit 5,000, you need the next level of optimization.”
Alluxio is available as an open source product with an Apache 2.0 license. The company also sells subscriptions to an enterprise version of the product that includes better security and more advanced data management capabilities, as well as technical support.