Follow Datanami:
November 16, 2021

Alluxio Nabs $50M, Preps for Growth in Data Orchestration


Data orchestration software provider Alluxio today announced the close of an oversubscribed $50-million Series C round, which its CEO plans to spend on a global expansion. It also launched version 2.7 of its software, which is aimed at accelerating machine learning and analytics use cases and providing some relief to the multiplication of data silos.

Haoyuan “HY” Li co-founded Alluxio six years ago with bold plans to build a data virtualization layer that decoupled data processing engines from the underlying storage repositories that actually persist the data. The company was the commercial vehicle for the open source distributed virtual file system Li helped develop at the UC Berkeley AMPlab, alongside other prominent AMPLab projects like Spark and Mesos.

When installed on a cluster next to an existing file system or object store, such as NFS, S3, Ceph, HDFS, or Gluster, that orchestration layer (originally called Tachyon but later renamed Alluxio) could dramatically accelerate the throughput of data engines sitting on top, including Spark, Presto, TensorFlow, H2O, MapReduce, or Impala.

This not only provided a performance or efficiency boost, but also protected the business from the continually shifting sands of the storage infrastructure. That was the topic of Li’s PhD thesis at Berkeley, which theorized that the market for storage software goes through a roughly eight-year replacement cycle.

“All the storage vendors, their goal or their message has been [developing] a better storage than before. Better means faster, cheaper, easier to use,” Li says. “For example, HDFS people said HDFS is going to dominate the world. All your data will be moved into HDFS. But that story is actually repeating itself very roughly every eight years, or every decade. So every decade, based on the whole storage industry revolution, there will be a new wave of system architecture to replace the previous generation.”

As an in-memory data orchestration layer, Alluxio speeds data from persistent storage to consuming data engines (Image source Alluxio)

According to Li, Alluxio provides the mechanism by which customers can start to get off the storage-replacement treadmill (or at least not be completely beholden to it, although they still have to persist their data somewhere). That will have the intended affect of lowering customers’ future storage costs while getting a 5x to 10x or higher performance boost for today’s workloads, according to Li.

When the Hadoop bubble popped, Amazon Web Services’ S3 and S3-compatible object stores became the new storage du jour. With the capability to store a nearly infinite amount of data in a global namespace, object stores have embraced the “big” in big data, but at the expense of performance, which is typically abysmal.

It took a bit of time, but Alluxio’s message of performance and future-compatibility now appears to be resonating with some of the biggest businesses in the world, many of whom are struggling with object storage overload. For example, Li says one of his customers, a Fortune 300 company, is already using seven different object storage systems. “And that’s not even counting the file systems,” he tells Datanami.

The beginning of 2020 was rough, with the COVID-19 pandmic and the departure of then-CEO Steven Mih, who left to co-found and lead the Ahana, a Presto software company.

“But I took the company back and put in on the right course and we closed the last year very strong,” Li says, adding that the company experienced 3.5x growth in its business in 2020 and was cashflow positive after the first quarter of 2021. “So far this year, we have been growing very strong as well.”

Alluxio co-founder and CEO Haoyuan “HY” Li wrote his PhD thesis on the impermanence of persisted storage layers

Eight of the 10 largest Internet companies use Alluxio, including Facebook, Airbnb, Uber, Alibaba, Tencent, and Bytedance, the company says. ”They’re all running us in production today,” Li says. “Some are running on 10,000 nodes already.”

The $50 million Series C round was led by an unnamed “global investment firm” and had participation from existing investors, including a16z, Seven Seas Partners, and Volcanics Ventures. The San Mateo, California company has now raised a total of $70 million to date.

When asked what he was going to spend the money on, Li responded, “people, people, people.” The company started the fiscal year (which begins February 1) with around 50 people. By the close of the current fiscal year on January 31, 2022, Li hopes to have doubled the number of workers.

“With the new funding, we’re essentially using that to expand our operations globally, particularly APAC and EMEA region,” Li says. “And we are expanding our bandwidth from an R&D perceptive to fulfil the need from ecosystem, from customers etc. at the same time we will enlarge our to go-to-market team to better take care of our existing and new customers, and to supply the demand.”

It’s very difficult to go to market with a full on platform play, Li concedes. So to move the needle, Alluxio needs to show customers that it can serve demands of current projects. In that regard, Alluxio’s capability to help companies run AI and analytics workloads in a hybrid cloud environment surely fits the bill.

“For example, you run Spark, Presto, TensorFlow either on top of remote [storage] or on premise storage, because they want to keep the data on-premise,” Li says. “Then you would run Alluxio with that, and [benefit from a] 10x or higher hybrid cloud efficiency improvement, performance improvement. You get the value right away.”

The company also announced Alluxio version 2.7, which brings several enhancements to its data orchestration layer. For starters, it brings support for Hudi and Iceberg table formats, which the company says will enable customer to more quickly and easily scale data lakes serving Presto and Spark analytics.

Alluxio 2.7 also introduces a new container Storage Interface (CSI) driver for Kubernetes and a Kubernetes operator for machine learning, which th ecomapny says will make it easier to operate machine learning pipelines on Alluxio in containerized environment.

It also brings support for Nvidia’s Data Loading Library (DALI), a Python library that supports CPU and GPU execution. New techniques for batching data management jobs should also lower the burden on underlying compute resources, the company says, while a new “shadow cache” should help provide better insight into the impact of cache size on response times for Presto environments.

Due to surging customer demand, optimizing Presto performance is a key area of focus going forward for Alluxio, Li says. “They’re virtualizing the compute, we are virtualizing the data,” he says. “So we’re doubling down on that as well.”

According to ESG Analyst Mike Leone, Alluxio can help address pressures that companies with large-scale analytics and AI/ML computing frameworks are coming under.

“Organizations want to use more affordable and scalable storage options like cloud object stores, but they want peace of mind knowing they don’t have to make costly application changes or experience new performance issues,” Leone says in a press release. “Alluxio is helping organizations address these challenges by abstracting away storage details while bringing data closer to compute, especially in hybrid cloud and multi-cloud environments.”

Related Items:

Alluxio Claims 5X Query Speedup by Optimization Data for Compute

Alluxio Bolsters Data Orchestration for Hybrid Cloud World

Meet Alluxio, the Distributed File System Formerly Known as Tachyon