Alluxio Claims 5X Query Speedup by Optimization Data for Compute
Ever since Alluxio emerged from the AMPLab, its focus as a data orchestration layer has been to grease the wheels for data initiatives by making remotely stored data appear to be stored locally. Today the company took its silo-busting abstraction layer to new heights with Structured Data Service, which hooks the orchestration layer into existing data catalogs and optimizes data for compute engines rather than storage efficiency.
The big new delivery with today’s announcement of Alluxio version 2.2 is the addition of Structured Data Service, a new component of the enterprise eversion of Alluxio that introduces three new features, including integration with the Hive Metastore, Transformation Services, and a new connector for Presto. (The community edition does not include these new features.)
Arguably the most compelling new feature is the Transformation Service, which exposes three new services, including coalesce, format conversion, and sorting. Armed with these data services, Alluxio data consumers (i.e. data scientists and data analysts) can see 2.5x to 5x speedups in query performance, the vendor says.
The coalesce data services is conducive to helping data engineers combine a large number of small files into a smaller number of larger files that can be more easily queried. The format conversion service is geared toward converting data stored in formats like CSV into more optimized format, like Parquet. Finally, the sorting service can pre-process data by organizing important data by certain dimensions, eliminating the need to scan the entire table.
It’s all about presenting the data in a way that lets data analysts and data scientists query and work with data in a way that’s most conducive to their productivity, says Alluxio CEO Steven Mih.
“It’s not about the producers of the data. It’s about the consumers of the data,” Mih says in an interview with Datanami last week. “You’ll buy data. You’ll gather data. There’s a lot of energy focused on massaging and landing that data. But there’s not a lot of focus on delivering data to the consumers, and that’s the part that Alluxio is focused on.”
When installed next to disparate data stores located, Alluxio’s open source software streamlines data access for users, no matter where those users or the data physically resides. Much of that data is headed to the cloud, and Alluxio’s software can help, Mih says.
“I think data has been a friction for many companies, and that’s changing fast,” he says. “The big data warehouses and data lakes that were on prem are now shifting [to the cloud] as companies get more comfortable….The cloud is the lowest common denominator right now, the lowest cost per bit, and S3 and S3-compatible [cloud object stores] have won that.”
But many of the same challenges that organizations faced with on-prem data lakes continue to exist with the new cloud data lakes. Data engineers are still called upon to write ETL jobs and build data pipelines to keep analysts and data scientists busy with the latest data. The cloud has not solved the data silo problem. The cloud has just changed where data silos are stored.
“You have the same data sets, but you may want to have PyTorch running on the data set, or you may want to have Presto running on the data sets,” he says. “Those are different frameworks that expect data in different ways, and so Alluxio is providing that in a compute optimized way based on what compute you’re using.”
A side benefit to customers is lower risk of lock-in with the cloud vendors. “We believe in having less silos of data,” Mih says. “We find that there are many companies that have data on prem, and it’s not all going to go to the cloud. Yes, there will be new parts in the cloud. Mount that with Alluxio. Mount it to on-prem, and now they can run advanced analytics and AI across all those things.”
Eric Kavanagh, CEO of the Bloor Group, says Alluxio’s data distribution techniques are driving benefits for customers operating in a hybrid world. “We can thank Kubernetes for distributed compute; and Alluxio for distributed data,” he says in a press release. “The combination of these technologies offers tremendous promise for our data-driven hybrid and multicloud future.”
Version 2.2 of Alluxio also includes a new Catalog Service that hooks into the Hive Metastore. This will streamline the delivery of schema and table information into Alluxio.
Instead of pointing a data processing engine like Presto or Spark at data, users can point it at the Hive Metastore, and users can get a unified view of all the data catalogs that exist in the user’s system, says Aseem Rastogi, Alluxio’s vice president of engineering.
“We just basically connect to that catalog and are able to suck in all the table and schema information into our services, and then use that for these other query workloads,” Rastogi says. “Just like it abstracts all the different source systems, it’s also abstracting the various catalog services.”
The Catalog Service currently supporting AWS Glue Data Catalog, and in the future Alluxio will work with other data catalogs, including the one from Collibra. In addition to simplifying access to the data, the data catalog integration can also reduce any latency that might be introduced, the company says.
The Hive Metastore integration actually powers the new data transformation services, says Haoyaun “HY” Li, the founder and CTO of Alluxio.
“These schema-aware optimizations are made possible with the new Alluxio Catalog Service which abstracts the widely-used Apache Hive Metastore,” he says in a press release. “So regardless of how the data was initially stored–CSV and text formatted files, for example–the data is now transformed into the generally recognized compute-optimized Parquet format.”