March 11, 2020

Alluxio Claims 5X Query Speedup by Optimization Data for Compute

Alex Woodie

(pluie_r-scaled/shutterstock)

Ever since Alluxio emerged from the AMPLab, its focus as a data orchestration layer has been to grease the wheels for data initiatives by making remotely stored data appear to be stored locally. Today the company took its silo-busting abstraction layer to new heights with Structured Data Service, which hooks the orchestration layer into existing data catalogs and optimizes data for compute engines rather than storage efficiency.

The big new delivery with today’s announcement of Alluxio version 2.2 is the addition of Structured Data Service, a new component of the enterprise eversion of Alluxio that introduces three new features, including integration with the Hive Metastore, Transformation Services, and a new connector for Presto. (The community edition does not include these new features.)

Arguably the most compelling new feature is the Transformation Service, which exposes three new services, including coalesce, format conversion, and sorting. Armed with these data services, Alluxio data consumers (i.e. data scientists and data analysts) can see 2.5x to 5x speedups in query performance, the vendor says.

The coalesce data services is conducive to helping data engineers combine a large number of small files into a smaller number of larger files that can be more easily queried. The format conversion service is geared toward converting data stored in formats like CSV into more optimized format, like Parquet. Finally, the sorting service can pre-process data by organizing important data by certain dimensions, eliminating the need to scan the entire table.

(HaseHoch2/Shutterstock)

It’s all about presenting the data in a way that lets data analysts and data scientists query and work with data in a way that’s most conducive to their productivity, says Alluxio CEO Steven Mih.

“It’s not about the producers of the data. It’s about the consumers of the data,” Mih says in an interview with Datanami last week. “You’ll buy data. You’ll gather data. There’s a lot of energy focused on massaging and landing that data. But there’s not a lot of focus on delivering data to the consumers, and that’s the part that Alluxio is focused on.”

When installed next to disparate data stores located, Alluxio’s open source software streamlines data access for users, no matter where those users or the data physically resides. Much of that data is headed to the cloud, and Alluxio’s software can help, Mih says.

“I think data has been a friction for many companies, and that’s changing fast,” he says. “The big data warehouses and data lakes that were on prem are now shifting [to the cloud] as companies get more comfortable….The cloud is the lowest common denominator right now, the lowest cost per bit, and S3 and S3-compatible [cloud object stores] have won that.”

But many of the same challenges that organizations faced with on-prem data lakes continue to exist with the new cloud data lakes. Data engineers are still called upon to write ETL jobs and build data pipelines to keep analysts and data scientists busy with the latest data. The cloud has not solved the data silo problem. The cloud has just changed where data silos are stored.

Alluxio’s idea is to prevent customers from creating so many data silos by exposing different views of the same data to different constituencies, Mih says.

“You have the same data sets, but you may want to have PyTorch running on the data set, or you may want to have Presto running on the data sets,” he says. “Those are different frameworks that expect data in different ways, and so Alluxio is providing that in a compute optimized way based on what compute you’re using.”

A side benefit to customers is lower risk of lock-in with the cloud vendors. “We believe in having less silos of data,” Mih says. “We find that there are many companies that have data on prem, and it’s not all going to go to the cloud. Yes, there will be new parts in the cloud. Mount that with Alluxio. Mount it to on-prem, and now they can run advanced analytics and AI across all those things.”

Eric Kavanagh, CEO of the Bloor Group, says Alluxio’s data distribution techniques are driving benefits for customers operating in a hybrid world. “We can thank Kubernetes for distributed compute; and Alluxio for distributed data,” he says in a press release. “The combination of these technologies offers tremendous promise for our data-driven hybrid and multicloud future.”

Version 2.2 of Alluxio also includes a new Catalog Service that hooks into the Hive Metastore. This will streamline the delivery of schema and table information into Alluxio.

Instead of pointing a data processing engine like Presto or Spark at data, users can point it at the Hive Metastore, and users can get a unified view of all the data catalogs that exist in the user’s system, says Aseem Rastogi, Alluxio’s vice president of engineering.

“We just basically connect to that catalog and are able to suck in all the table and schema information into our services, and then use that for these other query workloads,” Rastogi says. “Just like it abstracts all the different source systems, it’s also abstracting the various catalog services.”

The Catalog Service currently supporting AWS Glue Data Catalog, and in the future Alluxio will work with other data catalogs, including the one from Collibra. In addition to simplifying access to the data, the data catalog integration can also reduce any latency that might be introduced, the company says.

The Hive Metastore integration actually powers the new data transformation services, says Haoyaun “HY” Li, the founder and CTO of Alluxio.

“These schema-aware optimizations are made possible with the new Alluxio Catalog Service which abstracts the widely-used Apache Hive Metastore,” he says in a press release. “So regardless of how the data was initially stored–CSV and text formatted files, for example–the data is now transformed into the generally recognized compute-optimized Parquet format.”

Alluxio Bolsters Data Orchestration for Hybrid Cloud World

Meet Alluxio, the Distributed File System Formerly Known as Tachyon

Applications: Enterprise Analytics

Technologies: Frameworks, Middleware

Sectors: Financial Services, Government, Healthcare, Retail

Vendors: Alluxio, AWS, Collibra

Tags: Alluxio, cloud, data catalog, data optimication, data virtualization, on-prem

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

April 16, 2024

April 15, 2024

April 12, 2024

April 11, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Alluxio Claims 5X Query Speedup by Optimization Data for Compute

Join the discussion Cancel reply