November 11, 2015

AMPLab Releases Succinct, A New Way to Query Data in Spark

Alex Woodie

The folks at UC Berkeley’s’ AMPLab are starting to talk publicly about Succinct, a fast new in-memory data store aimed at enabling users to query large amounts of compressed data, in a fast and direct fashion. Early tests show Succinct can perform the same types of big data queries that popular NoSQL databases can do, but on systems one-tenth the size.

Succinct is an in-memory data store that enables the execution of complex queries directly on compressed data. It does this through a novel compression mechanism that embeds all of the information within the compressed representation of the data. In this manner, it can enable interactive queries of multiple gigabytes worth of database values on a system with relatively modest memory.

The problem of how best to store secondary attributes is one that has challenged database developers. Two basic approaches have been taken, AMPLab researchers Rachit Agarwal, Anurag Kahndelwal, and Ion Stoica, write in the October 2015 paper “Succinct: Enabling Queries on Compressed Data,” which can be download here.

“At one extreme, systems such as column oriented stores simply scan the data,” they write. “However, data scans incur high latency for large data sizes, and have limited throughput since queries typically touch all machines.”

At the other extreme are secondary indexes. “When stored in-memory, these indexes are not only fast, but can achieve high throughput since it is possible to execute each query on a single machine,” the AMPLab directors write. The main disadvantage of indexing? A high memory footprint, since the secondary indexes can be up to eight times (8x) bigger than the source data itself.

Succinct delivers queries faster than native Spark and ElasticSearch

Succinct delivers the same level of query throughput achieved through indexing but without the high memory footprint of secondary indexes. It achieves this by storing an “entropy-compressed representation of the input data that allows random access, enabling efficient storage and retrieval of data,” the researchers write. “Succinct’s data representation natively supports count, search, range and wildcard queries without storing indexes–all the required information is embedded within this compressed representation.”

Secondly, Succinct executes queries directly on the compressed representation, avoiding the need for data scans and decompression. “What makes Succinct a unique system,” the AMPLab researchers write, “is that it not only stores a compressed representation of the input data, but also provides functionality similar to systems that use indexes along with input data.”

AMPLab, which gave the world the Apache Spark framework, the Tachyon distributed file system, and the Mesos resource manager (as well as a dozen or so less-visible projects, like KeystoneML, the Akaros OS, CrowdDB, and more) has been working on Succinct for some time, but only recently ported it into Spark and supported Spark’s resilient distributed datasets (RDDs).

“We are very excited to announce the release of Succinct Spark, as a Spark package, that achieves a unique tradeoff–storage overhead no worse (and often lower) than data-scan based techniques and query latency comparable to index-based techniques,” Agarwal wrote last week on the AMPLab blog.

“Succinct Spark enables search (and a wide range of other queries) directly on compressed representation of the RDDs,” he continues. “What differentiates Succinct Spark is that queries are supported without storing any secondary indexes, without data scans and without data decompression–all the required information is embedded within the compressed RDD and queries are executed directly on the compressed RDD.”

According to Agarwal, Succinct allows users to use Spark as a document store similar to ElasticSearch. When used as a document store, Succinct Spark is 2.75x faster than ElasticSearch for search queries while requiring 2.5x lower storage, he says. Compared to native Spark writing to disk, it’s 75x faster.

The Berkeley Data Analytics Stack is BDAS.

The AMPLab has also tested Succinct against MongoDB and Cassandra, which support more complex data types than ElasticSearch’s key-value store. According to a 2014 Gigaom article, the researchers demonstrated how Succinct was able to store a 123GB dataset on a single machine with 64 GB of memory, whereas the NoSQL data stores required secondary indexes, which were spread across 16 servers, each with 64GB of memory.

Succinct isn’t ready to supplant those popular databases just yet, but it could provide the technical underpinning for radical new approaches to storing and accessing large amounts of data in the future.

“There are a large number of interesting follow up projects in AMPLab on Succinct exploring the fundamental limits to querying on compressed data, adding new applications on top of Succinct, and improving the performance for existing applications,” Agarwal writes. “We will write a lot more about these very exciting projects on Succinct webpage.”

The group has released open source code for Succinct, specifically for compressing RDDs in Spark and exposing DataFrames for Spark SQL. You can find downloads at http://succinct.cs.berkeley.edu.

Tachyon Nexus Gets $7.5M to Productize Big Data File System

Apache Spark Ecosystem Continues To Build

Applications: Enterprise Analytics

Technologies: Frameworks, Middleware

Sectors: Academia, Other

Vendors: AMPLab

Tags: AMPLab, apache spark, big data, compressed data, Ion Stoica, NoSQL, Succinct

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

AMPLab Releases Succinct, A New Way to Query Data in Spark

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

AMPLab Releases Succinct, A New Way to Query Data in Spark

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link