March 28, 2016

Resolving Hadoop’s Storage Gap

Todd Lipcon

(mamanamsai/Shutterstock.com)

Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds.

With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. These improvements have allowed Hadoop to expand the set of applications for which it is appropriate, making inroads into workloads like business intelligence, interactive data exploration, and even online user-facing applications such as web sites and mobile apps. Additionally, new trends such as the Internet of Things have driven a need for companies to analyze streaming data that arrives in real time, in contrast to traditional batch-oriented data loading processes.

Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics but little-to-no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access but scan rates that are too slow for large-scale data-warehousing workloads.

In my Wednesday session at Strata + Hadoop World, titled “Hadoop’s storage gap: Resolving transactional-access and analytic-performance tradeoffs with Apache Kudu (incubating),” I’ll explore the tradeoffs between real-time transactional access and fast analytic performance in the context of Apache Kudu (incubating), a new addition to the Hadoop ecosystem.

Kudu complements Apache HDFS and HBase, providing a new option which can achieve both fast analytic scan performance as well as fast random access in a single system. Kudu will enable companies to easily store and analyze fast-changing and constantly-arriving data in a single storage system, simplifying application architectures while retaining the best aspects of the current generation of systems. I’ll also explain how Kudu has been engineered in collaboration with Intel to take advantage of features in modern hardware platforms, including vectorized CPU instructions and persistent memory (3D XPoint).

This session goes from 1:50pm to 2:30pm on Wednesday March 30 in room 230 C. For more information, click here.

About the author: Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. He is a committer and a PMC member on the Apache Hadoop, HBase, and Thrift projects. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine-learning methods for collaborative filtering. Todd received his bachelor’s degree with honors from Brown University.

Applications: Enterprise Analytics, Predictive Analytics

Technologies: Frameworks, Middleware, Storage

Sectors: Financial Services, Government, Healthcare, Retail

Vendors: Cloudera

Tags: cloudera, Hadoop, HBase, HDFS, Kudu, strata conference

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Resolving Hadoop’s Storage Gap

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Resolving Hadoop’s Storage Gap

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link