September 5, 2013

Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO

Alex Woodie

This week, Twitter released to the open source community a new data platform called “Summingbird” that glues the batch-oriented processing of Hadoop with the real-time event processing of its own Storm framework. The release comes on the heels of a report about a possible initial public offering of stock for the San Francisco company.

Hadoop is great at combing through huge data sets in search of connections, but its batch orientation introduces sizable latencies into the equation that preclude webscale companies from using it for real-time big data jobs, such as Web search and trending datastreams.

This gave rise to other approaches at big data processing in real time, including the Storm framework that Twitter released in 2011, and which it uses for its “What’s Trending” stream. As Twitter explains, Storm provides very quick processing of data at very low latencies, but does so by sacrificing the “lovely fault tolerant guarantees” of Hadoop.

Now, with the launch of Summingbird, Twitter says that organizations no longer must sacrifice accuracy for speed, or vice versa. By devising a way that meshes Hadoop and Storm into a single whole, organizations can alternatively use Storm for the short-term processing and Hadoop for deep data dives, without requiring the developer to devise kludgey hacks or maintaining time-consuming workarounds.

Twitter — which by the way is reportedly eyeing an IPO next year that could value the company as high as $15 billion — says that Summingbird’s hybrid model allows most data to be processed by Hadoop and served out of a read-only stores, such as ElephantDB or Manhattan.

“Only data that Hadoop hasn’t yet been able to process, data that falls within the latency window, would be served out of a datastore populated in realtime by Storm,” the company says in its blog. “The error of the realtime layer is bounded, as Hadoop will eventually get around to processing the same data and smoothing out any error introduced.”

Developers can use Summingbird to build aggregations in real-time, using Storm, or on Hadoop, explains Twitter’s Sam Ritchie in a June 20 post on Summingbird.

“When the programmer describes her job, that job can be run without change on Storm or Hadoop,” he writes. “Additionally, summingbird can manage merging realtime/online computations with offline batches so that small errors in real-time do not accumulate. Put another way, summingbird gives eventual consistency in a manner that is easy for the programmer to reason about.”

Webscale companies (the biggest among them being Twitter, Facebook, and Google, etc) are very aware of the need for both of these types of big data processing demands, and have already adapted by building systems, or collections of systems, to handle these workloads. The big advantage of Twitter’s approach with Summingbird is that it’s one system.

This approach gives big data developers more flexibility, and eliminates the need for them to pick one approach to meet current needs, only to be stymied down the road when the need for a different approach rears its head. Developers at smaller firms that haven’t yet invested millions in big data hardware and software may benefit the most.

The development of Summingbird spawned several other development projects at Twitter, including things like Algebird, an abstract algebra library for Scala; Bijection, an injection typeclass to share serialization between different execution platforms and clients; Chill, which augments the Kyro serialization library with configuration options and modules; Tormenta, a type-safe layer over Storm’s “scheme” and “spout” interfaces; and Storehaus, which enables Storm to perform real-time aggregations into a number of commonly used backing stores, including Memcached and Redis.

So what’s next for Summingbird? Now that the product is out there, big data developers are free to start using it to devise new ways of combining big data with small data, cold data and fast data, and everything in between. In his presentation, Ritchie foresees “new execution platforms” and “smarter systems” as possible consequences of the move.

Twitter has several future plans laid out, including supporting additional execution platforms, like Spark and Akka, and support for filter-aware data sources, like Parquet. The company is also looking to add more extensions to Summingbird, via the Algebird, Tormenta, and Storehaus projects.

The Summingbird project is hosted at github.com/twitter/summingbird.

Related Articles

Data Science Has Got Talent as Facebook Launches Competition

IBM Helps Make Buildings More Efficient, Thanks to Data

Facebook Advances Giraph With Major Code Injection

Applications: Predictive Analytics, Research Analytics

Tags: big data, Hadoop, storm, summingbird, twitter

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link