Twitter Conjures Up a Hadoop-Storm Hybrid, Ponders IPO
This week, Twitter released to the open source community a new data platform called “Summingbird” that glues the batch-oriented processing of Hadoop with the real-time event processing of its own Storm framework. The release comes on the heels of a report about a possible initial public offering of stock for the San Francisco company.
Hadoop is great at combing through huge data sets in search of connections, but its batch orientation introduces sizable latencies into the equation that preclude webscale companies from using it for real-time big data jobs, such as Web search and trending datastreams.
This gave rise to other approaches at big data processing in real time, including the Storm framework that Twitter released in 2011, and which it uses for its “What’s Trending” stream. As Twitter explains, Storm provides very quick processing of data at very low latencies, but does so by sacrificing the “lovely fault tolerant guarantees” of Hadoop.
Now, with the launch of Summingbird, Twitter says that organizations no longer must sacrifice accuracy for speed, or vice versa. By devising a way that meshes Hadoop and Storm into a single whole, organizations can alternatively use Storm for the short-term processing and Hadoop for deep data dives, without requiring the developer to devise kludgey hacks or maintaining time-consuming workarounds.
Twitter — which by the way is reportedly eyeing an IPO next year that could value the company as high as $15 billion — says that Summingbird’s hybrid model allows most data to be processed by Hadoop and served out of a read-only stores, such as ElephantDB or Manhattan.
“Only data that Hadoop hasn’t yet been able to process, data that falls within the latency window, would be served out of a datastore populated in realtime by Storm,” the company says in its blog. “The error of the realtime layer is bounded, as Hadoop will eventually get around to processing the same data and smoothing out any error introduced.”
Developers can use Summingbird to build aggregations in real-time, using Storm, or on Hadoop, explains Twitter’s Sam Ritchie in a June 20 post on Summingbird.
“When the programmer describes her job, that job can be run without change on Storm or Hadoop,” he writes. “Additionally, summingbird can manage merging realtime/online computations with offline batches so that small errors in real-time do not accumulate. Put another way, summingbird gives eventual consistency in a manner that is easy for the programmer to reason about.”
Webscale companies (the biggest among them being Twitter, Facebook, and Google, etc) are very aware of the need for both of these types of big data processing demands, and have already adapted by building systems, or collections of systems, to handle these workloads. The big advantage of Twitter’s approach with Summingbird is that it’s one system.
This approach gives big data developers more flexibility, and eliminates the need for them to pick one approach to meet current needs, only to be stymied down the road when the need for a different approach rears its head. Developers at smaller firms that haven’t yet invested millions in big data hardware and software may benefit the most.
The development of Summingbird spawned several other development projects at Twitter, including things like Algebird, an abstract algebra library for Scala; Bijection, an injection typeclass to share serialization between different execution platforms and clients; Chill, which augments the Kyro serialization library with configuration options and modules; Tormenta, a type-safe layer over Storm’s “scheme” and “spout” interfaces; and Storehaus, which enables Storm to perform real-time aggregations into a number of commonly used backing stores, including Memcached and Redis.
So what’s next for Summingbird? Now that the product is out there, big data developers are free to start using it to devise new ways of combining big data with small data, cold data and fast data, and everything in between. In his presentation, Ritchie foresees “new execution platforms” and “smarter systems” as possible consequences of the move.
Twitter has several future plans laid out, including supporting additional execution platforms, like Spark and Akka, and support for filter-aware data sources, like Parquet. The company is also looking to add more extensions to Summingbird, via the Algebird, Tormenta, and Storehaus projects.
The Summingbird project is hosted at github.com/twitter/summingbird.