Yahoo Unveils SAMOA to Mine Multiple Data Streams
Yahoo last month unveiled a new streaming processing framework called Scalable Advanced Massive Online Analysis (SAMOA) that it says will simplify the process of developing and executing machine learning algorithms against multiple data streams. The open source software works with individual stream processing engines, such as Storm and S4, and is available for download now.
An ever-increasing need for speed in the big data world is pushing the sharp end of the analytic spear ever-closer to freshest and best data possible. The batch paradigm, as manifested by the MapReduce framework in Hadoop, is great for deep analysis of huge data sets, but it can’t deliver insights fast enough to satisfy business demands. The longer that data sits before it’s analyzed, the less pertinent the models become, and the greater the lost opportunities that businesses can never get back.
This is forcing businesses toward stream processing engines to harvest information in as close to a real-time fashion as possible. Stream processing engines such as Apache Storm, Apache Samza, and Apache S4 (originally developed by Twitter, LinkedIn, and Yahoo, respectively) have grown in popularity as developers look to squeeze every bit of actionable information out of real-time data flows, and to provide a very short feedback loop to continually update models and keep them away from the dreaded “concept drift.”
Yahoo engineers developed SAMOA to serve as a framework to simplify the process of mining information using multiple analytical tools on multiple data streams. Instead of futzing around with individual stream processing engines (SPEs) like Storm, Samza, and S4, developers can use SAMOA to create a machine learning (ML) algorithm once, and then run that ML algorithm in the individual SPEs as needed.
In effect, SAMOA provides an abstraction layer for accelerating the development of analytical systems that use multiple SPEs. It also enables ML algorithms developed for one SPE to be applied to another, and provides extensibility in integrating new stream processing engines into the framework as they’re created, the company says on its Yahoo Engineering blog.
The company sees several uses for SAMOA in the context of SPE computing, including spam detection. The problem with spam detection models is that they start getting stale as soon as they’re updated and creative spammer start finding ways around the filters. With SAMOA, Yahoo is able to keep retraining its spam watchers and updating its spam detection models on a near continuous basis, thereby doing a better job of keeping the unsavory digital crud out of its customers’ inboxes.
In addition to serving as a framework, SAMOA also serves as a library of distributed machine learning algorithms. The alpha release, which was first unveiled in July, includes algorithms for classification and clustering.
For classification, SAMOA includes the Vertical Hoeffding Tree (VHT), which is a distributed streaming version of decision trees tailored for sparse data (such as text), the company says. For clustering, it includes a K-means distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging, the company says.
“In effect, SAMOA is like Mahout for streaming,” the company says, referring to the popular library of distributed machine learning algorithms that provide advanced capabilities in the areas of collaborative filtering, clustering, and classification. Mahout is primarily deployed atop Hadoop and runs within the MapReduce paradigm, making it less useful for big fast data.
On the execution side, SAMOA resembles the Topology graph used by Storm. Each individual SAMOA node communicates by sending messages along streams. A pair of source processor nodes each sends their results to a downstream clusterer node, which then sends them to evaluator nodes. This provides for a vast, distributed, and scalable architecture.
SAMOA currently has Java-based APIs that allow it to work with S4 and Storm. But SAMOA’s creators envision a flexible and extensible framework that can easily adapt to new SPEs and support new algorithms, just as the Mahout library is continually updated with new algorithms. The software is not yet an Apache Incubator project, but that is also one of Yahoo’s goals with SAMOA.
SAMOA is available now under an Apache version 2.0 license. You can download it at https://github.com/yahoo/samoa.