Follow Datanami:
May 26, 2015

Spark Gets New Machine Learning Framework: KeystoneML

The AMPLab this week announced the release of KeystoneML, a framework for building and deploying large-scale machine-learning pipelines within Apache Spark. While the software is still in alpha stage, it’s stable enough for developers to begin using it, project backers say.

KeystoneML is designed to be a faster and more sophisticated alternative to SparkML, the machine learning framework that’s a full member of the Apache Spark club. Whereas SparkML comes with a basic set of operators for processing text and numbers, KeystoneML includes a richer set of operators and algorithms designed specifically for natural language processing, computer vision, and speech processing.

The package, which is available for download, also includes several example pipelines that reproduce state-of-the-art academic results on public data sets, according to Evan Sparks, a Cal Berkeley Ph.D student who works in the AMPLab and is helping to develop KeystoneML.

“Users familiar with the new package may recognize several similarities in the API concepts and interfaces presented between these two projects,” Sparks wrote today on the AMPLab blog. “This is no coincidence since we contributed to the design and initial implementation of However, KeystoneML provides both a richer set of built-in operators–for example, image featurizers and large-scale linear solvers–and modifies the interface to provide type-safety and ensure further robustness.”

The AMPLab is developing a big data stack—dubbed the Berkeley Data Analytics Stack, or BDAS–that extends and builds atop some the technologies developed for Hadoop. Apache Spark is a key component that’s gained a lot of traction, but there are other projects that could also catch fire, including: Tachyon, a distributed file system that sits atop HDFS; Succint, a compressed data store that sits above Tachyon; and Velox, an online model management for analytics.

You can read more about KeystoneML at