Follow Datanami:
May 18, 2015 Reinvents Machine Learning for Smart Applications with 3.0 Release

May 18, 2015 Today, the leading provider of open source machine learning for building smarter applications, announces the general availability of H2O 3.0, the latest major release of the company’s flagship software. The new version offers  a single integrated and tested platform for enterprise and open-source use, enhanced usability through a new web user interface (UI) with embeddable workflows, elegant APIs, and direct integration for Python and Sparkling Water.

“We have re-invented machine learning platforms with 3.0. H2O APIs and Flows enable a robust ecosystem of smarter applications and intelligent things,” said Sri Ambati, CEO and co-founder, “Developers that use machine learning APIs have the opportunity to impact billions of lives with data. Prediction is the new search and ML is the new SQL”

Machine Learning APIs to Build Intelligent Applications

H2O’s APIs allow developers to rapidly innovate and deploy smarter business applications in private and public clouds. Google-scale Prediction APIs enable developers to train and test models on large datasets, directly within their preferred application development environment. H2O’s REST APIs are well-documented, providing dynamic metadata and JSON schema.

The REST API also underlies the beautiful user experience of H2O Flow, the R package and the Python module. In all cases, H2O can export trained models as Java objects (POJO) that can easily be integrated into applications and real-time systems like Spark Streaming and Apache Storm™. H2O 3.0 seamlessly embeds machine learning algorithms into other applications frameworks. Sparkling Water is a powerful example of bringing H2O algorithms to the developer community of Apache Spark™.

Data Science Workflows from R, Python and Spark

H2O Flow seamlessly blends a modern web notebook with command-line computing, which allows users to interactively import files, build models, and iteratively improve them. Providing a point-and-click UI for every H2O operation, users can render all data and models in the form of beautiful graphical and tabular displays. Each of these point-and-click actions are translated into individual workflow scripts which can be saved for later interactive and offline use.

Users can easily select from a growing list of algorithms which are available out of the box, including Gradient Boosting Machine, Deep Learning, Generalized Linear Model, K-Means, Distributed Random Forests, and Naïve Bayes. In addition to the cleaner UI, deep consideration was given for how to manage and visualize model output for each algorithm; for example, interactive ROC curves are now rendered for binary classifiers.

From their laptops users can also use the Python command line, or integrated development environments like IPython Notebook and Jupyter, to drive tera-scale H2O clusters and interact with massive datasets in seconds. By writing simple expressions, developers can conduct data imputation and filtering, manage outliers, run group-bys and joins, perform feature engineering, and more. Developers can then use that data to train, validate and make predictions with H2O’s powerful machine learning models.

Sparkling Water – Best of Spark and H2O

Sparkling Water allows Apache Spark developers to implement the speed and accuracy of H2O in a single memory address space, integrating Spark’s powerful SQL query and data munging capabilities with H2O’s machine learning algorithms. Developers can use Sparking Water interactively using the Scala shell, and export RDDs directly into H2O data-frames. Alternatively, they can incorporate H2O algorithms directly into Spark applications in batch mode for recurring jobs. When running H2O on top of Spark installations, users have immediate access to the Flow UI, Python notebooks and R interfaces – complete with all H2O features through REST API.

To showcase the flexibility of Sparkling Water, the team at recently created a Deep Learning application for public safety directly within the Spark shell. Using publicly available crime datasets from the cities of Chicago and San Francisco, the team analyzed historical crime data from both cities and joined it with additional external source data including weather and US Census-based socioeconomic factors. By leveraging the SQL capability of Spark and publishing the Spark RDD as an H2O frame, the team was able to build and train a neural network that can identify with 92% (Chicago) and 95% (San Francisco) accuracy whether or not reported crimes will result in an arrest.