Follow Datanami:
June 28, 2017

TensorFlow to Hadoop By Way of Datameer

Companies that want to use TensorFlow to execute deep learning models on big data stored in Hadoop may want to check out the new SmartAI offering unveiled by Datameer today.

Deep learning has emerged as one of the hottest technique for turning massive sets of unstructured data into useful information, and Google‘s Tensorflow is arguably the most popular programming and runtime framework for enabling it. So it made sense that Datameer, which was one of the first vendors to develop a soup-to-nuts Hadoop application for big data anatlyics, has now added support for TensorFlow into its Hadoop-based application.

With today’s unveiling of SmartAI, Datameer is providing a way to execute and operationalize TensorFlow models. “The objective here is to take the stuff that mad scientists are coming up with, and actually take it to the business,” Datameer’s Senior Director of Product Marketing John Morrell tells Datanami.

SmartAI, which is still in technical preview, is not helping data scientists to create the models. They will still do that in their favorite coding environment. Nor is it set up to train the models. If you’re interested in learning about how that can be accomplished on Hadoop, Hortonworks has a good blog post on integrating TensorFlow assemblies into YARN.

Rather, Datameer’s new app is all about solving some of the thorny “last mile” problems that organizations often encounter as they’re moving a trained TensorFlow model from the lab into production.

“AI today has had some problems in terms of operationalization,” Morrell says. “When a data scientist come up with a formula using their data science tools, they just chuck it over the wall to IT guy, who then tries to turn it into code, and custom code the whole thing.”

Datameer seeks to help operationalize TensorFlow models with SmartAI

Instead of using scripts and custom coding, SmartAI aims to codify the TensorFlow work into its Hadoop application. Not only does Datameer provide a way to distribute TensorFlow algorithms to nodes in a Hadoop cluster by way of YARN, but it also hooks it into its workflow to help solve some of the thorny issues around code re-use, data governance, and security.

“It allows you to take an AI model that you created in TensorFlow, plug it into Datameer, and then Datameer can operationalize those models,” Morrell continues. “It can operationalize those insights, directly on top of your data lake, and give you all the scale and security and governance and integration with your business systems that is lacking in the data science world.”

AI is only as good as the data that feeds it, says Datameer CTO Peter Voss. “We’re thrilled to connect the dots by allowing enterprises to bring together massive amounts of disparate data, prepare and design the data pipeline, and now ultimately feed the data into models that have the potential to radically optimize business models,” he says.

Deep learning is a form of unsupervised machine learning that’s grown rapidly in popularity over the past year. The approach was initially used by Web giants like Google, Yahoo, and Microsoft to turbo-charge image recognition, voice recognition, and natural language processing (NLP) systems. This is typically done by training very large neural networks, with hundreds or thousands of layers, atop speedy GPUs processors.

As deep learning racks up the wins and demonstrates better accuracy compared to other machine learning techniques, it’s starting to branch out into the broader market. Today data scientists are looking for other ways to leverage the enormous power of this form of unstructured data analysis. In particular, organizations are examining ways to use deep learning in areas like fraud detection, recommendation systems, healthcare analytics, and analysis of time-series IoT data.

Deep learning’s main advantage lies in speed and simplicity. Many data scientists are looking to use TensorFlow to replace models originally developed with Spark’s MLlib, as TensorFlow can be an order of magnitude faster than Spark, Morrell says, “You can train things about four to 100 times faster and you can put together model with 10 to 12 lines of coding,” he says.

One of the great things about AI and deep learning in particular, Morrell adds, is that it takes feature engineering out of the equation, “because the deep learning model can automatically figure out what attributes are important,” he says. “This will dramatically speed up their cycles in terms of producing predictive types of models, and it will allow them to tackle many, many more problems.”

Datameer was one of the first vendors offering an end-to-end analytics application for Apache Hadoop that delivered many of the capabilities organizations need to operationalize their big data investments on the distributed platform. As technology evolved, so did Datameer, which added support for Apache Spark to boost the speed and provide access to more data science tools.

TensorFlow was the first deep learning framework added to Datameer’s application, but the company expects to add more frameworks over time, Morrell says.

Related Items:

Machine Learning, Deep Learning, and AI: What’s the Difference?

Why Deep Learning, and Why Now

Spark’s New Deep Learning Tricks