Anaconda Enterprise Now Supports GPU Clusters
Organizations that need to train machine learning models on massive amounts of data may be interested in the latest release of Anaconda Enterprise, which adds support for training on GPU clusters.
Customers running Anaconda Enterprise 5.2 will be able to take their Python- or R-based machine learning models directly from the laptop and deploy them on big GPU clusters without making changes to the code or fiddling about with command lines, says Mathew Lodge, SVP of products and marketing for Anaconda.
“Version 5.2 adds support for large-scale training and deployment of machine learning on GPU clusters, including job scheduling to support that,” Lodge tells Datanami. “[It also includes] one click-deployment for data scientists, so they can deploy their work and run training jobs on Kubernetes without having to be a DevOps expert.”
Several different pieces of technology came together in Anaconda 5.2 to enable clustered GPU support, Lodge says, including support for NVidia‘s latest drivers and technology, better support for containerized deployment involving Docker, and a new version of Kubernetes that simplifies connections to GPUs.
“We worked with Nvidia to develop the capability in our product,” he says, “and some of this is a new version of Kubernetes that has GPU support that came out a couple of months ago. And some of this has been wiring to connect it up to software that can take machine learning and make it run in parallel.”
Organizations are increasingly finding they need massive scale to train their latest machine learning models. Lodge noted that Google‘s AlphaGo, the machine learning program that conquered the ancient Chinese game of Go in 2013, required 1 petaflop of processing capability. These days, the current version of that program. AlphaGo Zero, requires 1,000 petaflops.
“Essentially what we’ve seen is you need 10 times the petaflops to train a leading edge AI model,” Lodge says. “So for most organizations, if you want to do market-leading AI, then scale is very important to be able to train those models and to be successful…GPUs make this economically possible because they have a much lower cost per petaflop than regular CPUs.”
While not everybody has the petaflop-scale computational demands of Google, the computing needs of real-world companies are starting to mimic the Alphabet subsidiary. Citibank, which is an Anaconda customer and an investor, is one that comes to Lodge’s mind.
“Citibank [uses Anaconda] in two different areas: credit card risk management and they also use it for anti-money laundering. They don’t talk too much about exactly how they use it, but they certainly do need that kind of scale because of the number of transaction they have to process and the complexity of the models.”
Some of the data science frameworks that Anaconda ships in enterprise and distribution versions are parallelized by default, such as TensorFlow. “It runs on Kubernetes by itself,” Lodge says. But other machine learning frameworks, things like Scikit-learn, are single-host by definition. “We have some software from Dask that will… take a machine learning algorithm, parallelize it, and run it across multiple nodes in Kubernetes,” he says.
Dask is an open source software effort that was created specifically to parallelize various Python frameworks, including Pandas, Numpy, and Scikit-learn. Dask is included in Anaconda Distribution (the free and open source version) as well as Anaconda Enterprise, which is aimed at helping organizations simplify the deployment of machine learning models.
In addition to supporting GPU clusters Kubernetes/Docker enhancements Anaconda Enterprise 5.2 also gets a new job scheduler. This is an important capability for enterprises to have if they want to productionalize their machine learning models, Lodge says.
“What we find is a lot of customers want to train the model on the new data, and they get new data every day, so they want a recurring job that runs the training on the GPUs everyday at 3 a.m. or whatever,” he says. “So the two [GPU support and scheduling] kind of go hand in hand.”
The Austin, Texas company has been at the forefront of the data science movement, particularly as it applies to the Python ecosystem, and to a lesser extent R. The company was the first to make sense of the morass of Python packages available and streamline their dependencies with a single distribution, called Conda.
Since then, the company has worked to keep its Python software up to date with the rapidly evolving data science world, including supporting things like Tensorflow, Keras, MXnet, PyTorch, and other machine learning frameworks.
The goal with Anaconda Enterprise isn’t necessarily to give data scientists the latest greatest frameworks, but to make them easier to use, Lodge says.
“Some of the algorithms that are available today and have been available for a few years, are already better than humans,” Lodge says, citing the ImageNet classifier’s breakthrough 2015 performance where it exceeded human capability. “The problem is not the algorithms. It’s the application and getting it into production.”
Lodge shared a quote from a Google whitepaper that discussed the Web giant’s experiences with deploying artificial intelligence capabilities at scale. The whitepaper said, in effect, that a mature AI system might end up being 5% machine learning code and 95% glue code.
“That’s essentially, in a nutshell the challenge for most of our customers, is the 95% glue code,” Lodge says. “That’s not anything that will differentiate them or have them be more effective in their business. It’s just glue code. What we’re doing is providing and automating the glue code so they can focus on the 5% machine learning stuff where it really makes a difference.”