Pushing the Scale of Deep Learning at ISC
Deep learning is the latest and most compelling technology strategy to take aim at the decades-old “drowning in data/starving for insight” problem. But contrary to the commonly held notion, deep learning is more than a big data problem per se. Delivering on deep learning’s potential – and achieving its anticipated 50 percent annual growth rate market opportunity – involves a highly demanding scaling problem that requires overlapping computational and communications capabilities as complex as any of the classic supercomputing challenges of the past.
That’s the view of Cray senior VP and CTO Steve Scott, who will discuss “pushing the frontiers of deep learning” at ISC in Frankfurt to close out Deep Learning Day (Wednesday, June 21) at the conference.
Scott told EnterpriseTech (Datanami’s sister publication) the focus of his session will be on training at-scale neural networks to handle complex deep learning applications: self-driving cars, facial recognition, robots sorting mail, supply-chain optimization and aiding in the search for oil and gas, to name a few.
“The main point I’ll be making is that we see a general convergence of data analytics and classic simulation and modeling HPC problems,” he said. “Deep learning folds into that, and the training problem in particular is a classic HPC problem.”
In short, greater machine intelligence requires larger, more complex models – with billions of model weights and hundreds of layers.
Ideally, Scott said, training neural networks using the stochastic gradient descent algorithm “you’d process one sample of that training data, then update the weights of your model, and then repeat that process with the next piece of training data and then update the weights of your model again.”
The problem, he said, is that it’s an inherently serial model. So even when using a single node, Scott said, users have traditionally broken up their training data into sets – called “mini-batches” – to speed up the process. The entire training process becomes much more difficult when you want to train your network not on one GPU, or 10 GPUs, but on a hundred or thousands of GPUs.
You can simplify training by using lesser amounts of data, but that leads to deep learning systems that haven’t been trained thoroughly enough and, therefore, aren’t intelligent enough. “If you have a small amount of data and you try to use it to train a very large neural network,” Scott said, “you end up with a phenomenon called ‘overfitting,’ where the model works very well for the training data you gave it, but it can’t generalize to new data and new situations.”
So scale is essential, and scale is a big challenge.
“Scaling up this training problem to large numbers of compute nodes brings up this classic problem of convergence of your model vs. the parallel speed you can get,” Scott said. “This is a really tough problem. If you have more compute nodes working in parallel you can process more samples per second. But now you’re doing more work each time, your processing more samples before you can update the model weights. So the problems of converging to the correct model becomes much more difficult.”
Scott will discuss the kind of system architecture required to take on deep learning training at scale, an architecture that – surprise! – Cray has been working on for years.
“It calls for a very strong interconnect [the fabric, or network, connecting the processors within the system], and it also has a lot to do with turning this into an MPI [the communications software used by the programs to communicate via the fabric] problem,” Scott said. “It calls for strong synchronization, it calls for overlapping your communications and your computation.
“We think bringing supercomputing technologies, from both a hardware and a software perspective, to bear can help speed up this deep learning problem that many people don’t think of. They think of it as a big data problem, not as a classic supercomputing problem. We think the core problem here in scaling these larger models is one in which supercomputing technology is uniquely qualified to address.”
Scott said deep learning has taken root to different degrees in different parts of the market. Hyperscalers (Google, Facebook, Microsoft, AWS, etc.) have thousands of projects under development with many, in voice and image recognition in particular, fully operational.
“It’s really past the tipping point,” Scott said. “The big hyperscalers have demonstrated that this stuff works and now they’re applying it all over the place.”
But the enterprise market, lacking the data and the compute resources of hyperscalers, remains for now in the experimentation and “thinking about it” phase, he said. “The enterprise space is quite a bit further behind. But they see the potential to apply it.” Organizations that are early adopters of IoT, with its attendant volumes of machine data, are and will be the early adopters of deep learning at scale.
“We’re seeing it applied to lots of different problems,” said Scott. “Many people, including me, are optimistic that every area of industry and science and beyond is going to have problems that are amenable to deep learning. We think it’s going to be very widespread, and it’s very large organizations with large amounts of data where it will take root first.”