Uber’s Training Tool Shares Ride for Deep Learning
A Linux Foundation project focused on AI development is expanding with the addition of a deep learning training tool based on an Uber-sponsored project.
Launch by the ride-sharing specialist, the Horovod project is a distributed training framework for Keras, PyTorch and TensorFlow. It is designed to handle resource allocation and provides the ability to scale machine learning training efforts.
Horovod is also intended, for example, to accelerate training on a TensorFlow program running on a single graphics processor by extending training to multiple GPUs. The resource allocation and scaling features are based on new algorithms while tapping into high-performance networks to provide tooling to scale deep learning models.
Uber has reported a doubling of scaling in benchmark testing against a standard distribution of TensorFlow, the Linux Foundation said Thursday (Dec. 13).
“This project has proven highly effective in training machine learning models quickly and efficiently,” said Ibrahim Haddad, the Linux Foundation’s research director.
“Uber built Horovod to make deep learning model training faster and more intuitive for AI researchers across industries,” said Alex Sergeev, the Horovod project leader.
Uber announced in March it was extending its work on distributed deep learning while scaling Horovod on large clusters and supercomputers using IBM’s Power9 architecture.
Along with Uber and IBM (NYSE: IBM), contributors to the Horovod project include Amazon Web Services (NASDAQ: AMZN), Intel Corp, (NASDAQ: INTC) and Nvidia (NASDAQ: NVDA). Uber is using the project to develop self-driving vehicle and trip forecasting applications.
Uber joined the Linux Foundation as a “Gold” member in November. Horovod will be managed as part of Linux Foundation’s deep learning community.
The Uber project is among a number of efforts aimed at accelerating the GPU-based training of deep learning models. For example, Fast.ai, an organization offering free courses on deep learning, claimed a new speed record in August for training a popular image database using Nvidia GPUs running on public cloud infrastructure.
A pair of researchers trained the ImageNet database with 93 percent accuracy in 18 minutes using 16 AWS cloud instances, each with eight Nvidia Tesla V100 Tensor Core GPUs. Running Fast.ai and Pytorch libraries, the researchers claimed a 40-percent boost in speed and accuracy for training ImageNet on public infrastructure.
The previous record was held by Google (NASDAQ: GOOGL) on its Tensor Processing Unit Pod cluster.