November 18, 2022

IBM Collaboration Looks to Bring Massive AI Models to Any Cloud

Jaime Hampton

Training machine learning foundation models with sometimes billions of parameters demands serious computing power. For example, the largest version of GPT-3, the famous large language model behind OpenAI’s DALL-E 2, has 175 billion parameters and needs truly powerful hardware. The model was trained on an AI supercomputer developed by Microsoft specifically for OpenAI that contains over 285,000 CPU cores, 10,000 GPUs, and 400gb/s InfiniBand networking.

These bespoke high performance computing systems are expensive and often out of reach for those outside a datacenter or research facility. Researchers at IBM and PyTorch are looking to change that.

IBM announced it has been collaborating with a distributed team within PyTorch, the open-source ML platform run by the Linux Foundation, to enable training large AI models on affordable networking hardware such as Ethernet. Additionally, the company has built an open source operator for optimizing PyTorch deployments on Red Hat OpenShift on IBM Cloud.

Using PyTorch’s FSDP, an API for data-parallel training, the team successfully trained models with 11 billion parameters across a multi-node, multi-GPU cluster using standard ethernet networking on IBM cloud. IBM says this method of training models with 12 billion or fewer parameters is 90% more efficient than pricey HPC networking systems.

(Laborant/Shutterstock)

“Our approach achieves on-par efficiency training models of this size as HPC networking systems, making HPC networking infrastructure virtually obsolete for small and medium-scale AI models,” said Mike Murphy, a research writer for IBM in a company blog post.

Murphy describes the infrastructure used for this work as “essentially off-the-shelf hardware” that runs on the IBM Cloud and consists of 200 nodes, each with eight Nvidia A100 80GB GPUs, 96 vCPUs, and 1.2TB CPU RAM. The GPU cards within single nodes are connected via NVLink with a card-to-card bandwidth of 600gb/s, and nodes are connected by two 100gb/s Ethernet links with an SR-IOV-based TCP/IP stack, which Murphy says provides a usable bandwidth of 120gb/s (though he notes for the 11B model, researchers observed peak network bandwidth utilization of 32gb/s).

This GPU system, configured with OpenShift, has been running since May. Currently, the research team is building a production-ready software stack for end-to-end training, tuning, and inference of large AI models.

Though this research was conducted with an 11 billion parameter model instead of a model of GPT-3’s size, IBM hopes to scale this technology for larger models.

“We believe this approach is the first in the industry to achieve scaling efficiencies for models with up to 11 billion parameters that use Kubernetes and PyTorch’s FSDP APIs with standard Ethernet,” said Murphy. “This will allow researchers and organizations to train massive models in any cloud in a far more cost-efficient and sustainable way. In 2023, the goal of the joint team is to continue scaling this technology to handle even larger models.”

IBM Research Open-Sources Deep Search Tools

Meta Releases AI Model That Translates Over 200 Languages

Applications: Artificial Intelligence

Technologies: Cloud, Frameworks, Network, Processors, Systems

Vendors: IBM, NVIDIA, Pytorch, RedHat

Tags: GPT-3, hpc, IBM, machine learning, model training, Red Hat OpenShift, transformer language models

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

IBM Collaboration Looks to Bring Massive AI Models to Any Cloud

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

IBM Collaboration Looks to Bring Massive AI Models to Any Cloud

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link