February 8, 2021

Google’s New Switch Transformer Model Achieves 1.6 Trillion Parameters, Efficiency Gains

Oliver Peckham

(achinthamb/Shutterstock)

Last year, OpenAI wowed the world with its eerily human language generator, GPT-3. The autoregressive model stood at a then-staggering 175 billion parameters, ten times higher than its predecessors. Now, Google is upping the bar, delivering a model capable of 1.6 trillion parameters, nearly decupling GPT-3’s range – all while delivering major improvements in efficiency compared to previous, hardware-intensive approaches.

The model is built on a mixture-of-experts approach, which uses a variety of specialized “expert” models as constituent parts of the overall architecture, allowing for (as the authors of the paper say) “outrageous numbers of parameters.” “However,” they write, “despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability.”

Enter the Switch Transformers: “scalable and effective natural language learners” that allowed the researchers to increase the parameter count while keeping floating point operations (FLOPs) per example constant. “Our hypothesis,” they say, “is that the parameter count, independent of total computation performed, is a separately important axis on which to scale.”

An illustration of a Switch Transformer encoder block. Image courtesy of the researchers.

The Googlers built the Switch Transformers on the back of its own T5 models (introduced in 2019), powered them with 32 of Google’s in-house Tensor Processing Units (TPUs), equipped them with 2,048 “experts,” and set them to work on the Colossal Clean Crawled Corpus. The Corpus (“C4”) is a nearly terabyte-scale dataset of crawled text from major websites used to test natural language processing (NLP) models. The researchers masked 15% of the words in the C4 dataset and tasked the Switch Transformers with filling in the blanks. They also asked the model to translate among 101 languages and answer trivia questions.

Speed gains delivered by the Switch Transformer models relative to the baseline T5 model. Image courtesy of the researchers.

The Switch Transformers performed well. The new models saw a seven-fold pretraining speedup without a commensurate increase in computational cost and exhibited “no training instability.” The researchers said that while these experiments were focused on “extremely large models,” models with as few as two experts benefit from the new approach.

“We find that these models excel across a diverse set of natural language tasks and in different training regimes, including pre-training, fine-tuning and multi-task training,” the authors conclude. “These advances make it possible to train models with hundreds of billion to trillion parameters and which achieve substantial speedups[.] … We hope our work motivates sparse models as an effective architecture and that this encourages researchers and practitioners to consider these flexible models in natural language tasks, and beyond.”

So: why hasn’t it been done before? “The motivation to try sparse models has been stymied by the massive success of scaling dense models,” they explain, “the success of which is partially driven by co-adaptation with deep learning hardware[.]”

Now, the researchers are setting out to improve the model – specifically, through increased training stability for the largest models and better understanding of the scaling relationships at play.

To learn more, read the paper, which was written by William Fedus, Barret Zoph, and Noam Shazeer (all hailing from Google Brain) and is available here.

Applications: Artificial Intelligence, Predictive Analytics, Research Analytics

Technologies: Processors

Vendors: google

Tags: google, Google Brain, GPT-3, Switch Transformer, T5, TPU

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Google’s New Switch Transformer Model Achieves 1.6 Trillion Parameters, Efficiency Gains

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 9, 2024

May 8, 2024

May 7, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Google’s New Switch Transformer Model Achieves 1.6 Trillion Parameters, Efficiency Gains

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 9, 2024

May 8, 2024

May 7, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link