One Model to Rule Them All: Transformer Networks Usher in AI 2.0, Forrester Says
The recent advent of massive transformer networks is ushering in a new age of AI that will give customers advanced natural language capabilities with just a fraction of the skills and data previously required, according to Forrester. The capabilities demonstrated by OpenAI’s GPT-3 and other language models not only were unexpected, but they also mark the beginning of a new arms race among the hyperscalers and bring us a step closer to AGI, the analysts say.
In “AI 2.0: Upgrade Your Enterprise With Five Net-Generation AI Advances,” Forrester analysts Kjell Carlsson, Brandon Purcell, and Mike Gualtieri identified five key technologies that are responsible for advancing the state of artificial intelligence. Transformer networks are the first technology on the list, followed by synthetic data, reinforcement learning, federated learning, and causal inference. All of these are important, and they work together and build off one another to create something that’s more than the sum of its parts.
But make no mistake: There is something special and quite unexpected happening with regards to transformer networks. If Carlsson was attempting to conceal his surprise at the rise of transformer networks during an interview with Datanami last week, he did a lousy job.
This story starts about a year and a half ago, he says. The AI conversation largely was centered around making incremental progress in the use of existing technologies, like convolutional neural networks (CNNs), XGBoost, and random forest models. “All of the innovation is really in how we’re using them,” Carlsson says.
People were experimenting with miniaturizing CNNs models and running them on FPGAs and custom processors to push inference out to the edge. The idea was to build on what was already known to push computer vision use cases into the real world. There didn’t seem to be much discussion about fundamentally new technologies.
“But then something happened with the transformer networks,” Carlsson says. “We went from Salesforce two or three years ago talking about decaNLP to ‘Wow, we can do all these different tasks with a single model. Isn’t this cool?’ to ‘Oh, no, wait a minute–this completely blows up the way we were handling our language models before. Now we can create these hyper accurate backends for all of these types of NLP models, and at the same time, it’s easier to manage, it requires less training data, and we can just brute force so many things that previously we just couldn’t do.’”
Last July, OpenAI released GPT-3, which sported 175 billion parameters. The model drew lots of attention for its ability to create novel sentences that sounds very human-like, as well as its ability to accurately provide answers. Last month, Google released its Switch Transformer model, which features 1.6 trillion parameters, a 10x increase over GPT-3. The Chinese Web giants are also using transformer networks, as are analytics startups.
What makes these large transformer networks so much better, Carlsson says, is that they can parallelize processing of time-series data. Advances in recurrent neural networks (RNNs) and long short-term memory (LSTM) models gave us transcription engines that brought us closer to the accuracy of humans, but still not beyond human capability (as CNNs have done in some computer vision use cases).
The challenge was the lack of parallelization prevented RNNs and LSTMs from being able to process time-oriented data at the levels that we needed to push the state of the art forward in a major way, Carlsson says.
“We could take problems that didn’t have this time dimensions and parallelize those,” he says. “That’s why we could do image processing so well. But now you’ve moved to a world whereby, okay, now I can incorporate the time dimension. That’s how we understand voice and text. What I said now not only depends on what I said previously, but the meaning also depends on what I’m about to say.”
This situation impacts other problem that have time dimension, he says. “You think of forecasting problems. You think of machine logs and managing infrastructure downtime. You think of fraud,” Carlsson says. “So many of the problems that we want to tackle in the real world have that time dimension in them, and previously, you couldn’t just throw as much data at it as you wanted to because it just wasn’t feasible. Now [with transformer networks] you just keep adding more and more compute at it and you can parse anything.”
There are other qualities of transformer networks that have the potential to change the AI game in a fundamental way. For starters, the pre-trained transformer networks appear to be extremely generalizable. If you have any sort of NLP problem, it no longer makes sense to build your own system from scratch. We have seen this with BERT (Bidirectional Encoder Representations from Transformers), an open source transformer model that’s being used to fight fake news on social media, among other uses.
“We had transfer learning before, but that only worked when you’re in the same domain,” Carlsson says. “Now this is working going across domains and you end up with these bizarre instances where the model that has suddenly learned to count, even though nobody ever trained it to count.”
If you have an NLP problem, Carlsson’s advice is to go to the folks who have pre-trained that transformer network model on that “ridiculous corpus of data, do some additional training on it, and voila. There’s no point in doing it on your own.”’
It’s not “one model to rule them all” yet, Carlsson says, but it’s getting close. The transformer networks are getting so good that AI users won’t need to have separate models for each specific use cases, such as chatbots or voice assistants. There will be one model that works across domains.
Here’s another benefit for AI-using companies: Since the starting point for transformer networks is so much more advanced than in the past, data scientists will no longer be required to build much of the underlying system from scratch.
“For a lot of these new use cases, you don’t need the data scientist with extremely specialized skills that you had before, because now that one model to rule them all, it’s actually available to be used as is,” Carlsson says. “There’s an ecosystem of folks and a set of skills which are no longer about ‘How do I train the model and how do I deploy this model,’ because the hyperscalers are going to be doing that for you. Instead, it’s more about how do I build this solution around it.”
The advances in NLP and natural language understanding (NLU) capabilities will be felt in other areas. While traditional machine learning techniques didn’t face the same types of impediments when working with structured, or tabular data, the ability to mix and match NLP/NLU techniques in combination ML atop tabular data will enable a new class of solutions, Carlsson says.
“The opportunity is more in the cross domain [area], when you’re looking at text and language together, or tabular data and video together, or tabular data and images together,” he says.
The advent of massive transformer models training against massive data sets on massive high-performance computing (HPC) clusters could hearken a “back to the future” type of moment for big data.
“Back in 2015, everybody was saying the hyperscalers are going to rule the world because they’ve got more data and they’ve got more compute, and they’ve got talented people,” Carlsson says. “But that didn’t materialize. It turned out, yeah you have lots of data but you don’t have the right data. You can’t ingest all the data out there. It doesn’t work very effectively. Now the world is [saying], oh, you do need a supercomputer. You do have these returns to having absolutely incredible amounts of compute and incredible amounts of data.”
In short, the transformer networks are changing the ballgame in AI. The rise of transformer networks aren’t the only factor in the advance of AI 2.0, but it’s a critical one, it was largely unexpected, and its impacts will likely be felt for years.
We’re still pushing the bounds of what these transformer networks can do. The potential to build more accurate forecasts deep into the future, such as predicting what a child’s income earning potential is, or whether a person will develop diabetes, now seem within our grasp, Carlsson says. It’s not going to give rise to Hari Seldons, who practiced psychohistory in Isaac Asimov’s Foundation series. At least not yet.
“It’s not AGI yet, or anytime soon,” he says. “But we weren’t thinking that AGI was ever going to happen. There was no indication that we were even making progress against it. And this is actual progress against it, which is phenomenal.”