Google Mimics Human Brain with Unified Deep Learning Model
Despite the progress we’ve made in deep learning, the approach remains out of reach for many, largely due to the high costs associated with tuning models for specific AI tasks. Now Google hopes to simplify things with a unified deep learning model that works across multiple tasks – and even apparently mimics the human brain’s capability to take learnings from one field and apply them to another.
Researchers from Google and the University of Toronto last month quietly released an academic paper, titled “One Model To Learn Them All,” that introduces a new approach to unifying deep neural network training. Dubbed MultiModel, the approach incorporates building blocks from multiple domains, and aims to generate good training results for a range of deep learning tasks – including speech recognition, image classification, and language translation – on different data types, like images, sound waves, and text data.
The idea is to get away from the high degree of specialization required to get good results out of deep learning, and to provide a more generalized approach that delivers accuracy without the high tuning costs.
“Convolutional networks excel at tasks related to vision, while recurrent neural networks have proven successful at natural language processing tasks,” the researchers write. “But in each case, the network was designed and tuned specifically for the problem at hand. This limits the impact of deep learning, as this effort needs to be repeated for each new task.”
What’s more, the requirement to focus training regimens to solve specific
tasks runs counter to the underlying concept driving modern neural networking theory – namely that mimicking the human brain, with its powerful transfer learning capabilities, is the best approach to building machine intelligence.
“The natural question arises,” the researchers write, “Can we create a unified deep learning model to solve tasks across multiple domains?”
The answer, apparently, is yes.
The concept of multi-tasking is not new in the fields of artificial intelligence and deep learning. Researchers have long been aware of the benefits of taking a model trained for one particular task and using it for another.
But until now, those tasks have been relatively closely related. For example, if you wanted to build a machine translation algorithm that converts German text to English, you could strengthen the model by having it train on other languages, too. And some computer vision problems, such as facial landmark detection, behave in a similar way, according to the researchers.
“But all these models are trained on other tasks from the same domain: translation tasks are trained with other translation tasks, vision tasks with other vision tasks, speech tasks with other speech tasks,” the researchers write.
While multi-modal learning has been shown to improve learned representations in the unsupervised setting, no competitive multi-task, multi-modal model has been proposed, the researchers write.
That was, of course, before Google proposed its new MultiModel architecture, which it designed to handle a variety of deep learning tasks and, most importantly, to take learnings from one domain and apply them to another – just as the human brain does.
So, how does one construct a neural network that can get better at image classification problems by working through speech recognition problems, and vice versa? According to the paper, Google’s MultiModel architecture handles these tasks by building task- and domain-specific engines directly into the same model, and then providing linkage to connect them together in an intelligent way.
Architecturally, the MultiModel consists of a few small modality-nets, an encoder, an I/O mixer, and an autoregressive decoder. The encoder and decoder are each built with three key computational blocks to get good performance across different problems, including convolutional blocks; attention blocks; and mixture-of-experts blocks. Four types of modality nets are built into the system, including one for textual language data, one for images, one for audio, and one for categorical data.
The modality nets, which are used to represent data from different domains in a unified fashion, and the computational blocks, which provide domain-specific processing, are critical to the MultiModel architecture.
“To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space,” the researchers write. “We call these sub-networks ‘modality nets’ as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation.”
The body of the MultiModel uses computational building blocks from multiple domains, which is crucial for providing good results on various problems. “We use depthwise-separable convolutions, an attention mechanism, and sparsely-gated mixture-of-experts layers,” the researchers write.
What’s interesting is that these blocks were introduced in papers that belonged to different domains, and were not studied before on tasks from other domains, according to the researchers. For example, the Xception architecture was designed for image classification workloads, and hasn’t been applied to text or speech processing before. Similarly, the sparsely-gated mixture-of-experts was designed for language processing tasks, and hasn’t been studied on image problems.
Each of these mechanisms is crucial for the domain it was designed for, the researchers write. “But, interestingly, adding these computational blocks never hurts performance, even on tasks they were not designed for,” they continue. “In fact we find that both attention and mixture-of-experts layers slightly improve performance of MultiModel on ImageNet, the task that needs them the least.”
Performance + Ramifications
Google implemented the MultiModel architecture using its TensorFlow framework and trained it for eight specific tasks. It then compared the results of training the model on those eight tasks simultaneously versus training those tasks to run separately on state-of-the-art, single use models.
The results were promising. “The joint eight-problem model performs similarly to single-model on large tasks, and better, sometimes significantly, on tasks where less data is available, such as parsing,” he researchers write.
The idea that MultiModel would outperform a dedicated model in the field of text parsing caught the researchers attention. They tried training a single-model parser on ImageNet, which is an image recognition engine, to see if it improved performance.
“This is indeed the case,” the researchers write. “The difference in performance is significant, and since we use both dropout and early stopping, we conjecture that it is not related to over-fitting. Rather, it seems, there are computational primitives shared between different tasks that allow for some transfer learning even between such seemingly unrelated tasks as ImageNet and parsing.”
The researchers demonstrated for the first time that a single deep learning model can jointly learn a number of large-scale tasks from multiple domains. “The key to success comes from designing a multi-modal architecture in which as many parameters as possible are shared and from using computational blocks from different domains together,” the researchers write. “We believe that this treads a path towards interesting future work on more general deep learning architectures, especially since our model shows transfer learning from tasks with a large amount of available data to ones where the data is limited.”
This is an important finding, as it indicates that deep learning may not be confined to stove-piped applications, and may in fact be applicable in a much more broader sense. For those who are eager for general AI, this brain-like capability demonstrated by Google Brain, Google Reseearch, and University of Toronto researchers could be very important, indeed.