Microsoft Claims Speech Recognition ‘Parity’
Microsoft speech recognition researchers report they have achieved parity with humans using a testing framework that gauges error rates for professional transcribers and the ability to understand “open-ended conversations.”
“In both cases, our automated system establishes a new state-of-the-art, and edges past the human benchmark,” members of the Microsoft Artificial Intelligence and Research group claimed in a paper published this week. “This marks the first time that human parity has been reported for conversational speech.”
Using a U.S. human error rate benchmark to test it speech recognition system, the researchers said their approach equaled the 5.9 percent error rate for transcribers and the 11.3 percent error rate for open-ended conversations among friends or family members.
The Microsoft (NASDAQ: MSFT) team attributed the advance to systematic use of convolutional and recurrent neural networks, including LSTM, for Long Short-Term Memory networks. These speech recognition networks were combined with a “spatial smoothing” method along with an acoustic training technique.
Since a single measure of human performance was insufficient to accurately gauge an automated system, the conversational speech recognition system was compared with both professional transcribers using a “Switchboard” benchmark along with a separate “CallHome” test. The new Microsoft system showed an improvement of about 0.4 percent, the researchers reported, exceeding human performance “by a small margin.”
Convolutional models were found to perform best, but the researchers also noted that LSTM networks also showed promise for both acoustic and language modeling. Inspired by the human auditory cortex, the part of the brain responsible for the ability to hear, the researchers said they employed a spatial smoothing technique to improve the accuracy of its LSTM models
The researchers used three variants of convolutional neural networks in their acoustic model along with a combination of complementary models.
The neural networks incorporated into the speech recognition system were trained on Microsoft’s “cognitive toolkit” running on Linux-based servers with multiple GPUs. The toolkit leveraged graphics processing to accelerate the training of acoustic models that previously required at weeks or months.
Microsoft released its Computational Network Toolkit on Github earlier this year, saying it undertook the project out of necessity: Current tools used to improve how computers understand human speech were slowing progress.
Meanwhile, the Microsoft researchers’ analysis of human versus machine errors indicated “substantial equivalence,” with the exception of recognizing familiar aspects of human speech known as “backchannel acknowledgements” such as “uh-huh” and hesitations like “um.”
“The distinction is that backchannel words like ‘uh-huh’ are an acknowledgment of the speaker, also signaling that the speaker should keep talking, while hesitations like ‘uh’ are used to indicate that the current speaker has more to say and wants to keep his or her turn.” These “turn-management devices ” therefore “have exactly opposite functions” when a speech recognition system attempts to classify individual words, they noted.
Certain words continue to trip up both human transcribers and speech recognition systems. For example, the researchers found that so-called “short function words” generate the most errors.
Illustrating the subtleties of human speech, they found that the word “I” was omitted most often by transcribers. “While we believe further improvement in function and content words is possible, the significance of the remaining backchannel/hesitation confusions is unclear,” they added.