Voice ‘Fingerprint’ Propels Speaker Recognition
The accuracy of automatic speech recognition has made significant gains in the last few years thanks to the advent of deep neural networks. But there’s one area that has thwarted researchers: telling multiple speakers apart. Now a startup called Chorus says it has made a breakthrough in the matter through a technique it calls “voice fingerprinting.”
Speech recognition and computer vision arguably are the two computational challenges that have benefited the most from deep learning. Armed with huge training sets – including vast troves of photographs and digital recordings of voices – convolutional neural network (CNNs) and recurrent neural networks (RNNs) have given computers sensory perception that can almost rival humans’ senses.
Despite those successes, we still run into edge cases in where deep learning has not delivered breakthroughs in cognitive applications. There’s the self-driving car that makes a spurious correlation between the hue of sky and the need to suddenly turn left. And in speech recognition, there’s the challenge of diarization, or identifying multiple speakers.
“It’s notoriously hard to know who’s speaking, especially when multiple people are in the same room,” says Micah Breakstone, the chief scientist and co-founder of Chorus, which is based in San Francisco and Israel. “It’s harder than transcribing.”
Chorus has devised a novel way of dealing with this problem in its speech recognition platform, which analyzes its customers’ recorded sales calls en masse to identify potential problems with their clients and prospects, and also to identify speech patterns of the most successful sales people for the purpose of replicating them across the sales team.
- A language model that delivers the statistical probability of a given word appearing near others;
- An acoustic model that turns wave sounds into digital equivalents, or phonics;
- A dictionary model that turns phonics into words.
Breakstone says Google excels across all three areas, and has the highest quality automatic speech recognition engine on the market, or about a 15% word error rate (WER) on real world data, which translates to an 85% accuracy rate. That’s significantly below the WER that Google and others claim because of the way they train their model and measure the accuracy.
“They measure their accuracy on a test set that has been mulled over for last 17 years,” Breakstone says. “They’re highly optimized for that specific data set. But if you put it live in the wild, speech recognition is nowhere close to human parity today.”
Chorus’s accuracy is about 90%, or a WER of 10%, Breakstone says. It’s not as accurate as humans, which typically have a WER of 5% to 6%. But it’s better than what you can get with plain vanilla speech recognition from the Web giants, he says.
Breakstone attributes Chorus’ success to two things. First, the company trained its engine on a specific type of conversation: sales calls. By focusing just on sales calls and the language that is commonly used during sales calls – not to mention all the non-traditional company names that are often spoken – the company was able to boost its WER.
“We’re not smarter that Microsoft or Google, but we have invested a hell of a lot in our data,” the Breakstone says. “We have huge amounts of data and large corpus that allow us to…know more about the world we’re transcribing and thereby achieve word error rates that are probably about 25% better on average than the top players.”
Chorus’ other breakthrough has to do with speaker diarization, which has flummoxed many efforts to build automatic speech recognition software. The company’s solution to identifying multiple speakers involves a multi-pronged effort that includes using visual clues (such as faces or names appearing in conference platforms like WebEx) as well as creating phonetic and language models that identify users.
One of the key aspects to solving this problem was automatically enrolling a user into the system, or training the algorithm on the person’s voice without asking them to speak into the microphone for a minute. “Without any human intervention or friction, we can say with extremely high confidence that it’s you speaking without ever asking you to train the system for us,” Breakstone says. “It simplifies the problem by an order of magnitude.”
That doesn’t completely solve the problem, but by combining various deep learning and Gaussian models, Chorus is able to deliver more accurate transcriptions. “Together with addition language feature and feature from the video and how a person talks, we reach accuracy that has just not been reached before by basically anyone we know of,” he says.
The company, which has six Ph.Ds. on staff (including Breakstone), has submitted 15 patent applications on its technology. It published some of its research in a recent paper titled, “Fully automatic speaker separation system, with automatic enrolling of recurrent speakers.” Breakstone also recently wrote a blog post about it on Machine Learnings.
The research is paying dividends for Chorus.ai, which has received more than $20 million in funding and has more than 100 customers, according to Breakstone. Companies like Periscope Data, Engagio and Adobe have licensed the software to automatically transcribe and analyze their sales calls.
One customer getting traction with Chorus is Procore, a developer of construction management software based in Carpinteria, California. Alex Jaffe, Procore’s director of sales enablement, says Chorus highlighted a problem with its sales team.
“We noticed early on that our sales reps were not pitching the right pricing model,” Jaffe says in a video posted to the Chorus.ai website. “We worked with teams in product and product marketing to recalibrate the messaging and further train the team on the new pricing model, so we were able to course correct and we would have had access to that information without Chorus.”