Speech Recognition Gets an AutoML Training Tool
AutoML, the application of machine learning to create new automation tools, is branching out to new use cases, making itself useful for particularly tedious data science tasks when training speech recognition models.
Among the latest attempts at automating the data science workflow is an AutoML tool from Deepgram, offering what the speech recognition vendor claims is a new model training framework for machine transcription. The startup’s investors include Nvidia GPU Ventures and In-Q-Tel, the venture arm of the U.S. intelligence community.
Deepgram’s flagship platform scans audio data to train a speech recognition tool. Its deep learning tool uses a hybrid convolutional/recurrent neural network approach, training models via GPU accelerators.
It is now adding to the mix AutoML tools, widely used for applications like machine vision, image recognition and natural language processing, but heretofore non-existent for automatic speech recognition, according to Deepgram CEO Scott Stephenson.
The San Francisco-based company this week released its new AutoML training tool, claiming greater than 90 percent accuracy, faster delivery and half the cost. Deepgram is aiming the speech model trainer at data scientists and engineers “looking to implement speech recognition or replace clunky [automated speech recognition] models that haven’t worked,” Stephenson noted in a blog post unveiling the AutoML tool on Thursday (Aug. 27).
AutoML streamlines the data science workflow by acquiring and prepping data, deriving features from data, selecting the best algorithm, then tuning it. The last step is deployment and monitoring of production models.
Along with improving utilization of GPU resources and generally making better use of data scientists’ time, Deepgram said its AutoML model trainer eliminates a variety of tasks, including: selecting input audio features; audio noise removal; tuning the “hyperparameters” of models or neural networks; tweaking foundational algorithms; maintaining a custom vocabulary list; and “applying model ensembling with keyword boosting or stacking,” the company said.
The speech recognition tool reduces those steps to higher-level functions, starting with an audio source, then selecting one of the company’s base models that cover speech recognition from phone calls to meetings. After the model is trained, the user reviews it for accuracy. If needed, additional training focuses on specific audio examples.
The best model is then selected and deployed in the cloud.
Stephenson made the case for Deepgram’s deep learning approach to automated speech recognition in a recent commentary for sister web site EnterpriseAI.com.
A deep learning-based approach, he wrote, “allows enterprises to pick which pieces of the puzzle to build the model from, and then train the model to build itself.
“In many cases, 10 hours of thoughtfully selected audio is all that’s needed to effectively train a model. By doing the work up front, the model can continue to optimize its performance over time, and companies can extract more accuracy and scale out of it,” Stephenson added.