MIT Uses Video to Train Machine Vision System
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory reported this week they have developed a deep learning algorithm that could help machines using predictive vision anticipate human interactions. The approach uses unlabeled YouTube videos as its source material to train deep networks to predict human interactions.
In a paper titled, “Anticipating Visual Representations from Unlabeled Video,” the researchers said they applied recognition algorithms on the trained network’s prediction to forecast future actions. That prediction capability could be used in applications such as training robots to understand that a greeting in the form of a wave could lead to a handshake or an embrace.
“The capability for machines to anticipate future concepts before they begin is a key problem in computer vision that will enable many real-world applications,” the MIT researchers asserted. “We believe abundantly available unlabeled videos are an effective resource we can use to acquire knowledge about the world, which we can use to learn to anticipate future.”
The key to their approach was that readily available unlabeled video could be used to train deep neural networks to predict visual representations. Those images “are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute,” the MIT team noted.
They then applied recognition algorithms to the predicted representations to anticipate future actions and objects. The researchers said they validated experimentally that actions could be predicted one second into the future while objects could be estimated about five seconds ahead.
The investigators also said their work builds on previous “big visual data” such as millions on online images used to build object and scene recognition systems. They then took this approach one step further by mining information from online video with the goal of developing a training model for machine vision used to anticipate subtle semantic concepts.
The fact that training data did not have to be labeled meant there was ample online video from popular TV shows that could be used in the researchers’ proposed deep regression networks. They then used a relatively small set of labeled examples from a desired task to indicate specific categories of human interaction. When predictions matched the recognition system, the researchers said they applied standard recognition algorithms to the predicted representation in order to forecast a category.
“There’s a lot of subtlety to understanding and forecasting human interactions,” noted lead MIT researcher Carl Vondrick. “We hope to be able to work off of this example to be able to soon predict even more complex tasks.”
Outside experts said MIT’s approach is a good start, but practical applications still years away. “This experiment represents a good scenario to apply machine learning because it uses a small set of input and possible output, so the investment in training is reasonable,” noted Marco Varone, founder and CTO of Expert System, a cognitive computing specialist.
“It is important to note that real world scenarios are often more complex, such as the case in our sector of natural language understanding,” Varone added. “Here, the costs and resources required are significantly higher and so is the complexity, requiring skills not always accessible to organizations. So we while we certainly welcome this research, the real use in the real world is still years away.”