Follow Datanami:
May 13, 2022

MIT Advances Unsupervised Computer Vision with ‘STEGO’

Training machine learning models often means working with labeled data. For computer vision tasks, this might look, for instance, like an hour of camera footage from a car, meticulously sectioned by humans to designate roads, road signs, vehicles, pedestrians and so forth. But labeling even this small amount of data could take hundreds of hours for a human, bottlenecking the training process. Now, researchers from MIT’s Computer Science & Artificial Intelligence Laboratory (CSAIL) are introducing a new, state-of-the-art algorithm for unsupervised computer vision tasks that operates without any human labels.

The model is called STEGO, short for “Self-supervised Transformer with Energy-based Graph Optimization.” STEGO is a semantic segmentation algorithm, the process of labeling the pixels in an image. Historically, semantic segmentation has been easiest for discrete objects like people or vehicles and harder for more amorphous, blended elements of the environment like clouds or bushes—or cancers.

“If you’re looking at oncological scans, the surface of planets, or high-resolution biological images, it’s hard to know what objects to look for without expert knowledge. In emerging domains, sometimes even human experts don’t know what the right objects should be,” explained Mark Hamilton, a research affiliate of MIT CSAIL, software engineer at Microsoft, and lead author of the paper describing STEGO, in an interview with MIT’s Rachel Gordon. “In these types of situations where you want to design a method to operate at the boundaries of science, you can’t rely on humans to figure it out before machines do.”

STEGO is built on top of the DINO algorithm, itself trained on 14 million images. The researchers tested STEGO on a variety of test cases, including the incredibly diverse COCO-Stuff image dataset. The researchers reported that STEGO doubled the performance of prior unsupervised computer vision models on the COCO-Stuff benchmark, and performed similarly well on tasks like driverless car datasets and space imagery datasets.

“In making a general tool for understanding potentially complicated datasets, we hope that this type of an algorithm can automate the scientific process of object discovery from images,” Hamilton said. “There’s a lot of different domains where human labeling would be prohibitively expensive, or humans simply don’t even know the specific structure, like in certain biological and astrophysical domains. We hope that future work enables application to a very broad scope of datasets. Since you don’t need any human labels, we can now start to apply ML tools more broadly.”

Related Items

Better Machine Learning Demands Better Data Labeling

Filling Cybersecurity Blind Spots with Unsupervised Learning

MIT Researchers Leverage Machine Learning for Better Lidar

Datanami