Follow Datanami:
August 18, 2022

The Key to Computer Vision-Driven AI Is a Robust Data Infrastructure

Gil Elbaz

(Spectral-Design/Shutterstock)

For infrastructure, the sign of true greatness is to go unnoticed. The better it is, the less we think about it. Mobile infrastructure, for example, only ever crosses our minds when we find ourselves struggling to understand someone on the other end of a bad connection, or without service altogether. When driving on a pristine, recently-paved highway, we give little thought to the road as it passes silently beneath our wheels. A poorly-maintained highway, on the other hand, reminds us of its existence with every pothole, divot, and rough patch we encounter.

Infrastructure only demands our attention when missing, inadequate, or broken. And in the field of computer vision, infrastructure – or rather, what’s missing from it – is currently on the minds of   many.

Compute Sets the Standard for Infrastructure

Underpinning every AI/ML project, including computer vision, are three fundamental pillars of development — data, algorithms/models, and compute. Of these three pillars, compute is by far the one with the most robust and deeply entrenched infrastructure. With decades of dedicated enterprise investment and development behind it, cloud computing has become something of a gold standard for IT infrastructure throughout the entire enterprise IT environment – with computer vision being no exception.

Under the infrastructure-as-a-service model, developers have enjoyed on-demand, pay-as-you-go access to an ever-widening pipeline of computing power for nearly 20 years. And in that time, it has radically transformed enterprise IT across the board, through  dramatically increased agility, cost-efficiency, scalability, and more. Taken in tandem with the advent of purpose-built, machine-learning GPUs, it’s safe to say that this part of the computer vision infrastructure stack is alive and well. And if we want to see computer vision and AI reach their full potential, we would be wise to use compute as the model on which the rest of CV’s infrastructure stack is based.

Data management has emerged as the number one bottleneck for AI develoment (3dkombinat/Shutterstock)

Model-Driven Development’s Lineage and Limitations

Until recently, algorithms and model development were the driving force behind computer vision and AI’s development. In both research and commercial development, teams toiled for years testing, tinkering, and incrementally improving AI/ML models and sharing their advancements in open-source communities such as Kaggle. By concentrating its collective efforts on algorithm development and modeling, the fields of computer vision and AI progressed mightily over the first two decades of the new millennium.

In more recent years, however, that progress has slowed, as model-centric optimization runs up against the law of diminishing returns. What’s more, there are several limitations to a model-centric approach. For example, you cannot use the same data to train and then retrain your models. A model-centric approach also requires more manual labor when it comes to data cleaning, model validation and training which can take precious time and resources away from more innovative, revenue-driving tasks.

Today, through communities like Hugging Face, CV teams have free and open access to a multitude of large, sophisticated algorithms, models, and architectures, each supporting a different, core, CV capability — from object identification and facial landmark recognition, to pose estimation and feature matching. These assets have become as close to “off-the-shelf” solutions as one could imagine — ready-made tabula rasae for computer vision and AI teams to train for any number of specialized tasks and use cases.

In the same way that a fundamental human capability like hand-eye coordination can be applied to and trained for a wide variety of different skills – from playing ping pong to pitching a baseball – so too can these modern ML algorithms be trained to perform a range of specific applications. However, while humans become specialized through years of practice and perspiration, machines do so through data training.

Data-Centric AI & The Great Data Bottleneck

This has led many of AI’s foremost minds to call for a new era of deep learning development — an era in which the primary engine of progress is data. It’s been just a few short years since Andrew Ng and others declared data-centricity as the way forward for AI development. And in that brief period of time, the industry has erupted with activity and growth. In just a few short years, a plethora of novel commercial applications and use cases for computer vision have emerged, spanning a wide range of industries — from robotics and AR/VR, to automotive manufacturing and home security.

Recently, we ran a study on hands-on-wheel detection in a car using the data-centric approach. Our experiment showed that by using this approach and synthetic data, we were able to identify and generate a specific edge case that was lacking in the training dataset.

Datagen generated synthetic images for its hands-on-wheel test (Image courtesty Datagen)

Though the computer vision industry is abuzz about data, not all of that buzz is unbridled enthusiasm. Though the field has identified data as the path forward, there are plenty of obstacles and pitfalls along the way, many of which are already causing CV teams to stumble. A  recent survey of US-based computer vision professionals revealed a field plagued by lengthy project delays, non-standardized processes, and a scarcity of resources – all stemming from data. In the same survey, 99% of respondents reported having had at least one CV project canceled indefinitely due to insufficient training data.

And even the lucky 1% that had thus far avoided project cancellation, couldn’t escape project delays. In the survey, every single respondent reported experiencing significant project delays as a result of inadequate or insufficient training data, with 80% reporting delays lasting 3 months or more. Ultimately, infrastructure’s purpose is one of utility – to facilitate, expedite, or convey. And in a world where serious delays are simply a part of doing business, it’s clear that some essential infrastructure is missing.

Traditional Training Data Defies Infrastructurization

However, unlike compute and algorithms, the third pillar of AI/ML development doesn’t readily lend itself to infrastructurization – especially in the field of computer vision, where data is large, messy, and time- and resource-intensive to collect and curate. While there are many databases of labeled, visual training data freely available online — such as the now famous ImageNet database – they’ve proven inadequate on their own as a source for training data in commercial CV development.

That’s because – unlike models, which are generalized by design – training data is, by its very nature, application specific. Data is what differentiates one application of a given model from another, and therefore must be unique to not only a specific task, but also the environment or context in which that task is to be performed. And unlike computing power, which can be generated and accessed at literally the speed of light, traditional visual data must be created or collected by humans (by either snapping photos in the field or combing the internet for suitable images), and then painstakingly cleaned and labeled by humans (a process prone to human error, inconsistency, and bias).

Which raises the question, “How can we make visual data that’s both application specific and easily commodifiable (i.e., fast, inexpensive, and versatile)?” Although these two qualities seem to be at direct odds with each other, a potential solution has already emerged; and it’s showing great promise as a way of reconciling these two essential, yet seemingly incompatible, qualities.

Computer vision (CV) is one of the leading fields in modern AI

Synthetic Data & The Path to a Complete CV Stack

The only way to make visual training data that is both application specific and time- and resource-efficient at scale, is through the use of synthetic data. For those unfamiliar with the concept, synthetic data is artificially generated information meant to faithfully represent some real-world equivalent. In the case of visual synthetic data, that means photo-realistic, computer-generated 3D imagery (CGI) in the form of static images or video.

In response to the many issues to emerge from the dawn of data-centricity, a burgeoning industry has begun to take shape around synthetic data generation — a growing ecosystem of small to medium-sized startups offering a variety of solutions that leverage synthetic data to address the litany of pain points outlined above.

The most promising of these solutions use AI/ML algorithms to generate life-like, 3D imagery with associated ground truth (i.e., metadata) automatically generated for each data point. As a result, synthetic data eliminates the typically months-long process of hand-labeling and annotation – while also removing the possibility for human error and bias.

In our paper (presented at NeurIPS 2021), Using Synthetic Data to Uncover Population Biases in Facial Landmarks Detection, we found that to analyze a trained model performance and identify its weak spots, one has to set aside a portion of the data for testing. The test set has to be large enough to detect statistically significant biases with respect to all the relevant sub-groups in the target population. This requirement may be difficult to satisfy, especially in data-hungry applications. We proposed to overcome this difficulty by generating a synthetic test set. We used the face landmarks detection task to validate our proposal by showing that all the biases observed on real datasets are also seen on a carefully designed synthetic dataset. This shows that synthetic test sets can efficiently detect a model’s weak spots and overcome limitations of real test sets in terms of quantity and/or diversity.

Today, startups are offering fully-fledged, self-service synthetic data generation platforms to enterprise CV teams to mitigate bias and allow for scaling data acquisition. These platforms allow enterprise CV teams to generate use case specific training data in a metered, on-demand basis — bridging the gap between specificity and scale that’s made traditional data ill-suited for infrastructurization.

A New Hope for Computer Vision’s So-Called “Data Janitors”

This is undeniably an exciting time for the field of computer vision. But, like any other field in flux, this is also a time rife with challenges. Exceptional talent and brilliant minds have flocked to the field brimming with ideas and enthusiasm, only to find themselves stymied by the absence of an adequate data pipeline. The field is so mired in inefficiency that today’s data scientists have been described as “data janitors,” first by Steve Lohr all the way back in 2014, and perpetuated ever since by the stubborn persistence of these inefficient processes.

For a field in which a third of organizations already struggle with a skills gap, we can’t afford to squander precious human resources. Synthetic data opens the door to the possibility of a true training data infrastructure – one which, someday, might require as little thought as turning on the faucet for a glass of water, or provisioning compute, for that matter. For the data janitors of the world, that would certainly be a welcome refreshment.

About the author: Gil Elbaz is Datagen’s CTO and Co-founder, based in Tel Aviv. He received his B.Sc and M.Sc from the Technion. Gil’s thesis research was focused on 3D Computer Vision and has been published at CVPR, the top computer vision research conference in the world.

Related Items:

Data Sourcing Still a Major Bottleneck for AI, Appen Says

Only 12% of AI Users Are Maximizing It, Accenture Says

Computer Vision Platform Datagen Raises $50M Series B

Datanami