Synthetic Data: Sometimes Better Than the Real Thing
Having a large stockpile of data is still a prerequisite for advanced analytics and AI. But companies building AI models increasingly are finding that artificially created data can be just as good as the real thing. And in some cases, synthetic data is a superior alternative, specifically when it comes to issues of bias and ethics.
Yashar Behzadi is a big fan of fake data. As the CEO and founder of synthetic data company SynthesisAI, Behzadi knows firsthand how difficult it can be to collect real data to cover all of the situations that AI experts want to train their systems to handle, not to mention the time and expense of getting humans to manually label the images.
“It’s fundamentally limiting,” Behzadi says of real data. “It’s a bottleneck for how fast innovation can occur in this space.”
SynthesisAI generates three-dimensional image data on behalf its customers, who use it to train computer vision models used in smart phones, teleconferencing systems, and smart assistants. The company uses some of the latest simulation technology from Hollywood and the gaming industry–like Epic Games’ Unreal Engine physics simulator and SideFX’s Houdini 3D animation software–to create huge caches of photo-realistic images that its customers can use to train AI models.
“Essentially we create 3D hi-fidelity simulations of things that you’re interested in, using techniques from CGI [computer-generated imagery], digital effects, and gaming industry coupled with emerging generative neural networks to create these very immerse, simulated environments,” Behzadi tells Datanami.
“That allows you to virtually train your systems,” he continues. “And because you know everything about every pixel in that scene, you can extract lots more information. So it completely changes the paradigm and allows you to iterate very quickly, very cost effectively–orders of magnitude cheaper and faster to build your systems.”
SynthesisAI has been in business less than two years, but it has already landed three of the world’s four largest handset manufacturers as clients. Dialing in the facial recognition system that’s used to unlock the smart phones is a tough task in the best situations. But these manufacturers need assurances that the system works well no matter what may crop up, including poor lighting conditions, dirt on the lens, mask-wearing due to COVID-19, and dealing with a variety of skin tones.
Add in the demands of an 18-month hardware cycle with constantly improving image sensors, and you can quickly see how difficult it can be to keep the facial recognition system working in a robust manner. Relying on millions of pictures of fake but realistic-looking and diverse people can eliminate the need to physically capture those images, while also sidestepping thorny privacy and bias problems.
“Synthetic data is a great way of approaching both of those problems,” Behzadi says. “It’s inherently private, because you’re generating the data. There are no real people involved. And from the onset, you can establish the distribution of your data to be balanced, so you’re not going to have issues with underrepresentation of a certain ethnic class or gender or age or BMI [body mass index] or whatever the dimensions is that you want to be robust against. You can normalize that, just by definition.”
SynthesisAI provides images to customers via an API. Customers can specify exactly what they’re looking for across several dimensions–including the person (age, gender, ethnicity, BMI), facial attributes (beard/mustache, makeup, gaze, expression), accessories (glasses, hats, masks); and camera (different optics, dust on the lens, etc.), and out pops a thousand or a million images with the exact characteristics.
In some situations, synthetic data can be used in combination with real data, to bolster the amount of data available for training. But in other cases, 100% synthetic data can be used. In either case, the goal is providing a greater amount of data on which to train AI models.
“The realization now is that 90% of data doesn’t give me any value, because it’s the same data I’ve seen before,” Behzadi says. “So what you really want is data that imparts new information. So now customers are moving to a space and saying ‘I don’t want to pay for wrangling all the data. I only want to pay for the data that helps my model improve.’”
Forrester sees synthetic data as a key ingredient in the emergence of AI 2.0, along with transformer networks, reinforcement learning, federated learning, and causal inference. In addition to helping with computer vision tasks, synthetic data will be widely used to train autonomous vehicles, Forrester says, and it will also be used to provide robust training data that’s free of privacy and ethical concerns in financial services, insurance, and the pharmaceutical industry.
Forrester analyst Kjell Carlsson notes there are several meanings of synthetic data. “There are some ways in which synthetic data is a well known quantity for some folks on the data engineering and information architecture realms, in that we were we were generating synthetic data to test systems you know for years,” he tells Datanami.
“But this is a very different kind of synthetic data creation, because it’s synthetic data which is often is generated by models, not always machine learning models or signal processing models. It’s generated by gaming engines and the like. and there’s likely going to be a lot of folks who look at synthetic data and say Oh yeah you know this is the same thing as the synthetic data we were doing before. When in actuality, no probably a lot of it won’t really transfer over.”
Working with synthetic data will often fall to the same data engineers who have been tasked with ensuring a ready supply of real data for analytics and AI use cases, Carlsson says. “The kinds of synthetic data you need to create and that process of creating it will be in a much more iterative fashion that [fits] together with the AI models that people are building on top of this, and the use cases that people are developing,” he says. “So it’s very much a data engineer who is…joined at the hip with the product manager, data scientist, and developer” who is involved with this.
SynthesisAI is starting with creating synthetic data for computer vision use cases because there are great CGI tools, but synthetic data use cases can extend out to time series and NLP data as well, Behzadi says. In three to five years, this will become much more common, he predicts.
“We’re kind of at the Wright Brothers stage of AI development, where you build a plane, jump off a dune and see what it does, then you change your design,” he says. “With these systems, as they get better and better, to be able to create these photo-realistic, high-fidelity, diverse kinds of scenes, you’re going to be able to do a lot of that system design, a lot of that training virtually.”