Follow Datanami:
June 17, 2019

Five Reasons Synthetic Data Is the Electrolyte to Speed Up Your AI Initiatives

Yashar Behzadi


When taking on an artificial intelligence (AI) project, there is one ingredient above all that is essential  for success: clean, well organized, relevant data.

In theory, data is everywhere (2.5 quintillion bytes of data produced a day to be exact), and that amount is accelerating rapidly with the proliferation of sensors from the Internet of Things (IoT) and machine-generated big data. In fact, even though it may be hard to believe, 90% of the world’s data was created in just the past two years. This includes every tap and swipe we make on our mobile devices, every video we stream, every tweet we read and Instagram post we like or share.

IoT data is also growing at a similar exponential rate to mobile and social. While this data may technically have previously existed, we now have the ability to capture it due to the shrinking cost of storage and sensor instrumentation. This IoT data represents an enormous untapped opportunity, but the costs associated with recording, storing organizing and preparing it can be immense and still need to be justified.

Recent studies show 12.5% of staff time is lost in data collection. That’s five hours a week in every 40-hour work week just for just collecting the data necessary to begin successfully executing AI and ML initiatives.

Once the data is collected, teams then need to correctly label and categorize the data to feed and train an algorithm. So what’s the problem? To create reliable and accurate training data, you need to collect and label tens of thousands and sometimes millions of data assets. Assuming that labeling a single comment may take a worker 30 seconds, he or she will need to spend 750 hours or almost 94 work shifts averaging 8 hours each to complete the task.

Privacy concerns are another major issue when considering data sources. It is true that much of the new data is siloed in privately held platforms, but even within these massive platform companies (and they are truly massive – over a quarter of the humans on the planet are active on Facebook) there are privacy concerns and rights to consider (as can be seen in the new privacy focus of Facebook unveiled at their latest F8 Conference).

In terms of sensor data, the ethical implications of surveillance are clear — the right to privacy must be respected and people should be informed if their actions are being observed and have the ability to opt-out and control the usage of that data.

So, as an executive considering the potential for using artificial intelligence to solve real business problems, how do you know which data sets are relevant, available, usable, and how much time, effort and cost will be required to acquire, prepare and organize that data so that it can be effectively used?


One technique that is commonly employed is to embark on a long and involved process of inventorying all of the data at a company’s disposal. While this may be worthwhile, this still likely only represents a small fraction of the data that a company could collect. But data collection is usually an expensive proposition, and collecting the necessary amounts needed to train highly accurate and robust AI systems can take months or even years.

Finally, the executive must consider that once the data has been collected and prepared, there is no guarantee that an effective combination of algorithms can be employed and appropriately modified in order to achieve the desired result in real world conditions with the required accuracy.

So, what is the right way to proceed? I’d like to propose a dramatically more time and cost effective technique for prototyping and developing new AI computer vision applications: synthetic data.

Synthetic data takes the form of digitally created images, video, and 3D environments that can be used to train deep learning models. It combines techniques from the movie and gaming industries like CGI, animation and computer simulation with generative neural networks such as GANs and VAEs to create perfectly-labeled, realistic datasets and simulated environments at scale. Using this technology, additional images can be added to the synthetic dataset at virtually no incremental cost. Labeling is likewise provided by design, with no need for costly and time consuming human labeling. And since all the attributes of the image are known to pixel-perfect precision, key labels such as depth, 3D position and partially obstructed objects are all perfectly accurate.

Here are five ways that synthetic data can speed up your AI initiatives:

  1. Make a go/no go decision before even starting data collection: Testing algorithms with synthetic data allows developers to produce proofs-of-concept to justify the time and expense of AI initiatives. They can show conclusively that a given combination of algorithms can in principle be modified to achieve the desired results, providing crucial assurance that costs incurred in a full development cycle will not be wasted and giving you the confidence you need to move forward.
  2. Make sure you are collecting the right data: Proposed data to be collected for an AI initiative can be simulated using synthetic data. When conceiving of new AI initiatives, many companies think they have no option but to put the cart before the horse in terms of data collection, and do so blindly hoping that the data will have some value down the road. Using synthetic data, your business can rapidly develop large scale perfectly labeled data sets in line with your requirements for testing purposes. Furthermore, this data can then be modified and improved through iterative testing to provide you with the highest likelihood for success in your subsequent data collection operation.


  3. Efficiently optimize sensor types and locations: Using synthetic data, you can simulate sensor attributes and positioning to easily understand the relative value of the number, type and location of cameras or other sensors in a wide variety of locations, including manufacturing, logistics and retail locations without having to go through a prolonged process of building representative hardware, acquiring data under various configurations, labeling the images and building various models.
  4. Correct for problems with your existing data: Synthetic data can be used to correct for shortcomings in your existing available data. One common problem that occurs when you have too much of a certain label in your training data is overfitting. This creates unreliable results in real world usage. Bias is another pervasive problem stemming from collected data that does not adequately represent the full range of differences that can occur in reality. Synthetically generated datasets provide a reliable and cost effective way to correct for these issues and guarantee a well balanced dataset.
  5. Improve your data beyond what would be possible with external collection: Synthetic data can be used for reliable generation of edge cases, which would be extremely difficult or impossible to capture in the wild. Examples of this could include rare weather events, equipment malfunctions, workplace and vehicle accidents and rare disease symptoms. Given the probability for these events to occur in some cases can be only one in 10 million, synthetic data can represent the only way to ensure that your AI system is trained for every eventuality and will perform well precisely when you need it the most.

While it is not a one-size-fits-all solution or a fire and forget technology, we have found that synthetic data has the potential to dramatically improve the economics and chances for success in AI transformation initiatives. So, rather than let data be the bar for entry that prevents your company from embarking on the important process of AI transformation or the bottleneck that slows down implementation, consider whether synthetic data could allow you to prototype, test and iterate potential AI applications far more quickly, cheaply and accurately.

About the author: Yashar Behzadi is the CEO Neuromation, a a San Francisco-based AI technology company that’s pioneering the use of synthetic data and generative models to build more capable real-world AI models. Yashar is an experienced entrepreneur who has built transformative businesses in the AI, medical technology, and IoT space. He comes to Neuromation after spending the last 12 years in Silicon Valley building and scaling data-centric technology companies. His work at Proteus Digital Health was recognized by Wired as one of the top 10 technological breakthroughs of 2008 and as a Technology Pioneer by the World Economic Forum. He has been recognized in Wired, Entrepreneur, WSJ, CNET, and numerous other leading tech journals for his contributions to the industry. With 30 patents and patents-pending and a PhD in Bioengineering from UCSD, he is a proven technologist.

Related Items:

Big Data Meltdown: How Unclean, Unlabeled, and Poorly Managed Data Dooms AI

Faulty Data is Stalling AI Projects

Data Management: Still a Major Obstacle to AI Success