Synthetic Data Market Gets Real
A growing list of data privacy regulations along with demand for better training data is spawning new AI-based approaches to managing “personally identifiable” information, including “synthetic” data sets that remove personal information covered by current and pending privacy rules.
Diveplane, an AI startup based on Raleigh, NC, said its synthetic data platform dubbed Geminai generates a twin data set that “acts and feels realistic” for the purposes of data modeling while stripping out personal information. The synthetic data tool targets business and government users seeking to analyze and share data sets while also complying with a growing list of data privacy rules.
For instance, the California Consumer Privacy Act (CPAA) scheduled to take effect on January 1, 2020, limits the dissemination of “personally identifiable information.” The EU’s General Data Protection Regulation (GDPR) contains similar restrictions.
Diveplane said it synthetic data tool can address the widening gap in training data brought about by privacy restrictions.
“Many businesses are forced to use inaccurate or incomplete data to train their AI due to privacy requirements, which can lead to the AI making poor or misleading decisions,” said Diveplane CEO Michael Capps. The Geminai tool creates a synthetic “twin” dataset that can be verified by users as they train AI models.
For example, proponent of the synthetic data approach note it can be used to test algorithms, allowing developers to develop prototypes that can help justify risky AI initiatives. In another scenario, synthetic data can be used to develop large, labeled data sets customized for a specific project.
Diveplane claims its approach goes beyond simply masking slices of private information such as names and social security numbers. Instead, it addresses what the startup calls the balance between privacy and accurate data used for model training.
Other startups offer data discovery approaches to assist with regulatory compliance, including tools that use machine learning algorithms to help track down and manage “personally identifiable information.”
Along with compliance with U.S. and EU privacy regulations, Diveplane is also targeting medical research, including the ability to anonymize patient records so investigators can use those data sets without violating the Health Insurance Portability and Accounting Act, or HIPAA.
Other applications include generating granular data for training neural networks, thereby improving AI functionality, as well as “de-identifying” data sets to allow more data sharing.
Ultimately, the startup’s goal is enabling “understandable AI” that is “trainable, interpretable and auditable.”
Other synthetic data proponents note the emerging approach can be used to produce large volumes of labeled data faster and cheaper than manual labeling. Another use case is generating unique training data that would otherwise be difficult to capture “in the wild.”