The Real Deal About Synthetic Data

It’s often difficult to access the real-world data needed to train AI models or gain insights, but new techniques for generating look-alike data sets can help.

Reading Time: 9 min 


Permissions and PDF

Image courtesy of Michael Morgenstern/

Data is the essential fuel driving organizations’ advanced analytics and machine learning initiatives, but between privacy concerns and process issues, it’s not always easy for researchers to get their hands on what they need. A promising new avenue to explore is synthetic data, which can be shared and used in ways real-world data can’t. However, this emerging approach isn’t without risks or drawbacks, and it’s essential that organizations carefully explore where and how they invest their resources.

What Is Synthetic Data?

Synthetic data is artificially generated by an AI algorithm that has been trained on a real data set. It has the same predictive power as the original data but replaces it rather than disguising or modifying it. The goal is to reproduce the statistical properties and patterns of an existing data set by modeling its probability distribution and sampling it out. The algorithm essentially creates new data that has all of the same characteristics of the original data — leading to the same answers. However, crucially, it’s virtually impossible to reconstruct the original data (think personally identifiable information) from either the algorithm or the synthetic data it has created.

Synthetic data is a boon for researchers. Consider what the National Institutes of Health in the U.S. is doing with Syntegra, an IT services startup. Syntegra is using its synthetic data engine to generate and validate a nonidentifiable replica of the NIH’s database of COVID-19 patient records, which comprises more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients. The synthetic data set, which precisely duplicates the original data set’s statistical properties but with no links to the original information, can be shared and used by researchers across the globe to learn more about the disease and accelerate progress in treatments and vaccines.

The technology has potential across a range of industries. In financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use synthetic data to help them identify and eliminate bias in how they treat customers — without contravening data privacy regulations. And retailers are seeing the potential for new revenue streams derived from selling synthetic data on customers’ purchasing behavior without revealing personal information.


Reprint #:


More Like This

Add a comment

You must to post a comment.

First time here? Sign up for a free account: Comment on articles and get access to many more articles.