Magazine Winter 2022 Issue

The Real Deal About Synthetic Data

It’s often difficult to access the real-world data needed to train AI models or gain insights, but new techniques for generating look-alike data sets can help.

Fernando Lucini October 20, 2021 Reading Time: 9 min

Topics

Permissions and PDF

Twitter Facebook Linkedin

Image courtesy of Michael Morgenstern/theispot.com

Data is the essential fuel driving organizations’ advanced analytics and machine learning initiatives, but between privacy concerns and process issues, it’s not always easy for researchers to get their hands on what they need. A promising new avenue to explore is synthetic data, which can be shared and used in ways real-world data can’t. However, this emerging approach isn’t without risks or drawbacks, and it’s essential that organizations carefully explore where and how they invest their resources.

What Is Synthetic Data?

Synthetic data is artificially generated by an AI algorithm that has been trained on a real data set. It has the same predictive power as the original data but replaces it rather than disguising or modifying it. The goal is to reproduce the statistical properties and patterns of an existing data set by modeling its probability distribution and sampling it out. The algorithm essentially creates new data that has all of the same characteristics of the original data — leading to the same answers. However, crucially, it’s virtually impossible to reconstruct the original data (think personally identifiable information) from either the algorithm or the synthetic data it has created.

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Synthetic data is a boon for researchers. Consider what the National Institutes of Health in the U.S. is doing with Syntegra, an IT services startup. Syntegra is using its synthetic data engine to generate and validate a nonidentifiable replica of the NIH’s database of COVID-19 patient records, which comprises more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients. The synthetic data set, which precisely duplicates the original data set’s statistical properties but with no links to the original information, can be shared and used by researchers across the globe to learn more about the disease and accelerate progress in treatments and vaccines.

The technology has potential across a range of industries. In financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use synthetic data to help them identify and eliminate bias in how they treat customers — without contravening data privacy regulations. And retailers are seeing the potential for new revenue streams derived from selling synthetic data on customers’ purchasing behavior without revealing personal information.

Topics

About the Author

Fernando Lucini (@fernandolucini) is global data science and machine learning engineering lead at Accenture Applied Intelligence.

Tags:

Reprint #:

63217

Topics

What Is Synthetic Data?

Get Updates on Leading With AI and Data

Topics

About the Author

Tags:

Reprint #:

More Like This

Add a comment Cancel reply