Skip to content

2021 Trends in Synthetic Data Generation and Detection

December 8, 2020

This post is part of G2's 2021 digital trends series. Read more about G2’s perspective on digital transformation trends in an introduction from Michael Fauscette, G2's chief research officer and Tom Pringle, VP, market research, and additional coverage on trends identified by G2’s analysts.

On the one hand: good actors using synthetic data

We are living in the era of data. Companies are looking to utilize the data they collect to make more informed business decisions. Government organizations, for example, that have historically been slow to innovate, are looking to better understand the data that they are amassing to provide better care and support for their constituents.

Organizations are looking for ways to utilize data while: 
  1. Preserving data utility: ensuring that the data being used is indeed useful and that valid insights can be drawn from it
  2. Preserving data privacy: ensuring that the data being used has no privacy risks or personally identifiable information (PII)

Frequently, old-school data masking software and de-identification software just don’t cut it to ensure that the above-mentioned points are upheld. They risk either destroying the data utility by producing datasets that are not statistically comparable to the original (violating #1) or allowing one to identify people within the data (violating #2). 

What is Data Masking Software and De-identification Software?

Data masking software protects an organization’s important data by disguising it with random characters or other data. De-identification software replaces personal identifying data in datasets with artificial identifiers, or pseudonyms.

Over the past few years, G2 has seen the rise of synthetic data, both unstructured and structured, which is providing companies with tools to programmatically create datasets that are statistically identical but do not have actual data or PII. We have seen that even governmental organizations, such as The National Security Commission on Artificial Intelligence recognize the importance of this type of data, as they have expressed through partnerships with sellers and reports.

Although synthetic data of different varieties have been around for decades, we are seeing a boom in interest over the past few years and advancement in techniques. Indeed, over 71% of the 21 companies in G2’s Synthetic Data software category were founded since 2017, as can be seen below.

graph depicting sellers of synthetic data per their founding date

The positive use cases of synthetic data are manifold and exciting, with the industry impact being immense. If one picks an industry out of a (very large) hat, chances are there is a use case for synthetic data can make an impact.

Jasmine Lee, G2 analyst focused on healthcare, has highlighted the appeal and real-life consequences of applying synthetic data to sensitive clinical data. She writes:

Once synthetic data solutions are integrated within the databases of a healthcare organization, it ingests all data points, automating data de-duplication and cleaning, capturing statistical insights and relationships between data points, and facilitating data sharing, delivery, and modeling.  

Autonomous vehicles
Within the autonomous vehicle space, companies are working with synthetic data companies in order to build more robust training sets. Traditional methods of training these vehicles are fraught with difficulties, from expenses related to building a large and diversified dataset of scenarios to the danger of casualties. With synthetic data, autonomous vehicle makers are able to programmatically create datasets which are comparable to the real world. With an adequate dataset, these vehicles are geared toward being safer and more reliable. 

In the financial service space, companies are using synthetic data to share and analyze financial data. For example, businesses are able to augment customer information, including credit scoring. With synthetic data, they are able to preserve patterns and relationships in transactional time-series data. The real-world applications include: modelling complex causal and temporal relationships in transactional flows and building credit risk systems.

Concrete examples include: 

  • Within the healthcare space, The National Institutes of Health have partnered with MDClone to facilitate research into COVID-19 data.
  • Within the autonomous vehicle space, CVEDIA have built SynCity to provide a simulation platform used to generate data for neural network training and validation. This platform can be used to to validate computer vision systems for autonomous vehicles with custom, photo-realistic simulations.
  • In the financial service space, Hazy is specializing in financial services, already helping some of the world’s top banks and insurance companies reduce compliance risk and speed up data innovation.Our analysts reveal what's big right now in their 2021 Digital Trends reports.     See our predictions here →

On the other hand: bad actors using synthetic data

However, not everything is peachy in the field of synthetic data. Over the past couple of years, we have seen an uptick in the malicious use of synthetic media, especially in the form of deepfakes, a type of synthetic media which can take the form of text, images, audio, or video. Most commonly, people think of deepfakes when an image or video is doctored with someone else's likeness. 

Below, one can see how interest in this domain has remained relatively low except for those two spikes in early 2018 and mid 2018, when the term first began to be used. 

graph showing growth of interest in deepfakes in the US since 2018

Deepfakes differ in sophistication, with some versions being particularly amateur and shoddy, while others are very difficult to detect. What is alarming is that this type of synthetic media is only becoming more advanced and increasingly difficult to detect. This trend is also fueled by the following factors:

  1. Deepfakes-as-a-service: Some bad actors are offering to sell any individual a bespoke deepfake, allowing them to create any sort of media for the right price.
  2. Misinformation for the loss: Bad actors can disseminate deepfake videos through social media and present fake footage as if it were real.

However, there is hope

All is not lost. As noted above, governments have taken notice of both the good and bad side of synthetic data. Apart from the positive applications of synthetic data we saw above, the U.S. Congress is also investing in solutions to combat deepfakes and is actively working to move the conversation forward.

We have also seen strong interest from cybersecurity firms and social media organizations to combat malicious synthetic media through competitions and data science labs. 

Merry Marwig, G2 analyst focused on data privacy and cybersecurity remarked:

G2 does not (yet) have a category for deepfakes and other types of disinformation detection, but we are keeping a close eye on this market in 2021.”

Don’t fall behind.

Subscribe to the latest software news & updates from the expert analysts at G2.

By submitting this form, you are agreeing to receive marketing communications from G2.