We are living in the era of data. Companies are looking to utilize the data they collect to make more informed business decisions. Government organizations, for example, that have historically been slow to innovate, are looking to better understand the data that they are amassing to provide better care and support for their constituents.
Organizations are looking for ways to utilize data while:
Preserving data utility: ensuring that the data being used is indeed useful and that valid insights can be drawn from it
Preserving data privacy: ensuring that the data being used has no privacy risks or personally identifiable information (PII)
Frequently, old-school data masking software and de-identification software just don’t cut it to ensure that the above-mentioned points are upheld. They risk either destroying the data utility by producing datasets that are not statistically comparable to the original (violating #1) or allowing one to identify people within the data (violating #2).
What is Data Masking Software and De-identification Software?
Data masking software protects an organization’s important data by disguising it with random characters or other data. De-identification software replaces personal identifying data in datasets with artificial identifiers, or pseudonyms.
Over the past few years, G2 has seen the rise of synthetic data, both unstructured and structured, which is providing companies with tools to programmatically create datasets that are statistically identical but do not have actual data or PII. We have seen that even governmental organizations, such as The National Security Commission on Artificial Intelligence recognize the importance of this type of data, as they have expressed through partnerships with sellers and reports.
Over the next year, G2 expects to see steady growth in the number of sellers in the synthetic data space along with more novel uses of the technology as knowledge of its impact and import increases.
Although synthetic data of different varieties have been around for decades, we are seeing a boom in interest over the past few years and advancement in techniques. Indeed, over 71% of the 21 companies in G2’s Synthetic Data software category were founded since 2017, as can be seen below.
The positive use cases of synthetic data are manifold and exciting, with the industry impact being immense. If one picks an industry out of a (very large) hat, chances are there is a use case for synthetic data can make an impact.
Once synthetic data solutions are integrated within the databases of a healthcare organization, it ingests all data points, automating data de-duplication and cleaning, capturing statistical insights and relationships between data points, and facilitating data sharing, delivery, and modeling.
Autonomous vehicles Within the autonomous vehicle space, companies are working with synthetic data companies in order to build more robust training sets. Traditional methods of training these vehicles are fraught with difficulties, from expenses related to building a large and diversified dataset of scenarios to the danger of casualties. With synthetic data, autonomous vehicle makers are able to programmatically create datasets which are comparable to the real world. With an adequate dataset, these vehicles are geared toward being safer and more reliable.
Finance In the financial service space, companies are using synthetic data to share and analyze financial data. For example, businesses are able to augment customer information, including credit scoring. With synthetic data, they are able to preserve patterns and relationships in transactional time-series data. The real-world applications include: modelling complex causal and temporal relationships in transactional flows and building credit risk systems.
Within the autonomous vehicle space,CVEDIA have built SynCity to provide a simulation platform used to generate data for neural network training and validation. This platform can be used to to validate computer vision systems for autonomous vehicles with custom, photo-realistic simulations.
In the financial service space,Hazy is specializing in financial services, already helping some of the world’s top banks and insurance companies reduce compliance risk and speed up data innovation.
On the other hand: bad actors using synthetic data
However, not everything is peachy in the field of synthetic data. Over the past couple of years, we have seen an uptick in the malicious use of synthetic media, especially in the form of deepfakes, a type of synthetic media which can take the form of text, images, audio, or video. Most commonly, people think of deepfakes when an image or video is doctored with someone else's likeness.
Below, one can see how interest in this domain has remained relatively low except for those two spikes in early 2018 and mid 2018, when the term first began to be used.
Deepfakes differ in sophistication, with some versions being particularly amateur and shoddy, while others are very difficult to detect. What is alarming is that this type of synthetic media is only becoming more advanced and increasingly difficult to detect. This trend is also fueled by the following factors:
Deepfakes-as-a-service: Some bad actors are offering to sell any individual a bespoke deepfake, allowing them to create any sort of media for the right price.
Misinformation for the loss: Bad actors can disseminate deepfake videos through social media and present fake footage as if it were real.
Moving forward, we expect to see more investment in deepfake detection from both cybersecurity firms and media organizations. In regards to the latter, this will likely be fueled by both internal innovation and talent, as well as strategic investment.
However, there is hope
All is not lost. As noted above, governments have taken notice of both the good and bad side of synthetic data. Apart from the positive applications of synthetic data we saw above, the U.S. Congress is also investing in solutions to combat deepfakes and is actively working to move the conversation forward.
We have also seen strong interest from cybersecurity firms and social media organizations to combat malicious synthetic media through competitions and data science labs.
Merry Marwig, G2 analyst focused on data privacy and cybersecurity remarked:
“G2 does not (yet) have a category for deepfakes and other types of disinformation detection, but we are keeping a close eye on this market in 2021.”
Explore the highest-rated software in related categories:
Matthew Miller is passionate about emerging technology and its impact on society and businesses. He most recently worked as an AI Research Analyst at CognitionX, a London-based AI-powered Knowledge Network and host of one of Europe's largest Ai conferences. He also co-founded a pro bono voice technology group, VAICE, which has helped companies discover the best ways to incorporate voice tech in their business and their business models. At G2, he is focusing on the AI and Analytics categories and looks forward to learning more. Get in touch at firstname.lastname@example.org.