Synthetic data: How redaction can help you use AI in GDPR-compliant ways
Opdateret: feb. 10
The rise of AI-technologies gives companies new opportunities to analyze customer and market data. Yet, the risk of data breaches put a stop to full exploration. Redaction software tools can create synthetic data and solve the problem.
Artificial intelligence (AI) is one of the most promising tools of the century. Doctors and medical researchers can employ it to treat patients more effectively. Businesses can use it to analyze consumption patterns and customer segmentation. And researchers, NGO’s and government organizations can use AI-methods to track historical records, political patterns, and much more.
Anno 2021, a large quantity of data is available - both from the past and the present. Our societies become increasingly digitized, and personal data is used and stored in a complex matrix between actors. The overload of digital information gives especially businesses a hitherto unseen possibility to use machine-based learning for tactical purposes. The list of pros is long:
Understand and foreshadow consumer behavior.
Create personalized marketing content.
Detect fraud and scams.
Update lists and evaluate company performance.
And much more… All this helps businesses to increase their sales and profits. This is why they want to develop machine algorithms that can help analyze, systemize and interpret data. Yet they face one main challenge:
How to develop an AI-tool based on vast amounts of data sets WITHOUT clashing with GDRP-regulations?
AI versus GDPR
Now, the issue is a complex one. Yet it can be solved with the right automated redaction software tool.
Both AI and machine learning are 100 percent based on good data. Data that is precise, extensive, and useful for cross-listing. With good data at hand, one can create smart technologies that allow for a quick analysis of useful information. In a way, one can say that it takes good data to make good data.
Fortunately, rich data is everywhere. In records, journals, receipts, surveys, ratings, posts, pictures, licenses, public data records, historical listings etc. Yet, any unauthorized use clashes with the privacy protection rules. It means breaking the law and risking huge fines.
Problem is, that consent is mandatory for every use of personal data. Say you have several license plates and a list of names and locations. For a car manufacturer to address and adapt the information into a useful AI-instrument, the manufacturer needs extensive consent from every listed person. Consent not only to the novel use of their data but also to the transferal and incorporation into an AI-technology.
So: How to make a software that leverages data but still stays compliant with GDPR? Without having to phone down 5.000 people manually?
Solution: Synthetic data
There is more or less one way to go. That is to create synthetic or redacted data. As Forbes Magazine puts it in a recent article by Annie Brown, founder of an AI-driven social commerce platform:
“Synthetic data algorithms are especially good at synthesizing behavioral records, such as credit card transactions or purchase histories, including time-dependencies of customers’ actions and behaviors.”
Synthetic data arise when original information is manipulated and all individual details and identification obscured. You extract the statistical qualities that you can correlate and use in AI. Synthetic data is anonymous. It is a method that allows you to filter for patterns rather than individual denominators. By redacting and anonymizing names, licenses, health indicators, or payment details, you secure your right to employ the knowledge into your new AI-technology.
By keeping the characteristics from the original datasets and removing all personal identifiers = voila! The data can be incorporated into algorithms without the risk of breaking privacy rules. A process that helps everything from financial actors over hospitals to corporations to create tools for optimizing output and treatment in near future.
Currently, synthetic data seems to be the only way forward. It provides a solution to the ethical issues of consent, privacy, and data transparency integral to AI.
Redaction software to the rescue
Now redaction comes into play. Redaction is all about hiding and blurring sensitive data, so digital documents, data sets, and information can be put to AI-use. To develop synthetic data – that keeps statistical qualities and patterns intact – one needs to process the originals. And it needs to be done well!
That is why a redaction software tool comes in handy. With an automated redaction software tool like ours, your company can easily “clean” datasets. A document redaction tool has three major advantages:
Sensitive content is identified.
Sensitive content is hidden.
Sensitive content is pseudonymized.
A modern redaction software tool can identify, blur, and pseudonymize data by replacing it with fictive denominators. Denominators like B1, B2, and B3 can stand in place of a name, location, security number, age, or what have we. Once names like Sean or Maria turns into B1 and C2, they are fit for any AI-algorithm predicting say education choice or job market demands on macroscale.
The tricky part about synthetic data is to keep the patterns intact. We want no knowledge to get lost in digital translation. To redact and anonymize properly is not easy.
Avoid bias and poorly anonymized data
Yet it is important in order to safeguard yourself against two grave pitfalls.
One is poorly anonymized data. Data, which potentially leaves the door open for retracing personal information hidden in the first place. That has been a huge problem in the past, and there have been cases of cross-referencing personal data from otherwise redacted documents.
Data breaches can have serious consequences for a company or organization working with databased AI-tools. An example of a leak is that of the private addresses of New York taxi drivers traceable via trip records. Without a proper automated redaction tool in place, the risk of these unfortunate data leaks increases.
Another issue is inbuilt bias, which can create an imbalance in data sets. Take historical records as an example. They often favor men over women, when it comes to registration and acknowledgment of career performance. Or take a financial crisis fundamentally changing patterns of consumption, saving, and loaning. Both would weigh heavily in a data set and tend to skew the algorithms in a non-representative way.
Working with synthetic data through automated redaction offers a chance to correct those errors. It gives a more precise and fairer outcome. Redaction software helps build AI that not only secures 100 percent anonymization but also a fair and valid representation.
Take advantage of digitalization
Today, there is such a big and unexplored field of useful digitally stored data. Just waiting to be used for research and innovation or business building and customer understanding. It is a shame not to take advantage.
Cleardox´ redaction software can create synthetic data that allows you to go all-in on smart machine learning. It can help you access not just a fragment of relevant data but all of it.
Interested in getting a closer look at our product? Sign up for a demo here!
The Cleardox team