business technology

Data masking can solve the anonymization problem

data security
Written by Nigel Simpkins

Data privacy has become a significant issue for individuals and governments alike. Consumers respond to data breaches by terminating their relationships with an organization, and governments have been increasingly passing data privacy regulations to limit what personal data an organization can have and what they can do with it.

As a result, organizations have been making an effort to improve the privacy of their data subjects. However, efforts to anonymize sensitive data before disclosure to third parties has proven to be inadequate, prompting the need to use more comprehensive anonymization solutions like data masking.

The Need for Data Anonymization

In recent years, governments have been passing legislation designed to help improve the personal privacy of their citizens. This includes several data privacy regulations that redefine how organizations are allowed to collect and process consumers’ personal information.

While many new regulations exist like the California Consumer Privacy Act (CCPA) and Brazil’s General Data Protection Act (LGPD), the most famous of them is the EU’s General Data Privacy Regulation (GDPR). The GDPR set the stage for many of the data protection regulations that followed it by dramatically changing how organizations must manage personal data entrusted to them by EU citizens.

Some important provisions included:

  • Protecting any data that could uniquely identify an individual
  • Allowing consumers access to and control over their collected data
  • Enforcing an “opt in” rather than “opt out” policy for data collection and processing
  • Requiring clear and transparent language for privacy policies
  • Increasing the fines that can be levied for non-compliance

In a nutshell, the new regulation ensures that an individual’s data cannot be collected, processed, or shared without their consent. Organizations with access to the personal data of EU citizens are also required to take action to protect this personal data from being breached. They may be liable for fines or other penalties if a data breach occurs or if they are found not to be taking appropriate precautions.

Shortcomings of Anonymization

One of the ways in which GDPR allows organizations a little leeway is in how they can manage anonymized data. Under the regulation, protected data includes any data that can be used to uniquely identify an individual. Therefore, if an organization anonymizes data to make such identification impossible, then protections on the anonymized data can be relaxed and it can be shared externally.

However, achieving the level of anonymity necessary to achieve this goal may be more difficult than expected. Research published in Nature Communications by the Imperial College London demonstrates how easily anonymized data can be reverse-engineered to determine the identity of the data owner.

This research is based upon data that is anonymized using traditional techniques used by organizations “compliant” with GDPR, i.e. names and email addresses are removed from the data but all other identifying information is left intact. The “anonymized” dataset can include age, gender, marital status, and other information which isn’t unique to a particular individual.

However, access to enough of these characteristics is enough to deanonymize members of even a large population. With a collection of 5 datasets consisting of anonymized data of 11 million Americans, the researchers were able to correctly reidentify 99.98% of the data subjects based upon access to 15 features.

Applying Data Masking

The results of this research demonstrate the inadequacy of traditional data anonymization procedures for protecting the privacy of data subjects. Logically, even a few data features, like birthdate, gender, and postal code, would be enough to significantly narrow down the possible identities of an “anonymized” data subject and may even allow them to be uniquely identified. Since many “anonymized” datasets contain much more information than this, additional measures are required to bring such data into compliance with GDPR and similar regulations.

This is where data masking can be a valuable tool for organizations that want to simultaneously share data with partners and maintain the anonymity and privacy of the data subjects. Rather than simply dropping easily identifiable features from a dataset, data masking involves replacing data features with values that are plausible but incorrect.

For example, a partner organization may want to know the impacts of age and location on a user’s preferences. In general, they don’t need the level of granularity than an exact age or exact location provides. For example, being born in December of 1990 and January 1991 or living on one side or the other of a street that defines a zip code boundary likely has minimal impact on their preferences.

A data masking algorithm can be used to randomize data points while maintaining some level of plausibility. For example, a 26 year old data subject may be recorded as being 28 and living in the next county over. This likely has minimal impact on the ability of the partner organization to determine valuable trends but has a significant impact on the data subject’s ability to remain anonymous. By using a strong data masking solution, an organization can perform this obfuscation efficiently and effectively at scale.

Protecting Sensitive Data

Organizations have to balance the needs of their customers to remain anonymous and for their internal and external business partners to have access to data for analysis of customer preferences and trends. Current attempts to anonymize data have proven to be inadequate, making it trivial for someone to reidentify the “anonymous” data subjects.

The use of data masking can solve this anonymization problem. Rather than just dropping obviously identifiable data fields (like name and address), data masking replaces data with plausible but incorrect data. These replacements have minimal impact on the results of data processing but provide data subjects with the anonymity that they need.