What is Data De-identification? Why is it important?

Data de-identification is the process of removing the connection between data and the person with whom it was originally associated. This involves transforming or removing personal identifiers. It is easier to share and reuse data with third parties once personal identifiers have been removed or transformed by the data deidentification guide process.

HIPAA expressly regulates data de-identification. This is why most people link the process to medical data. Businesses and agencies who need to hide their identities under different frameworks such as CCPA, CPRA or GDPR, will also require data de-identification.

These identifiers may be used to classify health information as protected information (PHI), which restricts its use and disclosure. These types of information can be detected and hidden using data de-identification tools that use sensitive data discovery.

Safe Harbor, which is often praised for its simplicity, low cost and affordability, can not be used for all uses. It can be too restrictive, leaving too much utility in the data, or too permissive, leaving too few indirect identifiers.

Expert Determination

Expert determination is the application of scientific and statistical principles to data in order to reduce the risk of re-identification. This allows you to adapt the de-identification process to your specific use case while maximising utility. It is praised for its flexibility.

Expert determination can sometimes be considered too expensive because it requires an expert in statistics. These experts can be costly to source. Expert determination allows for quantitative methods to reduce the risk of re-identification, which can lead to the possibility of leveraging generalization or automation.

Limited Data Sets

HIPAA allows for limited data sets to also be released for research, public healthcare, and healthcare operations. These data sets do not contain personal identifiers, except for date of birth, death, age, location, dates of treatment, discharge, and dates of discharge. Although limited data sets may still contain identifying information, these data sets are protected under HIPAA as PHI.

How to de-identify data

Data de-identification usually takes place in two steps.

The first step is to classify and tag direct and indirect identifiers. Direct identifiers refer to unique identifiers such as Social Security numbers and passport numbers. Taxpayer identification numbers and taxpayer identification numbers are also known as direct identifiers. The rest of the identifiers, known as indirect identifiers, are personal attributes that are not unique to any one individual. Some examples of indirect identifiers are height, hair color, ethnicity, and so on. Although they are not unique, indirect identifiers may be combined to identify an individual’s records.

After the data classifiers are verified to represent the data source, it’s possible to automate the tag process. This makes de-identification much more efficient for data teams.

The combination of several techniques and controls can be used to de-identify data. These organizational and technical measures can impact the appearance of the data as well as its environment. This includes who can access it and for what purposes. Pseudonymization is one of the main techniques that data engineers and operations teams use before de-identifying it.


Although pseudonymization is a useful security technique, it does not always de-identify the victim. This is partly because pseudonymization, which is often used to mask direct identifiers and does not take into account indirect identifiers.

Although some techniques are more powerful than others, pseudoonymization can be used to transform direct identifiers using various masking techniques. For example, salted hashes provide a formal guarantee that hidden value cannot be reasonably linked to individual identifiable information without knowing the salt or random input data. Recent privacy and data protection laws require that this random data is kept separate from pseudonymized information through technical and organizational steps. This makes salted hashes one the strongest masking techniques.

Methods for de-identification

There are two main methods of de-identification: randomizing and generalizing.

Generalizing (k-anonymization)

K-anonymization, a data generalization technique, is used after direct identifiers are masked. K-anonymization reduces the risk of re-identification by hiding individuals within groups and suppressing indirect identifications for groups smaller that a predetermined number, or k. This is intended to minimize identity and relational inference attacks. This de-identification technique helps to reduce the redaction of data within data sets. It also increases its utility without compromising data privacy.

Is de-identified data considered PHI?

As long as the proper de-identification processes and, in practice a data audit trail are established, data that has been de-identified is not considered to be PHI under HIPAA.

Data de-identification is vital in public health emergencies. While real-time data is important, it is also crucial to ensure privacy, confidentiality and compliance. Data de-identification allows for important health information to be distributed without compromising privacy or confidentiality.

Reduce Compliance Costs

The Center for New Data can help us understand the importance of data identification in public health. The Center for New Data was born out of the Covid-19 epidemic with the aim of giving researchers the ability to analyze public data for policymakers, governments, and academics. It used Immuta’s dynamic and k-anonymization capabilities to meet its urgent need for insight.