Patient health data is an essential component of medical research. Clinical trials depend on accurate data obtained from patient databases to analyze and interpret results; however, it would be highly unethical for these companies to have access to patient names and other identifiers. To avoid this issue, regulatory laws like the United States Health Insurance Portability and Accountability Act (HIPAA) require patient data to be de-identified before leaving the originating healthcare institution.
How are records de-identified?1
Before data can be de-identified, several factors must be considered:
- How do you define concepts that are flexible across many domains?
- How will you de-identify data and assess it for accuracy to meet privacy and security standards?
- Is there guidance for method development, procurement, assessment, and compliance?
- Who will oversee data monitoring?
- What metrics will be used to determine privacy risk?
- How will personnel be trained to work with the de-identified data?
There are six primary methods of de-identifying data to select from, including:
- Suppression: The removal of a field or column, like the patients’ names, from the dataset entirely.
- Generalization: A reduction in the granularity of the data, resulting in the information becoming less specific. This might involve changing an exact location to the neighborhood or city.
- Masking: Used to make it simpler for personnel to identify entries that the same individual has made, masking involves obscuring data, usually direct identifiers, that have been left in the set. Essentially, it creates an identifier throughout the dataset that allows users to trace a single person.
- Perturbation: Altered data reliability, similar to adding noise to a system. This is done by slightly altering a value such that it cannot be assumed to be reliable.
- Aggregation: Grouping raw data together. This allows statistics about the data to be released instead of specific data points.
- Access control & monitoring: Systems that actively monitor and/or limit access to the data (as opposed to mere signed agreements regarding the use of the data).
The level of de-identification is often assessed using a measure called k-anonymity. De-identified data has a specific amount of k-anonymity if the data for each individual in the released dataset can’t be distinguished from at least (k – 1) individuals whose data are also included in the set.
Best practices1
The anxiety surrounding de-identification is tied to the concern of the unintended re-identification of the data. This may occur either by accident or by a malicious party, such that it breaches the patient’s privacy. The first step is correctly identifying all elements unique to the individual. These are typically things like names, email addresses, and phone numbers but might also include IP addresses and computer identifiers, like MAC addresses. Furthermore, it’s necessary to identify where all the unique information occurs. However, while a patient’s names may be removed from one file, meta-data and other hidden forms of data might also contain this information.
The characteristics of the data are also important to determine both the privacy and utility levels (or how useful the information is for its intended research purpose after de-identification). The distribution of the data may play a role in determining privacy level; for instance, for datasets primarily composed of entries from patients with a specific disease, like bladder cancer, outliers included in the dataset with another type of cancer will be more unique than those with bladder cancer and therefore more easily identifiable.
If encryption is used to mask data, it’s critical to ensure that ciphered text doesn’t resemble the plain text values of your data. The data from any encryption system should also be undecipherable even if everything except the encryption key is made public. To this end, not just any mathematical algorithm will do to encrypt confidential data; the algorithm must be complex enough that the key must be obtained to re-identify the information.
Artificial intelligence in de-identification
Furthermore, artificial intelligence and machine learning have been incorporated into a variety of medical research applications. They’ve also been used to develop automated strategies of de-identifying patient data. In 2014 and 2016, several groups developed various machine learning-based algorithms for a competition to discern optimal de-identification methods. These algorithms were primarily composed of combinations of elements, including manually-derived rule sets and neural network-derived algorithms. Interestingly, it wasn’t the machine learning-based systems that won out; hybrid systems consisting of both machine learning and manually inputted rulesets provided a better level of de-identification in relation to the data’s usability.
The long-term benefits of optimizing the de-identification process are obvious: personal patient information remains confidential while researchers can continue to utilize critical data points for their studies. In practice, it is much harder than it looks, requiring a strict balance between de-identification and data usability. Ultimately, as artificial intelligence grows in capability, so too will the power of scientists to generate valuable datasets that retain the privacy and security of the patient.
LabTAG by GA International is a leading manufacturer of high-performance specialty labels and a supplier of identification solutions used in research and medical labs as well as healthcare institutions.
References:
- Krehling. “De-Identification Guideline.” Western Information Security and Privacy Research Laboratory Technical Report WL-2020-01, Western University, Canada, 2020. Available at: https://whisperlab.org/technical-reports/de-identification-guideline-WL2020-01.pdf.
- Yogarajan V et al. A survey of automatic de-identification of longitudinal clinical narratives. arXiv. 2018: 1-20.