“Data Identifiability” refers to the ability of a researcher to identify specific individuals within a research dataset. There are varying degrees of data identifiability, ranging from data that is easily linked to specific individuals based on information present in the dataset, to data for which it is impossible to identify specific individuals based on the information present in the dataset.
You may have already heard several terms used to describe levels of data identifiability. Common terms include “identifiable,” “de-identified,” “coded,” “pseudonymized,” “anonymous,” and “aggregate.” In the context of research, many of these terms are used interchangeably, despite each term having its own specific meaning. This often leads to confusion between researchers and IRB reviewers, since for many research projects, the precise nature of the dataset’s identifiability is crucial in making IRB determinations relating to exemption status and exempt categories, as well as for understanding confidentiality protections and future sharing plans.
The VCU IRB has therefore adopted specific terms and definitions relating to data identifiability, in order to assist investigators in communicating details about their research plans with IRB reviewers. The VCU IRB urges researchers to utilize these definitions when developing and describing research plans regarding data identifiability.
Below are definitions for the terms “directly identifiable,” “de-identified,” “anonymous,” and “aggregate.”
“Directly identifiable” means that identifiers are present in the dataset directly. This means the data is not coded (identifiers replaced with a code and a key kept to link that code to identifiers). Rather, the identifiers are directly present in a dataset.
Here is an example of data that is stored in a directly identifiable manner.Shows a data collection spreadsheet that is storing directly identifiable data. It is a spreadsheet with five columns labeled (from left to right) Name, Date of Birth, Last Visit, Gender, and Diagnosis. [View Image]
In the first column, we have a direct identifier (name), which is information that can be directly linked back to a specific individual. Direct identifiers include information like names, Social Security Numbers, student ID numbers, Medical Record Numbers, and so on.
This direct identifier is stored alongside indirect identifiers (date of birth; last visit date), or information that can be used in combination with other information to re-identify an individual. In this case, the indirect identifiers are date of birth and date of last visit, but other indirect identifiers might include ZIP code, IP address, or certain demographic information, like ethnicity, race, or occupation, depending on the size of the sample and the distribution of characteristics among the study population.
Both of these types of identifiers are stored alongside data (gender; diagnosis), which is private information collected for research purposes. In this example, the data includes gender and diagnosis. These are variables that, alone, might not identify someone specific, but since the variables are stored in the same document as direct and indirect identifiers, the whole dataset is identifiable.
Directly identifiable data is the highest level of data identifiability, and carries the highest risk of re-identification of specific individuals, because identifiers are stored in the same location as the data. Secondary use of directly identifiable data is considered human subjects research, and will require IRB review.
De-identified data (AKA “coded” or “pseudonymized” data) is data that has had identifiable information removed and replaced with a code (consisting of numbers, letters, symbols, or combinations thereof), AND that code is linked to identifying information in a separate document, known as the “key document” or “code key document.”
Here is an example of data that is stored in a “de-identified” manner.Shows a data collection table storing data in a de-identified manner. It is a spreadsheet with three columns. From left to right: code, gender, diagnosis. [View Image]Shows a code key document. It is a spreadsheet that has three columns. From left to right: code, name, date of birth. [View Image]
The important takeaway from this example is that these are two separate files: the first one — the data collection file — contains only participant data (in this case, gender and diagnosis) along with a code (i.e.: 001, 002, etc.) to uniquely identify each participant. The second is the code key document, which only contains participant identifiers (in this example, name and date of birth), and links those identifiers to the same code used in the data collection file.
With de-identified data, it is still possible to link data back to specific individuals, through the use of the code key document. De-identified data is at a lower data identifiability level than directly identifiable data, and has a lower risk of re-identification of specific individuals, because someone would need access to both the dataset AND the code key document in order to re-identify a specific individual. However, the possibility of re-identification still exists, and so secondary use of de-identified data is often considered human subjects research, and usually requires IRB review (except in specific circumstances, such as when the recipient of the de-identified dataset can demonstrate that they will never be granted access to the key document by the provider of the data).
Anonymous data is data that has had identifiable information removed and replaced with a code (consisting of numbers, letters, symbols, or combinations thereof), AND that code is NOT linked to identifying information in a key document. For data to be truly anonymous, any variables that could be used in combination to identify an individual must be suppressed, removed, or collapsed (aggregated) in order to prevent re-identification.
Here is an example of data that is anonymous.Shows a data collection spreadsheet storing data in an anonymous manner. It is a spreadsheet with 5 columns. From left to right: code, age group, gender, diagnosis, length of stay. [View Image]
In this example, a code has been assigned to each entry, but there is no corresponding key document that links that code back to identifying information. In addition, potential indirect identifiers like date of birth have been converted to an age range variable, and instead of using dates of admission and discharge, the dataset contains a calculated “length of stay” variable instead. The remaining information — gender and diagnosis — cannot be used in combination with the other variables to re-identify specific individuals.
A note on indirect identifiers: what is meant by indirect identifiers is information which could be used in combination with other information to re-identify an individual. Take for example a study which found that 80% of the US population can be identified using only date of birth, gender, and ZIP code. That article was written in 2000, so imagine how much easier it must be to identify someone with minimal information, given advancements in technology and the internet!
Anonymous data is among the lowest levels of data-identifiability, and carries a lower risk of re-identification, because direct identifiers are neither present in nor associated with the data through the use of a code key document. When indirect identifiers are suppressed, removed, or collapsed, this risk of re-identification becomes even lower. Secondary use of anonymous data is generally not considered human subjects research, and its use rarely requires IRB review.
Aggregate data is data which has been combined (summed, categorized, etc.) and presented in summary form. No identifiers or individual-level data are presented.
Here is an example of data that is aggregated.Shows a spreadsheet table titled Average Length of Stay (Days). It has three columns. From left to right: Age Group, Men, Women. [View Image]
In this example, the data table presents data at the group level. The chart shows average length of stay in days, but groups the results by population characteristics, in this case, by 5-year age ranges, and gender. For example, the chart reports that among men aged 36-40, the average length of stay is 30 days.
Aggregate data is not considered to be identifiable at all, and the risk of re-identification is non-existent, because no individual-level data is present. This means it is not possible to pick out one individual from the dataset to attempt to re-identify, because all information is presented at the group level. Secondary use of aggregate data is not human subjects research, and does not require IRB review.
Investigators are strongly encouraged to adopt this vocabulary when working with the VCU IRB. You should be precise in your use of this language, particularly when describing the nature of secondary data used in the research, when describing the disposition of the study data at the conclusion of the study, and when describing plans for future sharing of the data.