The NIH RFI on genomic data sharing –Consent and genomic data sharing and linking

Doug Fridsma
5 min readMar 2, 2022

--

This is the fourth in a four part series on the NIH genomic data sharing policy RFI, looking at consent for genomic data sharing and linking.

Framework for consent

We can look at genomic data sharing, linking, and consent in three different situations:

  • Data collected, submitted to repositories (or linked) in which all the data has been consented (clinical trials or IRB-reviewed, consented research)
  • Data that is collected without consent (data in electronic health records and claims data), and
  • Data in which some of the data has been consented, but other linked data has not (for example, linking clinical trials genomic information with real world data found in EHRs)

How should we handle data sharing and linking genomic data collected with consent?

For patients who are participating in research and can consent to the use of that data, permission to share the data within protected enclaves and to link that data to other data that the patient has consented to for use (from a registry or other IRB-reviewed research), should be permitted. It is still important that the data be protected as PII, and used within access controlled repositories or de-identified before it is used for other purposes. With meaningful informed consent, patients can be fully aware of the risks and benefits of consenting to sharing data within repositories and linking that data to data that a patient may have.

What about linking real world data in which some (or all) of the data is collected without consent?

Data that is collected for care delivery but not consented for secondary use is pervasive within EHRs, claims and consumer data. As genetic testing becomes more wide-spread, genomic data will be present in these other data sources, for which no consent for sharing or linking has been provided.

In this setting in which data has been collected without consent, before this data is shared or linked, it should be de-identified and reviewed so that the risk of re-identification remains low.

There are examples in which this kind of data can be useful for clinical research, and support research that can (and should) provide patients with informed consent. For example, it should be possible to use privacy preserving record linkage methods to identify patients who have data in two different data sets (without exposing PII) or identifying cohorts in datasets. If these datasets are linked, then expert determination should be used to assure that the new datasets are still appropriately de-identified.

In linking to data that has been collected without consent, it is important to not expose an individual’s PII as part of the linking process. Linkages should use privacy preserving mechanisms to link data, and the resulting linked data set should be evaluated to ensure that the risk of re-identification of the resulting dataset remains negligible.

What if some of the data is collected with consent to link, and others are not?

An increasingly common scenario is when a patient consents to participate in an IRB-reviewed clinical study, and consents to the use and linkage of that data. In this case, the clinical study data has been properly consented, but it is combined and linked with other data sources that may not have been consented for use. This could include individual level claims or EHR data, or aggregate data related to social determinants of health. In either of these cases, linkages between consented data and non-consented data can increase the risk of re-identification.

This scenario is similar to the scenario above, and should be treated as if all of the data has not been consented. The non-consented data should be de-identified and no PII shared for either linking or for analysis. PPRL methods can still permit privacy preserving linkages. The resulting data set should be reviewed to assure that the new, linked dataset still conforms to the uses and restrictions of the original consented data.

For example, if that data is to be held in access restricted enclaves, the new data set should be held in that same environment. If the consented data has been de-identified before it was shared, then the resulting dataset should be reviewed to be sure that it remains de-identified, and if not, remediation should be applied to the data set to assure it falls under the consented uses. The NIH should make sure that meaningful informed consent obtained for both sharing and linking considers the scenarios in which that data may be used, and follows best practices to protect the privacy and confidentiality of the patient’s data.

Should we require consent for linking datasets together?

Whether data linkage should be addressed when obtaining consent for sharing and future use of data under the GDS Policy, as well as in IRB consideration of risks associated with submission of data to NIH genomic data repositories. And if so, how to ensure such consent is meaningful.

As described above, when patients participate in IRB-reviewed, consented research, future studies should consent participants for both sharing data and linking data to other datasets. In this setting, the informed consent should include scenarios for sharing and under what conditions that data will be shared (de-identified, identified within a protected enclave, restricted access with IRB-review and controls, etc). It should also include scenarios outlining how that data might be linked (no linkages allowed, de-identified links allowed, fully identifiable links allowed) and how the investigators plan to protect the data.

However, data that is obtained in other settings (from EHRs, claims data, or is repurposed for secondary-uses in research), there should be no expectation to get meaningful informed consent for these kinds of data. In these circumstances, data should be held in secure enclaves to minimize risk of unauthorized access, every effort should be made to remove identifiable information, modern, privacy-preserving techniques for linking data sets should be used, and expert determination should be used to assess the risk of re-identification in the linked datasets.

What about data sharing plans that the NIH requires, but does not score with a grant application?

As suggested above, the importance of good data sharing and linking plans cannot be overstated. If the NIH believes that good data sharing and linking is essential to scientific advancement as well as protecting patients who participate in clinical studies, it is imperative that the NIH consider this a Scored element on grant applications, and that studies with inadequate plans for protecting patient data should be removed from consideration. Without a clear incentive to ensure the safety of genomic data, data sharing plans will remain an afterthought, and not carry the importance that these important data resources require.

Previous: The NIH RFI on genomic data sharing — Protecting genomic data

--

--

Doug Fridsma

Doug is currently the Chief Medical Informatics Officer, Health Universe and a senior advisor for Datavant Inc. Previously the Chief Science officer for ONC.