The NIH RFI on genomic data sharing — Historical context
The NIH has taken a good step forward in revisiting (and potentially updating) the NIH genomic data sharing policy. In this the second of four blogs, I’d like to provide the historical background of genomic data sharing policy, and why this particularly update is important.
Has genomic information always been considered personally identifiable?
Remarkable, the recognition that genomic information is highly personally identifiable is a relatively new concept. When genetics sequencing tools were rapidly advancing in the late 1990s and early 2000s, most people considered genetic sequences as unidentifiable unless they were linked to identifying information. Since then, we’ve learned a lot about how genetic information can be used to identify individuals or groups.
What is GINA and why is it important when we are talking about genomic information sharing?
The first real legislation was the Genetic Information Nondiscrimination Act of 2008, commonly referred to as GINA. While this legislation was primarily aimed at preventing discrimination from employers (Title I) and insurers (Title II) based on genetic information, it required defining genetic information as health information under HIPAA. At the same time, a number of state laws followed GINA (most notably calGINA), that provided additional protections from genetic discrimination with respect to emergency services, education, housing and other provisions. And although genetic information was considered health information, it was considered highly identifiable and was often made publicly available through research databases and shared among researchers.
When did our approach to protecting genomic information change?
In 2013 however, Yaniv Erlich published his study in which he showed that by using short tandem repeats (STR) and queries to publicly accessible online genealogy databases, he could re-identify the surnames of a significant number of the individuals represented in those samples. With further linking (again, to publicly available datasets), the exact identity of individuals could be revealed.
The HIPAA Omnibus rule in 2013 attempted to address Erlich findings and was a required follow-up to the GINA legislation. In the HIPAA amendments, genetic information was now classified as health information (Section 105) and so any uniquely identifying genetic information was considered PHI for the first time.
Genetic information was not classified as one of the 18 direct identifiers and even with the redaction of the 18 direct identifiers, the risk of re-identification (as demonstrated by Erlich) is high.
What does the current NIH policy on genomic data sharing say?
The current policy released in 2014, can be found here. Since this policy has been published, we have greater experience (and knowledge) about the risks of re-identification in genomic data, as well as new sophistication in both understanding expert determination of genetic information, and in techniques that allow for linking of datasets together in ways that preserve patient privacy. While sharing non-human subjects is relatively easy, sharing data that contains human genetic information — particularly for large datasets — relies on data management restrictions, and says very little about linking, and sharing data that leverages more recent privacy preserving linkage technologies.
Next: The NIH RFI on genomic data sharing — Protecting genomic data
Previous: The NIH genomic data sharing RFI — Make data-sharing plans scored within a re-identification framework