The NIH genomic data sharing RFI — Make data-sharing plans scored within a re-identification framework

Doug Fridsma
3 min readMar 2, 2022

This is the first in a four part series on the NIH genomic data sharing RFI, focused on privacy, data sharing, linking, and consent.

It is encouraging to see a comprehensive update to the NIH data sharing policies for genomic data. Genomic data is critical to the advancement of science and essential for research into disease prediction, prevention, and novel therapies. The value of genomic data to the health and welfare of society cannot be overstated, and I’m glad to see NIH reexamine how to share this information in ways that are both protective of individual privacy, while recognizing the value that this data has to society. It is important to find responsible ways to share genomic information that both protects an individual’s information, and creates opportunities for research and discovery

The challenge of course, is that highly specific sequence data is necessary to advance our understanding of science. Redacting genetic information can render it useless for scientific research, but recognizing that genetic data is unique to an individual requires a more sophisticated approach to protecting genomic information, and a framework for how to evaluate risks of re-identification.

As we think about a framework for how to assess re-identification risk, not all genomic data is the same, or carries with it the same risks for re-identification:

  • Tumor genetic information versus somatic and heritable genetic information. Tumor genetic information would have a much lower risk of re-identification when compared to somatic and heritable genetic information.
  • The length of the sequence with a risk for re-identification as sequence length increases. Shorter sequences will have a lower risk of re-identification than long genetic sequences
  • Rare diseases or sequences versus those more common in populations. Having additional frequency information regarding genetic mutations or baseline rates of a particular genetic sequence will be important to be able to statistically evaluate re-identification risk
  • The comprehensiveness or number of heritable sequences that can use genetic information to identify related individuals. For example, a dataset with a small or limited number of SNPs will have a lower risk of re-identification compared to a dataset with numerous or comprehensive sets of SNPs.

So how do we create a policy and process that is capable of keeping up with a a dynamic and rapidly changing environment in data sharing, particularly with new technologies to protect privacy of patient information? I would strongly urge the NIH to use the scientific peer review process to create incentives for more novel and secure ways of protecting genomic information. Data sharing plans that are required for grant submissions should be reviewed and scored in the same way that other aspects of a grant (significance, methods, prior work) are reviewed and scored. Without making the data sharing plans a scored element, the NIH signals to the research community that data sharing plans are an unimportant afterthought. The NIH should put these plans front and center in all grant applications, and these should be reviewed rigorously.

I explore these ideas (and more) in a the rest of the blogs in this series — the next blog entry on the history of regulation and our evolving understanding of the nature of genomic information:

Next: The NIH RFI on genomic data sharing — Historical context

--

--

Doug Fridsma

Doug is currently the Chief Medical Informatics Officer, Health Universe and a senior advisor for Datavant Inc. Previously the Chief Science officer for ONC.