An FAQ written by Doc (Michael) Edge and Graham Coop on their paper about genetic genealogy & privacy (pdf link here).
The preprint is scheduled to appear on Oct 22nd and should be available at this link.
What is this paper about?
Our paper is about genetic privacy concerns related to a subset of direct-to-consumer (DTC) genetics services. The largest DTC genetics companies are Ancestry and 23andMe, but our paper is not directly about them—it’s about services that allow users to upload their own genetic datasets.
Several DTC genetics services, including GEDmatch, MyHeritage, FamilyTreeDNA, and LivingDNA, allow people who have been genotyped by other services to upload their data to their databases. So, imagine that you’ve been genotyped by 23andMe but want to find genetic relatives who were genotyped by other services. One option is to be genotyped by more genetic genealogy companies, but another option—usually cheaper and sometimes free—is to download your 23andMe genetic data and upload it to some other services. This helps genealogy enthusiasts find more genetic relatives for less money, and it helps the smaller DTC services grow their databases as well.
The potential problem to which we want to draw attention is that allowing users to upload their own datasets can present serious concerns about genetic privacy. In the paper, we describe some ways that a motivated person (we’ll call this person an “adversary”) could compromise the privacy of people in a DTC genetics database by uploading several genetic datasets, either real or fake. Under some circumstances, an adversary could reveal most of the genetic information of most people in a DTC database by uploading a few hundred datasets and aggregating the information returned by the DTC service. Many genetic genealogy services return the full names of relative matches, and some include email addresses.Therefore, in some cases large amounts of identifiable genome-wide data may be obtainable by a motivated adversary. We also describe some actions that DTC services can take to limit these risks.
Why did you write this paper?
DTC genetics is a massively growing industry, and genetic genealogy is a major driver of demand. Many people love having the ability to find their genetic relatives and piece together their family trees. The ability to search for genetic relatives can be especially valuable for people who are missing information about their genetic relatives, including adoptees, biological children of sperm donors, and descendants of slaves or holocaust survivors, among others. We want people to be able to continue to use these resources, but also to be able to do so as safely as possible. In writing this paper, we want to clarify some privacy risks so that companies offering DTC services can limit them and so that consumers can be informed. In using personal genomics sites, there is always a tradeoff between privacy risk and the ability to learn something new about your family. Our view is that clearly communicating these risks is the best way forward.
Doesn’t publishing this information make it possible for people to exploit the kinds of privacy vulnerabilities you describe?
To help protect users’ privacy, we wrote to all the entities we could find that currently offer or recently offered upload services—GEDmatch, MyHeritage, FamilyTreeDNA, LivingDNA, and DNA.LAND—ninety days before posting our manuscript. In our letters to them, we outlined the privacy risks we saw and some methods for limiting or preventing them.
Though sharing this information makes it available to people motivated to compromise genealogy enthusiasts’ privacy, it also makes it available to the people who run genetic genealogy services, to potential customers and the public, and to anyone thinking of founding a new DTC genetic genealogy service. We believe that having this information out in the open makes it easier to protect people’s privacy. And if we did not publish this information, there would be nothing to prevent someone motivated to compromise people’s privacy from figuring it out themselves. The ideas underlying these approaches are not very complex and not too difficult to implement. Thus, it seems best to lay out the issues clearly.
How can uploading genetic datasets to genealogy services potentially compromise the privacy of people in the database?
We describe three different approaches that an adversary could take to identify people’s genotype information, which we call “IBS tiling,” “IBS probing,” and “IBS baiting.” “IBS” stands for “identity by state,” a phrase that geneticists use to talk about segments of genome where two people’s genotypes partially match. If these segments are long stretches of the genome, it indicates that the two individuals have both inherited this genetic material from a recent common ancestor.
What is IBS tiling?
Because genetic information is inherited in large chunks broken up by recombination, you can think of a person’s genome as a mosaic of pieces inherited from a set of ancestors who lived, say, within the last 20 generations. If you look at ancestors from recent generations, the tiles in the mosaic will be big, because they have not been broken up by many recombination events. The tiles from ancestors farther back in the past will be small, because they have passed down to you through many generations, and the recombination events in each of those generations will have chipped away at the tiles. DTC genetics companies identify genetic relatives by looking for people who share big tiles. But you will tend to share small tiles with lots of people even if they are not closely related to you.
When a DTC genetics company reports that you share a tile with another user in their database, you learn something about the other user’s genotypes. After all, you know your own genotypes, and if the location of the shared tile is reported (it often is), then you know that the other user shares genetic variants with you within the tile. To perform IBS tiling, an adversary would upload lots of genetic datasets and keep track of their shared tiles with everyone in the database. Depending on the way in which the DTC service identifies and reports shared tiles of genetic information, an adversary that uploads enough genotypes could aggregate that information to find out a lot about the genotypes of people in the database. An adversary could gather genotypes to upload from a subset of the many publicly available genetic datasets used for research.
What is IBS probing?
The second approach we describe is called “IBS probing.” IBS probing is similar to IBS tiling, but an adversary could use it to find people who have a specific genetic variant of interest in a DTC service that reports only whether two people share a matching genetic tile and not where the match occurs. The idea is that the adversary fills in most of the genome with fake data that is designed to look unlike actual human genetic data, and thus not to match anyone in the database. The only exception is that in a small region of the genome, the adversary uses real data containing the genetic variant of interest. Thus, any matches returned by the database are likely to be people who have the genetic variant that the adversary is interested in.
What is IBS baiting?
The third approach is “IBS baiting,” and it relies on tricking a particular class of algorithms that is sometimes used to identify relatives. This class of algorithms does not represent the cutting edge of computational tools to identify IBS, but algorithms in this class potentially allow DTC genetics services to skip a data processing step that has been hard to perform in large datasets until recently. (The skippable step is called “phasing,” which is the attempt to identify genetic variants that occur together on the same strand of DNA.) If a DTC service uses one of these methods to detect relatives, then it might be possible for an adversary to upload pairs of fake datasets that reveal the genotypes of every user in the database at hundreds of places in the genome. Once enough genotypes are gathered by IBS baiting, it’s possible to use well-established algorithms to fill in the rest of the genotypes (this is called “genotype imputation’).
There seem to be large data breaches reported in the news often. Would a breach of a genetic database be any different from those?
Yes, digital security is an increasingly important issue, and major data breaches happen every year. In many cases, these breaches involve passwords, email addresses, or credit card information of customers.
Genetic data is especially important to protect. First off, genetic data could be used for discrimination. Though prediction of traits from genetic information is not very accurate for most traits, it is possible and even likely that the accuracy will increase. When it does, it might be possible to predict important health outcomes, which could lead to discrimination. For example, in the United States, health insurance companies are not allowed to discriminate on the basis of genotype because of the Genetic Information Nondiscrimination Act (GINA). However, GINA does not explicitly disallow discrimination for other kinds of insurance (such as life insurance). GINA may also be repealed in the future, and protections against genetic discrimination vary among states and countries.
Genetic data also has two special features that make it different from many other kinds of sensitive information, such as a credit card number. First, whereas we can change a credit card number or even a social security number (perhaps with a lot of hassle), we cannot change our genotypes. Second, whereas my credit card number does not reveal anything about the credit card numbers of my children or other genetic relatives, my genetic information can reveal something about my genetic relatives’ genotypes.
Couldn’t a criminal just hack into one of these databases by doing…whatever it is computer hackers normally do?
Yes, DTC genetics companies have to deal with all the same data security issues that other companies do. The difference is that the methods we talk about in our paper would only work on genetic data—they take advantage of the structure of genetic variation and the way it is distributed among people, or of the algorithms used to identify genetic relatives. To perform the attacks we describe, an adversary would need to know something about genetics and genetic datasets but would not need to know much about security hacking. The adversary would simply be uploading datasets and aggregating the information returned.
A direct-to-consumer genetics company has my genotype information. Should I be worried because of what’s in your paper?
Not necessarily. It’s important to understand that sharing your genetic information with any company or other organization will always entail some degree of privacy risk. The people we have corresponded with at DTC genetics services that allow uploads have all assured us that they take privacy seriously and that they either will take action or have already taken action to prevent the types of attacks we describe.
On the basis of our results, we do have concerns about the privacy of GEDmatch users. As of mid-December, GEDmatch uses length thresholds for displaying matching segments that are too short, allowing for effective IBS tiling attacks, and GEDmatch also appears to use phase-unaware IBD detection methods, allowing for IBS baiting attacks.
in late November 2019, we demonstrated IBS baiting in GEDmatch using a small number of artificial genotypes uploaded. Before uploading any data to GEDmatch, we first confirmed our planned procedure with the UC Davis IRB and with GEDmatch representatives. We used artificial kits and compared them only to each other, and so avoided interacting with any genotype data of real GEDmatch users and did not violate GEDmatch’s terms and conditions. As of December 15, 2019, GEDmatch was still vulnerable.
The other active services (MyHeritage, FamilyTreeDNA, and LivingDNA) are likely substantially less vulnerable than GEDmatch to the attacks we describe here. LivingDNA does not provide a chromosome browser, precluding IBS tiling attacks. MyHeritage and FamilyTreeDNA use thresholds for revealing matching segment locations that make IBS tiling much less efficient. (However, FamilyTreeDNA’s practice of showing matches as short as 1cM given that two people share at least one long match is still somewhat permissive.) Representatives of MyHeritage, FamilyTreeDNA, and LivingDNA have confirmed to us that their IBD-calling algorithms rely on phased data, which should preclude IBS baiting. (We have not tested this ourselves.) DTC genetic genealogy is a growing field, and any new entities that begin offering upload services may also face threats of the kind we describe.
Can the privacy attacks you describe be prevented?
Yes, to a large extent. We describe a set of policies that DTC genetics services could adopt to limit or prevent the attacks we describe. Some DTC services had already adopted many of these policies when we wrote to them, and others mentioned possible plans to adopt some. These changes tend to involve a trade-off for their users—in some cases, better privacy protection means that genealogy enthusiasts have less information to work with, and so services or their user bases might reasonably decide not to adopt some of these suggestions. At the same time, there are some changes that help protect privacy without blocking much information that would be of much use to a genealogist. The possible policies we suggest are:
Only report shared genetic segments if they are long (we suggest 8 centiMorgans as a possible length threshold).
Do not report the chromosomal locations of matching genetic segments, only their length and number.
Require uploaded genotypes to be cryptographically signed to indicate that the source of the file is a trusted genotyping company. (This would require cooperation between several genotyping services.)
Report only a small number of relatives per uploaded genotype file (we suggest 50, but other or more flexible limits might be set).
Disallow searches of arbitrary genotype files against each other.
Block uploads of publicly available genotype data.
Block uploads with evidence of segments designed to have no matches in the database
Block uploads with long heterozygous segments or with segments that match many more people than would be expected.
Use phase-aware methods for detecting genetic relatives.
I have read news about law enforcement using databases like GEDmatch and FamilyTreeDNA. How does that fit in here?
Yes, in the last two years, genealogists working with law enforcement have uploaded genetic information to DTC genetic genealogy services, attempting to identify the sources of crime-scene samples or missing persons by identifying their relatives. This practice is called long-range familial search, and it received widespread public attention after being used to identify a subject in the decades-old Golden State Killer case. There has been little regulation or oversight of this practice until recently, when the Department of Justice released a set of interim guidelines for long-range familial searches, with a permanent policy to follow soon. Currently, GEDmatch users may opt in to being considered in law enforcement searches, and FamilyTreeDNA users may opt out. Other DTC services are not known to have been searched by law enforcement.
One implication of our paper is that a user that has uploaded many genotypes to a DTC genetic genealogy service may be able to access a lot of information about users in the database via the method we call IBS tiling. Companies that have cooperated with law enforcement to perform many investigations, such as Parabon Nanolabs and Bode Technologies, have uploaded dozens or hundreds of datasets to GEDmatch and/or FamilyTreeDNA. We have no reason to think that these companies are engaging in IBS tiling or storing any information that is not directly pertinent to their searches. Still, data management policies for long-range familial searching should be designed to prevent IBS tiling and the accidental acquisition of genotype information for many people.
What’s the bigger take-away from the paper?
As medical genomics and personal genomics spread into many aspects of our lives, we as individuals and societies need to balance their promises and pitfalls. The storage of large amounts of genetic data necessarily brings a range of privacy issues, some of which may only come to light after people have shared their information(for example, the long-range familial searches came as a surprise to many genetic genealogists). Given the sensitive nature of genetic information, we as a society need to be proactive about avoiding its misuse. We need genetic discrimination laws to be more comprehensive to ensure that personal genomics users are not exposed to discrimination, and we need tools in place to ensure that people can determine when and how their genetic information is used. We also need greater transparency from genomics companies, and organisations that interface with these companies, to allow users confidence in exploring their family histories and personal genomics.