An FAQ written by Doc (Michael) Edge and Graham Coop on their paper about genetic genealogy & privacy (pdf link here).
The preprint is scheduled to appear on Oct 22nd and should be available at this link.
What is this paper about?
Our paper is about genetic privacy concerns related to a subset of direct-to-consumer (DTC) genetics services. The largest DTC genetics companies are Ancestry and 23andMe, but our paper is not directly about them—it’s about services that allow users to upload their own genetic datasets.
Several DTC genetics services, including GEDmatch, MyHeritage, FamilyTreeDNA, and LivingDNA, allow people who have been genotyped by other services to upload their data to their databases. So, imagine that you’ve been genotyped by 23andMe but want to find genetic relatives who were genotyped by other services. One option is to be genotyped by more genetic genealogy companies, but another option—usually cheaper and sometimes free—is to download your 23andMe genetic data and upload it to some other services. This helps genealogy enthusiasts find more genetic relatives for less money, and it helps the smaller DTC services grow their databases as well.
The potential problem to which we want to draw attention is that allowing users to upload their own datasets can present serious concerns about genetic privacy. In the paper, we describe some ways that a motivated person (we’ll call this person an “adversary”) could compromise the privacy of people in a DTC genetics database by uploading several genetic datasets, either real or fake. Under some circumstances, an adversary could reveal most of the genetic information of most people in a DTC database by uploading a few hundred datasets and aggregating the information returned by the DTC service. Many genetic genealogy services return the full names of relative matches, and some include email addresses.Therefore, in some cases large amounts of identifiable genome-wide data may be obtainable by a motivated adversary. We also describe some actions that DTC services can take to limit these risks.
Why did you write this paper?
DTC genetics is a massively growing industry, and genetic genealogy is a major driver of demand. Many people love having the ability to find their genetic relatives and piece together their family trees. The ability to search for genetic relatives can be especially valuable for people who are missing information about their genetic relatives, including adoptees, biological children of sperm donors, and descendants of slaves or holocaust survivors, among others. We want people to be able to continue to use these resources, but also to be able to do so as safely as possible. In writing this paper, we want to clarify some privacy risks so that companies offering DTC services can limit them and so that consumers can be informed. In using personal genomics sites, there is always a tradeoff between privacy risk and the ability to learn something new about your family. Our view is that clearly communicating these risks is the best way forward.
Doesn’t publishing this information make it possible for people to exploit the kinds of privacy vulnerabilities you describe?
To help protect users’ privacy, we wrote to all the entities we could find that currently offer or recently offered upload services—GEDmatch, MyHeritage, FamilyTreeDNA, LivingDNA, and DNA.LAND—ninety days before posting our manuscript. In our letters to them, we outlined the privacy risks we saw and some methods for limiting or preventing them.
Though sharing this information makes it available to people motivated to compromise genealogy enthusiasts’ privacy, it also makes it available to the people who run genetic genealogy services, to potential customers and the public, and to anyone thinking of founding a new DTC genetic genealogy service. We believe that having this information out in the open makes it easier to protect people’s privacy. And if we did not publish this information, there would be nothing to prevent someone motivated to compromise people’s privacy from figuring it out themselves. The ideas underlying these approaches are not very complex and not too difficult to implement. Thus, it seems best to lay out the issues clearly.
How can uploading genetic datasets to genealogy services potentially compromise the privacy of people in the database?
We describe three different approaches that an adversary could take to identify people’s genotype information, which we call “IBS tiling,” “IBS probing,” and “IBS baiting.” “IBS” stands for “identity by state,” a phrase that geneticists use to talk about segments of genome where two people’s genotypes partially match. If these segments are long stretches of the genome, it indicates that the two individuals have both inherited this genetic material from a recent common ancestor.
What is IBS tiling?
Because genetic information is inherited in large chunks broken up by recombination, you can think of a person’s genome as a mosaic of pieces inherited from a set of ancestors who lived, say, within the last 20 generations. If you look at ancestors from recent generations, the tiles in the mosaic will be big, because they have not been broken up by many recombination events. The tiles from ancestors farther back in the past will be small, because they have passed down to you through many generations, and the recombination events in each of those generations will have chipped away at the tiles. DTC genetics companies identify genetic relatives by looking for people who share big tiles. But you will tend to share small tiles with lots of people even if they are not closely related to you.
When a DTC genetics company reports that you share a tile with another user in their database, you learn something about the other user’s genotypes. After all, you know your own genotypes, and if the location of the shared tile is reported (it often is), then you know that the other user shares genetic variants with you within the tile. To perform IBS tiling, an adversary would upload lots of genetic datasets and keep track of their shared tiles with everyone in the database. Depending on the way in which the DTC service identifies and reports shared tiles of genetic information, an adversary that uploads enough genotypes could aggregate that information to find out a lot about the genotypes of people in the database. An adversary could gather genotypes to upload from a subset of the many publicly available genetic datasets used for research.
What is IBS probing?
The second approach we describe is called “IBS probing.” IBS probing is similar to IBS tiling, but an adversary could use it to find people who have a specific genetic variant of interest in a DTC service that reports only whether two people share a matching genetic tile and not where the match occurs. The idea is that the adversary fills in most of the genome with fake data that is designed to look unlike actual human genetic data, and thus not to match anyone in the database. The only exception is that in a small region of the genome, the adversary uses real data containing the genetic variant of interest. Thus, any matches returned by the database are likely to be people who have the genetic variant that the adversary is interested in.
What is IBS baiting?
The third approach is “IBS baiting,” and it relies on tricking a particular class of algorithms that is sometimes used to identify relatives. This class of algorithms does not represent the cutting edge of computational tools to identify IBS, but algorithms in this class potentially allow DTC genetics services to skip a data processing step that has been hard to perform in large datasets until recently. (The skippable step is called “phasing,” which is the attempt to identify genetic variants that occur together on the same strand of DNA.) If a DTC service uses one of these methods to detect relatives, then it might be possible for an adversary to upload pairs of fake datasets that reveal the genotypes of every user in the database at hundreds of places in the genome. Once enough genotypes are gathered by IBS baiting, it’s possible to use well-established algorithms to fill in the rest of the genotypes (this is called “genotype imputation’).
There seem to be large data breaches reported in the news often. Would a breach of a genetic database be any different from those?
Yes, digital security is an increasingly important issue, and major data breaches happen every year. In many cases, these breaches involve passwords, email addresses, or credit card information of customers.
Genetic data is especially important to protect. First off, genetic data could be used for discrimination. Though prediction of traits from genetic information is not very accurate for most traits, it is possible and even likely that the accuracy will increase. When it does, it might be possible to predict important health outcomes, which could lead to discrimination. For example, in the United States, health insurance companies are not allowed to discriminate on the basis of genotype because of the Genetic Information Nondiscrimination Act (GINA). However, GINA does not explicitly disallow discrimination for other kinds of insurance (such as life insurance). GINA may also be repealed in the future, and protections against genetic discrimination vary among states and countries.
Genetic data also has two special features that make it different from many other kinds of sensitive information, such as a credit card number. First, whereas we can change a credit card number or even a social security number (perhaps with a lot of hassle), we cannot change our genotypes. Second, whereas my credit card number does not reveal anything about the credit card numbers of my children or other genetic relatives, my genetic information can reveal something about my genetic relatives’ genotypes.
Couldn’t a criminal just hack into one of these databases by doing…whatever it is computer hackers normally do?
Yes, DTC genetics companies have to deal with all the same data security issues that other companies do. The difference is that the methods we talk about in our paper would only work on genetic data—they take advantage of the structure of genetic variation and the way it is distributed among people, or of the algorithms used to identify genetic relatives. To perform the attacks we describe, an adversary would need to know something about genetics and genetic datasets but would not need to know much about security hacking. The adversary would simply be uploading datasets and aggregating the information returned.
A direct-to-consumer genetics company has my genotype information. Should I be worried because of what’s in your paper?
Not necessarily. It’s important to understand that sharing your genetic information with any company or other organization will always entail some degree of privacy risk. The people we have corresponded with at DTC genetics services that allow uploads have all assured us that they take privacy seriously and that they either will take action or have already taken action to prevent the types of attacks we describe.
That said, we are not in a position to comment with much precision on the specific vulnerabilities of each service. In most cases, details of the computational steps these services use to identify relatives are private or proprietary, which makes it impossible to verify whether the attacks we describe might work. (Another way of finding this out would be to try to carry out the attacks ourselves, but we will not do this—we have no intention of violating anyone’s privacy, and all of our demonstrations were carried out entirely in publicly available data.)
Some information relevant to determining the risk users face of privacy violations is publicly available and is listed in our paper, such as the minimum matching segment length that the service will show to its users. Each service has had to make a choice about how much information it wants to release to users vs. how much it wants to keep its users’ genotypes private. Releasing more information (such as, for example, the chromosomal locations of matches between genetic relatives) can allow for more fine-grained genealogical analyses and is often appreciated by genetic genealogists, but it also presents extra risk.
Other information relevant to determining risk has not been publicly released, at least not by all DTC services that allow uploads. For example, MyHeritage released a description of their analysis pipeline that strongly suggests that they look for genetic relatives using phased genotypes, which would mean that the IBS baiting method we describe would not work at MyHeritage. We encourage all DTC companies that allow uploads to share such information publicly.
Can the privacy attacks you describe be prevented?
Yes, to a large extent. We describe a set of policies that DTC genetics services could adopt to limit or prevent the attacks we describe. Some DTC services had already adopted many of these policies when we wrote to them, and others mentioned possible plans to adopt some. These changes tend to involve a trade-off for their users—in some cases, better privacy protection means that genealogy enthusiasts have less information to work with, and so services or their user bases might reasonably decide not to adopt some of these suggestions. At the same time, there are some changes that help protect privacy without blocking much information that would be of much use to a genealogist. The possible policies we suggest are:
Only report shared genetic segments if they are long (we suggest 8 centiMorgans as a possible length threshold).
Do not report the chromosomal locations of matching genetic segments, only their length and number.
Require uploaded genotypes to be cryptographically signed to indicate that the source of the file is a trusted genotyping company. (This would require cooperation between several genotyping services.)
Report only a small number of relatives per uploaded genotype file (we suggest 50, but other or more flexible limits might be set).
Disallow searches of arbitrary genotype files against each other.
Block uploads of publicly available genotype data.
Block uploads with evidence of segments designed to have no matches in the database
Block uploads with long heterozygous segments or with segments that match many more people than would be expected.
Use phase-aware methods for detecting genetic relatives.
I have read news about law enforcement using databases like GEDmatch and FamilyTreeDNA. How does that fit in here?
Yes, in the last two years, genealogists working with law enforcement have uploaded genetic information to DTC genetic genealogy services, attempting to identify the sources of crime-scene samples or missing persons by identifying their relatives. This practice is called long-range familial search, and it received widespread public attention after being used to identify a subject in the decades-old Golden State Killer case. There has been little regulation or oversight of this practice until recently, when the Department of Justice released a set of interim guidelines for long-range familial searches, with a permanent policy to follow soon. Currently, GEDmatch users may opt in to being considered in law enforcement searches, and FamilyTreeDNA users may opt out. Other DTC services are not known to have been searched by law enforcement.
One implication of our paper is that a user that has uploaded many genotypes to a DTC genetic genealogy service may be able to access a lot of information about users in the database via the method we call IBS tiling. Companies that have cooperated with law enforcement to perform many investigations, such as Parabon Nanolabs and Bode Technologies, have uploaded dozens or hundreds of datasets to GEDmatch and/or FamilyTreeDNA. We have no reason to think that these companies are engaging in IBS tiling or storing any information that is not directly pertinent to their searches. Still, data management policies for long-range familial searching should be designed to prevent IBS tiling and the accidental acquisition of genotype information for many people.
What’s the bigger take-away from the paper?
As medical genomics and personal genomics spread into many aspects of our lives, we as individuals and societies need to balance their promises and pitfalls. The storage of large amounts of genetic data necessarily brings a range of privacy issues, some of which may only come to light after people have shared their information(for example, the long-range familial searches came as a surprise to many genetic genealogists). Given the sensitive nature of genetic information, we as a society need to be proactive about avoiding its misuse. We need genetic discrimination laws to be more comprehensive to ensure that personal genomics users are not exposed to discrimination, and we need tools in place to ensure that people can determine when and how their genetic information is used. We also need greater transparency from genomics companies, and organisations that interface with these companies, to allow users confidence in exploring their family histories and personal genomics.