New release of “Population and Quantitative Genetics” book

The second release version of “Population and Quantitative Genetics”. All of the latex, figures, etc are released under a CC-BY 3.0 licence. All of the figures have their attribution and code is provided for all of the figures produced for the book.

If you make use these notes please consider answering this questionnaire. I’m collecting this information in case it helps support further development of the notes.

What’s new? Descriptions throughout the book have been expanded. I’ve also added new figures and illustrations. Entirely new to this release:
A math appendix at the end of the book. This briefly reviews many of the math topics needed to follow the explanations in the book. Links to this appendix have been added throughout the book.
A new final chapter on the interaction of selection and recombination. This new chapter discusses the advantages and disadvantages of sex and recombination, and the evolution of inversions and super genes.

A chapter ‘Population Structure and Correlations Among Loci’ has been broken off the chapter on allele and genotype frequencies.
A chapter ‘The Population Genetics of Divergence and Molecular Substitution.’ has been broken off from the genetic drift chapter and extended. A chapter on ‘Neutral Diversity and Population Structure’ has also been broken off from this genetic drift chapter.
The response to selection chapter has been split into a chapter on single traits (‘The Response to Phenotypic Selection’) and on multiple traits (‘The Response of Multiple Traits to Selection.’). New material on fitness landscapes has been added to the single trait chapter and the multivariate chapter has new material on estimating fitness gradients.
A section of sex ratios and selfish elements has been added to the ‘One-Locus Models of Selection’ chapter.
A section on hybrid zones has been added to the ‘The Interaction of Selection, Mutation, and Migration.’ chapter.

Posted in Uncategorized | 1 Comment

Paper FAQ for Attacks on genetic privacy via uploads to genealogical databases

An FAQ written by Doc (Michael) Edge and Graham Coop on their paper about genetic genealogy & privacy (pdf link here).

The preprint is scheduled to appear on Oct 22nd and should be available at this link.

What is this paper about?

Our paper is about genetic privacy concerns related to a subset of direct-to-consumer (DTC) genetics services. The largest DTC genetics companies are Ancestry and 23andMe, but our paper is not directly about them—it’s about services that allow users to upload their own genetic datasets.

Several DTC genetics services, including GEDmatch, MyHeritage, FamilyTreeDNA, and LivingDNA, allow people who have been genotyped by other services to upload their data to their databases. So, imagine that you’ve been genotyped by 23andMe but want to find genetic relatives who were genotyped by other services. One option is to be genotyped by more genetic genealogy companies, but another option—usually cheaper and sometimes free—is to download your 23andMe genetic data and upload it to some other services. This helps genealogy enthusiasts find more genetic relatives for less money, and it helps the smaller DTC services grow their databases as well.

The potential problem to which we want to draw attention is that allowing users to upload their own datasets can present serious concerns about genetic privacy. In the paper, we describe some ways that a motivated person (we’ll call this person an “adversary”) could compromise the privacy of people in a DTC genetics database by uploading several genetic datasets, either real or fake. Under some circumstances, an adversary could reveal most of the genetic information of most people in a DTC database by uploading a few hundred datasets and aggregating the information returned by the DTC service. Many genetic genealogy services return the full names of relative matches, and some include email addresses.Therefore, in some cases large amounts of identifiable genome-wide data may be obtainable by a motivated adversary. We also describe some actions that DTC services can take to limit these risks.

Why did you write this paper?

DTC genetics is a massively growing industry, and genetic genealogy is a major driver of demand. Many people love having the ability to find their genetic relatives and piece together their family trees. The ability to search for genetic relatives can be especially valuable for people who are missing information about their genetic relatives, including adoptees, biological children of sperm donors, and descendants of slaves or holocaust survivors, among others. We want people to be able to continue to use these resources, but also to be able to do so as safely as possible. In writing this paper, we want to clarify some privacy risks so that companies offering DTC services can limit them and so that consumers can be informed. In using personal genomics sites, there is always a tradeoff between privacy risk and the ability to learn something new about your family. Our view is that clearly communicating these risks is the best way forward.

Doesn’t publishing this information make it possible for people to exploit the kinds of privacy vulnerabilities you describe?

To help protect users’ privacy, we wrote to all the entities we could find that currently offer or recently offered upload services—GEDmatch, MyHeritage, FamilyTreeDNA, LivingDNA, and DNA.LAND—ninety days before posting our manuscript. In our letters to them, we outlined the privacy risks we saw and some methods for limiting or preventing them.

Though sharing this information makes it available to people motivated to compromise genealogy enthusiasts’ privacy, it also makes it available to the people who run genetic genealogy services, to potential customers and the public, and to anyone thinking of founding a new DTC genetic genealogy service. We believe that having this information out in the open makes it easier to protect people’s privacy. And if we did not publish this information, there would be nothing to prevent someone motivated to compromise people’s privacy from figuring it out themselves. The ideas underlying these approaches are not very complex and not too difficult to implement. Thus, it seems best to lay out the issues clearly.

How can uploading genetic datasets to genealogy services potentially compromise the privacy of people in the database?

We describe three different approaches that an adversary could take to identify people’s genotype information, which we call “IBS tiling,” “IBS probing,” and “IBS baiting.” “IBS” stands for “identity by state,” a phrase that geneticists use to talk about segments of genome where two people’s genotypes partially match. If these segments are long stretches of the genome, it indicates that the two individuals have both inherited this genetic material from a recent common ancestor.

What is IBS tiling?

Because genetic information is inherited in large chunks broken up by recombination, you can think of a person’s genome as a mosaic of pieces inherited from a set of ancestors who lived, say, within the last 20 generations. If you look at ancestors from recent generations, the tiles in the mosaic will be big, because they have not been broken up by many recombination events. The tiles from ancestors farther back in the past will be small, because they have passed down to you through many generations, and the recombination events in each of those generations will have chipped away at the tiles. DTC genetics companies identify genetic relatives by looking for people who share big tiles. But you will tend to share small tiles with lots of people even if they are not closely related to you.

When a DTC genetics company reports that you share a tile with another user in their database, you learn something about the other user’s genotypes. After all, you know your own genotypes, and if the location of the shared tile is reported (it often is), then you know that the other user shares genetic variants with you within the tile. To perform IBS tiling, an adversary would upload lots of genetic datasets and keep track of their shared tiles with everyone in the database. Depending on the way in which the DTC service identifies and reports shared tiles of genetic information, an adversary that uploads enough genotypes could aggregate that information to find out a lot about the genotypes of people in the database. An adversary could gather genotypes to upload from a subset of the many publicly available genetic datasets used for research.

What is IBS probing?

The second approach we describe is called “IBS probing.” IBS probing is similar to IBS tiling, but an adversary could use it to find people who have a specific genetic variant of interest in a DTC service that reports only whether two people share a matching genetic tile and not where the match occurs. The idea is that the adversary fills in most of the genome with fake data that is designed to look unlike actual human genetic data, and thus not to match anyone in the database. The only exception is that in a small region of the genome, the adversary uses real data containing the genetic variant of interest. Thus, any matches returned by the database are likely to be people who have the genetic variant that the adversary is interested in.

What is IBS baiting?

The third approach is “IBS baiting,” and it relies on tricking a particular class of algorithms that is sometimes used to identify relatives. This class of algorithms does not represent the cutting edge of computational tools to identify IBS, but algorithms in this class potentially allow DTC genetics services to skip a data processing step that has been hard to perform in large datasets until recently. (The skippable step is called “phasing,” which is the attempt to identify genetic variants that occur together on the same strand of DNA.) If a DTC service uses one of these methods to detect relatives, then it might be possible for an adversary to upload pairs of fake datasets that reveal the genotypes of every user in the database at hundreds of places in the genome. Once enough genotypes are gathered by IBS baiting, it’s possible to use well-established algorithms to fill in the rest of the genotypes (this is called “genotype imputation’).

There seem to be large data breaches reported in the news often. Would a breach of a genetic database be any different from those?

Yes, digital security is an increasingly important issue, and major data breaches happen every year. In many cases, these breaches involve passwords, email addresses, or credit card information of customers.

Genetic data is especially important to protect. First off, genetic data could be used for discrimination. Though prediction of traits from genetic information is not very accurate for most traits, it is possible and even likely that the accuracy will increase. When it does, it might be possible to predict important health outcomes, which could lead to discrimination. For example, in the United States, health insurance companies are not allowed to discriminate on the basis of genotype because of the Genetic Information Nondiscrimination Act (GINA). However, GINA does not explicitly disallow discrimination for other kinds of insurance (such as life insurance). GINA may also be repealed in the future, and protections against genetic discrimination vary among states and countries.

Genetic data also has two special features that make it different from many other kinds of sensitive information, such as a credit card number. First, whereas we can change a credit card number or even a social security number (perhaps with a lot of hassle), we cannot change our genotypes. Second, whereas my credit card number does not reveal anything about the credit card numbers of my children or other genetic relatives, my genetic information can reveal something about my genetic relatives’ genotypes.

Couldn’t a criminal just hack into one of these databases by doing…whatever it is computer hackers normally do?

Yes, DTC genetics companies have to deal with all the same data security issues that other companies do. The difference is that the methods we talk about in our paper would only work on genetic data—they take advantage of the structure of genetic variation and the way it is distributed among people, or of the algorithms used to identify genetic relatives. To perform the attacks we describe, an adversary would need to know something about genetics and genetic datasets but would not need to know much about security hacking. The adversary would simply be uploading datasets and aggregating the information returned.

A direct-to-consumer genetics company has my genotype information. Should I be worried because of what’s in your paper?

Not necessarily. It’s important to understand that sharing your genetic information with any company or other organization will always entail some degree of privacy risk. The people we have corresponded with at DTC genetics services that allow uploads have all assured us that they take privacy seriously and that they either will take action or have already taken action to prevent the types of attacks we describe.

On the basis of our results, we do have concerns about the privacy of GEDmatch users. As of mid-December, GEDmatch uses length thresholds for displaying matching segments that are too short, allowing for effective IBS tiling attacks, and GEDmatch also appears to use phase-unaware IBD detection methods, allowing for IBS baiting attacks.

in late November 2019, we demonstrated IBS baiting in GEDmatch using a small number of artificial genotypes uploaded. Before uploading any data to GEDmatch, we first confirmed our planned procedure with the UC Davis IRB and with GEDmatch representatives. We used artificial kits and compared them only to each other, and so avoided interacting with any genotype data of real GEDmatch users and did not violate GEDmatch’s terms and conditions. As of December 15, 2019, GEDmatch was still vulnerable.

The other active services (MyHeritage, FamilyTreeDNA, and LivingDNA) are likely substantially less vulnerable than GEDmatch to the attacks we describe here. LivingDNA does not provide a chromosome browser, precluding IBS tiling attacks. MyHeritage and FamilyTreeDNA use thresholds for revealing matching segment locations that make IBS tiling much less efficient. (However, FamilyTreeDNA’s practice of showing matches as short as 1cM given that two people share at least one long match is still somewhat permissive.) Representatives of MyHeritage, FamilyTreeDNA, and LivingDNA have confirmed to us that their IBD-calling algorithms rely on phased data, which should preclude IBS baiting. (We have not tested this ourselves.) DTC genetic genealogy is a growing field, and any new entities that begin offering upload services may also face threats of the kind we describe.

Can the privacy attacks you describe be prevented?

Yes, to a large extent. We describe a set of policies that DTC genetics services could adopt to limit or prevent the attacks we describe. Some DTC services had already adopted many of these policies when we wrote to them, and others mentioned possible plans to adopt some. These changes tend to involve a trade-off for their users—in some cases, better privacy protection means that genealogy enthusiasts have less information to work with, and so services or their user bases might reasonably decide not to adopt some of these suggestions. At the same time, there are some changes that help protect privacy without blocking much information that would be of much use to a genealogist. The possible policies we suggest are:

Only report shared genetic segments if they are long (we suggest 8 centiMorgans as a possible length threshold).
Do not report the chromosomal locations of matching genetic segments, only their length and number.
Require uploaded genotypes to be cryptographically signed to indicate that the source of the file is a trusted genotyping company. (This would require cooperation between several genotyping services.)
Report only a small number of relatives per uploaded genotype file (we suggest 50, but other or more flexible limits might be set).
Disallow searches of arbitrary genotype files against each other.
Block uploads of publicly available genotype data.
Block uploads with evidence of segments designed to have no matches in the database
Block uploads with long heterozygous segments or with segments that match many more people than would be expected.
Use phase-aware methods for detecting genetic relatives.

I have read news about law enforcement using databases like GEDmatch and FamilyTreeDNA. How does that fit in here?

Yes, in the last two years, genealogists working with law enforcement have uploaded genetic information to DTC genetic genealogy services, attempting to identify the sources of crime-scene samples or missing persons by identifying their relatives. This practice is called long-range familial search, and it received widespread public attention after being used to identify a subject in the decades-old Golden State Killer case. There has been little regulation or oversight of this practice until recently, when the Department of Justice released a set of interim guidelines for long-range familial searches, with a permanent policy to follow soon. Currently, GEDmatch users may opt in to being considered in law enforcement searches, and FamilyTreeDNA users may opt out. Other DTC services are not known to have been searched by law enforcement.

One implication of our paper is that a user that has uploaded many genotypes to a DTC genetic genealogy service may be able to access a lot of information about users in the database via the method we call IBS tiling. Companies that have cooperated with law enforcement to perform many investigations, such as Parabon Nanolabs and Bode Technologies, have uploaded dozens or hundreds of datasets to GEDmatch and/or FamilyTreeDNA. We have no reason to think that these companies are engaging in IBS tiling or storing any information that is not directly pertinent to their searches. Still, data management policies for long-range familial searching should be designed to prevent IBS tiling and the accidental acquisition of genotype information for many people.

What’s the bigger take-away from the paper?

As medical genomics and personal genomics spread into many aspects of our lives, we as individuals and societies need to balance their promises and pitfalls. The storage of large amounts of genetic data necessarily brings a range of privacy issues, some of which may only come to light after people have shared their information(for example, the long-range familial searches came as a surprise to many genetic genealogists). Given the sensitive nature of genetic information, we as a society need to be proactive about avoiding its misuse. We need genetic discrimination laws to be more comprehensive to ensure that personal genomics users are not exposed to discrimination, and we need tools in place to ensure that people can determine when and how their genetic information is used. We also need greater transparency from genomics companies, and organisations that interface with these companies, to allow users confidence in exploring their family histories and personal genomics.

Posted in cooplab, genetic genealogy, new paper | Leave a comment

Coop lab Evolution Talks

Doc Edge How much does GWAS stratification drive variation in polygenic scores? Selection 1 Saturday the 22nd 9:45 AM 552
Vince Buffalo Detecting the signature of polygenic adaptation in temporal datasets Molecular Ecology 1 Sat, June 22 4:15 PM 552
Erin Calfee Parallel selection on introgression into maize from a highland endemic wild relative Gene Flow 1 Sunday the 23rd 10:00 AM 551
Sivan Yair The timing and geography of adaptive Neanderthal introgression in modern humans Gene Flow 2 Sunday the 23rd 11:30 AM 551
Matt Osmond Genetic signatures of evolutionary rescue Pop Gen Theory 2 Tuesday the 25th 11:45 ball_bc

Posted in meetings, trips | Leave a comment

Toronto Darwin Day

Had a lot of fun giving the Darwin Day talk at the University of Toronto on Genetics, Genealogy, and our Vast Family Tree. Here is a pdf slides:
Toronto Darwin Day pdf
and the powerpoint
Toronto Darwin Day Power Point Slides
Thanks to all of the people who braved the snowstorm and shutdown campus to come along. Thanks to Aneil for the invitation, and the grad students for a lovely visit and helping out:

Posted in genetic genealogy, meetings, trips | 1 Comment

Woodland genetic genealogy talk

I gave a talk as part of the Woodland (CA) Public Library Science & Society Discussion Series (Thurs once a month). The powerpoint of the slides is here: Woodland genetic genealogy slides [ppt], a pdf of the slides is here (but lacks the animations & gifs).

The discussion was a lot of fun, with many great questions. Thanks to Sudhir Vaikkattil for inviting me, and to Woodland Public Library for hosting the series, the discussion series future schedule is here.

If you’re interested in more information you can read my blog posts on the topic, or check out one of these great books on the topic

Screen Shot 2018-06-29 at 9.31.01 AM.png

Posted in genetic genealogy, personal genomics, popgen teaching | Leave a comment

Coop lab at PEGQ

Emily Josephs. Detecting polygenic adaptation in maize. 11:20am – 11:40am Mon, May 14

Erin Calfee. Methods for detecting selection in admixed populations. Short talk: 4:30pm – 4:35pm Mon, May 14. Poster (56M) 8:00pm – 9:00pm Mon, May 14

Doc Edge. Reconstructing the history of polygenic adaptation using local coalescent trees. Poster (324T) 8:00pm – 9:00pm Tue, May 15

Sivan Yair. Characterizing adaptive Neanderthal introgression using ancient and modern population genomic data. Poster (122M) 8:00pm – 9:00pm Mon, May 14

Nancy Chen. Tracking short-term evolution in a pedigreed wild population. 11:00am – 11:15am. Tue, May 15

Kristin Lee. Detecting signatures of convergent adaptation in population genomic data. 3:15pm – 3:30pm Tue, May 15

Vince Buffalo. “A temporal signal of linked selection.” 3:45pm – 4:00pm Tue, May 15 2nd Floor – Capitol Ballroom

Posted in cooplab, meetings | Leave a comment

How lucky was the genetic investigation in the Golden State Killer case?

Last week, police arrested Joseph DeAngelo as a suspect in case of the Golden State Killer, an infamous serial murderer and rapist whose case has been open for over forty years. The arrest is huge news in and of itself, but for people interested in the social uses of genetic data, the way in which DeAngelo was identified—using genetic genealogy & genetic data from crime-scene samples—was noteworthy. In this blog post, we discuss some of the genetics and math underlying the way in which he was identified (see also Henn et al). Because there’s been lots of discussion of the ethics of these approaches, we will not focus on that here; see here for a collection of links & news articles.

 The use of genetic data to identify suspects is not new. In the US, law enforcement makes extensive use of their CODIS (Combined DNA Index System) database—genetic searches against the database have aided almost 400,000 investigations since the mid-1990s. The CODIS database contains the genotypes of over 13 million people, most of whom have been convicted of a crime. The genetic information included about each person in the CODIS database is relatively sparse. Most of the profiles record genotypes at just 13 sites in the genome (since 2017, 20 sites have been genotyped). Because the CODIS sites are highly variable microsatellites, CODIS genotypes identify people nearly uniquely—they are sometimes called “DNA fingerprints” . (The CODIS markers reveal more than fingerprints do, though–they can reveal considerable ancestry information, can reveal close relatives, and in some cases, it’s possible to identify genome-wide genetic profiles that “match” a particular CODIS dataset well.)

 In a typical case in which law enforcement uses genetic data, the procedure is to genotype a crime-scene sample at the CODIS loci and look for a full or partial match against the CODIS database. If the sample came from a person who is in the CODIS database, he or she is likely to be identified. If there is no match, then the genetic search ends unless other information can be brought to bear.

 In the Golden State Killer case, genotyping the samples at the CODIS markers did not reveal a match—Joseph DeAngelo was apparently not included in the CODIS database. Nonetheless, the genetic search continued. Investigators apparently genotyped the crime scene sample at a genome-wide set of SNPs, or single-nucleotide polymorphisms. SNPs are the markers of choice for large consumer genetics services like Ancestry and 23andMe (as well as for genome-wide association studies [GWAS].) The police cannot access private databases like these—at least not without an extended legal process—but they do not have to. Many users upload their SNP data to third-party websites to perform advanced analyses or to search for matches with people tested by different companies.

 These SNP databases are growing rapidly. The plot below shows the number of users in each of a set of repositories over the last few years (plot from here). The largest databases—AncestryDNA and 23andMe—are private. But the fourth-largest—GEDmatch, which now has about 950,000 profiles—is an online service that searches for genetic matches with any user who uploads an appropriately formatted genotype file. That’s the one that police searched for DeAngelo.

 Investigators searched for the suspect’s profile by making a personal user account and uploading a genotype file created from the SNP data obtained from crime-scene samples. To do this, the investigators must have created a data file mimicking the SNP set and file format provided by some genetic genealogy company . There was no exact match in the GEDmatch database—indeed, investigators did not expect that DeAngelo would have uploaded his own data—but the trail was not yet cold. The police could still run a  search scanning the database for relatives of the suspect. If it is possible to identify a close relative, then the search for the suspect will be narrowed considerably, even if the suspect is not in the database. This is similar to the familial searching done using the CODIS database, which is legal in some States. (But it is imperfect, see work here and here from Rori Rohlfs and colleagues). However, in the CODIS database, familial search efficacy is limited to close relatives (usually parents and siblings, and more tenuously uncles/aunts/nieces/nephews and first cousins). Thirteen microsatellite markers’ worth of information is simply not enough to distinguish a distant cousin from an unrelated person. With the hundreds of thousands of markers on a typical SNP chip, familial searching is much more powerful—third cousins can be found most of the time, and many (but not all) fourth cousins can be found too. A sample set of profile matches from GEDmatch is shown below:

 Looking at SNP-based relative matches in GEDmatch, police found what they needed in the form of 10 to 20 likely relatives. These likely relatives represented third-to-fourth cousins of DeAngelo, most of whom he had probably never met. Using this genetic data, in combination with genealogical information about these relatives, the Golden State Killer investigation narrowed to one extended family, eventually honing in on DeAngelo himself.

 Geneticists and genetic genealogists have been using these techniques for some time; the GEDmatch database exists because genealogists wanted to share genomic resources to help identify relatives, allowing families to be reunited (see here). Widespread reporting of the method used to identify DeAngelo as the suspected Golden State Killer has inspired a surge of interest in genetic privacy (see here for a general review of topic). Though DeAngelo’s capture is widely celebrated, people are also understandably surprised that the decisions of third or fourth cousins can potentially expose one to surveillance. In this post, we explore some simple models to ask questions about the extent of surveillance that is possible using the methods employed in the Golden State Killer case.

 Two opposed phenomena govern the effectiveness of familial searches on genetic databases, one genealogical and one genetic. The genealogical phenomenon, which we could call “genealogical blowup”, is that the number of relatives one has at a specified degree of relatedness increases as the relatedness becomes more distant. For example, whereas a typical person may have one, two, or three siblings, he or she will usually have a large number—dozens or even hundreds—of third cousins (or “third-degree” cousins). The picture below shows the genealogical blowup phenomenon. On the left, we see the probability that a random person has at least one cousin of degree p in a database (depending on the size of the database), and on the right, we see the average number of cousins contained in a database. The number of genealogical cousins one has—where genealogical cousins are cousins in the usual sense, those connected by genealogy—increases rapidly for more distant relationships.

(The calculation on the left is based on the work of to Shchur and Nielsen. To make our calculations, we adopt some simplifying assumptions that are certainly wrong—namely complete inbreeding avoidance, monogamy with random mating, non-overlapping generations, random participation in the database, and population sizes similar to US census sizes across the last few generations. However, these calculations are useful to get a rough sense of the problem. Some details and pointers to other sources are in the notes below. The primary caveat that our assumptions entail is that our computations apply most directly to ancestry groups that are well represented in the database. GEDmatch is mostly composed of profiles from Americans of European ancestry. Recent immigrants to the US and people from non-European backgrounds are likely to find fewer relatives in GEDmatch than are European-Americans whose families have been in the US for a few generations.)

The opposing genetic phenomenon is the noisiness of genetic inheritance. Whereas the typical person has many distant cousins, the amount of genetic material shared with each of these distant cousins is small. You are nearly certain to share a lot of your genome with your first cousin, as you both have inherited a lot of your genomes from your shared grandparents. As a result, it is easy to identify pairs of first cousins if they are in the database.

The genomic material you share with your first cousin is the overlapping fragments of genome that both of you have inherited from your shared grandparents. Below we show a simulation of you and your first cousin’s genomic material that you both inherited from your shared grandmother (details about how we made these simulations here). In the third panel we show the overlapping genomic regions in purple. These are regions where you and your cousin will have matching genomic material, due to having inherited it “identical by descent” from your shared grandmother. (If you are full first cousins, you will also have shared genomic regions from your shared grandfather, not shown here.)

 

Now consider the case of third cousins. You share one of eight sets of great-great grandparents with each of your (likely many) third cousins. On average, you and your third cousin each inherit one-sixteenth of your genome from each of those two great-great grandparents. This turns out to imply that on average, a little less than one percent of your and your third cousin’s genomes (2 * (1/16)^2 =0.78%) will be identical by virtue of descent from those shared ancestors. If you do share one percent of your genomes, then your relationship to your cousin will likely be detectable using SNPs—the shared portions will be concentrated in relatively long stretches of chromosome that are easy to see statistically. But the more interesting thing is the variation around that average. There is a non-trivial chance (~2%) that you will actually share no identical segments of your genome with your third cousin—in that case, we say you are genealogical cousins but not genetic cousins.

Here’s an example where third cousins share some blocks of their genome (on chromosome 16 and 2) due to their great, great grandmother:

Here’s an example where the same individual shares the same great, great grandmother with another 3rd cousin, but has no genetic sharing due to that connection:

 

As the degree of relatedness decreases—on to fourth cousins, fifth cousins, and so on—an ever-larger proportion of one’s genealogical cousins will not be genetic cousins. The figure below shows the proportion of degree-p cousins with which one expects to share either at least one, two, or three genetic blocks. Sharing 1 block is not very informative (see here). Individuals with whom one shares three or more large genetic fragments are likely strong leads. (Again, the assumptions used here are explained in the notes below.)

An appreciation of these two phenomena—genealogical blowup and the noisiness of genetic inheritance—is crucial for understanding how public SNP databases might be used by law enforcement in the future. There is a tradeoff. One typically has a large number of genealogical eighth cousins, but only a small proportion of them will be genetic cousins, and even these are often impossible to identify as such. On the other hand, it is easy to detect one’s first cousins, but because one typically has a small number of first cousins, the probability that a random person has one in a genetic database is low unless the database is very large. (Another factor relevant for law enforcement is that closer matches are more useful; they narrow the pool of possible suspects more.) The image below combines the considerations illustrated in the previous plots, showing the expected numbers of genetic cousins in the database. The tradeoff of genealogical blowup and the noisiness of genetic inheritance is optimized in the third to fifth cousin range—you have a lot of genealogical cousins at this degree of relatedness, and many of them will be detectable genetic cousins. Because closer relatives are more useful to law enforcement than more distant relatives, it’s likely that many of the cases that could be solved by these methods would involve some mix of 2nd, 3rd, and 4th cousins.

The Golden State Killer results are close to what we expect given the size of the GEDmatch database. Under the assumptions we make here, it’s likely that a large percentage of people have at least one high-confidence genetic cousin in GEDmatch, and the number of 3rd-4th cousins found for DeAngelo—10 to 20—is not too far from the expectations. It’s striking that uploading one’s information to a matching database potentially opens up a large number of other people to eventual identification, and that most of these people are distant enough relatives that one would likely never have met them. To illustrate, consider that 13 million individuals in CODIS likely wouldn’t reveal a familial match because only very close relatives are detectable in CODIS. But using the far smaller GEDmatch database (~1 million individuals), investigators tracked DeAngelo down. As Yaniv Erlich put it recently, “You are a beacon who illuminates 300 people around you.” It’s also striking that we’re already in an era in which familial searches against publicly accessible SNP databases are feasible for a lot of cases, probably the majority of cases where the suspect has substantial recent ancestry in the US—the public datasets are big enough (or will be soon). The limiting factor here may be the genealogical work to trace distant cousins through family trees, but big public datasets might make the genealogical task easier too. From here, it’s a question of deciding the circumstances under which we as a society want these familial searches to be used.

Doc Edge (@DocEdge85) & Graham Coop (@Graham_Coop) 

Thanks to the Coop lab and Debbie Kennett for helpful comments on an earlier draft.  

Notes

A pth cousin is a person with whom one shares an ancestor (in our model, an ancestral couple) p+1 generations ago (your great(p-1) grandparents). If there’s no inbreeding in one’s recent family tree, then one is descended from 2p ancestral couples p+1 generations ago. A pair of individuals in the present are pth cousins (or closer) if their sets of 2p ancestral couples overlap—they share ancestors p+1 generations ago. Let’s assume that there are Np potential ancestors in N/2 couples, p generations back. If each of these couples have the same probability of having children and there is not too much variation in family size, we can view the problem as if people in the present “choose” their ancestors p+1 generations ago at random. Your ancestors were no doubt very special people, but as far as this model is concerned they were just 2p random draws from all the couples who’ve left descendants. To calculate the probability that you and I are pth cousins, we just need to calculate the probability that our two sets of 2p ancestors overlap (note that this assumes monogamy, i.e. that we’ll be full not half cousins, but even if that wasn’t true, that just alters things by a factor of two). Now, we have something close to a classic probability problem: we draw a set of 2p balls at random from an urn with Np balls, replace the balls in the urn, and repeat the draw of 2p balls—what is the probability that at least one ball is a member of both sets of 2p balls?

The probability that you and I are pth cousins is roughly (4p/(Np/2)), when Np<<2p ie when your ancestors are a small fraction of the total people in the population. In a current-day database of K individuals, drawn from the same population as you, your expected number of pth cousins is K*4p/(Np/2). Two factors make this blow up quickly back over the generation. First, 4p grows quickly back over the generations; second, population sizes have increased rapidly in the recent past, which means that Np  declines quickly with p (because p counts generations backward in time).  

One of biggest uncertainties in our calculations is the size of the pool of possible ancestors. Our calculations should therefore be viewed as crude approximation. Our calculations are based on assuming that the population size of possible ancestors is given by the census population size of the USA. To get the census population size we assume a generation time of 30 years, and take the population size in the decade 1950-30*(p+1). We assume that roughly ½ of the individuals in the population are potentially parents, and that 90% of potentially parents have children. We impose a floor on the population size that it cannot drop below 1 million potential parents, to reflect the fact that for people of European-ancestry, the pool of ancestors back then would also include Europe. Given the large variation in family sizes N should likely be lower still, as variation in family size decreases the effective N further.

Shchur and Nielsen recently worked through the probability that you have no pth cousins in a database of K individuals, in a model similar to that described above. The model Shchur and Nielsen use is more realistic than the one we consider here—it allows for some inbreeding and takes explicit account of the fact that some couples will not have children. They find (their equation 7) that the probability that an individual has no pth cousins in the database, given a fixed population size of N, is approximately exp(-2(2*p-2)*K/N).

The math underlying the genetic calculation is described in more detail here. To summarize: if you share two ancestors p+1 generations with your pth cousin, then you share a particular autosomal chromosomal region with probability 2*(1/2p+1 -1). You have 22 autosomal chromosomes, and each generation, recombination happens in ~34 places on these chromosomes. Looking back p+1 generations, your chromosomes are broken up into approximately (22+34(p+1)) chunks, which are spread across your ancestors. Likewise, your relative’s genome is broken into (22+34*(p+1)) chunks. Because recombination events rarely happen in the exactly same place, your two genomes combined are broken into (22+34*d*2) pieces. As each of these is inherited identical by descent to both you and your cousins from that ancestor with probability 1/22(p+1 -1), you and your cousins should expect to share EB=1/22(p+1)-1 2*(22+34(p+1)) blocks of your autosomal genome. The probability that you share 0 blocks is approximately exp(-EB), while the probability of sharing 2 or more blocks (Qp) can approximately be obtained under the Poisson distribution (which is a good approximation beyond 1st cousins).

Putting all of this together, your expected number of genetic pth cousins is (Qp*K*4p/(Np/2). That’s the solid line plotted in the final figure.

Posted in genetic genealogy, popgen teaching | 5 Comments