My wonderful population genetics graduate student class surprised me with popgen inspired cookies for the last class. There’s species trees, trees, frequency spectra & equations and a whole boatload of popgen fun. Thanks to the class for a great set of classes.

Posted in photos, popgen teaching, teaching | Leave a comment

## How many genomic blocks do you share with a cousin?

Thanksgiving is over, although you fridge may still be full of leftovers. You probably spent your time wondering exactly what you have in common with your cousin, other than your loathing of brussels sprouts. I’m a British ex-pat so I have no real clue, but I guess that it is what you are pondering as you stare off over the half eaten turkey.

In the previous few posts I talked through the probability you share a given number of genomic blocks with a particular ancestor, and how your number of genetic ancestors compares to your number of genealogical ancestors.

We’ll now take a look at the probability that you and a cousin share a given number of autosomal genomic regions. Every generation you go back the two copies of your genome are spread more and more thinly over your increasing large number of genealogical ancestors. This means that there is a reasonable chance that cousins of more than a few degrees of separation (e.g. 4th cousins, see definitions here) share no autosomal genomic material due to that shared ancestor. This probability of sharing zero increases the further back you and your cousin share a common ancestor.

In the picture below (left panel) is a simulation of the autosomal genome you inherited from your mother (colouring her 2 copies of each chromosome, one from her mother & one from her father). You can see how she transmitted a mosaic of her two copies of each chromosome to you, we call the switches from maternal to paternal chromosome (in your mother) recombination events. In the right two panels I show your genome in your maternal grandmother and grandfather:

You can see how the genomic chunks they transmitted your mother, and then her to you, are fragmented across their genomes. For example in this simulation your mother passed no genomic material on chromosome 2 on from your maternal granddad, and so all of your maternal chromosome 2 comes from your maternal grandmum.

To illustrate how variable this process is here’s a slide show of 10 other replicates of this process:

This slideshow requires JavaScript.

The genomic material you share with your first cousin (on your mother’s side) is the overlapping fragments of genome that both of you have inherited from your shared maternal grandparents. In this next plot I show a simulation of you and your cousin’s genomic material that you both inherited from your shared maternal grandparent. In the third panel I show the overlapping genomic regions in purple. (If you are full first cousins you will also have shared genomic regions from your shared grandfather, not shown here.) These are regions where you and your cousin will have matching genomic material, due to having inherited it “identical by descent” from your shared grandmother:

In the cartoon below this is sketched out to show the transmission of the grandparental chromosomes (e.g. chromosome 1) to two cousins, and a stretch of identity by descent (IBD) shared between the cousins is shown:

(Note that the two representations do not show the same outcome of transmission, and so do not match up in terms of the shared genomic material. )

The inherent randomness in the transmission of genetic material, and in where recombination events occur, means that the exact number and location of these shared segments is quite random.

We can also look at more distant cousins. For example, consider second cousins who share a great grandmother. Here’s a simulation of 2nd cousins showing the genomic regions they inherited from their shared great grandmother (following the maternal lineage, mother’s mother’s mother), and the overlap in purple in the final panel:

As each of these individual has eight great grandparents, they have inherited less genomic material from their great grandmother than from their grandmother. This material is also broken into shorter blocks as it has been through more generations of recombination. As these individuals inherit less material from their great grandmother there is also less overlapping blocks of “identity by descent” between second cousins than there was between first cousins, and these regions are smaller.

We only have to go back to 4th cousins till it’s quite likely that they share no overlapping autosomal genomic material due to their shared great, great, grandmother. Here’s an example of the material that two fouth cousins inherited from their shared great, great, grandmother:

However, by chance they may have some overlapping material inherited from this ancestor. Also you potentially have a reasonably large number of fouth cousins so it is quite likely that you’ll share some genomic material with some of them.

These plots are potentially nice way of illustrating the shared material, but they do not give us a sense of the probability that you share a given number of blocks identical by descent with a cousin. To look at this I simulated a large number of pedigrees and calculated the number of shared autosomal blocks for a variety of depths of relationship for each simulation. Here are the results for full cousins of varying degrees of relationship, the black line shows the results of the simulation (the light grey dots are an analytical approximation that I’ll explain below):

Looking at these we can see the range of numbers of blocks we expect cousins of a given degree of separation to share. For example, roughly 1 in 100 pairs of third full cousins will share zero blocks of their genome due to that shared pair of ancestors. While roughly 25% of pairs of full fourth cousins will share no blocks of their genome due to that pair of shared ancestors. These results assume that this is the only relationship that our cousins share. However, cousins may also share blocks of their genomes identical by descent due to deeper shared relationships (see Peter Ralph and I’s post on this for more discussion of this point). That means for people who share just a couple of blocks, particularly a single block, it may be difficult to assess whether they truly are closely related or whether by chance they have inherited a block from a much more distant ancestor.

We can also make these plots for varying degrees of half sibs:

You share likely less with 1/2 cousins of a given degree than you do with full cousins as you only share a single recent common ancestor with these individuals rather than a pair of ancestors.

We can also graph out the probability that you and a relative who share one or two ancestors k generations back share zero blocks:

the dots show the simulations, the lines the approximation discussed below.

We can develop a simple, but reasonably accurate, approximation to the expected number of blocks shared between a pair of cousins of a given degree. These approximations have been developed by a number of authors. You can find a reasonable description (open access) in Huff et al, see text above and surrounding equation 7. An elaboration of the ideas laid out here are (I think) used by 23&me and ancestry.com to identify individuals are close relatives in their databases.

That calculation is for a given genomic region, we now have to work out how many different genomic regions you and your cousin could possibly share. You have 22 autosomal chromosomes, and each generation recombination happens in ~34 places on these chromosomes. Looking back d generations your chromosomes are broken up into (22+34d) chunks, which are spread across your ancestors. Likewise your relative’s genome is broken into (22+34*d) chunks. Because recombination events rarely happen in the exactly same place, your two genomes combined are broken into (22+34*d*2) pieces. As each of these is inherited identical by descent to both you and your cousins from that ancestor with probability 1/2(2 d -1), you and your cousins should expect to share 1/2(2 d -1) (22+34d) regions of your genome identical by descent (and double this for full cousins).

A genome does not always undergo ~34 recombination events per generation, this is just the average number. We can approximate the probability distribution of the number of blocks that could possibly be shared between you and a relative by a Poisson distribution with mean (22+68d) as the number of recombination events is roughly Poisson distributed (ignoring recombination interference). As each of these blocks is shared with the probability 1/2(2d -1) for half cousins, the number of shared blocks is Poisson distributed with mean 1/2(2d -1) (22+34d) for half-cousins with an ancestor d generations ago (and double that mean for full cousins). In R we can code up this distribution for 1/2 cousins as dpois(0:70,(33.8*(2*d)+22)/(2^(2*d-1))), where d is the degree of the cousins. This approximation is what is shown as light grey dots in the above figures. This approximation also allows us to get the probability of zero blocks, the lines in the graph just above. For example the probability of zero blocks being shared between two full degree relatives who share two ancestors k generations back is: exp(-2*(33.8*(2*d)+22)/(2^(2*d-1))).

(I’m not totally happy with this description of the approximation, and will think about how to describe it better).

Posted in genetic genealogy, popgen teaching | 8 Comments

## How many genetic ancestors do I have?

In my last couple of posts I talked about how much of your (autosomal) genome you inherit from a particular ancestor [1,2]. In the chart below I show a family tree radiating out from one individual. Each successive layer out shows an individual’s ancestors another generation back in time, parents, grandparents, great-grandparents and so on back (red for female, blue for male).

Each generation back your number of ancestors double, until you are descended from so many people (e.g. 20 generation back you potentially have 1 million ancestor) that it is
quite likely that some people back then are your ancestors multiple times over. How quickly then does your number of genetic ancestors grow, i.e. those ancestors who contributed genetic material to you?

Each generation we go back is expected to halve the amount of autosomal genetic material an ancestor gives to you. As this material inherited in chunks, we only have to go back ~9 generations until it is quite likely that a specific ancestor contributed zero of your autosomal material to you (see previous post). This process is inherently random, as the process of recombination (the breaking of chromosomes into chunks) and transmission are both random sets of events. To give more intuition, and to demonstrate the nature of the randomness, I thought I’d setup some simulations of the inheritance genetic process back through time.

Below I show the same plot as above (going back 11 generations), but now ancestors that contribute no (autosomal) chunks of genetic material are coloured white (I give the % of ancestors with zero contribution below). I also wanted to illustrate how variable the contribution of (autosomal) genetic material was across ancestors in a particular generation. So I altered the shade of the colour of the ancestor to show what fraction of the genome they contributed. In choosing a scale I divided that fraction through by the maximum contribution of any ancestor in that generation, so that the individual who contributed the most is the darkest shade. Below the figure I give the range of % contributions to this individual, and the mean (which follows 0.5k).

It’s quite fun to trace particular branches back and see their contribution change over time. These figures were inspired by ones I found at the genetic genealogy blog. I’m not sure how they generated them, and they are for illustrative purposes only. I made scripts to do the simulations and plot in R. I’ll post these scripts to github shortly.

To give a sense of how variable this process is, here’s another example

From these it is clear that your number of ancestors is increasing but no where near as fast as your number of genealogical ancestors. To illustrate this I derived a simple approximation to the number of genetic ancestors over the generations (I give details below). Using this approximation I derived the number of genetic and genealogical ancestors, in a particular generation, going back over 20 generations:

Your number of genealogical ancestors, in generation k, is growing exponentially (I cropped the figure as otherwise it looks silly). Your number of genetic ancestors at first grows as quickly as your number of genealogical ancestors, as it is very likely that an ancestor a few generations back is also a genetic ancestor. After a few more generations your genetic number of genetic ancestors begins to slow down its rate of growth, as while the number of genealogical ancestors is growing rapidly fewer and fewer of them are genetic ancestors. Your number of genetic ancestors eventually settles down to growing linearly back over the generations, at least over the time-scale here, with your number of ancestors in generation k being roughly 2*(22+33*(k-1)).

To get at this result I did some approximate calculations. If we go back k generations, the autosomes you received from (say) your mum are expected to be broken up in to roughly (22+33*(k-1)) different chunks spread across ancestors in generation k (you have 22 autosomes, with roughly 33 recombination events per generation). If we go far enough back each ancestor is expected to contribute at most 1 block, so you have roughly 2*(22+33*(k-1)) (from your mum and dad).

To develop this a little more consider the fact that k generations back you have 2 (k-1) ancestors k generations back on (say) your mother’s side, you expect to inherit (22+33*(k-1))/2(k-1) chunks from each ancestor. We can approximate the distribution of the number of chunks you inherit from a particular ancestor by a Poisson distribution with this mean*. So the probability that you inherit zero of your autosomal genome from a particular ancestor is approximately exp(-(22+33*(k-1))/2 (k-1)). This approximation seems to work quite well, and matches my simulations:

So using this we can write your expected number of genetic ancestors as 2k *(1- exp(-(22+33*(k-1))/2(k-1))), as you have 2k ancestors each contribute genetic material to you with probability one minus the probability we just derived. When we go back far enough exp(-(22+33*(k-1))/2(k-1)) ≈ 1- (22+33*(k-1))/2(k-1), so your number of ancestors, in generation k, is growing linearly as 2*(22+33*(k-1)).

Your number of genetic ancestors will not grow linearly forever. If we go far enough back your number of genetic ancestors will get large enough, on order of the size of the population you are descended from, that it will stop growing as you will be inheriting different chunks of genetic material from the same set of individuals multiple times over. At this point your number of ancestors will begin to plateau. Indeed, once we go back far enough actually your number of genetic ancestors will begin to contract as human populations have grown rapidly over time. I’ll return to this in another post.

* this will be okay if k is sufficiently large, I can explain this in the comments if folks like. This approximation has been made by many folks, e.g. Huff et al. in estimating genetic relationships between individuals.

This post was inspired in part by an nice post by Luke Jostins (back in 2009). I think there were some errors in Luke’s code. I’ve talked this over with Luke, and he’s attached a note to the old post pointing folks here.

## How much of your genome do you inherit from a particular ancestor?

A generation ago you have two ancestors, your parents, two generations ago you have four grandparents (ignoring the possibility of inbreeding).
Each generation we go back your number of ancestors doubles, such that your number of ancestors k generations back grows at 2^k (again ignoring the possibility of inbreeding, which is a fair assumption for small k and if your ancestry derived from a large population).

However, you only have two copies of your autosomal genome, one from your mum one from your dad. Each generation we go back halves the amount of autosomal genome you receive, on average, from a particular ancestor. For example, on average 50% of your autosomal genome passed on from your mother comes from your maternal grandmother, 50% comes from your maternal grandfather. This material is inherited in large chunks, as chromosome fragments are inherited in large blocks between recombination events.

As you inherit autosomal material in large chunks there is some some spread around the amount of genetic material you receive; e.g. you might have inherited 45% of your autosomal material from your maternal grandmother, and 55% from your maternal grandfather. In my last post on this topic I looked at distribution of how much of your autosomes from grandparents, and I talked about why it was vanishingly unlikely that you received 0% of your genome from a grandparent.

We can take this back further, and look at the spread of how much of your autosomes you receive from ancestors further back, and how far we have to go back until it is quite likely that a particular ancestor contributed no genetic material on your autosomes to you. To do this I again made use of transmission data I had to hand to calculate these quantities using real data. Using data I had for one generation of transmissions, I compounded these together over multiple generations. After doing this I calculated a number of different quantities that I’ll describe below.

First lets look at the distribution of the number of autosomal genomic blocks you receive from a specific ancestor k generations ago

The black line is for a typical ancestor, where we do not worry about how many males and females there are along the particular route back through the family tree. While if we follow your Matrilineal line back we see there are more blocks as females have a higher recombination rate and so are breaking there genomes up into more blocks, following the patrilineal line we find less blocks as males have lower rates of recombination.

We can also calculate the probability that you inherit zero (large) blocks of your genome from a specific ancestor:

We can also do this for individual chromosomes:

The lower number chromosomes are bigger, recombine more, and so are broken into more chunks, making it more likely that a specific ancestor contributes one of those chunks.

Finally we can look at the distribution of the amount of autosomal material you inherit from an ancestor k generations ago:

note that these distributions are centered on 1/(2^k)

## How much of your genome do you inherit from a particular grandparent?

You’ve got two copies of each chromosome, having received one copy of each chromosome from your mother and one chromosome from your father (this is true for your autosomes, but not for your X, Y, and mitochondria). When it comes time to pass on your DNA to the next generation, you in turn package up a single copy of each chromosome into a sperm/egg. Sometimes you pass on either mum or dad’s copy of a chromosome at random, often though you pass on a mosaic consisting of the two chromosomes (a recombinant chromosome).

The question came up (via a article by Razib Khan) of what is the probability that by chance your parent entirely failed to pass any autosomal DNA from a grandparent to you (e.g. your father fails to pass on any autosomal genome from your paternal grandfather)? There are 22 autosomes, so if there was no recombination that would happen with probability 2 x 0.5^22=4.7×10^(-7). But this probability is very much lower with recombination, as a recombinant chromosome necessarily has material from both parents. A discussion of how to do this calculation with recombination came up via Mike Eisen on twitter [1].

In order for you to receive your parent to transmit the entire autosome only from one grandparent, your parent also have to transmit all of their chromosomes without recombination [2]. Recombination also makes this probability differs between the sexes. This is because the probability that a chromosome is transmitted without recombination depends on the sex of the individual, females recombine more than males and so are less likely to transmit a chromosome without recombination. The probability of a chromosome being transmitted without recombination also depends on the size of the chromosome, big chromosomes recombine more. For example, chromosome 1 has a 2% chance of being transmitted to the next generation by females, but a 7% chance of this happening in males. While chromosome 22, a much smaller chromosome, has a 37% chance of being transmitted with out recombination in females, and has a 44% chance in males (you can look up this frequencies in the supplement of a paper I wrote with Adi Fledel-Alon and other folks from Molly Przeworski’s lab).

To work out the probability of all chromosomes failing to be transmitted with recombination for a particular sex we simply multiple together the probability of each chromosome being transmitted without recombination [3]. Doing this, we find that the probability that a male transmits every chromosome without recombination is 8.8 x 10^(-16), and this probability is substantially lower in females at 2.8×10^(-23).

Then having not recombined on any chromosome that parent would have to also transmit every chromosome without recombination (with probability 4.7×10^(-7)). So the probability that your mother fails entirely to transmit any autosomal genetic material from a particular grandparent to you is 1.3×10^(-29), and your father does this with probability 4.2×10^(-22). So it’s pretty bloody unlikely.

Perhaps a more interesting question what is the distribution of the fraction of the autosomal genome that your parent transmits to you from a particular grandparent (e.g. your maternal grandmother)?

This question has been considered mathematically by a number of authors, as it has important applications for identifying unknown genetic relationships between individuals and estimating various heritability measures. However, to my knowledge no one has actually done this calculation using real recombination data (so I thought it would be fun to do). For each chromosome in turn, using recombination data from real transmissions, I simulated the amount of grandparental chromosome that was transmitted by a parent. For example, here’s the histogram of the distribution of the amount of chromosome 1 and 22 a father or a mother transmits.

These distributions are less variable in females than in males due to the greater number of recombination event in females than in males, and the fraction transmitted is more variable for small chromosomes as they have fewer recombination events. The pdf showing these histograms for every chromosome is here.

I then looked at what fraction of the entire (autosomal) genome from a particular grandparent was transmitted to the next generation.

I was a little surprised by how long tailed this was in males. Roughly 5/1000 fathers transmit less than 20% of one paternal grandparent’s autosome to the next generation!

Sometime soon I’ll generate these numbers for longer transmission chains, e.g. what’s the distribution of the fraction of your genome could you expect to receive from a great-grandparent.

1. I originally messed up this calculation, Mike Eisen got the right answer and pointed out my error. Thanks also to Amy Williams and Adam Auton for motivating some of the questions addressed here.

2. The probability of failing to transmit the entirety of one grandparental autosome is actually a lot lower than this, as gene conversion also can lead to transmission of small chunks of genome even if there is no crossing over. Gene conversion is thought to be ~10x as common as crossing over, and I estimate the probability of no transmitted crossovers or gene conversions to be <10^(-90). However, gene conversions are very small, so we might think the calculation above is for the bulk of the genome.

3. This isn't quite right, as the recombination rates of different chromosomes aren't independent of each other.

UPDATE:
A few more details of how I obtained the distributions of transmitted material. I started with a set of 1374 parent-offspring transmissions that we had information for.

For each transmission I took the observed set of crossover events for each chromosome. If a chromosome had no crossovers, with probability 1/2 the parent transmitted the entire grandparental chromosome, otherwise they transmitted nothing for this chromosome.

If a chromosome had one or more recombination events in its transmission from a parent, both grandparents will have a contribution. We then have to decide who contributed what material based on the locations of the recombination events. The crossovers define a set of intervals transmitted together, which alternate between which grandparental material is transmitted. So for each transmission with probability 1/2 I make the parent transmit the grandparental corresponding to the odd inter-recombination intervals, else they transmit the even inter-recombination intervals.

Thus my simulations represent real transmissions, the only simulated part is the realization of Mendelian transmission (i.e. the 50/50 transmission probabilities). This means that the chromosome specific plots are not really simulations, and truly reflect these transmission data (each transmission contributing two datapoints, corresponding to the two grandparents).

My whole genome simulations are simulations, that assume independence of mendelian transmission across chromosomes. Only strong selection on viability/meiotic drive at individual loci could violate this assumption, and in general their is little evidence for this in humans. Given this assumption I can simulate vast numbers of transmitted autosomes due to the different realizations of Mendelian segregation across chromosomes. These represent pseudo-samples, in the sense that they only reflect the variation in the placement of recombination events across our 1374 parent-offspring transmissions. But overall I think this is not a bad way to approximate the distribution of transmitted material. It won’t be quite right in the very extreme tails, and that would need data on vast more transmissions.

## The blossoming of Capsella rubella.

Yaniv’s Capsella article is the cover image of PLOS genetics

Image Credit: Kim Steige

Flowers of the selfing plant species, C. rubella.
In this issue, Brandvain et al. identify blocks of ancestry inherited from the founders of this recently derived species. With these blocks, they learn that C. rubella split from its outcrossing progenitor around 50 to 100,000 years ago, and subsequently lost much of its genetic diversity. These ancestry blocks also inform us about the number of individuals that founded C. rubella, the relaxation of purifying selection since its origin, and its spread across the globe.

Posted in cooplab, new paper, photos | Leave a comment

## Post on The Population Genetic Signature of Polygenic Local Adaptation

We (Jeremy and Graham) have a new arXived paper: “The Population Genetic Signature of Polygenic Local Adaptation” (arXived here). This us a cross post from Haldane’s sieve. Comments are welcome there.

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.

In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.

We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.