How much of your genome do you inherit from a particular ancestor?

How much of your genetic material do you inherit from a particular ancestor? You inherit your mitochondria through your matrilineal lineage (your mum, your mum’s mum, your mum’s mum’s mum and so one) and your Y chromosome from your patrilineal lineage, but how is the rest of your genome spread across your ancestors in any given generation?

A generation ago you have two ancestors, your parents, two generations ago you have four grandparents (ignoring the possibility of inbreeding).
Each generation we go back your number of ancestors doubles, such that your number of ancestors k generations back grows at 2^k (again ignoring the possibility of inbreeding, which is a fair assumption for small k and if your ancestry derived from a large population).

However, you only have two copies of your autosomal genome, one from your mum one from your dad. Each generation we go back halves the amount of autosomal genome you receive, on average, from a particular ancestor. For example, on average 50% of your autosomal genome passed on from your mother comes from your maternal grandmother, 50% comes from your maternal grandfather. This material is inherited in large chunks, as chromosome fragments are inherited in large blocks between recombination events.

As you inherit autosomal material in large chunks there is some some spread around the amount of genetic material you receive; e.g. you might have inherited 45% of your autosomal material from your maternal grandmother, and 55% from your maternal grandfather. In my last post on this topic I looked at distribution of how much of your autosomes from grandparents, and I talked about why it was vanishingly unlikely that you received 0% of your genome from a grandparent.

We can take this back further, and look at the spread of how much of your autosomes you receive from ancestors further back, and how far we have to go back until it is quite likely that a particular ancestor contributed no genetic material on your autosomes to you. To do this I again made use of transmission data I had to hand to calculate these quantities using real data. Using data I had for one generation of transmissions, I compounded these together over multiple generations. After doing this I calculated a number of different quantities that I’ll describe below.

First lets look at the distribution of the number of autosomal genomic blocks you receive from a specific ancestor k generations ago

distribution_num_blocks_egs

The black line is for a typical ancestor, where we do not worry about how many males and females there are along the particular route back through the family tree. While if we follow your Matrilineal line back we see there are more blocks as females have a higher recombination rate and so are breaking there genomes up into more blocks, following the patrilineal line we find less blocks as males have lower rates of recombination.

As a rough rule of thumb the autosomes you received from (say) your mother, k generations back is broken into (22+33*(k-1)) chucks, as your genome comes in 22 chromosomes and there are on average 33 recombination events per transmitted genome. These chunks are spread across your 2^(k-1) maternal ancestors. So, for example, nine generations ago the autosomes you receive from (say) your mum are broke, on average, into 286 large chunks, and these are spread across your 256 ancestors. Thus on average each of ancestors has contributed only a single block to you, and by chance it is possibly that they contribute zero. This gets worse the further we go back in time, your genome is broken up into more and more chunks, but this does not grow as fast as your number of ancestors. This makes it increasingly likely that you inherit no autosomal material from a particular ancestor.

We can also calculate the probability that you inherit zero (large) blocks of your genome from a specific ancestor:

Prob_zero_blocks

We can also do this for individual chromosomes:
Prob_zero_blocks_chr
The lower number chromosomes are bigger, recombine more, and so are broken into more chunks, making it more likely that a specific ancestor contributes one of those chunks.

Finally we can look at the distribution of the amount of autosomal material you inherit from an ancestor k generations ago:

distribution_amount_autosomes_egs
note that these distributions are centered on 1/(2^k)

Posted in genetic genealogy, personal genomics, popgen teaching | 22 Comments

How much of your genome do you inherit from a particular grandparent?

You’ve got two copies of each chromosome, having received one copy of each chromosome from your mother and one chromosome from your father (this is true for your autosomes, but not for your X, Y, and mitochondria). When it comes time to pass on your DNA to the next generation, you in turn package up a single copy of each chromosome into a sperm/egg. Sometimes you pass on either mum or dad’s copy of a chromosome at random, often though you pass on a mosaic consisting of the two chromosomes (a recombinant chromosome).
transmission_cartoon

The question came up (via a article by Razib Khan) of what is the probability that by chance your parent entirely failed to pass any autosomal DNA from a grandparent to you (e.g. your father fails to pass on any autosomal genome from your paternal grandfather)? There are 22 autosomes, so if there was no recombination that would happen with probability 2 x 0.5^22=4.7×10^(-7). But this probability is very much lower with recombination, as a recombinant chromosome necessarily has material from both parents. A discussion of how to do this calculation with recombination came up via Mike Eisen on twitter [1].

In order for you to receive your parent to transmit the entire autosome only from one grandparent, your parent also have to transmit all of their chromosomes without recombination [2]. Recombination also makes this probability differs between the sexes. This is because the probability that a chromosome is transmitted without recombination depends on the sex of the individual, females recombine more than males and so are less likely to transmit a chromosome without recombination. The probability of a chromosome being transmitted without recombination also depends on the size of the chromosome, big chromosomes recombine more. For example, chromosome 1 has a 2% chance of being transmitted to the next generation by females, but a 7% chance of this happening in males. While chromosome 22, a much smaller chromosome, has a 37% chance of being transmitted with out recombination in females, and has a 44% chance in males (you can look up this frequencies in the supplement of a paper I wrote with Adi Fledel-Alon and other folks from Molly Przeworski’s lab).

To work out the probability of all chromosomes failing to be transmitted with recombination for a particular sex we simply multiple together the probability of each chromosome being transmitted without recombination [3]. Doing this, we find that the probability that a male transmits every chromosome without recombination is 8.8 x 10^(-16), and this probability is substantially lower in females at 2.8×10^(-23).

Then having not recombined on any chromosome that parent would have to also transmit every chromosome without recombination (with probability 4.7×10^(-7)). So the probability that your mother fails entirely to transmit any autosomal genetic material from a particular grandparent to you is 1.3×10^(-29), and your father does this with probability 4.2×10^(-22). So it’s pretty bloody unlikely.

Perhaps a more interesting question what is the distribution of the fraction of the autosomal genome that your parent transmits to you from a particular grandparent (e.g. your maternal grandmother)?

This question has been considered mathematically by a number of authors, as it has important applications for identifying unknown genetic relationships between individuals and estimating various heritability measures. However, to my knowledge no one has actually done this calculation using real recombination data (so I thought it would be fun to do). For each chromosome in turn, using recombination data from real transmissions, I simulated the amount of grandparental chromosome that was transmitted by a parent. For example, here’s the histogram of the distribution of the amount of chromosome 1 and 22 a father or a mother transmits.

chr1_and_22_transmission

These distributions are less variable in females than in males due to the greater number of recombination event in females than in males, and the fraction transmitted is more variable for small chromosomes as they have fewer recombination events. The pdf showing these histograms for every chromosome is here.

I then looked at what fraction of the entire (autosomal) genome from a particular grandparent was transmitted to the next generation.

fraction_of_genome_transmitted

I was a little surprised by how long tailed this was in males. Roughly 5/1000 fathers transmit less than 20% of one paternal grandparent’s autosome to the next generation!

Sometime soon I’ll generate these numbers for longer transmission chains, e.g. what’s the distribution of the fraction of your genome could you expect to receive from a great-grandparent.

1. I originally messed up this calculation, Mike Eisen got the right answer and pointed out my error. Thanks also to Amy Williams and Adam Auton for motivating some of the questions addressed here.

2. The probability of failing to transmit the entirety of one grandparental autosome is actually a lot lower than this, as gene conversion also can lead to transmission of small chunks of genome even if there is no crossing over. Gene conversion is thought to be ~10x as common as crossing over, and I estimate the probability of no transmitted crossovers or gene conversions to be <10^(-90). However, gene conversions are very small, so we might think the calculation above is for the bulk of the genome.

3. This isn't quite right, as the recombination rates of different chromosomes aren't independent of each other.

UPDATE:
A few more details of how I obtained the distributions of transmitted material. I started with a set of 1374 parent-offspring transmissions that we had information for.

For each transmission I took the observed set of crossover events for each chromosome. If a chromosome had no crossovers, with probability 1/2 the parent transmitted the entire grandparental chromosome, otherwise they transmitted nothing for this chromosome.

If a chromosome had one or more recombination events in its transmission from a parent, both grandparents will have a contribution. We then have to decide who contributed what material based on the locations of the recombination events. The crossovers define a set of intervals transmitted together, which alternate between which grandparental material is transmitted. So for each transmission with probability 1/2 I make the parent transmit the grandparental corresponding to the odd inter-recombination intervals, else they transmit the even inter-recombination intervals.

Thus my simulations represent real transmissions, the only simulated part is the realization of Mendelian transmission (i.e. the 50/50 transmission probabilities). This means that the chromosome specific plots are not really simulations, and truly reflect these transmission data (each transmission contributing two datapoints, corresponding to the two grandparents).

My whole genome simulations are simulations, that assume independence of mendelian transmission across chromosomes. Only strong selection on viability/meiotic drive at individual loci could violate this assumption, and in general their is little evidence for this in humans. Given this assumption I can simulate vast numbers of transmitted autosomes due to the different realizations of Mendelian segregation across chromosomes. These represent pseudo-samples, in the sense that they only reflect the variation in the placement of recombination events across our 1374 parent-offspring transmissions. But overall I think this is not a bad way to approximate the distribution of transmitted material. It won’t be quite right in the very extreme tails, and that would need data on vast more transmissions.

Posted in genetic genealogy, personal genomics, popgen teaching | 23 Comments

The blossoming of Capsella rubella.

Yaniv’s Capsella article is the cover image of PLOS genetics
image.pgen.v09.i09.g001
Image Credit: Kim Steige

Flowers of the selfing plant species, C. rubella.
In this issue, Brandvain et al. identify blocks of ancestry inherited from the founders of this recently derived species. With these blocks, they learn that C. rubella split from its outcrossing progenitor around 50 to 100,000 years ago, and subsequently lost much of its genetic diversity. These ancestry blocks also inform us about the number of individuals that founded C. rubella, the relaxation of purifying selection since its origin, and its spread across the globe.

Posted in cooplab, new paper, photos | Leave a comment

Post on The Population Genetic Signature of Polygenic Local Adaptation

We (Jeremy and Graham) have a new arXived paper: “The Population Genetic Signature of Polygenic Local Adaptation” (arXived here). This us a cross post from Haldane’s sieve. Comments are welcome there.

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.

Height-genetic-value

In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.

OneSNP

We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.

Posted in new paper | Leave a comment

Identification of Founding Haplotypes Reveals the History of the Selfing Species Capsella rubella out @ PLOS Genetics

Yaniv’s paper: Genomic Identification of Founding Haplotypes Reveals the History of the Selfing Species Capsella rubella is out @ PLOS Genetics. Congrats to all of the authors.

Posted in Uncategorized | Leave a comment

couple of notes on fixation prob. of beneficial allele

There was a conversation on twitter about Haldane’s 2s approximation to the fixation probability of an allele, and how it related to the diffusion approximation of the same quantity. This followed from a blog post by Adam Eyre-Walker. I thought I’d write a couple of notes up on it. This post could likely do with more thought/editing but I thought it would be useful to put it out there.

The 2s result is the correct (ignoring terms of s^2 and higher) answer for the probability of a mutation never being lost in an infinite population with Poisson number of offspring with mean 1+s. The reason why this is “never being lost” instead of fixed, is that the population is infinite. So to persist indefinitely the allele has to escape loss permanently, by never being absorbed by the zero state.

This disagrees with the fixation probability from the diffusion, which is given by (1-exp(-4Nes/(2N))/(1-exp(-4Nes))) ~ 2s (Ne/N)/(1-exp(-4Nes))). Note the various roles played by Ne (eff. pop. size) and N in both equations (or the lack thereof).

Haldane’s result is not quite “right” for “real” populations (e.g. as modeled by Wright Fisher, and its diffusion limit) for 2 reasons.

The first is that population size is finite, so to fix we only need to reach a size 2N individuals (and then we will never be lost). Weakly beneficial mutations (Ns~1) are slightly more likely to fix than the 2s probability, as they only have to reach 2N to never be lost. Similarly deleterious mutations will never escape loss in infinite population, but can in finite pop. by reach 2N individuals. This is captured by the denominator of the fixation probability under the diffusion model, which that this increases the fixation prob. of alleles with |Ns|~1. The absorption of alleles at 2N copies can also be modeled in finite individual models (i.e. not the diffusion limit), I seem to remember that Rick Durrett’s book has a section on this.

The second issue with the 2s result is that it assumes that the individuals have Poisson distributed number offspring with variance 1 (actually our selected type has mean and var 1+s, but we ignore the s). However, in practice that isn’t quite true as our number of offspring (in Wright Fisher) is binomial with p=1/2N (actually it not quite this due to s, but we can ignore that). That also drops the dependance on Ne out of the equation, this can be factored back in as the branching process escaping loss probability is easily modified for non-Poisson variance. Where for an allele, with mean offspring 1+s and variance in offspring V, the probability the branching process escapes loss is ~2s/V.

Posted in popgen teaching, Uncategorized | Leave a comment

Thoughts on preprint citation policy

This post is cross posted from Haldane’s sieve.

This guest post is by Graham Coop [@graham_coop] on the journal Molecular Biology and Evolution’s new preprint policy.

We had an interesting discussion via twitter on the potential reasons for MBE’s policy of not allowing a full citation of preprint articles. I thought I’d writeup some of my thoughts as shaped by that conversation.

Following on from this discussion, I thought I’d lay out some of the arguments that we discussed and my thoughts on these points. We do not know MBE’s reasoning on this, so I may have missed some obvious practical reason for this citation policy (if so, it would be great if it could be explained). Also I note that other journals may well have similar policies about preprint citations, so this is not an argument specifically against MBE. It is great that MBE is now allowing preprints, so this is a somewhat minor quibble compared to that step.

One of my main reasons for disliking this policy, other than it singling out preprints for special treatment, is that it may well disrupt how preprints accumulate citations (via tools like google scholar). I view one of the key advantages of preprints that they allow the early recognition and acknowledgement of good ideas (with bad ones being allowed to sink out of view). This is particularly important for young researchers, where preprints can potentially allow people on the job market to escape some of the randomness of how long the publication process takes. Allowing young scholars to have their work critiqued, and cited, early to me seems an important step in allowing young researchers to get a headstart in an increasingly difficult job market.

Potential arguments against treating preprint citations like any other citation:
1) Allowing full citation of preprints may lose the journal (or the authors) citations.

It is slightly hard to see the logic of (1). If I cite a preprint, which has yet to appear in a journal, then by its very nature the journal couldn’t possibly have benefited from that citation. I’m hardly going to delay my own submission/publication to wait for a paper to appear merely so I can cite it (unless I have some prior commitment to a colleague). The same argument seem to hold for the author, citations of the preprint are citations that you would not have received if you did not distribute the article early. Now, a fair concern is that journals/authors may lose citations of the published article, if after the article appears people accidentally cite the arXived paper instead of the final article. However, MBE’s system doesn’t avoid this problem, and it seems like it could be addressed simply by asking the authors to do a pubmed search for each arXived paper to avoid this oversight.

2) Another potential concern is that preprints are, by their nature, subject to change.

Preprints can be updated, so that information contained in them could change, or even be removed. However, preprint sites like arXiv (as well as peerJ and figshare) keep all previous versions of the paper, and these are clearly labeled and can be cited separately. So I can clearly indicate which version I am citing, and this citation is a permanent entry. While this information may have changed in subsequent versions, this is really no different than the fact that subsequent publications can overturn existing results. What is different with versioning of preprints is that we get to see more of this process in the open, which feels like a good thing overall.

3) Authors should acknowledge that arXived preprints have not been through peer review.

At first sight there is more validity to this point, but I think it is also weak. As an author, and as a reviewer (and indeed as a reader), you have a responsibility to question whether a citation really supports a particular point. As an author I invest a lot of time in trying to track done the right citations and to carefully read, and test, the papers I rely heavily on. As a reviewer I regularly question authors’ use of particular citations and point them toward additional work or ask them to change the wording around a citation. Published papers are not immune from problems, any more than preprints are. If I, and the reviewers of my article, think it is appropriate for me to cite a preprint then I should be allowed to do so as I would any other article.

Also this argument seems somewhat strange; MBE already allows the normal citation of PhD theses and [potentially unpeer-reviewed] books (as pointed out by Antonio Marco). So it is really quite unclear why preprints have been singled out in this way.

All of my articles have benefited greatly from the comments of colleagues and from peer review. I also have a lot of respect for the work done by editors of various journals, including MBE. However, it is unclear to me who this policy serves. Journal policies should always be a light hand; they should ideally allow the authors freedom to fully acknowledge their sources. I see no strong argument for this policy other than it prevents the further blurring of the line between journals and preprints. In my view the only sustainable way forward for journals and scientific societies is to be innovative focal points for collating peer-review and peer-recognition. Only by adapting quickly can journals hope to stay relevant in an age where increasingly (to steal Mike Eisen’s phrase) publishing is pushing a button.

Graham Coop

Posted in Uncategorized | Leave a comment