couple of notes on fixation prob. of beneficial allele

There was a conversation on twitter about Haldane’s 2s approximation to the fixation probability of an allele, and how it related to the diffusion approximation of the same quantity. This followed from a blog post by Adam Eyre-Walker. I thought I’d write a couple of notes up on it. This post could likely do with more thought/editing but I thought it would be useful to put it out there.

The 2s result is the correct (ignoring terms of s^2 and higher) answer for the probability of a mutation never being lost in an infinite population with Poisson number of offspring with mean 1+s. The reason why this is “never being lost” instead of fixed, is that the population is infinite. So to persist indefinitely the allele has to escape loss permanently, by never being absorbed by the zero state.

This disagrees with the fixation probability from the diffusion, which is given by (1-exp(-4Nes/(2N))/(1-exp(-4Nes))) ~ 2s (Ne/N)/(1-exp(-4Nes))). Note the various roles played by Ne (eff. pop. size) and N in both equations (or the lack thereof).

Haldane’s result is not quite “right” for “real” populations (e.g. as modeled by Wright Fisher, and its diffusion limit) for 2 reasons.

The first is that population size is finite, so to fix we only need to reach a size 2N individuals (and then we will never be lost). Weakly beneficial mutations (Ns~1) are slightly more likely to fix than the 2s probability, as they only have to reach 2N to never be lost. Similarly deleterious mutations will never escape loss in infinite population, but can in finite pop. by reach 2N individuals. This is captured by the denominator of the fixation probability under the diffusion model, which that this increases the fixation prob. of alleles with |Ns|~1. The absorption of alleles at 2N copies can also be modeled in finite individual models (i.e. not the diffusion limit), I seem to remember that Rick Durrett’s book has a section on this.

The second issue with the 2s result is that it assumes that the individuals have Poisson distributed number offspring with variance 1 (actually our selected type has mean and var 1+s, but we ignore the s). However, in practice that isn’t quite true as our number of offspring (in Wright Fisher) is binomial with p=1/2N (actually it not quite this due to s, but we can ignore that). That also drops the dependance on Ne out of the equation, this can be factored back in as the branching process escaping loss probability is easily modified for non-Poisson variance. Where for an allele, with mean offspring 1+s and variance in offspring V, the probability the branching process escapes loss is ~2s/V.

Posted in popgen teaching, Uncategorized | Leave a comment

Thoughts on preprint citation policy

This post is cross posted from Haldane’s sieve.

This guest post is by Graham Coop [@graham_coop] on the journal Molecular Biology and Evolution’s new preprint policy.

We had an interesting discussion via twitter on the potential reasons for MBE’s policy of not allowing a full citation of preprint articles. I thought I’d writeup some of my thoughts as shaped by that conversation.

Following on from this discussion, I thought I’d lay out some of the arguments that we discussed and my thoughts on these points. We do not know MBE’s reasoning on this, so I may have missed some obvious practical reason for this citation policy (if so, it would be great if it could be explained). Also I note that other journals may well have similar policies about preprint citations, so this is not an argument specifically against MBE. It is great that MBE is now allowing preprints, so this is a somewhat minor quibble compared to that step.

One of my main reasons for disliking this policy, other than it singling out preprints for special treatment, is that it may well disrupt how preprints accumulate citations (via tools like google scholar). I view one of the key advantages of preprints that they allow the early recognition and acknowledgement of good ideas (with bad ones being allowed to sink out of view). This is particularly important for young researchers, where preprints can potentially allow people on the job market to escape some of the randomness of how long the publication process takes. Allowing young scholars to have their work critiqued, and cited, early to me seems an important step in allowing young researchers to get a headstart in an increasingly difficult job market.

Potential arguments against treating preprint citations like any other citation:
1) Allowing full citation of preprints may lose the journal (or the authors) citations.

It is slightly hard to see the logic of (1). If I cite a preprint, which has yet to appear in a journal, then by its very nature the journal couldn’t possibly have benefited from that citation. I’m hardly going to delay my own submission/publication to wait for a paper to appear merely so I can cite it (unless I have some prior commitment to a colleague). The same argument seem to hold for the author, citations of the preprint are citations that you would not have received if you did not distribute the article early. Now, a fair concern is that journals/authors may lose citations of the published article, if after the article appears people accidentally cite the arXived paper instead of the final article. However, MBE’s system doesn’t avoid this problem, and it seems like it could be addressed simply by asking the authors to do a pubmed search for each arXived paper to avoid this oversight.

2) Another potential concern is that preprints are, by their nature, subject to change.

Preprints can be updated, so that information contained in them could change, or even be removed. However, preprint sites like arXiv (as well as peerJ and figshare) keep all previous versions of the paper, and these are clearly labeled and can be cited separately. So I can clearly indicate which version I am citing, and this citation is a permanent entry. While this information may have changed in subsequent versions, this is really no different than the fact that subsequent publications can overturn existing results. What is different with versioning of preprints is that we get to see more of this process in the open, which feels like a good thing overall.

3) Authors should acknowledge that arXived preprints have not been through peer review.

At first sight there is more validity to this point, but I think it is also weak. As an author, and as a reviewer (and indeed as a reader), you have a responsibility to question whether a citation really supports a particular point. As an author I invest a lot of time in trying to track done the right citations and to carefully read, and test, the papers I rely heavily on. As a reviewer I regularly question authors’ use of particular citations and point them toward additional work or ask them to change the wording around a citation. Published papers are not immune from problems, any more than preprints are. If I, and the reviewers of my article, think it is appropriate for me to cite a preprint then I should be allowed to do so as I would any other article.

Also this argument seems somewhat strange; MBE already allows the normal citation of PhD theses and [potentially unpeer-reviewed] books (as pointed out by Antonio Marco). So it is really quite unclear why preprints have been singled out in this way.

All of my articles have benefited greatly from the comments of colleagues and from peer review. I also have a lot of respect for the work done by editors of various journals, including MBE. However, it is unclear to me who this policy serves. Journal policies should always be a light hand; they should ideally allow the authors freedom to fully acknowledge their sources. I see no strong argument for this policy other than it prevents the further blurring of the line between journals and preprints. In my view the only sustainable way forward for journals and scientific societies is to be innovative focal points for collating peer-review and peer-recognition. Only by adapting quickly can journals hope to stay relevant in an age where increasingly (to steal Mike Eisen’s phrase) publishing is pushing a button.

Graham Coop

Posted in Uncategorized | 3 Comments

new ArXiv paper on the population genomics of a recently derived selfing species

There’s a couple of new Coop lab papers up on the ArXiv.

One is from Yaniv’s work (Brandvain et al.) on a haplotype-based approach to examining the history of a recently derived selfing species (Capsella rubella). This grew out of a collaboration that Yaniv initiated with Stephen Wright and Tanja Slotte. Tanja and Stephen was heavily involved in sequencing the C. rubella genome (along with folks from Detlef Weigel’s lab). As part of this work they also sequenced a set of transcriptomes for that species and its closely related outcrossing progenitor (C. grandiflora). The genome paper has just come out, on which Yaniv and I are minor authors, so congratulations to all on that.

Slide1

In Yaniv’s paper we analyze the two populations’ transcriptome data to examine the founding and subsequent evolution of the selfing species. To do this he uses the high rate of selfing in C. rubella to infer the haplotypes that founded the species (i.e. the chromosomes that were presenting in the founders), and uses these to partition variation into alleles that arose before and after the founding of the species. Doing this, we were able to see the dramatic reduction in the population size of C. rubella, and its impact on variation, over the past 100k years of evolution. This paper will appear in press shortly, and congratulations to Yaniv on a really nice paper (in my admittedly biased view).

Posted in new paper | Leave a comment

SMBE2013 Chicago

We (Kristin, Yaniv, Jeremy, and I) had a great time at SMBE. Our talks went well, we’ll likely post some slides shortly. It was great to catch up with so many folks and catch up with Chicago life. We hung out with Coop alums Torsten and Peter, see the photo below (Kristin had gone by to Wisconsin when this was taken). You can catch up with some of the highlights of the conference through twitter #SMBE2013. One highlight of twitter was Alex Cagan’s great illustrations of talks. Jeremy and I were lucky enough to have our talks sketched by Alex, and he’s kindly allowed us to post them below.

BO0kuqCCcAAUtpZ.jpg_large

BO0ho6ECYAAgaRj.jpg_large

IMG_0185

Posted in cooplab, meetings, photos | Leave a comment

Coop lab at SMBE2013

Hope to see you at SMBE 2013 in Chicago. Yaniv, Jeremy, and I will be at the meeting and giving talks. Kristin Lee, the newest member of the Coop lab (starting Sept), will also be around.

Talks from the Coop lab:
Yaniv Brandvain. Speciation and Introgression in a Mimulus Species Pair. Monday 11.30am Speciation Genomics.

Graham Coop. Predictions and Inference for Coalescent Models with Soft Sweeps. Wednesday. 9.24am. The Evol. Gen. of Polygenic Traits.

Jeremy Berg. General Approaches for Identifying Adaptation Involving Polygenic Traits. Wednesday. 9.45am. The Evol. Gen. of Polygenic Traits.

Hope to see you there; I’m particularly looking forward to catching up with all the Chicago folks.

Graham

Posted in meetings, trips | Leave a comment

Evolution meeting 2013 at Snowbird, UT

The whole lab went to the Evolution meeting at Snowbird, Utah. The talks from the lab went over well it seems (see e.g. Kim Gilbert’s writeup). We’ll hopefully post some pdfs of the talk slides shortly. There were some great talks, and it was lovely to meet a bunch of new folks in person.

Chenling, Jeremy, Yaniv, and I took a road trip on our way there. We went via Mono Lake, DeepSprings College (visiting Amity Wilczek who’s faculty there), and then camped at Great Basin National park. The pictures are below. It was a wonderful trip, with great hikes, views, and company (+good bourbon).

IMG_0836

IMG_0164

IMG_0158

IMG_0141

IMG_0140

IMG_0137

IMG_0120

IMG_0114

IMG_0111

IMG_0110

IMG_0107

IMG_0093

IMG_0074

IMG_0073

IMG_0053

Posted in meetings, photos, trips | 4 Comments

Evolution Meeting 2013

We are heading off shortly to the Evolution 2013 meeting, and taking a bit of a road trip to get up there (via camping in Great Basin National park, and a few other sights). Hope to see you there. Come see the talks by folks in our lab:

Graham Coop Saturday 11.15am Alpine A/B. (Popgen Theory).
“The coalescent with soft sweeps”.
Jeremy Berg. Sunday 11.45am. Rendezvous A. (Gene Flow/Migration II)
“Using environmental correlations and genome-wide associations to detect the signal of polygenic selection”.
Gideon Bradburd. Sunday 2.45. Peruvian B (Genetic Drift).
“Disentangling the effects of geographic and ecological isolation on genetic differentiation”
Alisa Sedghifar. Sunday 3.45pm. Alpine A/B. (Empirical Popgen V.)
“Genomic patterns of latitudinal differentiation in Drosophila simulans.”
Yaniv Brandvain. Monday 9.15am. Cottonwood C. (Speciation IV.)
“Speciation and Introgression in a Mimulus species pair”
Chenling Xu. Monday 10.40. Cottonwood B.
“Molecular analysis of Wolbachia invasions in Drosophila simulans

Posted in cooplab, meetings, trips | 1 Comment

“Ask me anything” Reddit on our European ancestry paper

Peter Ralph and I are doing an “Ask me anything” on our paper about the Recent genetic genealogy of Europe over at the askScience reddit http://www.reddit.com/r/askscience/comments/1ee560/askscience_ama_we_are_the_authors_of_a_recent/ today [May 15th]. Feel free to pop by and ask us questions.

Posted in genetic genealogy, personal genomics, popgen teaching | Leave a comment

Identification of genomic regions shared between distant relatives

We’ve been addressing some of the FAQs on topics arising from our paper on the geography of recent genetic genealogy in Europe (PLOS Biology). We wanted to write one on shared genetic material in personal genomics data but it got a little long, and so we are posting it as its own blog post.

Personal genomics companies that type SNPs genome-wide can identify blocks of shared genetic material between people in their databases, offering the chance to identify distant relatives. Finding a connection to someone else who is an unknown relative is exciting, whether you do this through your family tree or through personal genomics (we’ve both pored over our 23&me results a bunch). However, given the fact that nearly everyone in Europe is related to nearly everyone else over the past 1000 years (see our recent paper and FAQs), and likely everyone in the world is related over the past ~3000 years, how should you interpret that genetic connection?

The answer to that question is obviously highly personal, and specific to the relationship identified. For example, Peter and Graham are likely to be related a few tens of generations back, but our connection to our siblings is obviously much closer. (Also shared genetic inheritance is only one aspect of what it means to be family, e.g. step parents are part of a family.)

Our paper offers some preliminary answers to questions concerning the observation of distant connections found by personal genomics companies. A lot of theses ideas that we’ll touch on in this post are explained more thoroughly here. The short answer is that we think that these single shared blocks (especially the short ones) are from much older shared relatives than you would think, and that they often aren’t a particularly meaningful connection in a genealogical sense.

The difficulty is that, the further we go back the less sharing of genetic material due to recent ancestry there is. Individuals with who share many long blocks (if those blocks are correctly identified) are likely close relatives. However, individuals who share a specific ancestor more than eight generations back are unlikely to share even a single chunk of genetic material due to that particular connection (Donnelly 1983, see also the discussion around Figure 1 in Huff et al, and Luke Jostins post on this). That said, you have many 8th cousins, so you will share a block with quite a few of these cousins. Conditionally on sharing a block of material, from that far back, this block is often quite long, highly variable in length, but frequently identifiable by using SNP chips. So a more concrete question is, if you and I share a single block of a given length (say ~10cM) what is it possible to say about our relationship?

We tackle this question in the discussion of our paper. The first difficulty is that the length of the block due to a given relationship is highly variable. The other problem is that while you have many close relatives, you have a huge number of more distant relatives ( explained here). This acts to seriously distort our intuition of when a block of a given length would have come from. This is further complicated as the number of distant relatives (e.g. 10th cousins) you have depends strongly on the demography of all of the myriad populations that contributed to your ancestry. For example, if your ancestry comes from a set of populations that have grown very rapidly, like many populations around the world have over the past few thousand years, you will have much fewer close relatives than if you come from a small population that was constant in size. For example in these two figures [1,2] we show theoretical age distribution of blocks of three different lengths, for two different demographic scenarios (a constant population and an exponentially growing population respectively). This means that we can’t make a statement like “10cM blocks are from 20-30 generations ago” that will hold for everyone.

Consider that hypothetical block of length 10cM shared between 2 people. Since the mean length of a shared IBD block inherited from five generations ago is 10 cM, we might expect the age of the corresponding common ancestor to be from around five generations ago (10 meioses, since 10cM is 1/10th of a typical chromosome). However, a direct calculation using our inferred demographic histories says that the typical age of a 10 cM block shared by two individuals from the United Kingdom is between 32 and 52 generations (depending on the inferred distribution used). This giant discrepancy results from the fact that you are a priori much more likely to share a common genetic ancestor further in the past, and this acts to skew our answers away from the naive expectation—even though it is unlikely that a 10 cM block is inherited from a particular shared ancestor from 40 generations ago, there are a great number of such older shared ancestors. As discussed above, our estimated does depend drastically on the populations’ shared histories: for instance, the age of such a block shared by someone from the United Kingdom with someone from Italy is even older, usually from around 60 generations ago.

A corollary of this is that if we were seeing 10cM blocks from only 5 generations ago, we must be sampling from a really tiny population, since that would mean a large chance that random people were related through ancestors 5 generations ago (fourth cousins).

Numbers like the 32-52 generations above must be taken with a grain of salt, as they are highly dependent on the demographic history. However, it does imply that blocks of these lengths are likely coming from deeper in time than the time when all Europeans share all of their common ancestors. Therefore, a single example of a block of around this length is not a particularly meaningful statement about genealogical relationship between two people, as these people share all of their ancestors that far back.

This conclusion may not apply to ancestors from the past very few (perhaps less than eight) generations, from whom we expect to inherit multiple long blocks—in this case, we can hope to infer a specific genealogical relationship with reasonable certainty (e.g., Huff et al., Henn et al), although even then care must be taken to exclude the possibility that these multiple blocks have not been inherited from distinct common ancestors (and this will also vary across countries). It is not totally obvious to us how/whether this is currently being done in relative finding software that personal genomics companies use. What is really needed is some guidelines and tests, informed by data from Europe and elsewhere, of how long a single shared block has to be to indicate a more meaningful relationship. These efforts have begun in some populations (Henn et al Gusev et al, Kong et al) but we likely need more of it.

What is potentially informative about these single shared blocks is the geographic pattern of who you share these blocks with. For example, If you have many shared blocks with people from Norway in a company’s database, this would suggest that some of your recent ancestors lived in Norway (although we need to know how many Norwegian people there are in the database to truly understand this result).This is the kind of information that some of these companies use to work out where your genomic ancestry derives from. However, we think that we are still a long way from understanding these tools thoroughly, and that these tools should be treated as only one (likely imperfect) aspect of family history research. For a more general discussion of how personal genomics can inform our views of family history see Sense about Science, which takes a (rightly) skeptical view of some of the more dubious claims (especially those made by companies that only test Y/mtDNA markers).

We note that even if sharing a single long block doesn’t imply a particularly close genealogical relationship, it can imply a stronger genetic relationship than usual. Both are significant, in different ways.

Peter Ralph and Graham Coop

Posted in genetic genealogy, personal genomics, popgen teaching | 1 Comment

Peter and I’s European genetic genealogy paper is out.

Peter Ralph and I’s article on the geography of recent genetic genealogy in Europe is out in PLOS Biology. We’ve written an FAQ on the paper, that we sent out with the press release. PLOS also has a synopsis of the article. The article has already gotten a bunch of coverage, a few of which are linked to here:
Carl Zimmer at the Loom, Nature News, Sciencenews, NBC, LA times

I’ll post more when I get a chance, the past couple of days were a little crazy with all of this.

One of the nice aspects is that the paper has been up on the arXiv as a preprint server since we 1st submitted the paper to PLOS Biology (in July 2012). I’ve written about our reasons for doing that here, and blogged about the paper here at Haldane’s sieve. The arXived paper has gathered a number of comments via Haldane’s Sieve, various other sources including emails from people. A number of these comments, especially by Amy Williams, were very useful in helping shape the final paper. This was feedback we would have never gotten if we hadn’t posted the paper. For example, I only met Amy at a conference after she had commented via Haldane’s sieve, although I’d known of her work (and enjoyed it, but would never have thought to ask her for comments). The paper has already gained a couple of citations via the arXiv. I also appreciate that PLOS has a clear policy on preprints, and had no issue with us blogging about the paper (also they liked the idea of the FAQ).

We had had gone back and forth of the issue of whether we should even do a press release, as their simple format sometimes lends itself to creating confusion (especially as some news outlets seem to just recycle parts of the press release). But we decided that the paper would likely get some coverage, even if we didn’t do a press release, so it was important to get it right. We worked with Andy Fell at UCdavis on the press release, who I’d followed via blogs and twitter, and he was great at talking to us about the work. We all did a bunch of work on the press release, and made sure that we were all totally happy with everything it said. However, having helped write that, and knowing how complex many of these issues are, we could see that there were a lot of basic questions that we wouldn’t be able to cover in a traditional press release format. So we were keen to try and avoid some of the confusion by writing an FAQ.

I think we also benefited a lot from writing the FAQ, especially in terms of getting much of the press coverage reasonably right. We sent it out as a link with our official press release, while the paper was under embargo, and referred all press contacts to it when we answered their questions. A number of the press/blog articles linked back to it. The FAQ has had 5000 views (as of today) presumably due to people following up on the press article. A number of the reporters had clearly read it before contacting us, which made things a lot easier. Also writing the FAQ prepared us somewhat for talking to the (few) journalists we talked to, as we had thought through the answers to basic questions. Peter and I have discussed turning the FAQs into some form of article (e.g. nonacademic) on issues concerning genetic and genealogical relatedness as there’s a tonne of neat and counter-intuitive ideas and facts out there to explain to folks. We’d definitely recommend considering writing FAQs for your articles, especially if they may get some press interest. We may try it for some of our others in the pipeline. It’s a lot of fun and also nice to take the time to clarify the tricky concepts that often go unexplained in scientific papers.

Anyhow, those are my thoughts so far.
Graham

Posted in genetic genealogy, personal genomics, popgen teaching | Leave a comment