## Population genetics course resources: Demostrating the uses of coalescent simulations

A problem set making use of the coalescent simulations the students wrote for a previous exercise (see here). They use the code to calculate p-values for the observed number of segregating sites, this could be done under the HKA test but it’s nice to get them to use their simulations. They also use the code to do rejection sampling to get the distribution of the time to the most recent common ancestor given the number of segregating sites (this idea is taken from Tavare et al 1997). This exercise could be extended in a number of ways, e.g. as in Tavare et al 97 we could simulate from a prior on theta. I’d also like to get the students to use ms to infer parameters of a bottleneck model (e.g. from Tajima’s D).

Q3) You sequence 11 loci of 1000bp in a species of dolphin. In your sample of 10 individuals you observe the following number of segregating sites:
41, 48, 43, 48, 23, 45, 5, 56, 48, 25, 83
Assume (based on prior knowledge from divergence) that these 12 loci have identical mutation rates (of 2×10-8 per base per generation) and that there is no recombination within these loci (not a particularly great assumption).
A) What is your best estimate for the effective population size?
B) There seems to be substantial variance in the number of segregating sites across loci, perhaps because some of them are linked to loci undergoing selection. Assuming a constant population size, and your estimate of Ne from part A, use your coalescent simulations to estimate a p-value that the data at each of these loci was produced by the neutral coalescent model.
C) Which loci are candidates (at a false positive rate of 0.05) for being influenced by linked selection? What kind of selection could explain your observations? What other aspects of the data could you look at to distinguish these hypotheses? Are your conclusions robust to the assumptions we have made?
D) Your colleague sequences a 500bp region (which putatively has the same mutation rate) in the same 10 individuals. He finds 12 segregating sites. He tells you that he thinks this region has unusually low diversity because it is linked to a locus under selection. Using your simulations comment on whether his story is plausible.
E) He asks you whether he can estimate the time to the most recent common ancestor (TMRCA) for his locus based on the number of segregating sites and your estimate of Ne. Can you extend your coalescent simulations to estimate the distribution of the TMRCA at his locus? [Hint you will need to modify your coalescent code to produce both the TMRCA and number of segregating sites.]