## Population genetics course resources: Coding up the coalescent time distribution.

A homework exercise coding up a coalescent model in a constant population of size N. There’s a variety of ways this could be done, the code is setup to guide a person through one of those. The code is applied to working out the time to speciation under the genealogical species concept as an example of the application of coalescent thinking (see Hudson and Coyne for a deeper analysis of this problem). Two other obvious applications that will be next up are: 1) calculating the probability of an observed number of segregating sites; 2) working out the distribution of the time to the most recent common ancestor conditional on a number of segregating sites.

```
##The R assignment for this week is to write code to simulate the times in coalescent trees and the number of segregating sites.
## We'll simulate the to the common ancestor of our sample, the total time in our coalescent and then the number of segregating sites.

#write function below to simulate the Time to the Most Recent Common Ancestor (TMRCA) for n sequences sampled from a population of N diploids
##In this function you'll need to use the rexp function to simulate a single draw from an exponential r.v.
TMRCA=function(n,N){

return(tmrca)
}

##EXERCISE 1
##Write a for loop to iterate your TMRCA function 10000 times, recording each value returned by your function
##you could alternatively use the function replicate() to do this loop.
##plot a histogram of the distribution of the TMRCA, and calculate the mean and variance
##Do this for two different values of N=10000 and 1,000,000 and a few different sample sizes, e.g. n=2, n=5 and n=1000 (try other values as well)
##Hand in these histogram, means and variances and briefly comment on the numbers you obtain and how they compare to the results we've developed in class

##EXERCISE 2
# There are many concepts of what a species is, one species concept is the phylogenetic species concept (PSC). Under this species concept (crudely speaking), a population would be considered a separate species if at 90% of loci in the genome of a population are monophyletic with respect to any sister population (this is the genealogical species concept (GSC), a variant of the PSC).
# Imagine a population has been totally isolated from all other populations. Can you use your coalescent simulations to gain an approximate sense of the time till the population would be declared a separate species by (this admittedly crude representation of) the GSC.
# Many population geneticists, while recognizing the usefulness of a clear definition of a species, are some skeptical of the application of this species concept. This is especially true of population geneticists (perhaps now the majority) who see adaptation as playing the key role in the accumulation of reproductive isolation factors. Can you explain why?
#[Read Hudson and Coyne’s  paper (http://www.jstor.org/pss/3061538) on this topic if you want a deeper analysis of this problem.]

##EXERCISE 3

##Alter this function to simulate the total time in the genealogy for n sequences sampled from a population of N diploids
tot.time=function(n,N){

return(tot.time)
}

##Write a for loop to iterate your total time function 10000 times, recording each value returned by your function
##or use replicate()
##plot a histogram of the distribution of the TMRCA, and calculate the mean and variance
##Do this for two different values of N=10000 and 1,000,000 and two different sample sizes n=5 and n=100 (try other values as well)
##Hand in these histogram means, variance and briefly comment on the numbers you obtain and how they compare to the results we've developed.

u=1e-8

## Write a for loop to iterate your total time function 10000 times, and each iteration simulate a number of segregating sites given the total time in the tree.
## You'll need to make use of the function rpois(n=1,lambda), which simulates a single draw from a Poisson with a mean lambda.
##record the number of segregating site for each iteration
##try a couple of different values of the mutation rate u (e.g. u=1e-7 and 1e-4 represents a region of roughly ten bases and 10000 bases) for the two N's given above.
##plot a histrogram of the number of segregating sites across your simulations, calculate the mean and variance
##Hand in these histogram means, variance and briefly comment on the numbers you obtain and how they compare to the results we've developed.

```

If you do use these scripts and figures, please acknowledge that fact (mainly so that others can find this resource). Also if you do use them it would be great if you could add a comment to the post, so I can see how widely used they are, to get a sense of how worthwhile this is. If you find a bug or make an improved version do let me know.

This entry was posted in popgen teaching, Programming exercises, teaching. Bookmark the permalink.