Journal tea: 1st Nov.

For tuesday let’s read:
Bhatia et al. Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection. AJHG 2011

Posted in journal tea | Tagged , | Leave a comment

Journal tea: 13th Oct

Matt Rockman’s review on the QTN program

Posted in journal tea | Leave a comment

Population genetics course resources: Hardy-Weinberg Eq.

Using data from the Hapmap to illustrate Hardy Weinberg equilibrium. The code and the data are below.

The original inspiration for these comes from John Novembre, who made a similar plot for one of his classes. I took 10,000 SNPs from the HapMap CEU European and YRI African populations (the 1st 10k from chromosome 1 on the Affy chip as that was what I had handy). Within each of these populations I plot the allele frequency against the frequency of the 3 genotypes. Each SNP is represented by 3 different coloured points. The solid lines show the mean genotype frequency calculated using a loess smoothing. The dashed line shows the predicted genotype frequency from Hardy Weinberg equilibrium. I find it really pretty that the analytical predictions work as well as they do (I’m not sure whether the slight deviation is finite sample effects, should check into that). Obviously these (and most SNPs) pass HW QC metrics, so it isn’t very surprising still I find it nice to know that popgen makes sense.

The data are from the PHASE2 HapMap and the genotypes were processed into genotype counts using PLINK‘s HWE option (the files are in that format).

This slideshow requires JavaScript.

If you do use these scripts and figures, please acknowledge that fact (mainly so that others can find this resource). Also if you do use them it would be great if you could add a comment to the post, so I can see how widely used they are, to get a sense of how worthwhile this is. If you find a bug or make an improved version do let me know. Thanks again to John for the original inspiration for these plots.

The SNP files:
CEU file

YRI file

plot.geno.vs.HW<-function(file,title){
	
	#read in the HW file from plink
	plink.hwe<-read.table(file,as.is=TRUE) 
	names(plink.hwe)<-c("chr","SNP.id","which.inds","allele.1","allele.2","genotype","obs.het","exp.het","HWE.pval")

	counts<-sapply(plink.hwe$genotype,function(x){as.numeric(strsplit(x,"/")[[1]])})
	counts<-t(counts)
	tot.counts<-rowSums(counts)
	geno.freq<-counts/tot.counts
	allele.freq<-(geno.freq[,1]+.5*geno.freq[,2])

	these.minor<-sample(1:nrow(geno.freq),3000)
	these.major<-sample(1:nrow(geno.freq),3000)
	ss.allele<-c(allele.freq[these.minor],1-allele.freq[these.major])
	ss.geno<-rbind(geno.freq[these.minor,],geno.freq[these.major,c(3,2,1)])


	plot(ss.allele,ss.geno[,1],xlim=c(0,1),ylim=c(0,1),col=adjustcolor("red",0.1),xlab="allele frequency",ylab="genotype frequency",main=title)
	points(ss.allele,ss.geno[,3],xlim=c(0,1),ylim=c(0,1),col=adjustcolor("blue",0.1))
	points(ss.allele,ss.geno[,2],xlim=c(0,1),ylim=c(0,1),col=adjustcolor("green",0.1))
	smooth=1/5
	lines(lowess(ss.geno[,1]~ss.allele,f = smooth),col="black")
	lines(lowess(ss.geno[,3]~ss.allele,f = smooth),col="black")
	lines(lowess(ss.geno[,2]~ss.allele,f = smooth),col="black")

	x=1:1000/1000
	lines(x,x^2,lty=2)
	lines(x,2*x*(1-x),lty=2)
	lines(x,(1-x)^2,lty=2)
	legend(x=0.3,y=1,col=c("red","blue","green",rep("black",2)),legend=c("Homozygote AA","Homozygote aa","Heterozygote Aa","Mean","Hardy Weinberg Expectation"),pch=c(rep(1,3),rep(NA,2)),lty=c(rep(NA,3),1,2))
}
png(file="YRI_HWE.png")
plot.geno.vs.HW(file="YRI_10000.hw.gz",title="HapMap YRI (Africans)")
dev.off()

png(file="CEU_HWE.png")
plot.geno.vs.HW(file="CEU_10000.hw.gz",title="HapMap CEU (Europeans)")
dev.off()
 

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Posted in popgen teaching, Programming exercises, teaching | 14 Comments

Population genetics course resources: Quantitative traits 3.

Final of this initial set of 3 quantitative trait simulations. The first reason I set out to do these simulations is that none of the figures in text books were very good (okay these aren’t pretty but at least I understand them), and also so that they could be used for demonstration for my grad. student class. This year I may try and get my big undergrad. class to download and run R (though this may not work).

Parents are generated from the population as before [1,2], with an additive genetic variance of 1, and a user specified environmental variance.

In this sim. there is a single generation of truncation selection, where individuals in the top SEL% of the phenotypic distribution are chosen to breed. The simulation produces 3 histograms of the parental distribution before and after selection (vertical bars show the mean) and the distribution of phenotypes in the offspring generation.

Ideas:
Height loci?
Allow user to specify alternative forms to selection (i.e. linear, quadratic).
Correlated traits
Multiple generations of selection, showing: movement of trait towards optimum; change in freq. of individual loci resulting in loss/fixation of genetic variation over generations; decline in heritability.

This slideshow requires JavaScript.

I usually show this one after the parent-offspring regression. If I were clever I should also could a version with the parent-child distributions as side-panels on the regression graph. But that would mean leaning how to draw a histogram on its side (sure it must a simple flag in par or some such thing).



one.gen.sel<-function(L=1000,environ.var,sel){
##Quantitative genetics sims
allele.freq<-0.5   ###each locus is assumed to have the same allele frequencies. This is just to simplify the coding, in reality these results work when each locus has its own frequency (and the coding wouldn't be too much harder). 
 

Num_inds=10000

##MAKE A MUM
## For each mother, at each locus we draw an allele (either 0 or 1) from the population allele frequency. 
##We do this twice for each mother two represent the two haplotypes in the mother 
mum.hap.1<-replicate(Num_inds, rbinom(L,1,allele.freq) )
mum.hap.2<-replicate(Num_inds, rbinom(L,1,allele.freq) )
##type mum.hap.1[,1] to see the 1st mothers 1st haplotype

##Each mothers genotype at each locus is either 0,1,2
mum.geno<-mum.hap.1+mum.hap.2

additive.genetic<-colSums(mum.geno)
mean.genetic<-mean(additive.genetic)
genetic.var<-sd(additive.genetic)

additive.genetic<-additive.genetic / sd(additive.genetic)
mum.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))
mum.pheno<-mum.pheno-mean(mum.pheno)



###FAMILIES


##MAKE A DAD (same code as make a mum, only said in a deeper voice)
dad.hap.1<-replicate(Num_inds, rbinom(L,1,allele.freq) )
dad.hap.2<-replicate(Num_inds, rbinom(L,1,allele.freq) )
dad.geno<-dad.hap.1+dad.hap.2


additive.genetic<-colSums(dad.geno)
additive.genetic<-additive.genetic / sd(additive.genetic)
dad.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))
dad.pheno<-dad.pheno-mean(dad.pheno)

### Make a child
child.geno<-dad.hap.1+mum.hap.1 ##1/2 from mum 1/2 from dad

additive.genetic<-colSums(child.geno)
additive.genetic<-additive.genetic / sd(additive.genetic)
child.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))
child.pheno<-child.pheno-mean(child.pheno)



 ##Selection of top sel% of individuals
 
top.sel.per.mums<- mum.pheno>quantile(mum.pheno,p=1-sel) 
top.sel.per.dads<- dad.pheno>quantile(dad.pheno,p=1-sel)
 
 
child.geno<-dad.hap.1[,top.sel.per.dads]+mum.hap.1[,top.sel.per.mums] ##1/2 from mum 1/2 from dad

additive.genetic<-(colSums(child.geno)-mean.genetic)
additive.genetic<-additive.genetic/genetic.var
child.pheno<- additive.genetic + rnorm(length(child.geno),sd=sqrt(environ.var))

layout(1:3)
my.lim<-quantile(c(mum.pheno,dad.pheno),p=c(0.01,0.99))
my.lim[2]<-quantile(child.pheno,p=c(0.99))

hist(c(mum.pheno,dad.pheno),breaks=100,xlim=my.lim,xlab="Phenotype",main=paste("Phenotype distribution before selection,Mean=0, Vg=1 Ve=",environ.var,"Taking top",round(100*sel),"%")); 
hist(c(mum.pheno[top.sel.per.mums],dad.pheno[top.sel.per.dads]),breaks=100,xlim=my.lim,xlab="Phenotype",main=paste("Phenotype distribution after selection, parental mean",format(mean(c(mum.pheno[top.sel.per.mums],dad.pheno[top.sel.per.dads])),dig=4))); 
abline(v= mean(c(mum.pheno[top.sel.per.mums],dad.pheno[top.sel.per.dads])),col="red",lwd=3)
hist(child.pheno,xlim=my.lim,breaks=100,xlab="Phenotype",main=paste("Phenotype distribution in the children Mean in children = ",format(mean(child.pheno),dig=4))); 

abline(v= mean(child.pheno),col="red",lwd=3)
##Mean phenotype after selection
cat("Selected parental mean",mean(c(mum.pheno[top.sel.per.mums],dad.pheno[top.sel.per.dads])),"\n")
##Mean child phenotype
cat("Mean in children = ",mean(child.pheno),"\n")
}
for(sel in c(0.1,0.4))
for(environ.var in c(.01,1,50)){
png(file=paste("One_generation_selection_varg_1_vare_",format(environ.var,dig=1),"sel_",format(sel,dig=1),".png",sep=""))
one.gen.sel(L=50,environ.var=environ.var,sel=sel) ##high herit.
dev.off()
}
 

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Posted in popgen teaching, Programming exercises, teaching | 2 Comments

Population genetics course resources: Quantitative traits 2.

The second quantitative trait code post. The set up is the same as in the last QT post. I simulate parents with L genotypes drawn binomially from the population, and then make kids from these parents.

This one makes the parental mid-point vs child phenotype regression plot. You get to specify the number of loci (L) and environmental variance (environ.var), and genetic.var=1. The slope of the regression of child_pheno ~ (mum_pheno+dad_pheno)/2 is h^2, i.e. genetic.var/(environ.var+genetic.var). What I find really quite neat is how few loci you need for the h^2 regression to work.

Ideas for this year: Use height loci? Add in dominance and other relationships, or get the students to do this. They struggle to understand the dominance variance (and to be honest I always do too).

This slideshow requires JavaScript.



##Paste this in first
par.off.corr<-function(L=20, environ.var,print.slope=FALSE){
##Quantitative genetics sims
allele.freq<-0.5   ###each locus is assumed to have the same allele frequencies. This is just to simplify the coding, in reality these results work when each locus has its own frequency (and the coding wouldn't be too much harder). 
 
Num_inds=1000 

##MAKE A MUM
## For each mother, at each locus we draw an allele (either 0 or 1) from the population allele frequency. 
##We do this twice for each mother two represent the two haplotypes in the mother 
mum.hap.1<-replicate(Num_inds, rbinom(L,1,allele.freq) )
mum.hap.2<-replicate(Num_inds, rbinom(L,1,allele.freq) )
##type mum.hap.1[,1] to see the 1st mothers 1st haplotype

##Each mothers genotype at each locus is either 0,1,2
mum.geno<-mum.hap.1+mum.hap.2

additive.genetic<-colSums(mum.geno)
genetic.sd<-sd(additive.genetic)
mean.genetic<-mean(additive.genetic)

additive.genetic<-additive.genetic / sd(additive.genetic)
mum.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))
mum.pheno<-mum.pheno-mean(mum.pheno)



##MAKE A DAD (same code as make a mum, only said in a deeper voice)
dad.hap.1<-replicate(Num_inds, rbinom(L,1,allele.freq) )
dad.hap.2<-replicate(Num_inds, rbinom(L,1,allele.freq) )
dad.geno<-dad.hap.1+dad.hap.2


additive.genetic<-colSums(dad.geno)
additive.genetic<-additive.genetic / sd(additive.genetic)
dad.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))
dad.pheno<-dad.pheno-mean(dad.pheno)

### Make a child
child.geno<-dad.hap.1+mum.hap.1 ##1 haplotype from mum 1 haplotype from dad

additive.genetic<-colSums(child.geno)
additive.genetic<-additive.genetic / sd(additive.genetic)
child.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))
child.pheno<-child.pheno-mean(child.pheno)



##Calculate midpoints, linear model and plots

parental.midpoint<-(mum.pheno+dad.pheno)/2 ##avg. parents

lm(child.pheno~parental.midpoint) ##linear model between child and mid point
							##the slope of this line is the narrow sense heritability.

# plot parental midpoint against offsprings phenotype.
layout(1) ###done in case this is run after the code with 3 plots
plot(parental.midpoint,child.pheno,xlab="Parental midpoint",ylab="Child's phenotype",cex=1.5,main=paste("L =",L,"environ.var=",environ.var,"genetic.var=1"))
## plot the regression in red
abline(h=0,col="grey",lwd=2)
abline(v=0,col="grey",lwd=2)
 abline(lm(child.pheno~parental.midpoint),col="red",lwd=2)

if(print.slope) text(x=-2,y=3,label=paste("slope= ",format(lm(child.pheno~parental.midpoint)$coeff[2],digit=3)),col="red",lwd=4)

abline(0,1,col="blue",lwd=3)
 
 }
 
for(environ.var in c(0.00001,0.5,1,2)){
png(file=paste("parent_offspring_regression_varg_1_vare_",format(environ.var,dig=1),".png",sep=""))
 par.off.corr(L=200, environ.var=environ.var,print.slope=TRUE)
dev.off()
 }
 

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Posted in popgen teaching, Programming exercises, teaching | 1 Comment

Population genetic course resource: Quantitative traits 1

Here’s the first of 3 Quantitative trait simulations I wrote for last year. We used them in class and also had the TAs in my big undergrad class use them.

Code to simulate the phenotypes of a population. There are L biallelic loci. Each locus segregates an additive allele with equal effect on the phenotype at 50% frequency. The effect of these loci is normalized such that the additive genetic variance is 1. An environmental variance is then added, which you get to chose.

The figures below show histograms of the distribution of the phenotype within the population for L=1,3,10,100. The environ. variance is set low (=0.01) to allow the changing number of peaks to be seen. With a bit more environmental variance these distributions quickly approach the normal distribution (i.e. the central limit kicks in).
Ideas: Redo these simulations using the ~160 height loci found to date. Will hopefully code that up shortly.

This slideshow requires JavaScript.

If you do use these figures or code, please acknowledge that fact (mainly so that others can find this resource). Also add a comment to the post so I can see how widely used they are, to get a sense of how worthwhile this is. If you find a bug or make an improved version do let me know.

##Quantitative genetics sims

pheno.geno<-function(L,environ.var){
allele.freq<-0.5   ###each locus is assumed to have the same allele frequencies. This is just to simplify the coding, in reality these results work when each locus has its own frequency (and the coding wouldn't be too much harder). 
 
 ###L try values between 2 and 100
Num_inds=10000

 ## try some values between zero and (say) 5.
					### the genetic component is scaled below so that its variance is 1
					##thus you only need play with the environmetal variance 
 
##MAKE A MUM
## For each mother, at each locus we draw an allele (either 0 or 1) from the population allele frequency. 
##We do this twice for each mother two represent the two haplotypes in the mother 
mum.hap.1<-replicate(Num_inds, rbinom(L,1,allele.freq) )
mum.hap.2<-replicate(Num_inds, rbinom(L,1,allele.freq) )
##type mum.hap.1[,1] to see the 1st mothers 1st haplotype


##Each mothers genotype at each locus is either 0,1,2
mum.geno<-mum.hap.1+mum.hap.2

##assuming that each allele at locus has the same effect on phenotype, sum the genotypes to get the additive genetic "phenotype"

if(L>1){ additive.genetic<-colSums(mum.geno)}
if(L==1){ additive.genetic<-mum.geno}

##I normalize this to have variance 1, to make it comparable to the environmental variance.
additive.genetic<-additive.genetic / sd(additive.genetic)

##add the environmental contribution to the trait to get the overall phenotype
mum.pheno<- additive.genetic + rnorm(Num_inds,sd=sqrt(environ.var))

##normalize the trait to have mean zero
mum.pheno<-mum.pheno-mean(mum.pheno)

##plot a histogram of the distribution of the phenotype within the population
layout(1) ###done in case this is run after the code with 3 plots
hist(mum.pheno,breaks=1000,xlab="Phenotype",main=paste("Number loci=",L,"environ. variance=",environ.var,"genetic variance=1"))

}

for(L in c(1,3,10,100)){
png(file=paste("geno_pheno_L",L,".png",sep=""))
pheno.geno(L=L,environ.var=.01)
dev.off()
}

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Posted in popgen teaching, Programming exercises, teaching | 2 Comments

Teaching Population genetics resources

I teach a graduate level population genetics class, in our Population Biology Graduate Group core (along with a large undergrad class on An Introduction to Evolution). I taught the course last year and it went reasonably well (well no one complained too bitterly). This year I’ve decided to post some of my resources as I develop them to this blog. The hope is that they’ll be useful to others.

I was inspired to do this by the R code tutorial that Yaniv my great postdoc posted on his website. In general I’m exploring new ways to share more of the work we do, for example: posting our papers to archives and posting all of the code from our papers from now on. It is tricky to share our research work too early as competition (or at least perceived competition) makes that difficult. Teaching is one area where I can see absolutely no downside to sharing as early and as much as possible.

If you do use them, please acknowledge that fact (mainly so that others can find this resource). Also if you do use them it would be great if you could add a comment to the post, so I can see how widely used they are, to get a sense of how worthwhile this is. If you find a bug or make an improved version do let me know.

Hope this is helpful to people. We’ll have to see how successful I am at keeping this up.

Graham

Posted in popgen teaching, Programming exercises, teaching | Tagged | Leave a comment

journal tea Oct 4th.

Bayesian inference of ancient human demography from individual genome sequences -Gronau et al

A summary of the paper by Pritchard

the paper relies on the method of Rannala and Yang

Posted in journal tea | Tagged , , | Leave a comment

Journal tea 29th Sept

Testing for Ancient Admixture between Closely Related Populations
Durand et al.

Posted in journal tea | Tagged | Leave a comment

journal tea 27th Sept.

Let’s read the two new Australian archaic admixture papers:
http://www.sciencemag.org/content/early/2011/09/21/science.1211177
http://www.cell.com/AJHG/abstract/S0002-9297%2811%2900395-8

Posted in journal tea | Tagged , | Leave a comment