library("TreeSim")
## Loading required package: ape
## Loading required package: geiger
## Loading required package: laser
library("TreePar")
## Loading required package: Matrix
## Loading required package: subplex
## Loading required package: deSolve
##
## Attaching package: 'deSolve'
##
## The following object is masked from 'package:graphics':
##
## matplot
If you have not installed them you should use the function
#install.packages("nameofthepackage")
Tutorial modified from Tanja Stadler’s PhyloDynamics seminar.
Simulate a tree using TreeSim
package. We are assuming that the change in shape of the tree is given by a shift in the diversification rates (like a mass extinction) and not because of diversity dependence (we will test this in the next section). Input for the simulation will be:
set.seed(1)
#tree1<-sim.rateshift.taxa()[[1]]
Plot the tree and the lineage through time. Do you notice the “pull of the present”?
Let’s calculate speciation times using the following instruction
#x<-sort(getx(tree1),decreasing=TRUE)
So now we estimate our density dependent diversification rates. Three parameters are of importance (K, lambda_n, mu) mu is fixed but not zero, lambda_n depends on the number of lineages and K is the carrying capacity. For this part we will use function bd.densdep.optim
in TreePar
there are faster options in DDD
and expoTree
(not very intuitive).
What are the input choices for bd.densdep.optim
? What would you specify in order to find MLEs?
#res.DDD<-bd.densdep.optim()
If we want to compare to the model with shifts we can optimize using bd.shifts.optim
.
#res.shifts <- bd.shifts.optim(x,c(rho,1),0.1,0.1,2.1)[[2]][[2]]
How are your estimates for the parameters? What are the potential problems?
To make model comparisons we can calculate AIC values for both models
#aic.ddd <- 2*3+2*res.DDD$value
#aic.shifts <- 2*5+2*res.shifts[1]
What should be happening given what you simulated?
Can you simulate a tree with density dependent diversification? Check the notes of function sim.rateshift.taxa()
Now, fit the density dependent model and the shift dependent model, which one should fit better?
This tutorial largely follows the tutorials by Dan Rabosky on his website.
BAMM consists of two components, the program that runs the actual analysis is in C++ (to be faster), while the analysis of the outputs happens in R.
Easiest might be to install it from the executables (both for Win and OS X), although from source and using homebrew is also possible.
unpack everything, in case of OS X, move to the appropriate directory in the command line and type
tar -xzf bamm-2.3.0-MacOSX.tar.gz
move the files and folders to the directory you want
For this one, simply open R and type
install.packages('BAMMtools')
If it is installed, load the package using
library(BAMMtools)
In order to run BAMM, we need nothing more than a tree and a controlfile. We will use the whale examples in this tutorial. To run your own analysis, it is a good idea to simply modify the example controlfile or use the template controlfile.
Let’s start the whale analysis and then look at the controlfile.
It is easiest to use BAMM when all files are in the same directory where BAMM is, so:
numberOfGenerations
from 10k to 1Mio./bamm -c divcontrol.txt # on windows you can omit the './'
The main data we want to look at later is in the event_data file, and we look at the file mcmc_out to assess convergence. You might want to keep the run_info file with your other files in order to know what settings you used for a specific analysis.
Go to the examples folder and find the ‘whales’ controlfile and tree in the ‘diversification’ subfolder. Open the controlfile ‘divcontrol.txt’. The file is pretty well annotated and most points should be fairly self-explanatory, we will quickly discuss some of them though.
Sections:
Most values are preset in a way that will work out well, and most users work along the policy to not touch anything if you don’t know what it does, but a few parameters are worth looking at (the bold ones you definitely have to modify for your own analysis, or at least consider it…):
sampleProbsFilename
all the priors…
acceptanceRateFreq
scaling operators…
move frequencies…
initial values…
swapPeriod
segLength
More in-depth treatment of how to fine-tune the analysis and how to use advanced features can be found here
Now we want to dig into the outputs of the analysis, which we do in R. There are a number of different options regarding what we can do and should do, and we try to look at the most important ones here.
Just like for any MCMC analyses, e.g. when we date a tree with BEAST, we want to know if we ran the analysis long enough to reach convergence. To do this for BAMM, we look at the file mcmc_out in R:
mcmc <- read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/mcmc_out.txt", header=T) # load file
dim(mcmc) # check dimensions
head(mcmc) # check head
Now we can look at the trace plot of the log likelihood per generation (and if we want at the numbers of shifts per generation)
plot(mcmc$logLik ~ mcmc$generation)
plot(mcmc$N_shifts ~ mcmc$generation)
Now we want to discard a reasonable portion as burnin, e.g. 10%…
burnstart <- floor(0.1 * nrow(mcmc))
mcmcpost <- mcmc[burnstart:nrow(mcmc), ]
… and calculate the effective sample size, which should exceed 200 if our analyses ran long enough (although different people might argue for different minimal values here).
library(coda)
effectiveSize(mcmcpost$logLik) # calculates autocorrelation function
effectiveSize(mcmcpost$N_shift) # effective size on N_shifts
## do this on pre-burnin MCMC
effectiveSize(mcmc$logLik)
effectiveSize(mcmc$N_shift)
Now for the thing we are most interested in: the rate shifts. We load the package BAMMtools, the tree we used and the event_data file, for which we specify the same burnin we decided on before:
library(BAMMtools)
tree <- read.tree("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/whaletree.tre")
ed <- getEventData(tree, "/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/event_data.txt", burnin=0.1)
It is possible to subset the event data file, if the size is too big for R to handle.
Now we can just look at the mean rates inferred:
plot.bammdata(ed, legend=T, spex='netdiv')
We can also look at speciation or extinction rates separately:
plot.bammdata(ed, legend=T, spex='s')
plot.bammdata(ed, legend=T, spex='e')
We can get a first idea on what number of shifts were most commonly found using summary (this just tells us how many samples contain a certain number of shifts, but not WHERE those shifts are):
summary(ed)
To get a more robust idea of how certain shift models are performing compared to each other, we can calculate Bayes Factors:
postfile <- mcmc # since we already loaded this
priorfile <- read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/prior_probs.txt", header=T)
bfmat <- computeBayesFactors(postfile, priorfile, burnin=0.1)
The resulting table shows how models with certain numbers of shifts compare to a specific other one, BFs of >20 indicate good support that a model is better than the one it is compared to, BFs of >50 indicate very strong support.
What our data actually looks like now, are different samples of shift regimes, all of which are sampled at a certain frequency. We can access each of them, and it is a good idea to visualize the most frequent ones:
pset <- getBranchShiftPriors(tree, priorfile)
cset <- credibleShiftSet(ed, pset, BFcriterion=3)
plot.credibleshiftset(cset, lwd=2.5)
Based on the credible shift sets, we might conclude that the most commonly sampled configuration of shifts is the best one:
best <- getBestShiftConfiguration(ed, prior = pset)
plot.bammdata(best, lwd = 2)
addBAMMshifts(best, cex=2.5)
Alternatively, pick the shiftset that maximizes the marginal likelihood of a shift on a branch, a set of maximum shift credibility, similar to when we get a maximum clade credibility tree out of a set of trees:
msc.set <- maximumShiftCredibility(ed, maximize='product')
msc.config <- subsetEventData(ed, index = msc.set$sampleindex)
plot.bammdata(msc.config, lwd=2)
addBAMMshifts(msc.config, cex = 2)
When we have rate shifts to lower or higher rates, some sampled configurations might cancel each other out ( i.e. some say speedup for sisterclade A, some say slowdown for sisterclade B). We can catch this (and more) by looking at the cohort matrix:
cmat <- getCohortMatrix(ed)
cohorts(cmat, ed)
Fin!