Grad Core Phylogenetics 2014

UTK EEB requires new graduate students to take an intensive team-taught course covering ecology and evolution. I present the phylogenetics portion (tree creation and tree use).

Note the anonymous feedback form: Suggest changes while it can still help you in class.

Fall 2014:

This year I am teaching the entire phylogenetics portion. Here is the syllabus (PDF) and lectures are below; the recorded videos are here.

  1. Intro to phylo (PDF), R exercise
  2. Likelihood, Bayes, model selection, bootstrap. (PDF), R exercise, R exercise answers Background reading: Lewis 2001
  3. Ingredients for phylogenetic methods. PDF. Background reading: O’Meara 2012
  4. DNA models, heterogeneity, alignment (PDF). Mesquite, Tracy Heath’s Beast Tutorial
  5. Continuous traits and tree stretching (PDF). R script
  6. Gene tree incongruence. Background reading: Maddison 1997 PDFQuiz
  7. Species, speciation, taxonomy PDF
  8. Diversification PDF

Study Guide
I had three main goals in this section: making sure you know what sort of questions you can address using phylogenies, having some idea of what methods are available (or close to being available, with a little work) to answer these questions, and having the basic understanding to read and know how to build a tree.

Things you won’t have to know: the difference between gamma and kappa, the equation for multivariate normal, authors behind methods, or similar specifics that you’ll forget again in a month. What I do want you to know:

  • How are trees made?
  • Why are trees made?
  • Compare and contrast a phylogram and a chronogram.
  • Given a tree, answer questions about its topology or other features (i.e., what’s the closest relative to X on the tree?).
  • Compare and contrast bootstrapping and Bayesian approaches for understanding uncertainty.
  • What is a continuous time Markov chain with a finite state space?
  • Why do we care about the above?
  • How are models for DNA related to models for a binary morphological trait?
  • Connect the central limit theorem to the multivariate normal. Why does this make sense for an evolutionary model?
  • What is independent contrasts? When would you use this?
  • What are ways to deal with heterogeneity of processes on a tree? Why is this important?
  • What is tree stretching? What questions does this let you ask?
  • What are sister group comparisons used for?
  • What is the difference between Bayesian and likelihood approaches?
  • Compare and contrast hypothesis testing, model selection, and parameter estimation.
  • What are some approaches to understanding diversification?
  • Why are there methods for jointly looking at diversification and trait evolution?
  • Given a certain biological question, I may ask you what sort of approach you would take to it. The various empirical examples should help with this.

Note that this is the content I want you to know, but not necessarily the questions. There would not be time for these to be essay questions, so I will ask for some of this information in short answer questions and just a few essays.

Grad Core Phylogenetics 2012

Syllabus Fall 2012:

Day 1

Topics:
Why phylo?
Continuous time Markov Chain finite state space
Doing this in Geiger: get exercise Discrete.R
Slides (PDF)

Day 2

Topics:
In class review quiz
Central limit theorem CLT.R
Brownian motion, Ornstein Uhlenbeck, Independent Contrasts
Slides (PDF)

Day 3

Topics:
In class review quiz
Likelihood vs Bayes redux
Model comparison exercise
Tree stretching
Tree stretching exercise
Slides (PDF)

Day 4

Topics:
In class review quiz
Working through homework. What did you conclude? What does it mean? (relevant R script is here)
Delve deeper into how to do an analysis.
Diversification
Slides (PDF)

Day 5

Topics:
Doing a good analysis
Slides (PDF)

Day 6

Topics
Continuing pair analyses

Final

Concatenation of all slides (PDF)
Study guide (PDF)

Grad Core Phylogenetics 2011

Syllabus Fall 2011:

Day 1

Topics:
Speciation
Hybridization
Slides: PDF.
For homework, due by start of class Monday, do steps 1-15 of this tutorial.

Day 2

Topics:
What trees are used for
history of life + cool discoveries about topology
tree terminology
clades (dinos birds, reptiles, fish)
sister groups, crown/stem, etc.
rank and its importance
homology
heuristic search
treespace
Slides: PDF.

Day 3

Topics:
Distance
Parsimony
Long branch attraction
Models
Model selection
Bayes/Likelihood

Slides: PDF
Homework:
The next assignment should be pretty fun. By next Monday, write a heuristic search algorithm for getting the maximum parsimony tree in R after downloading a set of GenBank sequences, also within R. I’ve sketched out a lot of the code — the main tricky bit should be figuring out how to store the best tree as you go and learning how to use an if() function. Comments in ALL CAPS are the sections you should focus on for changing. You’re encouraged to play with this assignment: maybe you want to plot the scores as you search, maybe you want to try different search strategies, etc. Feel free to email me for help or talk to your cohort-mates (remember you can use the blackboard forum, too). R template here.

Day 4

Slides: PDF

Day 5

Slides: PDF

Day 6

Slides: PDF

Grad Core Phylogenetics 2010

Syllabus Fall 2010:

Day 1

Topics:
What trees are used for
history of life + cool discoveries about topology
Tree terminology
clades (dinos birds, reptiles, fish)
sister groups, crown/stem, etc.
rank and its importance
homology

Slides: pdf, keynote, mov (QuickTime movie)
R assignment:
By the start of class Wednesday, please:
1) Install R on a computer you can access (you don’t need to bring it in, so a desktop at home is fine). You can get an installer from http://cran.r-project.org/ .
2) Read this web page: http://bodegaphylo.wikispot.org/Phylogenetics_and_Comparative_Methods_in_R
3) Read and do the exercises (aka “type ~8 lines”) from this web page: http://bodegaphylo.wikispot.org/I._Getting_Started_with_R . Do the task view route (it’s easier).
4) Write me an email with the info you get from the See Also section when you run ?read.tree after doing the tutorial (should be one line).

Day 2

Topics:

Treespace/NP-nasty

History

Distance

Parsimony

Long branch attraction

Slides: pdf, keynote, mov (QuickTime movie)
R assignment: Plot the number of trees vs the number of taxa.
Here is the second assignment. Note that http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf is a very readable tutorial you may want to download (written by the same guy who wrote the phylogenetic package ape). The homework is due by start of class on Friday.

1. Fire up R. Input library(phangorn) to load the Phangorn package (who gets the geeky reference? Why the F->Ph transition? Get used to this sort of thing). If this gives you an error, try installing it — install.packages(“phangorn”)
2. Create a plain text file (no formatting: .txt rather than .rdf) called treespace.R. We’re going to be typing our commands in this.
3. Go to http://cran.r-project.org/doc/manuals/R-intro.html
4. Read and understand sections 1.5, 1.7, 1.10, 2.1, 2.3, 2.8, 3.2, 12.1, 12.1.4, and 13 [the other bits are handy, too, but I want to minimize your workload].
5. Read the documentation for allTrees (just type ?allTrees ), plot (same way), and length .
6. Treespace is big, as we’ll learn on Friday. Let’s see how big. In your treespace.R file, include the following commands (don’t copy in “7. Don’t paste this…” or the lines after that into the file). Remember that anything after a # sign is ignored by R (it’s a comment). I’m also attaching this as a file in case the formatting is affected by emailing it.


library(phangorn) #to load the phangorn package
x<-c() #creates an empty x vector y<-c() #creates an empty y vector for (number_of_taxa in 3:8) { #this starts a loop: the first time through, variable number_of_taxa=3, the next time, it equals 4, then 5, etc. up to and including 8. print("now doing this number of taxa:") print(number_of_taxa) x[number_of_taxa-2]<-number_of_taxa #here we are sticking the number of taxa into vector x. x[1] is the first slot in that vector, x[2] is the second slot in that vector… (it expands if it is not that size already -- not all languages do that). Think about why we do number_of_taxa - 2 rather than number_of_taxa. #WAY TO STORE THE LENGTH OF THE OUTPUT FROM allTrees, WHICH IS THE NUMBER OF POSSIBLE TREES, FOR PLOTTING ON THE Y-AXIS } #PLOT THE NUMBER OF POSSIBLE TREES VERSUS THE NUMBER OF TAXA (DO A LOG SCALE FOR NUMBER OF TREES)

7. Don’t paste this or following lines into your file
8. Replace the #WAY… and #PLOT… comments with functioning R code. I’ve given you hints (why did I have you look at what the allTrees command does? what’s that y<-c() in the code for?). 9. Try running your code -- copy and paste into R, or on a Mac you can select all and then command+return to have it run (if you're editing the text within an R window), or you can use source("treespace.R") if you have set the working directory to the folder containing the file (?setwd or use an option in the menu). 10. Fix the errors, repeat 9 (see the joys of loops? -- "while (working==FALSE) { debug some more }"). 11. Send me your final figure (screenshot, pdf, etc.) and your treespace.R file. 12. Is the number of trees growing exponentially with the number of taxa? 13. Are you sure? BONUS (optional): How can you tell? 14. If you're still up for it (this part is optional, too), modify your code to look at all trees from 3 to 9 taxa rather than from 3 to 8. You'll need to be very patient. Why? Bang your head against this for awhile, then talk to your cohort-mates for help, and/or email me. A lot of programming issues, especially early on, are frustrating in the way these problems are (i.e., what function lets me count the number of trees? how do I stick things in a vector?), and it's salutary to try to work through them. There are generally several ways to solve each problem -- go for the one that works first. Later you can tweak with speed, memory usage, code beauty (yes, really), etc. There are a few mailing lists for help with R, and help with R for phylogenetics, but [most of you] are too new to R to start posting there.

Day 3

Topics:

Models

Model selection

Bayes/Likelihood

R code for animation to make image

Slides: pdf, keynote, mov (QuickTime movie)

R assignment: Do your own tree search

The next assignment should be pretty fun. You are going to write a heuristic search algorithm for getting the maximum parsimony tree in R after downloading a set of GenBank sequences, also within R. I’ve sketched out a lot of the code — the main tricky bit should be figuring out how to store the best tree as you go and learning how to use an if() function. Comments in ALL CAPS are the sections you should focus on for changing. You’re encouraged to play with this assignment: maybe you want to plot the scores as you search, maybe you want to try different search strategies, etc. Feel free to email me for help or talk to your cohort-mates (remember you can use the blackboard forum, too). R code below and attached.

And just think: on Monday, many of you had never coded before. Now, you’re able to create a phylogenetic search program and pull things from an online database!

Brian

library(phangorn)

accessionNumbers <- c("U15717", "U15718", "U15719", "U15720", "U15721", "U15722", "U15723", "U15724") #These come from a paper by Paradis (1997) on Tanagers #USE THE read.GenBank command TO DOWNLOAD THESE SEQUENCES FROM GENBANK AND PUT THEM IN AN OBJECT CALLED sequences. #Note that it's read.GenBank, not read.Genbank, read.genbank, readGenBank, etc. Programming is usually very case-sensitive distanceMatrix<-dist.dna(sequences) #get a distance matrix print(distanceMatrix) convertedSeqs<-phyDat(sequences) #convert to a format useful for phangorn treeBad<-upgma(distanceMatrix) #the UPGMA starting tree actually is not that bad (and a NJ tree is the same as the maximum parsimony tree). So let's make it challenging by giving you a bad starting tree (for larger data sets, you would not have this "problem" of a starting tree that is too good) #note how to get the parsimony score of a tree (use convertedSeqs, not sequences) while(parsimony(treeBad,convertedSeqs)<220) { treeBad<-rNNI(treeBad) } maximumTries<-50 #GET SCORE OF STARTING TREE, treeBad #STORE THE CURRENT BEST TREE, WHICH IS treeBad #PRINT THE STARTING SCORE for (tryNumber in 1:maximumTries) { #GET A NEW TREE BY DOING newTree<-rNNI( __SOMETHING__ ) #COMPARE THE SCORE OF THE NEW TREE TO THE OLD TREE #USE AN IF BLOCK (if (new #INFO ON IF() IS AT http://cran.r-project.org/doc/manuals/R-intro.html#Conditional-execution #HERE IS AN EXAMPLE BLOCK if (1<2) { print("one is less than 2") } #PERHAPS MORE THINGS AFTER THE IF BLOCK } #PRINT THE FINAL SCORE #PLOT THE FINAL TREE #BONUS: THINK OF A WAY TO DO AN EXHAUSTIVE SEARCH USING THE allTrees COMMAND. WILL THIS WORK FOR MANY MORE TAXA?

Day 4

Topics:

Bootstrap

Consensus

Clocking

Slides: pdf, keynote, mov (QuickTime movie)

Day 5

Topics:

Supertrees & Supermatrices

Gene tree species tree

Where to find trees

Slides: pdf, keynote, mov (QuickTime movie)

Grad Core Phylogenetics 2009

Syllabus Fall 2009:

Day 1 (W. Sept. 30, 50 minutes)

Lecture1.pdf

Readings: Gregory 2008, Crisp and Cook 2005

Topics:

What trees are used for

history of life + cool discoveries about topology

Tree terminology

clades (dinos birds, reptiles, fish)

sister groups, crown/stem, etc.

rank and its importance

homology

Day 2 (F. Oct. 2, 50 minutes)

Lecture2.pdf

Readings: Felsenstein 1978.

Topics:

Treespace/NP-nasty

History

Distance

Parsimony

Long branch attraction

Day 3 (M. Oct. 5, 2 hr)

Lecture3.pdf

Readings: Huelsenbeck et al. 2002

Topics:

Models

Model selection

Bayes/Likelihood

Sample analysis walkthrough (genbank, alignment, tree search, sketch of posttree analysis)

Day 4 (W. Oct. 7, 50 minutes)

Lecture4.pdf

Readings: none.

Topics:

Bootstrap

Clocking

Day 5 (F. Oct. 9, 50 minutes)

Lecture5.pdf

Readings: Edwards 2009, Hugall et al. 2007

Topics:

Supertrees

Supermatrices

Gene tree species tree