R for phylogenetics

R is a useful environment/language for statistical work. Here is a basic tutorial on using it for phylogenetics. You may also look at the basic info at my general R tutorial page; also see sections 1.5, 1.7, 1.10, 2.1, 2.3, 2.8, 3.2, 12.1, 12.1.4, and 13 of the R manual. R commands will be written this way. It is often best to store them in a text file, so if you want to repeat the analysis, you can just rerun the file. Anything preceded by a # in R code is ignored, so you can use this to store comments to describe what you're doing [for those in my class, send me your script, and include your name as a comment].

  1. Install R. You may want to use RStudio: it works on Mac, Linux, and Windows, is free, and can be easier to use than R's built-in graphical environment as it uses panels rather than floating windows (it works better on a Mac, certainly)
  2. Let's install our first package. A package is a set of functions and perhaps data that make R more functional. Probably the majority of comparative methods that are now released become R packages. We will first install the CRAN task view ("ctv") package. A task view describes all (or at least, attempts to describe all) packages in a particular domain, such as high performance computing or economics. There's a view for phylogenetics. Install by entering install.packages("ctv"). It may ask you to choose a repository -- I generally use Berkeley's, as it's updated frequently.
  3. Hooray, our library is installed. By default, R doesn't load it (you'll eventually have many libraries, and probably don't want them all running at once). But let's load ours: library("ctv")
  4. Now let's install all (hopefully) of the phylogenetics packages. install.views("Phylogenetics")
  5. There are several ways to store trees in R. The most popular is the phylo format used by the package ape. phylo4 format, an S4 format used by phylobase, is probably second in popularity. Let us start with ape. library("ape")
  6. What does ape do? Type ?ape. For more information, type library(help = ape)
  7. Let's make a simple tree using the rcoal function. Type ?rcoal to learn the syntax for creating a tree with 10 taxa. Then create one and store it in an object named "phy".
  8. If you had trouble with the above line, try again. You only learn programming by trying to solve problems.
  9. If you're still having trouble, the command to use is phy<-rcoal(n=10). Remember that in R, <- is an assignment operator: foo<-5 means foo will equal 5.
  10. How do we know this worked? We can print(phy) to get basic info about phy (this is general for any R object). summary(phy) may give us a bit more info. plot(phy) will draw the tree.
  11. How does R know to draw the tree in this way, rather than as a scatter plot or some other graph? Well, class(phy) is "phylo" meaning phy is a kind of "phylo" object. Such objects have their own plotting function, plot.phylo(). So R knows to use this function for plotting. Get help on this function and then plot a tree in the upwards orientation, without taxon labels, and with thick branches. [If you're doing this for my class, this plot is part of your assignment: use the pulldown Export menu on the plot panel to save a copy of this plot to email to me].
  12. Tree space is large. Let's see how large. For this, we will use the phangorn package. Load this into R (see how to activate ape for a hint).
  13. Read the documentation for AllTrees() and length().
  14. Create an R script with the following lines:
    
    library(phangorn) #to load the phangorn package
    x<-c() #creates an empty x vector
    y<-c() #creates an empty y vector
    for (number_of_taxa in 3:8) { #this starts a loop: the first time through, variable number_of_taxa=3, the next time, it equals 4, then 5, etc. up to and including 8.
        print("now doing this number of taxa:")
        print(number_of_taxa)
        x[number_of_taxa-2]<-number_of_taxa #here we are sticking the number of taxa into vector x. x[1] is the first slot in that vector, x[2] is the second slot in that vector… (it expands if it is not that size already -- not all languages do that). Think about why we do number_of_taxa - 2 rather than number_of_taxa.
    
        #WAY TO STORE THE LENGTH OF THE OUTPUT FROM allTrees, WHICH IS THE NUMBER OF POSSIBLE TREES, FOR PLOTTING ON THE Y-AXIS
    }
    
    #PLOT THE NUMBER OF POSSIBLE TREES VERSUS THE NUMBER OF TAXA (DO A LOG SCALE FOR NUMBER OF TREES)
    
  15. This is a chewy problem: change #WAY TO STORE.... and #PLOT THE... with ways to do this. [For those in my class, send me your script and final plot]
  16. The above approach is quite stupid. First, ape has a function to calculate the number of trees directly, without having to create them all (try to find this). This is more efficient. Second, while for() loops are easy to understand, they can often be much slower in R than using commands in the lapply() family. They are trickier to learn, but not that difficult, and can offer orders of magnitude speed up. Finally, you are right now probably only using one core of your multicore machine (if it's a laptop or desktop from the past several years). Look at the foreach() function in the documentation for the doMC package for ways to easily turn a for() loop [bad] to a job that automatically executes in parallel [awesome].