Phylogenetics
Phylogenetic analysis
Each item includes PAUP batch files or commands where appropriate. I wrote this page for the Farrell Lab website, but have included it here as well, where I continue to update it (occasionally). This page is mostly useful for the batch files which are on it. Some useful references which provide broader overviews of phylogenetics are Phylogenetic Trees Made Easy by Barry Hall, chapters 11 and 12 of the somewhat dated (1996) Molecular Systematics, 2nd Edition, edited by David Hillis, Craig Moritz, and Barbara Mable, and the recently updated Molecular Systematics website. A great new resource is Joe Felsenstein’s new book Inferring Phylogenies. You may also check my page of books for phylogenetics. PAUP also has a quick start tutorial on its web page as well as a useful discussion group for asking questions. This page was last updated on February 3, 2004 , though most of its information has not been updated since June 10, 2002.
If you find this page useful, and think it may be useful to others, please link to it from your website. This will rank it higher in search engines, making it easier for others to find. Thanks!
Each step listed individually
NEXUS file preparation
Batch files
Exhaustive and branch-and-bound searches (<13 taxa)
Heuristic searches
Parsimony ratchet
Bootstrap searches
Decay indices, including partitioned Bremer support
Testing for a molecular clock
Calibrating using a published calibration
Calibrating using fossil data
Computing error bars on node times
Bayesian searches (MrBayes)
Hypothesis testing
Tree drawing and editing
NEXUS file preparation
Open the NEXUS file with MacClade 4.03 or higher.
Select characters for one locus by shift-clicking on characters for that locus.
Under the Characters menu, choose “Character Set…” then “Store Character Set”.
Name this set by the name of the locus (omit spaces)
Do the same for other loci.
Make sure that the genetic code is correct for your loci by going to “Genetic Code…” in the Characters menu. For mixed nuclear and mitochondrial loci, choose Drosophila mtDNA.
For protein-coding loci, select all the characters in one exon, then go to Characters -> “Codon Positions” -> “Calculate Positions…”. Select the option “Choose to minimize stop codons”.
Select non-protein-coding regions (i.e., introns, rDNA) and go to Characters -> “Codon Positions” -> “Non-coding”.
Go to Taxa -> “Taxon list”.
Move the outgroup taxa so that they are next to each other by dragging the number to the left of the taxon name.
Select the whole group of outgroup taxa by shift-clicking on their numbers.
Go to Taxa -> “Taxon sets” -> “Store taxon set”.
Name this taxon set “outgroup”.
Do the same for other taxa you would like in a set. Feel free to change the order of the taxa to do this.
When you are done, click on the icon that looks like a Mayan pyramid, then click on one of the taxa. This alphabetizes the taxa, making later steps in PAUP easier.
Go to File -> “Print list” to print the list, along with taxon numbers.
Save the file and quit MacClade.
Open PAUP 4.0b8 or higher. PAUP often has updates, so check its website to make sure you are using a recent version. Also check for bug reports.
From within PAUP, open the file you just saved with MacClade. Choose “Edit”, not “Execute”.
Scroll down to below the data matrix.
NEXUS files are arranged in blocks, starting with “BEGIN ###;” and ending with “END;”. Go to the SETS block [skip this and the next 2 instructions if you have only one locus or no coding regions].
Start a new line. Type “CHARSET”, a space, then “###first =”, replacing ### with the name of your first coding locus.
Look for the CODONS block. You should see something within this block that looks like “1: 1-1300\3 1400-2700\3”, with the numbers changed. If your first coding locus is 1302 long, for example, copy the “1-1300\3”and place it after the equals sign in the line you just created. If your first coding locus extends all the way to 2700 in this example, just with an intron in the middle, copy the entire “1-1300\3 1400-2700\3” to after the equals sign in the line you just created.
Type a semicolon at the end of this line after pasting in the numbers.
You have just created a character set containing all the bases with the first codon position from your first coding locus. Repeat the same process for other loci, all three positions. The general format is “CHARSET [charset name] = [character numbers];”.
You can do the same thing with sets of taxa. You should already have a taxset named “outgroup”. You can make other taxsets using the format “TAXSET [taxset name] = [taxon numbers];”. The taxon number is the same as the number on the taxon’s left in MacClade’s taxon list window, which you have printed out. It’s also the order of taxa within the DATA block.
Create a PAUP block. Make sure this is after one of the “end;” lines but before the next “Begin ###;” lines. The format is “Begin PAUP;” skip a few lines, then “end;”. Don’t forget the semicolons.
Within the PAUP block, create a character partition. The general format is “CHARPARTITION [partition name] = [charset name 1]:[charset name 1], [charset name 2]:[charset name 2], [charset name 3]:[charset name 3];” For example, if the data consists of 28S and COI sequence, you would have a charset called 28s, another charset called coI, and the charparition command would be “CHARPARTITION locus = 28s:28s, coI:coI;” This is used in ILD tests.
Within the PAUP block, paste the following text:
outgroup outgroup /only; set maxtrees=5000 increase=auto autoclose=no torder=right tcompress=yes taxlabels=full storebrlens=yes storetreewts=yes outroot=paraphyl;
This sets the outgroup to be whatever you determined in the outgroup taxset, sets a reasonable starting maxtrees number, and sets other PAUP preferences to useful ones for starting a search. Other PAUP commands can be included in the PAUP block as well.
- You may find it useful to keep a running log of all analyses done. This can be helpful in checking what p-values were of various tests, how many replicates were done in a particular search, etc. To do this, paste the following text within the PAUP block:
log file=[name] append=yes replace=no;
If you do this, delete the log commands from other batch files (see below) so that everything is logged to one file.
Batch files
Batch files are ways to give PAUP commands without going through the menus (Mac) or entering items one at a time in the command line (all platforms). For example, you could set up a PAUP batch file which would do a heuristic search saving the best trees from each addition sequence replicate, save these trees to a file, filter for the best trees, and finally save these best trees to a second tree file. Batch files are also a good way to ensure consistency between searches – for example. you won’t have to remember if you used 5 or 10 addition sequence replicates per bootstrap replicate, as that information will be in the batch file. A batch file has the following format:
#nexus [Tells PAUP that this is a NEXUS file]
begin PAUP; [Starts the PAUP block]
[PAUP commands]; [Gives PAUP instructions on what to do – the same thing you enter in the command line]
end; [Ends the PAUP block]
To create a batch file, first create a new document within PAUP or in some other text editor (such as SimpleText). Then type “#nexus” in the first line and “begin paup;” in the second. After that, include the PAUP commands you want – PAUP’s command reference document, available from the downloads section of its website, is invaluable in constructing these. Finally, write “end;” at the end of the document. Save the document. Now, executing the document, the same way you execute a data file, will start the commands running.
All the batch files on this website can be copied into a new PAUP document and executed. Several of the batch files have places to insert the name of a file or set, often appearing as [name]. Make sure that the brackets are removed: for example, if the file is called tree1, “file=[name]” should become “file=tree1”. PAUP makes batch file troubleshooting easy – if there is a problem with the file, PAUP will open the file and put the cursor where the error occured. PAUP does not see anything in brackets, so comments written there will not affect analyses. The only two exceptions to this are: 1) if the first character within the bracket is an exclamation point, PAUP will display the contents in the display buffer; and 2) Trees can have a [&U] or [&R] to tell PAUP if they are rooted or not.
Exhaustive and branch-and-bound searches (<13 taxa)
Both these search strategies are guaranteed to find the optimal tree. However, they are not generally practical for even moderate numbers of taxa: finding the best tree is generally an NP-hardproblem. Thus, heuristic searches are necessary for most analyses.
Heuristic searches
Heuristic searches are not guaranteed to find the globally-optimal tree, but they can work for many more taxa than exhaustive or branch-and-bound searches. A good starting batch file is:
#nexus
begin PAUP;
log file=hsearch1.log;
set autoclose=yes;
hsearch start=stepwise addseq=random nreps=100 savereps=yes randomize=addseq rstatus=yes hold=1 swap=tbr multrees=yes;
savetrees file=hsearch1.all.tre brlen=yes;
filter best=yes permdel=yes;
savetrees file=hsearch1.best.tre brlen=yes;
log stop;
end; > >
This batch file will do a heuristic search, save all the trees found in each random addition sequence replicate in the hsearch1.all.tre file, then filter for the best trees overall and save them in an hsearch1.best.tre file. PAUP will also output a tree-island profile. A “tree-island” is a group of trees which cannot be reached through branchswapping from a different group of trees. Ideally, all the trees will be on one island which will have been hit every time, no matter what starting tree or taxon addition order. If this is the case, the treespace is fairly simple and we would probably move on to bootstrapping. If the island(s) with the best trees was hit nearly all the time, we’d probably be satisfied, as well. However, if the best trees are not recovered in many of the searches, we would need to search further.
We use a variety of strategies for these next searches. The first strategy is to simply increase the number of addition sequence replicates. To do this, change the nreps=100 in the batch file above to a higher number, perhaps nreps=1000. To avoid confusion, we may also want to replace “hsearch1” with “hsearch2” wherever it appears before running the search again.
Another strategy is to start from random trees, rather than random taxon addition. On the above batch file, change the randomize=addseq to randomize=trees, change the “hsearch#” as above, and search again, with nreps to taste.
These searches will run longer than the initial search. A way to speed up the search while covering more of tree space is to increase the number of addition replicates but reducing the thoroughness of the search for each replicate. We might do this if we believe there may be islands of good trees PAUP is missing. A sample batch file follows. You may want to change the nreps and timelimit.
#nexus
begin paup;
log file=hsearch.tlimit.log;
set maxtrees=10000 increase=auto;
hsearch rstatus=no limitperrep=yes nreps=5000 randomize=trees timelimit=5 savereps=yes;
savetrees file=hsearch.tlimit.all.tre brlen=yes;
filter best=yes permdel=yes;
savetrees file=hsearch.tlimit.best.tre brlen=yes;
end; > >
The parsimony ratchet (below) can be useful in searching treespace broadly, as well.
Parsimony ratchet
The parsimony ratchet is a way to search treespace by reweights characters for some iterations of a search. It is especially good for searches with large numbers of taxa. It is described by Kevin Nixon (Nixon, K. C. 1999. “The Parsimony Ratchet, a new method for rapid parsimony analysis.” Cladistics 15: 407-414). Derek Sikes and Paul Lewis have written a program, called PaupRat, which generates batch files to implement this search strategy in PAUP. A few notes on the use of this: 1) It’s better to do several searches of a moderate number of nreps in each search (create a new ratchet file for each search) than one search with many nreps; 2) It’s useful to insert the commands:
stopcmd “filter best=yes permdel=yes”;
stopcmd “savetrees file=mydata.best.tre”;
into the setup.nex file before the stopcmd “[quit]”; to automatically filter for the best trees.
Bootstrap searches
Bootstrap searches can take quite some time. One useful feature of PAUP is the ability to search on multiple computers or on several different occasions and combine the results. The key to this is saving the bootstrap trees from each search and then loading them all together, then computing the consensus tree using the tree weights. You must be sure to keep the bootstrap search settings (except for the number of bootstrap replicates) the same between searches for this to be valid. This can be done through the menus (make sure to hit the “save trees to file” checkbox in the bootstrap menu), or through the batch files (below).
For the search itself [the search can be stopped before completing all the bootstrap replicates; if doing multiple searches, the treefile name should be changed for each search. You may want to change the number of bootstrap replicates (currently set at 500) and the number of random taxon additions per bootstrap replicate (currently set at a low value of 10)]:
#nexus
begin paup;
set storetreewts=yes;
bootstrap nreps=500 treefile=bootstrap1.tre replace=no brlen=yes/ start=stepwise addseq=random nreps=10 savereps=no randomize=addseq hold=1 swap=tbr multrees=yes;
end;
Load all the bootstrap trees, making sure to store tree weights (an option which should have been made the default by the above batch file), to load all blocks, and NOT to eliminate duplicate trees. Executing the following batch file should set all these options and load the trees saved as bootstrap1.tre in the active folder.
#nexus
begin paup;
gettrees allblocks=yes duptrees=keep storetreewts=yes storebrlens=yes mode=7 file=bootstrap1.tre;
end;
Finally, compute a majority-rule consensus tree. The batch file for this is:
#nexus
begin paup;
log file=bootstrapconsensus.log;
contree /majrule=yes strict=no le50=yes usetreewts=yes showtree=yes treefile=finalbootstrap.tre grpfreq=yes;
log stop;
end; > >
Note that this will create a bootstrap tree which will include nodes with support less than 50% which are consistent with the majority rule tree. Most authors choose to omit bootstrap numbers less than 50% while continuing to show the node; a better approach, if space allows, would be to show all bootstrap values.
Though likelihood is far slower than parsimony, this does not mean that likelihood is too slow for bootstrapping. To speed up likelihood bootstrap searches, one can fix the values of nuisance parameters (for example, use the lset output from ModelTest) and time limit each heuristic search.
Decay indices, including partitioned Bremer support
For simple decay indices, MacClade can generate a batch file. Go to the tree window, then go to the sigma () menu -> “Decay Index Paup File…”. For partitioned Bremer support, we use TreeRot (see its detailed instruction manual).
Choosing a likelihood model
We generally use David Posada’s program ModelTest to determine the appropriate likelihood model. This consists of a batch file which has PAUP compute the likelihood score of a tree under various models and a program which computes the likelihood ratio test and the AIC criterion. When the two criteria choose different models, we use the likelihood ratio test result. This procedure will be brought into PAUP in one of the next updates, according to PAUP’s programmers.
Testing for a molecular clock
Once the appropriate likelihood model has been selected by ModelTest, one can test for a clock. Copy the lscores line corresponding to ModelTest’s chosen model from the modelblock file and paste this over the [LSCORES] lines in the following batch file:
#nexus
begin paup;
set criterion=distance;
log file=clocktest.log;
DSet distance=JC objective=ME base=equal rates=equal pinv=0
subst=all negbrlen=setzero;
NJ showtree=no breakties=random;
set criterion=likelihood;
lset clock=no;
[LSCORES];
;[!Non-clock score above, clock score below];
roottrees;
lset clock=yes;
[LSCORES];
;[!Clock score above];
tstatus;
log stop;
end; > >
This will result in two ln likelihood scores. Take the difference between the scores and double it. This is your test statistic for a chi-squared test; the number of taxa minus 2 is the degrees of freedom. You can use a program such as Microsoft Excel (the chidist function) or this web site to compute the p-value. Significant p-values mean that the clock is rejected. People generally try each locus independently; if you have time, you could use ModelTest to determine the correct likelihood model for each locus, or just use the model chosen for all the loci combined.
Calibrating the tree I: Using a previously-published calibration
There are generally two ways to put ages on nodes of a clock tree: 1) Use a previously-published rate of percent divergence for a particular gene versus time, or 2) Use fossil information (see next entry). Reported calibration rates often are in terms of uncorrected percent sequence divergence per million years (“uncorrected” means uncorrected for multiple hits). To use this value for your tree, compute all uncorrected pairwise distances for your taxa in PAUP. With menus, use File -> Save Distances to File, then Options -> Distance options and select uncorrected “p”. The batch file would be (replacing [filename] with what you want to call the distance file, without brackets):
#nexus
begin paup;
dset distance=p;
savedist format=onecolumn file=[filename] undefined=asterisk;
end; > >
Examine the ensuing list (in PAUP or Microsoft Excel) to find a pair of taxa which have a percent divergence similar to the ones in the taxa used in the original calibration. Using these taxa, rather than the oldest or youngest pair, makes it more likely that the proportion of uncorrected multiple hits will be the same for the calibration and the pair being used to calibrate the entire tree (the rate of uncorrected divergence versus time will go down with time as multiple hits accumulate). The calibration rate can be used to determine the age of the node separating the chosen pair of taxa. The age of the node is then used to calibrate the rest of the tree.
Calibrating the tree II: Using fossil data
The rate of molecular evolution may be affected by many things: mutation rate, generation time, and perhaps even body size. A calibration based on fossil data of the group being studied may be more accurate than a calibration from other organisms. Multiple calibration points will give the most precise estimate. Record infomation about all the fossils in the group, not just the oldest. Fossils tend to give just minimum ages for the group – think about ways to determine maximum ages, as well. For example, the group may not be older than the evolution of life on land, or may not be older than the oldest possible age (which is not the age of the oldest known fossil) of an obligate host. Biogeographic information may also be useful. Fossil calibrations, and minimum ages in general, should be mapped to the stem of the clade (= the most recent common ancestor of the clade and its sister group). . Maximum ages should be mapped to the most recent common ancestor (MRCA) of the clade constrained by that calibration (the “crown”). For example, if we have a calibration for birds using the oldest known bird fossil (a minimum age constraint), we know that the split between birds and their sister group had to happen eariler than that fossil, but not that the MRCA of the bird taxa included in our study lived before that fossil. In contrast, if we say that we know that Lepidoptera must have evolved after the first terrestrial plants (a maximum age constraint), we know that the MRCA of Lepidoptera cannot be older than that, but not that the split separating Lepidoptera from their sister group occurred after it. The total branchlength from the present to the node of each constraint on the clock tree is recorded (the units do not matter, as long as the branch lengths are proportional to time). This information, plus the age and type of the calibration constraint, are entered into an Excel sheet which calculated the maximum and minimum possible rates given all the calibration points. This can easily be done manually, as well. Each branch length is then multiplied by these rates to get the maximum and minimum length of the branch in time. Note that the rate is actually applied to the whole tree at once – the tree is stretched or compressed by the maximum or minimum rates – the ratio of the ages of nodes does not change with the different rates. Other methods must be used to calculate error bars on the ages of nodes due to finite sequence length rather than uncertain calibration. These methods are explained below.
Determining error bars on ages of nodes
For determining error bars on nodes, two methods can be used, both implemented in Sanderson’s r8s program. The first method involves looking at the shape of the likelihood surface. The second method uses bootstrapped datasets and calculates error bars from the distribution of branchlengths on the bootstrapped dataset. We have done the latter method using PAUP and Excel, due to a lack of Linux x86 computers to run r8s. First, Phylip’s SeqBoot is used to create bootstrapped datasets. These are edited in PAUP to merge them into one large dataset (for example, if the original dataset is 3000 characters long, and 500 bootstrap replicate datasets are created, they are edited to make a single 1,500,000 character dataset. A PAUP batch file is then constructed in Excel which divides the combined datasets into character sets and then has PAUP evaluate the branchlengths of the given topology (the clock tree being evaluated) for each replicate dataset (corresponding to a charset). A sample batch file is below for a dataset 2432 characters long. Some of the repetitive commands have been deleted, but the pattern should be clear. Before executing this batch file, load your long datafile, set the likelihood parameters, and load your clock tree.
#nexus
begin paup;
charset rep1 = 1 - 2432 ;
charset rep2 = 2433 - 4864 ;
charset rep3 = 4865 - 7296 ;
_[…. repetitive commands not included here. Just use Excel to follow the pattern….]
_charset rep500 = 1213569 - 1216000 ;
exclude all;
include rep1 ; [! rep 1 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes file=brlenwbootstrapdata.tre;
exclude all;
include rep2 ; [! rep 2 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes file=brlenwbootstrapdata.tre;
exclude all;
include rep3 ; [! rep 3 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes file=brlenwbootstrapdata.tre;
[…. repetitive commands not included here. Just use Excel to follow the pattern….]
exclude all;
include rep500 ; [! rep 500 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes file=brlenwbootstrapdata.tre;
end; > >
The trees are all saved in one file with branchlengths by PAUP. This tree file is then opened within Excel, with the length of each branch put in a column. The total length of branches from the root to the node of interest is computed for each tree and divided by the distance from the root to the tips. Then the percentile function is used in Microsoft Excel to determine the 95% confidence interval on the distance from the root to the node of interest, with the distance expressed as proportion of total tree length.
Bayesian searches (MrBayes)
Bayesian phylogenetic inference is a new way of finding trees and tree support, in terms of probability of the topology given the data and your prior beliefs (or the default prior beliefs of your program). I use MrBayes for this. Good introductions to the proper use of MrBayes (such as diagnosing convergence) can be found at Frederick Ronquist’s site.
Hypothesis testing
We often want to test our hypotheses of evolution. For example, we may ask if the tree we get from our analysis is significantly different than the traditional phylogeny, or whether the observed paraphyly of a particular group is strongly supported, or other questions. There are several ways to do this. The first step in most of them is to create a constraint tree. Open the data file in MacClade and go to the tree window. Create a random tree. Then use MacClade’s tree drawing tools to create the constraint tree. A constraint tree for the monophyly of a certain group would have only one resolved internode branch, that between the group and the rest of the taxa. A constraint tree for a previously-existing phylogenetic hypothesis would be more resolved. Export these trees from MacClade, one at a time, naming each appropriately.
Open PAUP and execute the data file. Under analysis, choose load constraints (don’t load as backbone constraint). Then, depending on the analysis and optimization criterion:
For Bayesian analyses: Load the trees found after stationarity. Filter the trees to find those consistent with the constraint tree. The command for this is “filter constraint=[constraint name]”, where [constraint name] is replaced by the name of your constraint tree. Record the number of these trees. This number, divided by the total number of post-stationarity trees, is the posterior probability of the hypothesis represented by the constraint tree. For example, if 60 out of 1000 trees are left after filtering for trees with a certain group monophyletic, the probability of monophyly of this group is 6%.
Under the parsimony criterion: Perform a thorough tree search with the constraint enforced [if using a batch file, insert “enforce=yes constraint=[constraint name]” within the hsearch command (somewhere between “hsearch” and the semicolon), replacing [constraint name] with the name of your constraint]. This generates the most parsimonious tree[s] which meet this constraint. Save these trees. Then load your most parsimonious trees found without the constraint without replacing the trees you just found [go to options in the get trees window and fill in both circles, or use “gettrees mode=7 file=[tree file name];”, replacing [tree file name] with the name of your tree file]. The Templeton and winning-sites tests are the appropriate tests to use in this situation and can be selected under Trees -> Tree Scores… -> Parsimony, then choosing nonparametric tests. The command for this is “pscores /nonparamtest=yes;”. The Kishino-Hasegawa test, which is also available, is inappropriate when comparing trees which were not specified a priori. We often report this value anyway.
Under the likelihood criterion: Perform a thorough tree search with the constraint enforced [if using a batch file, insert “enforce=yes constraint=[constraint name]” within the hsearch command (between “hsearch” and the following semicolon), replacing [constraint name] with the name of your constraint]. This generates the likeliest tree which meets this constraint. Save this tree. Then load all likely trees without replacing the trees you just found [go to options in the get trees window and fill in both circles, or use “gettrees mode=7 file=[tree file name];”, replacing [tree file name] with the name of your tree file]. What “all likely trees” means is difficult. Some people include the maximum likelihood tree and the most parsimonious trees. You could also include all trees with a length no greater than 5 more than the most parsimonious trees or something like that. Then go to Trees -> Tree Scores… -> Likelihood, then click on the button for topology based tests. Choose the Shimodaira-Hasegawa test. This test measures whether some trees are better than others under likelihood. The Kishino-Hasegawa test, one many people use, is inappropriate (non-conservative) unless all the trees have been specified a priori (specified without using any of the data for a tree search). We often report the value anyway, though noting it’s not valid.
Tree drawing and editing
To prepare trees for publication, we use Acrobat PDF Writer. This printer extension allows trees to be saved as PDF files from PAUP, complete with correct branchlengths, scale bars, titles, and bootstrap proportions. These trees may then be opened in Adobe Illustrator and have branches colored, line thicknesses changed, keys added, etc. The trees can be saved in PDF format or output to other formats for publication.