Bias correction in diversification models

Likelihood methods are powerful (they squeeze every last drop of information from the data) but often biased at low sample sizes (i.e., an estimated rate might be too high). They are often used in our field to estimate speciation, extinction, net diversification, and turnover (speciation + extinction) rates from dated phylogenies. Sometimes they’re embedded in Bayesian approaches, allowing priors to be added (for good or ill) and uncertainty to be more easily calculated in parameter estimates.

In this age of genomics, who cares about low sample sizes? Well, some charismatic clades folks study are naturally small (there are only so many baleen whale species – maybe you should have worked on insects instead?). But a more common issue will occur in approaches that try to deal with the differences across the tree of life by applying different rates to different parts of the tree. Older approaches like Medusa (Alfaro et al. 2009) and BAMM (Rabosky, 2014), and newer approaches like the Bayesian ClaDS (Maliet et al. 2019) or non-Bayesian MiSSE (Vasconcelos et al. 2022) also effectively subdivide the tree into different chunks (though this is done in a more continuous way with MiSSE, as it uses hidden states). So it’s possible a chunk is small enough that it has few enough taxa that it would be affected by bias in an estimator.

Jeremy Beaulieu and I have been on a mission lately to address bias in estimators (note this is a different issue than some of the myriad other issues affecting diversification analyses; see, for example, one of our preprints). There can be bias from ignoring measurement error (Beaulieu & O’Meara (2025); O’Meara & Beaulieu (2024); see also my upcoming poster at Evolution 2026), there can be bias from only looking at “interesting clades” (Beaulieu & O’Meara (2018 & 2019)) etc. Today we published the latest paper in this series (Beaulieu & O’Meara 2026).

Tanja Stadler (Stadler, 2013) wrote a key paper on estimators in diversification models, pointing out that the way different implementations estimate “the” speciation rate can vary and comparing different approaches. Inspired by this and some of our (well, largely Jeremy’s) work on finding an unbiased estimator in the Yule (pure birth, no extinction) case, we set out to find unbiased estimators for various diversification parameters.

The way we did this was rather fun. There’s the “real” math way of doing it by deriving equations – we were able to do this for Yule and the critical branching process (speciation = extinction), and for incorporating the effect of not looking at trees of two taxa or fewer. However, for the more complicated case where speciation and extinction can both vary, we could not analytically find a solution.

The approach we used instead has the polite name “symbolic regression” and is formally a machine learning approach (so maybe it aligns with NSF priorities for AI – program officers take note). But it boils down to, “well, we know a bias correction probably uses something about the number of taxa, and/or the age of the tree, and/or the speciation rate estimate, and/or the extinction rate estimate, and/or…” and “the correction might involve the likelihood estimate being multiplied by one (or more) of these, added to by one of these,…” and then trying ALL the possible equations to find the correction that works best. This follows my long-term strategy of “math can be difficult, let’s just make the computer work very hard” like the approach I used for parametric species delimitation (store the probabilities of all the gene trees (O’Meara 2009; see this header file)) or the phrapl series of papers for looking at phylogeography, gene flow, and species delimitation that involve estimating the probability of a gene tree by simulating many, many times. Here, we basically make all the reasonable corrections and try each to see which reduces the bias in the estimator. The key difficulty, as with approximate Bayesian computation, is to figure out a good way to estimate the distance from what we are getting and what we want. Something like RMSE is the obvious choice, but the issue is that for really tiny trees, the errors can be huge in a handful of simulations and those can overwhelmingly drive what is chosen. So instead we did a regression and saw how far the regression slope and intercept were from the ideal line.

We validated this approach (which used a slightly modifed gramEvol R package (Noorian et al. 2016) to search exhaustively) on the Yule case where we knew the correct result, then applied it to the more complex birth-death result.

Overall, it looked like the best unbiased estimator for speciation rate was the maximum likelihood estimate (MLE) for speciation rate times (number of taxa on the tree - 1) / (number of taxa on the tree - 2). The correction approaches 1 as the number of taxa on the tree (ntax) increases, but at just three taxa, the unbiased speciation rate is twice that of the MLE. The extinction rate, which is famously hard to estimate well, is the MLE of extinction rate times (ntax/(ntax-1) + extinction fraction), where extinction fraction is the MLE for extinction rate divided by the MLE for the speciation rate. However, read the paper for details.

It’s important to note that this is NOT another example of “diversification methods are wrong” – those that were wrong before are still wrong, and those that have yet to be proven wrong are still yet to be proven wrong¹. However, it does suggest that if your question is about the parameter estimates themselves, it could be worth converting them to the unbiased estimates, especially when comparing between groups of different sizes. I suggest phylogenetic software start returning both the MLEs and the unbiased estimates to users; users could also convert the estimates themselves, as these are basic transformations.

Another note: while we all like to gripe about peer review (somehow especially popular for those whose papers I review), this was a case where the manuscript was improved substantially by the people who volunteered to review it and AE Mike Steel, especially by one reviewer who caught an error in an earlier draft. Our thanks to the reviewers were heartfelt for this paper. This was my first time publishing in Bulletin of Mathematical Biology, which is published by Springer for the Society for Mathematical Biology and it was a delightful process – fast, competent reviews and a quick turnaround for publication.

Citations

Alfaro, ME, F Santini, C Brock, H Alamillo, A Dornburg, DL Rabosky, G Carnevale, and LJ Harmon. 2009. Nine exceptional radiations plus high turnover explain species diversity in jawed vertebrates. Proceedings of the National Academy of Sciences 106: 13410-13414. https://doi.org/10.1073/pnas.0811087106
Beaulieu JM, O’Meara BC. 2018. Can we build it? Yes we can, but should we use it? Assessing the quality and value of a very large phylogeny of campanulid angiosperms. Am J Bot. 105(3):417-432. doi: 10.1002/ajb2.1020. Epub 2018 Mar 5. PMID: 29746717. https://doi.org/10.1002/ajb2.1020
Beaulieu, J. M. and B. C.O’Meara. 2019. Diversity and skepticism are vital for comparative biology: a response to Donoghue and Edwards (2019). American Journal of Botany 106(5): 613–617. https://doi.org/10.1002/ajb2.1278
Beaulieu, J.M., O’Meara, B.C. 2025. Navigating “tip fog”: embracing uncertainty in tip measurements, Evolution, Volume 79, Issue 7, 1 July 2025, Pages 1131–1142, https://doi.org/10.1093/evolut/qpaf067
Statistical and Structural Bias in Birth-Death Models. Bull Math Biol 88, 81. https://doi.org/10.1007/s11538-026-01644-0
Maliet, O., Hartig, F. & Morlon, H. 2019. A model with many small shifts for estimating species-specific diversification rates. Nat Ecol Evol 3, 1086–1092. https://doi.org/10.1038/s41559-019-0908-0
Noorian F, de Silva A. M., Leong PHW. 2016. gramevol: Grammatical evolution in r. Journal of Statistical Software 71:1–26. https://doi.org/10.18637/jss.v071.i01.
O’Meara BC. 2009. New heuristic methods for joint species delimitation and species tree inference. Syst Biol. 2010 Jan;59(1):59-73. doi: 10.1093/sysbio/syp077.
O’Meara BC, Beaulieu JM. 2024. Noise leads to the perceived increase in evolutionary rates over short time scales. PLOS Computational Biology 20(9): e1012458. https://doi.org/10.1093/sysbio/syp077
Rabosky DL. 2014. Automatic Detection of Key Innovations, Rate Shifts, and Diversity-Dependence on Phylogenetic Trees. PLOS ONE 9(2): e89543. https://doi.org/10.1371/journal.pone.0089543
Stadler. T. 2013. How Can We Improve Accuracy of Macroevolutionary Rate Estimates?, Systematic Biology, Volume 62, Issue 2, March 2013, Pages 321–329, https://doi.org/10.1093/sysbio/sys073
Vasconcelos, T, O’Meara, BC, Beaulieu, JM. 2022. A flexible method for estimating tip diversification rates across a range of speciation and extinction scenarios, Evolution, Volume 76, Issue 7 Pages 1420–1433, https://doi.org/10.1111/evo.14517

To subscribe, go to https://brianomeara.info/blog.xml in an RSS reader.

Footnotes

If George Box worked in diversification, his famous quote, “All models are wrong, but some are useful” might have just been the far pithier “All models are wrong.” :-)↩︎

Citation

BibTeX citation:

@online{o'meara2026,
  author = {O’Meara, Brian},
  title = {Bias Correction in Diversification Models},
  date = {2026-04-17},
  url = {https://brianomeara.info/posts/biascorrection/},
  langid = {en}
}

For attribution, please cite this work as:

O’Meara, Brian. 2026. “Bias Correction in Diversification Models.” April 17, 2026. https://brianomeara.info/posts/biascorrection/.