Tree building using maximum likelihood

In this exercise we’ll estimate trees using maximum likelihood and look at the impact of different substitution models.

For this we’ll use the phangorn package again and a data set of primates sequences. The sequences are a short fragment of the mitochondrial genome. You can download the data from here.

As before start by opening the file and converting the data into the right format for phangorn.

# read in the character matrix in ape 
dat = = "data/primates-data.nex")

# & convert to phanghorn object
dat = phangorn::phyDat(dat, type = "DNA")

As before we need to generate a starting tree using neighbour joining.

# create a neighbour joining starting tree
dm = phangorn::dist.hamming(dat)
starting_tree = phangorn::NJ(dm)

Side note: for those interested from last time, for two sequences the Hamming distance is simply the proportion of sites at which they differ. It’s therefore one of the simplest ways to approximate evolutionary distances and generate a starting tree. But remember this approach won’t account for convergent evolution or homoplasy.

The phangorn function pml (pml = phylogenetic maximum likelihood) calculates the likelihood of a tree given the data under a model specified by the user. We’ll start with the simplest model of sequence evolution, the Jukes-Cantor (JC) model.

# calculate the likelihood of the tree given the data
fitJC = phangorn::pml(starting_tree, data=dat, model = "JC")

What do you notice about the base frequencies?

Then we’ll use the function optim.pml to find the tree with the best likelihood score under the Jukes-Cantor model. This function takes the tree and the starting values obtaining using the pml function and explores the tree space by switching around the branches. The argument optNni = TRUE tells the function to optimize the tree topology (i.e. find the tree with the highest likelihood), as well as the branch lengths and the model parameters. The parameters of the JC model are fixed, so the model parameters won’t actually change in this case.

# estimate the tree using ML
fitJC_opt = phangorn::optim.pml(fitJC, model = "JC", optNni = TRUE)

As the ML algorithm proceeds it outputs information about the progress. You can see how the likelihood changes as the tree topology and the branch lengths (edge weights) are updated. You should also see that the likelihood is improving, i.e. getting larger.

The function outputs a list with a bunch of values that might be of interest, including the tree.


Next we’ll do the same with the GTR substitution model.

fitGTR = phangorn::pml(starting_tree, data=dat, model = "GTR")
fitGTR_opt = phangorn::optim.pml(fitGTR, model = "GTR", optNni = TRUE)

What differences do you notice in the output?

Let’s look at the trees. Remember that by default trees generated based on a substitution model will be unrooted. First, let’s root the tree using the Lemur.

# identify the outgroup
outgroup = "Lemur"

# root the tree
JC_tree_rooted = ape::root(fitJC_opt$tree, outgroup = outgroup, resolve.root = TRUE)
GTR_tree_rooted = ape::root(fitGTR_opt$tree, outgroup = outgroup, resolve.root = TRUE)

# plot the trees

Do you notice any differences in the topology or branch lengths? Which one is correct?

Since we have the maximum likelihood estimate under both models, we can use a straightforward model testing to identify the best fitting model.

AIC(fitJC_opt, fitGTR_opt)

How would you interpret these results? Hint

Follow the link to learn more about interpreting AIC scores.

Output a tree from R

To output a tree from R and save it we can use the ape function write.tree. This outputs the tree in Newick format.

# write to screen
# write to file
ape::write.tree(tree, file = "my-tree.nex")

Other resources for buidling trees using maximum likelihood

phangorn is a very useful package but you won’t often see published trees generated using this package. There are other more specialized software packages for tree building using maximum likelihood. Some of these feature additional substitution models or have been optimized for the analysis of large data sets. The analysis in this exercise is very fast because our data set has a small number of species and characters (only 150) – most DNA data sets are much larger than this. Two currently alternative likelihood based programs for tree building are RAxML and IQTREE.

A complete script for this exercise can be downloaded here.