In this exercise we’ll estimate trees using maximum likelihood and look at the impact of different substitution models.
For this we’ll use the
phangorn package again and a data
set of primates sequences. The sequences are a short fragment of the
mitochondrial genome. You can download the data from here.
As before start by opening the file and converting the data into the
right format for
# read in the character matrix in ape dat = ape::read.nexus.data(file = "data/primates-data.nex") # & convert to phanghorn object dat = phangorn::phyDat(dat, type = "DNA")
As before we need to generate a starting tree using neighbour joining.
# create a neighbour joining starting tree dm = phangorn::dist.hamming(dat) starting_tree = phangorn::NJ(dm)
Side note: for those interested from last time, for two sequences the Hamming distance is simply the proportion of sites at which they differ. It’s therefore one of the simplest ways to approximate evolutionary distances and generate a starting tree. But remember this approach won’t account for convergent evolution or homoplasy.
pml (pml =
phylogenetic maximum likelihood) calculates the likelihood of a tree
given the data under a model specified by the user. We’ll start with the
simplest model of sequence evolution, the Jukes-Cantor (JC) model.
# calculate the likelihood of the tree given the data fitJC = phangorn::pml(starting_tree, data=dat, model = "JC") fitJC
What do you notice about the base frequencies?
Then we’ll use the function
optim.pml to find the tree
with the best likelihood score under the Jukes-Cantor model. This
function takes the tree and the starting values obtaining using the
pml function and explores the tree space by switching
around the branches. The argument
optNni = TRUE tells the
function to optimize the tree topology (i.e. find the tree with the
highest likelihood), as well as the branch lengths and the model
parameters. The parameters of the JC model are fixed, so the model
parameters won’t actually change in this case.
# estimate the tree using ML fitJC_opt = phangorn::optim.pml(fitJC, model = "JC", optNni = TRUE) fitJC_opt
As the ML algorithm proceeds it outputs information about the progress. You can see how the likelihood changes as the tree topology and the branch lengths (edge weights) are updated. You should also see that the likelihood is improving, i.e. getting larger.
The function outputs a list with a bunch of values that might be of interest, including the tree.
Next we’ll do the same with the GTR substitution model.
fitGTR = phangorn::pml(starting_tree, data=dat, model = "GTR") fitGTR_opt = phangorn::optim.pml(fitGTR, model = "GTR", optNni = TRUE) fitGTR_opt
What differences do you notice in the output?
Let’s look at the trees. Remember that by default trees generated based on a substitution model will be unrooted. First, let’s root the tree using the Lemur.
# identify the outgroup outgroup = "Lemur" # root the tree JC_tree_rooted = ape::root(fitJC_opt$tree, outgroup = outgroup, resolve.root = TRUE) GTR_tree_rooted = ape::root(fitGTR_opt$tree, outgroup = outgroup, resolve.root = TRUE) # plot the trees plot(JC_tree_rooted) plot(GTR_tree_rooted)
Do you notice any differences in the topology or branch lengths? Which one is correct?
Since we have the maximum likelihood estimate under both models, we can use a straightforward model testing to identify the best fitting model.
How would you interpret these results? Hint
Follow the link to learn more about interpreting AIC scores.
To output a tree from R and save it we can use the
write.tree. This outputs the tree in Newick
# write to screen ape::write.tree(tree)
# write to file ape::write.tree(tree, file = "my-tree.nex")
phangorn is a very useful package but you won’t often
see published trees generated using this package. There are other more
specialized software packages for tree building using maximum
likelihood. Some of these feature additional substitution models or have
been optimized for the analysis of large data sets. The analysis in this
exercise is very fast because our data set has a small number of species
and characters (only 150) – most DNA data sets are much larger
than this. Two currently alternative likelihood based programs for tree
building are RAxML and
A complete script for this exercise can be downloaded here.