In this exercise we’ll estimate trees using maximum likelihood and look at the impact of different substitution models.

For this we’ll use the `phangorn`

package again and a data
set of primates sequences. The sequences are a short fragment of the
mitochondrial genome. You can download the data from here.

As before start by opening the file and converting the data into the
right format for `phangorn`

.

```
# read in the character matrix in ape
dat = ape::read.nexus.data(file = "data/primates-data.nex")
# & convert to phanghorn object
dat = phangorn::phyDat(dat, type = "DNA")
```

As before we need to generate a starting tree using neighbour joining.

```
# create a neighbour joining starting tree
dm = phangorn::dist.hamming(dat)
starting_tree = phangorn::NJ(dm)
```

*Side note: for those interested from last time, for two sequences
the Hamming
distance is simply the proportion of sites at which they differ.
It’s therefore one of the simplest ways to approximate evolutionary
distances and generate a starting tree. But remember this approach won’t
account for convergent evolution or homoplasy.*

The `phangorn`

function `pml`

(pml =
phylogenetic maximum likelihood) calculates the likelihood of a tree
given the data under a model specified by the user. We’ll start with the
simplest model of sequence evolution, the Jukes-Cantor (JC) model.

```
# calculate the likelihood of the tree given the data
fitJC = phangorn::pml(starting_tree, data=dat, model = "JC")
fitJC
```

What do you notice about the base frequencies?

Then we’ll use the function `optim.pml`

to find the tree
with the best likelihood score under the Jukes-Cantor model. This
function takes the tree and the starting values obtaining using the
`pml`

function and explores the tree space by switching
around the branches. The argument `optNni = TRUE`

tells the
function to optimize the tree topology (i.e. find the tree with the
highest likelihood), as well as the branch lengths and the model
parameters. The parameters of the JC model are fixed, so the model
parameters won’t actually change in this case.

```
# estimate the tree using ML
fitJC_opt = phangorn::optim.pml(fitJC, model = "JC", optNni = TRUE)
fitJC_opt
```

As the ML algorithm proceeds it outputs information about the progress. You can see how the likelihood changes as the tree topology and the branch lengths (edge weights) are updated. You should also see that the likelihood is improving, i.e. getting larger.

The function outputs a list with a bunch of values that might be of interest, including the tree.

`fitJC_opt$tree`

Next we’ll do the same with the GTR substitution model.

```
fitGTR = phangorn::pml(starting_tree, data=dat, model = "GTR")
fitGTR_opt = phangorn::optim.pml(fitGTR, model = "GTR", optNni = TRUE)
fitGTR_opt
```

What differences do you notice in the output?

Let’s look at the trees. Remember that by default trees generated based on a substitution model will be unrooted. First, let’s root the tree using the Lemur.

```
# identify the outgroup
outgroup = "Lemur"
# root the tree
JC_tree_rooted = ape::root(fitJC_opt$tree, outgroup = outgroup, resolve.root = TRUE)
GTR_tree_rooted = ape::root(fitGTR_opt$tree, outgroup = outgroup, resolve.root = TRUE)
# plot the trees
plot(JC_tree_rooted)
plot(GTR_tree_rooted)
```

Do you notice any differences in the topology or branch lengths? Which one is correct?

Since we have the maximum likelihood estimate under both models, we can use a straightforward model testing to identify the best fitting model.

`AIC(fitJC_opt, fitGTR_opt)`

How would you interpret these results? Hint

Follow the link to learn more about interpreting AIC scores.

To output a tree from R and save it we can use the `ape`

function `write.tree`

. This outputs the tree in Newick
format.

```
# write to screen
ape::write.tree(tree)
```

```
# write to file
ape::write.tree(tree, file = "my-tree.nex")
```

`phangorn`

is a very useful package but you won’t often
see published trees generated using this package. There are other more
specialized software packages for tree building using maximum
likelihood. Some of these feature additional substitution models or have
been optimized for the analysis of large data sets. The analysis in this
exercise is very fast because our data set has a small number of species
and characters (only 150) – most DNA data sets are *much* larger
than this. Two currently alternative likelihood based programs for tree
building are RAxML and
IQTREE.

A complete script for this exercise can be downloaded here.