Model selection of morphological models

In this exercise we will carry out model selection of different morphological models. The morphological matrix used for this tutorial can be downloaded from here. This matrix is from Agnolin et al 2007. It consists of 51 characters and 12 taxa of Brontornis (Terror Birds). The data can be downloaded here.

Below I provide code for different elements of morphological models:

You can decide how you want to build you models and then test to see which one fits you data.

How to organise your code

Set up a new directory for this tutorial with a directoy for the data and the scripts. In this tutorial you should make (at least) 3 rev scripts. One for the stepping stone commands steppingstone.rev and two files for your two morphological models - you can test more models if you want!

Setting up your stepping stone algorithm

Much of the following code is identical to that use for a standard MCMC. We specify that we want to use a sample the marginal likelihood but the model set up is the exact same as in the previous tutorials.

In the steppingstone.revscript we first need to read in our morphological data

morpho <- readDiscreteCharacterData("data/Agnolin.nex")

Define some helper variables

model_name = "Mk" # or whatever model you are testing
taxa <- morpho.names()
num_taxa <- taxa.size()
num_branches <- 2 * num_taxa - 3

When testing different models I find it helpful to create a variable named model at the start. This can be used to name the output files and means you don’t have to manually change the name everytime you want to run a different model and prevents overwriting the output of previous models.

Create vectors for our moves and monitors

moves = VectorMoves()
monitors = VectorMonitors()

Set up the tree model exactly as in previous analysis

# uniform prior on branch lengths
phylogeny ~ dnUniformTopology(taxa)

## tree moves
moves.append( mvNNI(phylogeny, weight = num_taxa) ) # nearest neighbour interchange
moves.append( mvSPR(phylogeny, weight = num_taxa/10.0) ) # subtree pruning and regrafting

Set up the prior for the branch lengths

## sample branch lengthd
for (i in 1:num_branches) {
   br_lens[i] ~ dnExponential(10.0)
   moves.append( mvScale(br_lens[i]) )
}


## construct tree object
tree := treeAssembly(phylogeny, br_lens)

# save the tree length
TL := sum(br_lens)

Setting up the morphological models

Here you can choose any combination of model you want to test - between partitioned, unpartitioned, ACRV, and correcting for acertainment bias.

To set up an Mk model you use the JC function in revbayes as this has the same set of assumptions. Our data set has 4 characters so we create a Q-matrix size 4. This number will depened on the data used so will need to be manually changed if using this script for different data sets.

An unpartitioned Mk model can be defined by

Q <- fnJC(4)
seq ~ dnPhyloCTMC(tree=phylogeny, Q=Q, type="Standard")
seq.clamp(morpho)

To account for accertainment bias we need to tell the model we are conditioning on all sites being variable. To do this we add to the phyloCTMC object as follows

seq ~ dnPhyloCTMC(tree=phylogeny, Q=Q, type="Standard", coding = "variable")
seq.clamp(morpho)

To estimate ACRV we first need to set up a prior on our alpha (or shape) parameter

# Set up Gamma-distributed rate variation.
alpha_morpho ~ dnUniform( 0.0, 1E5 )
rates_morpho := fnDiscretizeGamma( alpha_morpho, alpha_morpho, 4 ) 

# Moves on the parameters to the Gamma distribution.
moves.append(mvScale(alpha_morpho, lambda=1, weight=2.0))

We then need to add this to our phyloCTMC object

seq ~ dnPhyloCTMC(tree=phylogeny, Q=Q, type="Standard", siteRates=rates_morpho)
seq.clamp(morpho)

For a partitioned Mk model we can use a loop to go through the alingmnet and create partitions based on the maximum observed state.

n_max_states <- 4
idx = 1
morpho_bystate[1] <- morpho
for (i in 2:n_max_states) {
    morpho_bystate[i] <- morpho                                # make local tmp copy of data
    morpho_bystate[i].setNumStatesPartition(i)                 # only keep character blocks with state space equal to size i
    nc = morpho_bystate[i].nchar()                             # get number of characters per character size with i-sized states

    if (nc > 0) {                                              # for non-empty character blocks
        q[idx] <- fnJC(i)                                      # make i-by-i rate matrix
        m_morph[idx] ~ dnPhyloCTMC( tree=phylogeny,
                                    Q=q[idx],
                                    nSites=nc,
                                    type="Standard" )           # create model of evolution for the character block

        m_morph[idx].clamp(morpho_bystate[i])                  # attach the data

        idx = idx + 1                                          # increment counter
        #idx
    }
}

This loop can look a bit confusing but it is just creating a phyloCTMC object for each partition. The same way we created one phyloCTMC for the unpartitioned data

You can add variable coding and ACRV to the phyloCTMC object the same as for the unpartitioned model.

Back to the stepping stone set up

We can now define out model object

## define the model
mymodel = model(phylogeny)

The algorithm to approximate the value between the prior and the posterior posterior is called powerPosterior. Here we specify where to save the output and how many “steps” we want to take

pow_p = powerPosterior(mymodel, moves, monitors, "output/" + model_name  + "/model1.out", cats=47) 
pow_p.burnin(generations=1000,tuningInterval=1000)
pow_p.run(generations=1000)

We can then get the approximate likelihood using the following

ss = steppingStoneSampler(file= "output/" + model_name   + "/model1.out", powerColumnName="power", likelihoodColumnName="likelihood")
ps = pathSampler(file="output/" + model_name   + "/model1.out", powerColumnName="power", likelihoodColumnName="likelihood")

I have mainly talked about the stepping stone algorithm but path sampling is another approach. In RevBayes the same function can be used to approximate the marginal likeihood using both. It can be useful to compare these two approximations, as values that differ by less than ~1 indicate that you have used enough steps in you analysis.

It can be helpful to save these values to a file

### save this variable and run the second
write(ss.marginal() , filename= "output/" + model_name  + "/marginal_likelihood.csv", "\n", append=TRUE)
write(ps.marginal() , filename="output/" + model_name  + "/marginal_likelihood.csv", "\n", append=TRUE)
  1. Which model did you determine to be the best fit for your data?
  2. Test other models and see if there is a better fit
  3. Does the model that was chosen make sense biologically?

A complete set of scripts for this exercise can be downloaded here.