Tutorial

Dating Tree Sequences

To illustrate the typical use case of tsdate’s Python API, we will first create sample data with msprime and infer a tree sequence from this data using tsinfer. We will then run tsdate on the inferred tree sequence.

Let’s start by creating some sample data with msprime using human-like parameters.

import msprime

sample_ts = msprime.simulate(sample_size=10, Ne=10000,
                                        length=1e4,
                                        mutation_rate=1e-8,
                                        recombination_rate=1e-8,
                                        random_seed=2)
print(sample_ts.num_trees,
      sample_ts.num_nodes)

The output of this code is:

12 29

We take this simulated tree sequence and turn it into a tsinfer SampleData object as documented here, and then infer a tree sequence from the data

import tsinfer

sample_data = tsinfer.SampleData.from_tree_sequence(sample_ts)
inferred_ts = tsinfer.infer(sample_data)

Note

tsdate works best with simplified tree sequences (tsinfer’s documentation provides) details on how to simplify an inferred tree sequence. This should not be an issue when working with tree sequences simulated using msprime.

Next, we run tsdate to estimate the ages of nodes and mutations in the inferred tree sequence:

import tsdate
dated_ts = tsdate.date(inferred_ts, Ne=10000, mutation_rate=1e-8)

All we need to run tsdate (with its default parameters) is the inferred tree sequence object, the estimated effective population size, and estimated mutation rate. Here we have provided a human mutation rate per base pair per generation, so the nodes dates in the resulting tree sequence should be interpreted as generations.

Specifying a Prior

The above example shows the basic use of tsdate, using default parameters. The software has parameters the user can access through the tsdate.build_prior_grid() function which may affect the runtime and accuracy of the algorithm.

Inside Outside vs Maximization

One of the most important parameters to consider is whether tsdate should use the inside-outside or the maximization algorithms to perform inference. A detailed description of the algorithms will be presented in our preprint, but from the users perspective, the inside-outside approach performs better empirically but has issues with numerical stability, while the maximization approach is slightly less accurate empirically, but is numerically stable.

Command Line Interface Example

tsdate also offers a convenient command line interface (CLI) for accessing the core functionality of the algorithm.

For a simple example of CLI, we’ll first save the inferred tree sequence we created in the section above as a file.

import tskit

inferred_ts.dump("inferred_ts.trees")

Now we use the CLI to again date the inferred tree sequence and output the resulting dated tree sequence to dated_ts.trees file:

$ tsdate date inferred_ts.trees dated_ts.trees 10000 1e-8 --progress

The first two arguments are the input and output tree sequence file names, the third is the estimated effective population size, and the fourth is the estimated mutation rate. We also add the --progress option to keep track of tsdate’s progress.

Troubleshooting tsdate

If numerical stability issues are encountered when attempting to date tree sequences using the Inside-Outside algorithm, it may be necessary to remove large sections of the tree which do not have any variable sites using tsdate.preprocess_ts() method.

Inferring and Dating Tree Sequences with Historical (Ancient) Samples

tsdate and tsinfer can be used together to infer tree sequences from both modern and historical samples. The following recipe shows how this is accomplished with a few lines of Python. The only requirement is a tsinfer.SampleData file with modern and historical samples (the latter are specified using the individuals_time array in a tsinfer.SampleData file).

import msprime
import tsdate
import tsinfer
import tskit

import numpy as np


def make_historical_samples():
    samples = [
        msprime.Sample(population=0, time=0),
        msprime.Sample(0, 0),
        msprime.Sample(0, 0),
        msprime.Sample(0, 0),
        msprime.Sample(0, 1.0),
        msprime.Sample(0, 1.0)
    ]
    sim = msprime.simulate(samples=samples, mutation_rate=1, length=100)
    # Get the SampleData file from the simulated tree sequence
    # Retain the individuals times and ignore the sites times.
    samples = tsinfer.SampleData.from_tree_sequence(
      sim, use_sites_time=False, use_individuals_time=True)
    return samples

def infer_historical_ts(samples, Ne=1, mutation_rate=1):
   """
   Input is tsinfer.SampleData file with modern and historical samples.
   """
   modern_samples = samples.subset(np.where(samples.individuals_time[:] == 0)[0])
   inferred_ts = tsinfer.infer(modern_samples) # Infer tree seq from modern samples
   # Removes unary nodes (currently required in tsdate), keeps historical-only sites
   inferred_ts = tsdate.preprocess_ts(inferred_ts, filter_sites=False)
   dated_ts = tsdate.date(inferred_ts, Ne=Ne, mutation_rate=mutation_rate) # Date tree seq
   sites_time = tsdate.sites_time_from_ts(dated_ts)  # Get tsdate site age estimates
   dated_samples = tsdate.add_sampledata_times(
      samples, sites_time) # Get SampleData file with time estimates assigned to sites
   ancestors = tsinfer.generate_ancestors(dated_samples)
   ancestors_w_proxy = ancestors.insert_proxy_samples(
      dated_samples, allow_mutation=True)
   ancestors_ts = tsinfer.match_ancestors(dated_samples, ancestors_w_proxy)
   return tsinfer.match_samples(
      dated_samples, ancestors_ts, force_sample_times=True)

samples = make_historical_samples()
inferred_ts = infer_historical_ts(samples)

We simulate a tree sequence with six sample chromosomes, four modern and two historical. We then infer and date a tree sequence using only the modern samples. Next, we find derived alleles which are carried by the historical samples and use the age of the historical samples to constrain the ages of these alleles. Finally, we reinfer the tree sequence, using the date estimates from tsdate and the historical constraints rather than the frequency of the alleles to order mutations in tsinfer. Historical samples are added to the ancestors tree sequence as proxy nodes, in addition to being used as samples.