Tutorial¶
Dating Tree Sequences¶
To illustrate the typical use case of tsdate
’s Python API, we will first create
sample data with msprime and infer a tree
sequence from this data using tsinfer.
We will then run tsdate
on the inferred tree sequence.
Let’s start by creating some sample data with msprime
using human-like parameters.
import msprime
sample_ts = msprime.simulate(sample_size=10, Ne=10000,
length=1e4,
mutation_rate=1e-8,
recombination_rate=1e-8,
random_seed=2)
print(sample_ts.num_trees,
sample_ts.num_nodes)
The output of this code is:
12 29
We take this simulated tree sequence and turn it into a tsinfer SampleData object as documented here, and then infer a tree sequence from the data
import tsinfer
sample_data = tsinfer.SampleData.from_tree_sequence(sample_ts)
inferred_ts = tsinfer.infer(sample_data)
Note
tsdate
works best with simplified tree sequences (tsinfer
’s documentation provides)
details on how to simplify an inferred tree sequence. This should not be an issue
when working with tree sequences simulated using msprime
.
Next, we run tsdate to estimate the ages of nodes and mutations in the inferred tree sequence:
import tsdate
dated_ts = tsdate.date(inferred_ts, Ne=10000, mutation_rate=1e-8)
All we need to run tsdate
(with its default parameters) is the inferred tree sequence
object, the estimated effective population size, and estimated mutation rate. Here we
have provided a human mutation rate per base pair per generation, so the nodes dates in the
resulting tree sequence should be interpreted as generations.
Specifying a Prior¶
The above example shows the basic use of tsdate
, using default parameters. The
software has parameters the user can access through the tsdate.build_prior_grid()
function which may affect the runtime and accuracy of the algorithm.
Inside Outside vs Maximization¶
One of the most important parameters to consider is whether tsdate
should use the
inside-outside or the maximization algorithms to perform inference. A detailed
description of the algorithms will be presented in our preprint, but from the users
perspective, the inside-outside approach performs better empirically but has issues with
numerical stability, while the maximization approach is slightly less accurate
empirically, but is numerically stable.
Command Line Interface Example¶
tsdate
also offers a convenient command line interface (CLI) for
accessing the core functionality of the algorithm.
For a simple example of CLI, we’ll first save the inferred tree sequence we created in the section above as a file.
import tskit
inferred_ts.dump("inferred_ts.trees")
Now we use the CLI to again date the inferred tree sequence and output the resulting
dated tree sequence to dated_ts.trees
file:
$ tsdate date inferred_ts.trees dated_ts.trees 10000 1e-8 --progress
The first two arguments are the input and output tree sequence file names, the third is the estimated effective population size, and the fourth is the estimated mutation rate. We also add the --progress
option to keep track of tsdate
’s progress.
Troubleshooting tsdate¶
If numerical stability issues are encountered when attempting to date
tree sequences using the Inside-Outside algorithm, it may be necessary to remove
large sections of the tree which do not have any variable sites using
tsdate.preprocess_ts()
method.
Inferring and Dating Tree Sequences with Historical (Ancient) Samples¶
tsdate
and tsinfer
can be used together to infer tree sequences from both
modern and historical samples. The following recipe shows how this is accomplished
with a few lines of Python. The only requirement is a tsinfer.SampleData file with
modern and historical samples (the latter are specified using the
individuals_time array in a tsinfer.SampleData file).
import msprime
import tsdate
import tsinfer
import tskit
import numpy as np
def make_historical_samples():
samples = [
msprime.Sample(population=0, time=0),
msprime.Sample(0, 0),
msprime.Sample(0, 0),
msprime.Sample(0, 0),
msprime.Sample(0, 1.0),
msprime.Sample(0, 1.0)
]
sim = msprime.simulate(samples=samples, mutation_rate=1, length=100)
# Get the SampleData file from the simulated tree sequence
# Retain the individuals times and ignore the sites times.
samples = tsinfer.SampleData.from_tree_sequence(
sim, use_sites_time=False, use_individuals_time=True)
return samples
def infer_historical_ts(samples, Ne=1, mutation_rate=1):
"""
Input is tsinfer.SampleData file with modern and historical samples.
"""
modern_samples = samples.subset(np.where(samples.individuals_time[:] == 0)[0])
inferred_ts = tsinfer.infer(modern_samples) # Infer tree seq from modern samples
# Removes unary nodes (currently required in tsdate), keeps historical-only sites
inferred_ts = tsdate.preprocess_ts(inferred_ts, filter_sites=False)
dated_ts = tsdate.date(inferred_ts, Ne=Ne, mutation_rate=mutation_rate) # Date tree seq
sites_time = tsdate.sites_time_from_ts(dated_ts) # Get tsdate site age estimates
dated_samples = tsdate.add_sampledata_times(
samples, sites_time) # Get SampleData file with time estimates assigned to sites
ancestors = tsinfer.generate_ancestors(dated_samples)
ancestors_w_proxy = ancestors.insert_proxy_samples(
dated_samples, allow_mutation=True)
ancestors_ts = tsinfer.match_ancestors(dated_samples, ancestors_w_proxy)
return tsinfer.match_samples(
dated_samples, ancestors_ts, force_sample_times=True)
samples = make_historical_samples()
inferred_ts = infer_historical_ts(samples)
We simulate a tree sequence with six sample chromosomes, four modern and
two historical. We then infer and date a tree sequence using only the modern
samples. Next, we find derived alleles which are carried by the historical samples and
use the age of the historical samples to constrain the ages of these alleles. Finally,
we reinfer the tree sequence, using the date estimates from tsdate and the historical
constraints rather than the frequency of the alleles to order mutations in tsinfer
.
Historical samples are added to the ancestors tree sequence as proxy nodes, in addition
to being used as samples.