Dating Tree Sequences¶
To illustrate the typical use case of
tsdate’s Python API, we will first create
sample data with msprime and infer a tree
sequence from this data using tsinfer.
We will then run
tsdate on the inferred tree sequence.
Let’s start by creating some sample data with
msprime using human-like parameters.
import msprime sample_ts = msprime.simulate(sample_size=10, Ne=10000, length=1e4, mutation_rate=1e-8, recombination_rate=1e-8, random_seed=2) print(sample_ts.num_trees, sample_ts.num_nodes)
The output of this code is:
We take this simulated tree sequence and turn it into a tsinfer SampleData object as documented here, and then infer a tree sequence from the data
import tsinfer sample_data = tsinfer.SampleData.from_tree_sequence(sample_ts) inferred_ts = tsinfer.infer(sample_data)
tsdate works best with simplified tree sequences (
tsinfer’s documentation provides)
details on how to simplify an inferred tree sequence. This should not be an issue
when working with tree sequences simulated using
Next, we run tsdate to estimate the ages of nodes and mutations in the inferred tree sequence:
import tsdate dated_ts = tsdate.date(inferred_ts, Ne=10000, mutation_rate=1e-8)
All we need to run
tsdate (with its default parameters) is the inferred tree sequence
object, the estimated effective population size, and estimated mutation rate. Here we
have provided a human mutation rate per base pair per generation, so the nodes dates in the
resulting tree sequence should be interpreted as generations.
Specifying a Prior¶
The above example shows the basic use of
tsdate, using default parameters. The
software has parameters the user can access through the
function which may affect the runtime and accuracy of the algorithm.
Inside Outside vs Maximization¶
One of the most important parameters to consider is whether
tsdate should use the
inside-outside or the maximization algorithms to perform inference. A detailed
description of the algorithms will be presented in our preprint, but from the users
perspective, the inside-outside approach performs better empirically but has issues with
numerical stability, while the maximization approach is slightly less accurate
empirically, but is numerically stable.
Command Line Interface Example¶
tsdate also offers a convenient command line interface (CLI) for
accessing the core functionality of the algorithm.
For a simple example of CLI, we’ll first save the inferred tree sequence we created in the section above as a file.
import tskit inferred_ts.dump("inferred_ts.trees")
Now we use the CLI to again date the inferred tree sequence and output the resulting
dated tree sequence to
$ tsdate date inferred_ts.trees dated_ts.trees 10000 1e-8 --progress
The first two arguments are the input and output tree sequence file names, the third is the estimated effective population size, and the fourth is the estimated mutation rate. We also add the
--progress option to keep track of
Inferring and Dating Tree Sequences with Historical (Ancient) Samples¶
tsinfer can be used together to infer tree sequences from both
modern and historical samples. The following recipe shows how this is accomplished
with a few lines of Python. The only requirement is a tsinfer.SampleData file with
modern and historical samples (the latter are specified using the
individuals_time array in a tsinfer.SampleData file).
import msprime import tsdate import tsinfer import tskit import numpy as np def make_historical_samples(): samples = [ msprime.Sample(population=0, time=0), msprime.Sample(0, 0), msprime.Sample(0, 0), msprime.Sample(0, 0), msprime.Sample(0, 1.0), msprime.Sample(0, 1.0) ] sim = msprime.simulate(samples=samples, mutation_rate=1, length=100) # Get the SampleData file from the simulated tree sequence # Retain the individuals times and ignore the sites times. samples = tsinfer.SampleData.from_tree_sequence( sim, use_sites_time=False, use_individuals_time=True) return samples def infer_historical_ts(samples, Ne=1, mutation_rate=1): """ Input is tsinfer.SampleData file with modern and historical samples. """ modern_samples = samples.subset(np.where(samples.individuals_time[:] == 0)) inferred_ts = tsinfer.infer(modern_samples) # Infer tree seq from modern samples # Removes unary nodes (currently required in tsdate), keeps historical-only sites inferred_ts = tsdate.preprocess_ts(inferred_ts, filter_sites=False) dated_ts = tsdate.date(inferred_ts, Ne=Ne, mutation_rate=mutation_rate) # Date tree seq sites_time = tsdate.sites_time_from_ts(dated_ts) # Get tsdate site age estimates dated_samples = tsdate.add_sampledata_times( samples, sites_time) # Get SampleData file with time estimates assigned to sites ancestors = tsinfer.generate_ancestors(dated_samples) ancestors_w_proxy = ancestors.insert_proxy_samples( dated_samples, allow_mutation=True) ancestors_ts = tsinfer.match_ancestors(dated_samples, ancestors_w_proxy) return tsinfer.match_samples( dated_samples, ancestors_ts, force_sample_times=True) samples = make_historical_samples() inferred_ts = infer_historical_ts(samples)
We simulate a tree sequence with six sample chromosomes, four modern and
two historical. We then infer and date a tree sequence using only the modern
samples. Next, we find derived alleles which are carried by the historical samples and
use the age of the historical samples to constrain the ages of these alleles. Finally,
we reinfer the tree sequence, using the date estimates from tsdate and the historical
constraints rather than the frequency of the alleles to order mutations in
Historical samples are added to the ancestors tree sequence as proxy nodes, in addition
to being used as samples.