Python API

This page provides documentation for the tsdate Python API.

Running tsdate

tsdate.date(tree_sequence, Ne, mutation_rate=None, recombination_rate=None, priors=None, *, progress=False, **kwargs)

Take a tree sequence with arbitrary node times and recalculate node times using the tsdate algorithm. If a mutation_rate is given, the mutation clock is used. The recombination clock is unsupported at this time. If neither a mutation_rate nor a recombination_rate is given, a topology-only clock is used.

Parameters:
  • tree_sequence (TreeSequence) – The input tskit.TreeSequence, treated as undated.
  • Ne (float) – The estimated (diploid) effective population size: must be specified.
  • mutation_rate (float) – The estimated mutation rate per unit of genome per generation. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default: None
  • recombination_rate (float) – The estimated recombination rate per unit of genome per generation. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default: None
  • priors (NodeGridValues) – NodeGridValue object containing the prior probabilities for each node at a set of discrete time points. Default: None
  • eps (float) – Specify minimum distance separating time points. Also specifies the error factor in time difference calculations. Default: 1e-6
  • num_threads (int) – The number of threads to use. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
  • method (string) – What estimation method to use: can be “inside_outside” (empirically better, theoretically problematic) or “maximization” (worse empirically, especially with gamma approximated priors, but theoretically robust). Default: “inside_outside”
  • probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”
  • ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if “inside_outside” is used as the method). Ignoring outside root provides greater stability when dating tree sequences inferred from real data. Default: False
  • progress (bool) – Whether to display a progress bar. Default: False
Returns:

A tree sequence with inferred node times in units of generations.

Return type:

tskit.TreeSequence

Specifying Prior and Time Discretisation Options

tsdate.build_prior_grid(tree_sequence, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', eps=1e-06, progress=False)

Using the conditional coalescent, calculate the prior distribution for the age of each node given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.

Parameters:
  • tree_sequence (TreeSequence) – The input tskit.TreeSequence, treated as undated
  • timepoints (int_or_array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20
  • approximate_priors (bool) – Whether to use a precalculated approximate prior or exactly calculate prior. If approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
  • approx_prior_size (int) – Number of samples from which to precalculate prior. Should only enter value if approximate_priors=True. If approximate_priors=True and no value specified, defaults to 1000. Default: None
  • prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”
  • eps (float) – Specify minimum distance separating points in the time grid. Also specifies the error factor in time difference calculations. Default: 1e-6
Returns:

A prior object to pass to tsdate.date() containing prior values for inference and a discretised time grid

Return type:

base.NodeGridValues Object

Preprocessing Tree Sequences

tsdate.preprocess_ts(tree_sequence, *, minimum_gap=1000000, remove_telomeres=True, **kwargs)

Function to prepare tree sequences for dating by removing gaps without sites and simplifying the tree sequence. Large regions without data can cause overflow/underflow errors in the inside-outside algorithm and poor performance more generally. Removed regions are recorded in the provenance of the resulting tree sequence.

Parameters:
  • tree_sequence (TreeSequence) – The input :class`tskit.TreeSequence` to be preprocessed.
  • minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default: “1000000”
  • remove_telomeres (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default: “True”
  • **kwargs – All further keyword arguments are passed to the tskit.simplify command.
Returns:

A tree sequence with gaps removed.

Return type:

tskit.TreeSequence

Functions for Inferring Tree Sequences with Historical Samples

tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)

Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.

If node times in the tree sequence have been estimated by tsdate using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed by tsdate: in this case the standard node times can be used by setting unconstrained=False.

The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is np.nan (note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).

Parameters:
  • tree_sequence (TreeSequence) – The input :class`tskit.TreeSequence`.
  • unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If True (default), this requires a tree sequence which has been dated using the tsdate inside-outside algorithm. If this is not the case, specify False to use the standard tree sequence node times.
  • node_selection (str) –

    Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings

    • 'child' (default): the site time is the age of the oldest node below each mutation at the site
    • 'parent': the site time is the age of the oldest node above each mutation at the site
    • 'arithmetic': the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.
    • 'geometric': the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means
  • min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the mutation_age parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.
Returns:

Array of length tree_sequence.num_sites with estimated time of each site

Return type:

numpy.array

tsdate.add_sampledata_times(samples, sites_time)

Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.

Parameters:samples (tsinfer.formats.SampleData) – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.
Returns:A tsinfer.SampleData file
Return type:tsinfer.SampleData