Python API

This page provides documentation for the tsdate Python API.

Running tsdate

tsdate.date(tree_sequence, mutation_rate, Ne=None, recombination_rate=None, time_units=None, priors=None, *, return_posteriors=None, progress=False, **kwargs)

Take a tree sequence (which could have uncalibrated node times) and assign new times to non-sample nodes using the tsdate algorithm. If a mutation_rate is given, the mutation clock is used. The recombination clock is unsupported at this time. If neither a mutation_rate nor a recombination_rate is given, a topology-only clock is used. Times associated with mutations and non-sample nodes in the input tree sequence are not used in inference and will be removed.

Parameters:
  • tree_sequence (TreeSequence) – The input tskit.TreeSequence, treated as one whose non-sample nodes are undated.
  • Ne (float) – The estimated (diploid) effective population size used to construct the (default) conditional coalescent prior. This is what is used when priors is None: a positive Ne value is therefore required in this case. Conversely, if priors is not None, no Ne value should be given.
  • mutation_rate (float) – The estimated mutation rate per unit of genome per unit time. If provided, the dating algorithm will use a mutation rate clock to help estimate node dates. Default: None
  • recombination_rate (float) – The estimated recombination rate per unit of genome per unit time. If provided, the dating algorithm will use a recombination rate clock to help estimate node dates. Default: None
  • time_units (str) – The time units used by the mutation_rate and recombination_rate values, and stored in the time_units attribute of the output tree sequence. If the conditional coalescent prior is used, then this is also applies to the value of Ne, which in standard coalescent theory is measured in generations. Therefore if you wish to use mutation and recombination rates measured in (say) years, and are using the conditional coalescent prior, the Ne value which you provide must be scaled by multiplying by the number of years per generation. If None (default), assume "generations".
  • priors (NodeGridValues) – NodeGridValue object containing the prior probabilities for each node at a set of discrete time points. If None (default), use the conditional coalescent prior with a standard set of time points as given by build_prior_grid().
  • return_posteriors (bool) – If True, instead of returning just a dated tree sequence, return a tuple of (dated_ts, posteriors). Note that the dictionary returned in posteriors (described below) is suitable for reading as a pandas DataFrame object, using pd.DataFrame(posteriors).
  • eps (float) – Specify minimum distance separating time points. Also specifies the error factor in time difference calculations. Default: 1e-6
  • num_threads (int) – The number of threads to use. A simpler unthreaded algorithm is used unless this is >= 1. Default: None
  • method (string) – What estimation method to use: can be “inside_outside” (empirically better, theoretically problematic) or “maximization” (worse empirically, especially with gamma approximated priors, but theoretically robust). If None (default) use “inside_outside”
  • probability_space (string) – Should the internal algorithm save probabilities in “logarithmic” (slower, less liable to to overflow) or “linear” space (fast, may overflow). Default: “logarithmic”
  • ignore_oldest_root (bool) – Should the oldest root in the tree sequence be ignored in the outside algorithm (if “inside_outside” is used as the method). Ignoring outside root provides greater stability when dating tree sequences inferred from real data. Default: False
  • progress (bool) – Whether to display a progress bar. Default: False
Returns:

A copy of the input tree sequence but with altered node times, or (if return_posteriors is True) a tuple of that tree sequence plus a dictionary of posterior probabilities from the “inside_outside” estimation method. Each node whose time was inferred corresponds to an item in this dictionary, with the key being the node ID and the value a 1D array of probabilities of the node being in a given time slice (or None if the “inside_outside” method was not used). The start and end times of each time slice are given as 1D arrays in the dictionary, under keys named "start_time" and end_time".

Return type:

tskit.TreeSequence or (tskit.TreeSequence, dict)

Specifying Prior and Time Discretisation Options

tsdate.build_prior_grid(tree_sequence, Ne, timepoints=20, *, approximate_priors=False, approx_prior_size=None, prior_distribution='lognorm', eps=1e-06, progress=False)

Using the conditional coalescent, calculate the prior distribution for the age of each node, given the number of contemporaneous samples below it, and the discretised time slices at which to evaluate node age.

Parameters:
  • tree_sequence (TreeSequence) – The input tskit.TreeSequence, treated as undated.
  • Ne (float) – The estimated (diploid) effective population size: must be specified. Using standard (unscaled) values for Ne results in a prior where times are measures in generations.
  • timepoints (int_or_array_like) – The number of quantiles used to create the time slices, or manually-specified time slices as a numpy array. Default: 20
  • approximate_priors (bool) – Whether to use a precalculated approximate prior or exactly calculate prior. If approximate prior has not been precalculated, tsdate will do so and cache the result. Default: False
  • approx_prior_size (int) – Number of samples from which to precalculate prior. Should only enter value if approximate_priors=True. If approximate_priors=True and no value specified, defaults to 1000. Default: None
  • prior_distr (string) – What distribution to use to approximate the conditional coalescent prior. Can be “lognorm” for the lognormal distribution (generally a better fit, but slightly slower to calculate) or “gamma” for the gamma distribution (slightly faster, but a poorer fit for recent nodes). Default: “lognorm”
  • eps (float) – Specify minimum distance separating points in the time grid. Also specifies the error factor in time difference calculations. Default: 1e-6
Returns:

A prior object to pass to tsdate.date() containing prior values for inference and a discretised time grid

Return type:

base.NodeGridValues Object

Preprocessing Tree Sequences

tsdate.preprocess_ts(tree_sequence, *, minimum_gap=1000000, remove_telomeres=True, **kwargs)

Function to prepare tree sequences for dating by removing gaps without sites and simplifying the tree sequence. Large regions without data can cause overflow/underflow errors in the inside-outside algorithm and poor performance more generally. Removed regions are recorded in the provenance of the resulting tree sequence.

Parameters:
  • tree_sequence (TreeSequence) – The input :class`tskit.TreeSequence` to be preprocessed.
  • minimum_gap (float) – The minimum gap between sites to remove from the tree sequence. Default: “1000000”
  • remove_telomeres (bool) – Should all material before the first site and after the last site be removed, regardless of the length. Default: “True”
  • **kwargs – All further keyword arguments are passed to the tskit.simplify command.
Returns:

A tree sequence with gaps removed.

Return type:

tskit.TreeSequence

Functions for Inferring Tree Sequences with Historical Samples

tsdate.sites_time_from_ts(tree_sequence, *, unconstrained=True, node_selection='child', min_time=1)

Returns an estimated “time” for each site. This is the estimated age of the oldest MRCA which possesses a derived variant at that site, and is useful for performing (re)inference of a tree sequence. It is calculated from the ages of nodes, with the appropriate nodes identified by the position of mutations in the trees.

If node times in the tree sequence have been estimated by tsdate using the inside-outside algorithm, then as well as a time in the tree sequence, nodes will store additional time estimates that have not been explictly constrained by the tree topology. By default, this function tries to use these “unconstrained” times, although this is likely to fail (with a warning) on tree sequences that have not been processed by tsdate: in this case the standard node times can be used by setting unconstrained=False.

The concept of a site time is meaningless for non-variable sites, and so the returned time for these sites is np.nan (note that this is not exactly the same as tskit.UNKNOWN_TIME, which marks sites that could have a meaningful time but whose time estimate is unknown).

Parameters:
  • tree_sequence (TreeSequence) – The input :class`tskit.TreeSequence`.
  • unconstrained (bool) – Use estimated node times which have not been constrained by tree topology. If True (default), this requires a tree sequence which has been dated using the tsdate inside-outside algorithm. If this is not the case, specify False to use the standard tree sequence node times.
  • node_selection (str) –

    Defines how site times are calculated from the age of the upper and lower nodes that bound each mutation at the site. Options are “child”, “parent”, “arithmetic” or “geometric”, with the following meanings

    • 'child' (default): the site time is the age of the oldest node below each mutation at the site
    • 'parent': the site time is the age of the oldest node above each mutation at the site
    • 'arithmetic': the arithmetic mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means.
    • 'geometric': the geometric mean of the ages of the node above and the node below each mutation is calculated; the site time is the oldest of these means
  • min_time (float) – A site time of zero implies that no MRCA in the past possessed the derived variant, so the variant cannot be used for inferring relationships between the samples. To allow all variants to be potentially available for inference, if a site time would otherwise be calculated as zero (for example, where the mutation_age parameter is “child” or “geometric” and all mutations at a site are associated with leaf nodes), a minimum site greater than 0 is recommended. By default this is set to 1, which is generally reasonable for times measured in generations or years, although it is also fine to set this to a small epsilon value.
Returns:

Array of length tree_sequence.num_sites with estimated time of each site

Return type:

numpy.array

tsdate.add_sampledata_times(samples, sites_time)

Return a tsinfer.SampleData file with estimated times associated with sites. Ensures that each site’s time is at least as old as the oldest historic sample carrying a derived allele at that site.

Parameters:samples (tsinfer.formats.SampleData) – A tsinfer SampleData object to add site times to. Any historic individuals in this SampleData file are used to constrain site times.
Returns:A tsinfer.SampleData file
Return type:tsinfer.SampleData