Five-leaf generalizations of the D-statistic reveal the directionality of admixture
Data files
Oct 24, 2024 version files 11.35 GB
Abstract
Over the past 15 years, the D-statistic, a four-taxon test for organismal admixture (hybridization, or introgression) which incorporates single nucleotide polymorphism data with allelic patterns ABBA and BABA, has seen considerable use. This statistic seeks to discern significant deviation from either a given species tree assumption, or from the balanced incomplete lineage sorting that could otherwise defy this species tree. However, while the D-statistic can successfully discriminate admixture from incomplete lineage sorting, it is not a simple matter to determine the directionality of admixture using only four-leaf tree models. As such, methods have been developed that use 5 leaves to evaluate admixture. Among these, the DFOIL method, which tests allelic patterns on the “symmetric” tree S = (((1,2),(3,4)),5), succeeds in finding admixture direction for many five-taxon examples. However, DFOIL does not make full use of all symmetry, nor can DFOIL function properly when ancient samples are included because of the reliance on singleton patterns (such as BAAAA and ABAAA). Here, we take inspiration from DFOIL to develop a new and completely general family of five-leaf admixture tests, dubbed Δ-statistics, that can either incorporate or exclude the singleton allelic patterns depending on individual taxon and age sampling choices. We describe two new shapes that are also fully testable, namely the “asymmetric” tree A = ((((1,2),3),4),5) and the “quasisymmetric” tree Q = (((1,2),3),(4,5)), which can considerably supplement the “symmetric“ S = (((1,2),(3,4)),5) model used by DFOIL. We demonstrate the consistency of Δ-statistics under various simulated scenarios, and provide empirical examples using data from black, brown and polar bears, the latter also including two ancient polar bear samples from previous studies. Recently DFOIL and one of these ancient samples was used to argue for a dominant polar bear → brown bear introgression direction. However, we find, using both this ancient polar bear and our own, that by far the strongest signal using both DFOIL and Δ-statistics on tree S is actually bidirectional gene flow of indistinguishable direction. Further experiments on trees A and Q instead highlight what were likely two phases of admixture: one with stronger brown bear → polar bear introgression in ancient times, and a more recent phase with predominant polar bear → brown bear directionality. Code and documentation available at https://github.com/KalleLeppala/Delta-statistics.
https://doi.org/10.5061/dryad.xksn02vr9
Description of the data and file structure
We conducted Δ-statistics analyses using bear SNP data from our previous work (Lan et al., 2022), adding to it the data for another ancient polar bear sample (Wang et al. 2022), to directly reassess the previously published hypotheses on gene flow directionality.
This submission comprise the following data files used in our empirical example:
* PNAS_Bruno_PB_PB_SNPonly_VF_mac2_mindp5_maxdp500_chrenamedto1.vcf.gz
This file is a Variant Call Format (VCF) file comprising the SNP data set used in the empirical described in the paper (Principal Component Analysis using smartPCA and a drift maximum likelihood tree using TreeMix). Data processing, variant calling, and filtering followed the methodology outlined previously (Lan et al., 2022).
* Empirical_bears_example.traw
The VCF file above was converted into a variant-major additive component (“.traw”) file using PLINK 2.0 (–recode A-transpose). This .traw file was used as input in the empirical Δ-statistics tests described in the paper.
Sharing/Access information
Other publicly accessible locations of the data.
The genome data used in this study were acquired from two published studies:
1. Lan, T., Leppälä, K., Tomlin, C., Talbot, S. L., Sage, G. K., Farley, S. D., … others (2022). Insights into bear evolution from a pleistocene polar bear genome. Proceedings of the National Academy of Sciences, 119(24), e2200016119.
https://doi.org/10.1073/pnas.220001611
Raw reads generated for this study are deposited in the National Center for Biotechnology Information (NCBI), BioProject: PRJNA804505.
2. Wang, M.-S., Murray, G. G., Mann, D., Groves, P., Vershinina, A. O., Supple, M. A., … others (2022). A polar bear paleogenome reveals extensive ancient gene flow from polar bears into brown bears. Nature Ecology & Evolution, 6(7), 936–944
https://doi.org/10.1038/s41559-022-01753-8
Raw reads generated for this study are deposited in the National Center for Biotechnology Information (NCBI), BioProject: PRJNA720153 and Sequence Reads Archive SRS8777210.
Code/Software
Code and documentation to run Δ-statistics analyses are available at: