Data from: Synergizing Bayesian and heuristic approaches: D-BPP uncovers ghost introgression in Panthera and Thuja
Data files
Mar 03, 2026 version files 17.97 MB
-
dataset.zip
17.96 MB
-
README.md
5.26 KB
Abstract
Hybridization involving extinct or unsampled (“ghost”) lineages profoundly influences species’ evolutionary histories, but detecting such introgression remains methodologically challenging. We introduce D-BPP, a framework that integrates the heuristic D-statistic (or ABBA-BABA test) with Bayesian phylogenomic inference (implemented in BPP) to efficiently infer phylogenetic networks. In D-BPP, we first employ the D-statistic to rapidly identify candidate introgression events on a predefined bifurcating species tree; then we leverage the Bayesian test in BPP to rigorously validate these can- didates and sequentially add them to the species tree, retaining only those events with strong statistical support. When the species tree is ambiguous, D-BPP identifies the most probable topology by comparing introgression models in a Bayesian framework. Through dedicated simulation analyses, we show that the D-BPP workflow has high power: the D-statistic reli- ably detects the presence of introgression, BPP accurately discriminates among alternative introgression scenarios, and the key procedural steps of the pipeline are empirically well-justified. Critically, our framework excels at detecting ghost intro- gression, which is often unidentifiable or overlooked by existing methods—whether heuristic or full-likelihood. Applied to genomic datasets from Panthera (big cats) and Thuja (conifers), D-BPP uncovered previously undetected ghost introgres- sion events in both clades, underscoring the pervasive role ghost lineages have played across diverse taxa. By combining the computational efficiency of heuristic D-statistics with the robust statistical rigor of full-likelihood Bayesian inference, D-BPP provides a practical and powerful approach for reconstructing complex reticulate evolutionary histories.
Dataset DOI: 10.5061/dryad.47d7wm3sr
Description
This dataset contains input files and simulation data necessary to replicate the D-BPP phylogenetic network analyses presented in Synergizing Bayesian and Heuristic Approaches: D-BPP Uncovers Ghost Introgression in Panthera and Thuja.
The D-BPP framework is designed to detect and quantify ghost introgression—gene flow from unsampled or extinct lineages—by integrating heuristic screening with rigorous Bayesian model testing. The dataset is organized by empirical study systems (Panthera and Thuja) and includes simulated data used for method validation.
Data Files and Directory Structure
The data are contained within the dataset.zip archive. Upon extraction, the following directory structure is created:
dataset.zip
├── Panthera/
│ ├── data.even.nocat.txt
│ ├── data.odd.nocat.txt
│ └── simulation/
│ ├── concatenatedfile.txt
│ └── MySeq-2500.txt
│
└── Thuja/
└── Thuja.bpp
File Details
| File Name | Analysis Context | Description |
|---|---|---|
Panthera/data.even.nocat.txt |
Panthera Empirical | Input file for BPP analysis: even partition subset. |
Panthera/data.odd.nocat.txt |
Panthera Empirical | Input file for BPP analysis: odd partition subset. |
Thuja/Thuja.bpp |
Thuja Empirical | Input file for BPP analysis |
Simulation/concatenatedfile.txt |
Panthera Simulation | Simulated dataset representing a specific introgression scenario. This is a concatenated alignment file. |
Simulation/MySeq-2500.txt |
Panthera Simulation | Simulated dataset, containing 2500 independent loci, used for BPP analysis. |
Access and Source Data
The empirical datasets used in this study were derived from previously published sources, as cited below. The files provided here are reformatted specifically for our analysis.
Thuja Dataset
The raw sequence data and initial phylogenetic inferences were obtained from:
Li J, Zhang Y, Ruhsam M, Milne RI, Wang Y, Wu D, Jia S, Tao T, Mao K. 2022. Seeing through the hedge: Phylogenomics of Thuja (Cupressaceae) reveals prominent incomplete lineage sorting and ancient introgression for Tertiary relict flora. Cladistics 38:187–203.
Panthera Dataset
The raw genomic data and context for inter-species introgression were obtained from:
Santos SHD, Figueiró HV, Flouri T, Ramalho E, Cullen L, Jr., Yang Z, Murphy WJ, Eizirik E. 2025. Massive inter-species introgression overwhelms phylogenomic relationships among jaguar, lion, and leopard. Systematic Biology 74:583–599.
Software and Code
To run these files and replicate the D-BPP analyses, you will need:
1. Software
a. Dsuite
- Reference: Malinsky, M., Matschiner, M. and Svardal, H. (2021) Dsuite ‐ fast D‐statistics and related admixture evidence from VCF files. Molecular Ecology Resources 21, 584–595. doi: https://doi.org/10.1111/1755-0998.13265.
- Download: https://github.com/millanek/Dsuite/tree/master
b. BPP
- Reference: Flouri, T., Jiao, X., Rannala, B., & Yang, Z. (2018). A Bayesian implementation of the multispecies coalescent model with introgression for phylogenomic analysis. Molecular Biology and Evolution, 35(7), 1811–1823.
- Download: https://github.com/bpp/bpp
c. snp-sites
- Reference: Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. 2016. SNP-sites: Rapid efficient extraction of SNPs from multi-fasta alignments. Microb Genom 2:e000056.
- Download: http://sanger-pathogens.github.io/snp-sites/\
d. Newick Utilities
- Download: https://github.com/tjunier/newick_utils
2. Analysis Pipeline Scripts
This document provides a detailed, step-by-step guide for running the D-BPP pipeline, from raw data preparation to the final inference of phylogenetic networks and ghost introgression.
- GitHub Repository: https://github.com/yangyang9608/D-BPP_Workflow
Contact
For questions about the data or the D-BPP methodology, please contact:
- Corresponding Author: Yang Yang, Xiao-Xu Pang, Wei-Ning Bai, Da-Yong Zhang
- Email: yangy@mail.bnu.edu.cn; pangxiaoxu@mail.bnu.edu.cn; baiwn@bnu.edu.cn; zhangdy@bnu.edu.cn.
