Data from: Inference of cross-species gene flow using genomic data depends on the methods: Case study of gene flow in Drosophila
Data files
Oct 14, 2024 version files 125.59 MB
-
bpp_half_0.ctl
1.09 KB
-
Imap.txt
516 B
-
README.md
1.17 KB
-
seqfile_half_0.txt
62.60 MB
-
seqfile_half_1.txt
62.99 MB
Oct 14, 2024 version files 125.59 MB
-
bpp_half_0.ctl
1.09 KB
-
Imap.txt
516 B
-
README.md
1.17 KB
-
seqfile_half_0.txt
62.60 MB
-
seqfile_half_1.txt
62.99 MB
Abstract
Analysis of genomic data in the past two decades has highlighted the prevalence of introgression as an important evolutionary force in both plants and animals. The genus Drosophila has received much attention recently, with an analysis of genomic sequence data detailing widespread introgression across the species phylogeny for the genus. However, the methods used in the study are based on data summaries for species triplets and are unable to infer gene flow between sister lineages or to identify the direction of gene flow. Hence, we reanalyze a subset of the data using the Bayesian program bpp, which is a full-likelihood implementation of the multispecies coalescent (MSC) model and can provide more powerful inference of gene flow between species, including its direction, timing, and strength. While our analysis supports the presence of gene flow in the species group, the results differ from the previous study: we infer gene flow between sister lineages undetected previously whereas most gene-flow events inferred in the previous study are rejected in our tests. To verify our conclusions, we performed simulations to examine the properties of Bayesian and summary methods. Bpp was found to have high power to detect gene flow, high accuracy in estimated rates of gene flow, and robustness under misspecification of the mode of gene flow. In contrast, summary methods had low power and produced biased estimates of introgression probability. Our results highlight an urgent need for improving the statistical properties of
summary methods and the computational efficiency of likelihood methods for inferring gene flow using genomic sequence data.
This dataset contains all necessary files to reproduce the BPP A00 runs under the MSC-I model using two halves of Drosophila sequence data, as conduct in https://doi.org/10.5061/dryad.ngf1vhj33.
Description of the data and file structure
Files Included:
seqfile_half_0.txt: Sequence data file for the first half, containing 1389 loci.
seqfile_half_1.txt: Sequence data file for the second half, containing 1388 loci.
bpp_half_0.ctl: BPP control file. The seqfile parameter is set to seqfile_half_0.txt. If you want to use the second half of the data, modify this line to seqfile = seqfile_half_1.txt and update the number of loci to 1388 (as the second half contains one fewer loci).
Imap.txt: BPP individual map (Imap) file. This file maps individual IDs to their corresponding species IDs for each sequence.
The supplemental information includes the following contents.
SI text. Bayesian parameter estimation in data simulated using the quartet trees of figure 2
Figures S1-S11
Tables S1-S9
Supplemental Information containing supplemental text, figures, and tables.