Data for: Among-species rate variation produces false signals of introgression

Koppetsch, Thore1; Malinsky, Milan2; Matschiner, Michael 1

Published May 31, 2023; Updated Apr 04, 2024 on Dryad. https://doi.org/10.5061/dryad.sf7m0cgbs

Data files

May 31, 2023 version files 7.49 MB

Apr 04, 2024 version files 13.55 MB

Abstract

The role of interspecific hybridization has recently seen increasing attention, especially in the context of diversification dynamics. Genomic research has now made it abundantly clear that both hybridization and introgression – the exchange of genetic material through hybridization and backcrossing – are far more common than previously thought. Besides cases of ongoing or recent genetic exchange between taxa, an increasing number of studies report “ancient introgression” – referring to results of hybridization that took place in the distant past. However, it is not clear whether commonly used methods for the detection of introgression are applicable to such old systems, given that most of these methods were originally developed for analyses at the level of populations and recently diverged species, affected by recent or ongoing genetic exchange. In particular, the assumption of constant evolutionary rates, which is implicit in many commonly used approaches, is more likely to be violated as evolutionary divergence increases. To test the limitations of introgression detection methods when being applied to old systems, we simulated thousands of genomic datasets under a wide range of settings, with varying degrees of among-species rate variation and introgression. Using these simulated datasets, we showed that some commonly applied statistical methods, including the D-statistic and certain tests based on sets of local phylogenetic trees, can produce false-positive signals of introgression between divergent taxa that have different rates of evolution. These misleading signals are caused by the presence of homoplasies occurring at different rates in different lineages. To distinguish between the patterns caused by rate variation and genuine introgression, we developed a new test that is based on the expected clustering of introgressed sites along the genome, and implemented this test in the program Dsuite.

Supplementary Material for Koppetsch et al. includes Supplementary Notes S1-S4, Supplementary Figures S1-S36, and Supplementary Tables S1-S5.

Supplementary Table 1 is provided in file Supplementary_Table_1.xlsx, while all other Supplementary Material is included in file dstats_supplement.pdf.

Supplementary Table 1

This Excel spreadsheet contains nested sheets that correspond to all the summary content that would be obtained after having run the script named run_all_simulation_data.sh. All scripts are provided on the GitHub code repository (https://github.com/thorekop/ABBA-Site-Clustering/blob/main/src/). Please note, that the run_all_simulation_data.sh script is not intended to be executed itself.

Abbreviations for the variables and parameters listed in Supplementary Table 1 are defined here:

pop_size: effective population sizes
div_time: divergence of species P1 and P2 in our four-taxon phylogeny, in generations
rec_rate: recombination rate of the chromosome
mut_rate: mutation rate
intr_rate: rate of introgression between species P2 and P3
P2_rate: scale factor indicating the extending or shortening of the branch leading to P2 in order to allow mutation-rate variation
n_variable_sites_vcf: number of variable sites per vcf-file
n_biallelic_sites_vcf: number of biallelic sites per vcf-file
n_multiallelic_sites_vcf: number of multiallelic sites per vcf-file
n_variable_sites_200bp: number of variable sites per alignments with a length of 200 bp extracted from the genomic dataset
n_variable_sites_500bp: number of variable sites per alignments with a length of 500 bp extracted from the genomic dataset
n_variable_sites_1000bp: number of variable sites per alignments with a length of 1000 bp extracted from the genomic dataset
n_pi_sites_200bp: number of parsimony-informative sites per alignments with a length of 200 bp extracted from the genomic dataset
n_pi_sites_500bp: number of parsimony-informative sites per alignments with a length of 500 bp extracted from the genomic dataset
n_pi_sites_1000bp: number of parsimony-informative sites per alignments with a length of 1000 bp extracted from the genomic dataset
run: run number, since ten, or fifty (for the main settings), replicate simulations were performed for each possible combination of parameters
D_stat: Patterson's D-statistic
p_value: p-values
f4_ratio: f4-ratio estimation, a method for estimating ancestry proportions in an admixed population
sensitive version of the `ABBA'-site clustering: signals of introgression (given as p-values) detected with the “sensitive” version of the new ‘ABBA’-site clustering test
robust version of the `ABBA'-site clustering: signals of introgression (given as p-values) detected with the “robust” version of the new ‘ABBA’-site clustering test
BBAA: number of BBAA sites
ABBA: number of ABBA sites
BABA: number of BABA sites
align_length: lengths of equally spaced windows across the simulated chromosome that were used for extracting variants from (for tree-based introgression-detection methods only)
n_significant: number of significant runs
loglik_dif: difference in log-likelihood
topo_ok: verification whether the inferred topology actually still represents the original topology of the four-taxon phylogeny
network: inferred topology
introgression_rate: rate of introgression calculated based on the QuIBL results
mrca_reduction: difference between the two oldest pairwise mean MRCA ages (dMRCA)
means_c_genes_length: mean length of c-genes
means_tracts_length: mean length of single-topology tracts
n_introgressed_tracts: number of introgressed single-topology tracts
means_introgressed_tracts_length: mean length of introgressed single-topology tracts

In the following the content of the single nested sheets is specified in more detail:

Dataset statistics: Here, we present a comprehensive overview of the numbers of variable, biallelic, and multiallelic sites per simulated dataset, as well as the numbers of variable and parsimony-informative sites per alignment lengths of 200, 500, and 1,000 bp.
Dsuite: Here, both the Dsuite analysis (results obtained after running the script summarize_dsuite.sh) and the analysis based on both the “sensitive” and “robust” 'ABBA'-Site Clustering (results obtained after running the script summarize_abba_clustering.sh) is summarized and all results obtained are written into a table.
Dtree: The script summarize_iqtree.sh summarizes the Dtree analysis and writes the results into this table.
SNaQ: The script summarize_snaq.sh summarizes the SNaQ analysis and writes the results into this table.
QuIBL: The script summarize_quibl.sh summarizes the QuIBL analysis and writes the results into this table.
MMS17: The script summarize_meyer_approach.sh summarizes the analysis with the method developed by Meyer, Matschiner, and Salzburger (2017) ("MMS17 method") and writes the results into this table.
Moderate ILS: The script summarize_abba_clustering.sh summarizes the analysis based on 'ABBA'-Site Clustering and writes the results for simulations with a moderate level of ILS (obtained after running the script simulate_data_ils.sh) into this table.
High ILS: The script summarize_abba_clustering.sh summarizes the analysis based on 'ABBA'-Site Clustering and writes the results for simulations with a moderate level of ILS (obtained after running the script simulate_data_high_ils.sh) into this table.
Very high ILS: The script summarize_abba_clustering.sh summarizes the analysis based on 'ABBA'-Site Clustering and writes the results for simulations with a moderate level of ILS (obtained after running the script simulate_data_very_high_ils.sh) into this table.
C-genes_tracts_length: The script calculate_mean_tracts.sh analyzes the lengths of c-genes and single-topology tracts and writes the results into this table. Also, the number of introgressed single-topology tracts is listed. Please note, that in some cases there are no single-topology tracts present and therefore the average length of introgressed tracts can be not identified (indicated with "NA" in the column).

Data for: Among-species rate variation produces false signals of introgression

Data files

Abstract

README: Among-species rate variation produces false signals of introgression

Supplementary Table 1

Methods

Works referencing this dataset