Data for: Among-species rate variation produces false signals of introgression
Data files
May 31, 2023 version files 7.49 MB
-
dstats_supplement.pdf
6.62 MB
-
README.md
4.66 KB
-
Supplementary_Table_1.xlsx
858.96 KB
Apr 04, 2024 version files 13.55 MB
-
dstats_supplement.pdf
12.51 MB
-
README.md
6.66 KB
-
Supplementary_Table_S1.xlsx
1.03 MB
Abstract
The role of interspecific hybridization has recently seen increasing attention, especially in the context of diversification dynamics. Genomic research has now made it abundantly clear that both hybridization and introgression – the exchange of genetic material through hybridization and backcrossing – are far more common than previously thought. Besides cases of ongoing or recent genetic exchange between taxa, an increasing number of studies report “ancient introgression” – referring to results of hybridization that took place in the distant past. However, it is not clear whether commonly used methods for the detection of introgression are applicable to such old systems, given that most of these methods were originally developed for analyses at the level of populations and recently diverged species, affected by recent or ongoing genetic exchange. In particular, the assumption of constant evolutionary rates, which is implicit in many commonly used approaches, is more likely to be violated as evolutionary divergence increases. To test the limitations of introgression detection methods when being applied to old systems, we simulated thousands of genomic datasets under a wide range of settings, with varying degrees of among-species rate variation and introgression. Using these simulated datasets, we showed that some commonly applied statistical methods, including the D-statistic and certain tests based on sets of local phylogenetic trees, can produce false-positive signals of introgression between divergent taxa that have different rates of evolution. These misleading signals are caused by the presence of homoplasies occurring at different rates in different lineages. To distinguish between the patterns caused by rate variation and genuine introgression, we developed a new test that is based on the expected clustering of introgressed sites along the genome, and implemented this test in the program Dsuite.
README: Among-species rate variation produces false signals of introgression
Supplementary Material for Koppetsch et al. includes Supplementary Notes S1-S4, Supplementary Figures S1-S36, and Supplementary Tables S1-S5.
Supplementary Table 1 is provided in file Supplementary_Table_1.xlsx
, while all other Supplementary Material is included in file dstats_supplement.pdf
.
Supplementary Table 1
This Excel spreadsheet contains nested sheets that correspond to all the summary content that would be obtained after having run the script named run_all_simulation_data.sh
. All scripts are provided on the GitHub code repository (https://github.com/thorekop/ABBA-Site-Clustering/blob/main/src/). Please note, that the run_all_simulation_data.sh
script is not intended to be executed itself.
Abbreviations for the variables and parameters listed in Supplementary Table 1 are defined here:
- pop_size: effective population sizes
- div_time: divergence of species P1 and P2 in our four-taxon phylogeny, in generations
- rec_rate: recombination rate of the chromosome
- mut_rate: mutation rate
- intr_rate: rate of introgression between species P2 and P3
- P2_rate: scale factor indicating the extending or shortening of the branch leading to P2 in order to allow mutation-rate variation
- n_variable_sites_vcf: number of variable sites per vcf-file
- n_biallelic_sites_vcf: number of biallelic sites per vcf-file
- n_multiallelic_sites_vcf: number of multiallelic sites per vcf-file
- n_variable_sites_200bp: number of variable sites per alignments with a length of 200 bp extracted from the genomic dataset
- n_variable_sites_500bp: number of variable sites per alignments with a length of 500 bp extracted from the genomic dataset
- n_variable_sites_1000bp: number of variable sites per alignments with a length of 1000 bp extracted from the genomic dataset
- n_pi_sites_200bp: number of parsimony-informative sites per alignments with a length of 200 bp extracted from the genomic dataset
- n_pi_sites_500bp: number of parsimony-informative sites per alignments with a length of 500 bp extracted from the genomic dataset
- n_pi_sites_1000bp: number of parsimony-informative sites per alignments with a length of 1000 bp extracted from the genomic dataset
- run: run number, since ten, or fifty (for the main settings), replicate simulations were performed for each possible combination of parameters
- D_stat: Patterson's D-statistic
- p_value: p-values
- f4_ratio: f4-ratio estimation, a method for estimating ancestry proportions in an admixed population
- sensitive version of the `ABBA'-site clustering: signals of introgression (given as p-values) detected with the “sensitive” version of the new ‘ABBA’-site clustering test
- robust version of the `ABBA'-site clustering: signals of introgression (given as p-values) detected with the “robust” version of the new ‘ABBA’-site clustering test
- BBAA: number of BBAA sites
- ABBA: number of ABBA sites
- BABA: number of BABA sites
- align_length: lengths of equally spaced windows across the simulated chromosome that were used for extracting variants from (for tree-based introgression-detection methods only)
- n_significant: number of significant runs
- loglik_dif: difference in log-likelihood
- topo_ok: verification whether the inferred topology actually still represents the original topology of the four-taxon phylogeny
- network: inferred topology
- introgression_rate: rate of introgression calculated based on the QuIBL results
- mrca_reduction: difference between the two oldest pairwise mean MRCA ages (dMRCA)
- means_c_genes_length: mean length of c-genes
- means_tracts_length: mean length of single-topology tracts
- n_introgressed_tracts: number of introgressed single-topology tracts
- means_introgressed_tracts_length: mean length of introgressed single-topology tracts
In the following the content of the single nested sheets is specified in more detail:
Dataset statistics: Here, we present a comprehensive overview of the numbers of variable, biallelic, and multiallelic sites per simulated dataset, as well as the numbers of variable and parsimony-informative sites per alignment lengths of 200, 500, and 1,000 bp.
Dsuite: Here, both the Dsuite analysis (results obtained after running the script
summarize_dsuite.sh
) and the analysis based on both the “sensitive” and “robust” 'ABBA'-Site Clustering (results obtained after running the scriptsummarize_abba_clustering.sh
) is summarized and all results obtained are written into a table.Dtree: The script
summarize_iqtree.sh
summarizes the Dtree analysis and writes the results into this table.SNaQ: The script
summarize_snaq.sh
summarizes the SNaQ analysis and writes the results into this table.QuIBL: The script
summarize_quibl.sh
summarizes the QuIBL analysis and writes the results into this table.MMS17: The script
summarize_meyer_approach.sh
summarizes the analysis with the method developed by Meyer, Matschiner, and Salzburger (2017) ("MMS17 method") and writes the results into this table.Moderate ILS: The script
summarize_abba_clustering.sh
summarizes the analysis based on 'ABBA'-Site Clustering and writes the results for simulations with a moderate level of ILS (obtained after running the scriptsimulate_data_ils.sh
) into this table.High ILS: The script
summarize_abba_clustering.sh
summarizes the analysis based on 'ABBA'-Site Clustering and writes the results for simulations with a moderate level of ILS (obtained after running the scriptsimulate_data_high_ils.sh
) into this table.Very high ILS: The script
summarize_abba_clustering.sh
summarizes the analysis based on 'ABBA'-Site Clustering and writes the results for simulations with a moderate level of ILS (obtained after running the scriptsimulate_data_very_high_ils.sh
) into this table.C-genes_tracts_length: The script
calculate_mean_tracts.sh
analyzes the lengths of c-genes and single-topology tracts and writes the results into this table. Also, the number of introgressed single-topology tracts is listed. Please note, that in some cases there are no single-topology tracts present and therefore the average length of introgressed tracts can be not identified (indicated with "NA" in the column).
Methods
Genomic datasets have been simulated with msprime, and processed with Dsuite, IQ-TREE, SNaQ, and QuIBL.