Population size differences can lead to biases in phylogenetic inference and introgression detection in the presence of purifying selection
Data files
Feb 06, 2024 version files 2.95 GB
-
raw_data.zip
-
README.md
Abstract
Phylogenetic reconstruction and introgression detection rely on an assumption about the probability distribution of gene tree topologies. Recently, evidence has emerged that population size differences can affect the probability distribution of gene tree topologies in the presence of purifying selection. Here, using the population genetic simulator SLiM, we provide evidence that in the presence of purifying selection, population size differences can lead to biases in phylogenetic inference. We also provide evidence that in the presence of purifying selection, population size differences can cause statistics used for introgression detection to exhibit patterns resembling those caused by introgression. In addition, we present a theoretical analysis showing that the occurrence of population size–dependent gene tree distributions is an inherent consequence of purifying selection. Our work underscores the importance of considering the potential confounding effect of purifying selection on phylogenetic inference and introgression detection.
README: Population Size Differences Can Lead to Biases in Phylogenetic Inference and Introgression Detection in the Presence of Purifying Selection
https://doi.org/10.5061/dryad.2z34tmpsz
This dataset includes Supplementary Materials and Methods, Supplementary Figures, and the scripts that produce the results described in the paper.
Description of the data and file structure
Supplementary Materials and Methods describes the simulation used to examine the value of P(V1|V) and the simulation used to examine the replicability of the results of He et al. (2020).
In Supplementary Figures, the results produced by the simulations described in Supplementary Materials and Methods are shown.
A part of raw data can be found in file "raw_data.zip", which include the reconstructed gene trees involved in the generation of Figure 3 and Figure 5, the reconstructed phylogenetic trees involved in the generation of Figure 4, and the Dsuite results involved in the generation of Figure 5. The SLiM simulation results are not incluced in "raw_data.zip" because they are too large.
In "raw_data.zip", files with suffix ".treefile" in folder "iqtree_trees_four_species_1000bp_nucleotide_u=2.4e-6" are the reconstructed gene trees involved in the generation of Figure 3 and Figure 5; files with suffix "species.treefile" and suffix ".fasta.treefile" in folder "species_trees_four_species_1000bp_nucleotide_u=2.4e-6" are the reconstructed phylogenetic trees involved in the generation of Figure 4; files with suffix "combined_tree.txt" are the Dsuite results involved in the generation of Figure 5. In each of the above-mentioned folders, subfolders with names that start with "neutral" are results generated under neutrality, and subfolders with names that start with "deleterious" are results generated under purifying selection. Numbers after "neutral" or "deleterious" (e.g. 2000_80_2000) are the population sizes of S3, S2, S1, respectively.
Using the scripts included in this dataset, readers can replicate all the results described in our paper, including those included in "raw_data.zip" and the SLiM simulation results.
Sharing/Access information
The scripts used for simulation are also publicly available at:
Code/Software
File "analyze_1000bp_nucleotide.py" implements the SLiM simulation to generate nucleotide sequences.
Files "sim_four_species_1000bp_nucleotide_neutral" and "sim_four_species_1000bp_nucleotide_deleterious" are the SLiM scripts used by
"analyze_1000bp_nucleotide.py".
File "run_iqtree.py" utilizes the nucleotide sequences produced by "analyze_1000bp_nucleotide.py" as input to run IQTREE.
The true gene trees generated by "analyze_1000bp_nucleotide.py" and the reconstructed gene trees generated by "run_iqtree.py"
are the input of file "plot_iqtree_result.py" which generates the results shown in Figure 3.
File "plot_iqtree_result.py" also generates the results shown in the upper panels in Figure 5, which illustrate the impact of purifying selection
on phylogenetic-based introgression detection.
File "infer_species_tree.py" implements coalescent-based and concatenation-based phylogenetic reconstruction using
the nucleotide sequences generated by "analyze_1000bp_nucleotide.py" and the inferred gene trees generated by "run_iqtree.py".
The results produced by "infer_species_tree.py" are the input of file "plot_species_tree.py", which generates the results shown in Figure 4.
File "run_sim_four_species_chromosome.py" implements the SLiM simulation to generate VCF files. Files "sim_four_species_chromosome_neutral"
and "sim_four_species_chromosome_deleterious" are the SLiM scripts used by "run_sim_four_species_chromosome.py".
File "run_dsuite_for_four_species_chromosome.py" utilizes the VCF files produced by "run_sim_four_species_chromosome.py" as input to run Dsuite.
The results generated by "run_dsuite_for_four_species_chromosome.py" are the input of file "plot_dsuite_result.py",
which generates the results shown in the lower panels in Figure 5. These results illustrate the impact of purifying selection on D-statistic-based
introgression detection.
Running Dsuite needs information about the source population/species of each sampled individual. This information is written in
file "sim_four_species_chromosome.population".
File "analyze_1bp_biallelic.py" implements SLiM simulations for biallelic loci. File "sim_three_species_1bp_biallelic"
is the SLiM script used by "analyze_1bp_biallelic.py". The results generated by "analyze_1bp_biallelic.py"
are the input of "plot_ancestral_mutation.py" and "plot_biallelic_result.py" which generates the results shown in Supplementary Figures S1 and S2.
Using the above-described scripts, researchers can replicate all the results described in our paper. Running these scripts
requires the installation of Python3, SLiM, tkits, pyslim, SciPy, NumPy, and Matplotlib.