Population size differences can lead to biases in phylogenetic inference and introgression detection in the presence of purifying selection
Data files
Feb 06, 2024 version files 2.95 GB
-
raw_data.zip
2.95 GB
-
README.md
4.95 KB
Nov 01, 2024 version files 2.95 GB
-
raw_data.zip
2.95 GB
-
README.md
6.57 KB
Apr 17, 2025 version files 2.95 GB
-
raw_data.zip
2.95 GB
-
README.md
7.66 KB
Abstract
Assumptions about the probability distribution of gene tree topologies provides a basis for phylogenetic reconstruction and introgression detection. Initial evidence has suggested that in the presence of purifying selection, population size differences can affect the probability distribution of gene tree topologies. Nevertheless, the impact of this phenomenon on phylogenetic reconstruction and introgression detection remains to be explored. Additionally, a theoretical understanding of this phenomenon remains absent. Here, using the population genetic simulator SLiM, we provide evidence that in the presence of purifying selection, population size differences can cause biases in phylogenetic inference. We also provide evidence that in the presence of purifying selection, population size differences can cause statistics used for introgression detection to exhibit patterns resembling those caused by introgression. Additionally, a theoretical analysis is presented to show that the biological basis underlying the formation of gene trees is different under neutral evolution and under purifying selection, and the population size dependency in gene tree distributions can be deduced from the inherent nature of purifying selection. This work underscores the importance of considering the potential confounding impact of purifying selection on phylogenetic inference and introgression detection.
https://doi.org/10.5061/dryad.2z34tmpsz
This dataset includes Supplementary Methods, Supplementary Figures, and the scripts that produce the results described in the paper.
Description of the data and file structure
Figure S1 shows that rescaling will not significantly alter our results. Figure S2 shows the results of changing the population sizes used by Vanderpool et al. (2020). These results aim to show that Vanderpool et al.’s simulations cannot be used to support their claim “there should be no effect of negative selection on the distribution of tree topologies”. Figure S3 shows the values P(V1 | V) under neutral evolution and under purifying selection. These results aim to show that regardless of whether under neutral evolution or under purifying selection, the condition P(V1 | V) cannot be satisifed. |
A part of raw data can be found in file “raw_data.zip”, which include the reconstructed gene trees used to generate the Figures 2 and 4, the reconstructed phylogenetic trees used to generate the Figure 3, and the Dsuite results used to generate the Figure 4. The SLiM simulation results are not incluced in “raw_data.zip” because they are too large.
In “raw_data.zip”, files with suffix “.treefile” in folder “iqtree_trees_four_species_1000bp_nucleotide_u=2.4e-6” are the reconstructed gene trees used to generate the Figures 2 and 4; files with suffix “species.treefile” and suffix “.fasta.treefile” in folder “species_trees_four_species_1000bp_nucleotide_u=2.4e-6” are the reconstructed phylogenetic trees used to generate the Figure 3; files with suffix “combined_tree.txt” are the Dsuite results involved in the generation of Figure 4. In each of the above-mentioned folders, subfolders with names that start with “neutral” are results generated under neutral evolution, and subfolders with names that start with “deleterious” are results generated under purifying selection. Numbers after “neutral” or “deleterious” (e.g. 2000_80_2000) are the population sizes of S3, S2, S1, respectively.
Using the scripts included in this dataset, readers can replicate all the results described in our paper, including those included in “raw_data.zip” and the SLiM simulation results.
Sharing/Access information
The scripts used for simulation are also publicly available at:
https://github.com/he-chong/gene_tree_dist_under_purifying_sel_slim
Code/Software
File “analyze_1000bp_nucleotide.py” runs SLiM to simulate nucleotide sequences and true gene trees.
Files “sim_four_species_1000bp_nucleotide_neutral” and “sim_four_species_1000bp_nucleotide_deleterious” are the SLiM scripts used by “analyze_1000bp_nucleotide.py”.
The nucleotide sequences generated by “analyze_1000bp_nucleotide.py” were the input of “run_iqtree.py”
which runs IQTREE to reconstruct gene trees. The true gene trees generated by “analyze_1000bp_nucleotide.py” and the reconstructed gene trees generated by “run_iqtree.py” are the input of “plot_iqtree_result.py” which generates the results shown in Figure 2.
File “plot_iqtree_result.py” also generates the results shown in the upper panels in Figure 4, which demonstrate the impact of purifying selection on phylogenetic-based introgression detection.
File “infer_species_tree.py” performs phylogenetic reconstruction using the nucleotide sequences generated by “analyze_1000bp_nucleotide.py” and the inferred gene trees generated by “run_iqtree.py”. The results produced by
“infer_species_tree.py” are the input of “plot_species_tree.py”, which generates the results shown in Figure 3.
File “run_sim_four_species_chromosome.py” runs SLiM to simulate VCF files. “sim_four_species_chromosome_neutral” and “sim_four_species_chromosome_deleterious” are the SLiM scripts used by “run_sim_four_species_chromosome.py”.
The VCF files generated by “run_sim_four_species_chromosome.py” are the input of “run_dsuite_for_four_species_chromosome.py”, which runs Dsuite to calculate the D-statistic. The results generated by “run_dsuite_for_four_species_chromosome.py” are the input of “plot_dsuite_result.py”, which generates results shown in the lower panels in Figure 4. These results demonstrate the impact of purifying selection on D-statistic-based introgression detection.
Running Dsuite needs information about the population/species where each sampled individual belongs to, which is written in “sim_four_species_chromosome.population”.
File “analyze_1bp_biallelic.py” runs SLiM simulations for biallelic loci. “sim_three_species_1bp_biallelic” is the SLiM script used by “analyze_1bp_biallelic.py”. The results generated by “analyze_1bp_biallelic.py” are the input of “plot_biallelic_result.py” and “plot_ancestral_mutation.py”, which generate the results shown in Figure 5 and Supplementary Figure S3, respectively.
File “analyze_Vanderpool_et_al.py” runs SLiM simulations to demonstrate that changing the population sizes used by Vanderpool et al. (2020) affects gene tree frequencies. The true gene trees produced by “analyze_Vanderpool_et_al.py” are the input of “plot_true_gene_trees.py”, which generates the results shown in Supplementary Figure S2. “sim_three_species_1000bp_nucleotide_vanderpool_pop_changed” and “sim_three_species_1000bp_nucleotide_vanderpool” are the SLiM scripts used by “analyze_Vanderpool_et_al.py”.
Additionally, the script “analyze_1000bp_nucleotide.py” also runs SLiM simulations to further explore the changing patter of the probabilities of gene tree topologies. The true gene trees produced these simulations are the input of “plot_true_gene_trees.py”, which generates the results shown in Figure 10. These simulations also use the SLiM script “sim_four_species_1000bp_nucleotide_deleterious”.
File “analyze_1bp_biallelic_large_pop.py” simulate gene trees for the results presented in Figure S1.
“sim_three_species_1bp_biallelic_large_pop” is the SLiM script used by “analyze_1bp_biallelic_large_pop.py”.
“plot_true_gene_trees.py” contains the codes that generate the results presented in Figure S1.
Using the aforementioned files, one should be able to replicate the analyses presented in our paper. Running these files requires the installation of SLiM, tkits, pyslim, SciPy, NumPy, and Matplotlib
Change Log
10-28-2024
- File “analyze_Vanderpool_et_al.py” has been added, and the results generated by this script (Figure S1) has been added.
- More population size combinations have been added in ”analyze_1000bp_nucleotide.py”, which generated the results presented in Figure 9.
- The previous Figure S2 has been moved to the main text and therefore is no longer contained in supplementary data. Additionally, because the new Figure S1 has been added, the previous Figure S1 becomes Figure S2.
4-15-2025
- Files “analyze_1bp_biallelic_large_pop.py” and ”sim_three_species_1bp_biallelic_large_pop” have been added. These scripts implement simulations to show that rescaling does not alter our results (new Figure S1).
- Because the new Figure S1 has been added, the previous Figures S1 and S2 becomes Figures S2 and S3 respectively.
- The original SLiM recipe used for biallelic simulations has been added (original_biallelic_recipe.pdf)
- Some cotents in the previous main text have been moved to the Supplementary Materials, including Supplementary Sections S1 and S2, as well as Supplementary Appendices S1–S3.
HE CHONG
email: biohe@foxmail.com