Population size differences can lead to biases in phylogenetic inference and introgression detection in the presence of purifying selection

He, Chong 1 ; Chen, Meng-Yun2 ; Zhu, Hao 1

Published Feb 06, 2024; Updated Nov 25, 2025 on Dryad. https://doi.org/10.5061/dryad.2z34tmpsz

Data files

Feb 06, 2024 version files 2.95 GB

raw_data.zip

2.95 GB
README.md

4.95 KB

Nov 01, 2024 version files 2.95 GB

raw_data.zip

2.95 GB
README.md

6.57 KB

Apr 17, 2025 version files 2.95 GB

raw_data.zip

2.95 GB
README.md

7.66 KB

Aug 10, 2025 version files 2.95 GB

raw_data.zip

2.95 GB
README.md

7.55 KB

Nov 25, 2025 version files 2.95 GB

raw_data.zip

2.95 GB
README.md

6.04 KB

Abstract

Assumptions about the probability distribution of gene tree topologies provides a basis for phylogenetic reconstruction and introgression detection. Initial evidence has suggested that in the presence of purifying selection, population size differences can affect the probability distribution of gene tree topologies. Nevertheless, the impact of this phenomenon on phylogenetic reconstruction and introgression detection remains to be explored. Additionally, a theoretical understanding of this phenomenon remains absent. Here, using the population genetic simulator SLiM, we provide evidence that in the presence of purifying selection, population size differences can cause biases in phylogenetic inference. We also provide evidence that in the presence of purifying selection, population size differences can cause statistics used for introgression detection to exhibit patterns resembling those caused by introgression. Additionally, a theoretical analysis is presented to show that the biological basis underlying the formation of gene trees is different under neutral evolution and under purifying selection, and the population size dependency in gene tree distributions can be deduced from the inherent nature of purifying selection. This work underscores the importance of considering the potential confounding impact of purifying selection on phylogenetic inference and introgression detection.

https://doi.org/10.5061/dryad.2z34tmpsz

This dataset includes Supplementary Materials, and the scripts that produce the results described in the paper.

Description of the data and file structure

Supplementary Materials contain Supplementary Figures and more detailed explanations of our conclusions presented in the main text.

A part of raw data can be found in file "raw_data.zip", which include the reconstructed gene trees used to generate the Figures 2 and 4, the reconstructed phylogenetic trees used to generate the Figure 3, and the Dsuite results used to generate the Figure 4. The SLiM simulation results are not incluced in "raw_data.zip" because they are too large.

In "raw_data.zip", files with suffix ".treefile" in folder "iqtree_trees_four_species_1000bp_nucleotide_u=2.4e-6" are the reconstructed gene trees used to generate the Figures 2 and 4; files with suffix "species.treefile" and suffix ".fasta.treefile" in folder "species_trees_four_species_1000bp_nucleotide_u=2.4e-6" are the reconstructed phylogenetic trees used to generate the Figure 3; files with suffix "combined_tree.txt" are the Dsuite results involved in the generation of Figure 4. In each of the above-mentioned folders, subfolders with names that start with "neutral" are results generated under neutral evolution, and subfolders with names that start with "deleterious" are results generated under purifying selection. Numbers after "neutral" or "deleterious" (e.g. 2000_80_2000) are the population sizes of S3, S2, S1, respectively.

Using the scripts included in this dataset, readers can replicate all the results described in our paper, including those included in "raw_data.zip" and the SLiM simulation results.

Sharing/Access information

The scripts used for simulation are also publicly available at:

https://github.com/he-chong/gene_tree_dist_under_purifying_sel_slim

Code/Software

File "analyze_1000bp_nucleotide.py" runs SLiM to simulate nucleotide sequences and true gene trees.

Files "sim_four_species_1000bp_nucleotide_neutral" and "sim_four_species_1000bp_nucleotide_deleterious" are the SLiM scripts used by "analyze_1000bp_nucleotide.py".

The nucleotide sequences generated by "analyze_1000bp_nucleotide.py" were the input of "run_iqtree.py"

which runs IQTREE to reconstruct gene trees. The true gene trees generated by "analyze_1000bp_nucleotide.py" and the reconstructed gene trees generated by "run_iqtree.py" are the input of "plot_iqtree_result.py" which generates the results shown in Figure 2.

File "plot_iqtree_result.py" also generates the results shown in the upper panels in Figure 4, which demonstrate the impact of purifying selection on phylogenetic-based introgression detection.

File "infer_species_tree.py" performs phylogenetic reconstruction using the nucleotide sequences generated by "analyze_1000bp_nucleotide.py" and the inferred gene trees generated by "run_iqtree.py". The results produced by

"infer_species_tree.py" are the input of "plot_species_tree.py", which generates the results shown in Figure 3.

File "run_sim_four_species_chromosome.py" runs SLiM to simulate VCF files. "sim_four_species_chromosome_neutral" and "sim_four_species_chromosome_deleterious" are the SLiM scripts used by "run_sim_four_species_chromosome.py".

The VCF files generated by "run_sim_four_species_chromosome.py" are the input of "run_dsuite_for_four_species_chromosome.py", which runs Dsuite to calculate the D-statistic. The results generated by "run_dsuite_for_four_species_chromosome.py" are the input of "plot_dsuite_result.py", which generates results shown in the lower panels in Figure 4. These results demonstrate the impact of purifying selection on D-statistic-based introgression detection.

Running Dsuite needs information about the population/species where each sampled individual belongs to, which is written in "sim_four_species_chromosome.population".

File "analyze_1bp_biallelic.py" runs SLiM simulations for biallelic loci. "sim_three_species_1bp_biallelic" is the SLiM script used by "analyze_1bp_biallelic.py". The results generated by "analyze_1bp_biallelic.py" are the input of "plot_biallelic_result.py" and "plot_ancestral_mutation.py", which generate the results shown in Figure 5 and Supplementary Figure S3, respectively.

File "analyze_Vanderpool_et_al.py" runs SLiM simulations to demonstrate that changing the population sizes used by Vanderpool et al. (2020) affects gene tree frequencies. The true gene trees produced by "analyze_Vanderpool_et_al.py" are the input of "plot_true_gene_trees.py", which generates the results shown in Supplementary Figure S2. "sim_three_species_1000bp_nucleotide_vanderpool_pop_changed" and "sim_three_species_1000bp_nucleotide_vanderpool" are the SLiM scripts used by "analyze_Vanderpool_et_al.py".

Additionally, the script "analyze_1000bp_nucleotide.py" also runs SLiM simulations to further explore the changing patter of the probabilities of gene tree topologies. The true gene trees produced these simulations are the input of "plot_true_gene_trees.py", which generates the results shown in Figure 10. These simulations also use the SLiM script "sim_four_species_1000bp_nucleotide_deleterious".

File "analyze_1bp_biallelic_large_pop.py" simulate gene trees for the results presented in Figure S1.

"sim_three_species_1bp_biallelic_large_pop" is the SLiM script used by "analyze_1bp_biallelic_large_pop.py".

"plot_true_gene_trees.py" contains the codes that generate the results presented in Figure S1.

Using the aforementioned files, one should be able to replicate the analyses presented in our paper. Running these files requires the installation of SLiM, tkits, pyslim, SciPy, NumPy, and Matplotlib

HE CHONG

email: biohe@foxmail.com

Population size differences can lead to biases in phylogenetic inference and introgression detection in the presence of purifying selection

Data files

Abstract

README: Population Size Differences Can Lead to Biases in Phylogenetic Inference and Introgression Detection in the Presence of Purifying Selection

Description of the data and file structure

Sharing/Access information

Code/Software

Change log

Works referencing this dataset