Population size differences can lead to biases in phylogenetic inference and introgression detection in the presence of purifying selection
Data files
Feb 06, 2024 version files 2.95 GB
-
raw_data.zip
2.95 GB
-
README.md
4.95 KB
Nov 01, 2024 version files 2.95 GB
-
raw_data.zip
2.95 GB
-
README.md
6.57 KB
Abstract
Phylogenetic reconstruction and introgression detection are based on assumptions about the probability distribution of gene tree topologies. Initial evidence has suggested that population size differences can affect the probability distribution of gene tree topologies in the presence of purifying selection. Nevertheless, the impact of this phenomenon on phylogenetic reconstruction and introgression detection remains to be explored. Additionally, the mechanism underlying this phenomenon remains elusive. Here, using the population genetic simulator SLiM, we provide evidence that in the presence of purifying selection, population size differences can cause biases in phylogenetic inference. We also provide evidence that in the presence of purifying selection, population size differences can cause statistics used for introgression detection to exhibit patterns resembling those caused by introgression. Additionally, a theoretical analysis is presented to show that under purifying selection, the way in which single-population genealogies are connected together to form a gene tree can differ from that under neutral evolution as it is affected by population size differences. Consequently, the probability distribution of gene tree topologies under purifying selection is not identical to that under neutral evolution but instead is affected by population size differences. This work underscores the importance of considering the potential confounding impact of purifying selection on phylogenetic inference and introgression detection.
README: Population Size Differences Can Lead to Biases in Phylogenetic Inference and Introgression Detection in the Presence of Purifying Selection
https://doi.org/10.5061/dryad.2z34tmpsz
This dataset includes Supplementary Methods, Supplementary Figures, and the scripts that produce the results described in the paper.
Description of the data and file structure
Figure S1 shows the results of changing the population sizes used by Vanderpool et al. (2020). These results aim to show that Vanderpool et al.'s simulations cannot be used to support their claim "there should be no effect of negative selection on the distribution of tree topologies". Figure S2 shows the values P(V1|V) under neutral evolution and under purifying selection. These results aim to show that regardless of whether under neutral evolution or under purifying selection, the condition P(V1|V) cannot be satisifed.
A part of raw data can be found in file "raw_data.zip", which include the reconstructed gene trees used to generate the Figures 2 and 4, the reconstructed phylogenetic trees used to generate the Figure 3, and the Dsuite results used to generate the Figure 4. The SLiM simulation results are not incluced in "raw_data.zip" because they are too large.
In "raw_data.zip", files with suffix ".treefile" in folder "iqtree_trees_four_species_1000bp_nucleotide_u=2.4e-6" are the reconstructed gene trees used to generate the Figures 2 and 4; files with suffix "species.treefile" and suffix ".fasta.treefile" in folder "species_trees_four_species_1000bp_nucleotide_u=2.4e-6" are the reconstructed phylogenetic trees used to generate the Figure 3; files with suffix "combined_tree.txt" are the Dsuite results involved in the generation of Figure 4. In each of the above-mentioned folders, subfolders with names that start with "neutral" are results generated under neutral evolution, and subfolders with names that start with "deleterious" are results generated under purifying selection. Numbers after "neutral" or "deleterious" (e.g. 2000_80_2000) are the population sizes of S3, S2, S1, respectively.
Using the scripts included in this dataset, readers can replicate all the results described in our paper, including those included in "raw_data.zip" and the SLiM simulation results.
Sharing/Access information
The scripts used for simulation are also publicly available at:
https://github.com/he-chong/gene_tree_dist_under_purifying_sel_slim
Code/Software
File "analyze_1000bp_nucleotide.py" runs SLiM to simulate nucleotide sequences and ture gene trees.
Files "sim_four_species_1000bp_nucleotide_neutral" and "sim_four_species_1000bp_nucleotide_deleterious" are the SLiM scripts used by "analyze_1000bp_nucleotide.py".
The nucleotide sequences generated by "analyze_1000bp_nucleotide.py" were the input of "run_iqtree.py", which runs IQTREE to reconstruct gene trees. The true gene trees generated by "analyze_1000bp_nucleotide.py" and the reconstructed gene trees generated by "run_iqtree.py" are the input of "plot_iqtree_result.py" which generates the results shown in Figure 2.
File "plot_iqtree_result.py" also generates the results shown in the upper panels in Figure 4, which demonstrate the impact of purifying selection on phylogenetic-based introgression detection.
File "infer_species_tree.py" performs phylogenetic reconstruction using the nucleotide sequences generated by "analyze_1000bp_nucleotide.py" and the inferred gene trees generated by "run_iqtree.py". The results produced by "infer_species_tree.py" are the input of "plot_species_tree.py", which generates the results shown in Figure 3.
File "run_sim_four_species_chromosome.py" runs SLiM to simulate VCF files. "sim_four_species_chromosome_neutral" and "sim_four_species_chromosome_deleterious" are the SLiM scripts used by "run_sim_four_species_chromosome.py".
The VCF files generated by "run_sim_four_species_chromosome.py" are the input of "run_dsuite_for_four_species_chromosome.py", which runs Dsuite to calculate the D-statistic. The results generated by "run_dsuite_for_four_species_chromosome.py" are the input of "plot_dsuite_result.py", which generates results shown in the lower panels in Figure 4. These results demonstrate the impact of purifying selection on D-statistic-based introgression detection.
Running Dsuite needs information about the source population/species of each sampled individual. This information is written in
file "sim_four_species_chromosome.population".
File "analyze_1bp_biallelic.py" runs SLiM simulations for biallelic loci. "sim_three_species_1bp_biallelic" is the SLiM script used by "analyze_1bp_biallelic.py". The results generated by "analyze_1bp_biallelic.py" are the input of "plot_biallelic_result.py" and "plot_ancestral_mutation.py", which generate the results shown in Figure 5 and Supplementary Figure S2, respectively.
File "analyze_Vanderpool_et_al.py" runs SLiM simulations to exmamine how changing the population sizes used by Vanderpool et al. (2020) affects gene tree frequencies. "sim_three_species_1000bp_nucleotide_vanderpool_pop_changed" and "sim_three_species_1000bp_nucleotide_vanderpool" are the SLiM scripts used by "analyze_Vanderpool_et_al.py". The true gene trees produced by "analyze_Vanderpool_et_al.py" are the input of "plot_true_gene_trees.py", which generates the results shown in Supplementary Figure S1.
Additionally, "analyze_1000bp_nucleotide.py" also runs SLiM simulations to further explore the pattern of change of the three possible gene tree topologies as population sizes change. "plot_true_gene_trees.py" was also used to generate the results of this part of the analysis, which are shown in Figure 9.
Using the above-described scripts, researchers can replicate all the results described in our paper. Running these scripts
requires the installation of Python3, SLiM, tkits, pyslim, SciPy, NumPy, and Matplotlib.
Change Log
10-28-2024
- File "analyze_Vanderpool_et_al.py" has been added, and the results generated by this script (Figure S1) has been added.
- More population size combinations have been added in "analyze_1000bp_nucleotide.py", which generated the results presented in Figure 9.
- The previous Figure S2 has been moved to the main text and therefore is no longer contained in supplementary data. Additionally, because the new Figure S1 has been added, the previous Figure S1 becomes Figure S2.
HE CHONG
email: biohe@foxmail.com