Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wide excess of shared derived alleles between taxa using Patterson's D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Several recent studies have extended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we use simulations and whole genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions. We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic f̂d, a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture. f̂d is not subject to the same biases as D, and is better at identifying introgressed loci. Finally, we show that both D and f̂d outliers tend to cluster in regions of low absolute divergence (dXY), which can confound a recently proposed test for differentiating introgression from shared ancestral variation at individual loci.
README
General summary of scripts and commands used in this study.
Figure_1.R
Script to generate plots in Figure 1 B and C.
compare_f_estimators.r
Script to generate Figures 2 and S1.
Figures_3_S3.R
Script to generate Figures 3 and S3. Requires data files such as Heliconius_autosome_windows_5kb.csv and Heliconius_Zchromosome_windows_5kb.csv. These names are hard-coded into this script, so editing is required to load different files.
Figure_4.R
Script to generate Figure 4. Requires as input files such as model_files_win10000_s0.01_l5000_r50.alternate_models.dxy.summary.sg.tsv, generated using run_model_combinations.py, shared_ancestry_simulator.R and generate_summary_statistics.R.
Figure_5.R
Script to generate Figure 5. Requirees as input files such as model_files_win10000_s0.01_l5000_r50.alternate_models.dxy.summary.sg.tsv, generated using run_model_combinations.py shared_ancestry_simulator.R and generate_summary_statistics.R.
egglib_sliding_windows.py
Python script to calculate ABBA BABA statistics, as well as pi and dXY from heliconius whole-genome data. It makes use of the EGGLIB library. Input was a "calls" format file, provided in Martin et al. 2013. Window size is specified with the -w flag, sliding increment with the -i flag and minimum number of sites with the -m flag. The latter is a hard cutoff, and windows with fewer sites are discarded. There is also a soft cutoff between 0 and 1, specified with --minimumExploitableData. The script will output a column called sitesOverMinExD. At a value of 0.5, this would report the number of sites in the window that had genotype calls for at least 50% of the individuals. To analyse autosomes and Z-linked scaffolds separately, the --include and --exclude flags were used, along with the file Hmel1-1_Zupdated_Zscafs.txt, which provides names of all Z-linked scaffolds provided in Martin et al. 2013. For the Z chromosome analysis, ploidy was specified using the --ploidy flag, because there were two females in the dataset of Martin et al. 2013.
Hmel1-1_Zupdated_Zscafs.txt
List of Z-liniked scaffolds, used when running egglib_sliding_windows.py.
run_model_combinations.py
Script to generate YAML files for use by shared_ancestry_simulator.R. These can be generated as follows: ./run_model_combinations.py -m Model_parameter_list.csv -w 10000 -t 30 -s 0.01 -l 5000 -r 5
./run_model_combinations.py -m Model_parameter_list.csv -w 10000 -t 30 -s 0.01 -l 5000 -r 50
Model_parameter_list.csv
Parameter list used by run_model_combinations.py to generate the YAML files used by shared_ancestry_simulator.R.
shared_ancestry_simulator.R
A single combined model can be generated like this: ./shared_ancestry_simulator.R -w 10000 -t 60 -c Alternate_t123-0.4_t23-0.2.yml:0.1,Background_t123-0.6_t21-0.4.yml:0.9. This will generate 10000 windows, 10% of which will be generated using the model described in the file Alternate_t123-0.4_t23-0.2.yml and 90% of which will be generated using Background_t123-0.6_t21-0.4.yml, using 60 threads. See the model files folders for the YAML files generated for this paper. The CSV files for the models will be made available in a Data Dryad repository on publication and can be made available on request.
A single model, as used for the null models reported in the paper, can be run like this: ./shared_ancestry_simulator.R -w 10000 -t 60 -c Background_t123-0.6_t21-0.4.yml:1. The YAML files are generated using run_model_combinations.py.
generate_summary_statistics.R
Summary statistics for the models found in the partition.summary and dxy.summary files were generated as follows:
./generate_summary_statistics.R -m model_files_win10000_s0.01_l5000_r5 -l Model_parameter_list.csv -t 10
./generate_summary_statistics.R -m model_files_win10000_s0.01_l5000_r50 -l Model_parameter_list.csv -t 10
Summary files are produced for alternate and null models and for ms and Seq-Gen output. The Seq-Gen files used for the paper analyses are included in the repository.
model_files_win10000_s0.01_l5000_r5.alternate_models.dxy.summary.sg.tsv
model_files_win10000_s0.01_l5000_r5.alternate_models.partition.summary.sg.tsv
model_files_win10000_s0.01_l5000_r5.null_models.dxy.summary.sg.tsv
model_files_win10000_s0.01_l5000_r5.null_models.partition.summary.sg.tsv
model_files_win10000_s0.01_l5000_r50.alternate_models.dxy.summary.sg.tsv
model_files_win10000_s0.01_l5000_r50.alternate_models.partition.summary.sg.tsv
model_files_win10000_s0.01_l5000_r50.null_models.partition.summary.sg.tsv
model_files_win10000_s0.01_l5000_r50.null_models.dxy.summary.sg.tsv
model_results_table.R
Summarize all tests for differences in mean dXY in a single table.
model_results_table
Summary of all tests for differences in mean dXY.
Heliconius_autosome_windows_5kb
Results of analysis of Heliconius autosomes, with 5kb windows.
Heliconius_autosome_windows_10kb
Results of analysis of Heliconius autosomes, with 10 kb windows.
Heliconius_autosome_windows_20kb
Results of analysis of Heliconius autosomes, with 20 kb windows.
Heliconius_Zchromosome_windows_50kb
Results of analysis of Heliconius autosomes, with 50 kb windows.
Heliconius_autosome_windows_50kb.csv
Heliconius_Zchromosome_windows_5kb
Results of analysis of Heliconius Z chromosome, with 5 kb windows.
Heliconius_Zchromosome_windows_10kb
Results of analysis of Heliconius Z chromosome, with 20 kb windows.
Heliconius_Zchromosome_windows_20kb
Results of analysis of Heliconius Z chromosome, with 20 kb windows.
Heliconius_Zchromosome_windows_50kb
Results of analysis of Heliconius Z chromosome, with 50 kb windows.
Figure_S2.R
Script to generate Figure S2.
Figure_S4.R
Script to generate Figure S5.