Data from: Strong genetic differentiation but limited niche partitioning in a sympatric species pair separated by an allochronic reproductive barrier
Data files
Mar 23, 2026 version files 560.65 MB
-
d_trios_summary.R
4.30 KB
-
dsuite_results.tar.gz
3.15 MB
-
faststructure_pophelper_plotter.R
4.14 KB
-
faststructure_results.tar.gz
2.72 MB
-
faststructure_results.with_outliers.tar.gz
2.84 MB
-
host_data.csv
56.10 KB
-
host_use_analysis_v3.R
11.65 KB
-
lmm.R
3.34 KB
-
neo_CPUE_lmm.xlsx
20.64 KB
-
pixy_results.tar.gz
692.86 KB
-
README.md
22.46 KB
-
relernn_results.tar.gz
2.68 KB
-
seas_habitat_analyses.R
10.35 KB
-
seasonal_abundance_analysis_historical.R
21.08 KB
-
seasonal_data_long.csv
10.70 KB
-
seasonal_data_modern.csv
1.52 KB
-
stacked_plots.R
22.68 KB
-
variant_samples_metadata.xlsx
15.67 KB
-
variants_plus_invariants.vcf.gz
494.56 MB
-
variants.vcf.gz
56.48 MB
-
vcf_pca_plotter.R
4.10 KB
Abstract
In the absence of differential adaptation to ecological niche and the development of reproductive barriers, boundaries between co-existing species are expected to breakdown. The fruit fly species pair, Bactrocera tryoni and B. neohumeralis, have significant overlap in geographic range and host use, with mating time the only confirmed difference in their mating systems. Using a combination of ecological and genomic data, we tested the roles of competition and reproductive isolation on the maintenance of species boundaries in this pair. Genome-wide SNP analyses found strong genetic differentiation between the species with no evidence for hybridization. Most outlier SNPs were restricted to narrow regions towards the centromeres and telomeres of chromosomes, while high nucleotide diversity rates were observed in both species. Enrichment of annotation terms indicated an overabundance of genes with the ‘abnormal neuroanatomy’ term, but non-enrichment of terms associated with sleep and circadian rhythm. Field data found minimal evidence for ecological differentiation based on significant positive temporal, spatial and host-use correlations between the species. However, B. tryoni appears less sensitive to low-humidity than B. neohumeralis and used hosts which were not used by B. neohumeralis suggesting subtle micro-ecological variation between species contributing to reproductive isolation. The simultaneous comparisons of molecular and ecological data in our study have provided a more nuanced understanding of how species boundaries have been maintained in this sibling species pair. Our analyses highlight the importance of genetic divergence for their maintenance in sympatry, but less for a role of competitive displacement or other ecological adaptation.
Name: Mitchell Irvine
Institution: Queensland University of Technology
Email: mitchell.irvine@hdr.qut.edu.au
Dataset Overview
This repository contains both molecular and ecological data used in the analyses by Irvine et. al. (in review and available as a preprint) to assess several hypotheses regarding the population genetics and ecology of two sympatric species: Bactrocera. tryoni and B. neohumeralis.
Molecular analyses
Samples were collected from 19 locations throughout their sympatric range. Subsequent to sequencing and data mapping, variants were called with Freebayes and filtered to retain only biallelic polymorphisms; the file in this state is provided in this repository.
Most downstream analyses also removed the samples with genotype codes 526, 528, 529, 530, and 531, as they were deemed to be outliers with potential sequencing library issues. The variant calls subsequent to these samples being removed were used in a variety of downstream analyses,e.g., fastSTRUCTURE. For Dsuite, we applied some additional filtering steps as detailed in our manuscript. To facilitate pixy analysis, we also called invariant sites with bcftools and merged these into the original Freebayes variant calls with a custom script (merge_invariant_sites.py from https://github.com/zkstewart/Various_scripts).
Reference genome
Variants were called using the CSIRO_BtryS06_freeze2 assembly and annotation as available through NCBI under accession GCF_016617805.1.
This file has 191 contigs/scaffolds in total. Five autosomal scaffolds are contained within, and are denoted with 'NC_' prefix, where all other fragmentary sequences are denoted with 'NW_' prefix.
Recommended Citation
Irvine, M., Stewart, Z., Kumaran, N., Manawaduge, C. G., Ryan, J., Balagawi, S., Missenden, B., Starkie, M., Clarke, A. R., Hurwood, D., & Prentis, P. (2025). Strong genetic differentiation but limited niche partitioning in a sympatric species pair separated by an allochronic reproductive barrier (p. 2023.08.30.555036). bioRxiv. https://doi.org/10.1101/2023.08.30.555036
Description of the data and file structure
Variant call data are provided in Variant Call Format (VCF) for study replication purposes. Several scripts used to produce plots are also provided for the same purpose, with some program commands indicated for clarity of our bioinformatic methods.
Files and Folders
variants.vcf.gz
VCF produced by Freebayes and compressed using bgzip. Samples are identified by their genotype and species name with '_' joining (e.g., '502_neohumeralis'). See the variant_samples_metadata.xlsx file for details.
File contains 28,334 biallelic polymorphism calls for 186 samples. Notably, the five outlier samples (genotype codes 526, 528, 529, 530, and 531) are present in this file. Filtering of these samples gives the variants.no_outliers.vcf.gz file mentioned throughout this README.
variants_plus_invariants.vcf.gz
VCF produced by Freebayes with subsequent incorporation of invariant sites called by bcftools and compressed using bgzip. File is otherwise formatted equivalently to variants.vcf.gz.
File contains the same 28,334 biallelic polymorphism calls as variants.vcf.gz, but has an additional 17,184,726 invariant site calls, giving a total number of 17,213,060 data rows. All data is for the same 186 samples as in the variants.vcf.gz file. Similarly to the original variants VCF, filtering of outlier samples gives the variants_plus_invariants.no_outliers.vcf.gz file mentioned in this README.
variant_samples_metadata.xlsx
Table file with five columns describing the 186 samples used for variant analyses:
- genotype
- An integer code used to identify the sample in our study.
- species
- A string indicating whether the sample is of Bactrocera neohumeralis or Bactrocera tryoni.
- location
- A two-letter abbreviation for sampling locations. Table S1 associated with our manuscript relates these abbreviations to their full site names.
- latitude/longitude
- The decimal degrees indicate the geographic coordinates of each sampling site.
faststructure_results.tar.gz
Result of using fastSTRUCTURE with K=1 through K=4. Command-line usage is indicated below:
plink2 --vcf variants.no_outliers.vcf.gz --make-bed --out variants.no_outliers --allow-extra-chr
for k in 1 2 3 4; do python /location/of/fastStructure/structure.py -K ${k} --input variants.no_outliers --output variants.no_outliers.fs_simple --full --seed=100 --format=bed; done
faststructure_results.with_outliers.tar.gz
As above, but run with the five outlier samples included. Can be used to assess how K=3 looks.
pixy_results.tar.gz
The result of running Pixy is:
pixy --stats pi fst dxy
--vcf variants_plus_invariants.no_outliers.vcf.gz
--populations species_metadata.tsv
--window_size 50000
--n_cores 12
--output_folder 50kb
The species_metadata.tsv file contains two columns: the first is the samples identified by their genotype and species name with '_' joining (e.g., '502_neohumeralis'), and the second is any identifier string to group each species together (e.g., all B. neohumeralis samples are 'group1' and all B. tryoni samples are 'group2')
relernn_results.tar.gz
The BSCORRECTED results of the eLERRN prediction of recombination rate for each species. This was run using command-line instructions as indicated below.
VARSCRIPTDIR="/mnt/c/git/Various_scripts" # local clone of https://github.com/zkstewart/Various_scripts
SEED=111
SPECIES=neohumeralis # or tryoni
python ${VARSCRIPTDIR}/popgen/ReLERNN/filter_fasta_by_snp_num.py -f ${GENOME} -v variants.no_outliers.${SPECIES}.vcf -o genome.bed # VCF has been subset to just this $SPECIES beforehand
ReLERNN_SIMULATE
--vcf variants.no_outliers.${SPECIES}.vcf
--genome genome.bed
--projectDir ${SPECIES}
--nCPU 12
--unphased
--seed ${SEED}
# STEP 4: Train
ReLERNN_TRAIN
--projectDir ${SPECIES}
--nEpochs 100
--nValSteps 20
--gpuID 0
--nCPU 1
--seed ${SEED}
# STEP 5: Predict
ReLERNN_PREDICT
--vcf variants.no_outliers.${SPECIES}.vcf
--projectDir ${SPECIES}
--seed ${SEED}
# Parametric Bootstrapping
ReLERNN_BSCORRECT
--projectDir ${SPECIES}
--nSlice 100
--nReps 100
--seed ${SEED}
dsuite_results.tar.gz
Results obtained from running Dsuite as indicated below. This archive contains the additional files,e.g., phylogenetic tree, needed to run the analysis.
# Filter and format variants
vcftools --gzvcf variants.no_outliers.vcf.gz --max-missing 0.8 --mac 2 --remove-filtered-all --recode --recode-INFO-all --out dsuite.filtered.vcf
mv dsuite.filtered.vcf.recode.vcf dsuite.filtered.vcf
bgzip dsuite.filtered.vcf
tabix dsuite.filtered.vcf.gz
# Run Dsuite
/location/of/Dsuite Dtrios tabix dsuite.filtered.vcf.gz dsuite_sets_all_subset.txt -o dsuite_results -t dsuite.nwk --ABBAclustering
/location/of/Dsuite Fbranch dsuite.nwk dsuite_results_tree.txt > fbranch.txt
/location/of/utils/dtools.py --outgroup oleae_SRR8788622 fbranch.txt dsuite.nwk
/location/of/Dsuite Dinvestigate -n dsuite_Dinvestigate dsuite.filtered.vcf.gz dsuite_sets_all_subset.txt test_trios.txt
d_trios_summary.R
Dsuite outputs were summarized and p-values corrected using this R script.
Code/Software
All R scripts in this repository were run in R version 4.4.1.
vcf_pca_plotter.R
This R script can be used to replicate the PCA plot shown in Figure 2a of the manuscript. The PCA was manually mirrored horizontally with label modifications made for visual purposes.
faststructure_pophelper_plotter.R
This R script can be used to replicate the fastSTRUCTURE plot shown in Figure 2b of the manuscript. Many visual elements, including labels and legends, were manually formatted.
stacked_plots.R
This R script can be used to produce the stacked statistics plots as seen in Figure 3 and Supplementary Figures S5-S8. Some additional files are required to operate, and are indicated as comments in the script file.
Ecological analyses
The ecological analyses presented in the manuscript included seasonal abundance (historical dataset), seasonal abundance (modern dataset), habitat abundance, landscape analysis, and host fruit abundance. Raw data and R scripts for each analysis are provided below.
Seasonal abundance (historical dataset)
Historical seasonal abundance data were taken from May, A. W. S. (1961). A taxonomic and ecological study of the Dacinae (fam. Trypetidae) in Queensland. [Thesis (Ph.D.), University of Queensland]. Here are the raw data extracted from the thesis.
seasonal_data_long.csv
- This dataset shows monthly counts of two species (tryoni and neohumeralis) at a single location (“ath”) from July 1955 to November 1956.
- Tryoni consistently has higher counts, peaking around October–November 1955, then declining through 1956 with some fluctuations.
- Neohumeralis appears in lower numbers and drops to zero from mid-1956 onward, indicating a possible disappearance or seasonal absence.
seasonal_abundance_analysis_historical.R
This R script can be used to calculate the Pearson's correlation and paired t-test statistics presented in the manuscript. This script also includes the code used to generate Figure 4 of the manuscript.
Seasonal abundance (modern dataset), habitat abundance, and B. neohumeralis landscape analyses
The data used for the modern seasonal abundance, habitat abundance, and B. neohumeralis landscape analyses were collected as part of a previous study by Ryan, J., Clarke, A. R., Piper, A. M., Fuller, S., & Prentis, P. J. (2025). Gene flow and abundance of a tropical fruit fly in a horticultural landscape mosaic in eastern Australia are limited by cleared grazing land and area-wide management. Evolutionary applications, 18(4). https://doi.org/https://doi.org/10.1111/eva.70097. The raw data used in this manuscript are provided below.
seasonal_data_modern.csv
- This dataset shows counts of two species (neohumeralis = neo and tryoni = try) across multiple sites and habitat types over time (Apr 2021–Apr 2022).
- Tryoni generally has higher counts and is more widespread, while neohumeralis appears more variable, with spikes in certain rainforest and residential sites.
- Species abundance changes by season and habitat, with rainforest sites showing particularly high and fluctuating populations.
seas_habitat_analyses.R
This R script can be used to calculate the Pearson's correlation and paired t-test statistics presented in the manuscript. This script also includes the code used to generate Figures S10 & S11 of the manuscript.
Here are the raw data used in the linear mixed modelling analysis comparing B. neohumeralis abundance with landscape features. Results are found in Tables S10 and S11. The code for performing the LLM analysis can be accessed in the R script below.
neo_CPUE_lmm.xlsx
This table shows different monitoring sites in April 2021, with their geographic coordinates (longitude/latitude), elevation, and distance to a water source.
For each site, it also gives land-use composition (%) (e.g., grazing, residential, eucalyptus, rainforest, produce) and a B.neo value (likely a measured environmental or biological indicator).
Overall, it compares how landscape characteristics and location vary across sites and may relate to the B.neo measurements.
lmm.R
This R script loads data and fits a linear mixed-effects model to test how environmental variables (e.g., land use, location, water source) influence B. neohumeralis abundance, with Month as a random effect.
It then uses model selection (AICc) and model averaging to identify and combine the best-performing models, giving robust estimates and confidence intervals for the key predictors.
Host abundance
The host abundance analyses were conducted using data collected from the QDPI Maroochy Research Facility in Nambour, QLD, Australia, over three years.
host_data.csv
This dataset records fruit sampling events at a site over time, including collection date, fruit species (common and scientific names), fruit stage (e.g., ripe/mature), total weight, and number of fruits collected.
It also tracks whether two fruit fly species—Bactrocera tryoni and Bactrocera neohumeralis—were detected in each sample (all zeros here = none found).
host_use_analysis_v3.R
This R script can be used to calculate the Pearson's correlation and paired t-test statistics presented in the manuscript. This script also iincludesthe code used to generate Figure S13 of the manuscript.
Supplementary data
Table_S1.csv Trapping records from 19 collection sites for molecular data analysis. The individuals with an intermediate phenotype were allocated to a separate “site” labelled intermediate (acronym – I) for the fastSTRUCTURE analysis.
Table_S10.csv Model support for top performing linear mixed models (LMMs) correlating Bactrocera neohumeralis abundance with landscape predictors after model averaging (top performing models all have ∆AICc < 2 prior to model averaging). Models are ranked according to AIC corrected for finite sample size (∆AICc) and Akaike weight (ω). Degrees of freedom (df) for each model are also given.
Table_S11.csv Statistical support for each landscape predictor following model averaging. Reported are the linear coefficient, standard error (Std. error), Z-score, p value, 2.5% and 97.5% confidence intervals (CI). Bold text denotes statistical significance.
Table_S12.csv Top five performing linear mixed models (LMMs) correlating B. tryoni abundance with landscape predictors. Models are ranked according to AIC corrected for finite sample size (∆AICc) and Akaike weight (ω). Degrees of freedom (df) for each model are also given. Data taken from Ryan et al. (2025).
Table_S13.csv Results of model averaging for the top five performing linear mixed models (LMMs). Reported are the linear coefficient, standard error (Std. error), Z-score, p value, 2.5% and 97.5% confidence intervals (CI). Bold text denotes statistical significance in a model, and CIs that do not overlap 0. Data taken from Ryan et al. (2025).
Table_S14.csv Results for the host use data t-test and correlation analyses between B. tryoni and B. neohumeralis.
Table_S2.csv Read counts obtained for Bactrocera neohumeralis samples after demultiplexing with process_radtags. Also shown are the percentage of reads that successfully mapped back to the genome sequence.
Table_S3.csv Read counts obtained for Bactrocera tryoni samples after demultiplexing with process_radtags. Also shown are the percentage of reads that successfully mapped back to the genome sequence
Table_S4.csv A selection of candidate genes containing or in proximity to outlier SNPs. Genes were selected based on their functional annotations, of which several terms were of interest for hypothesis generation, including: ‘abnormal neuroanatomy’, ‘nervous system development’, ‘sensory system’, ‘sensory organ development’, ‘abnormal sleep’, ‘circadian rhythm’, ‘abnormal circadian rhythm’, ‘abnormal circadian behaviour’, ‘inositol biosynthetic process’, and ‘inositol-1,4,5-trisphosphate 3-kinase activity’. These terms are presented in bold in their corresponding annotation category (molecular function, biological process, phenotype, and phenotypic class) from the FlyBase automated gene summaries. Bactrocera tryoni gene identifiers are indicated in the ‘Gene ID’ column, with the best FlyBase gene hit indicated, including its gene name and FlyBase identifier in parentheses.
Table_S5.csv Trap catches of B. tryoni: B. neohumeralis for seasonal activity in various locations in Queensland, Australia. For Atherton, Ayr, Rockhampton, and Lawes, weekly data were summed to a calendar month. For Maryborough, Sunnybank, Kamerunga, and Gayndah, the percent mean of total trap catches was calculated. For South Johnstone, Rita Island, Nambour, Withcott, Toowoomba, Offham, and Stanthorpe, only total trap catch data were available.
Table_S6.csv Results for the broad-scale seasonal abundance data t-test and correlation analyses between B. tryoni and B. neohumeralis.
Table_S7.csv Results for the fine-scale seasonal abundance data t-test and correlation analyses between B. tryoni and B. neohumeralis.
Table_S8.csv Trap catches of B. tryoni: B. neohumeralis for fine-scale seasonal activity and habitat use in various locations in the Wide-Bay/Burnett region of Queensland, Australia
Table_S9.csv Results for the habitat use data t-test and correlation analyses between B. tryoni and B. neohumeralis.
Fig._S1.png Collection sites throughout the Wide Bay-Burnett region of South East Queensland for the fine-scale seasonal abundance analysis. Figure taken from Ryan et al. (2025).
Fig._S10.png Correlation of fine scale seasonal abundance of Bactrocera tryoni and B. neohumeralis at multiple locations in the Wide Bay-Burnett region of Southeast Queensland. * = p-value < 0.05, ** = p-value <0.01, *** = p-value < 0.001.
Fig._S11.png Bactrocera tryoni and B. neohumeralis abundance in different habitat types across the Wide-Bay Burnett region of South-east Queensland, Australia. Each habitat type was replicated across sites, and each site was sampled six times over 12 months. The correlation coefficients are calculated for all sites and sampling periods combined within a habitat type. Correlation coefficient * = p <0.05, ** = p <0.01, *** = p < 0.001.
Fig._S12.png A) Empirical Bayesian kriging of Bactrocera neohumeralis abundance during each sampling period. i) April 2021, ii) August 2021, iii) October 2021, iv) December 2021, v) February 2022, vi) April 2022, vii) Average across all time periods. B) Empirical Bayesian kriging of B. tryoni abundance during each sampling period. i) April 2021, ii) August 2021, iii) October 2021, iv) December 2021, v) February 2022, vi) April 2022, vii) Average across all time periods, viii) Average abundance across all time periods, excluding outlier from Good Night 2 in February 2022. Sub-figures B i) to viii) were taken from Ryan et al. (2025).
Fig._S13.png Host fruit usage on fruits where both B. tryoni and B. neohumeralis emerged, including correlation coefficient statistics. * = p-value < 0.05, ** = p-value <0.01, *** = p-value < 0.001.
Fig._S2.png Principal component analysis (PCA) of single nucleotide polymorphisms (SNPs) Bactrocera neohumeralis and B. tryoni, including outlier samples, with the values from the first two eigenvectors plotted for each sample. Eigenvectors one (horizontal) and two (vertical) explain 10.434% and 2.497% of the variance in the data, respectively.
Fig._S3.png fastSTRUCTURE results at cluster size (K) equals three for single nucleotide polymorphisms (SNPs) for Bactrocera neohumeralis and B. tryoni. Red, blue, and green colouration of the bars indicates the inferred population admixture proportions as a percentage (numbers not shown). Three-digit numbers along the bottoms of each row of bars are the genotype ID codes.
Fig._S4.png A) Heatmap generated using the results of EMIBD9 assessing the relatedness of sampled individuals via their Identity by Descent (IBD) coefficients. The colour key indicates the coefficient value, with a histogram plotted on top of the colour key to showcase the number of pairwise sample comparisons obtaining that value. The heatmap indicates the IBD value for each pairwise sample comparison, with the sample’s species indicated along the left and top of the plot. Lower values (e.g., soft red colour) indicate low to no relatedness as seen when comparing most Bactrocera neohumeralis samples against most B. tryoni samples. An increased value (e.g., amber colour) indicates some degree of relatedness and is seen when comparing most species amongst themselves. B) Plot generated using the results of rrBLUP assessing the estimated relatedness of sampled individuals via an additive relationship matrix. The colour key indicates the probability of identity by descent, with a histogram plotted on top of the colour key to showcase the number of pairwise sample comparisons obtaining that value. A heatmap indicates the estimated relatedness value for each pairwise sample comparison, with the sample’s species indicated along the left and top of the plot. Negative values (e.g., blue colour) indicate low to no relatedness as seen when comparing B. neohumeralis samples against B. tryoni samples. A positive value (e.g., yellow colour) indicates some degree of relatedness and is seen when comparing individuals within a species.
Fig._S5.png Population genetics statistics calculated for chromosome 2 labelled as NC_052500.1. Figure details are described in Fig. 7.
Fig._S6.png Population genetics statistics calculated for chromosome 3 labelled as NC_052501.1. Figure details are described in Fig. 7.
Fig._S7.png Population genetics statistics calculated for chromosome 4 labelled as NC_052502.1. Figure details are described in Fig. 7.
Fig._S8.png Population genetics statistics calculated for chromosome 5 labelled as NC_052503.1. Figure details are described in Fig. 7.
Fig._S9.png Unplaced contigs positioned and scaled along the X-axis in proportion to the five autosomal chromosomes displayed in Fig. 4. Fst values for each contig were computed and are plotted as the Y-axis value (left), with their NCBI sequence identifier shown to the right. Outlier SNPs are highlighted in blue.
