Ecological divergence despite common mating sites: Genotypes and symbiotypes shed light on cryptic diversity in the black bean aphid species complex

Gimmi, Elena 1 ; Wallisch, Jesper1; Vorburger, Christoph1

Published Apr 26, 2024 on Dryad. https://doi.org/10.5061/dryad.tx95x6b5t

Data files

Apr 26, 2024 version files 204.22 MB

README.md

14.42 KB
RProject_Afabe_popgen_publication_final.zip

204.21 MB

Abstract

Different host plants represent ecologically dissimilar environments for phytophagous insects. The resulting divergent selection can promote the evolution of specialized host races, provided that gene flow is reduced between populations feeding on different plants. In black bean aphids belonging to the Aphis fabae complex, several morphologically cryptic taxa have been described based on their distinct host plant preferences. However, host choice and mate choice are largely decoupled in these insects: they are host-alternating and migrate between specific summer host plants and shared winter hosts, with mating occurring on the shared hosts. This provides a yearly opportunity for gene flow among aphids using different summer hosts, and raises the question if and to what extent the ecologically defined taxa are reproductively isolated. Here, we analyzed a geographically and temporally structured dataset of microsatellite genotypes from A. fabae that were mostly collected from their main winter host Euonymus europaeus, and additionally from another winter host and fourteen summer hosts. The data reveals multiple, strongly differentiated genetic clusters, which differ in their association with different summer and winter hosts. The clusters also differ in the frequency of infection with two heritable, facultative endosymbionts, separately hinting at reproductive isolation and divergent ecological selection. Furthermore, we found evidence for occasional hybridization among genetic clusters, with putative hybrids collected more frequently in spring than in autumn. This suggests that similar to host races in other phytophagous insects, both prezygotic and postzygotic barriers including selection against hybrids maintain genetic differentiation among A. fabae taxa, despite a common mating habitat.

https://doi.org/10.5061/dryad.tx95x6b5t

Submission Overview

This data submission consists of an R project folder (RProject_Afabe_popgen_publication_final) containing the following files and subfolders:

7 R scripts (filename.R, detailed description below) that can be used to replicate all analyses, figures and tables presented in the manuscript or the manuscript supplement. Set the working directory to the file location to run the scripts.
folder ‘data’: contains 5 data files and 4 subfolders whose content is described in detail below as well as in the README file also present in the folder.
folder 'data_produced' with subfolders ‘dapc’ and ‘hybrid_analysis’: empty, intermediate data files that can be obtained by running the provided scripts are saved in this folder.
folder 'figures': empty, figures that can be obtained by running the provided scripts are saved in this folder.
folder 'functions': contains five R scripts with functions required for running the analysis scripts (filename.R, detailed description below)

R scripts

description of the 7 R scripts in the main project folder:

###

01_remove_duplicates_create_str_file_table_S2.R

-> Script to read in the genotype data, check for duplicates, remove duplicates, and save the data in a STRUCTURE file format that will serve as input for all subsequent analyses. The output file produced here is also provided in the ‘data’ folder (A_fabae_genotypes_no_dupl.str). Table S2 summarises the genotype data used for the rest of the analysis.

###

02_snapclust_analysis_figures_S1_S2_S3.R

-> script to reproduce the analyses and figures related to snapclust, including snapclust clustering approaches on the full dataset and on data subsets. The ‘balanced data subset’ used also for further analyses (scripts 03 and 07) can be produced here. The version we used for further analysis is provided in the ‘data’ folder (A_fabae_genotypes_snapclust_subset_20221215.str).

###

03_structure_analysis_figures_1_2_3_tables_1_S5_to_S12.R

-> script to read in the STRUCTURE output provided in the ‘data’ folder (full dataset and more balanced data subset, different Ks) and reproduce the related figures and tables including comparisons with the snapclust clustering results.

###

04_endosymbionts_figure_4_table_S16_S17.R

-> script to read in the endosymbiont data, combine it with the clustering results and produce the related figure and tables.

###

05_hybrid_analysis_tables_S13_S14_S15_figure_S7_S8.R

-> script to produce the input files for NewHybrids, and to read in and visualize the NewHybrids output (which is provided in the ‘data’ folder).

###

06_hybrids_tests_table_S13.R

-> script to reproduce the tests for the ability of NewHybrids to detect simulated hybrid genotypes. New input files for NewHybrids are produced and/or the output we obtained (saved in the ‘data’ folder) can be read in for further analysis.

###

07_DAPC_analysis_figures_S9_S10_S11.R

-> script to reproduce DAPC analyses on the full dataset (not presented in the manuscript) and on the more balanced data subset (presented in the manuscript supplement). New DAPC analyses can be run, or the output files we obtained can be read in from the ‘data’ folder to exactly reproduce the results presented in the manuscript.

Folder 'functions'

description of the R functions contained in the folder ‘functions’:

###

function_cakeplot_all_hostplants.R

-> function to produce the ‘cake plots’ per host plant as presented in Figure 3 of the manuscript

###

function_cakeplot_euonymus.R

-> function to produce the ‘cake plots’ per site and timepoint from samples collected from Euonymus europaeus as presented in Figure 4 of the manuscript

###

function_genind2structure.R

-> function to save data stored in a genind object to a structure-formatted file

###

function_hybrid_plots.R

-> function to produce ‘structure plots’ from the output of NewHybrids or snapclust hybrid analysis

###

function_structureplot_euonymus.R

-> function to produce ‘structure plots’ from the output of STRUCTURE or snapclust specifically for the data from Euonymus europaeus (separated by sampling timepoint and site)

Folder 'data'

The 'data' subfolder in the main project folder contains 5 data files and 4 subfolders whose content is described in detail below.

Files:

A_fabae_sample_information.csv
A_fabae_endosymbiont_data.csv
A_fabae_genotype_data.csv
A_fabae_genotypes_no_dupl.str
A_fabae_genotypes_snapclust_subset_20221215.str

Folders:

STRUCTURE_results_full_dataset
STRUCTURE_results_balanced_dataset
hybrid_analysis
DAPC

###

A_fabae_sample_information.csv

-> This file has 9 columns. It contains the raw data concerning aphid sampling.

sample_id: individual aphid sample identifier
site: sampling site (either specifically Faellanden, Gossau or Steinmaur; or generally Zurich)
latitude: latitudinal coordinate of exact sampling site
longitude: longitudinal coordinate of exact sampling site
date: exact sampling date
host_plant: plant species from which the aphid was sampled
timepoint: character name of reference sampling date
wdate: reference sampling date
year: sampling year

###

A_fabae_endosymbiont_data.csv

-> This file has 11 columns. It contains the raw data concerning the results from diagnostic PCR for endosymbiont infections. Only the symbionts Buchnera aphidicola, Hamiltonella defensa and Regiella insecticola have been tested for in all samples; i.e. the number of NAs is high for the other symbionts.

sample_id: individual aphid sample identifier
Buch: presence (1) or absence (0) of obligate symbiont Buchnera aphidicola in the whole aphid DNA extract
Ham: presence (1) or absence (0) of facultative symbiont Hamiltonella defensa in the whole aphid DNA extract
Reg: presence (1) or absence (0) of facultative symbiont Regiella insecticola in the whole aphid DNA extract
Ser: presence (1) or absence (0) of facultative symbiont Serratia symbiotica in the whole aphid DNA extract
Spiro: presence (1) or absence (0) of facultative symbiont Spiroplasma in the whole aphid DNA extract
Fuk: presence (1) or absence (0) of facultative symbiont Fukatsuia symbiotica in the whole aphid DNA extract
Risia: presence (1) or absence (0) of facultative symbiont Rickettsia in the whole aphid DNA extract
Ars: presence (1) or absence (0) of facultative symbiont Arsenophonus in the whole aphid DNA extract
Rilla: presence (1) or absence (0) of facultative symbiont Rickettsiella in the whole aphid DNA extract
Wol: presence (1) or absence (0) of facultative symbiont Wolbachia in the whole aphid DNA extract

###

A_fabae_genotype_data.csv

-> This file has 17 columns. It contains the raw genotype data of all successfully genotyped aphid samples including duplicates/clonal copies in a one-row format, i.e. the two alleles per marker are given in two neighbouring columns.

sample_id: individual aphid sample identifier
Af85: First allele at marker site Af85
X: Second allele at marker site Af85
Af181: First allele at marker site Af181
X.1: Second allele at marker site Af181
Af86: First allele at marker site Af86
X.2: Second allele at marker site Af86
Af48: First allele at marker site Af48
X.3: Second allele at marker site Af48
Af82: First allele at marker site Af82
X.4: Second allele at marker site Af82
Afbeta: First allele at marker site Afbeta
X.5: Second allele at marker site Afbeta
AfF: First allele at marker site Af181
X.6: Second allele at marker site Af181
Af50: First allele at marker site Af50
X.7: Second allele at marker site Af50

###

A_fabae_genotypes_no_dupl.str

-> This file contains the Aphis fabae genotype data used for all further analyses, i.e. with duplicates/clonal copies removed (see script 01). The data is formatted in STRUCTURE input format, which can be used as input for both snapclust and STRUCTURE:

First row: marker names separated by tabs (i.e. marker1 marker2 marker3 etc.)
All other rows: sample identifier (sample_id) in the first column, alleles per marker (in the order provided in the header row) in a one-row format in the following columns, i.e. the two alleles per marker are provided next to each other separated by a tab (i.e. allele1A allele1B allele2A allele2B allele3A allele3B etc.)

###

A_fabae_genotypes_snapclust_subset_20221215.str

-> This file contains a subset of the Aphis fabae genotype data created in script 02; it contains all samples clustering with clusters 2,3,4,5, and 6 in the snapclust analysis of the full dataset with K=6, but only a random subset of 222 samples from the dominant cluster 1. Goal of this is to have more balanced sample numbers per putative genetic cluster (222 is the mean number of samples clustering to clusters 2,3,4,5, and 6). The data is given in STRUCTURE-format as in file A_fabae_genotypes_no_dupl.str above.

###

Folder ‘STRUCTURE_results_full_dataset’

->This folder contains the data related to the clustering analysis of the full dataset with STRUCTURE:

Subfolder ‘STRUCTURE_output_full_dataset’: output folder from STRUCTURE analysis.
- File 20221212_euo_summer21_wang_3_5.spj contains the main program settings
- File project_data corresponds to A_fabae_genotypes_no_dupl.str in the ‘data’ folder of this publication
- Folder ‘25k_200k_wang’ contains the parameter files (‘mainparams’ and ‘extraparams’) and two STRUCTURE output folders ‘Results’ (100 files summarising the 10x10 runs for each K between 1 and 10) and ‘PlotData’ (600 files, 6 files for each of the 100 runs). Note: to save time, multiple approaches for the different Ks were run in parallel and manually combined afterwards.
Subfolder ‘clumpak_output_full_dataset’: output folder downloaded following the CLUMPAK analysis via webtool. The averaged membership likelihood per sample provided in the subfolders K=1, K=2 etc are used to draw the structure plots and figures presented in the analysis, see script 03.
Subfolder ‘structure_harvester_output_full_dataset’ : output folder downloaded following the STRUCTURE HARVESTER analysis via webtool. meanLnProb.pdf, deltaK.pdf and evannoTable.tab correspond to the figures and data shown in Figure S4 of the manuscript supplement.

###

Folder ‘STRUCTURE_results_balanced_dataset’

->This folder contains the data related to the clustering analysis of the more balanced data subset with STRUCTURE:

Subfolder ‘STRUCTURE_output_balanced_dataset’: output folder from STRUCTURE analysis.
- File 20221215_balanced_dataset_wang.spj contains the main program settings
- File project_data corresponds to A_fabae_genotypes_snapclust_subset_20221215.str in the ‘data’ folder of this publication
- Folder ‘25k_200k_wang’ contains the parameter files (‘mainparams’ and ‘extraparams’) and two STRUCTURE output folders ‘Results’ (100 files summarizing the 10x10 runs for each K between 1 and 10) and ‘PlotData’ (600 files, 6 files for each of the 100 runs).

Subfolder ‘clumpak_output_balanced_dataset’: output folder downloaded following the CLUMPAK analysis via webtool. The averaged membership likelihood per sample provided in the subfolders K=1, K=2 etc. are used to draw the structure plots and figures presented in the analysis, see script 03.

Subfolder ‘structure_harvester_output_balanced_dataset’ : output folder downloaded following the STRUCTURE HARVESTER analysis via webtool. meanLnProb.pdf, deltaK.pdf and evannoTable.tab correspond to the figures and data shown in Figure S5 of the manuscript supplement.

###

Folder ‘hybrid_analysis’

-> This folder contains the data produced for and analyzed when searching for hybrids with snapclust and NewHybrids.

Subfolder ‘my_data’ contains the genotype data (in NewHybrids input format, 7 files) and the corresponding sample names (7 files) from pairwise combinations of parental genotypes considered for hybrid analysis. This data can be reproduced using script 04.
Subfolder ‘my_result_files’ contains the output from NewHybrids analyses of the files given in my_data (a separate subfolder contains the output for each combination of parental genotypes, i.e. for each input file). This data can be read in and visualized/analyzed using script 05.
Subfolder ‘artificial_hybrids’ contains the datasets including presumably ‘pure’ parents and their simulated hybrids as produced in script 06. In subfolder ‘strfiles_STR_based’ they are saved in a STRUCTURE format, in subfolder ‘nhfiles_STR_based’ they are saved in a NewHybrids-input format.
Subfolder ‘new_hybrids_output’ contains the output from NewHybrids analysis, this data can be visualized/analyzed with script 06.

###

Folder ‘DAPC’

->This folder contains the data produced and analyzed for DAPC analysis. With script 07, new DAPC analyses can be run (with new values), or the following result files can be loaded to reproduce the exact values and plots that are presented in the manuscript supplement:

dapc_6_20_largesubset.Rds: results from DAPC on the more balanced subset with K=6 groups and 20 PCs retained
dapc_6_40_largesubset.Rds: results from DAPC on the more balanced subset with K=6 groups and 40 PCs retained
dapc_7_20_largesubset.Rds: results from DAPC on the more balanced subset with K=7 groups and 20 PCs retained
xval_output_largesubset_grp5.Rds: results from cross validation to determined the number of PCs to retain after clustering the more balanced data subset into 5 clusters
xval_output_largesubset_grp6.Rds: results from cross validation to determined the number of PCs to retain after clustering the more balanced data subset into 6 clusters
xval_output_largesubset_grp7.Rds: results from cross validation to determined the number of PCs to retain after clustering the more balanced data subset into 7 clusters