Data for: ‘Drifting’ Buchnera genomes track the microevolutionary trajectories of their aphid hosts
Abstract
This repository comprises code and scripts needed to recreate the analyses presented in the manuscript:
Thia et al. 'Drifting’ Buchnera genomes track the microevolutionary trajectories of their aphid hosts.
This study explores the coevolution between the endosymbiont Buchnera and its aphid host in the aphid species, Myzus persicae, the green peach aphid. Evidence that Buchnera is polymorphic within aphid clones indicates that these symbionts exist as populations of different strains within a single aphid. Strong signals of genomic covariation were observed between both organisms. However, genomic variation in either organism was not associated with geography or the host plant. Further examination of protein-coding SNPs in Buchnera found that the number of non-synonymous to synonymous mutations were on average comparable (ratio ~1), with similar site frequency spectra of alleles in both mutation classes. Our results suggest that Buchnera within Myzus persicae are largely evolving under neutral processes. This study provides a unique population genetics perspective of Buchnera-aphid host coevolution on a microevolutionary scale.
This repository contains the data and scripts needed to replicate the analysis in:
- Thia, Zhan, et al. (2024) ‘Drifting’ Buchnera genomes track the microevolutionary trajectories of their aphid hosts. Insect Molecular Biology. DOI: 10.1111/imb.12946
This study investigates the coevolution between Buchnera and its aphid host among a globally distributed sample of Myzus persicae.
This repository is split into two ZIP files: ‘Data’ and ‘Scripts’.
‘Data’ contains two directories that set out the structure for these analyses:
- 0_Metadata/
- 1_Bioinformatics/
- 2_Popgen_analysis/
‘Scripts’ contains the scripts that should be placed into either ‘1_Bioinformatics/’ or ‘2_Popgen_analysis/’ to run the analyses for these separate pipelines.
Note that the ‘2_Popgen_analysis/’ directory is intentionally left empty. This is meant to act as a placeholder to outline how the analysis pipeline should be structured.
Note that in all tabulated files (e.g., CSV or XLSX), missing data is represented by blank cells.
Code/Software
This project used a combination of bioinformatics to derive genetic variants for Myzus persicae and its endosymbiotic bacteria Buchnera. These were then imported into R for population genetic analysis. Data structures and scripts are described below.
Description of the data and file structure
This repository is divided into three core directories:
- 0_Metadata/
- 1_Bioinformatics/
- 2_Popgen_analysis/
These directories should all coexist in the same main directory, as the R code for analyses directly refers to this structure. Bioinformatic bash scripts were executed on the University of Melbourne's Spartan HPC. These scripts should only be launched within their respective directories. See the 'Repository_description.pdf' file.
Data: 0_Metadata/
This directory contains higher-level metadata for samples.
distance_coeff_names.csv
A table of labels to rename coefficients in linear models of genetic distance.
Column name | Type | Description |
---|---|---|
Coeff.Default | Character | Default coefficient name produced from the model in R |
Coeff.Rename | Character | New coefficient name to relabel |
metadata_combined.xlsx
Metadata for samples used in this study.
Column name | Type | Description |
---|---|---|
Sample | Character | Sample ID |
Continent | Character | Continent of origin |
Country | Character | Country of origin |
Locality | Character | Locality of origin |
Year_collected | Character | Year collected |
Crop | Character | Crop of origin |
Plant_family | Character | Plant family of origin |
Source | Character | The study from which the sample came |
Msat | Character | Microsatellite genotype |
Lat | Numeric | Latitude |
Long | Numeric | Longitude |
Notes | Character | Additional notes |
Data: 1_Bioinformatics/
Pipeline for running bioinformatic analyses.
1_0_Genomes/
Buchnera_aphidicola_Mper_CP002697_annot.[gff/gtf]
Buchnera annotations for accession CP002697 in GFF file format. No header.
Column number | Type | Description |
---|---|---|
1 | Character | Sequence |
2 | Character | Source |
3 | Character | Type |
5 | Integer | Start |
6 | Integer | End |
7 | Character | Quality |
8 | Character | Strand |
9 | Character | Details |
Buchnera_aphidicola_Mper_CP002697_annot.tsv
Buchnera annotations for accession CP002697 in TSV file format.
Column name | Type | Description |
---|---|---|
Name | Character | Sequence |
Type | Character | Type |
Minimum | Integer | Start |
Maximum | Integer | End |
Length | Integer | Annotation length |
Direction | Character | Strand |
Buchnera_aphidicola_Mper_CP002697_cds.gtf
Buchnera coding sequence annotations for accession CP002697 in GFF file format. No header.
Column name | Type | Description |
---|---|---|
1 | Character | Sequence |
2 | Character | Source |
3 | Character | Type |
5 | Integer | Start |
6 | Integer | End |
7 | Character | Quality |
8 | Character | Strand |
9 | Character | Details |
Buchnera_aphidicola_Mper_CP002697_genome.[fasta/gb]
Buchnera genome sequences for accession CP002697 in FASTA and GB file format.
Buchnera_aphidicola_Mper_CP002697_prot.[faa]
Buchnera protein sequences for accession CP002697 in FASTA file format.
Myzus_persicae_CloneG006_AphidBase_v3.fasta
Myzus persicae genome for clone G006 in Aphid Base (version 3) in FASTA file format.
1_4_Variant_calling/
vars_for_snpeff_buch.vcf
Variants for performing SnpEff analyses on Buchnera in VCF file format.
vars_snp_nomiss_5x_gpa.vcf
Variants for performing population genetic analyses on Myzus persicae in VCF file format.
vars_snp_nomiss_20x_buch.vcf
Variants for performing population genetic analyses on Buchnera in VCF file format.
1_5_Diversity_filter/
sites_cov_buch_20x.csv
Number of sites covered in the Buchnera genome at a depth of 20x.
Column name | Type | Description |
---|---|---|
Sample | Character | Sample ID |
Sites_tot | Character | Total sites |
Sites_cov | Character | Sites covered at depth threshold |
sites_cov_gpa_10x.csv
Number of sites covered in the Myzus persicae genome at a depth of 10x.
Column name | Type | Description |
---|---|---|
Sample | Character | Sample ID |
Sites_tot | Character | Total sites |
Sites_cov | Character | Sites covered at depth threshold |
sites_het_buch_20x.csv
Number of sites that are heterozygous in the Buchnera genome at a depth of 20x.
Column name | Type | Description |
---|---|---|
Sample | Character | Sample ID |
Sites_het | Character | Heterozygous sites |
sites_het_gpa_10x.csv
Number of sites that are heterozygous in the Myzus persicae genome at a depth of 10x.
Column name | Type | Description |
---|---|---|
Sample | Character | Sample ID |
Sites_het | Character | Heterozygous sites |
1_6_SnpEff_Buchnera
snpEff_annot_raw_buch.vcf
Annotations of Buchnera variants using SnpEff in VCF file format.
snpEff_annot_parsed.csv
Annotations of Buchnera variants using SnpEff parsed and in CSV file format.
Column number | Type | Description |
---|---|---|
CHROM | Character | Sequence name |
POS | Integer | Sequence position |
LOCUS | Character | Locus ID |
REF | Character | Reference allele |
ALT | Character | Alternate allele |
SEQ | Character | Focal allele sequence |
EFFECT | Character | Mutation effect |
GENE | Character | Gene function |
AMINO | Character | Amino acid |
TYPE | Character | Type of mutation |
Data: 2_Popgen_analysis/
This data directory is left empty and is used as a placeholder. This is the pipeline for running population genetic analyses.
Scripts for Bioinformatics
Any script with the name ‘1_[…]’ is part of the ‘1_Bioinformatics’ pipeline. The ‘[…]’ indicates details of the script’s role.
1_0_R_environment.R
R script to set up environmental variables in R.
1_2_Trim_reads._HPC_jobs.R
R script to generate multiple scripts for trimming samples in parallel on an HPC.
1_3a_Prepare_references.sh
Bash script to prepare reference sequences for mapping.
1_3b_Map_GPA_HPC_jobs.R
R script to generate multiple scripts for mapping samples to the Myzus persicae genome in parallel on an HPC.
1_3c_Map_Buch_HPC_jobs.R
R script to generate multiple scripts for mapping samples to the Buchnera genome in parallel on an HPC.
1_3d_Downsample_BAMs.sh
Bash script to downsample BAM files.
1_4a_Variant_calling_GPA.sh
Bash script to call variants in the Myzus persicae genome.
1_4b_Variant_calling_Buchnera.sh
Bash script to call variants in the Buchnera genome.
1_4c_Import_and_filter_SNPs.R
R script to import and filter called variants.
1_4d_Genotype_probabilities_GPA.R
R script to obtain genotype probabilities from low-coverage Myzus persicae variants.
1_5a_Diversity_filter_GPA_10x.sh
Bash script to obtain genetic diversity statistics for Myzus persicae.
1_5b_Diversity_filter_Buch_20x.sh
Basch script to obtain genetic diversity statistics for Buchnera.
1_6a_SnpEff_Buchnera_annotate.sh
Bash script to annotate mutational effects on Buchnera variants with SnpEff.
1_6b_SnpEff_Buchnera_parse.R
R script to parse the SnpEff annotations of Buchnera variants.
1_Bioinformatics.Rproj
R project file to manage R analyses.
Scripts for Population genetic analysis
Any script with the name ‘2_[…]’ is part of the ‘2_Popgen_analysis’ pipeline. The ‘[…]’ indicates details of the script’s role.
2_0_R_environment.R
R script to set up environmental variables in R.
2_3_Genetic_differentiation.R
R script for calculating genetic differentiation among samples.
2_4_Genetic_distance_trees.R
R script for constructing genetic distance trees.
2_5_Diversity_stats.R
R script for summarising diversity statistics.
2_6_Plant_hosts.R
R script for testing for genetic segregation among plant hosts.
2_7_Buchnera_mutations.R
R script for summarising the mutational effects in the Buchnera genome.
2_Popgen_analysis.Rproj
R project file to manage R analyses.
Sharing/Access information
Whole-genome sequencing data used in this study came from Singh et al (2021) and de novo sequenced samples generated specifically for this work. Source and GenBank SRA accessions are listed below.
Source | Sample | SRA Accession |
---|---|---|
Singh et al. 2021 | CHI-S-71 | SRR13326401 |
Singh et al. 2021 | CHI-S-72 | SRR13326400 |
Singh et al. 2021 | SK-S-82 | SRR13326389 |
Singh et al. 2021 | AUS-S-43 | SRR13326432 |
Singh et al. 2021 | AUS-S-45 | SRR13326430 |
Singh et al. 2021 | AUS-S-46 | SRR13326429 |
Singh et al. 2021 | AUS-S-47 | SRR13326428 |
Singh et al. 2021 | BEL-S-103 | SRR13326480 |
Singh et al. 2021 | FRA-S-21 | SRR13326456 |
Singh et al. 2021 | ITA-S-37 | SRR13326439 |
Singh et al. 2021 | UK-S-11 | SRR13326483 |
Singh et al. 2021 | UK-S-118 | SRR13326475 |
Singh et al. 2021 | UK-S-3 | SRR13326457 |
Singh et al. 2021 | USA-S-G006 | SRR3466613 |
Singh et al. 2021 | AUS-S-42 | SRR13326433 |
Singh et al. 2021 | AUS-S-44 | SRR13326431 |
Singh et al. 2021 | AUS-S-48 | SRR13326427 |
Singh et al. 2021 | AUS-S-49 | SRR13326426 |
Singh et al. 2021 | AUS-S-50 | SRR13326425 |
Singh et al. 2021 | ITA-S-110 | SRR10199541 |
Singh et al. 2021 | ITA-S-33 | SRR13326443 |
Singh et al. 2021 | SPA-S-77 | SRR13326395 |
Singh et al. 2021 | UK-S-105 | SRR10199546 |
Singh et al. 2021 | UK-S-29 | SRR13326448 |
This study | AUS-Z-1 | SRR25064225 |
This study | AUS-Z-10 | SRR25064204 |
This study | AUS-Z-11 | SRR25064223 |
This study | AUS-Z-12 | SRR25064222 |
This study | AUS-Z-13 | SRR25064221 |
This study | AUS-Z-14 | SRR25064220 |
This study | AUS-Z-17 | SRR25064218 |
This study | AUS-Z-23 | SRR25064211 |
This study | AUS-Z-15 | SRR25064219 |
This study | AUS-Z-18 | SRR25064217 |
This study | AUS-Z-19 | SRR25064216 |
This study | AUS-Z-20 | SRR25064215 |
This study | AUS-Z-21 | SRR25064214 |
This study | AUS-Z-3 | SRR25064213 |
This study | AUS-Z-4 | SRR25064210 |
This study | AUS-Z-6 | SRR25064208 |
This study | AUS-Z-7 | SRR25064207 |
This study | AUS-Z-8 | SRR25064206 |
This study | AUS-Z-9 | SRR25064205 |
This study leverages previously and newly sequenced Myzus persicae to compile a global dataset of Buchnera and aphid host genomes. Analyses include quantification of genetic diversity and differentiation, protein-coding effects, and comparisons of site frequency spectra.
More details can be found in the README.