Data for: ‘Drifting’ Buchnera genomes track the microevolutionary trajectories of their aphid hosts

Thia, Joshua 1

Published Nov 13, 2024 on Dryad. https://doi.org/10.5061/dryad.gf1vhhmvz

Data files

Nov 13, 2024 version files 118.33 MB

Data.zip
118.32 MB
README.md
13.80 KB

Abstract

This repository comprises code and scripts needed to recreate the analyses presented in the manuscript:

Thia et al. 'Drifting’ Buchnera genomes track the microevolutionary trajectories of their aphid hosts.

This study explores the coevolution between the endosymbiont Buchnera and its aphid host in the aphid species, Myzus persicae, the green peach aphid. Evidence that Buchnera is polymorphic within aphid clones indicates that these symbionts exist as populations of different strains within a single aphid. Strong signals of genomic covariation were observed between both organisms. However, genomic variation in either organism was not associated with geography or the host plant. Further examination of protein-coding SNPs in Buchnera found that the number of non-synonymous to synonymous mutations were on average comparable (ratio ~1), with similar site frequency spectra of alleles in both mutation classes. Our results suggest that Buchnera within Myzus persicae are largely evolving under neutral processes. This study provides a unique population genetics perspective of Buchnera-aphid host coevolution on a microevolutionary scale.

This repository contains the data and scripts needed to replicate the analysis in:

Thia, Zhan, et al. (2024) ‘Drifting’ Buchnera genomes track the microevolutionary trajectories of their aphid hosts. Insect Molecular Biology. DOI: 10.1111/imb.12946

This study investigates the coevolution between Buchnera and its aphid host among a globally distributed sample of Myzus persicae.

This repository is split into two ZIP files: ‘Data’ and ‘Scripts’.

‘Data’ contains two directories that set out the structure for these analyses:

0_Metadata/
1_Bioinformatics/
2_Popgen_analysis/

‘Scripts’ contains the scripts that should be placed into either ‘1_Bioinformatics/’ or ‘2_Popgen_analysis/’ to run the analyses for these separate pipelines.

Note that the ‘2_Popgen_analysis/’ directory is intentionally left empty. This is meant to act as a placeholder to outline how the analysis pipeline should be structured.

Note that in all tabulated files (e.g., CSV or XLSX), missing data is represented by blank cells.

Code/Software

This project used a combination of bioinformatics to derive genetic variants for Myzus persicae and its endosymbiotic bacteria Buchnera. These were then imported into R for population genetic analysis. Data structures and scripts are described below.

Description of the data and file structure

This repository is divided into three core directories:

0_Metadata/
1_Bioinformatics/
2_Popgen_analysis/

These directories should all coexist in the same main directory, as the R code for analyses directly refers to this structure. Bioinformatic bash scripts were executed on the University of Melbourne's Spartan HPC. These scripts should only be launched within their respective directories. See the 'Repository_description.pdf' file.

Data: 0_Metadata/

This directory contains higher-level metadata for samples.

distance_coeff_names.csv

A table of labels to rename coefficients in linear models of genetic distance.

Column name	Type	Description
Coeff.Default	Character	Default coefficient name produced from the model in R
Coeff.Rename	Character	New coefficient name to relabel

metadata_combined.xlsx

Metadata for samples used in this study.

Column name	Type	Description
Sample	Character	Sample ID
Continent	Character	Continent of origin
Country	Character	Country of origin
Locality	Character	Locality of origin
Year_collected	Character	Year collected
Crop	Character	Crop of origin
Plant_family	Character	Plant family of origin
Source	Character	The study from which the sample came
Msat	Character	Microsatellite genotype
Lat	Numeric	Latitude
Long	Numeric	Longitude
Notes	Character	Additional notes

Data: 1_Bioinformatics/

Pipeline for running bioinformatic analyses.

1_0_Genomes/

Buchnera_aphidicola_Mper_CP002697_annot.[gff/gtf]

Buchnera annotations for accession CP002697 in GFF file format. No header.

Column number	Type	Description
1	Character	Sequence
2	Character	Source
3	Character	Type
5	Integer	Start
6	Integer	End
7	Character	Quality
8	Character	Strand
9	Character	Details

Buchnera_aphidicola_Mper_CP002697_annot.tsv

Buchnera annotations for accession CP002697 in TSV file format.

Column name	Type	Description
Name	Character	Sequence
Type	Character	Type
Minimum	Integer	Start
Maximum	Integer	End
Length	Integer	Annotation length
Direction	Character	Strand

Buchnera_aphidicola_Mper_CP002697_cds.gtf

Buchnera coding sequence annotations for accession CP002697 in GFF file format. No header.

Column name	Type	Description
1	Character	Sequence
2	Character	Source
3	Character	Type
5	Integer	Start
6	Integer	End
7	Character	Quality
8	Character	Strand
9	Character	Details

Buchnera_aphidicola_Mper_CP002697_genome.[fasta/gb]

Buchnera genome sequences for accession CP002697 in FASTA and GB file format.

Buchnera_aphidicola_Mper_CP002697_prot.[faa]

Buchnera protein sequences for accession CP002697 in FASTA file format.

Myzus_persicae_CloneG006_AphidBase_v3.fasta

Myzus persicae genome for clone G006 in Aphid Base (version 3) in FASTA file format.

1_4_Variant_calling/

vars_for_snpeff_buch.vcf

Variants for performing SnpEff analyses on Buchnera in VCF file format.

vars_snp_nomiss_5x_gpa.vcf

Variants for performing population genetic analyses on Myzus persicae in VCF file format.

vars_snp_nomiss_20x_buch.vcf

Variants for performing population genetic analyses on Buchnera in VCF file format.

1_5_Diversity_filter/

sites_cov_buch_20x.csv

Number of sites covered in the Buchnera genome at a depth of 20x.

Column name	Type	Description
Sample	Character	Sample ID
Sites_tot	Character	Total sites
Sites_cov	Character	Sites covered at depth threshold

sites_cov_gpa_10x.csv

Number of sites covered in the Myzus persicae genome at a depth of 10x.

Column name	Type	Description
Sample	Character	Sample ID
Sites_tot	Character	Total sites
Sites_cov	Character	Sites covered at depth threshold

sites_het_buch_20x.csv

Number of sites that are heterozygous in the Buchnera genome at a depth of 20x.

Column name	Type	Description
Sample	Character	Sample ID
Sites_het	Character	Heterozygous sites

sites_het_gpa_10x.csv

Number of sites that are heterozygous in the Myzus persicae genome at a depth of 10x.

Column name	Type	Description
Sample	Character	Sample ID
Sites_het	Character	Heterozygous sites

1_6_SnpEff_Buchnera

snpEff_annot_raw_buch.vcf

Annotations of Buchnera variants using SnpEff in VCF file format.

snpEff_annot_parsed.csv

Annotations of Buchnera variants using SnpEff parsed and in CSV file format.

Column number	Type	Description
CHROM	Character	Sequence name
POS	Integer	Sequence position
LOCUS	Character	Locus ID
REF	Character	Reference allele
ALT	Character	Alternate allele
SEQ	Character	Focal allele sequence
EFFECT	Character	Mutation effect
GENE	Character	Gene function
AMINO	Character	Amino acid
TYPE	Character	Type of mutation

Data: 2_Popgen_analysis/

This data directory is left empty and is used as a placeholder. This is the pipeline for running population genetic analyses.

Scripts for Bioinformatics

Any script with the name ‘1_[…]’ is part of the ‘1_Bioinformatics’ pipeline. The ‘[…]’ indicates details of the script’s role.

1_0_R_environment.R

R script to set up environmental variables in R.

1_2_Trim_reads._HPC_jobs.R

R script to generate multiple scripts for trimming samples in parallel on an HPC.

1_3a_Prepare_references.sh

Bash script to prepare reference sequences for mapping.

1_3b_Map_GPA_HPC_jobs.R

R script to generate multiple scripts for mapping samples to the Myzus persicae genome in parallel on an HPC.

1_3c_Map_Buch_HPC_jobs.R

R script to generate multiple scripts for mapping samples to the Buchnera genome in parallel on an HPC.

1_3d_Downsample_BAMs.sh

Bash script to downsample BAM files.

1_4a_Variant_calling_GPA.sh

Bash script to call variants in the Myzus persicae genome.

1_4b_Variant_calling_Buchnera.sh

Bash script to call variants in the Buchnera genome.

1_4c_Import_and_filter_SNPs.R

R script to import and filter called variants.

1_4d_Genotype_probabilities_GPA.R

R script to obtain genotype probabilities from low-coverage Myzus persicae variants.

1_5a_Diversity_filter_GPA_10x.sh

Bash script to obtain genetic diversity statistics for Myzus persicae.

1_5b_Diversity_filter_Buch_20x.sh

Basch script to obtain genetic diversity statistics for Buchnera.

1_6a_SnpEff_Buchnera_annotate.sh

Bash script to annotate mutational effects on Buchnera variants with SnpEff.

1_6b_SnpEff_Buchnera_parse.R

R script to parse the SnpEff annotations of Buchnera variants.

1_Bioinformatics.Rproj

R project file to manage R analyses.

Scripts for Population genetic analysis

Any script with the name ‘2_[…]’ is part of the ‘2_Popgen_analysis’ pipeline. The ‘[…]’ indicates details of the script’s role.

2_0_R_environment.R

R script to set up environmental variables in R.

2_3_Genetic_differentiation.R

R script for calculating genetic differentiation among samples.

2_4_Genetic_distance_trees.R

R script for constructing genetic distance trees.

2_5_Diversity_stats.R

R script for summarising diversity statistics.

2_6_Plant_hosts.R

R script for testing for genetic segregation among plant hosts.

2_7_Buchnera_mutations.R

R script for summarising the mutational effects in the Buchnera genome.

2_Popgen_analysis.Rproj