Data from: The polygenic strategies of host-specific and general virulence of Botrytis cinerea across diverse eudicot hosts

Caseys, Celine 1 ; Kliebenstein, Daniel 1

Published May 09, 2025 on Dryad. https://doi.org/10.5061/dryad.j6q573npm

Data files

May 09, 2025 version files 34.39 MB

Github_Eudicot_GWAS.zip

34.38 MB
README.md

12.50 KB

Abstract

Diverse qualitative and quantitative genetic architectures can successfully enable fungal virulence and host range. To model the quantitative genetic architecture of a generalist pathogen with an extensive host range, we conducted a genome-wide association study (GWAS) of the lesion area of Botrytis cinerea across eight hosts. This revealed that it was possible to partition the virulence, as defined by the lesion area, into common and host-specific components across all hosts from host-specific virulence. All traits showed that a large proportion of the Botrytis genome likely contributes to fungal lesion development on leaves with small effect sizes. The candidate genes are evenly spread across the core chromosomes with no indication of bipartite genomic architecture. The GWAS-identified polymorphisms and genes show that B. cinerea relies on genetic variants across hundreds of genes for growing on diverse hosts, with most genes influencing relatively few hosts. When pathogen genes were associated with multiple hosts, they were associated with unrelated rather than related host species. Comparative genomics further suggested that the GWAS-identified genes are largely syntenic with other specialist Botrytis species and not unique to B. cinerea. Overall, as shown in A. thaliana, B. cinerea’s generalist behavior is derived from the sum of the genome-wide genetic variation acting within gene networks that differentially coordinate the interaction with diverse hosts.

https://doi.org/10.5061/dryad.j6q573npm

Description of the data and file structure

Corresponding publication: Dryad Repository

Author: Celine Caseys, Dept. of Plant Sciences, University of California, Davis

Project Overview

This repository contains data and code used in the genome-wide association study (GWAS) of Botrytis cinerea virulence across eight eudicot crop species. The project investigates host-specific and general virulence strategies through polygenic analyses, including Bayesian Sparse Linear Mixed Models (BSLMM), multivariate MASH analysis, SNP annotation, and comparative genomics.

Notebook files (.html, .pdf, or .Rmd) provide detailed documentation of the datasets, analytical pipelines, and scripts

Main Archive

File: Github_Eudicot_GWAS.zip

Description:

Contains an RStudio project (Github_Eudicot_GWAS.Rproj) which sets the appropriate working directory. Some folders may appear empty depending on your operating system or RStudio version.

Files and variables

BSLMM of lesion area acros 85 plant genotypes

The input files for Gemma were in the PLINK file format (https://www.cog-genomics.org/plink/1.9/input#ped)

Subfolder:

Gemma

Associated files:

bslmm_Eudicot98.sh :

Bash script to run Gemma on linux

PLINK Files:

Cor_MAF20NA10.bed

Binary SNP matrix. This is the an imput file for Gemma. It will look like an encrypted file because it is coded as a binary file.
Cor_MAF20NA10.bim

Binary SNP locations. his is the an imput file for Gemma

Column 1: Chromosome, Column 2: SNP identifier, Column 3: empty, Column 4: SNP position, Column 5&6: Parameters for Gemma
Cor_MAF20NA10.fam

This is an input file for Gemma. It contains the PHENOTYPES in columns 7 and following. Columns 1-6 are the strain informations. https://www.cog-genomics.org/plink/2.0/formats#fam 1= Family ID (‘FID’); 2=Individual ID (‘IID’; cannot be ‘0’); 3=Individual ID of father (‘0’ if father isn’t in dataset); 4=Individual ID of mother (‘0’ if mother isn’t in dataset); 5=Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown); 6=Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/‘0’/non-numeric = missing data case/control)
Eudicot_GWAS_revision.fam

This is an input file for Gemma. It contains the PHENOTYPES in columns 7 and following. Columns 1-6 are the strain informations. https://www.cog-genomics.org/plink/2.0/formats#fam 1= Family ID (‘FID’); 2=Individual ID (‘IID’; cannot be ‘0’); 3=Individual ID of father (‘0’ if father isn’t in dataset); 4=Individual ID of mother (‘0’ if mother isn’t in dataset); 5=Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown); 6=Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/‘0’/non-numeric = missing data case/control) The name and order of the traits is encoded in the file Revisions_traitList.txt. This contains the general virulence and residual datasets.
Cor_MAF20NA10.map

Non-binary, contains the SNP locations

Column 1: Chromosome, Column 2: SNP identifier, Column 3: empty, Column 4: SNP position.
Cor_MAF20NA10.ped

Non-binary SNP matrix, for all strains across all SNPs.
Cor_MAF20NA10_kmat1_pheno1.cXX.txt

Kinship matrix from gemma (centered matrix)
Cor_MAF20NA10_kmat2_pheno1.sXX.txt

Kinship matrix from gemma (standardized matrix)

R Code:

Prep_file-rand.R

Create the randomized datasets for permutation-based null hypothesis testing. Ready to be copy-pasted into the .fam file and ran through GEMMA. This randomizes the phenotypes across the 96 strains while the SNP information remains associated to each 96 strains.

2. Analysis of BSLMM hyperparameters

Associated files:

all_hyperParameters.txt

hyperparam: h (heritability), PVE (Total proportion of variance), rho (approximation to proportion of genetic variance explained by variants with major effects), PGE (Proportion of genetic variance explained by sparse effects), pi (proportion of variants with non-zero effects), n.gamma (number of variants with major effect) as hyper-parameters of the BSLMM

The mean, median, 2.5 and 97.5 quantiles of each hyperparameters are reported for each of the 20 BSLMM runs (rep 1-20) for each of the 85 genotypes (coded in GenoN) across the 8 Host species (coded in Species). Trait1 is the combination of the host species and genotypes.
I93_Eudic7_traitList.txt

Num_ID: Numerical identifier For each trait. Number 2-8 are the mean lesion area for each host species. Number 9-98 are the mean lesion area for each genotypes.

Name: species name or accession identifier.
Revisions_traitList.txt

Num_ID: Numerical identifier For each trait.

Name: Trait description as either model corrected emmean or Residuals (Res).

R Code:

HyperParam_extract.R

Import, analyze and plot the BSLMM hyper parameters.

3. MASH Analysis

R Code:

Mash_datAnalysis.R

Import the results from the BSLMM and run multivariate adaptive shrinkage (MASH) pipeline to determine the significance of SNPs. Return the local false sign rate (LFSR).

4. Analysis of significant SNPs

Associated files:

Signlsfr_85_Geno_SNPs_binom_new.txt

SNP_ID is the combination of the chromosome number and the SNP position.

All other columns are the binary matrix of significant (1) or non-significant (0) association of the trait (see the column name) to the SNP.
AllSNP_CountBySpecie_Table.txt

Num_ID: Numerical row ID

snp_id: is the combination of the chromosome number and the SNP position.

All other columns are the binary matrix of significant (1) or non-significant (0) association of the species (see the column name) to the SNP.

N_species is the count of how many species were significantly associated to a SNP.
SNP_genomicLoc_plot_perc.txt

Overview of the percentage of variance explained by each parameter.

Cat: Category as (3' UTR, 5' UTR, CDS functional, CDS synonymous, Intergenic, Intron, Intron with splice).

Perc: Percentage of variance explained by the parameter (in %).

Species: Each of the 8 species or for the whole genome (WG) as control.
AllSNP_CountBySpecies_1.txt

N_Species: Value between 1-7, as the total count of species a gene was associated to

Species: Each of the 8 species

Counts: Count of genes for each species associated to 1 (unique to the species) to 7 other species (shared gene).

R Code:

SlidingWindows_BSLMM.R

Contain the function to calculate the effect of SNPs by sliding windows along the genome
SNP_sliding_window.R

Import, analyze and plot the sliding window analysis.

5. Analysis of significant genes

Associated files:

All271k_SNPs_Archive.txt

snp_id: the combination of the chromosome number and the SNP position.

seqname: the combination of the chromosome number and the SNP position and nucleotide for the alleles.

omosome: chromosome number

start: SNP location

Distance: Distance of the SNP to the nearest gene. 0 means the SNP is located inside a gene

Gene final: Identifier of the closest gene to a SNP.

Type final: Functional annotation of the SNP.

G500: Association of a SNP to a gene if within 500bp of the start/stop of the gene.

G1000: Association of a SNP to a gene if within 1000bp of the start/stop of the gene.

G1500: Association of a SNP to a gene if within 1000bp of the start/stop of the gene.

gene_ID: Association of a SNP to a gene only if within a gene.

gene_region: as CDS, intron, 3' UTR, 5' UTR.
Gene_Binary_venn_Jan2024.txt

Matrix of 7 species (columns) across significant genes (row).
Annotations_plot.txt

Process: annotations for each 10 top categories

Num: Count of genes of each of the top 10 categories

Cat: Categories as GO process, Interpro annotations, GO Component, Annotation categories
ALLSW_SummEff.txt

File with the sliding windows summary of the effect sizes.

Start_time: Position of the beginning of the window

end_time: Position of the end of the window

center: Position of the center of the window

summary: Sum of the effect size for the SNPs within the window. 0 means there wasn't SNPs to summarize.

chr: chromosome number

Index: Continuous index across chromosomes to allow plotting.

SumEff: Sum of the effect size for the SNPs within the window. NA means there wasn't SNPs to summarize.

Species: Each of the 8 species

R Code:

SNP_genomic _features.R

Import, analyse and plot SNP data about their location within genomic features.

Genes_Clustering.R

Import data and perform hierarchical clustering.

6. Co-expression networks analysis

Associated files:

col0_allreads.csv

Co-expression data for A.thaliana Col-0 from Zhang et al. 2017

Isolate: Each of the 97 Botrytis strains

HostGenotype : Col-0

All other columns are the matrix of Botrytis gene expression.
Mash_AllGenes_500bp.txt

All candidate genes with significant SNPs (identified by BSLMM+MASH) within genes +- 500bp of start or stop of the CDS

R Code:

Pipeline_GeneNetwork.R

Import data, filter based on presence in the gene expression dataset, calculate co-expression matrix and generate the gene co-expression network.

7. Comparative genomics via orthologs analysis

Subfolder:

Orthologs

Associated files:

Botrytis_12_ortho_R.xlsx

Cluster & Cluster_ID: Cluster of orthologous genes

representative function: functional annotation for the orthogroup

Botrytis_cinerea_B05.10: B.cinerea genes associated to the orthogroup

Botrytis_aclada: B.aclada genes associated to the orthogroup

Botrytis_deweyae: B.deweyae genes associated to the orthogroup

Botrytis_fragariae: B.fragariae genes associated to the orthogroup

Botrytis_porri: B.porri genes associated to the orthogroup

Botrytis_sinoallii: B.sinoallii genes associated to the orthogroup

Botrytis_hyacinthi: B.hyacinthi genes associated to the orthogroup

Genes_GWS_Sp_SNP.txt

Genes_GWAS: B.cinerea gene IDs

Gene: Gene IDs as coded in the orthogroups

Columns C-J: Matrix of significant (1) or insignificant (0) association to a species (see the column name)

N_species: The number of species a gene was significantly associated to.

CDS: Number of significant SNPs of that genes located in the CDS.

Intron: Number of significant SNPs of that genes located in Introns

five_prime_UTR: Number of significant SNPs of that genes located in the 5' UTR regions.

three_prime_UTR: Number of significant SNPs of that genes located in the 3' UTR regions.

Within_500bp: Number of significant SNPs of that genes located in the promoter region (500 bp downstream or upstream of the gene).

Total_SNP: Total number of SNPs significantly associated for that gene.

R Code:

Orthologs_analysis.R

Import, analyze and plot the orthologs dataset.

The .RData file is a binary file used by R to save the entire workspace, including variables, data frames, functions, and other R objects.

It allows users to reload saved objects exactly as they were during the last session using load(".RData") in R.

Signlsfr_90Geno_SNPs:

This file is a SNP presence-absence matrix across plant genotypes, with each row representing a SNP and columns indicating its presence or absence in individual samples. It’s used to assess genetic variation across species.

Annotation_plot.R

This R script reads gene annotation data, splits it by category, creates bar plots for each subset (Annotations, GO Component, InterPro, and GO Process), and arranges them into a single 2×2 panel figure.

Code/software

This was coded for and functional with R version:

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"

Platform: x86_64-apple-darwin15.6.0 (64-bit)

Access information

Data was derived from the following sources:

Data from: The polygenic strategies of host-specific and general virulence of Botrytis cinerea across diverse eudicot hosts

Data files

Abstract

README: The polygenic strategies of host-specific and general virulence of Botrytis cinerea across diverse eudicot hosts

Description of the data and file structure

Main Archive

Files and variables

BSLMM of lesion area acros 85 plant genotypes

Subfolder:

Associated files:

R Code:

2. Analysis of BSLMM hyperparameters

R Code:

3. MASH Analysis

4. Analysis of significant SNPs

5. Analysis of significant genes

6. Co-expression networks analysis

7. Comparative genomics via orthologs analysis

Methods