Data for: Endophyte genomes support greater metabolic gene cluster diversity compared with non-endophytes in Trichoderma
Data files
Mar 30, 2023 version files 565.09 MB
-
Metabolic_gene_cluster_data.xlsx
69.97 KB
-
mycoparasitism_gene_analysis.tar.gz
409.93 KB
-
protein_fastas.tar.gz
128.61 MB
-
README.md
18.40 KB
-
Trichoderma_annotations_gffs.tar.gz
426.85 MB
-
Trichoderma_basic_stats.csv
17.44 KB
-
Trichoderma_Orthogroups.txt
9.12 MB
-
Trichoderma_tree.txt
1.48 KB
Mar 30, 2023 version files 565.10 MB
-
EasA_phylogenies.xlsx
11.29 KB
-
Metabolic_gene_cluster_data.xlsx
69.97 KB
-
mycoparasitism_gene_analysis.tar.gz
409.93 KB
-
protein_fastas.tar.gz
128.61 MB
-
README.md
18.74 KB
-
Trichoderma_annotations_gffs.tar.gz
426.85 MB
-
Trichoderma_basic_stats.csv
17.44 KB
-
Trichoderma_Orthogroups.txt
9.12 MB
-
Trichoderma_tree.txt
1.48 KB
Abstract
Trichoderma is a cosmopolitan genus with diverse lifestyles and nutritional modes, including mycotrophy, saprophytism, and endophytism. Previous research has reported greater metabolic gene repertoires in endophytic fungal species compared to closely-related non-endophytes. However, the extent of this ecological trend and its underlying mechanisms are unclear. Some endophytic fungi may also be mycotrophs and have one or more mycoparasitism mechanisms. Mycotrophic endophytes are prominent in certain genera like Trichoderma, therefore, the mechanisms that enable these fungi to colonize both living plants and fungi may be the result of expanded metabolic gene repertoires. Our objective was to determine what, if any, genomic features are overrepresented in endophytic fungi genomes in order to undercover the genomic underpinning of the fungal endophytic lifestyle. Here we compared metabolic gene cluster and mycoparasitism gene diversity across a dataset of thirty-eight Trichoderma genomes representing the full breadth of environmental Trichoderma’s diverse lifestyles and nutritional modes. We generated four new Trichoderma endophyticum genomes to improve the sampling of endophytic isolates from this genus. As predicted, endophytic Trichoderma genomes contained, on average, more total biosynthetic and degradative gene clusters than non-endophytic isolates, suggesting that the ability to create/modify a diversity of metabolites potential is beneficial or necessary to the endophytic fungi. Still, once the phylogenetic signal was taken in consideration, no particular class of metabolic gene cluster was independently associated with the Trichoderma endophytic lifestyle. Several mycoparasitism genes, but no chitinase genes, were associated with endophytic Trichoderma genomes. Most genomic differences between Trichoderma lifestyles and nutritional modes are difficult to disentangle from phylogenetic divergences among species, suggesting that Trichoderma genomes maybe particularly well-equipped for lifestyle plasticity. We also consider the role of endophytism in diversifying secondary metabolism after identifying the horizontal transfer of the ergot alkaloid gene cluster to Trichoderma.
Endophyte genomes support greater metabolic gene cluster diversity compared with non-endophytes in Trichoderma
Contributors: Kelsey Scott1, Zachary Konkel1,2, Emile Gluck-Thaler3, Guillermo E. Valero David1, Coralie Farinas Simmt1, Django Grootmyers4, Priscila Chaverri5,6, Jason Slot1,7
Affiliations
1Department of Plant Pathology, The Ohio State University, Columbus, OH
2Center for Applied Plant Sciences, The Ohio State University, Columbus, OH
3Laboratory of Evolutionary Genetics, University of Neuchtel, Neuchtel, Switzerland
4Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN
5Department of Natural Sciences, Bowie State University, Bowie, MD
6School of Biology and Natural Products Research Center (CIPRONA), University of Costa Rica, San Jos, Costa Rica
7Center for Psychedelic Drug Research and Education, The Ohio State University, Columbus, OH
Dataset Attribution, Usage, and Sharing/Access information
- bioRxiv preprint link: https://www.biorxiv.org/content/10.1101/2023.03.14.532605v1
- SRA bioproject accession number for newly sequenced genomes: PRJNA899549
BioProject contains genome assemblies of 4 newly sequenced Trichoderma endophyticum genomes - Links to other publicly accessible locations of the data:
- 35 genome assemblies/raw SRA data were downloaded directly from NCBI: https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/5543/
Methods and Analyses
Full methodology available at: https://www.biorxiv.org/content/10.1101/2023.03.14.532605v1
Briefly:
Four isolates of T. endophyticum were isolated from the sapwood of living rubber trees (Hevea spp.) from Peru. These isolates were cultured, DNA extracted, DNA sequenced with MinION long read sequencing and Illuminna PE 150 sequencing, and had their genomes hybrid assembled.
Publicly available Trichoderma genomes and SRA data were used to generate 35 other Trichoderma genomes to supplement the dataset.
For the complete dataset of 39 Trichoderma genomes:
- GENOME ASSEMBLIES:
Raw SRA data was retrieved from NCBI 35 Trichoderma strains. For both the publicly available read data and our own T. endophyticum reads, we removed adapters and low-quality reads from the short-read data with Trimmomatic and assessed overall read quality with FastQC.
Trimmed reads were assembled using SPAdes v3.12.0 to assemble each Trichoderma genome, supplementing with Nanopore reads to create hybrid assemblies for the T. endophyticum genomes.
All assemblies were assessed with QUAST v4.6.3 and Illumina coverage assessed with BBmap.
Our personal assemblies were used if they were considered to be of higher quality than the publicly available assemblies on NCBI. - GENOME ANNOTATIONS:
All genomes were annotated using a MAKER annotation pipeline, this pipeline was performed separately and identically for each separate genome.
De novo repeat libraries were created with RepeatModeler and supplied to MAKER. MAKER was run in three iterations to fine-tune the genome annotations.
In the first iteration, protein and EST data from the same or closely related Trichoderma species were retrieved from NCBI for each genome. The first iteration of MAKER used the repeat library and provided EST and protein data to generate individual preliminary gene annotations for each genome.
For the second iteration of the MAKER2 pipeline, we provided ab initio gene predictions from SNAP (produced from preliminary gene predictions from the first iteration), Augustus (trained using BUSCO), and GeneMark.
SNAP and Augustus were both retrained using the high-quality predictions from iteration 2, and this refined dataset was supplied to MAKER2 for the final iteration to produce the final genome annotations for each isolate (option keep_preds = 1).
To find orthologous sequences across the 39 genome set, we provided the final protein datasets for each Trichoderma genome to OrthoFinder.
The resulting orthology datafiles and the MAKER protein files were processed with Orthofiller to improve the genome annotations and identification of orthologs.
All subsequent analyses used the post-Orthofiller protein datasets. Genes identified with OrthoFiller are preceded with the label "ofg". - ASSIGNMENT OF SPECIES, AND SPECIES LIFESTYLE/NUTRITIONAL MODE
For each genome, the genes encoding translation-elongation factor 1 alpha (TEF1) and RNA polymerase II gene (RPB2) sequence were identified in each genome and compared to their respective GenBank type specimen sequences using BLASTn.
In the case of several Trichoderma isolates, the original species identification was determined to be incorrect, and the species was re-labeled.
lifestyle information (i.e., endophyte, on decaying plant material, in soil, and on other fungi) was based on published research, and nutritional mode (i.e., mycotrophy vs. saprotrophy) was based on ancestral character reconstructions, phylogenetic affinities, and confirmed antagonism experiments.
Two Trichoderma strains (T. bissettii [JCM 1883] and unknown Trichoderma species [OTPB3]) were left as unknown because no ecological information was available.
For each Trichoderma species, we assigned a potential endophytic lifestyle by identifying isolates reportedly isolated from living plant tissue, documented on NCBI.
To be a potential endophyte, we required at least two isolates from separate sources with a blastn match for one of the following: TEF1, 100% coverage and >99.5% identity, and RPB2, 100% coverage and >99.0% identity. For T. arundinaceum, there was no RPB2 sequence from type material available on NCBI, so we looked for matches with TEF1, 100% coverage and >99.5% identity and ITS 100% coverage and >99.5% identity. - CREATION OF PHYLOGENOMIC TREE:
The protein sequences of the single-copy orthogroups from OrthoFinder were aligned (mafft) and curated.
RAxML was used to determine and mapping percentage of 100 rapid bootstraps to the best-scoring maximum-likelihood tree, which resulted in 154 SCO trees with a median BP >98%.
These 154 single-copy ortholog trees were used to create the majority rule extended consensus tree using RAxML (-m GTRCAT).
An alignment consensus phylogenomic tree was created from the 154 SCO IQ-TREEs (those >98% BP support) in IQ-TREE (-bb 1000 -bsam GENESITE -m TEST) [172].
For each of the 38 high-quality Trichoderma genomes, we performed the following analyses:
- SECONDARY METABOLITE DEGRADATIVE GENE CLUSTERS (DGCs), IDENTIFATION AND COMPARISON:
Using a custom pipeline (https://github.com/egluckthaler/cluster_retrieve) we identified the DGCs present in each genome.
This script used gene cluster models containing particular core genes, or genes known to be important for processing secondary metabolites as reference.
The following core genes were searched: aromatic ring-opening dioxygenase (ARD), benzoate 4-monooxygenase (BPH), ferulic acid esterase 7 (CAE), catechol dioxygenase (CCH), epicatechin laccase (ECL), ferulic acid decarboxylase (FAD), pterocarpan hydroxylase (PAH), naringenin 3-dioxygenase (NAD), phenol 2-monooxygenase (PMO), quinate 5-dehydrogenase (QDH), salicylate hydroxylase (SAH), stilbene dioxygenase (SDO), and vanillyl alcohol oxidase (VAO) (described in Gluck-Thaler and Slot 2018).
DGC containing the same core gene were considered to be in the same gene cluster "class".
When we compared DGCs across different genomes, DGCs exhibiting similar organization with a high degree of sequence similarity were referred to as the same DGC family.
-We grouped gene clusters into homologous cluster families using BiG-SCAPE. Clusters were classified into cluster families using default parameters while referencing MIBiG. For degradative clusters, we modified BiG-SCAPE's default model weights to 63% protein domain sequence similarity, 35% protein domain presence-absence, 2% conserved synteny, and a core gene boost set to 2x. - SECONDARY METABOLITE BIOSYNTHETIC GENE CLUSTERS (BGCs), IDENTIFATION AND COMPARISON:
Biosynthetic gene clusters were predicted for each genome using the antiSMASH v5.0b pipeline with glimmerhmm gene prediction enabled.
Clusters were analyzed via the knownclusterblast module to identify potential homologs of MIBiG v2.0 database entries.
As additional quality filters, we only referenced MIBiG accessions with 3 or more genes and removed knownclusterblast genes with bit scores lower than 100, redundant hits within each cluster, clusters with aggregate bit scores less than 100 x cluster gene quantity, and any clusters with less than 60% of the respective MIBiG clusters gene quantity.
We parsed and compiled data from the antiSMASH and knownclusterblast output using smashStats.py (gitlab.com/xonq/mycotools).
BGCs are grouped in different classes respective to their association with different secondary metabolite compound classes, according to antiSMASH nomenclature (e.g. polyketides, non-ribosomal peptides, terpenes, indole etc.).
-We grouped gene clusters into homologous cluster families using BiG-SCAPE. Clusters were classified into cluster families using default parameters while referencing MIBiG. - MYCOPARASITISM-RELATED GENES, IDENTIFATION AND COMPARISON:
Using previously published results of differential gene expression analyses from RNA-seq and oligonucleotide tiling array data, we compiled a list of 746 genes from T. atroviride, T. virens, T. reesei, and T. harzianum that were upregulated during at least one time point or treatment during mycoparasitism assays compared with control time points or treatments.
References for the three differential expression analyses studies are as follows:- Atanasova L, Crom SL, Gruber S, Coulpier F, Seidl-Seiboth V, Kubicek CP, et al. Comparative transcriptomics reveals different strategies of Trichoderma mycoparasitism. BMC Genomics. 2013;14: 121. doi:10.1186/1471-2164-14-121
- Steindorff AS, Ramada MHS, Coelho ASG, Miller RNG, Pappas GJ, Ulhoa CJ, et al. Identification of mycoparasitism-related genes against the phytopathogen Sclerotinia sclerotiorum through transcriptome and expression profile analysis in Trichoderma harzianum. BMC Genomics. 2014;15: 204. doi:10.1186/1471-2164-15-204
- Morn-Diez ME, Carrero-Carrn I, Rubio MB, Jimnez-Daz RM, Monte E, Hermosa R. Transcriptomic Analysis of Trichoderma atroviride Overgrowing Plant-Wilting Verticillium dahliae Reveals the Role of a New M14 Metallocarboxypeptidase CPA1 in Biocontrol. Frontiers in Microbiology. 2019;10: 1120. doi:10.3389/fmicb.2019.01120
For each of these 746 mycoparasitism-related genes, we identified the highest scoring hit in each Trichoderma genome using BLASTp (E value threshold = 0.001). Any orthogroup containing a highest scoring hit to a mycoparasitism-related gene was considered to be a mycoparasitism-related orthogroup.
Naming Conventions of isolates
- each isolate is referred to by an abbreviation composed of their genus name (first three letters of genus name), species name (first two letters of species name), genome assembly version (#), and isolate code.
- example: Triat1-JCM9410 = Trichoderma atroviride genome version 1, isolate JCM9410
Data and File Overview
Trichoderma_tree.txt = phylogenomic tree in Newick format.
* Isolates are referred to by their full species name.
protein_fastas.tar.gz = protein fastas for each of the 39 Trichoderma genomes
* each file contains the protein sequence data for each (post-OrthoFiller) genome
Trichoderma_annotations_gffs.tar.gz = genome annotations (gffs) for the 39 Trichoderma assemblies
* each file contains the gffs for each (post-OrthoFiller) genome
Trichoderma_Orthogroups.txt = text file showing the orthogroups (generated by OrthoFinder) for 38 Trichoderma genomes
* file containing the groupings of orthologous genes for the 39 Trichoderma genomes
* each gene is labeled with the isolate of origin and the gene number
metabolic_gene_cluster_data.xlsx = results of DGC and BGC identification analysis with 38 Trichoderma genomes
* The DGC_families and BGC_families sheets are organized with the Trichoderma genome/isolates on the rows and the metabolic gene clusters on the columns. For each metabolic gene cluster, there is count data (from Big-SCAPE) for the number of identified gene cluster families. The gene cluster families are labeled with both their type and unique gene cluster familiy number.
mycoparasitism_gene_analysis.tar.gz = all related files for the mycoparasitism-related gene analysis including lists of reference genes
* mycoparasitism_gene_list.xlsx
- sheet 1 = experiment overview and stats
- sheet 2 = total number and percent of upregulated genes identified per genome
- sheet 3 = number of each Mycoparasitism gene ortholog identified in each genome (references mycoparasitism gene orthogroups)
* all_OGs.txt = orthogroups create using upregulated genes as references
* all_PAs.txt = counts of each orthogroup in each genome
* separated_based_on_study_of_origin (folder) = contains the orthogroups and orthogroups count per genome, separated based on the study where the mycoparasitism-gene was identified
Trichoderma_basic_stats.csv = genome statistics, lifestyle/nutritional information, gene cluster counts, gene counts and other related data for the 38 Trichoderma genomes
For each of the previously-explained analyses, the results of each have been collected into a single file for convenient statistical analysis.
Explanation of column name:
* genome_code = genome code used to identify each isolate/genome
* genomeID = species name spelled out for visulaziton purposes
* clade = clade name as determined in phylogenomic tree
* species = full species name
* nutritional_mode = mycotroph/saprotroph
* multiple_lifestyles = yes/no, is the species reported as being able to exhibit more than one lifestyle?
* potential_endophyte_incl = yes/no, is the species determined as being an endophyte, including species reported as being found in living plants on NCBI?
* reported_endophyte = yes/no, is the species originally reported as being an endophyte?
* on_fungi = yes/no, is the species reported as living on other fungi?
* soil = yes/no, is the species reported as living freely in the soil?
* wood.litter = yes/no, is the species reported as living on either plant litter or wood?
* Complete BUSCOs = number of complete BUSCO genes identifed (using sordariomyceta_odb9)
* complete busco % = percentage of complete BUSCO identified from the sordariomyceta_odb9 reference dataset
* Complete and single-copy BUSCOs = nnumber of complete and single copy BUSCOs from sordariomyceta_odb9
* avg_base_coverage.shortreads = average coverage of short reads on the genome, determinend using BBMAP
* Number_contigs.500bp = number of contigs in the genome assembly over 500 bp long
* N50 = N50 of genome assembly
* Number_genes.preOrthoFiller = number of genes identified from the MAKER pipeline
* Number_genes.postOrthoFiller = number of genes identified after the OrthoFiller step
* Genome_length = total genome assembly length in bp
* SAH_total = total number of SAH DGCs
* QDH_total = total number of QDH DGCs
* ARD_total = total number of ARD DGCs
* NAD_total = total number of NAD DGCs
* CCH_total = total number of CCH DGCs
* PMO_total = total number of PMO DGCs
* CCH_SAH_total = total number of hybrid CCH-SAH DGCs
* FAD_total = total number of FAD DGCs
* BPH_total = total number of BPH DGCs
* PAH_total = total number of PAH DGCs
* CCH_PMO_hybrid = total number of hybrid CCH-PMO DGCs
* SAH_SDO_hybrid = total number of hybrid SAH-SDO DGCs
* total_cat_fam = total number of DGC families identified
* BGCmetab_genes = total number of metabolic genes identified in antismash analysis
* NRPS_total = total number of NRPS BGCs
* T1PKS_total = total number of T1PKS BGCs
* NRPS_like_total = total number of NRPS-like BGCs
* T1PKS_NRPS_total = total number of hybrid T1PKS-NRPS BGCs
* terpene_total = = total number of terpene BGCs
* betalactone_total = total number of beta-lactone BGCs
* total_fungal_Ripp = total number of fungal-RiPP BGCs
* indole_total = total number of indole BGCs
* other_hybrid_total = total number of BGCs that were hybrids (contained more than one different type of BGC core gene)
* BGC_count_incl_singletons_total = total number of BGC families, including those that were only found in that particular genome (aka singletons)
* metgenes_per_BGC = BGCmetab_genes / BGC_count_incl_singletons_total
* AII = total number of AII chitinase genes
* AIV = total number of AIV chitinase genes
* AV = total number of AV chitinase genes
* BI = total number of BI chitinase genes
* BII = total number of BII chitinase genes
* BV = total number of BV chitinase genes
* CI = total number of CI chitinase genes
* CII = total number of CII chitinase genes
* AIII = total number of AIII chitinase genes
* BIII = total number of BIII chitinase genes
* BIV = total number of BIV chitinase genes
* CHITD = total number of CHITD chitinase genes
* total_Chit_class = total number of chitinase genes identified
* LysM = total number of lysin motifs identified
* CBD = total number of chitin-binding domains identified
* per_chit = (total_Chit_class / Number_genes.postOrthoFiller) * 100 [as percent]
* LysM.per = LysM / total_Chit_class
* CBD.per = CBD / total_Chit_class
* AII.per = (AII / Number_genes.postOrthoFiller) * 100
* AIV.per = (AIV / Number_genes.postOrthoFiller) * 100
* AV.per = (AV / Number_genes.postOrthoFiller) * 100
* BI.per = (BI / Number_genes.postOrthoFiller) * 100
* BII.per = (BII / Number_genes.postOrthoFiller) * 100
* BV.per = (BV / Number_genes.postOrthoFiller) * 100
* CI.per = (CI / Number_genes.postOrthoFiller) * 100
* CII.per = (CII / Number_genes.postOrthoFiller) * 100
* CHITD.per = (CHITD / Number_genes.postOrthoFiller) * 100
* mgene.prop.numgnes = BGCmetab_genes / Number_genes.postOrthoFiller
* DGC.prop = total_cat_fam / Number_genes.postOrthoFiller
* BGC.prop = BGC_count_incl_singletons_total / Number_genes.postOrthoFiller
* total_mycopara_gene = total number of mycoparasitism gene orthologs identified
* mp_gene_per_of_total = (total_mycopara_gene / Number_genes.postOrthoFiller) * 100
EasA_phylogenies.xlsx: Phylogenetic trees showing placement of Brevicompactum clade ergot alkaloid BGC genes, under different contraints.
Metabolic_gene_cluster_data.xlsx: Matrix indicating presence/absence of different gene clusters across the collection of genomes.
