Skip to main content

Comparative transcriptomics reveals divergence in pathogen response gene families amongst 20 forest tree species

Cite this dataset

Lu, Mengmeng; Cao, Min; Yang, Jie; Swenson, Nathan G. (2023). Comparative transcriptomics reveals divergence in pathogen response gene families amongst 20 forest tree species [Dataset]. Dryad.


Forest trees provide critical ecosystem services for humanity that are under threat due to ongoing global change. Measuring and characterizing genetic diversity is key to understanding adaptive potential and developing strategies to mitigate negative consequences arising from climate change. In the area of forest genetic diversity, genetic divergence caused by large-scale changes at the chromosomal level has been largely understudied. In this study, we used the RNA-seq data of twenty co-occurring forest tree species from genera including Acer, Alnus, Amelanchier, Betula, Cornus, Corylus, Dirca, Fraxinus, Ostrya, Populus, Prunus, Quercus, Ribes, Tilia, and Ulmus sampled from Upper Peninsula of Michigan. These data were used to infer the origin and maintenance of gene family variation, species divergence time, as well as gene family expansion and contraction. We identified a signal of common whole genome duplication events shared by core eudicots. We also found rapid evolution, namely fast expansion or fast contraction of gene families, in plant-pathogen interaction genes amongst the studied diploid species. Finally, the results lay the foundation for further research on the genetic diversity and adaptive capacity of forest trees, which will inform forest management and conservation policies.


Leaves of each of twenty forest species were sampled in August 2021, at the University of Notre Dame Environmental Research Center (UNDERC) near Land O' Lakes, Wisconsin, U.S.A. (Fig. 1). The UNDERC site is located at latitude: 46.23391, longitude: -89.537254, and encompasses approximately ~8,000 acres of mixed deciduous forest on both sides of the state line between Wisconsin and Michigan’s upper peninsula. The twenty species used in this study are dominant components of the woody vegetation at the site and are: Acer rubrum (red maple), Acer saccharum (sugar maple), Acer spicatum (mountain maple), Alnus incana (Gary alder), Amelanchier laevis (smooth serviceberry), Betula alleghaniensis (yellow birch), Betula papyrifera (paper birch), Cornus alternifolia (pagoda dogwood), Corylus cornuta (beaked hazelnut), Dirca palustris (Eastern leatherwood), Fraxinus nigra (black ash), Ostrya virginiana (American hophornbeam), Populus grandidentata (bigtooth aspen), Populus tremuloides (quaking aspen), Prunus serotina (black cherry), Prunus virginiana (chokecherry), Quercus rubra (northern red oak), Ribes cynosbati (prickly gooseberry), Tilia americana (American basswood), Ulmus americana (American elm). They are all angiosperms (flowering plants) and eudicots, belonging to the clades of asterids, saxifragales, and rosids.  These species were chosen as they are the most dominant at the study site and in large national forests in the region (i.e. Ottawa, Nicolet, and Hiawatha National Forests). Specifically, these species compose >95% of the tree individuals and above-ground biomass in these forests (Swenson et al. 2017). We focus on the most dominant species in order to provide insights into the ecological and evolutionary processes driving transcriptomic variation within and across populations and communities (Swenson and Jones 2017). Upon collection, the samples were flash-frozen in liquid nitrogen, then transported and stored in an ultracold freezer at the field station.

The samples were sent to Novogene Corporation Inc. (Chula Vista, CA, USA) for mRNA extraction, library preparation, mRNA sequencing, and quality control. The samples were sequenced on an Illumina NovaSeq 6000 platform to produce 150 bp paired-end reads. Raw reads filtering was performed by Novogene to get the clean reads including removing reads containing adapters, removing reads containing N >10%, and removing reads containing low quality base (Phred quality score ≤5) which is over 50% of the total base. An average of 6 Gb paired-end clean reads were acquired for each sample and deposited in NCBI Sequence Read Archive (SRA accession number: PRJNA896247). De novo transcriptome assembly was performed for each sample using the software TransLiG v1.3 with default parameters (Liu et al. 2019). The clean reads were then mapped to the assembled transcriptome using the software RSEM v1.2.25 with default parameters (Li and Dewey 2011). The longest transcript (isoform) with a transcript per million (TPM) value ≥1.5 for each gene was retained. Redundancy was removed from the retained transcripts using the software CD-HIT v4.8.1 with parameters of -c 0.95 -n 8 -T 32 -M 1540 (Fu et al. 2012). To remove potential contamination, the retained transcripts were aligned against the non-redundant (nr) protein database (downloaded on January 22, 2022) using the blastx function implemented by the software DIAMOND v2.0.13.151 (Buchfink et al. 2021) with parameter of --outfmt 102 to assign the best-known taxon ID to each query transcript. The file that was used to convert accession numbers to taxonomy was downloaded from, and the taxon ID file was downloaded from The query transcripts that could be assigned to green plant (Viridiplantae) taxon IDs were retained. The filtered transcripts were aligned to the nr database that only includes green plant subjects using the blastx function implemented by the software DIAMOND v2.0.13.151 with default parameters. The aligned transcripts were combined with those transcripts assigned with green plant taxon IDs to form the final transcriptomes. The completeness of the final transcriptomes was examined using the 1,614 orthologues in the BUSCO (Benchmarking Universal Single Copy Orthologs) v5.2.2 set of embryophyta_odb10 (Manni et al. 2021).

The transcriptomes were annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) terms. The transcripts were aligned to the nr database that only includes green plant subjects using the blastx function implemented by the software DIAMOND v2.0.13.151 with parameters of -taxonlist 33090 --very-sensitive -f5. The .xml outputs were used as input for mapping and annotation using the software Blast2GO (Götz et al. 2008). The transcripts were translated to protein sequences using the software TransDecoder v5.5.0 ( with default parameters to get the longest open reading frame (ORF) per transcript and then were annotated by the software InterProScan v5.54-87.0 (Jones et al. 2014) with the parameters of -dra -f XML -goterms -pa. The final GO terms were acquired by merging Blast2GO and InterProScan annotation outputs using the software Blast2GO. To get the KEGG terms, the longest ORF per transcript were also annotated using EggNOG-mapper v2.1.7 (Cantalapiedra et al. 2021) with parameters of -m diamond -sensmode more-sensitive -tax_scope 33090 -tax_scope_mode 33090. The assembled transcriptome and annotation were used in the downstream analyses as shown in the flowchart (Supplementary Fig. S1).


National Science Foundation, Award: DEB-2124466

National Natural Science Foundation of China, Award: DEB-32061123003