Skip to main content

Comparative transcriptomics reveals divergence in pathogen response gene families amongst twenty forest tree species


Lu, Mengmeng; Cao, Min; Yang, Jie; Swenson, Nathan G. (2023), Comparative transcriptomics reveals divergence in pathogen response gene families amongst twenty forest tree species, Dryad, Dataset,


Forest trees provide critical ecosystem services for humanity that are under threat due to ongoing global change. Measuring and characterizing genetic diversity is key to understanding adaptive potential and developing strategies to mitigate negative consequences arising from climate change. In the area of forest genetic diversity, genetic divergence caused by large-scale changes at the chromosomal level has been largely understudied. In this study, we used the RNA-seq data of twenty co-occurring forest trees species from genera including Acer, Alnus, Amelanchier, Betula, Cornus, Corylus, Dirca, Fraxinus, Ostrya, Populus, Prunus, Quercus, Ribes, Tilia, and Ulmus sampled from Upper Peninsula of Michigan. These data were used to infer the origin and maintenance of gene family variation, species divergence time, as well as gene family expansion and contraction. We identified a signal of common whole genome duplication events shared by core eudicots, and a signal of recent duplication events specific to particular species. We also found rapid evolution, namely fast expansion or fast contraction of gene families, in plant-pathogen interaction genes amongst the diploid species studied. Finally, the results lay the foundation for further research on the genetic diversity and adaptive capacity of forest trees, which will inform forest management and conservation policies.


Leaves of each of twenty forest species were sampled in August, 2021 at the University of Notre Dame Environmental Research Center (UNDERC) near Land O' Lakes, Wisconsin, U.S.A. (Figure 1). The UNDERC site is located at latitude: 46.23391, longitude: -89.537254, and encompasses approximately ~8,000 acres of mixed deciduous forest on both sides of the state line between Wisconsin and Michigan’s upper peninsula. The twenty species used in this study are dominate components of the woody vegetation at the site and are: Acer rubrum (red maple), Acer saccharum (sugar maple), Acer spicatum (mountain maple), Alnus incana (Gary alder), Amelanchier laevis (smooth serviceberry), Betula alleghaniensis (yellow birch), Betula papyrifera (paper birch), Cornus alternifolia (pagoda dogwood), Corylus cornuta (beaked hazel), Dirca palustris (Eastern leatherwood), Fraxinus nigra (black ash), Ostrya virgniana (American hophornbeam), Populus grandientata (bigtooth aspen), Populus tremuloides (quaking aspen), Prunus serotina (black cherry), Prunus virginiana (chokecherry), Quercus rubra (Northern red oak), Ribes cynosbati (prickly gooseberry), Tilia americana (basswood), Ulmus americana (American elm). They are all angiosperms (flowering plants) and eudicots and belong to the clades of asterids, saxifragles, and rosids. While these species all co-occur at the UNDERC site, their geographic ranges largely overlap on the continental scale and the species have largely co-occurred in North America since at least the Pleistocene. Upon collection, the samples were flash-frozen in liquid nitrogen, then transported and stored in an ultracold freezer at the field station. The samples were sent to Novogene corporation inc. (Chula Vista, CA, USA) for mRNA extraction, library preparation, mRNA sequencing, and quality control. The samples were sequenced on an Illumina NovoSeq 6000 platform to produce 150bp paired-end reads.

Raw reads filtering was performed by Novogene to get the clean reads including removing reads containing adapters, removing reads containing N >10%, and removing reads containing low quality base which is over 50% of the total base. An average of 6Gb paired-end clean reads were acquired for each sample and deposited in NCBI Sequence Read Archive (SRA accession number: PRJNA896247). De novo transcriptome assembly was performed for each sample using the software TransLiG v1.3 with default parameters (Liu et al., 2019). The clean reads were then mapped to the assembled transcriptome using the software RSEM v1.2.25 with default parameters (Li and Dewey, 2011). The longest transcript (isoform) with a transcript per million (TPM) value ≥ 1.5 for each gene was retained. Redundancy was removed from the retained transcripts using the software CD-HIT v4.8.1 with parameters of -c 0.95 -n 8 -T 32 -M 1540 (Fu et al., 2012). To remove potential contamination, the retained transcripts were aligned against the non-redundant (nr) protein database (downloaded on January 22, 2022) using the blastx function implemented by the software DIAMOND v2.0.13.151 (Buchfink et al., 2021) with parameter of --outfmt 102 to assign the best-known taxon ID to each query transcript. The file that was used to convert accession numbers to taxonomy was downloaded from, and the tax ID file was downloaded from The query transcripts that could be assigned to green plant (Viridiplantae) taxon IDs were retained. The filtered transcripts were aligned to the nr database that only includes green plant subjects using the blastx function implemented by the software DIAMOND v2.0.13.151 with default parameters. The aligned transcripts were also retained and then were combined with the previously retained transcripts to form the final transcriptomes. The completeness of the final transcriptomes was examined using the 1,614 orthologues in the BUSCO (Benchmarking Universal Single Copy Orthologs) v5.2.2 set of embryophyta_odb10 (Manni et al., 2021).

The transcriptomes were annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) terms. The transcripts were aligned to the nr database that only includes green plant subjects using the blastx function implemented by the software DIAMOND v2.0.13.151 with parameters of -taxonlist 33090  --very-sensitive -f5. The .xml outputs were used as input for mapping and annotation using the software Blast2GO (Götz et al., 2008). The transcripts were translated to protein sequences using the software TransDecoder v5.5.0 ( with default parameters to get the longest ORF (open reading frame) per transcript, and then were annotated by the software InterProScan v5.54-87.0 (Jones et al., 2014) with the parameters of -dra -f XML -goterms -pa. The final GO terms were acquired by merging Blast2GO and InterProScan annotation outputs using the software Blast2GO. To get the KEGG terms, the longest ORF per transcript were also annotated using EggNOG-mapper v2.1.7 (Cantalapiedra et al., 2021) with parameters of -m diamond -sensmode more-sensitive -tax_scope 33090 -tax_scope_mode 33090.


National Science Foundation, Award: DEB-2124466

National Science Foundation, Award: DEB-32061123003