Phylogenomics of Boraginaceae using lineage-specific and angiosperms353 loci
Data files
Mar 20, 2025 version files 23.62 MB
-
FOR_DRYAD.zip
23.61 MB
-
README.md
3.03 KB
Abstract
During the past 20 years, the phylogenetics of Boraginaceae has taken shape using plastid DNA regions and the nuclear ribosomal internal transcribed spacer (ITS), but these regions only represent a limited understanding of the evolutionary history of the family. Using hybridization-enrichment sequencing, 531 nuclear regions from lineage-specific and Angiosperms353 loci were sequenced and aligned for 49 species from across Boraginaceae. Additionally, the Angiosperms353 loci were incorporated with a broader dataset of the same loci from 115 accessions of Boraginales and relatives. Based on multiple phylogenetic approaches and datasets, the resolved phylogenies of Boraginaceae were quite similar to our current understanding, yet multiple taxa were recognized in different positions. These included: 1) Echiochiloideae as sister to Cynoglossoideae instead of to the rest of the entire family, 2) Moritzinae as nested within Boragininae, and 3) Lasiocaryeae and Trichodesmeae not resolved as sisters. These different positions recovered, via different methods, using hundreds of nuclear loci suggest that incomplete lineage sorting, hybridization, and heterotachy may have occurred during the early origin of the family. In analyses of Boraginales, Namaceae was resolved as non-monophyletic, providing evidence that a broader Hydrophyllaceae may again be appropriate, and Lennoaceae was nested in Ehretiaceae. While both sets of loci allowed for a well-resolved and well-supported phylogeny to be reconstructed, the lineage-specific loci recovered some of the more intriguing phylogenetic relationships in part because these loci appear to be less conserved than those from Angiosperms353. The two sets of loci provide an interesting complement for understanding patterns of evolution within the family and order.
https://doi.org/10.5061/dryad.f7m0cfz5d
The dataset focuses on DNA sequence data to investigate the Phylogenomics relationships of the plant family Boraginaceae and the order to which it belongs, Boraginales. Phylogenomic analyses were conducted using both Maximum Parsimony (MP) and Maximum Likelihood (ML) approaches for concatenated data and Multispecies Coalescent (MSC) methods with phylogenies of individual loci. For each method, multiple analyses were conducted to fully explore the various combinations of data and the impacts of different parameters on methods of analysis.
Description of the data and file structure
The dataset includes nuclear and organellar DNA sequence alignments and phylogenies from concatenated and multispecies coalescent analyses of species of Boraginaceae and Boraginales.
-The Alignments folder includes aligned DNA sequence data of orthologues of Angiosperms353 loci, loci from Boraginaceae-specific probes, and organellar and nuclear ribosomal regions.
-The Concatenation folder includes the phylogenies, from IQ-TREE and TNT, of concatenated DNA from Angiosperms353 and Boraginaceae-specific loci. The folder of IQ-TREE files includes results from BIC, AIC, and AICc model selection for partition and unpartitioned data, and partitioned data was analyzed with Gamma rate heterogeneity and FreeRate heterogeneity. Please see methods for additional information on phylogenetic search parameters.
-The MSC folder includes the input phylogenies from MP (TNT) and ML (IQ-TREE and RAxML-NG) analyses of individual loci as well as results of multispecies coalescent analyses from ASTRAL (including weighted ASTRAL [waster]).
-The Organellar and nrDNA folder includes the results of MP and ML analyses of the plastid and mitochondrial genomes and nrDNA.
-The Paralogues folder includes alignments of Angiosperms353 and Boraginaceae-specific paralogous loci, phylogenies of ML (IQ-TREE and RAxML-NG) analyses of individual trees, and MSC phylogenies of the individual trees using ASTRAL-Pro.
-The Boraginales folder includes alignments of Angiosperms353 loci, phylogenies of concatenated data analyzed with MP and ML approaches, input phylogenies from MP (TNT) and ML (IQ-TREE and RAxML-NG) analyses of individual loci as well as results of multispecies coalescent analyses from ASTRAL (including weighted ASTRAL [waster]), and alignments of Angiosperms353 and Boraginaceae-specific paralogous loci, phylogenies of ML (IQ-TREE and RAxML-NG) analyses of individual trees, and MSC phylogenies of the individual trees using ASTRAL-Pro.
Sharing/Access information
Raw reads are available at PRJNA1098857 and PRJNA1106503 at GenBank, and accession numbers are listed after the species for other taxa downloaded from GenBank for the Boraginales datasets.
Code/Software
See Methods for parameters used to align DNA sequence data and for phylogenetic analyses.
Six datasets were constructed for phylogenetic analyses of Boraginaceae: 1) 353-42 (Angiosperms353 loci with all 49 species, which includes 42 loci), 2) 353-305 (Angiosperms353 loci with at least 27 species, which includes 305 loci), 3) Borage-149 (Borage-specific loci with all 49 species, which includes 149 loci), 4) Borage-226 (Borage-specific loci with at least 27 species, which includes 226 loci), 5) B353-191 (Borage-specific and Angiosperms353 loci with all 49 species, which includes 191 loci), and 6) B353-531 (Borage-specific and Angiosperms353 loci with at least 27 species, which includes 531 loci). These six datasets represent two approaches for phylogenetic analyses. One was to maximize the completeness of the matrix by using only loci that were present for all 49 species, and the other was to maximize the number of characters for the matrix by including loci that had sufficient representation among the 49 species. The former resulted in representation for each locus for each species, limiting missing data, and the latter resulted in many more loci and characters included, but with a greater amount of missing data, with 27 species used as cut-offs to balance the incorporation of sequence data with increasing missing data. Furthermore, multiple distinct phylogenetic analyses were conducted and results were compared to explore potential variation in reconstructed trees, to examine the influence of methodological issues in phylogenetics, and to recognize areas of the tree or taxa that were more variable or stable depending on analysis and dataset.
Individual loci, plastid genomes (with only one inverted repeat), mitochondrial genomes, and nuclear ribosomal cistrons were each aligned using the global pair option with MAFFT (Katoh and Standley 2013). Aligned sequence data was subsequently analyzed using Maximum Parsimony (MP) and Maximum Likelihood (ML) methods. With MP analyses in TNT (Goloboff and Catalano 2016), the following search was conducted for each locus: 1,000 parsimony ratchets (Nixon 1999), with 10% upweighting and 10% down weighting, 1,000 drift iterations, 100 rounds of tree fusing, and sectorial searches (Goloboff 1999), and this analysis was undertaken 10 times to find the optimal tree. A strict consensus tree was reconstructed, and 1,000 bootstrap and jackknife replicates were conducted. For ML analyses with RAxML-NG (Kozlov et al. 2019), ModelTest-NG (Darriba et al. 2020) was used to identify the optimal substitution model according to the Bayesian Information Criterion (BIC), for each locus, among 11 substitution schemes and with nucleotide frequencies and rate heterogeneity taken into account. RAxML-NG was run to recover the optimal tree, and 1,000 bootstrap replicates were conducted, with Felsenstein bootstrap support and Transfer bootstrap expectation calculated (Lemoine et al. 2018; Lutteropp et al. 2020). IQ-TREE (Minh et al. 2020b) was also employed for ML analyses. For these analyses, ModelFinder (Kalyaanamoorthy et al. 2017) was used to identify the most appropriate model as recognized by BIC, and this was followed by searching for the optimal tree and 1,000 ultrafast bootstrap replicates (UFBoot) (Hoang et al. 2018) with hill-climbing nearest neighbor interchange to reduce potential severe model violations (bnni option).
Multispecies coalescent (MSC) analyses were conducted with ASTER and ASTRAL v5.6.3 (Zhang et al. 2018; Zhang et al. 2020), and the 353-42, 353-305, Borage-149, Borage-226, B353-191, and B353-531 sets of MP and ML gene trees were analyzed. Additionally, ML trees with Felsenstein bootstrap support values were used as input into weighted ASTRAL with ASTER (Liu and Warnow 2023), using 100 rounds of searching and resampling, to explore another approach for reconstructing species trees from gene trees.
Along with phylogenetic analyses of individual loci, sequence data were concatenated into the six aforementioned datasets (353-42 [44,364 bp], 353-305 [259,895 bp], Borage-149 [196,063 bp], Borage-226 [274,651 bp], B353-191 [240,427 bp], B353-531 [535,758 bp]). Each concatenated dataset was analyzed using MP and ML, with parameters in TNT and IQ-TREE mentioned previously, with one difference for ML analyses. With ML, the data were analyzed as both unpartitioned and partitioned by locus, with the optimal partition determined by analyses in ModelFinder Plus using BIC, Akaike Information Criterion (AIC), and corrected Akaike Information Criterion (AICc). For analyses in IQ-TREE, models of both Gamma rate heterogeneity (TESTMERGE) and FreeRate heterogeneity (MFP+MERGE) were investigated. Collectively, this strategy allowed for an exploration of the sensitivity of the data to different partitioning schemes.
Phylogenetic analyses of paralogues, identified via HybPiper, were conducted with RAxML-NG and IQ-TREE using the aforementioned parameters. Trees from Angiosperms353 paralogues, Borage-specific paralogues, and both together were analyzed in an MSC framework using ASTRAL-Pro (Zhang et al. 2020), an approach similar to ASTRAL that can account for multiple paralogues per species. All trees were compared to identify congruent and incongruent areas of the phylogeny of Boraginaceae. Along with analyses of paralogous loci, three datasets of single-copy loci as identified by HybPiper were constructed: 25 borage-specific loci, 209 Angiosperms353 loci, and 234 for both datasets collectively. These datasets were analyzed in MP and ML frameworks as individual loci and with concatenated datasets. All analyses used the same aforementioned parameters. This strategy was employed to recognize particular evolutionary patterns resolved solely by single-copy orthologues.
To further explore the evolutionary relationships of Boraginaceae, sequence data for 95 species of Boraginales (101 accessions) and 13 outgroup species (14 accessions) from Lamiidae from the Kew Tree of Life Explorer (Baker et al. 2022) and GenBank were accessed. These species have been enriched for the Angiosperms353 loci. Using the HybPiper parameters described above, sequences were mapped to loci, and DNA sequences were retrieved. The resulting sequences were combined with those from the Angiosperms353 loci from the present study into two different datasets: one with 297 loci that includes at least 100 species and one with all Angiosperms353 loci, which includes loci that can have a much smaller number of species for each locus. With this dataset, MP and ML phylogenetic analyses were conducted as described above using individual genes, concatenated (partitioned and unpartitioned) datasets, and paralogous loci.