Resolving phylogenetic relationships of a recently and rapidly evolving lineage from western North America (Mentzelia section Bartonia, Loasaceae)
Data files
Feb 05, 2025 version files 46.21 MB
-
Fabre_etal.2025.zip
46.20 MB
-
README.md
9.09 KB
Abstract
The landscape of western North America has dramatically transformed since the Miocene to become increasingly heterogeneous, in turn promoting the evolution of many rapidly radiating angiosperm lineages. Phylogenetic relationships of these recently and rapidly radiating groups are difficult to resolve as there is little genetic variation among species and a high degree of noise from incomplete lineage sorting and hybridization. Mentzelia section Bartonia (51 species; Loasaceae) exemplifies this problem well. The clade has been investigated with Sanger sequencing, RADSeq, and genome skimming methods, however, most species relationships remain elusive due to low genetic variability. To better infer species relationships, we applied a hybrid enrichment approach with the Angiosperms353 probe set and implemented a novel bioinformatics workflow that aimed to maximize phylogenetic signal and minimize noise from low-quality sequences, paralogy, and incomplete lineage sorting. With our phylogenomic approach, we found an increased resolution of species relationships compared to previous studies based on nrDNA loci. Although a few species relationships still do not have strong support, our results indicate that our methods were effective in phylogenetic inference of this recently and rapidly evolving lineage from western North America. To better characterize major groups in the Section, we propose the formal designation of three subsections: Decapetala, Multicaulis, and Multiflora.
README: Resolving phylogenetic relationships of a recently and rapidly evolving lineage from western North America (Mentzelia section Bartonia, Loasaceae)
https://doi.org/10.5061/dryad.sf7m0cghb
Description of the data and file structure
Principle Investigator Contact Information:
Name: John Schenk
Institution: Ohio University
Email: schenk@ohio.edu
Alternate Contact Information:
Name: Paige Fabre
Institution: Ohio University
Email: pf271621@ohio.edu
Funding:
Funding for this research was generously provided to John Schenk by the NSF DEB award #2117446, and to Paige Fabre by the Graduate Student Research Grant from the Ohio University College of Arts and Sciences.
Project Overview:
The goal of our study was to elucidate robust species-level relationships of Mentzelia section Bartonia (Loasaceae; 51 spp.), a recently and rapidly radiating lineage from western North America. To do this, we used targeted sequencing with Angiosperms353 baits, followed by a rigorous bioinformatics pipeline to address the myriad challenges of recent and rapidly radiating lineages, including having a low phylogenetic signal-to-noise ratio, a high chance of ILS, and the potential for hybridization.
Recommended Citation
Fabre, P. P., J. M. Brokaw, L. D. Hufford, M. G. Johnson, and J. J. Schenk. In press. Resolving phylogenetic relationships of a recently and rapidly evolving lineage from western North America (Mentzelia section Bartonia, Loasaceae). Systematic Botany.
Data Generation:
The bioinformatics pipeline used to generate these data is detailed in Fabre et al. (in press). Briefly, we will list the basic steps here.
- Raw, paired-end reads were received from Illumina (raw sequences are available through NCBI)
- Quality control of raw reads was performed with fastp (Chen et al. 2018)
- Angiosperms353 loci (supercontigs [exons with flanking introns]) were assembled with HybPiper (Johnson et al. 2016)
- Supercontigs were evaluated with HybPhaser (Nauheimer et al. 2021); the program removed low-quality samples and loci and generated consensus sequences with ambiguity codes
- Sequences that were < 25% of the mean recovered length were removed with filter_by_length.py (https://github.com/mossmatters/phyloscripts/tree/master/HybPiperUtils)
- Loci were aligned with MAFFT (Katoh and Standley 2013)
- Outliers were removed from concatenated alignment with SpruceUp (Borowiec 2019)
- Gene trees were inferred with IQ-TREE (Nguyen et al. 2015; Minh et al. 2022)
- Long branches were removed with TreeShrink (Mai and Mirarab 2018)
- Gene tree statistics for the 238-locus data set were calculated with SortaDate (Smith et al. 2018)
- Gene trees were then filtered based on the results of step 10. Only loci with > average bipartition support were kept for a "108_locus" data set
- An additional dataset was made based on a different subset of the 238-locus dataset that removed any locus with at least one paralogous sequence flagged by either HybPiper or HybPhaser; this generated the 75-locus dataset.
- All three datasets had species trees inferred with ASTRAL-III (Zhang et al. 2018)
Note on name corrections:
A few vouchers were originally misidentified and, as such, the names are incorrect in the tree and alignment files. These vouchers include:
MENCRO4332 (labeled as M. cronquistii, but corrected to M. lagarosa)
MENSIN786 (M. sinuata, corrected to M. speciosa)
M. hirsutissima2680 (corrected to M. nesiotes)
M.tridentata3626 (corrected to M. tricuspis)
The corrected names are in the publication.
References Cited in this README:
- Borowiec, M. L. 2019. Spruceup: fast and flexible identification, visualization, and removal of outliers from large multiple sequence alignments. Journal of Open Source Software 4: 1635.
- Chen, S., Y. Zhou, Y. Chen, and J. Gu. 2018. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34: i884–i890.
- Johnson, M. G., E. M. Gardner, Y. Liu, R. Medina, B. Goffinet, A. J. Shaw, N. J. C. Zerega, and N. J. Wickett. 2016. HybPiper: Extracting coding sequence and introns for phylogenetics from high-throughput sequencing reads using target enrichment. Applications in Plant Sciences 4: 1600016.
- Katoh, K. and D. M. Standley. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30: 772–780.
- Mai, U. and S. Mirarab. 2018. TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19: 272.
- Minh, B. Q., R. Lanfear, N. Ly-Trong, J. Trifinopoulos, D. Schrempf, and H. A. Schmidt. 2022. IQ-TREE version 2.2.0: Tutorials and manual phylogenomic software by maximum likelihood. http://www.iqtree.org/doc/iqtree-doc.pdf (last accessed 03 January 2025).
- Nauheimer, L., N. Weigner, E. Joyce, D. Crayn, C. Clarke, and K. Nargar. 2021. HybPhaser: A workflow for the detection and phasing of hybrids in target capture data sets. Applications in Plant Sciences 9: e11441.
- Nguyen L.-T., H. A. Schmidt, A. von Haeseler, B. Q. Minh. 2015. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268–274.
- Zhang, C., M. Rabiee, E. Sayyari, and S. Mirarab. 2018. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics 19: 15–30.
Files and variables
Description of the data and file structure:
This submission contains (i) the final alignments, gene trees, and species trees used in Fabre et al. (in press) and (ii) a text file containing select code used to generate those data.
The data folder is organized as such:
Sub-folder 1: "75_locus" contains all the results/data for the "75-locus" dataset.
75_locus/75_alignments/75_alignments.zip is a zip folder containing individual files for each locus alignment.
75_locus/75_alignments/75_concatenated.nex is a concatenated alignment file of all 75 loci, in nexus format.
75_locus/75_alignments/75_partitions.txt is a partitions file that can be used in conjunction with the concatenated alignment file.
75_locus/75_trees/75_genetrees.zip is a zip folder containing individual files for gene tree.
75_locus/75_trees/75_ASTRAL.tre is the tree file for the ASTRAL tree.
Sub-folder 2: "108_locus" contains all the results/data for the "108-locus" dataset.
108_loci/108_alignments/108_alignments.zip is a zip folder containing individual files for each locus alignment.
108_loci/108_alignments/108_concatenated.nex is a concatenated alignment file of all 108 loci, in nexus format.
108_loci/108_alignments/108_partitions.txt is a partitions file that can be used in conjunction with the concatenated alignment file.
108_loci/108_trees/108_ASTRAL.tre is the tree file for the ASTRAL tree.
108_loci/108_trees/108_genetrees.tre is a single file that contains all 108 gene trees.
Sub-folder 3: "238_locus" contains all the results/data for the "238-locus" dataset.
238_loci/238_alignments/238_alignments.zip is a zip folder containing individual files for each locus alignment.
238_loci/238_alignments/238_concatenated.nex is a concatenated alignment file of all 238 loci, in nexus format.
238_loci/238_alignments/238_partitions.txt is a partitions file that can be used in conjunction with the concatenated alignment file.
238_loci/238_alignments/238_trees/238_ASTRAL.tre is the tree file for the ASTRAL tree.
238_loci/238_alignments/238_trees/238_genetrees.tre is a single file that contains all 238 gene trees.
238_loci/238_alignments/238_trees/238_genetrees.zip is a zip folder containing individual files for each gene tree.
script.txt: A text file with select code used in the bioinformatics pipeline.
Naming conventions:
For ingroup specimens, the naming convention is as follows: MEN (for Mentzelia), followed by the first three letters of the specific epithet, followed by the voucher number. For example, MENALB912 refers to Mentzelia albescens 912. If there are two species that share the same first three letters, a 2 is added to whichever species comes second alphabetically. For example, MENARG4156 refers to Mentzelia argillicola 4156, and MENARG24145 refers to Mentzelia argillosa 4145.
For outgroup specimens, the naming convention is as follows: M. (for Mentzelia), followed by the full specific epithet, followed by the voucher number. For example, Mtricuspis553 refers to Mentzelia tricuspis 553.
Code/software
- The script.txt file contains selected script used in our bioinformatics pipeline. This script is not intended to be comprehensive.
- .nex files may be viewed in programs such as Notepad++
- .tre files may be viewed in FigTree v1.4.4
Access information
Raw sequences are available through NCBI.
Methods
1. Raw, paired-end reads were received from Illumina (raw sequences are available through NCBI)
2. Quality control of raw reads was performed with fastp (Chen et al. 2018)
3. Angiosperms353 loci (supercontigs [exons with flanking introns]) were assembled with HybPiper (Johnson et al. 2016)
4. Supercontigs were evaluated with HybPhaser (Nauheimer et al. 2021); the program removed low-quality samples and loci and generated consensus sequences with ambiguity codes
5. Sequences that were < 25% of the mean recovered length were removed with filter_by_length.py (https://github.com/mossmatters/phyloscripts/tree/master/HybPiperUtils)
6. Loci were aligned with MAFFT (Katoh and Standley 2013)
7. Outliers were removed from concatenated alignment with SpruceUp (Borowiec 2019)
8. Gene trees were inferred with IQ-TREE (Nguyen et al. 2015; Minh et al. 2022)
9. Long branches were removed with TreeShrink (Mai and Mirarab 2018)
10. Gene tree statistics for the 238-locus data set were calculated with SortaDate (Smith et al. 2018)
11. Gene trees were then filtered based on the results of step 10. Only loci with > average bipartition support were kept for a "108_locus" data set
12. An additional dataset was made based on a different subset of the 238-locus dataset that removed any locus with at least one paralogous sequence flagged by either HybPiper or HybPhaser; this generated the 75-locus dataset.
13. All three datasets had species trees inferred with ASTRAL-III (Zhang et al. 2018)