Phylogenetic diversity and endemism of the vascular flora of the páramo ecoregion
Data files
Nov 07, 2025 version files 22.95 MB
-
5markers_partitions.txt
109 B
-
Concatenated_tree_Final_Sep06_V4.fasta
22.81 MB
-
File_S1_final_phylogeny.new
127.88 KB
-
README.md
7.30 KB
Abstract
The páramo ecoregion is known for its remarkable biodiversity and evolutionary histories shaped by complex biogeographical processes. To protect and conserve the ecological diversity and evolutionary lineages in this hotspot requires analyzing spatial patterns of phylogenetic diversity and endemism. This study uses spatial phylogenetic methods, combining georeferenced occurrence data with a genus-level phylogeny of the region’s entire vascular flora to investigate the spatial distributions of phylogenetic diversity and endemism. In this study, we assembled a comprehensive dataset of 361 equal-area grid cells covering the páramo biome and compiled occurrence records for 567 vascular plant genera. A phylogeny of vascular plants in the páramo was generated by combining publicly available genetic sequence data with novel genetic data from Colombian herbarium collections. This study examined the vascular plant diversity in the páramo ecosystem and identified distinct centers of neo- and paleoendemism, as well as a strong pattern of phylogenetic clustering across most páramo regions. The phylogenetic beta diversity analysis identified several phylogenetically distinct bioregions, characterized by both widespread, shared evolutionary histories and localized, distinct assemblages. This study provides a scope to prioritize certain areas for protection and to further identify and understand biogeographical patterns in the páramo ecoregion. The dataset provided here is an alignment of 5 chloroplast DNA marker sequences aiming to represent all á vascular plant genera.
Dataset DOI: 10.5061/dryad.2v6wwq02k
Description of the data and file structure
DNA sequence collection and generation via herbarium genomics.
In order to create a phylogenetic tree containing all the taxa found in the páramo region, representative sequence data were sourced from the public repository NCBI GenBank (Benson et al., 2018). For this, the highly conserved chloroplast-genome maturase K gene (matK), chloroplast NADH dehydrogenase F gene (ndhF), RuBisCO large subunit gene (rbcL), and trnF(GAA) gene, as well as the internal transcribed spacer of the nuclear ribosomal DNA (ITS), were selected. From the initial list of 1833 taxa, available sequences from at least one of the selected markers were found for about 1670 genera. These sequences were identified and downloaded from NCBI using the Python script Matrix Maker (Freyman & Thornhill, 2016). If at least one species of a genus had a publicly available sequence for the marker in question, the sequence was included in the study to represent its genus.
To supplement this set of DNA sequences, leaf tissue samples were collected from 33 dried herbarium plant specimens in the Herbario Luis Siguifredo Espinal Tascón (CUVC) at the University of Valle in Cali, Colombia. Genomic DNA was extracted from these leaf tissue samples using the Qiagen DNeasy Plant Mini Kit, following the manufacturer’s protocol. After DNA extraction, the concentration of the extracted DNA was measured using the Qubit 4.0 Fluorometer (Life Technologies) with the dsDNA HS Assay Kit. Following extraction, DNA samples were sheared to the desired fragment size (400-600 bp) using a Covaris ME220 focused ultrasonicator. The quality and size distribution of the DNA extracts were then assessed using an Agilent 4200 TapeStation instrument, using HS D1000 ScreenTapes and reagents, to ensure they were suitable for the library preparation. The sheared genomic DNA extracts were used to construct Illumina sequencing libraries following the BEST (blunt-end single-tube) library preparation protocol (Carøe et al., 2018), which involves three main steps: blunt-end repair, adapter ligation, and adapter fill-in. After a qPCR analysis to determine the optimal number of cycles, the purified libraries were subjected to dual-indexing PCR with 16-30 cycles. Each PCR reaction was carried out in a 100-uL volume containing: 7.5 µL of library template, 0.8 µL of dNTPs (25 mM), 2 µL each of F and R index primers (10 µM), 1 µL of AmpliTaq Gold polymerase (5 U/µL), 10 µL of AmpliTaq Gold buffer (10X), 10 µL of MgCl2 (25 mM), 2 µL of BSA (20 mg/mL), and 64.7 µL of H₂O. The resulting libraries were purified and size-selected using a 1:1 ratio of purified DNA:SPRI beads, and then assayed and quantified on the Agilent 4200 TapeStation prior to equimolar pooling. Sequencing was performed by the Novogene UK commercial sequencing service on the NovaSeq X sequencing platform (150-bp paired-end chemistry).
The raw sequence data for each sample were subjected to de novo assembly using GetOrganelle 1.7.7.0 (Jin et al., 2020). The software GeSeq from the toolkit Chlorobox (Tillich et al., 2017) was used to annotate genes in the resulting chloroplast genome assemblies. The sequences of the aforementioned highly conserved genes, along with 1,000-bp flanking regions on either side of the gene, were extracted from their respective annotation files to include in alignments with the data obtained from GenBank.
The de novo assemblies resulted in circularized genomes for 14 samples, with base coverages ranging from 67.5X to 1509.8X and an average chloroplast genome length of 153,875 bp (Table S1). Uncircularized or partial assemblies were obtained for 17 samples, while two samples were completely unsuccessful in generating any assembled sequences. For 28 samples, at least one of the four target chloroplast markers (rbcL, matK, ndhF, and trnF) was assembled and annotated. For 23 samples, all four markers were annotated and included in the analysis.
Sequence alignment and phylogeny.
An initial sequence alignment for each marker was created using MAFFT version 7 (Katoh & Standley, 2014). The software Geneious Prime version 2024.0.5 was used to manually inspect the sequence alignments, editing the sequences to remove long strings of entirely ambiguous bases (encoded as N), as well as to trim sequence ends, and to refine the alignments. In some cases, some low-quality sequences were removed entirely from the alignment. The alignments were refined using the ClustalW algorithm (Sofi et al., 2022) within Geneious Prime, with gap penalties set to default values. After alignment, low-quality ends and ambiguous regions were trimmed from each sequence to retain only the high-quality, core region of the multiple sequence alignment. To identify potential issues with alignment and misidentified sequences, a single-marker phylogeny was generated for each marker using RAxML v8.2.11 (Kozlov et al., 2019), and these issues were resolved for each of the alignments prior to concatenation of the 5 single-marker multiple sequence alignments using a custom Python script. The final multi-marker multiple sequence alignment was 13,217 bp in length and contained 1696 sequences, including the newly generated sequences from 31 herbarium specimens. This alignment was used to infer a final maximum-likelihood phylogenetic tree with RAxML, using a GTRGAMMA nucleotide substitution model and five data partitions. The resulting phylogenetic tree was visualized with iTOL (Letunic & Bork, 2024). The phylogenetic tree comprises 7 classes, 67 orders, 231 families, and 1696 genera. The placement of each branch tip was checked for consistency with the reference phylogeny from Kew Tree of Life Explorer (Baker et al., 2022) and confirmed to belong to a monophyletic group at the order level. In visualizing and rooting the tree, the class Lycopodiopsida was selected as an outgroup based on an available phylogeny of vascular plants (Cole et al., 2019).
Files and variables
File: 5markers_partitions.txt
Description: Indicates the location of each of the five concatenated DNA markers (matK, ndhF, rbcL, trnF, ITS) within the FASTA alignment 'Concatenated_tree_Final_Sep06_V4.fasta'.
File: Concatenated_tree_Final_Sep06_V4.fasta
Description: Multiple sequence alignment of 5 sequences for all the genera included in the study. Marker sequences that were not available for particular genera are filled instead with Ns.
File: File_S1_final_phylogeny.new
Description: Phylogeny used for all spatial phylogenetics analyses in the tree (in standard Newick file format). Tip labels indicate the genus.
