Data from: A Phylogenomic analysis of Genipa (Rubiaceae) using target sequence capture data
Data files
Dec 24, 2024 version files 4.69 MB
-
contig_alignment_read_cov_yield_overviewPNG.png
3.93 MB
-
gene_trees_ref_assembly.trees
256.13 KB
-
Locus20.fasta
112.48 KB
-
Locus43.fasta
46.97 KB
-
Locus5.fasta
148.32 KB
-
Locus55.fasta
76.81 KB
-
Locus62.fasta
26 KB
-
Locus9.fasta
86.53 KB
-
README.md
2.70 KB
Abstract
The genus Genipa is a widespread, lowland, Neotropical lineage of trees in the coffee family, Rubiaceae. There is long-standing disagreement on the delimitation of species in the genus and how broadly Genipa is circumscribed. Here, we use genomic data to resolve the classification within Genipa. Using target sequence capture we generated a high resolution 245-locus dataset to produce a comprehensive species phylogeny under the multi-species coalescent model. The phylogenomic results strongly support Genipa spruceana, often synonymised with Genipa americana, as a distinct monophyletic species. Similarly, the monophyly of Genipa infundibuliformis, a recently recognized species, is also strongly supported. The phylogeny also shows three distinct, well-supported clades within the widespread species, Genipa americana. These clades are interpreted as three independently evolving lineages in contrast to the two varieties most commonly recognized in G. americana based on previous morphological studies.
README: Data from: A Phylogenomic Analysis of Genipa (Rubiaceae) Using Target Sequence Capture Data
https://doi.org/10.5061/dryad.7wm37pw0g
Summary of Data
Dataset for target capture study collected with the Angiosperms 353 probe set and two tree inference methods, both employing the multispecies coalescent model. The dataset provides the data required to reproduce both tree inference methods: the ASTRAL phylogeny and STACEY phylogeny published in A Phylogenomic Analysis of Genipa (Rubiaceae) Using Target Sequen...: Ingenta Connect, along with an overview plot showing graphically the sequence yields and information on the bioinformatic pipeline SECAPR that was used in the work. All raw target capture sequence data from the study are available from the GenBank Sequence Read Archive under BioProject ID PRJNA1029819.
Dataset contains:
Input data tree file of individual gene trees used to produce the ASTRAL phylogeny (gene_trees_ref_assembly.trees).
FASTA files of the six loci used in the STACEY analysis (Locus5.fasta, Locus9.fasta, Locus20.fasta, Locus43.fasta,Locus62.fasta, and Locus55.fasta).
Plot showing an overview of sequence yield (contig_alignment_read_cov_yield_overviewPNG.png).
Description of the data and file structure
Each dataset contains 29 samples, taxa abbreviations used are as follows: G_am (Genipa americana), Gcar (Genipa americana var. caruto), Gspru (Genipa spruceana), G_infun (Genipa infundibuliformis), followed by the sample number.
The individual gene trees created using IQ-TREE2 used in the ASTRAL phylogeny are provided in Newick format.
The STACEY data is in FASTA format of six loci used in the analysis.
The plot is a PNG image file that shows an overview of the contig yield and read-coverage each row is a Genipa sample and each column is a targeted loci. The plot shows in the upper portion the results of the de novo contig assembly, loci that could be assembled (blue) and those that could not be assembled (white). The lower portion of plot shows the results of the reference assembly as a heatmap (dark colours represent higher recovery than lighter colours). The middle portion of the plot is the reference sequence (in green ).
Code/Software
The SECAPR bioinformatic pipeline was used to run the analysis DOI: 10.7717/peerj.5175. The scripts used to create the data in this paper are available on GitHub at https://github.com/AntonelliLab/seqcap_processor.
Methods
Methodology
Total genomic DNA was extracted using the NucleoSpin Plant II Kit (Macherey-Nagel, Düren, Germany) or DNeasy Plant Mini Kit (Qiagen, Hilden,
Germany). The protocol followed manufacturer’s instructions apart from the cell lysis time, which was increased to overnight to maximise DNA yield. DNA quality was assessed using a NanoDrop 2000 spectrophotometer and quantified using the Qubit 2.0.
The NanoDrop 2000 and Qubit 2.0 results were used to determine samples that needed concentration by vacuum centrifugation. Gel electrophoresis was also carried out to assess DNA fragment size. Multiple extraction rounds were pooled as necessary when Initial DNA quantity was low, in order to meet the minimum concentration requirements of Rapid Genomics, Florida, USA who performed target capture library preparation and sequencing. The DNA was mechanically sheared to a size of 200 – 500 base pairs (bp). Illumina libraries were constructed and barcode adapters for the Illumina Sequencing platform were ligated to the libraries then PCR-amplified using standard cycling protocols. Samples were pooled into 16 barcoded libraries with equimolar amounts to a total of 500 ng for hybridization. Target enrichment was performed using the Angiosperms 353 bait set (Johnson et al. 2019) targeting 353 putatively orthologous genes. After enrichment, samples were re-amplified for an additional 6–12 PCR cycles and sequenced using an Illumina NovaSeq 6000 with paired181 end 250 bp reads.
The Illumina raw read data was processed using the bioinformatic pipeline SECAPR 2.2.5 (Andermann et al. 2018). The bioinformatic pipeline was run on the
Sigma2 High-Performance Computing cluster at NTNU, Norway. Raw sequence data was quality checked using FastQC (Andrews 2010) and MultiQC (Ewels et al. 2016) to gain an overview of sequence quality and determine cleaning parameters. Illumina adapters were removed and cleaning of sequences was carried out using FastP 0.23 (Chen et al. 2018). FastP default settings implemented in SECAPR were: i) the read was cut if the accuracy between adapter and read Phred quality score was below 20; ii) maximum percent of low-quality nucleotides allowed 40 reads with a higher percentage of unqualified (low quality) nucleotides were discarded; iii) size of sliding window for quality trimming 5 nucleotides; iv) trimming from front and tail if quality value was lower than 10; v) reads below complexity threshold of 10 removed; vi) trim poly repeats at end of read of length 7; vii) low complexity filtering was enabled and viii) length filtering was disabled. Quality of cleaned reads was checked, using FastQC, MultiQC and the plotting function in SECAPR. De novo contig assembly was performed on cleaned reads using Spades 3.15.2 (Bankevich et al. 2012). Overlapping sequences were combined into contig sequences using kmer values 21, 33, 55, 77, 99, and 127. The minimum contig length was set to 200, contigs under this threshold were discarded. Contigs belonging to target loci were identified by using Blastn (Camacho et al. 2009) to match the contig sequences with a set of reference sequences for each locus. The reference sequences used were the
Gardenia philastrei Pierre ex Pit. Davis, A.P. 4055 (K) sequences from the Royal Botanic Gardens Kew PAFTOL project (Baker et al. 2022). A sequence-match was identified if the sequence matched with at least 80% identity across at least 80% of the contig length. Loci with multiple contig matches were discarded as they may represent paralogous sequences. A multiple species alignment (MSA) was created from the contig data using MAFFT 7.490 (Katoh et al. 2019) for each locus that was recovered across at least three samples with the addition of the “no trim” parameter to keep full contig sequence length. In the next step, reference-based mapping was performed using the consensus sequence of each locus' MSA as a genus-specific reference library. This additional reference assembly leads in general to a more efficient and less biased retrieval of DNA reads across all samples for each locus (Andermann et al. 2018), as opposed to using the recovered contig sequences for each sample. The minimum coverage parameter was set at four reads. Consensus sequences were generated from the reads mapping to the genus-specific reference at each locus for each sample and from these consensus sequences multiple sequence alignments were computed for each locus using MAFFT 7.490 (Katoh et al. 2019).
Phylogenetic Analysis
Two different phylogenetic methods were used. The first ASTRAL-III (Zhang et al. 2018), which produces a species tree that shares the maximum number of quartet topologies with the input gene trees. The input gene trees were generated in IQ-TREE 2 (Minh et al. 2020). A set of bootstrap consensus maximum likelihood gene trees created using 1000 bootstrap replicates with UFBoot2 (Hoang et al. 2018) and automatic substitution model selection with ModelFinder (Kalyaanamoorthy et al. 2017) implemented in the IQ-TREE 2 software package. The tree was visualised using Figtree v.1.4.3 (Rambaut 2017).
The second species phylogeny was produced using Bayesian inference, created with Species Tree And Classification Estimation, Yarely (STACEY; Jones 2017) in BEAST2 (Bouckaert et al. 2019) on the CIPRES Science Gateway web portal (Miller et al. 230 2012). This method simultaneously estimates gene trees and species trees using a birth death collapse model. The input data was a subset of six loci from the de novo contig assembly dataset. The subset selection was numerical, the first six loci in the de novo assembly dataset were selected (5, 9, 20, 43, 55, and 62), with the exception of locus 59, it was excluded from the analysis as it only had seven out of 29 samples. The xml input was generated in BEAUTi 2.6 (Bouckaert et al. 2019). The samples were not preassigned to species and no partitions were selected. The following parameters and priors were selected: species tree model collapse height: 1e-5 237 ; strict clock model: each locus was set as relative to each other; JC69 substitution model; bdcGrowthRate: lognormal (M=5, S=2); collapseWeight: beta (alpha=2, beta=2); population prior log normal (M=-7, S=2); relativeDeathRate: beta (alpha=1, beta=1). The MCMC was run for 100 million
generations and Tracer Version v1.7.1 (Rambaut et al. 2018) was used to explore convergence of parameters. The species tree was generated using TreeAnnotator 2.6.3 (Drummond and Rambaut 2007), after discarding 10% as burn-in, and then visualised using Figtree v.1.4.3 (Rambaut 2017).