Dataset from: Incomplete lineage sorting and reticulate evolution mask species relationships in Brunelliaceae, an Andean family with rapid, recent diversification
Murillo-A., José et al. (2022), Dataset from: Incomplete lineage sorting and reticulate evolution mask species relationships in Brunelliaceae, an Andean family with rapid, recent diversification, Dryad, Dataset, https://doi.org/10.5061/dryad.3tx95x6jd
Premise: To date, phylogenetic relationships within the monogeneric Brunelliaceae have been based on morphological evidence, which does not provide sufficient phylogenetic resolution. Here we use target-enriched nuclear data to improve our understanding of phylogenetic relationships in the family.
Methods: We used the Angiosperms353 toolkit for targeted recovery of exonic regions and supercontigs (exons + introns) from low copy nuclear genes from 53 of 70 species in Brunellia, and several outgroup taxa. We removed loci that indicated biased relationships and applied concatenated and coalescent methods to infer Brunellia phylogeny. We identified conflicts among gene trees that may reflect hybridization or incomplete lineage-sorting events and assessed their impact on the phylogeny. Finally, we performed ancestral-state reconstructions of morphological traits and assessed the homology of character states used to define sections and subsections in Brunellia.
Results: Brunellia comprises two major clades and several subclades. Most of these clades/subclades do not correspond to previous infrageneric taxa. There is high topological incongruence among the subclades across analyses.
Conclusions: Phylogenetic reconstructions point to rapid species diversification in Brunelliaceae, reflected in very short branches between successive species splits. The removal of putatively biased loci slightly improves phylogenetic support for individual clades. Reticulate evolution due to hybridization and/or incomplete lineage sorting likely both contribute to gene-tree discordance. Morphological characters used to define taxa in current classification schemes are homoplastic in the ancestral character state reconstructions. While target enrichment data allows us to broaden our understanding of diversification in Brunellia, the relationships among subclades remain incompletely understood.
We used Hyb-Seq data generated for 62 samples of Brunellia and some outgroups. The sequence data was composed by 150-bp, paired-end reads obtained using an Illumina HiSeqX. Raw reads were quality trimmed using Trimmomatic 0.39 (Bolger et al., 2014) to remove low-quality bases at the end and beginning of each read (when 4 bp windows had a quality score <Q20), and to remove reads shorter than 30 bp.
After trimming, paired reads were processed using HybPiper 1.3.1 (Johnson et al., 2016; available at https://github.com/mossmatters/Angiosperms353) with BWA mapper (Li et al., 2009) for aligning the reads to the DNA targets, and SPAdes (Bankevich et al., 2012) for de novo assembly of reads. We performed the first capture using the “Angiosperms353_targetSequences” fasta file available on the HybPiper website. We recovered exonic regions with the “exonerate” script. The sample “Brunellia_inermis_Orozco 4085” was selected due to its high coverage to create a new customized set of exon data targets to maximize the data recovery in the sampling. Reads of each sample that assembled to these references were saved by default. We recovered exon data and supercontigs (exons + introns) of the 353 genes using the “reads_first.py” and “exonerate_hits.py” scripts.
We added to the “exons” and “supercontigs” datasets the DNA sequences of Cephalotus follicularis (GenBank PRJDB4484) retrieved from NCBI for the 353 protein targets. We inspected putative paralogs with the “paralog_investigator.py” script, and genes with paralogs in the ingroup were removed from the following analyses. The genes were evaluated using ten parameters to minimize potential bias produced by a strong, but misleading signal, such as sequence saturation, long-branch attraction, and potential hidden paralogy due to polyploidization. Genes that were over or under the 1.5 interquartile range for any of the considered parameters were excluded from all downstream analyses.
We aligned the curated “exons” and “supercontigs” datasets using MAFFT (Katoh and Toh, 2014) and trimmed the alignments using TrimAl v1.4 (Capella-Gutierrez et al., 2009) with the "automated1" command. We analyzed concatenated alignments using maximum likelihood (ML) in IQ-TREE v2.0.6 (Minh et al., 2020). We also used IQ-TREE to build the trees based on individual loci that were subsequently included in ASTRAL analyses (Zhang et al., 2018). Here, we included the phylogenetic trees of the curated "supercontig" dataset, which is the one without paralogs or loci that can produce bias, called in the paper "SCWE".
This Dataset contains five folders with the following:
Includes the Brunellia_inermis_Orozco4085 gene list used as references for capturing reads on all samples through the HybPiper 1.3.1 pipeline (Johnson et al., 2016; available at https://github.com/mossmatters/Angiosperms353).
Includes the reads captured using HybPiper on each sample. Reads were previously trimmed for quality using Trimmomatic 0.39 (Bolger et al., 2014).
Includes sequences of each gene obtained using HybPiper in two forms: (1) only exonic regions, and (2) exonic regions, introns, and 5' and 3' flanks of the gene (also called supercontigs).
Includes alignments of ¨exons¨ and ¨supercontigs¨ datasets (after being trimmed with TrimAl and paralogous genes removed).
Includes the ML and ASTRAL trees of the "supercontig-with-exclusions" dataset (SCWE).
Departamento Administrativo de Ciencia, Tecnología e Innovación (COLCIENCIAS), Award: Colciencias-1101658
Dirección de Investigación, Universidad Nacional de Colombia, Award: DIEB-15858
National Science Foundation, Award: DUE-1564969
Southern Illinois University
National Science Foundation, Award: DBI-2002400