Skip to main content

Joining forces in Ochnaceae phylogenomics: A tale of two targeted sequencing probe kits

Cite this dataset

Shah, Toral et al. (2022). Joining forces in Ochnaceae phylogenomics: A tale of two targeted sequencing probe kits [Dataset]. Dryad.


Premise: Both universal and family-specific targeted sequencing probe kits are becoming widely used for the reconstruction of phylogenetic relationships in angiosperms. Within the pantropical Ochnaceae, we show that with careful data filtering, universal kits are equally as capable in resolving intergeneric relationships as custom probe kits. Furthermore, we show the strength in combining data from both kits to mitigate bias and provide a more robust result to resolve evolutionary relationships.

Methods: We sampled 23 Ochnaceae genera and used targeted sequencing with two probe kits, the universal Angiosperms353 kit, and a family-specific kit. We used maximum likelihood inference with a concatenated matrix of loci and multispecies-coalescence approaches to infer relationships in the family. We explored phylogenetic informativeness and the impact of missing data on resolution and tree support.

Results: For the Angiosperms353 data set, the concatenation approach provided results more congruent with those of the Ochnaceae-specific data set. Filtering missing data was most impactful on the Angiosperms353 data set, with a relaxed threshold being the optimum scenario. The Ochnaceae-specific data set resolved consistent topologies using both inference methods, and no major improvements were obtained after data filtering. The merging of data obtained with the two kits resulted in a well-supported phylogenetic tree.

Conclusions: The Angiosperms353 data set improved upon data filtering, and missing data played an important role in phylogenetic reconstruction. The Angiosperms353 data set resolved the phylogenetic backbone of Ochnaceae as equally well as the family-specific data set. All analyses indicated that both Sauvagesia L. and Campylospermum Tiegh. as currently circumscribed are polyphyletic and require revised delimitation.


Contig assembly and multiple sequence alignment: The following bioinformatic methods were conducted for both data sets. FastQC v. 0.11.7 (Andrews, 2010) was used to assess the quality of Illumina raw reads from the bait-enriched samples. The raw sequencing reads were then trimmed with  Trimmomatic  v.0.36 (Bolger et al., 2014) using the settings LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:36 to remove adapter sequences and portions of low quality. The HybPiper pipeline v.3 (Johnson et al., 2016) was used with default settings to process the quality-checked reads and recover the coding sequences for each locus.  Outgroup sequences from the OneKP project (Wickett et al., 2014) were added to each data set. Paired reads of samples enriched with the Angiosperms353 baits and the Ochnaceae baits were mapped to targets using BLASTx option (Altschul et al., 1990) and their respective amino acid target file. The sequences obtained from the BLASTx option were used for subsequent analysis because it was found to recover longer sequences. Mapped reads were then assembled into contigs with  SPAdes  v3.13.1 (Bankevich et al., 2012), and the script from the HybPiper suite was used with the .aa flag to produce outputs of a single sequence per gene, which is selected using length, similarity, and coverage. HybPiper flags potential paralogs when multiple contigs are discovered mapping well to a single reference sequence. All loci flagged as potential paralogs were removed from downstream analyses. Subsequent analyses were performed using exon-only data. Sequence recovery for both data sets is listed in Appendix S2. The percentage of gene recovery was calculated using the sum of the captured length per genes per individual divided by the sum of the mean length of all loci. MAFFT v. 7.305b (Katoh et al., 2002) was used to align individual genes using the –auto flag. AMAS (Borowiec, 2016) was used to produce summary statistics for each alignment, evaluating the amount of missing data and the number of parsimony informative sites (Appendices S3 and S4).

Phylogenetic inference: Both assembled data sets were individually analyzed using the following approaches. An additional data set was generated by combining the genes from both probe kits. The two target files were tested for gene overlap using BLASTx. Duplicate genes (7 genes) were removed, and all other recovered genes from both data sets were combined resulting in 620 individual loci. Where two species were available for a genus, the species with higher gene recovery from its respective probe kit was selected to represent the genus.

Multispecies-coalescent (MSC) approach—The aligned exons were then used to infer individual maximum likelihood gene trees with IQTREE v.2.0 (Nguyen et al., 2015) with 1000 ultrafast bootstraps using the -bb option. Species trees were then inferred from the gene trees using ASTRAL-III v5.5.11 (Zhang et al., 2018) with the -t 2 option providing full annotation outputs, including quartet support to allow visualization of the main topology, and first and second alternative as pie charts on the phylogenetic tree reconstruction.

Concatenation approach—An additional analysis was performed by concatenating exon alignments using AMAS for all loci. A species tree was generated from the concatenated exon alignments using IQTREE v.2.0, and then two measures of genealogical concordance were also calculated for each data set; gene concordance factor (gCF) and site concordance factor (sCF) using the options -gcf and -scf in IQTREE v.2.0 (Nguyen et al., 2015). The gCF and sCF values represent the percentage of gene trees containing that branch, and the number of alignment sites supporting that branch, respectively.

Usage notes

Hybpiper plus all dependancies (see for installation details)






Calleva Foundation

The Sackler Trust

Deutsche Forschungsgemeinschaft, Award: ZI 557/14-1

Deutsche Forschungsgemeinschaft, Award: ZI 557/14‐1