Hybrid enrichment is an increasingly popular approach for obtaining hundreds of loci for phylogenetic analysis across many taxa quickly and cheaply. The genes targeted for sequencing are typically single-copy loci, which facilitate a more straightforward sequence assembly and homology assignment process. However, this approach limits the inclusion of most genes of functional interest, which often belong to multi-gene families. Here we demonstrate the feasibility of including large gene families in hybrid enrichment protocols for phylogeny reconstruction and subsequent analyses of molecular evolution, using a new set of bait sequences designed for the “portullugo” (Caryophyllales), a moderately sized lineage of flowering plants (∼2200 species) that includes the cacti and harbors many evolutionary transitions to C4 and CAM photosynthesis. Including multi-gene families allowed us to simultaneously infer a robust phylogeny and construct a dense sampling of sequences for a major enzyme of C4 and CAM photosynthesis, which revealed the accumulation of adaptive amino acid substitutions associated with C4 and CAM origins in particular paralogs. Our final set of matrices for phylogenetic analyses included 75–218 loci across 74 taxa, with ∼50% matrix completeness across datasets. Phylogenetic resolution was greatly improved across the tree, at both shallow and deep levels. Concatenation and coalescent-based approaches both resolve the sister lineage of the cacti with strong support: Anacampserotaceae + Portulacaceae, two lineages of mostly diminutive succulent herbs of warm, arid regions. In spite of this congruence, BUCKy concordance analyses demonstrated strong and conflicting signals across gene trees. Our results add to the growing number of examples illustrating the complexity of phylogenetic signals in genomic-scale data.
Supp Table 1. Gene families included in the bait design and sequencing project, showing the full name of each gene family, the shortened version of its name that is used in the paper, and whether the gene was classified as related to C4 or CAM photosynthesis.
Supp. Table 2. Voucher table for all individuals included in phylogenetic analyses, including statistics on enrichment success and NCBI SRA accession numbers.
Supp. Table 3. Sequencing statistics for g2, g5, g9, i37, and i57 datasets.
Supp. Table 4. Paralogs per individual per gene family (the same data shown in the heatmap in Supp. Fig. 2).
The results of the MrBayes analyses for all loci with individuals color-coded according to family and posterior probabilities for major groups shown.
An archived folder containing the following files for each of the five datasets (g2, g5, g9, i37, and i57; example file names for the g5 dataset are given): concatenated alignment in fasta format (c2p1pgtc2_g5_combined_72inds_163seqs.fa), the RAxML tree from the concatenated alignment (RAxML_bipartitions.c2p1pgtc2_g5_combined_72inds_163seqs), the astral tree (c2p1pgts2_g5_astral.tre), and the list of included loci for each of the five datasets (c2p1gt2_Locus_List_g5.txt).
An archived folder containing the following files for each of the individual loci (example file names for ppc2 are given): alignment in fasta format (c2p1pgts2_ppc2.fa), the RAxML tree without bootstrap values (RAxML_bestTree.c2p1pgts2_ppc2), and trees from 100 bootstrap replicates conducted in RAxML (RAxML_bootstrap.c2p1pgts2_ppc2). (The latter two files are the input for an Astral analysis.)
The alignment of the ppc1E1 gene family in fasta format (used in the analysis shown in Fig. 5).
The gene family tree for the ppc1E1 gene family created in RAxML (used in the analysis shown in Fig. 5).
An archived folder containing separate folders for each of the nine families in the portullugo (the groups used for the pipeline). Each folder contains two files for each gene family an sc_*.fa file with the contigs for that gene family and an sb3_*.out file that has the results of BLASTing that fasta file against the database of exons. These two files are the input for part II of the pipeline.
Supp. Table 5a. Results from the validation analyses after the transcript fragments were run through part II of the pipeline only.
Supp. Table 5b. Results from the validation analyses after the transcript fragments were run through both parts II and III of the pipeline.
Supp. Table 5c. Summary statistics from the validation analyses.
Supp. Table 6. Node support for all nodes that conflict or have less than 95% bootstrap support in any of the concatenated or Astral trees (the trees shown in Fig. 3).
Supp. Table 7: Concordance factors from BUCKy analysis for all putative clades with at least 10% genome-wide support.
An archived folder containing two folders: The trees folder contains the original backbone trees and the list of outgroups for running parts II and III of the pipeline. The blastdbs folder contains the fasta files to make the two BLAST databases (te original database for the start of part I of the pipeline, called forblast20150128b.fa and the individual databases for each locus with sequences that have been divided into exons for the end of part I of the pipeline).
Supplementary methods, focused on bait design and an expanded explanation of the pipeline.