Data from: CAnDI: a new tool to investigate conflict in homologous gene trees and explain convergent trait evolution

Robertson, Holly1 ; Walker, Joseph2; Moyroud, Edwige 1

Published Apr 22, 2025 on Dryad. https://doi.org/10.5061/dryad.g4f4qrfzq

Data files

Apr 22, 2025 version files 6.26 MB

CarnivoryDataset.zip

6.26 MB
README.md

4.60 KB

Abstract

Phenotypic convergence is found across the tree of life, and morphological similarities in distantly related species are often presumed to have evolved independently. However, clarifying the origins of traits has recently highlighted the complex nature of evolution, as apparent convergent features often share similar genetic foundations. Hence, the tree topology of genes that underlie such traits frequently conflicts with the overall history of species relationships. This conflict creates both a challenge for systematists and an exciting opportunity to investigate the rich, complex network of information that connects molecular trajectories with trait evolution. Here we present a novel conflict identification program named CAnDI (Conflict And Duplication Identifier), which enables the analysis of conflict in homologous gene trees rather than inferred orthologs. We demonstrate that the analysis of conflicts in homologous trees using CAnDI yields more comparisons than in ortholog trees in six datasets from across the eukaryotic tree of life. Using the carnivorous trap of Caryophyllales, a charismatic group of flowering plants, as a case study we demonstrate that analysing conflict on entire homolog trees can aid in inferring the genetic basis of trait evolution: by dissecting all gene relationships within homolog trees, we find genomic evidence that the molecular basis of the pleisiomorphic mucilaginous sticky trap was likely present in the ancestor of all carnivorous Caryophyllales. We also show that many genes whose evolutionary trajectories group species with similar trap devices code for proteins contributing to plant carnivory and identify a LATERAL ORGAN BOUNDARY DOMAIN transcription factor as a possible candidate for regulating sticky trap development.

https://doi.org/10.5061/dryad.g4f4qrfzq

Here we present a novel conflict identification program named CAnDI (Conflict And Duplication Identifier), which enables the analysis of conflict in homologous gene trees rather than inferred orthologs. We demonstrate that the analysis of conflicts in homologous trees using CAnDI yields more comparisons than in ortholog trees in six datasets from across the eukaryotic tree of life. Using the carnivorous trap of Caryophyllales, a charismatic group of flowering plants, as a case study we demonstrate that analysing conflict on entire homolog trees can aid in inferring the genetic basis of trait evolution: by dissecting all gene relationships within homolog trees, we find genomic evidence that the molecular basis of the pleisiomorphic mucilaginous sticky trap was likely present in the ancestor of all carnivorous Caryophyllales. We also show that many genes whose evolutionary trajectories group species with similar trap devices code for proteins contributing to plant carnivory and identify a LATERAL ORGAN BOUNDARY DOMAIN transcription factor as a possible candidate for regulating sticky trap development.

Description of the data and file structure

Our manuscript did not generate new sequencing data but instead re-analyse publicly available data. We described the source of those data below.

Our manuscript describe a new software, CAnDI (Conflict And Duplication Identifier) and we give below the link to access all codes, manual and related information via github.

In the folder CarnivoryDataset.zip we also provide:

Subfolder 'CarnivoryHomologs': contains the extracted homolog trees used in the paper from the carnivorous plant dataset.
'CarnivorySpeciesTree.tre': the species tree used for mapping
Subfolder 'SearchResultsCarnivory': contains the results from the search. In this subfolder, the file names begin with the genera being compared using the CAnDI search function. Files with _nosup_ in the name are the genes identified as conflicting, regardless of statistical support for the relationship. Files with _withsup_ in the name are the genes that had statistical support for the relationship in question. Files with _annotation at the end have the BLAST identified Arabidopsis annotation.

Sharing/Access information

We used six datasets of species trees, ortholog trees, and gene families (homolog trees) from published studies that broadly span the tree of life.

Data was derived from the following sources:

CARN: Walker J.F., Yang Y., Moore M.J., Mikenas J., Timoneda A., Brockington S.F., Smith S.A. 2017. Widespread paleopolyploidy, gene tree conflict, and recalcitrant relationships among the carnivorous Caryophyllales. Am. J. Bot. 104:858–867.DIAT: Parks M.B., Nakov T., Ruck E.C., Wickett N.J., Alverson A.J. 2018. Phylogenomics reveals an extensive history of genome duplication in diatoms (Bacillariophyta). Am. J. Bot. 105:330-347.
LEGU: Koenen E.J.M., Ojeda D.I., Steeves R., Migliore J., Bakker F.T., Wieringa J.J., Kidner C., Hardy O.J., Pennington R.T., Bruneau A., Hughes C.E. 2020. Large-scale genomic sequence data resolve the deepest divergences in the legume phylogeny and support a nearsimultaneous evolutionary origin of all six subfamilies. New Phytol. 225:1355–1369.
ERIC: Larson D.A., Walker J.F., Vargas O.M., Smith S.A. 2020. A consensus phylogenomic approach highlights paleopolyploid and rapid radiation in the history of Ericales. Am. J. Bot. 107:773–789.
HYMN: Johnson B.R., Borowiec M.L., Chiu J.C., Lee E.K., Atallah J., Ward P.S. 2013. Phylogenomics Resolves Evolutionary Relationships among Ants, Bees, and Wasps. Curr. Biol. 23:2058– 2062.
AMAR: Morales-Briones D.F., Kadereit G., Tefarikis D.T., Moore M.J., Smith S.A., Brockington S.F., Timoneda A., Yim W.C., Cushman J.C., Yang Y. 2021. Disentangling Sources of Gene Tree Discordance in Phylogenomic Data Sets: Testing Ancient Hybridizations in Amaranthaceae s.l. Syst. Biol. 70:219–235.

Code/Software

The novel program CAnDI (Conflict And Duplication Identifier) is available at

https://github.com/HollyMaeRobertson/CAnDI

For gene expression data analysis, Welch’s t-tests or ANOVA followed by Tukey’s HSD post-hoc tests were conducted in R [https://www.r-project.org, (R Core Team 2017)]

Gene Ontology analyses were performed with PANTHER16.0.

Comparison of whole homolog trees with ortholog trees for identifying gene tree conflict in six datasets

We compiled six datasets of species trees, ortholog trees, and gene families (homolog trees) from published studies that broadly span the tree of life. We used the species relationships and homolog trees inferred in the initial studies as a basis for each conflict analysis. If available, we used rooted homolog trees from the original study. If the rooted homolog trees were not available, we used the extract_clades.py script from (Yang et al. 2015) to extract them.

All of the datasets, both homologs and orthologs, were analysed using CAnDI to count the total number of nodes assessed corresponding to each node in the species, where each node was assessed as either conflicting with the species tree or concordant with the species tree. Gene duplication nodes were not counted.

We extracted 12,172 rooted homologs from the unrooted homologs previously inferred for the plant family Amaranthaceae (AMAR) by (Morales-Briones et al. 2021), using a threshold of 40 ingroup taxa. The AMAR dataset did not have support values for homolog trees. The 12,699 orthologs of the AMAR dataset with appropriate outgroup representation for rooting were from the original study where they were extracted using the monophyletic outgroup (MO) method (Yang and Smith, 2014). The Diatom (DIAT) dataset from (Parks et al. 2018) had duplications extracted with a minimum of 10 ingroup taxa, resulting in 18,914 homolog trees. The support values for this dataset were from a rapid bootstrap analysis; thus, we applied a cutoff of ≥70%. The 197 orthologs from the original study, that were extracted using the Rooted Tree (RT) method (Yang and Smith, 2014), were used for the ortholog comparison for the DIAT dataset. The Legumes dataset (LEGU), consisting of 8,038 homolog trees from (Koenen et al. 2020), had been extracted in the original study, and support values on the nodes were from a rapid bootstrap analysis; thus, a cutoff of (≥70) was used. For the LEGU orthologs, we used the ortholog trees extracted with the RT method from the original study. To extract the homologs we used the same outgroup taxa, resulting in 1014 trees with outgroup representation. The Hymenoptera dataset (HYMN) was originally published by (Johnson et al. 2013) but was subsequently re-analysed by (Smith et al. 2015), who identified 5,863 homolog trees and calculated rapid bootstrap support; therefore, we used a support cutoff of ≥70% for the HYMN dataset. To generate orthologs for the HYMN dataset, we used the RT method with a minimum cutoff of four taxa using the prune_from_rooted_tree.py script of (Morales-Briones et al. 2021), resulting in 6519 orthologs. The 6902 homolog trees from the Ericales (ERIC) dataset from (Larson et al. 2020) were analysed using a support value of (SH-aLRT ≥ 80). Conflict analysis on ortholog trees from the ERIC dataset was conducted using the 382 rooted ortholog trees from the original study that had been extracted using the RT method. The details of the carnivorous Caryophyllales (CARN) dataset, also used in this comparison, are described below.

Investigating the relationship of conflict with trait distribution in the carnivorous Caryophyllales (CARN) dataset

We used a minimum threshold of four ingroup taxa to extract 6,006 rooted homolog trees from the unrooted homolog trees inferred by Walker et al. (Walker et al. 2017). We term this the carnivorous Caryophyllales (CARN) dataset. Information regarding this and other datasets as well as all supplementary material is available on DRYAD (doi:10.5061/dryad.w3r2280xt). The 1237 orthologs we used for the CARN dataset were those from the original study, which were extracted using one-to-one orthology with 100% gene occupancy required. All trees in the CARN dataset had SH-aLRT support values; thus, we used a support cutoff of ≥80%. We then used CAnDI to identify all homolog trees in the CARN dataset that contained relationships (bipartition) of interest and the number of times the relationship of interest occurred in a single homolog tree. We identified trees with the bipartition Drosophyllum + Drosera and all trees with the bipartition Drosophyllum + Ancistrocladus. We also used CAnDI to identify all bipartitions that contained only Nepenthes ampullaria, Nepenthes alata, and Drosophyllum lusitanicum. The same procedure was performed to identify all instances in homolog trees where Ancistrocladaceae (Ancistrocladus) was sister to Nepenthaceae (N. alata and N. ampullaria). For each group of taxa investigated, we compared the number of bipartitions that contained only those taxa against the number of instances where an additional species was also included in the bipartition to eliminate any bias caused by differences in gene recovery in different samples.

Investigating gene function in the CARN dataset

Annotations for the CARN dataset were inferred using the coding sequences of the genome of Arabidopsis thaliana (TAIR10; Downloaded from Ensembl Plants on Jan 27^th, 2021). The best blast hit (e-value 1e-3) to the transcriptome of Nepenthes alata from each homolog cluster was used to predict the function of each cluster. Using this approach, 137 out of the 146 clusters supporting the monotypic family Drosophyllaceae sister to the genus Drosera could be annotated (9 clusters failed to return a blast hit against A. thaliana. These 137 annotations represented 132 distinct Arabidopsis genes. Gene ontology analyses with PANTHER16.0 were performed on this list of 132 A. thaliana genes using the PANTHER Overrepresentation Test (Released 20210224) and the GO Ontology database (DOI: 10.5281/zenodo.5228828 Released 2021-08-18) and either the “GO biological process complete” or the “GO molecular function complete” dataset. The Fisher’s exact test was performed with False Discovery Rate correction for each analysis. A Chitinase-like protein 1 [homolog to Arabidopsis CTL1; labelled cluster2518 by Walker et al. (Walker et al. 2017)] was further investigated since this gene is associated with carnivory. We spot-checked the annotation for this gene by testing the blast annotation of other homologs within cluster2518 against the NCBI RefSeq non-redundant protein database to ensure that this was not the result of a spurious hit. All sequences from cluster2518 matched a Chitinase gene. We then performed the reverse mapping feature with CAnDI (“-r”) and mapped the coalescent-based species tree topology for the CARN dataset to the cluster2518 homolog tree.

Plant material

Drosera admirabilis, Drosera aliciae and Drosera coccicaulis were purchased from Wack’s Wicked plants (UK), grown in ambient conditions (natural light, with temperature kept between 18-25°C) and watered from the base with distilled water. Tissues were harvested from emerging young leaves (unfurled primordia <1cm long, trichomes absent or not yet able to secrete mucilage), mature leaves representing fully developed sticky traps densely covered with trichomes (stalked glandular hairs) secreting mucilage droplets (Fig. S5) and inflorescences and immediately frozen in liquid nitrogen. For each species tissues were harvested from three distinct individuals to constitute three biological replicates for the gene expression analysis. All tissue samples were stored at -80°C until RNA extraction.

Gene expression analysis

Frozen tissues (from three distinct individuals for each species, constituting three biological replicates) were ground to a fine powder using a mortar and pestle. RNA was extracted using the Spectrum Plant Total RNA kit (Sigma-Aldrich) and retro-transcribed using SuperscriptIII reverse transcriptase (Invitrogen) following the manufacturer’s instructions. Quantitative real time PCR was performed using the Luna Universal qPCR Mastermix (NEBiolabs) and the Light Cycler 480 system (Roche). For each species and tissue type, three independent biological replicates were used (i.e., RNA extracted from three distinct individuals) and three technical replicates were performed for each condition (i.e., three repeats of the qRT-PCR reaction assessing gene expression for a given tissue type, of a given biological replicate for each species). Actin and eIF4A homologs were used as reference genes as established by Arai and collaborators (Arai et al. 2021). The primer efficiencies were 100% for Actin, 93% for eIF4 and 109% for LBD4-like homolog genes. Primers were designed to match gene regions conserved between the three different species of Drosera. Primer sequences are given in the Supplementary Spreadsheet 3. The gene expression was calculated relative to the housekeeping gene actin and the common base method, which accounts for the measured efficiency of each primer pair, was used to calculate relative expression levels (Ganger et al. 2017). Welch’s t-tests or ANOVA followed by Tukey’s HSD post-hoc tests were conducted in R [https://www.r -project.org , (R Core Team 2017)] and used to test for the likelihood that LBD4-like average expression levels were equivalent across tissues.