Exploring the genotype-to-phenotype map using quantifiable patterns in metazoan genomic and morphological data

Published Nov 03, 2025 on Dryad. https://doi.org/10.5061/dryad.7h44j1050

Data files

Nov 03, 2025 version files 7.14 MB

README.md

13.26 KB
SequencesFilesDataAndScripts.zip

7.12 MB

Abstract

A prevailing problem in evolutionary biology is elucidating the “genotype-phenotype map” that characterizes how genomic activities regulate different aspects of organismal morphology and their variability in both space and time. Here, we explore potential causality between genome content and both morphological complexity and disparity by compiling the regulatory components (i.e., transcription factors, RNA binding proteins, and microRNA families) as well as a representative set of non-regulatory “housekeeping genes” in 32 species belonging to a wide variety of animal phyla, altogether encapsulating a number of varying morphological, ecologic and genomic characteristics. A principal component analysis of these four non-overlapping genomic components from each of these 32 species in relation to their last common ancestor revealed that no relationship exists between genome space and disparity, as changes to animal body plans appear to be largely the result of changes to the gene regulatory networks that govern animal development rather than gaining or losing specific sets of regulatory genes. However, using both phylogenetically correlated as well as phylogenetically uncorrelated statistical tests, we find a strong relationship between the loss of all considered gene types in some parasitic taxa, an exacerbation of a trend that characterizes animal genomes in general. We also find a strong and likely causal relationship between microRNA innovations and organismal complexity. While this analysis of genomic features suggests how complexity and disparity are each encoded in the genome, further analysis of the regulatory networks in which they participate should provide a more comprehensive description of how organisms diversify their morphologies through time.

Peterson, K. J., A. W. Clarke, G. Zolotarov, B. Deline, T. D. Lamb, M. A. McPeek, P. Martinez, and B. Fromm.

Access this dataset on Dryad: DOI:10.5061/dryad.7h44j1050

DATA & CODE OVERVIEW

This data repository (SequencesFilesDataAndScripts.zip) consists of 78 FASTA files of gene sequences (in Sequence Directory); 15 summary data files in .csv format (in Files directory); and Matlab, R, .csv and .tre files used to perform final analyses (in DataAndScripts directory). The contents of these files are described below.

WORKFLOW

Starting with the list of human genes as described in the Materials and Methods, each gene found in one of the other 31 taxa (sequences in FASTA files of Sequence Directory) was manually imputed in a spread sheet with “0” indicating an absence or the positive integer indicating the total number of genes found in the indicated species (.csv files in Files Directory). For all invertebrate taxa minus D. melanogaster (which can be found at FlyBase, with the gene name indicated on each of the three sheets, Supp. Files 1-3), all orthologues were compiled in fasta files and are found in the “Sequences” folder. BUSCO (v5.4.3) used the Metazoa node (954 genes) (Manni et al. 2021). Results were sorted by BUSCO gene ID with corresponding status as present, present as duplicates, fragment or absent (Supp. File 4). microRNA data were taken directly from Mirgenedb.org and uploaded into a fourth spreadsheet (Supp. file 7). The percentage gains and losses for each gene for each taxon were then calculated and compiled into Supp. File 8. These are the data that were used to generate the RawData.csv file.\

Morphological data were taken directly from Deline et al. (2018) and modified to include taxa not included in their original analysis. These data are found in Supp. files 9-10 and are the data used to generate the values in Supp. file 11.

To generate the time-calibrated phylogenetic tree, the protein sequences from 19 housekeeping genes were aligned using Macvector (see Materials and Methods). The fasta file of the aligned sequences are located in Supp. file 12. These sequences were used to derive the branch lengths given an accepted topology of these taxa, which is the file Bilateria_Basal.tre. The branch lengths as measured using three different estimates, along with the N50 value of the assembled genome (taken directly from Genbank) are given in Supp. file 15.

This data assembly process produced the two primary datasets that are the basis of the analysis: the RawData.csv file of percentage gains and losses for each species (column A) in the four gene categories (column B = BUSCO gains; C = transcription factor gains; D = regulatory RNA-binding protein gains; E = miRNA gains; F = BUSCO losses; G = transcription factor losses; H = regulatory RNA-binding protein losses; I = miRNA losses), and the Bilateria_Basal.tre file of the time-calibrated phylogeny among the taxa included in the study.

SUPPLEMENTAL MATERIALS

Sequences (Sequences Directory)

Compressed zip files of all curated DNA-repair proteins (DRPs), regulatory RNA-binding proteins (RBPs) and transcription factors (TFs) from all the invertebrate taxa considered minus D. melanogaster whose sequences are available at FlyBase.

Supplemental Files (Files Directory):

SuppFile1_TF.csv. A .csv file tabulating the number of all transcription factors curated for the 32 species considered herein. Columns A indicates the gene name in human; B the chordate linkage group (CLG); columns C-H indicate the total number of paralogues found in the analyzed vertebrate species (0 = the gene is absent from the genome). Columns W and X indicate the number of paralogues found in the two fish species. Columns I and J indicate the number of genes present in the last common ancestor of eumetazoans (I) and bilaterians (J). Columns K and L indicate the number of paralogues found in the acoelomorphs Xenoturbella (K) and Hofstenia (L). Column M is the number of genes present in the last common ancestor of nephrozoans (i.e., all bilaterians minus the acoelomorphs). Columns M-AJ give the number of paralogues in each of the considered invertebrate species. Column AK gives the gene name in fruit fly taken from FlyBase. Column AL is the number of genes present in the last common ancestor of nephrozoans (i.e., all bilaterians minus the acoelomorphs). NA = not applicable (i.e., the gene evolved after this particular last common ancestor). Transcription factor families and genes that could not be included into the analysis are given in rows 487-508 indicating the name of the family or gene and the reason why the family or gene had to be excluded.

SuppFile2_RBP.csv. A .csv file tabulating the number of RNA-binding proteins curated for the 32 species considered herein. Columns A-AK are the same as in SuppFile1_TF.csv. Column AL gives a representative GO term in human and column AM gives the GO number. In rows 159-176 at the bottom of the file are the genes that could not be included and the reason why. NA = not applicable.

SuppFile3_DRP.csv. A .csv file of the DNA-repair proteins curated for the 32 species considered herein. Columns A-AK are as in SuppFile1_TF.csv. In rows 221-239 at the bottom of the file are the genes that could not be included and the reason why. NA = not applicable.

SuppFile4_BUSCO.csv. A .csv file for the BUSCO gains and losses and the genome assembly used for each of the 32 species considered herein. The numbers are derived from the automated BUSCO output.

SuppFile5_Lva.csv. A .csv file of the genomic locations for the 435 regulatory genes in the sea urchin Lytechinus variegatus. Column A is the gene names. Column B is its location on a chordate linkage group (taken from either Supp. File 1 or Supp. File 2). Column C is the chromosomal location in L. variegatus. Column D is the approximate start site of the conserved protein sequence. Genes are ranked in order from the beginning of chromosome 1 to the end of chromosome 19. Figure 1 was then constructed by graphing these chromosomal locations starting at the beginning of chromosome 1 and working up to the end of chromosome 19 against each of the chordate linkage groups that were assigned a group number (e.g., CLGF = 1).

SuppFile6_Has.csv. A .csv file of the genomic locations for the 446 regulatory genes in the abalone Haliotis asinina. Columns are as in Supp. File 5. The figure was made as Supp. File 5.

SuppFile7_miRNAs.csv. A .csv file of the miRNA families curated for the 32 species considered herein. Columns A-AJ are the same as in SuppFile1_TF.csv (column B intentionally left blank as CLG assignments are not given, but see Peterson et al. 2022). NA = not applicable.

SuppFile8_gainslosses.csv. A .csv file showing the fraction of gains (columns B-E) and losses (columns F-I) of BUSCO genes (B, F), transcription factors (C, G), RNA-binding proteins (D, H), and miRNAs (E, I). These are the data used for the PC analyses.

Supplemental file 9. A. csv file giving the codings for the 1767 morphological characters (see SuppFile10_characters.csv) for each of the 32 species considered.

SuppFile10_characters.csv. A .csv file giving the character descriptions for the 1767 morphological characters including Dimophilus as this taxon was not included in the original Deline et al. (2018) study.

SuppFile11_morph.csv. A .csv file showing the Principal Coordinate Analysis scores of the taxa included in the study. Each taxon (column B) was placed into the corresponding taxonomic group (Column C) from Deline et al. (2018). Columns D-AC give the PCA scores for the 26 highest axes. Taxa that were placed in the same broader taxonomic group were given the same anatomical characteristics and thus their PCA scores are the same.

SuppFile12_clock.fa. Fasta file (.fa) of the concatenated 19 protein sequences from each of the 32 species that was used to determine the phylogeny used for the phylogenetic independent contrasts analyses.

SuppFile13_losses.csv. A .csv file showing the amount of gene loss across the sampled 32 taxa. Column B is the loss of BUSCO genes; Column C the transcription factors; D the miRNA families, E the regulatory RNA-binding proteins; F the non-regulatory RNA binding proteins sampled within the BUSCO set; G the non-RNA binding proteins within the BUSCO set; H the mean of columns C-E and the values used for Fig. 3A.

SuppFile14_removedBUSCOgenes.csv. A .csv file listing the 954 BUSCO genes. Column A give the BUSCO identifier; B the gene name; C whether that gene encodes either a transcription factor (TF), regulatory RNA-binding protein (RBP), or a DNA-repair protein (DRP) and were removed from the BUSCO file, as well as the non-regulatory RNA-binding proteins (RBP).

SuppFile15_branchlengths_N50.csv. A .csv file giving the values used for the N50 and branch length comparisons. Column A lists the genus name of the 32 species under investigation; B the loss of BUSCO genes minus the DNA-repair proteins; C the N50 value; D the loss of DNA-repair proteins (DRP); E the uncorrected branch length (MacClade); F, a Poisson-corrected branch length (MacClade); G the branch length as reconstructed by PAUP using maximum likelihood (see materials and methods).

Data and Scripts (DataAndScripts Directory):

The basic data used in the analyses are the gains and losses in four classes of genes (i.e., BUSCO, transcription factors, RNA binding proteins, and microRNAs). The values of these eight variables for each taxa are in the file RawData.csv. The nexus file Bilateria_Basal.tre contains the time-calibrated phylogeny for taxa included in this study. These files have the basic data used in the main analyses in this paper.

The Matlab script fig3DBilateria.m performs the principal components analysis of this genomic data and draws some of the figures in the paper. The file PCData.csv contains the PC scores produced by this analysis (column A = species names; B-D are the values derived for the three genome PC axes; E-G are the values derived for the three morphospace PC axes; H = genome size taken from either Genbank or Ensembl; I – the number of protein-encoding genes taken directly from either Genbank or Ensembl; J = whether the species is macroscopic [0] or microscopic [1]; K = whether the species is free living [0] or parasitic [1]). These scores are used in subsequent analyses.

Evolutionary contrasts analyses were done using the phyTools, ape, and nlme packages of R. Scripts for these analyses are in this directory as well. The Bilaterian.r script reads in the data file PCData.csv and the tree file Bilateria_Basal.tre to perform the basic evolutionary contrasts analyses on the PC scores.

The CellType.r script reads the CellType.csv data file (columns A-G as in PCData.csv; column H gives the value for cell types, which is taken from Table 1 and where the citations are given) and the Cell_Type.tre tree file to perform these analyses. These two files have fewer taxa than the Bilaterian.r analysis because of a lack of cell type data on a few taxa.

The R script singleBranchTest.r does the single branch evolutionary contrasts analyses to infer rates on basal branches leading to parasite clades. This script also reads the PCData.csv and Bilateria_Basal.tre for the analyses.

The Repetitive.r script reads the Repetitive.csv data file (columns A-G as in PCData.csv; column H gives the value for the percentage of repetitive DNA, which is taken directly from Genbank), and the Repetitive.r script performs the phylogenetic analyses of the repetitive DNA variables.

The Bilaterian_Basal.tre is a Newick file of time calibrated phylogeny among the taxa used for most analyses. Cell_Type.tre – Newick file of the time calibrated phylogeny among the taxa used for the cell type analysis. This is the same tree as in Bilaterian_Basal.tre, with some of the taxa pruned because of a lack of cell type data for them. Likewise, Repetitive.tre is a Newick file of the time calibrated phylogeny among the taxa used for the repetitive DNA test.

ACCESS INFORMATION

1. Licenses/restrictions placed on the data or code

No restrictions or licenses are associated with this data and code.

2. Data derived from other sources

The morphological data used in this paper were derived from Deline, B., J. M. Greenwood, J. W. Clark, M. N. Puttick, K. J. Peterson, and P. C. J. Donoghue. 2018. Evolution of metazoan morphological disparity. Proceedings of the National Academy of Sciences, USA 29:201810575–10. The number of protein-encoding genes and the amount of repetitive DNA values are taken directly from either Genbank or from Ensembl depending on the taxon (see Materials and methods). Cell-type numbers are from the specific sources cited in Table 1.

3. Recommended citation for this data/code archive

Peterson, K. J., A. W. Clarke, G. Zolotarov, B. Deline, T. D. Lamb, M. A. McPeek, P. Martinez, and B. Fromm.^^ Data Archive for Exploring the genotype-to-phenotype map using quantifiable patterns in Metazoan genomic and morphological data. American Naturalist, in press.

Materials and Methods

Gene Complements

The transcription factors for human were taken from Messina et al. (2004) and for fruit fly D. melanogaster these sequences were taken from FlyBase (https://flybase.org/reports/FBgg0000745.htm) (Supp. File 1; all supplemental files can be found at https://doi.org/10.5061/dryad.7h44j1050). The regulatory RNA-binding proteins for human were taken from Gerstberger et al. (2014) and for fruit fly from Gamberi et al. (2006) (Supp. File 2). Regulatory RNA-binding proteins were discerned from non-regulatory RNA-binding proteins by examining the “GO: Biological Function” for each gene and compiling those with clear regulatory roles in either co-transcriptional or post-transcriptional processing in either human or fruit fly (see Supp. File 2). Octopus transcription factors were annotated by searching the proteome with hmmscan program from HMMER v 3.3.2 with –cut_ga option against the list of transcription factor Pfam models (Mistry et al. 2020). DNA repair proteins started with the curated lists of Wood et al. (2001, 2005) (Supp. File 3).

To compile the ancestral repertoire of regulatory and DNA repair proteins present in the bilaterian last common ancestor versus those that evolved afterwards in either the human, fruit fly, or (for the transcription factors) the octopus lineage, first orthologues of each gene in human were searched initially using Ensembl’s “orthologue” function in three other vertebrate species, the mouse Mus musculus, the chicken Gallas gallus and the spotted gar Lepisosteus oculatus, as well as any paralogues identified using Ensembl’s “paralogue” function (see Supp. Fig. 1 for a methods flowchart). Each gene plus all paralogues were then mapped to one of the four gnathostome-specific sub-genomes derived from one of the ~17 ancestral chordate linkage groups (Simakov et al. 2020; Lamb 2021; Nakatani et al. 2021) (Supp. File 2). This allowed us to distinguish between paralogues generated by the two whole-genome duplication events in early vertebrate history (Dehal and Boore 2005) from more ancient gene duplications potentially shared with one or more invertebrate taxa.

To search for paralogues in the invertebrate taxa, we used a combination of reciprocal blastp (using the default settings at NCBI) (Tatusov et al. 1997) and phylogenetic analysis (with sequences aligned using Muscle and the phylogeny generated with Neighbor-joining using uncorrected distances using the default settings in Macvector v. 18.2.5) against the publicly available predicted proteins from the sequenced genomes (assembly accession for each species is given in Supp. File 4). In addition, we searched using reciprocal blastp all available non-bilaterian taxa for each of our considered sequences. This allowed us to polarize potential gains within the Bilateria versus losses in one or more bilaterian sub-taxa. The predicted proteins for all but three taxa were taken directly from GenBank. The predicted proteins from the acoel H. miamia were taken from Ensembl https://metazoa.ensembl.org/Hofstenia_miamia/Info/Index) using the default blastp settings. The predicted proteins from the polyclad flatworm P. crozieri (Leite et al. 2022)** and the xenoturbellid X. bocki (Schiffer et al. 2023) were searched using BLAT. Multiple hits that were not clearly delineated as proteins derived from alternative splicing (e.g., those labeled isoform X1, X2 etc.) were searched against the assembled genome using the default settings in blastn. If a blastp search recovered two or more results, each of these results were first checked against the genome using tblastn using the default settings to make sure that each protein sequence was encoded from a separate locus. Then, we confirmed using reciprocal blastp that each of these protein sequences was equally related to the single query sequence. The number of loci were then recorded for each gene for each taxon. All sequences for all invertebrate taxa minus D. melanogaster (which are available at FlyBase) are available as supplemental materials.

To test the robustness of our orthology assignments, we used the chromosomal location of orthologues between vertebrates and an invertebrate, as it has recently been shown how the 17 chordate linkage groups relate to the 24 bilaterian linkage-groups present in this same last common ancestor, and how these 24 bilaterian linkage groups are distributed across the chromosomes of select invertebrates including the sea urchin Lytechinus variegatus and the abalone Haliotis asinina (Simakov et al. 2022). Using tblastn, we searched the genome of the L. variegatus using the protein sequences curated from the sea urchin S. purpuratus and catalogued the chromosomal location of the 435 putative orthologues whose location was specific to a single chordate linkage-group (Supp. File 5). We repeated this exercise for H. asinina using the protein sequences curated from the abalone H. rufescens and catalogued 446 putative orthologues again specific to a single chordate linkage group (Supp. File 6). Aside from the 7 orthologues present on chromosome 6 in L. variegatus and on chromosome 14 in H. asinina that corresponds to an ancestral linkage group that was lost in chordates but whose genes were dispersed throughout the chordate genome (Simakov et al. 2022), only two genes had inconsistent locations between the chordate linkage group and the bilaterian linkage group of both L. variegatus and H. asinina (Fig. 1, boxed genes) and thus represent potential mis-identified orthologues. However, the remaining genes all demonstrate a one-to-one correspondence between the chordate linkage group and the bilaterian linkage group in at least one of the two analyzed non-chordate species, and thus a near complete correspondence between our results using reciprocal blast versus synteny, this congruence strongly supports the methodology underlying our orthology assignments.

The gains and losses of 765 miRNA families found in at least one of the 32 queried species were taken directly from MirGeneDB v3.0 (Clark et al. 2024) and are summarized in Supplemental File 7. Finally, the curated 954 BUSCO gene data set (Simão et al. 2015; Manni et al. 2021) was assembled for each of the 32 sampled species (BUSCO v5.4.3, Metazoa node, Supp. File 3). Twelve of these BUSCO genes overlapped either the curated set of transcription factors or the curated set of regulatory RNA-binding proteins and were removed from the BUSCO file generating a set of 945 non-regulatory and non-redundant housekeeping genes. An additional 30 genes that encode DNA repair proteins were also removed for the analyses shown in Fig. 2D so that no gene was in both the BUSCO set as well as in any one of the other three curated gene sets.

Genome space

The fraction of gain and loss of each of the four gene sets for each of the 32 terminal taxa were calculated with respect to the bilaterian last common ancestor. These values are given in Supplemental File 8 and were used for the principal components analysis. An alternative ancestor was also explored, one where the xenacoelomorphs (Xenoturbella and the acoel H. miamia) are nested within the deuterostomes and allied with the echinoderms and hemichordates. An alternative principal components analysis found near-identical results (Supp. Fig. 2) and thus we only show the results with xenacoelomorphs reconstructed as basal bilaterians (see the text for further discussion). Principal components analyses were performed on the covariance matrix for the eight genomic variables using the ‘pcacov’ function in Matlab v.2024a (Mathworks Inc.) (all scripts are available as supplemental materials). Principal components scores for all taxa were calculated from the first three eigenvectors. Variable loadings and percent of total variance explained are given in Table 2. Principal component scores for the first three components were calculated for each taxon and used in subsequent analyses. Pearson product-moment correlations and linear regressions of principal components and original variables were calculated using SAS v.9.4 (SAS Institute).

Morphospace

The original Deline et al. (2018) metazoan morphological dataset consists of 212 extant terminal taxa from 34 animal phyla coded with 1,767 discrete characters. The operational taxonomic units varied in rank from family to phylum, but most represented Linnean orders. Deline et al. (2018) mapped character contingencies to analytically differentiate between missing, absent, and non-applicable characteristics. We culled the full morphological dataset to include only the operational taxonomic units corresponding to the 32 genetically characterized genera in the current study. In several instances, multiple taxa fell within the same morphological operational taxonomic unit (e.g. Homo and Mus within placental mammals), in those cases both taxa were coded identically in terms of morphology. Even with the coarse nature of the morphological datasets, the morphology displayed at the generic level was consistent with that of the higher taxonomic group. The exceptions are D. gyrociliatus, which is morphologically distinctive from its associated operational taxonomic unit (Scolecida) and whose coding was based on (Martín-Durán et al., 2021), and the spider mite T. urticae, whose coding was altered from the tick I. scapularis based on the characters provided in OConnor (OConnor 2009). In total, we added five characters to the original dataset to both capture autapomorphic characters within added species (i.e. D. gyrociliatus) or to differentiate distinctive taxa within broad operational taxonomic units (e.g. ticks and mites) (see Supp. Files 9 and 10).

The resulting dataset of 32 taxa and 1,769 characters was analyzed following the methods of Deline et al. (2018) (Supp. File 11). The current analysis only differed in methodology from the previous study in the use of a principal-coordinate eigenvector-based ordination rather than non-metric multidimensional scaling, to use consistent methods to those utilized with the genomic data. The resulting morphospace is consistent with that of the previous study and the previous study found that morphospaces generated with these different ordination methods were highly correlated (Deline et al. 2018). The morphological analyses were conducted in R (4.3.2) using the cluster (Maechler 2018) and ape (Paradis and Schliep 2018) packages.

Estimation of Molecular Branch Lengths and Phylogenetic Independent Contrasts

The estimation of branch lengths was derived from a data set consisting of 19 concatenated protein sequences and assembled for each of the 32 ingroup taxa as well as the outgroups Nematostella vectensis, Corticium candelabrum and Amphimedon queenslandica (Supp. File 12). These proteins were chosen because of their ability to capture the correct phylogeny of a small subset of canonical model system taxa (yeast (human (fly, C.elegans))) or to not strongly support an alternative arrangement (Mushegian et al. 1998; Wang et al. 1999). Further, subsequent work has shown that at least some of these proteins appear to evolve in a clock-like manner (Peterson et al. 2004, 2008). Branch lengths were calculated for each single in-group taxon as well as the three outgroup taxa (see Supp. Fig. 3B) using maximum likelihood (PAUP 4.0a) using the JTT matrix with 4 rate categories and a Gamma distribution with shape parameter = 0.5. The significance of the relationship between DNA-repair proteins and branch lengths is independent of the model of sequence evolution.

To generate phylogenetically independent contrasts analyses, we used Maclade (v. 4.08a) to generate a topology that conforms to the currently accepted phylogeny of these 32 select species (Laumer et al. 2019) (Supp. File 13). Using this same alignment and topology, we used MEGA7 (Kumar et al. 2016) to estimate divergence times incorporating well-constrained calibration points (Erwin et al. 2011). Phylogenetic independent contrasts analyses and phylogenetic regressions were performed using the ape, phytools, and nlme packages of R (Revell and Harmon 2022). The use of alternative topologies (e.g., Xenacoelomorpha as ambulacrarian deuterostomes) had no significant effect on the resulting analysis.

Exploring the genotype-to-phenotype map using quantifiable patterns in metazoan genomic and morphological data

Data files

Abstract

README: Exploring the genotype-to-phenotype map using quantifiable patterns in metazoan genomic and morphological data

Methods