Parallel shifts in differential gene expression reveal convergent miniaturization in fishes
Data files
Oct 10, 2025 version files 138.94 MB
-
Gobies_concatenated_alignment.fasta
68.56 MB
-
Gobies_final_tree.nex
11.74 KB
-
R_Data_and_Code.zip
70.26 MB
-
README.md
16.94 KB
-
Supplementary_Appendix_1_Body_Size_Phylogenomic_Tissue_Data.xlsx
24.70 KB
-
Supplementary_Appendix_2_RNA-seq_Sample_stats.xlsx
13.12 KB
-
Supplementary_Appendix_3_DEG_summary.xlsx
51.48 KB
Abstract
Body size variation in vertebrates is a complex polygenic trait, tightly correlated with numerous aspects of a species biology, ecology, and physiology. Miniaturization, the extreme reduction of adult body size, is a common phenomenon across the Tree of Life, yet the mechanisms underlying this process are poorly understood. Here, we investigate the molecular basis of body size evolution in goby fishes, a clade encompassing some of the smallest vertebrates on Earth. We generate a new genome-wide phylogeny for 162 Gobioidei species and perform comparative transcriptomics across three clades with repeated instances of miniaturization and large-bodied forms. We identified 54 differentially expressed one-to-one orthologs between miniature and large-bodied species. These genes reveal distinct functional profiles, suggesting that regulation of cell numbers is a key mechanism governing body size control. Miniature species consistently overexpress growth inhibitors like CDKN1B and ING2, associated with tighter cell cycle regulation and decreased proliferation rates, while large-bodied species upregulate growth-promoting genes such as TGFB3, linked to tissue development and growth signaling. These enriched functional pathways, conserved since the Eocene (50 Ma), suggest macroevolutionary convergence in size regulation over deep time. Our findings provide insights into how size determination is governed at a genetic level and highlights the importance of exploring these factors in non-model organisms to uncover the fundamental processes regulating vertebrate body size evolution.
1. Project Description:
RNA-seq study examining genomic convergence on miniaturization in six species of goby fishes. This supplemental dataset contains supplemental Appendices 1, 2, and 3, as well as a zipped folder of all R code and output needed to run the analyses within.
2. Contents:
File 1: Gobies_concatenated_alignment.fasta
Description: Concatenated alignment file for all species
File 2: Gobies_final_tree.nex
Description: RAxML-NG phylogenetic tree file for all goby species within this study
Appendix 1: Supplementary_Appendix_1_Body_Size_Phylogenomic_Tissue_Data.xlsx
Description: Phylogenomic data and body size information for all species in this study. Body size data and accession numbers for tissue samples used for phylogenomic estimations. Body lengths were obtained from FishBase or other published sources. For species missing total length (TL) data, the most common measurement, we calculated TL (bold values) using the formula TL=1.257*SL – 0.0818cm, which was derived from measurements of the individual gobies used in RNA-seq analyses.
Appendix 2: Supplementary_Appendix_2_RNA-seq_Sample_stats.xlsx
Description: Sample statistics for 35 tissue samples of six goby species from RNA-seq analysis. Sample ID, species, tissue replicate and type, sample batch number, RNA Integrity Number (RIN) score, percentage of reads mapping back to reference genome (Boleophthalmus pectinirostris), number of raw read pairs (before trimming of adaptors and low-quality reads), number of trimmed read pairs, percentage of read pairs removed after trimming, and the percentage of reads mapping back to the de novo assembly in the Salmon analyses are provided.
Appendix 3: Supplementary_Appendix_3_DEG_summary.xlsx
Description: Summary of 54 differentially expressed one-to-one orthologs and EggNOG-mapper output.
Column variables are:
- Ortholog ID – Unique identifier for the orthologous gene group.
- Base mean – Average normalized expression level across all samples.
- Log2 fold change – The log2 ratio of expression between conditions; positive = upregulated, negative = downregulated.
- LFC standard error – The estimated uncertainty (standard error) of the log2 fold change (LFC).
- Wald test statistic – The test statistic used to evaluate whether the LFC differs significantly from zero.
- P-value – Probability that the observed difference occurred by chance.
- Adjusted p-value – P-value corrected for multiple testing (e.g., Benjamini–Hochberg False Discovery Rate).
- EggNOG query ID – ID of the input sequence as submitted to EggNOG.
- Seed ortholog – Closest matching ortholog used to infer annotation.
- E-value – Statistical significance of the ortholog match (lower = better).
- Score – Alignment score of the ortholog match (higher = better).
- EggNOG OGs – Orthologous group(s) the query belongs to in the EggNOG database.
- Max annot lvl – Highest hierarchical level of annotation available (e.g., species, phylum, kingdom).
- COG category – Functional category from the Clusters of Orthologous Genes system (e.g., metabolism, information processing).See here for interpretatoin key: https://www.ncbi.nlm.nih.gov/research/cog#
- Description – Functional description of the ortholog.
- Preferred name – Common or gene symbol name of the ortholog.
- GOs – Gene Ontology ID numbers.
- EC – Enzyme Commission number, provides specific enzyme codes for the functions of identified genes or proteins.
- KEGG ko – KEGG orthology identifier linking to KEGG functional pathways.
- KEGG pathway – KEGG pathway(s) associated with the gene.
- KEGG module / KEGG Reaction / KEGG rclass – Additional KEGG entries describing biochemical modules, reactions, and reaction classes.
- BRITE – Hierarchical classification in KEGG BRITE ontology.
- KEGG TC – Transporter classification number (if relevant).
- CAZy – Carbohydrate-active enzyme classification (if relevant).
- PFAMs – Protein family domains identified within the sequence.
Appendix 4: R_Data_and_Code.zip
Description: Zipped folder containing all files and code needed to reproduce results from all R analyses
File 1 Name: abund_ALL_species_update.isoform.counts.matrix
File 1 Description: The counts matrix from the Trinity analyses
File 2 Name: All_spp_1-1_orthologs_update.tsv
File 2 Description: A list of the 1-1 orthologs for all six goby species in this analysis
File 3 Name: body_sizes.csv
File 3 Description: Body size data for the goby species in this analysis.
Column variables are:
- Original_name - Name in dataset
- Renamed_species - Renamed name
- Short_name - Binomial species name
- Family - Family
- Genus - Genus
- Species - Species
- SL_cm - Standard length in centimeters
- TL_cm - Total length in centimeters
- TL_cm_est - Estimated total length
- Reference - References for where length is taken
- Notes - Notes
File 4 Name: DESeq2_Large.csv
File 4 Description: DESeq2 results containing the orthogroup IDs and the associated adjusted p-values for the differentially expressed orthologs for the large-bodied species
Column variables are:
- Orthogroup - Unique identifier for the orthologous gene group.
- baseMean - Average normalized expression level across all samples.
- log2FoldChange - The log2 ratio of expression between conditions; positive = upregulated, negative = downregulated.
- lfcSE - The estimated uncertainty (standard error, SE) of the log2 fold change (LFC).
- stat - The Wald test statistic used to evaluate whether the LFC differs significantly from zero.
- pvalue - P-value, probability that the observed difference occurred by chance.
- padj - Adjusted p-value corrected for multiple testing (e.g., Benjamini–Hochberg False Discovery Rate).
File 5 Name: DESeq2_Small.csv
File 5 Description: DESeq2 results containing the orthogroup IDs and the associated adjusted p-values for the differentially expressed orthologs for the miniature species
Column variables are:
- Orthogroup - Unique identifier for the orthologous gene group.
- baseMean - Average normalized expression level across all samples.
- log2FoldChange - The log2 ratio of expression between conditions; positive = upregulated, negative = downregulated.
- lfcSE - The estimated uncertainty (standard error, SE) of the log2 fold change (LFC).
- stat - The Wald test statistic used to evaluate whether the LFC differs significantly from zero.
- pvalue - P-value, probability that the observed difference occurred by chance.
- padj - Adjusted p-value corrected for multiple testing (e.g., Benjamini–Hochberg False Discovery Rate).
File 6 Name: Eggnog_annotations_Large_Upreg.csv
File 6 Description: This dataframe contains the Eggnog mapping annotations for each DE Orthogroup along with associated gene names, GO terms, etc. for large-bodied species
Column variables are:
- Orthogroup - Unique identifier for the orthologous gene group.
- query - ID of the input sequence as submitted to EggNOG.
- seed_ortholog - Closest matching ortholog used to infer annotation.
- evalue - Statistical significance of the ortholog match (lower = better).
- score - Alignment score of the ortholog match (higher = better).
- eggNOG_OGs - Orthologous group(s) the query belongs to in the EggNOG database.
- max_annot_lvl - Highest hierarchical level of annotation available (e.g., species, phylum, kingdom).
- COG_category - Functional category from the Clusters of Orthologous Genes system (e.g., metabolism, information processing).See here for interpretatoin key: https://www.ncbi.nlm.nih.gov/research/cog#
- Description - Functional description of the ortholog.
- Preferred_name - Common or gene symbol name of the ortholog.
- GOs - Gene Ontology ID numbers.
- EC - Enzyme Commission number, provides specific enzyme codes for the functions of identified genes or proteins.
- KEGG_ko - KEGG orthology identifier linking to KEGG functional pathways.
- KEGG_Pathway - KEGG pathway(s) associated with the gene.
- KEGG_Module / KEGG_Reaction / KEGG_rclass – Additional KEGG entries describing biochemical modules, reactions, and reaction classes.
- BRITE – Hierarchical classification in KEGG BRITE ontology.
- KEGG_TC – Transporter classification number (if relevant).
- CAZy – Carbohydrate-active enzyme classification (if relevant).
- BiGG_Reaction - Lists identifiers of metabolic reactions associated with gene (if relevant)
- PFAMs – Protein family domains identified within the sequence.
File 7 Name: Eggnog_annotations_Small_Upreg.csv
File 7 Description: This dataframe contains the Eggnog mapping annotations for each DE Orthogroup along with associated gene names, GO terms, etc. for miniature species
Column variables are:
- Orthogroup - Unique identifier for the orthologous gene group.
- query - ID of the input sequence as submitted to EggNOG.
- seed_ortholog - Closest matching ortholog used to infer annotation.
- evalue - Statistical significance of the ortholog match (lower = better).
- score - Alignment score of the ortholog match (higher = better).
- eggNOG_OGs - Orthologous group(s) the query belongs to in the EggNOG database.
- max_annot_lvl - Highest hierarchical level of annotation available (e.g., species, phylum, kingdom).
- COG_category - Functional category from the Clusters of Orthologous Genes system (e.g., metabolism, information processing).See here for interpretatoin key: https://www.ncbi.nlm.nih.gov/research/cog#
- Description - Functional description of the ortholog.
- Preferred_name - Common or gene symbol name of the ortholog.
- GOs - Gene Ontology ID numbers.
- EC - Enzyme Commission number, provides specific enzyme codes for the functions of identified genes or proteins.
- KEGG_ko - KEGG orthology identifier linking to KEGG functional pathways.
- KEGG_Pathway - KEGG pathway(s) associated with the gene.
- KEGG_Module / KEGG_Reaction / KEGG_rclass – Additional KEGG entries describing biochemical modules, reactions, and reaction classes.
- BRITE – Hierarchical classification in KEGG BRITE ontology.
- KEGG_TC – Transporter classification number (if relevant).
- CAZy – Carbohydrate-active enzyme classification (if relevant).
- BiGG_Reaction - Lists identifiers of metabolic reactions associated with gene (if relevant)
- PFAMs – Protein family domains identified within the sequence.
File 8 Name: gene_mapping.txt
File 8 Description: Table to convert orthogroup IDs to gene names.
File 9 Name: Gene_summary_table_Large_v_Small.csv
File 9 Description: Key to GO term ID to Gene names. This is a subset of the Eggnog output results table.
File 10 Name: Gobies_all_R_results.RData
File 10 Description: An RData file containing the saved output of the scripts in the Gobies_R_code.R file. Results within include body size and ancestral state reconstruction analyses, Differential gene expression analyses using DESeq2, Gene Ontology enrichment analysis, and all output plots (e.g. PCAs, volcano plots, phylogenetic trees, boxplots)
File 11 Name: Gobies_R_code.R
File 11 Description: The R scripts needed to run all analyses in this paper.
File 12 Name: Goby_Samples_Body_Lengths.csv
File 12 Description: Total length and standard length of the goby individuals measured for RNA-seq. This is used to plot a regression between goby sample total length and standard length
File 13 Name: merged_eggnog_deseq_results_large.csv
File 13 Description: The output of the Eggnog-mapper program for large-bodied species
Column variables include:
- Orthogroup - Unique identifier for the orthologous gene group.
- baseMean - Average normalized expression level across all samples.
- log2FoldChange - The log2 ratio of expression between conditions; positive = upregulated, negative = downregulated.
- lfcSE - The estimated uncertainty (standard error, SE) of the log2 fold change (LFC).
- stat - The Wald test statistic used to evaluate whether the LFC differs significantly from zero.
- pvalue - P-value, probability that the observed difference occurred by chance.
- padj - Adjusted p-value corrected for multiple testing (e.g., Benjamini–Hochberg False Discovery Rate).
- query - ID of the input sequence as submitted to EggNOG.
- seed_ortholog - Closest matching ortholog used to infer annotation.
- evalue - Statistical significance of the ortholog match (lower = better).
- score - Alignment score of the ortholog match (higher = better).
- eggNOG_OGs - Orthologous group(s) the query belongs to in the EggNOG database.
- max_annot_lvl - Highest hierarchical level of annotation available (e.g., species, phylum, kingdom).
- COG_category - Functional category from the Clusters of Orthologous Genes system (e.g., metabolism, information processing). See here for interpretatoin key: https://www.ncbi.nlm.nih.gov/research/cog#
- Description - Functional description of the ortholog.
- Preferred_name - Common or gene symbol name of the ortholog.
- GOs - Gene Ontology ID numbers.
- EC - Enzyme Commission number, provides specific enzyme codes for the functions of identified genes or proteins.
- KEGG_ko - KEGG orthology identifier linking to KEGG functional pathways.
- KEGG_Pathway - KEGG pathway(s) associated with the gene.
- KEGG_Module / KEGG_Reaction / KEGG_rclass – Additional KEGG entries describing biochemical modules, reactions, and reaction classes.
- BRITE – Hierarchical classification in KEGG BRITE ontology.
- KEGG_TC – Transporter classification number (if relevant).
- CAZy – Carbohydrate-active enzyme classification (if relevant).
- BiGG_Reaction - Lists identifiers of metabolic reactions associated with gene (if relevant)
- PFAMs – Protein family domains identified within the sequence.
File 14 Name: merged_eggnog_deseq_results_small.csv
File 14 Description: The output of the Eggnog-mapper program for miniature species
Column variables include:
- Orthogroup - Unique identifier for the orthologous gene group.
- baseMean - Average normalized expression level across all samples.
- log2FoldChange - The log2 ratio of expression between conditions; positive = upregulated, negative = downregulated.
- lfcSE - The estimated uncertainty (standard error, SE) of the log2 fold change (LFC).
- stat - The Wald test statistic used to evaluate whether the LFC differs significantly from zero.
- pvalue - P-value, probability that the observed difference occurred by chance.
- padj - Adjusted p-value corrected for multiple testing (e.g., Benjamini–Hochberg False Discovery Rate).
- query - ID of the input sequence as submitted to EggNOG.
- seed_ortholog - Closest matching ortholog used to infer annotation.
- evalue - Statistical significance of the ortholog match (lower = better).
- score - Alignment score of the ortholog match (higher = better).
- eggNOG_OGs - Orthologous group(s) the query belongs to in the EggNOG database.
- max_annot_lvl - Highest hierarchical level of annotation available (e.g., species, phylum, kingdom).
- COG_category - Functional category from the Clusters of Orthologous Genes system (e.g., metabolism, information processing).See here for interpretatoin key: https://www.ncbi.nlm.nih.gov/research/cog#
- Description - Functional description of the ortholog.
- Preferred_name - Common or gene symbol name of the ortholog.
- GOs - Gene Ontology ID numbers.
- EC - Enzyme Commission number, provides specific enzyme codes for the functions of identified genes or proteins.
- KEGG_ko - KEGG orthology identifier linking to KEGG functional pathways.
- KEGG_Pathway - KEGG pathway(s) associated with the gene.
- KEGG_Module / KEGG_Reaction / KEGG_rclass – Additional KEGG entries describing biochemical modules, reactions, and reaction classes.
- BRITE – Hierarchical classification in KEGG BRITE ontology.
- KEGG_TC – Transporter classification number (if relevant).
- CAZy – Carbohydrate-active enzyme classification (if relevant).
- BiGG_Reaction - Lists identifiers of metabolic reactions associated with gene (if relevant)
- PFAMs – Protein family domains identified within the sequence.
File 15 Name: metadata_all_spp.csv
File 15 Description: Metadata for six goby species (e.g., sample ID, size, clade, and tissue replicate).
File 16 Name: out_gobies.raxml.support_renamed.nex
File 16 Description: RAxML tree file
3. Usage:
-
Open the "Gobies_R_code.R" into R. It is well-annotated. Simply follow all steps in this file if you want to run the analyses yourself
-
Load "Gobies_all_R_results.RData" into R. This will bring up all saved results for all analyses if you don't want to run the analyses yourself.
Missing data code: NA
4.Code/Software:
Analyses were run in R version 4.2.3
