Data from: Pleiotropy increases with gene age in six model multicellular eukaryotes
Data files
Aug 16, 2025 version files 2.12 GB
-
Code.zip
46.01 KB
-
common_data.zip
1.01 GB
-
README.md
11.43 KB
-
species_data.zip
1.11 GB
Abstract
Fundamental traits of genes, including function, length, and GC content, all vary with gene age. Pleiotropy, where a single gene affects multiple traits, arises through selection for novel traits and is expected to be removed from the genome through subfunctionalization following duplication events. It is unclear, however, how these opposing forces shape the prevalence of pleiotropy through time. We hypothesized that the prevalence of pleiotropy would be lowest in young genes, peak in middle-aged genes, and then either decrease to a middling level in ancient genes or stay near the middle-aged peak, depending on the balance between exaptation and subfunctionalization. To address this question, we have calculated gene age and pleiotropic status for several model multicellular eukaryotes, including Homo sapiens, Mus musculus, Danio rerio, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. Gene age was determined by finding the most distantly related species that shared an ortholog using the Open Tree of Life and the Orthologous Matrix Database (OMAdb). Pleiotropic status was determined using both protein-protein interactions (STRINGdb) and associated biological processes (Gene Ontology). We found that middle-aged and ancient genes tend to be more pleiotropic than young genes, and that this relationship holds across all species evaluated and across both modalities of measuring pleiotropy. We also found absolute differences in the degree of pleiotropy based on gene functional class, but only when looking at the biological process count. From these results, we propose that there is a fundamental relationship between pleiotropy and gene age, and further study of this relationship may shed light on the mechanism behind the functional changes genes undergo as they age.
Dataset DOI: 10.5061/dryad.m63xsj4fh
Description of the data and file structure
Data
This section contains information about the data that is available in the zipped data file in this Dray entry.
Each file in this folder contains data necessary for recreating the analysis in the published Gene Age Project paper (i.e., the manuscript associated with this entry). Each species folder has species-relevant data and a Python dataframe containing processed information saved as a .pkl file.
Species_data.zip: The zipped "species_data" file contains six folders, one for each species in the paper, and each file contains data that relates strictly to that species. This includes species-relevant information from String, OMAdb, Gene Ontology, Ensembl, and the Open Tree of Life.
Each species has a folder in the species_data.zip folder that follows a similar structure, so here I have defined the various files for a general species.
xxxx.protein.links.v12.0.txt: protein-protein interaction data from the string database. Organized in columns by Interactor 1, Interactor 2, and the confidence score of that interaction
XXXXX_Containing_Groups.txt: the OMA groups for each protein found in the OMAdb for the species of interest. Organized by oma group number OmaGroup Fingerprint tab-separated list of OMA Entry IDs
Unique_uniprot_ids.txt: the 5-character uniprot species IDs for all species that have a protein that shows up in an omagroup with the species of interest.
Unique_species_names.txt: full scientific species name for each species in the Unique_uniprot_ids.txt file. Used to trim the Open Tree of Life.
Unq_Species_Tree.tre: newick format tree trimmed from the tree of life to only include the species found in Unique_species_names.txt
main.csv: metadata for the taxons found in the trimmed tree.
XXX_Confidence_string_Interactions.pkl: pkl files containing all string interactions of specified confidence (see code for what thresholds constitute med vs high confidence)
UniProtIDs.txt: UniProt IDs for each protein for the species of interest
idmapping.tsv: map uniprot IDs to gene symbol
BIOGRID-ORGANISM-XXXXXX-4.4.227.tab: Biogrid gene gene interaction data for each organism. Gene-Gene interactions were not used in the main text because of poor coverage in non-H. sapiens species, but the biogrid data has been left here if future investigators want to review it.
XXXX.protein.info.v11.5.txt: information from the string database about each protein, including string ID, preferred name, size, and functional annotation. Used for converting from string IDs to more general IDs.
oma-refseq.csv: map from oma IDs to refseq IDs from the oma db.
HS_data.tsv: refseq data for each protein, unused in analysis.
BP_GLM....: results from generalized linear models for Biological processes vs age.
PPI_GLM.... results from generalized linear models for protein-protein interactions vs age.
Bins_to_True_age.pkl: map from bins to the actual evolutionary age of a bin, generated manually using the time tree of life https://timetree.org/. Not used in analysis
For each species, the Gene_Age_Dataframe.pkl file, found in the relevant species folder, contains a dataframe with the following columns (each row corresponds to a specific protein in the species of interest). These dataframes were populated using the following scripts: 1_Oma_Group_Trimmer.py, 2_Collect_Unique_Species.py, 3_Generate_Tree.py, 3a_Process_StringDB_data.py, 4_Generate_DataFrame.py, 5_Get_Gene_Info.py, 6_Broad_GO_Terms.py, 7_Common_Broad_categories.py, 9_GOGeneric_to_GOSlim.py, and these scripts must be run once per species in numerical order.
- 'Group_ID': numerical identifier for the oma group the protein of interest belongs to. This is not stable across releases.
- 'FingerPrints': string identifier for the oma group the protein of interest belongs to. This is stable across releases.
- 'Entry_IDs': species-specific protein names for all proteins in the oma group.
- 'Species_Name': species name for each species-specific protein.
- 'Root_Distances': distance from the most recent common ancestor of the species of interest and each species found in the oma group of the given protein to the root of the tree.
- 'Min_Dist': minimum MRCA to root distance from the Root_distances column.
- 'Gene': oma DB name for the protein
- 'Assc_BioProcesses': Biological processes associated with the protein from the Gene Ontology
- 'String_Name': name for the protein used in the string database
- 'String_IDs': species + ensembl ID for each protein in the format: 9606.ENSP00000400646
- 'LowConf_PPI': Count of protein protein interactions that are above a low confidence threshold of .2 and below .4
- 'MedConf_PPI': count of protein protein interactions that are between .4 and .7 confidence
- 'HighConf_PPI': count of protein protein interactions that are above .7 confidence.
- 'GGI': Only found in humans, the count of gene gene interactions. This is deprecated
- 'Paralogs': Count of paralogs from Ensembl BioMart
- 'Gene_Symbol': When available, provides the gene symbol for the associated protein
- 'Full_Name': when available, provides the full name of the protein
- 'article_count': only in the human dataframe, provides the number of articles found in a PubMed search for the protein of interest.
- 'AgeBin': the bin the protein belongs to when grouping proteins by age, so that each bin has at least 1000 members.
- 'Broad_BioProcesses': The top-level biological process each protein is associated with (i.e., metabolic process or immune system process)
- 'growth': count of the number of times growth appears in the proteins 'Broad_BioProcesses' list
- 'homeostatic process': count of the number of times homeostatic process appears in the proteins 'Broad_BioProcesses' list
- 'reproduction': count of the number of times' reproduction' appears in the proteins 'Broad_BioProcesses' list.
- 'viral process': count of the number of times viral process appears in the proteins 'Broad_BioProcesses' list
- 'biological process involved in interspecies interaction between organisms' count of the number of times biological process involved in interspecies interaction between organisms pigmentation appears in the proteins 'Broad_BioProcesses' list.
- 'pigmentation': count of the number of times pigmentation appears in the proteins 'Broad_BioProcesses' list
- 'reproductive process': count of the number of times reproductive process appears in the proteins 'Broad_BioProcesses' list
- 'multicellular organismal process': count of the number of times multicellular organismal process appears in the proteins 'Broad_BioProcesses' list
- 'developmental process': count of the number of times developmental process appears in the proteins 'Broad_BioProcesses' list
- 'metabolic process': count of the number of times metabolic process appears in the proteins 'Broad_BioProcesses' list
- 'rhythmic process': count of the number of times rhythmic process appears in the proteins 'Broad_BioProcesses' list
- 'biological regulation': count of the number of times biological regulation appears in the proteins 'Broad_BioProcesses' list
- 'immune system process': count of the number of times immune system process appears in the proteins 'Broad_BioProcesses' list
- 'localization': count of the number of times localization appears in the proteins 'Broad_BioProcesses' list
- 'cellular process': count of the number of times cellular process appears in the proteins 'Broad_BioProcesses' list
- 'response to stimulus': count of the number of times response to stimulus appears in the proteins 'Broad_BioProcesses' list
- 'Assc_BioProcesses_Exp': the biological processes that were explicitly confirmed using experimental evidence in the Gene Ontology
- 'AgeBin_500': the bin the protein belongs to when grouping proteins by age, so that each bin has at least 500 members.
- 'UHC_PPI': count of protein protein interactions that are above .95 confidence.
- 'Slim_BioProcesses': Biological processes found in the GO slim ontology rather than the full ontology.
COMMON_DATA: The zipped "common_data" file contains 3=2 folders (common data, opentree13.4_tree), which respectively contain the common data files used across species and the download from opentree that was used to construct each species-specific phylogeny.
The common data folder includes data that is relevant to each species and used to generate the species data frames, like the Open Tree of Life tree file, OMA groups from the OMA database, etc.
To recreate the analysis in the paper, the data from this folder should be kept in the same organization as it is here and added to a directory with the code from the accompanying code folder. Code should be executed in order based on the label given to each file (i.e., 1_oma_group_trimmer.py should be run before 2_collect_unique_species.py)
Files and variables
File: Code.zip
Description: All code + requirements.txt file necessary to recreate the analysis from the ground up.
File: common_data.zip
Description: All data that was used for each species was not species-specific. Contains three files: GO terms, OMA DB entries, as well as the figures that went into the manuscript.
File: species_data.zip
Description: Species-specific data, including Ensembl Biomart and StringDB data files that were specific to each species. A critical piece of data in each of these folders is the Gene_Age_Dataframe.pkl, which is a file in each species folder that contains processed data used for plotting.
Code/software
Code
The Python code used to process the downloaded data, run the analysis, and generate figures is included in the 'code.zip' file along with a requirements.txt to set up the necessary Python environment.
Assuming that the data has been unpacked in the structure described above, the code should be run in order, from 1_... to 9_... The files with a letter (i.e., 5a_...) generate figures. All code has been commented to provide additional information for running the code.
Suggestion: To fully recreate the analysis (rather than starting with the analyzed dataframes), I would suggest deleting all data in each species-specific folder** except** protein.info files (these come from string and you would have to redownload them), mart_export (these come from the biomart data and would need to be redownloaded). To run each script for a specific species, you would then need to change variable names in scripts 1-4 to match the folder names and species names you are interested in analyzing.
Access information
Other publicly accessible locations of the code used in this project:
Data was derived from the following sources:
