Data from: ERC2.0-evolutionary rate covariation update provides more powerful inference of functional interactions across large phylogenies

Little, Jordan 1 ; Hoffmann Meyer, Guillermo2 ; Grover, Aakash2 ; Francette, Alex3 ; Partha, Raghavendran4 ; Arndt, Karen2 ; Smith, Martin 5 ; Clark, Nathan 2 ; Chikina, Maria2

Published Mar 13, 2025 on Dryad. https://doi.org/10.5061/dryad.6m905qg8q

Data files

Mar 13, 2025 version files 7.54 GB

ERC2.0_supp_data.zip

7.54 GB
README.md

3.75 KB

Abstract

Evolutionary Rate Covariation (ERC) is an established comparative genomics method that identifies sets of genes sharing patterns of sequence evolution, which suggests shared function. Whereas many functional predictions of ERC have been empirically validated, its predictive power has hitherto been limited by its inability to tackle the large numbers of species in contemporary comparative genomics datasets. This study introduces ERC2.0, an enhanced methodology for studying ERC across phylogenies with hundreds of species and tens of thousands of genes. ERC2.0 improves upon previous iterations of ERC in algorithm speed, normalizing for heteroskedasticity, and normalizing correlations via Fisher transformations. These improvements have resulted in greater statistical power to predict biological function. In exemplar yeast and mammalian datasets, we demonstrate that the predictive power of ERC2.0 is improved relative to the previous method, ERC1.0, and that further improvements are obtained by using larger yeast and mammalian phylogenies. We attribute the improvements to both the larger datasets and improved rate normalization. We demonstrate that ERC2.0 has high predictive accuracy for known annotations and can predict the functions of genes in non-model systems. Our findings underscore the potential for ERC2.0 to be used as a single-pass computational tool in candidate gene screening and functional predictions.

Access this dataset on Dryad

This dataset contains output files for the analyses found in
“ERC2.0-evolutionary rate covariation update provides more powerful
inference of functional interactions across large phylogenies”. These
files include ERC matrices calculated using a phylogeny of 343 yeast
species and 120 mammal species. This repository also includes Cytoscape
networks for high scoring ERC pairs genome-wide and focused on the
histone chaperone network.

Description of the data and file structure

Table_S1_ERC_runtime_comparison: excel file showing the comparison
of run times for a subset of 500 genes between ERC1.0 and ERC2.0

ERC_matrices:

yeast_ftERC.RDS mammal_ftERC.RDS

These are the full genome matrices calculated on a phylogeny of 343
yeast species and 120 mammal species respectively.

glmnet_analysis_output: the glmnet objects generated from running cv.glmnet on various iterations of ERC datasets.

glmnet_urls: the urls used to call the various annotation databases:
KEGG and GO Biological processes

glmnet_analysis: R script to generate a glmnet object for each
annotation in a given dataset downloaded from url

glmnet_analysis_output/yeast: output from cv.glmnet for GO biological
processes and KEGG for various iterations of ERC calcuated with yeast
phylogenies. The files are named as such: database_species number_erc
iteration.RDS

glmnet_analysis_output/mammal: output from cv.glmnet for GO biological
processes and KEGG for various iterations of ERC calculated with
mammalian phylogenies. The files are named as such: database_species
number_erc iteration.RDS

non_cerevisiae_predicted_local.xlsx: excel file showing
non-cerevisiae orthologs and their subcellular localization predicted by
glmnet trained on the 343 yeast ERC matrix. The predicted subcellular
localizations are listed in seperate tabs. Each row is a gene predicted to that
localization with the name in the first column, and the BLAST result from a
non-cerevisiae ortholog in the second column

network_data: ERC and GO Biological process annotation data for clusters generated in Cytoscape

Figure5_yeast_FtERC_clusters: Full list of GO Biological processes
enrichment terms for the clusters shown in Figure 5. Each tab represents the
enriched GO Biological process terms for a given cluster. The GO BP term enrichment
scores and p-values are recorded in columns 2-7.

Figure6_Mammalian_FtERC_clusters: Full list of GO Biological
processes enrichmnent terms for the clusters shown in Figure 6.
Each tab represents the enriched GO Biological process terms for a given cluster.
The GO BP term enrichment scores and p-values are recorded in columns 2-7.

SuppFile4_ERCdataZscoreHistoneChaperonesCoreAll.xlsx: ERC scores
between the histone chaperone proteins displayed in Figure 7. The first and fourth
columns represent the manually curated annotation group as either a number
(column 1) or the histone-chaperone protein it is associated with (column 4).
The second column shows the gene name. The third column shows the ERC value
between the genes listed in column 2 and column 4. Columns 5 and 6 show the
Z-score of the ERC value for a given pair.

cytoscape_networks: cys files for cytoscape networks in Figures 5-7

yeast_clusters.cys: Cytoscape networks for top 0.1% of yeast scores as
shown in Figures 5, S4

mammal_clusters: Cytoscape networks for top 0.1% of mammal scores as
shown in Figures 6, S5

EloEvoHistoneChaperonesCore3ManualAnnotationNoERC.cys: Cytoscape network
for the histone chaperone network displayed in Figure 7

Code/Software

Software to run ERC2.0 can be found on
GitHub