Evolutionary rate covariation reveals novel protein networks in Mycobacterium tuberculosis

Clark, Nathan 1 ; Schwartz, Nico1 ; Little, Jordan2

Research facility: University of Pittsburgh

Published Oct 02, 2025 on Dryad. https://doi.org/10.5061/dryad.v41ns1s8v

Data files

Oct 02, 2025 version files 3.25 GB

Paper_Data.zip

3.25 GB
README.md

6.56 KB

Abstract

Proteins that participate together in a function often show covariation in their relative evolutionary rates, because selective pressures influencing their evolution affect co-functional proteins as an ensemble. Evolutionary Rate Covariation (ERC) is a computational comparative genomics tool that identifies this co-variation using collections of orthologous gene sequences from several species, providing measures of correlation that can be used to infer novel functional relationships. Though proven to be successful at identifying novel functional relationships in eukaryotic species, ERC’s usefulness in prokaryotes requires further investigation. In this study, we validate the use of ERC in prokaryotes using 106 species in the Mycobacterium genus, and provide a useful genome-wide dataset of functional relationships between 11,638 mycobacterial orthologous gene groups with Mycobacterium tuberculosis as a focal model organism. To evaluate the dataset’s utility, we studied genome-wide ERC patterns between known functionally interacting proteins and operons, as well as within smaller functional groups such as the recBCD complex, peptidoglycan synthesis, and multiple transport systems. We use these data to demonstrate how ERC can be used as a predictor of functional relationships in prokaryotes and describe three potential novel protein networks for future study. This repository contains all products and scripts associated with this study in Mycobacterial species.

Access this dataset on Dryad

This README file descibes all of the data gathered and code used in this ERC study. All of the ERC data for the gene families, networks, and operons discussed in the manuscript, as well as all other underlying data and materials used, are stored here.

Data Access

Myco_Proteomes

Full FASTA RefSeq amino acid sequences for each mycobacterium species used in this study
Obtained from the National Center for Biotechnology Information
.csv and .pdf versions of a table containing each species name, accession number, genome version, and number of protein coding genes

Orthogroup_Seqeunces

Amino acids sequences for each orthogroup created from OrthoFinder

Gene_Trees

Gene trees corresponding to each orthogroup generated by Orthofinder

SpeciesTree_Rooted.tree

Master species tree of all mycobacterium species used in this study generated by OrthoFinder

Alingments

MUSCLE alignments of each single copy orthogroup determined by OrthoSnap
Derived from the Orthogroups generated by OrthoFinder

Gene_name_map

.rds and .csv versions of an Mtb gene map containing the orthogroup names to which they were assinged, their gene symbol (if applicable), their Rv number, and their NP identifier

FULL_ERC_Matrix

.rds and .csv versions of the full ERC matrix containing all mycobacterium species
.rds and .csv versions of the same data in dataframe form

TB_ERC_matrix

.rds and .csv versions of an ERC matrix containg all orthogroups with an Mtb gene, with Mtb genes as row and column names
.rds and csv. versions of the same data in dataframe form

Operons

.rds and .csv versions of a dataframe containing each Mtb operon and their resepective genes
.rds and .csv versions of a dataframe containing permutation test data done within (intra) each operon
- ERC between genes within each individual operon
.rds and .csv versions of a dataframe containing permutation test data done between (inter) each operon pair
- ERC between the genes of each operon pair

Networks

.rds and .csv versions of dataframes containing permutation test data for each network discussed in the manuscript
This includes the Rip A secretion network, the mixed operon network, the virulence operon network, and the Mce1/Mce4/Esx5 network

Code/Software

OrthoFinder

OrthoFinder.sh
- bash script that opens python virtual environment on a computing cluster and executes the OrthoFinder command

OrthoSnap

OrthoSequences.py
- python script that edits the orthogroups sequences from OrthoFinder to adhere to the format requirements of OrthoSnap
- appends species name to the sequence header
- converts "_" in between the species and gene in the sequence header to "|"
- Ex: WP_085668502.1 -> GCF_002116635_1|WP_085668502.1
GeneTrees.py
- python script that edits the species/gene identifer in the gene trees from orthofinder to adhere to the format requirements of OrthoSnap
- converts "_" to "|"
- Ex: GCF_002116635_1_WP_085668502.1 -> GCF_002116635_1|WP_085668502.1
FileMoverScript.sh
- bash script that moves the orthogroup sequences and gene trees into one folder in order for OrthoSnap to loop through
OrthoSnapScript.sh
- bash script that loops through each file in the combined Sequence/Tree directory, executing the orthosnap command for each Sequence/tree pair
OrthoSnap.sh
- script that exucutes OrthoSnapScript.sh as a job on a computing cluster

MUSCLE

FileDivider.py
- python script that divides single copy orthogroups generated from OrthoSnap into 10 different subgroups to speed up alignment proccess
- last group gets the remainder
MuscleScript.sh
- template of the bash script used to create the alingments for each of the ten subgroups
- creates alignment and places them in an output folder
Muscle.sh
-bash script to run MuscleScript.sh as a job on a computing cluster

Run ERC Code

TB_ERC.r
- R script that that first exucutes a phangorn command that creates gene trees for each MUSCLE aligned single copy orthogroup
- puts the gene trees into a script that calculates ERC
- outputs full ERC matrix, Mtb specific ERC matrix, and Mtb specific dataframe
ERC_Processor.R
- contains function needed to execute TB_ERC.R
calcERC.sh
- bash script that executes TB_ERC.R on a computing cluster

ERC_data

This folder contains five different sub folders with scripts used to gather the data discussed in the manuscript

Gene name map

Map.py
- python script that creates a two column dataframe of Mtb NP identifiers and the orthogroups they were in
Map2.py
- python script that adds the Mtb genes' gene symbols (if applicable) and Rv numbers to the dataframe created by Map.py to make the complete Mtb gene map

ROC curves

ROC_curve.R
- R script to generate a ROC curve and AUC value for the STRING protein dataset
OP_ROC.R
- R Script to generate a ROC curve and AUC value for Mtb operons
Protein_Scores.py
- python script that retrives proteins pairs from STRING based on a user input score threshold
- true postives
RandomPairs.py
- python script that genreates a user input number of random pairs from the proteins retrieved from Proteins_Scores.py
- false postives

Operons

Operon_Extracter.py
- extracts the operons from the operon text file downloaded off the Texas A&M TB Genome annotation portal
Operons_with_at_least_2_genes.py
- trims the list so that each operon for the analysis has at least two genes
C.remover.py
- removes "c" from Rv numbers that indicates that gene is transcribed from the complimentary strand for easier data cleaning. It can be added back on later
Operons.R
- R script that processeses the operons into an R environment and creates individual lists for each operon
- carries out both the intra operon and inter operon perumtations tests, and creates two dataframes to store the results for each one
Operon_funtions.R
- contains functions needed to execute Operons.R

glmnet

glmnet_functions.R
- contains functions for running glmnet
Run_glmnet.R
- script to exectute glmnet for predictions

PermTests/Clustering

DataGathering.R
- contains different permutation test functions for data analysis
  - used to find the significance for all gene families, networks, and operons
- contains a clustering function to cluster operons based on specified user input

Evolutionary rate covariation reveals novel protein networks in Mycobacterium tuberculosis

Data files

Abstract

README: Mycobacterium Tuberculosis ERC data

Data Access

Myco_Proteomes

Orthogroup_Seqeunces

Gene_Trees

SpeciesTree_Rooted.tree

Alingments

Gene_name_map

FULL_ERC_Matrix

TB_ERC_matrix

Operons

Networks

Code/Software

OrthoFinder

OrthoSnap

MUSCLE

Run ERC Code

ERC_data

Gene name map

ROC curves

Operons

glmnet

PermTests/Clustering