Evolutionary rate covariation reveals novel protein networks in Mycobacterium tuberculosis
Data files
Oct 02, 2025 version files 3.25 GB
-
Paper_Data.zip
3.25 GB
-
README.md
6.56 KB
Abstract
Proteins that participate together in a function often show covariation in their relative evolutionary rates, because selective pressures influencing their evolution affect co-functional proteins as an ensemble. Evolutionary Rate Covariation (ERC) is a computational comparative genomics tool that identifies this co-variation using collections of orthologous gene sequences from several species, providing measures of correlation that can be used to infer novel functional relationships. Though proven to be successful at identifying novel functional relationships in eukaryotic species, ERC’s usefulness in prokaryotes requires further investigation. In this study, we validate the use of ERC in prokaryotes using 106 species in the Mycobacterium genus, and provide a useful genome-wide dataset of functional relationships between 11,638 mycobacterial orthologous gene groups with Mycobacterium tuberculosis as a focal model organism. To evaluate the dataset’s utility, we studied genome-wide ERC patterns between known functionally interacting proteins and operons, as well as within smaller functional groups such as the recBCD complex, peptidoglycan synthesis, and multiple transport systems. We use these data to demonstrate how ERC can be used as a predictor of functional relationships in prokaryotes and describe three potential novel protein networks for future study. This repository contains all products and scripts associated with this study in Mycobacterial species.
This README file descibes all of the data gathered and code used in this ERC study. All of the ERC data for the gene families, networks, and operons discussed in the manuscript, as well as all other underlying data and materials used, are stored here.
Data Access
Myco_Proteomes
- Full FASTA RefSeq amino acid sequences for each mycobacterium species used in this study
- Obtained from the National Center for Biotechnology Information
- .csv and .pdf versions of a table containing each species name, accession number, genome version, and number of protein coding genes
Orthogroup_Seqeunces
- Amino acids sequences for each orthogroup created from OrthoFinder
Gene_Trees
- Gene trees corresponding to each orthogroup generated by Orthofinder
SpeciesTree_Rooted.tree
- Master species tree of all mycobacterium species used in this study generated by OrthoFinder
Alingments
- MUSCLE alignments of each single copy orthogroup determined by OrthoSnap
- Derived from the Orthogroups generated by OrthoFinder
Gene_name_map
- .rds and .csv versions of an Mtb gene map containing the orthogroup names to which they were assinged, their gene symbol (if applicable), their Rv number, and their NP identifier
FULL_ERC_Matrix
- .rds and .csv versions of the full ERC matrix containing all mycobacterium species
- .rds and .csv versions of the same data in dataframe form
TB_ERC_matrix
- .rds and .csv versions of an ERC matrix containg all orthogroups with an Mtb gene, with Mtb genes as row and column names
- .rds and csv. versions of the same data in dataframe form
Operons
- .rds and .csv versions of a dataframe containing each Mtb operon and their resepective genes
- .rds and .csv versions of a dataframe containing permutation test data done within (intra) each operon
- ERC between genes within each individual operon
- .rds and .csv versions of a dataframe containing permutation test data done between (inter) each operon pair
- ERC between the genes of each operon pair
Networks
- .rds and .csv versions of dataframes containing permutation test data for each network discussed in the manuscript
- This includes the Rip A secretion network, the mixed operon network, the virulence operon network, and the Mce1/Mce4/Esx5 network
Code/Software
OrthoFinder
- OrthoFinder.sh
- bash script that opens python virtual environment on a computing cluster and executes the OrthoFinder command
OrthoSnap
- OrthoSequences.py
- python script that edits the orthogroups sequences from OrthoFinder to adhere to the format requirements of OrthoSnap
- appends species name to the sequence header
- converts "_" in between the species and gene in the sequence header to "|"
- Ex: WP_085668502.1 -> GCF_002116635_1|WP_085668502.1
- GeneTrees.py
- python script that edits the species/gene identifer in the gene trees from orthofinder to adhere to the format requirements of OrthoSnap
- converts "_" to "|"
- Ex: GCF_002116635_1_WP_085668502.1 -> GCF_002116635_1|WP_085668502.1
- FileMoverScript.sh
- bash script that moves the orthogroup sequences and gene trees into one folder in order for OrthoSnap to loop through
- OrthoSnapScript.sh
- bash script that loops through each file in the combined Sequence/Tree directory, executing the orthosnap command for each Sequence/tree pair
- OrthoSnap.sh
- script that exucutes OrthoSnapScript.sh as a job on a computing cluster
MUSCLE
- FileDivider.py
- python script that divides single copy orthogroups generated from OrthoSnap into 10 different subgroups to speed up alignment proccess
- last group gets the remainder
- MuscleScript.sh
- template of the bash script used to create the alingments for each of the ten subgroups
- creates alignment and places them in an output folder
- Muscle.sh
-bash script to run MuscleScript.sh as a job on a computing cluster
Run ERC Code
- TB_ERC.r
- R script that that first exucutes a phangorn command that creates gene trees for each MUSCLE aligned single copy orthogroup
- puts the gene trees into a script that calculates ERC
- outputs full ERC matrix, Mtb specific ERC matrix, and Mtb specific dataframe
- ERC_Processor.R
- contains function needed to execute TB_ERC.R
- calcERC.sh
- bash script that executes TB_ERC.R on a computing cluster
ERC_data
This folder contains five different sub folders with scripts used to gather the data discussed in the manuscript
Gene name map
- Map.py
- python script that creates a two column dataframe of Mtb NP identifiers and the orthogroups they were in
- Map2.py
- python script that adds the Mtb genes' gene symbols (if applicable) and Rv numbers to the dataframe created by Map.py to make the complete Mtb gene map
ROC curves
- ROC_curve.R
- R script to generate a ROC curve and AUC value for the STRING protein dataset
- OP_ROC.R
- R Script to generate a ROC curve and AUC value for Mtb operons
- Protein_Scores.py
- python script that retrives proteins pairs from STRING based on a user input score threshold
- true postives
- RandomPairs.py
- python script that genreates a user input number of random pairs from the proteins retrieved from Proteins_Scores.py
- false postives
Operons
- Operon_Extracter.py
- extracts the operons from the operon text file downloaded off the Texas A&M TB Genome annotation portal
- Operons_with_at_least_2_genes.py
- trims the list so that each operon for the analysis has at least two genes
- C.remover.py
- removes "c" from Rv numbers that indicates that gene is transcribed from the complimentary strand for easier data cleaning. It can be added back on later
- Operons.R
- R script that processeses the operons into an R environment and creates individual lists for each operon
- carries out both the intra operon and inter operon perumtations tests, and creates two dataframes to store the results for each one
- Operon_funtions.R
- contains functions needed to execute Operons.R
glmnet
- glmnet_functions.R
- contains functions for running glmnet
- Run_glmnet.R
- script to exectute glmnet for predictions
PermTests/Clustering
- DataGathering.R
- contains different permutation test functions for data analysis
- used to find the significance for all gene families, networks, and operons
- contains a clustering function to cluster operons based on specified user input
- contains different permutation test functions for data analysis
