Comprehensive dataset of heterogeneous network structures in traditional Chinese medicine research
Data files
Sep 13, 2024 version files 13.98 MB
-
herb_compound.csv
1.64 MB
-
herb_efficiacy.csv
76.94 KB
-
Herb_target.csv
9.68 MB
-
herb-symptom.csv
1.01 MB
-
herbherb.csv
24.87 KB
-
README.md
6.90 KB
-
sym-sym.csv
1.13 MB
-
target_target.csv
410.71 KB
Abstract
This dataset presents a comprehensive collection of heterogeneous network structures commonly encountered in biomedical research. The dataset encompasses diverse types of networks, including protein-protein interaction networks, herb-target interaction networks, herb-symptom interaction networks, herb-herb association networks, and symptom-symptom association networks. Each network is represented as a graph, with nodes representing entities such as proteins, herb, symptoms, and efficacy, and edges representing the relationships between these entities. The dataset provides detailed descriptions of the nodes and edges in each network, along with metadata such as node attributes where applicable. This dataset serves as a valuable resource for researchers studying complex biological systems and exploring relationships between various biomedical entities.
README: Comprehensive Dataset of Heterogeneous Network Structures in Traditional Chinese Medicine Research
https://doi.org/10.5061/dryad.wh70rxwx9
Description of the data and file structure
The dataset is comprised of seven files that encompass a range of networks, including protein-protein interactions, herb-target interactions, herb-symptom interactions, herb-herb associations, and symptom-symptom associations. It features three types of nodes: 1,254 herbs, 1,027 symptoms, and 2,208 protein targets. Additionally, the dataset delineates five types of associations: 1,912 herb-herb associations, 7,220 symptom-symptom associations, 25,133 protein-protein interactions, 168,797 herb-target associations, and 16,739 herb-symptom associations. Below, you'll find detailed descriptions of the contents contained within these seven files.
Herb target.csv
“Herb target.csv” is a compressed CSV that encompasses data on proteins and herbs. It begins with a header row outlining the dataset's features, followed by rows detailing specific protein-herb interactions. The columns are divided into two segments: the first segment includes attributes related to proteins such as Target ID, Protein Name, Gene Symbol, and Uniprot ID, while the second segment contains herb-related information, including Herb ID, Chinese Pin Yin, English Name, and Latin Name. This compilation serves as an extensive repository of knowledge about herbs and their molecular targets, enriched with supplementary information. In this dataset, the designation "NA" signifies a lack of information regarding specific protein-target interactions related to herbs. Specifically, it indicates that there is no identified protein name associated with the herb-target interaction documented in the dataset. Furthermore, the corresponding gene linked to this protein and UniProt ID remains unavailable or undocumented.
herb compound.csv
“herb compound.CSV” file encompasses 14,985 pairs of compound-herb activities and includes quality indicators for 1,254 herbs and 1,237 ingredients, detailing the interactions between these herbs and compounds organized into the following fields :
The CAS number uniquely identifies each chemical compound and is essential for scientific communication.
The preferred name is based on the IUPAC nomenclature, providing a standardized way to refer to the compound. These compounds contribute to the medicinal benefits of their respective herbs, making them valuable in traditional and modern herbal medicine.
PubChem ID: A unique identifier assigned to chemical substances in the PubChem database, offering an abundance of information regarding the chemical's properties, biological activities, and safety data.
ChEMBL ID: A unique identifier for bioactive compounds in the ChEMBL database, focusing on the activities of these compounds relevant to drug discovery and medicinal chemistry.
Formula: The chemical composition and structure of a compound, indicating the types and quantities of atoms present in the molecule.
Smiles: (Simplified Molecular Input Line Entry System): SMILES is a string notation used to represent the structure of molecules. It is a compact and human-readable way to describe the molecular structure, including the atoms, bonds, and functional groups. In the CSV file, the Smiles column contains the SMILES representation of the compounds. This information can be used to identify and compare the chemical structures of the compounds, as well as to predict their properties and activities.
Herb ID: A unique identifier designated for each herb.
In summary, the occurrence of "None" in the CAS, PubChem ID, or ChEMBL ID columns of the herb compound dataset indicates missing or unavailable identifiers.
herb-symptom.csv
The herb-symptom.CSV file contains detailed interactions between herbs and symptoms, organized into the following fields:
- Symptom: Describes the symptom associated with the herb.
- Herb_id: Provides a unique identifier for each herb.
- P-value: Represents the statistical significance value.
- FDR (BH): Denotes the False Discovery Rate adjusted using the Benjamini-Hochberg method.
- FDR (Bonferroni): Shows the False Discovery Rate adjusted by the Bonferroni method.
Relationship: Explains the association between the symptom and the herb. The presence of "None" for P_value , FDR(BH) and FDR(Bonferroni) in the herb-symptom interaction dataset indicates a lack of statistical evaluation for that particular interaction.The presence of "None" for P_value , FDR(BH) and FDR(Bonferroni) in the herb-symptom interaction dataset indicates a lack of statistical evaluation for that particular interaction.
This dataset is sourced from the Chinese Pharmacopoeia (CHPA), ensuring its credibility. It includes 1,254 unique herb nodes and 1,027 symptom nodes, resulting in a total of 2,281 nodes. Furthermore, it features 16,739 interactions as edges, offering a comprehensive overview of the relationships between herbs and symptoms.
There are two supplementary files related to herbs. One likely details the interactions among various herbs(herbherb.csv
), while the other appears to outline the therapeutic properties associated with each specific herb(herb efficiacy.csv
). These files can offer valuable insights into the relationships between herbs and their effectiveness in treating different ailments or health conditions.
Herb-herb association networks indicate relationships among herbs based on shared properties and are available in herbherb.CSV
. In these files, There are two columns, each representing a herbal medicine, and rows signify the associations between them. The dataset, compiled from research studies and efficacy-based herb vectors, includes 1,254 herbs connected by 1,912 associations.
traget-target Interaction illustrate the interactions among proteins and are also provided in target target.CSV
. In these files, There are two columns, each representing a Protein , and rows signify the associations between them.This data, sourced from the STRING database, consists of 2,208 protein targets connected by 25,133 interactions.
Additionally, Symptom-Symptom Association Networks showcase relationships between symptoms and are also available in sym-sym.CSV
. In these files, There are two columns, each representing a symptom, and rows signify the associations between them. The data is extracted from the Semantic MEDLINE Database (SemMedDB) and encompasses 1,027 symptoms linked by 7,220 associations.
Access information
The data were collected from various public databases and publications, including the HIT2 database, Chinese Pharmacopoeia (CHPA), SemMedDB, STRING database, and KEGG database. The data sources are typically publicly available online resources and databases accessible to researchers in the field.
Methods
The dataset is composed of seven files that encompass a range of networks, including protein-protein interactions, herb-target interactions, herb-symptom interactions, herb-herb associations, and symptom-symptom associations. It features three types of nodes: 1,254 herbs, 1,027 symptoms, and 2,208 protein targets. Additionally, the dataset includes five types of associations: 1,912 herb-herb associations, 7,220 symptom-symptom associations, 25,133 protein-protein interactions, 168,797 herb-target associations, and 16,739 herb-symptom associations. Below, you'll find detailed descriptions of the contents contained within these seven files.
The first file is a compressed CSV that encompasses data on proteins and herbs. It begins with a header row outlining the dataset's features, followed by rows detailing specific protein-herb interactions. The columns are divided into two segments: the first segment includes attributes related to proteins such as Target ID, Protein Name, Gene Symbol, and Uniprot ID, while the second segment contains herb-related information, including Herb ID, Chinese Pin Yin, English Name, and Latin Name. Table 1 presents some of its associated targets. This compilation serves as an extensive repository of knowledge about herbs and their molecular targets, enriched with supplementary information. In this dataset, the designation "NA" signifies a lack of information regarding specific protein-target interactions related to herbs. Specifically, it indicates that there is no identified protein name associated with the herb-target interaction documented in the dataset. Furthermore, the corresponding gene linked to this protein and UniProt ID remains unavailable or undocumented.
The CSV file encompasses 14,985 pairs of compound-herb activities and includes quality indicators for 1,254 herbs and 1,237 ingredients, detailing the interactions between these herbs and compounds organized into the following fields (Table 2 shows various compounds associated with the herb Coptis deltoidea):
- The CAS number uniquely identifies each chemical compound and is essential for scientific communication.
- The preferred name is based on the IUPAC nomenclature, providing a standardized way to refer to the compound. These compounds contribute to the medicinal benefits of their respective herbs, making them valuable in traditional and modern herbal medicine.
- PubChem ID: A unique identifier assigned to chemical substances in the PubChem database, offering an abundance of information regarding the chemical's properties, biological activities, and safety data.
- ChEMBL ID: A unique identifier for bioactive compounds in the ChEMBL database, focusing on the activities of these compounds relevant to drug discovery and medicinal chemistry.
- Formula: The chemical composition and structure of a compound, indicating the types and quantities of atoms present in the molecule.
- Smiles: (Simplified Molecular Input Line Entry System): SMILES is a string notation used to represent the structure of molecules. It is a compact and human-readable way to describe the molecular structure, including the atoms, bonds, and functional groups. In the CSV file, the Smiles column contains the SMILES representation of the compounds. This information can be used to identify and compare the chemical structures of the compounds, as well as to predict their properties and activities.
- Herb ID: A unique identifier designated for each herb.
In summary, the occurrence of "None" in the CAS, PubChem ID, or ChEMBL ID columns of the herb compound dataset indicates missing or unavailable identifiers.
The other CSV file contains detailed interactions between herbs and symptoms, organized into the following fields:
- Symptom: Describes the symptom associated with the herb.
- Herb_id: Provides a unique identifier for each herb.
- Pinyin name: Represents the Romanized version of the herb's Chinese name.
- Latin name: Lists the herb's Latin botanical name.
- English name: Identifies the herb by its common English name.
- Class in Chinese: Indicates the herb's classification in Chinese.
- Class in English: Describes the herb's classification in English.
- P-value: Represents the statistical significance value.
- FDR (BH): Denotes the False Discovery Rate adjusted using the Benjamini-Hochberg method.
- FDR (Bonferroni): Shows the False Discovery Rate adjusted by the Bonferroni method.
Relationship: Explains the association between the symptom and the herb. The presence of "None" for P_value , FDR(BH) and FDR(Bonferroni) in the herb-symptom interaction dataset indicates a lack of statistical evaluation for that particular interaction.
This dataset is sourced from the Chinese Pharmacopoeia (CHPA), ensuring its credibility. It includes 1,254 unique herb nodes and 1,027 symptom nodes, resulting in a total of 2,281 nodes. Furthermore, it features 16,739 interactions as edges, offering a comprehensive overview of the relationships between herbs and symptoms. There are two supplementary files related to herbs. One likely details the interactions among various herbs, while the other appears to outline the therapeutic properties associated with each specific herb. These files can offer valuable insights into the relationships between herbs and their effectiveness in treating different ailments or health conditions.
The CSV file encompasses 1,912 pairs of herb-herb activities and includes quality indicators for 1,254 herbs, detailing the interactions between these herbs. Herb-herb association interaction indicate relationships among herbs based on shared properties and are available in CSV format.
The CSV file encompasses 25133 pairs of target-target activities and includes quality indicators for 22008 protein targets, detailing the interactions between these proteins. Protein-Protein Interaction illustrate the interactions among proteins and are also provided in CSV format. This data, sourced from the STRING database, consists of 2,208 protein targets connected by 25,133 interactions.
The CSV file encompasses 7220 pairs of symptom-symptom activities and includes quality indicators for 22008 symptoms, detailing the interactions between these symptoms. Additionally, Symptom-Symptom Association show relationships between symptoms and are available in CSV format as well. The data is extracted from the Semantic MEDLINE Database (SemMedDB) and encompasses 1,027 symptoms linked by 7,220 associations.
EXPERIMENTAL DESIGN, MATERIALS AND METHODS
This study compiles data on herbs, symptoms, targets, and their interactions from various public databases and publications, as detailed in Table 3. Herb-target and herb-compound associations are obtained from the HIT2 database, while relationships between herbs and their indications are derived from the 2015 edition of the Chinese Pharmacopoeia (CHPA). Herb-herb associations are based on research described in the reference. Furthermore, associations related to herb efficacy were specifically gathered from the Chinese Pharmacopoeia (CHPA). Using these efficacy-based herb relationships, herb vectors are constructed, and cosine similarities between pairs of herbs are calculated, resulting in the development of herb-herb connections. These similarity scores are then applied as edge weights in the network analysis.
Table 3 summarizes information collected from various public databases and publications regarding herbs, symptoms, targets, and their interactions.
Name |
Composition |
Source |
Herb-target associations |
1254 herbs, 2208 targets and 168797 herb-target links |
HIT2 |
Herb-compound associations |
1254 herbs, 1237 compound and 14985 herb-compound links |
HIT2 |
Herb-efficacy associations |
829 herbs, 373 efficacies and 3830 herb-efficacy links |
CHPA |
Herb-symptom associations |
465 herbs, 1027 symptoms and 16739 herb-symptom links |
CHPA |
Herb-herb associations |
809 herbs and 1912 links |
Herbs linked with similar efficacy |
Protein-protein interactions |
10622 proteins and 25133 interactions |
String10 |
Symptom-symptom associations |
1027 symptoms and 7220 links
|
SemMedDB |
Assume, there are m types of herbs and n types of efficacies. Each herb is represented by a vector of efficacy Va = (w1, w2, …, wj, …, wn), where wi=1 indicates that herb Va has relationship with efficacy j, otherwise there is no relationship. The efficacy-based cosine similarity of herb Va and herb Vb can be calculated .
Relationships between symptoms were analyzed using text mining methodologies. Initially, these connections were drawn from the Semantic MEDLINE Database (SemMedDB), which includes ternary semantic relationships sourced from the MEDLINE database via the biomedical semantic relation extraction tool, SemRep. The significance of each relationship was assessed using Fisher’s exact test ,with relationships displaying a significance level of P<0.05 deemed reliable. Additionally, protein-protein interactions were obtained from a well-known gene-gene interaction network database. For relationships with weights exceeding 700, a filtering process was applied, followed by linear normalization through min-max normalization.
We utilize the chi-square test, to evaluate the significance of relationships between herbs and symptoms.
To pinpoint highly significant and pertinent herb-symptom relationships, we have chosen to consider only those associations with P values under 0.05 as trustworthy results. Due to the extensive range of symptom-herb combinations, identifying these associations presents a significant challenge of multiple comparisons, which requires managing false discovery rates. To tackle this issue, we apply the Bonferroni correction, as detailed in Table 4.
Table 4 Example of herb symptom relationship for SMHB00596 herb
symptom |
Herb_id |
P_value |
FDR(BH) |
FDR(Bonferroni) |
Dyssomnia |
SMHB00596 |
0.000577 |
0.003923 |
1 |
Vitreous Detachment Posterior |
SMHB00596 |
0.024232 |
0.052098 |
1 |
Insomnia |
SMHB00596 |
1.00E-05 |
0.000143 |
0.254553 |
Angina Symptoms |
SMHB00596 |
0.024232 |
0.052098 |
1 |
Hemoptysis |
SMHB00596 |
0.000693 |
0.004526 |
1 |
Epistaxis |
SMHB00596 |
0.047887 |
0.077434 |
1 |
Subwakefullness Syndromes |
SMHB00596 |
0.047887 |
0.077434 |
1 |
Anemic |
SMHB00596 |
0.047887 |
0.077434 |
1 |
Melena |
SMHB00596 |
0.000577 |
0.003923 |
1 |
Palpitation(S) |
SMHB00596 |
0.001434 |
0.008137 |
1 |
Unresponsive To Stimuli |
SMHB00596 |
0.024232 |
0.052098 |
1 |
Anemia |
SMHB00596 |
0.024232 |
0.052098 |
1 |
Cough |
SMHB00596 |
0.044011 |
0.07328 |
1 |
Long Sleeper Syndrome |
SMHB00596 |
0.047887 |
0.077434 |
1 |