A simplified method for comprehensive capture of the Staphylococcus aureus proteome: S. aureus proteome data table
Data files
May 20, 2025 version files 11.84 MB
-
README.md
23.37 KB
-
S._aureus_Proteome_Data_Table_FINAL..xlsx
4.62 MB
-
S._aureus_proteome_data_table_TAB_1.csv
449 B
-
S._aureus_proteome_data_table_TAB_2_TABLE_S1.csv
624.39 KB
-
S._aureus_proteome_data_table_TAB_3_TABLE_S2.csv
5.08 MB
-
S._aureus_proteome_data_table_TAB_4_TABLE_S3.csv
3.24 KB
-
S._aureus_proteome_data_table_TAB_5_TABLE_S4.csv
637.68 KB
-
S._aureus_proteome_data_table_TAB_6_TABLE_S5.csv
850.69 KB
May 26, 2025 version files 11.83 MB
-
DIALib-QC_Quality_Assessment.csv
3.24 KB
-
Differential_Expression_Analysis_Cellular.csv
850.69 KB
-
Differential_Expression_Analysis_Secreted.csv
637.68 KB
-
Legend.csv
567 B
-
README.md
23.42 KB
-
S._aureus_Peptides.csv
5.08 MB
-
S._aureus_Proteins.csv
624.39 KB
-
S._aureus_Proteome_Data_Table_FINAL..xlsx
4.62 MB
Abstract
Staphylococcus aureus is a major human pathogen causing myriad infections in both community and healthcare settings. Although well studied, a comprehensive exploration of its dynamic and adaptive proteome is still somewhat lacking. Herein, we employed streamlined liquid- and gas-phase fractionation with PASEF analysis on a TIMS-TOF instrument to expand coverage and explore the S. aureus dark proteome. In so doing, we captured the most comprehensive S. aureus proteome to date, totaling 2,231 proteins (85.6% coverage), using a significantly simplified process that demonstrated high reproducibility with minimal input material. We then showcase application of this library for differential expression profiling by investigating temporal dynamics of the S. aureus proteome. This revealed alterations in metabolic processes, ATP production, RNA processing, and stress-response proteins as cultures progressed to stationary growth. Notably, a significant portion of the library (94%) and proteome (80.5%) was identified by this single-shot, DIA-based analysis. Overall, our study shines new light on the unexplored S. aureus proteome, generating a valuable new resource to facilitate further study of this dangerous pathogen.
Dataset DOI: 10.5061/dryad.qbzkh18vz
Data set overview: This dataset contains the data obtained from characterizing the Staphylococcus aureus LAC USA300 proteome using a simplified yet high-throughput methodology for MS/MS capture. This method uses TIMS fractionation as well as a simplified liquid-phase fractionation on a TIMS-TOF instrument designed to enhance proteome capture while minimizing sample processing and hands-on experimental time. We performed statistical analysis and quality assessment of this DDA-PASEF library of proteins and peptides. Following generation of this comprehensive bacterial proteome, we then utilized this resource to perform differential expression profiling of various growth phases of S. aureus, implementing a DIA-single shot approach. We characterized the secreted and cellular fractions of mid exponential phase, entry into stationary phase and end stationary phase, cultured at 3 hours, 6 hours and 16 hours, respectively, also represented in this dataset.
CSV file names correspond to each table within the S. aureus proteome datatable: S._aureus_Proteome_Data_Table_FINAL.xlsx
Files
Legend.csv: This file contains the legend of the S. aureus proteome data table. A description of the table is listed alongside the table name and the Tab of excel spreadsheet it is found on.
Number of variables: 3
Number of header rows:1
Number of rows: 7
Variable list:
*Tab number: (text) Excel tab number of each data table.
*Legend/Table Number: (alphanumeric) legend and table designation of each dataset.
*Description: (text) Description of each dataset.
Data type: alphanumeric, text
S. aureus_Proteins.csv: This file contains summary data for each identified protein from Staphylococcus aureus LAC USA300 strain. Each protein entry includes details such as gene name, protein length, organism, protein coverage, peptide counts, spectral counts, and intensity. Raw mass spectrometry data were searched and processed using FragPipe (v.19.1) with the DIA_SpecLib_Quant workflow. MSBooster, Percolator, and ProteinProphet were the tools used in FragPipe to enhance confidence in protein and peptide identifications. A 1% false discovery rate (FDR) cutoff was used when performing spectral assignments.
Number of variables: 23
Number of header rows: 1
Number of rows: 2232
Variable list:
*Protein: (alphanumeric) UniProt accession number and identifier for the protein.
*Protein ID: (alphanumeric) Uniprot ID accession number.
*Entry Name: (alphanumeric) UniProt entry name.
*Gene: (alphanumeric) Gene symbol.
*Length: (numeric) Number of amino acids in the protein.
*Organism: (text) Full taxonomic description of the organism.
*Protein Description: (text) Functional annotation of the protein.
*Protein Existence: (text) Evidence level for protein existence.
*Coverage: (numeric) Percentage of protein sequence covered by identified peptides.
*Protein Probability: (numeric) Probability score for protein identification.
*Top Peptide Probability: (numeric) Highest probability among the identified peptides.
*Total Peptides: (numeric) Total number of peptides identified for the protein.
*Unique Peptides: (numeric) Number of unique peptides for the protein.
*Razor Peptides: (numeric) Number of peptides shared between proteins but assigned to the most probable one.
*Total Spectral Count: (numeric) Total MS/MS spectra matched to the protein.
*Unique Spectral Count: (numeric) Spectral count from unique peptides.
*Razor Spectral Count: (numeric) Spectral count from razor peptides.
*Razor Assigned Modifications: (numeric): post-translational modifications assigned to razor peptides. M Methionine Oxidation, Carbamidomethyl of Cysteine, N-terminal modification.
*Razor Observed Modifications: (numeric): post-translational modifications observed to razor peptides
Indistinguishable Proteins: (alphanumeric): M=Methionine Oxidation, C=Carbamidomethyl of Cysteine, N-term=N-terminal modification.
Data type: alphanumeric, numeric, text
S. aureus_Peptides.csv: This file contains peptide-level mass spectrometry data for the corresponding protein library identified in Table S1 for the Staphylococcus aureus LAC USA300 strain. Each row represents a peptide along with its associated metadata, including sequence, modifications, protein origin, spectral data, and protein description. Both assigned and observed post-translational modifications (PTMs) are included, such as methionine oxidation and carbamidomethylation, and N-terminal modifications.
Number of variables:17
Number of header rows: 1
Number of rows: 37340
Variable list:
*Peptide: (alphanumeric) Amino acid sequence of the identified peptide.
*Prev AA: (character) Amino acid that precedes the peptide in the protein sequence.
*Next AA: (character) Amino acid that follows the peptide in the protein sequence.
*Peptide Length: (numeric) Number of amino acid residues in the captured peptide.
*Charges: (numeric list) Observed charge states of the peptide during MS analysis.
*Probability: (numeric) Confidence score of the peptide-spectrum match (PSM).
*Spectral Count: (numeric) Number of spectra corresponding to this peptide.
*Intensity: (numeric) Measured peptide intensity.
*Assigned Modifications: (text) Modifications assigned to this peptide. M Methionine Oxidation, Carbamidomethyl of Cysteine, N-terminal modification.
*Observed Modifications: (text) Modifications directly identified from mass spectral data. M Methionine Oxidation, Carbamidomethyl of Cysteine, N-terminal modification.
*Protein: (alphanumeric) Uniprot accession ID number and identifier for the matched protein.
*Protein ID: (alphanumeric) Uniprot accession ID number for the matched protein
*Entry Name: (alphanumeric) UniProt entry name of the matched protein.
*Gene: (alphanumeric) Gene symbol of the matched protein.
*Protein Description: (text) Functional annotation or description of the matched protein.
*Mapped Genes: (text) Genes mapped from the matched protein.
*Mapped Proteins: (text) Functional protein group or family associated with the matched protein.
Data type: alphanumeric, numeric, character, text.
DIALib-QC_Quality_Assessment.csv: This file contains information from the quality assessment using the DIALib-QC tool (publicly available) of the generated Staphylococcus aureus DDA-PASEF library of proteins and peptides. Headings are labelled with the parameters that were assessed using this tool.
Number of variables:15
Number of header rows: 1
Number of rows:3
Variable list with abbreviations in the row below (highlighted in bold):
*Name of library file being analyzed (library): (text) Name assigned to the spectral library file.
*Library format (OpenSWATH, Peakview, or Spectronaut) (format): (text) Format used for the library.
*Number of peptide ions (i.e., precursor, sequence + modifications + charge) (pepions): (numeric) Total count of peptide ions including sequence, modifications, and charge states.
*Number of fragments (fragment ions) in library (Fragments): (numeric) Total number of fragment ions included in the library.
*Percentage of proteotypic peptide ions (not shared) (ptp_percent): (numeric) Percent of peptide ions that are shared between proteins.
*Percentage of shared peptide ions (shared_percent): (numeric) Percent of total peptide ions that are shared.
*Number of shared peptide ions (shared_pepions): (numeric) Count of shared peptide ions in the library.
*Number of distinct peptide sequences (peptides): (numeric) Unique peptide sequences without modification.
*Number of distinct modified peptides (sequences + modifications) (mod_peps): (numeric) Unique peptide sequences including modifications.
*Percentage of distinct modified peptides with mass modification (mod_percent): (numeric) Proportion of peptides with mass-modifying PTMs.
*Number of mass-modified amino acids (total_mods): (numeric) Total number of individual amino acids with mass modifications.
*Percentage of charge 2 precursors (chg_2): (numeric) Proportion of precursors with a +2 charge state.
*Percentage of charge 3 precursors (chg_3): (numeric) Proportion of precursors with a +3 charge state.
*Minimum precursor m/z (mass/charge) in library (precursor_min): (numeric) Lowest mass-to-charge ratio for a precursor ion in the library.
*Maximum precursor m/z in library (precursor_max): (numeric) Highest mass-to-charge ratio for a precursor ion in the library.
*Average peptide length (avg_len): (numeric) Mean number of amino acids per peptide.
*Average number of fragments per assay (precursor) (avg_num_frags): (numeric) Mean number of fragment ions associated with each precursor.
*Average fragment sequence length (avg_frag_len): (numeric) Mean length of the fragment ion sequences.
*Percentage of assays with 5 or fewer transitions (short_perc): (numeric) Fraction of assays that include five or fewer transitions (fragment ions).
*Percentage of fragment m/z above precursor m/z (fragment_above_precursor): (numeric) Percentage of fragments with a higher m/z than the precursor ion.
*Percentage of y ions (y_perc): (numeric) Proportion of fragments classified as y-type ions.
*Percentage of b ions (b_perc): (numeric) Proportion of fragments classified as b-type ions.
*Percentage of y ions considering only the top 6 fragments per assay (t6_y_perc): (numeric) Percentage of y-type ions among the top 6 most intense fragments per assay.
*Average intensity (avg_intensity): (numeric) Mean intensity of the peptide fragments (NaN if not available).
*Minimum retention time (RT) in library (rt_min): (numeric) Shortest retention time of any peptide in the library. Time = seconds (s)
*Maximum RT in library (rt_max): (numeric) Longest retention time of any peptide in the library. Time = seconds (s)
Median RT in library (rt_med): (numeric) Median retention time of peptides. Time = seconds (s)
r-squared value of fit between RT of +2 and +3 charge states for the same modified peptide (rt_rsq): (numeric) R² correlation for RT alignment across charge states.
Percentage of +2/+3 charge pairs of the same modified peptide within 5 RT units of each other(rt_five): (numeric) Consistency of RT for peptides across charge states.
Number of iRT peptides in library (n_irt): (numeric) Count of indexed retention time (iRT) standards present in the library.
Data type: numeric, text
*Differential_Expression_Analysis_Secreted.csv: This file contains quantitative data for DIA single-shot global analysis of Staphylococcus aureus LAC USA300 proteins identified across various time points (3-hour, 6-hour, and 16-hour conditions). Here we are characterizing the different growth phases of *S. aureus *including mid exponential phase (3h), entry into stationary phase (6h), and end stationary phase (16h) using the DDA-PASEF spectral library of the S. aureus proteome identified in Tables S1 and S2. This file is the analysis of the secreted fraction of these growth timepoints. Protein groups are identified with associated gene annotations and descriptions at each time point. We also include replicate intensity values and the corresponding log2-transformed values. Original intensity values are also outlined before imputation and normalization. Ontological categorizations are provided, and statistical comparisons between conditions are made to evaluate changes in protein abundance. Proteins in purple text represent those where at least one of the two protein groups that were compared has 2 or fewer MS detected values prior to imputation. Proteins highlighted in red indicate that imputation occurred on all four reps for that protein, and there was no original MS detection.
Number of variables: 83
Number of header rows: 1
Number of rows: 501
Variable list:
*Protein_Group: (alphanumeric) Uniprot group identifier for clustered proteins.
*Protein_Ids: (alphanumeric) Uniprot Protein accession(s) in the group.
*Protein_Ids_1: (alphanumeric) Primary protein Uniprot accession in the group
*Protein_Names: (text): Uniprot identifier for the primary protein
*Genes: (alphanumeric) Gene symbol(s) for all proteins in the group.
*Genes_1: (alphanumeric) Gene symbol for the primary protein.
*First_Protein_Description: (text) Functional annotation of the first protein in the group.
*Sec Emp 16h v 6h: (numeric): Ratio comparison between 16-hour and 6-hour conditions.
*Sec Emp 16h v 3h: (numeric) Ratio comparison between 16-hour and 3-hour conditions.
*Sec Emp 6h v 3h: (numeric) Ratio comparison between the 6-hour and 3-hour conditions.
Replicate Intensities (numeric):
*3hr1Sec_LS25 to 3hr4Sec_LS28: Protein intensities for 3-hour secretory replicates.
*6hr1Sec_LS29 to 6hr4Sec_LS32: Protein intensities for 6-hour secretory replicates.
*16hr1Sec_LS33 to 16hr4Sec_LS36: Protein intensities for 16-hour secretory replicates.
Log2 of Replicate Intensities (numeric):
*Log2 3hr1Sec_LS25 to 3hr4Sec_LS28: Log 2 Protein intensities for 3-hour secretory replicates.
*Log2 6hr1Sec_LS29 to 6hr4Sec_LS32: Log 2 Protein intensities for 6-hour secretory replicates.
*Log2 16hr1Sec_LS33 to 16hr4Sec_LS36: Log 2 Protein intensities for 16-hour secretory replicates.
*ANOVA Significant: (text) A one-way ANOVA was performed, followed by a post hoc Tukey’s HSD test (FDR<0.05) to determine the significance of differential expression when comparing the proteomes at different time points (16vs3 h, 16vs6 h, and 6 vs. 3 h) for the secretory fraction.
*Significant pairs: (alphanumeric) pair-wise comparisons for the secretory fraction timepoints that are ANOVA significant
*Cluster: (alphanumeric)Cluster designation based on hierarchical clustering
*GOBP name: (text) name of Gene Ontology Biological Process for all proteins in the group
*GOMF name: (text) name of Gene Ontology Molecular Function for all proteins in the group
*GOCC name: (text) name of Gene Ontology Cellular Component for all proteins in the group
*KEGG name: (text) name from the Kyoto Encyclopedia of Genes and Genomes
*GOBP name_: (text) name of Gene Ontology Biological Process for the main protein
*GOMF name_:(text) name of Gene Ontology Molecular Function for the main protein
*GOCC name_:(text) name of Gene Ontology Cellular Component for the main protein
*Uniprot function: (text) function assigned to the protein from the Uniprot database
*Uniprot subcellular location: (text) subcellular location assigned to the protein from the Uniprot database
*“-Log ANOVA p value”: (numeric) -Log of the ANOVA p value
*Protein_Group: (alphanumeric) Group identifier for the proteins in the group.
*Protein_Ids: (alphanumeric) Uniprot ID(s) for the proteins in the group
*Protein Ids_1: (alphanumeric) UniProt ID for the primary protein
*Protein_Names: (alphanumeric) Uniprot identifiers for the protein(s) in the group
*Genes: (text) Gene name for the proteins in the group
Genes_1 (text): Gene name for the primary protein
*First_Protein_Description: (text) Functional annotation of the primary protein.
Protein Intensities after Imputation and Normalization:
*Original Cell 3hr1_LS1+13 to Original Cell 3hr4_LS4+16 : (numeric) Protein intensity values from the secreted fraction of each 3-hour replicate
*Original Cell 6hr1_LS5+17 to Original Cell 6hr4_LS8+20: (numeric) Protein intensity values from the secreted fraction of each 6-hour replicate
*Original Cell 16hr1_LS9+21 to Original Cell 16hr4_LS12+24: (numeric) Protein intensity values from the secreted fraction of each 3-hour replicate
Protein Intensities prior to Imputation and normalization:
*Original 3hr1Sec_LS25 to Original 3hr2Sec_LS26: (numeric): LFQ intensities for the secreted fraction of the 3-hour replicated prior to imputation and normalization
*rep (numeric): number of replicates with identified intensities for the 3-hour replicates (preceding)
Original 6hr1Sec_LS29 to Original 6hr3Sec_LS31: (numeric) LFQ intensities for the secreted fraction of the 6-hour replicated prior to imputation and normalization
*rep (numeric): number of replicates with identified intensities for the 6-hour replicates (preceding)
*Original 16hr2Sec_LS34 to Original 16hr3Sec_LS35: (numeric): LFQ intensities for the secreted fraction of the 16 hour replicated prior to imputation and normalization
*rep (numeric): number of replicates with identified intensities for the 16-hour replicates (preceding)
*Genes_1_end: (alphanumeric) Gene names for the primary protein
*Significant pairs: (alphanumeric): Pair wise comparisons of timepoints that are significant
Abbreviations:
Sec= secreted
GOBP = Gene Ontology Biological Process
GOMP= Gene Ontology Molecular Function
GOCC= Gene Ontology Cellular Component
Data type: alphanumeric, numeric, text
*Differential_Expression_Analysis_Cellular.csv: This file contains quantitative data for DIA single-shot global analysis of Staphylococcus aureus LAC USA300 proteins identified across various time points (3-hour, 6-hour, and 16-hour conditions). Here we are characterizing the different growth phases of *S. aureus *including mid exponential phase (3h), entry into stationary phase (6h), and end stationary phase (16h) using the DDA-PASEF spectral library of the S. aureus proteome identified in Tables S1 and S2. This file is the analysis of the cellular of these growth timepoints. Protein groups are identified with associated gene annotations and descriptions at each time point. We also include replicate intensity values and the corresponding log2-transformed values. Original intensity values are also outlined prior to imputation and normalization. Ontological categorizations are provided, and statistical comparisons between conditions are made to evaluate changes in protein abundance. Proteins in purple text represent those where at least one of the two protein groups that were compared has 2 or fewer MS detected values prior to imputation. Proteins highlighted in red indicate that imputation occurred on all four reps for that protein, and there was no original MS detection.
Number of variables: 83
Number of header rows: 1
Number of rows: 709
Variable list:
*Protein_Group: (alphanumeric) Uniprot group identifier for clustered proteins.
*Protein_Ids: (alphanumeric) Uniprot protein accession(s) in the group.
*Protein_IDs_1: (alphanumeric) Primary protein Uniprot accession in the group
*Protein_Names: (text): Uniprot identifier for the primary protein
*Genes: (alphanumeric) Gene symbol(s) for all proteins in the group.
*Genes_1: (alphanumeric) Gene symbol for the primary protein.
*First_Protein_Description: (text) Functional annotation of the first protein in the group.
*Cell Emp 16h vs 6h: (numeric) Ratio comparison between 16-hour and 6-hour conditions.
*Cell Emp 16h vs 3h: (numeric) Ratio comparison between 16-hour and 3-hour conditions.
*Cell Emp 6h vs 3h: (numeric) Ratio comparison between the 6-hour and 3-hour conditions.
Replicate Intensities (numeric):
*Cell 3hr1_LS1+13 to Cell 3hr4_LS4+16: Protein intensities for 3-hour cellular replicates.
*Cell 6hr1_LS5+17 to Cell 6hr4_LS8+20: Protein intensities for 6-hour cellular replicates.
*Cell 16hr1_LS9+21 to Cell 16hr4_LS12+24: Protein intensities for 16-hour cellular replicates.
Log2 of Replicate Intensities (numeric):
*Log2 Cell 3hr1_LS1+13 to Log2 Cell 3hr4_LS4+16: Log 2 Protein intensities for 3-hour cellular replicates.
*Log2 Cell 6hr1_LS5+17 to Log2 Cell 6hr4_LS8+20: Log 2 Protein intensities for 6-hour cellular replicates.
*Log2 Cell 16hr1_LS9+21 to Log2 Cell 16hr4_LS12+24: Log 2 Protein intensities for 16-hour cellular replicates.
*ANOVA Significant: (text) A one-way ANOVA was performed, followed by a post hoc Tukey’s HSD test (FDR<0.05) to determine the significance of differential expression when comparing the proteomes at different time points (16vs3 h, 16 vs. 6 h, and 6vs3h) for the cellular fraction.
*Significant pairs: (alphanumeric) pair-wise comparisons for the cellular fraction timepoints that are ANOVA significant
*Cluster: (alphanumeric) Cluster designation based on hierarchical clustering
*GOBP name: (text) name of Gene Ontology Biological Process for all proteins in the group
*GOMF name: (text) name of Gene Ontology Molecular Function for all proteins in the group
*GOCC name: (text) name of Gene Ontology Cellular Component for all proteins in the group
*KEGG name: (text) name from the Kyoto Encyclopedia of Genes and Genomes
*GOBP name_: (text) name of Gene Ontology Biological Process for the main protein
*GOMF name_:(text) name of Gene Ontology Molecular Function for the main protein
*GOCC name_:(text) name of Gene Ontology Cellular Component for the main protein
*Uniprot function: (text) function assigned to the protein from the Uniprot database
*Uniprot subcellular location: (text) subcellular location assigned to the protein from the Uniprot database
*“-Log ANOVA p value”: (numeric) -Log of the ANOVA p value
*Protein_Group: (alphanumeric) Group identifier for the proteins in the group.
*Protein_Ids: (alphanumeric) Uniprot ID(s) for the proteins in the group
*Protein Ids_1: (alphanumeric) UniProt ID for the primary protein
*Protein_Names: (alphanumeric) Uniprot identifiers for the protein(s) in the group
*Genes: (text) Gene name for the proteins in the group
*Genes_1 (text): Gene name for the primary protein
*First_Protein_Description: (text) Functional annotation of the primary protein.
Protein Intensities prior to Imputation and normalization:
*Original Cell 3hr1_LS1+13 to Original Cell 3hr4_LS4+16: (numeric): LFQ intensities for the cellular fraction of the 3-hour replicated prior to imputation and normalization
*rep (numeric): number of replicates with identified intensities for the 3-hour replicates (preceding)
*Original Cell 6hr1_LS5+17 to Original Cell 6hr4_LS8+20: (numeric) LFQ intensities for the cellular fraction of the 6-hour replicated prior to imputation and normalization
*rep (numeric): number of replicates with identified intensities for the 6-hour replicates (preceding)
*Original Cell 16hr1_LS9+21 to Original Cell 16hr4_LS12+24: (numeric): LFQ intensities for the cellular fraction of the 16-hour replicated prior to imputation and normalization
*rep (numeric): number of replicates with identified intensities for the 16-hour replicates (preceding)
Raw Protein Intensities:
*Original 3hr1Sec_LS25 to Original 3hr4Sec_LS28: (numeric) Protein intensity values from the cellular fraction of each 3-hour replicate
*Original 6hr1Sec_LS29 + Original 6hr4Sec_LS32: (numeric) Protein intensity values from the cellular fraction of each 6-hour replicate
*Original 16hr2Sec_LS34 to Original 16hr3Sec_LS35: (numeric) Protein intensity values from the cellular fraction of each 3-hour replicate
*Genes_1_end: (alphanumeric) Gene names for the primary protein
*Significant pairs: (alphanumeric): Pairwise comparisons of timepoints that are significant
Abbreviations:
Cell= Cellular
GOBP = Gene Ontology Biological Process
GOMP = Gene Ontology Molecular Function
GOCC = Gene Ontology Cellular Component
Version Changes:
02-22-25: Files were given more descriptive names as per the journals request. Changes to the data were not made.
