“Tear down that wall” – updating the vocabulary of phage and bacterial lytic proteins
Data files
Oct 23, 2023 version files 38.06 GB
-
lysin_vocabulary.HMM_analysis_v2.tar.gz
-
README.md
Abstract
Lytic enzymes (often called lysins) break down the peptidoglycan, which in turn results in the death of bacteria. In consequence, they become one of the most promising alternatives to antibiotics. Such enzymes have been in the research spotlight and some of them have been proven to be effective and safe. The aim of this work was to gather the scattered data on functional and structural diversity of lysins, as well as available bioinformatic tools and data repositories used in studies of these proteins.
In the introductory part of this paper, we disambiguate the terminology of lytic proteins, which, over the years, has grown to be inconsistent in different branches of science, and delineate major lysin groups. We also review the databases and programs, which can be harnessed in lysin studies, and put particular emphasis on repositories of Hidden Markov Models. Finally, we describe a comprehensive, meticulously curated set of lysin proteins, protein families and domains, and sort them into clusters that reflect major families. Thus this work is a guide through a convoluted tangle of terms, concepts, databases, models, approaches and programs used to detect lytic enzymes.
This dataset is from the paper: "Tear down that wall - updating the vocabulary of phage and bacterial lytic proteins"
Authors: Sophia Baldysz, Jakub Barylski, Robert Nawrot
Contact e-mail address: jakub.barylski@amu.edu.pl.
Reference Information
Provenance for this README
- File name: README_TearWall.txt
- Author: Jakub Barylski
- Other contributors: Sophia Baldysz
- Date created: 2023-10-17
Contact Information
- Name: Jakub Barylski
- Affiliations: Department of Molecular Virology
Institute of Experimental Biology
Faculty of Biology
Adam Mickiewicz University in Poznan
Collegium Biologicum - ul. Uniwersytetu Poznańskiego 6
61-614 Poznan, Poland
ORCID ID: https://orcid.org/0000-0001-6630-6932
Email: jakub.barylski@amu.edu.pl
Name: Sophia Baldysz
Affiliations: Department of Molecular Virology
Institute of Experimental Biology
Faculty of Biology
Adam Mickiewicz University in Poznan
Collegium Biologicum - ul. Uniwersytetu Poznańskiego 6
61-614 Poznan, Poland
ORCID ID: https://orcid.org/0000-0001-9968-3929
Email: sopbal@amu.edu.pl
Dataset Attribution and Usage
- Dataset Title: "Endolysin-related HMMs"
- Persistent Identifier: https://doi.org/10.5061/dryad.fbg79cnsp
- Source code: https://github.com/zwmuam/lysin_vocabulary
Methodological Information
We selected a comprehensive set of lysin-related sequences, domains and families.
First, we compiled de-replicated set of bona-fide lysins from Enzybase ("Lysin90.NonGM_EnzyBase.fasta")
Then conducted a HMMER search ("hmmsearch_results") against the Enzybase database
and the "background" set of bacterial and phage UniRef90 proteins
("UniRef90pb.Caudoviricetes_2731619_Bacteria_2_unclassified_dsDNA_phages_79205_UniProt.faa").
Based on this search we selected HMMs overrepresented in the lysin set.
Finally we compared them to the complete model database visualised as "cytoscape_networks" based on:
- co-location ("hmmsearch_results" analysed through Cytoscape_table_from_colocation_analysis.py),
- direct HMM-HMM similarity ("hhblits_results" analysed through Cytoscape_table_from_hhblits.py).
External data
Input data were retrieved from:
Sequences:
http://biotechlab.fudan.edu.cn/database/EnzyBase
https://www.uniprot.org
Family/Domain databases:
http://eggnog5.embl.de (ver. 5.0)
http://pfam.xfam.org (ver. 33.0)
http://dmk-brain.ecn.uiowa.edu/pVOGs (accessed 01.02.2020)
https://vogdb.org (ver. vog99)
Bioinformatic analysis
Protein clustering (de-replication, MMseqs2 v 13.45111):
mmseqs easy-cluster ${rep_input_fasta} ${output_prefix} tmp --dbtype 1 --cov-mode 0 -c 0.8 --min-seq-id 0.9
HMMer quieries (hmmer v 3.1b2, http://hmmer.org/download.html):
hmmsearch --domtblout ${output_domtblout} --noali --cpu 15 ${derep_input_fasta} ${derep_input_fasta} &> ${log_file}"
HMM-HMM comparisons (hh-suite 3.3.0):
hhblits -i ${query_hmm} -d ${path_to_merged_databases_hhsuite} -o ${output_domtblout_hhr} -blasttab ${output_domtblout_btb}
Network visualisations were created using Cytoscape 3.9.1
Other analyses were conducted using scripts submitted to: https://github.com/zwmuam/lysin_vocabulary
The file included files were the basis of the HMM analysis performed in the paper.
File types
- fasta/faa - standard protein fasta sequence format
- domtblout - tabular files generated by hmmscan/hmmsearch with "--domtblout" flag
- ffdata/ffindex - hhsuite database files
- hhr/btb - "hhblits default"/"tabular BLAST" format of hhblits HMM-HMM comparison (one HMM against a whole databse)
- cys Cytoscape networks
Data Structure
[193G] lysin_vocabulary.HMM_analysis_v2
├── [1.8M] cytoscape_networks # cytoscape importable HMM-HMM similarity networks
│ ├── [984K] Colocation_analysis.cys
│ └── [825K] HMM2HMM_hhblits.cys
├── [ 26M] hhblits_results # raw HMM-HMM comparisions
│ ├── [147K] begg_2ZCQV.btb
│ ├── [272K] begg_2ZCQV.hhr
│ ├── [338K] begg_306IT.btb
│ ├── [247K] begg_306IT.hhr
│ ├── [164K] begg_309RS.btb
│ ├── [219K] begg_309RS.hhr
│ ├── [141K] begg_30IKH.btb
│ ├── [275K] begg_30IKH.hhr
│ ├── [ 24K] begg_30PT3.btb
│ ├── [ 26K] begg_30PT3.hhr
│ ├── [317K] begg_31TVC.btb
│ ├── [151K] begg_31TVC.hhr
│ ├── [327K] begg_32E28.btb
│ ├── [225K] begg_32E28.hhr
│ ├── [ 25K] begg_32F1I.btb
│ ├── [113K] begg_32F1I.hhr
│ ├── [154K] begg_32G2Z.btb
│ ├── [145K] begg_32G2Z.hhr
│ ├── [ 18K] begg_32R25.btb
│ ├── [ 74K] begg_32R25.hhr
│ ├── [114K] begg_32YYR.btb
│ ├── [267K] begg_32YYR.hhr
│ ├── [978K] begg_338B2.btb
│ ├── [352K] begg_338B2.hhr
│ ├── [ 14K] begg_33Q34.btb
│ ├── [ 27K] begg_33Q34.hhr
│ ├── [ 12K] begg_346W6.btb
│ ├── [ 12K] begg_346W6.hhr
│ ├── [890K] begg_34AXW.btb
│ ├── [296K] begg_34AXW.hhr
│ ├── [109K] begg_COG0739.btb
│ ├── [ 48K] begg_COG0739.hhr
│ ├── [ 84K] begg_COG0791.btb
│ ├── [171K] begg_COG0791.hhr
│ ├── [134K] begg_COG0860.btb
│ ├── [ 14K] begg_COG0860.hhr
│ ├── [382K] begg_COG1652.btb
│ ├── [438K] begg_COG1652.hhr
│ ├── [430K] begg_COG1705.btb
│ ├── [102K] begg_COG1705.hhr
│ ├── [119K] begg_COG1876.btb
│ ├── [ 57K] begg_COG1876.hhr
│ ├── [127K] begg_COG2989.btb
│ ├── [ 67K] begg_COG2989.hhr
│ ├── [ 32K] begg_COG3409.btb
│ ├── [ 34K] begg_COG3409.hhr
│ ├── [ 36K] begg_COG3757.btb
│ ├── [ 29K] begg_COG3757.hhr
│ ├── [ 83K] begg_COG3772.btb
│ ├── [ 31K] begg_COG3772.hhr
│ ├── [271K] begg_COG3807.btb
│ ├── [495K] begg_COG3807.hhr
│ ├── [ 24K] begg_COG3942.btb
│ ├── [ 49K] begg_COG3942.hhr
│ ├── [227K] begg_COG4193.btb
│ ├── [ 38K] begg_COG4193.hhr
│ ├── [ 92K] begg_COG5263.btb
│ ├── [ 66K] begg_COG5263.hhr
│ ├── [ 43K] begg_COG5632.btb
│ ├── [ 15K] begg_COG5632.hhr
│ ├── [ 12K] PFAM__Amidase02_C.btb
│ ├── [ 35K] PFAM__Amidase02_C.hhr
│ ├── [ 10K] PFAM__Amidase_2.btb
│ ├── [ 32K] PFAM__Amidase_2.hhr
│ ├── [7.5K] PFAM__Amidase_3.btb
│ ├── [ 21K] PFAM__Amidase_3.hhr
│ ├── [ 26K] PFAM__Amidase_5.btb
│ ├── [104K] PFAM__Amidase_5.hhr
│ ├── [ 14K] PFAM__CHAP.btb
│ ├── [ 73K] PFAM__CHAP.hhr
│ ├── [ 11K] PFAM__CW_7.btb
│ ├── [ 36K] PFAM__CW_7.hhr
│ ├── [ 10K] PFAM__CW_binding_1.btb
│ ├── [ 53K] PFAM__CW_binding_1.hhr
│ ├── [137K] PFAM__DUF3597.btb
│ ├── [ 44K] PFAM__DUF3597.hhr
│ ├── [8.1K] PFAM__G5.btb
│ ├── [9.5K] PFAM__G5.hhr
│ ├── [8.6K] PFAM__Glucosaminidase.btb
│ ├── [ 42K] PFAM__Glucosaminidase.hhr
│ ├── [7.2K] PFAM__Glyco_hydro_25.btb
│ ├── [ 12K] PFAM__Glyco_hydro_25.hhr
│ ├── [ 11K] PFAM__GW.btb
│ ├── [ 41K] PFAM__GW.hhr
│ ├── [ 23K] PFAM__Lys.btb
│ ├── [ 46K] PFAM__Lys.hhr
│ ├── [ 13K] PFAM__LysM.btb
│ ├── [ 98K] PFAM__LysM.hhr
│ ├── [8.6K] PFAM__Peptidase_M15_4.btb
│ ├── [ 34K] PFAM__Peptidase_M15_4.hhr
│ ├── [9.8K] PFAM__Peptidase_M23.btb
│ ├── [ 78K] PFAM__Peptidase_M23.hhr
│ ├── [ 15K] PFAM__PG_binding_1.btb
│ ├── [ 90K] PFAM__PG_binding_1.hhr
│ ├── [ 11K] PFAM__Phage_lysozyme.btb
│ ├── [ 32K] PFAM__Phage_lysozyme.hhr
│ ├── [ 19K] PFAM__Prophage_tail.btb
│ ├── [156K] PFAM__Prophage_tail.hhr
│ ├── [9.2K] PFAM__PSA_CBD.btb
│ ├── [ 12K] PFAM__PSA_CBD.hhr
│ ├── [ 29K] PFAM__SH3_3.btb
│ ├── [233K] PFAM__SH3_3.hhr
│ ├── [ 12K] PFAM__SH3_5.btb
│ ├── [ 35K] PFAM__SH3_5.hhr
│ ├── [9.5K] PFAM__SLH.btb
│ ├── [ 54K] PFAM__SLH.hhr
│ ├── [9.3K] PFAM__VanY.btb
│ ├── [ 36K] PFAM__VanY.hhr
│ ├── [7.1K] PFAM__ZoocinA_TRD.btb
│ ├── [9.2K] PFAM__ZoocinA_TRD.hhr
│ ├── [ 23K] pVOGS_VOG0352.btb
│ ├── [ 58K] pVOGS_VOG0352.hhr
│ ├── [ 45K] pVOGS_VOG10230.btb
│ ├── [ 82K] pVOGS_VOG10230.hhr
│ ├── [ 20K] pVOGS_VOG11038.btb
│ ├── [103K] pVOGS_VOG11038.hhr
│ ├── [417K] pVOGS_VOG3721.btb
│ ├── [125K] pVOGS_VOG3721.hhr
│ ├── [ 33K] pVOGS_VOG4565.btb
│ ├── [ 51K] pVOGS_VOG4565.hhr
│ ├── [ 25K] pVOGS_VOG4574.btb
│ ├── [ 18K] pVOGS_VOG4574.hhr
│ ├── [359K] pVOGS_VOG4599.btb
│ ├── [315K] pVOGS_VOG4599.hhr
│ ├── [656K] pVOGS_VOG4615.btb
│ ├── [328K] pVOGS_VOG4615.hhr
│ ├── [ 23K] pVOGS_VOG4649.btb
│ ├── [100K] pVOGS_VOG4649.hhr
│ ├── [ 49K] pVOGS_VOG4666.btb
│ ├── [ 95K] pVOGS_VOG4666.hhr
│ ├── [ 21K] pVOGS_VOG4707.btb
│ ├── [ 26K] pVOGS_VOG4707.hhr
│ ├── [153K] pVOGS_VOG4724.btb
│ ├── [ 42K] pVOGS_VOG4724.hhr
│ ├── [ 19K] pVOGS_VOG4772.btb
│ ├── [ 48K] pVOGS_VOG4772.hhr
│ ├── [ 22K] pVOGS_VOG4824.btb
│ ├── [102K] pVOGS_VOG4824.hhr
│ ├── [ 38K] pVOGS_VOG4865.btb
│ ├── [120K] pVOGS_VOG4865.hhr
│ ├── [ 34K] pVOGS_VOG4910.btb
│ ├── [ 79K] pVOGS_VOG4910.hhr
│ ├── [ 65K] pVOGS_VOG4918.btb
│ ├── [272K] pVOGS_VOG4918.hhr
│ ├── [ 20K] pVOGS_VOG4941.btb
│ ├── [ 39K] pVOGS_VOG4941.hhr
│ ├── [ 95K] pVOGS_VOG4985.btb
│ ├── [193K] pVOGS_VOG4985.hhr
│ ├── [174K] pVOGS_VOG6576.btb
│ ├── [ 96K] pVOGS_VOG6576.hhr
│ ├── [ 16K] pVOGS_VOG7635.btb
│ ├── [ 53K] pVOGS_VOG7635.hhr
│ ├── [ 37K] pVOGS_VOG8236.btb
│ ├── [136K] pVOGS_VOG8236.hhr
│ ├── [ 14K] pVOGS_VOG8294.btb
│ ├── [ 31K] pVOGS_VOG8294.hhr
│ ├── [ 57K] pVOGS_VOG9502.btb
│ ├── [162K] pVOGS_VOG9502.hhr
│ ├── [143K] vegg_4QAVB.btb
│ ├── [112K] vegg_4QAVB.hhr
│ ├── [ 38K] vegg_4QAZY.btb
│ ├── [ 71K] vegg_4QAZY.hhr
│ ├── [ 47K] vegg_4QB01.btb
│ ├── [192K] vegg_4QB01.hhr
│ ├── [110K] vegg_4QBQV.btb
│ ├── [198K] vegg_4QBQV.hhr
│ ├── [ 38K] vegg_4QCGD.btb
│ ├── [125K] vegg_4QCGD.hhr
│ ├── [ 36K] vegg_4QECV.btb
│ ├── [ 82K] vegg_4QECV.hhr
│ ├── [ 35K] vegg_4QENP.btb
│ ├── [126K] vegg_4QENP.hhr
│ ├── [ 38K] vegg_4QEUP.btb
│ ├── [ 73K] vegg_4QEUP.hhr
│ ├── [288K] vog2017_VOG00190.btb
│ ├── [299K] vog2017_VOG00190.hhr
│ ├── [ 55K] vog2017_VOG00298.btb
│ ├── [222K] vog2017_VOG00298.hhr
│ ├── [190K] vog2017_VOG00438.btb
│ ├── [101K] vog2017_VOG00438.hhr
│ ├── [123K] vog2017_VOG00798.btb
│ ├── [168K] vog2017_VOG00798.hhr
│ ├── [ 71K] vog2017_VOG00953.btb
│ ├── [149K] vog2017_VOG00953.hhr
│ ├── [421K] vog2017_VOG01046.btb
│ ├── [269K] vog2017_VOG01046.hhr
│ ├── [ 28K] vog2017_VOG01142.btb
│ ├── [ 36K] vog2017_VOG01142.hhr
│ ├── [ 16K] vog2017_VOG01143.btb
│ ├── [ 33K] vog2017_VOG01143.hhr
│ ├── [ 57K] vog2017_VOG01370.btb
│ ├── [ 60K] vog2017_VOG01370.hhr
│ ├── [ 40K] vog2017_VOG01885.btb
│ ├── [132K] vog2017_VOG01885.hhr
│ ├── [229K] vog2017_VOG02043.btb
│ ├── [ 70K] vog2017_VOG02043.hhr
│ ├── [ 80K] vog2017_VOG02530.btb
│ ├── [215K] vog2017_VOG02530.hhr
│ ├── [ 30K] vog2017_VOG03024.btb
│ ├── [ 90K] vog2017_VOG03024.hhr
│ ├── [ 45K] vog2017_VOG03088.btb
│ ├── [ 84K] vog2017_VOG03088.hhr
│ ├── [586K] vog2017_VOG03420.btb
│ ├── [240K] vog2017_VOG03420.hhr
│ ├── [337K] vog2017_VOG04785.btb
│ ├── [102K] vog2017_VOG04785.hhr
│ ├── [ 72K] vog2017_VOG09725.btb
│ ├── [264K] vog2017_VOG09725.hhr
│ ├── [390K] vog2017_VOG09928.btb
│ ├── [124K] vog2017_VOG09928.hhr
│ ├── [ 65K] vog2017_VOG11215.btb
│ ├── [212K] vog2017_VOG11215.hhr
│ ├── [206K] vog2017_VOG11220.btb
│ ├── [ 65K] vog2017_VOG11220.hhr
│ ├── [ 14K] vog2017_VOG11377.btb
│ ├── [ 44K] vog2017_VOG11377.hhr
│ ├── [ 50K] vog2017_VOG12191.btb
│ ├── [ 53K] vog2017_VOG12191.hhr
│ ├── [ 39K] vog2017_VOG18536.btb
│ ├── [ 43K] vog2017_VOG18536.hhr
│ ├── [ 88K] vog2017_VOG18969.btb
│ ├── [ 59K] vog2017_VOG18969.hhr
│ ├── [ 14K] vog2017_VOG21001.btb
│ ├── [ 31K] vog2017_VOG21001.hhr
│ ├── [ 93K] vog2017_VOG21772.btb
│ ├── [106K] vog2017_VOG21772.hhr
│ ├── [ 22K] vog2017_VOG22320.btb
│ ├── [ 52K] vog2017_VOG22320.hhr
│ ├── [ 32K] vog2017_VOG23062.btb
│ ├── [ 84K] vog2017_VOG23062.hhr
│ ├── [ 66K] vog2017_VOG23388.btb
│ ├── [157K] vog2017_VOG23388.hhr
│ ├── [ 97K] vog2017_VOG23635.btb
│ ├── [297K] vog2017_VOG23635.hhr
│ ├── [ 89K] vog2017_VOG24666.btb
│ └── [ 52K] vog2017_VOG24666.hhr
├── [128G] hmmsearch_results # results of HMMer3 search
│ ├── [515K] Lysin90.AllvogHMMprofiles.domtblout
│ ├── [3.5M] Lysin90.bacterialHMMdbupdated.domtblout
│ ├── [519K] Lysin90.Pfam-A.domtblout
│ ├── [179K] Lysin90.viralHMMdb.domtblout
│ ├── [883K] Lysin90.vog2017db.domtblout
│ ├── [3.7G] UniRef90pb.AllvogHMMprofiles.hmm.domtblout
│ ├── [ 97G] UniRef90pb.bacterialHMMdbupdated.hmm.domtblout
│ ├── [ 18G] UniRef90pb.Pfam-A.hmm.domtblout
│ ├── [2.5G] UniRef90pb.viralHMMdb.hmm.domtblout
│ └── [7.3G] UniRef90pb.vog2017db.hmm.domtblout
├── [ 46G] merged_databases_hhsuite # master HH-suite3 database files
│ ├── [ 10G] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_a3m.ffdata
│ ├── [7.3M] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_a3m.ffindex
│ ├── [ 56M] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_cs219.ffdata
│ ├── [6.3M] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_cs219.ffindex
│ ├── [5.5G] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_hhm.ffdata
│ ├── [7.3M] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_hhm.ffindex
│ ├── [ 31G] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_msa.ffdata
│ └── [7.2M] AllvogHMMprofiles__bacterialHMMdbupdated__Pfam-A__viralHMMdb__vog2017db_msa.ffindex
├── [190K] Lysin90.NonGM_EnzyBase.fasta # "positive" endolysin sequence dataset
├── [189K] Negative.NonLysin_UniRef90pb.fasta # "negative" NON-endolysin sequence dataset used for testing and debugging purposes
└── [5.3G] UniRef90pb.Caudoviricetes_2731619_Bacteria_2_unclassified_dsDNA_phages_79205_UniProt.faa # "background" bacterial/phage protein dataset
Methods
Original data were retrieved from:
http://biotechlab.fudan.edu.cn/database/EnzyBase
Family/Domain databases:
http://eggnog5.embl.de (ver. 5.0)
http://pfam.xfam.org (ver. 33.0)
http://dmk-brain.ecn.uiowa.edu/pVOGs (accesed 01.02.2020)
https://vogdb.org (ver. vog99)
Proteins clustering (de-replication, MMseqs2 v 13.45111):
mmseqs easy-cluster ${rep_input_fasta} ${output_prefix} tmp --dbtype 1 --cov-mode 0 -c 0.8 --min-seq-id 0.9
HMMer quieries (hmmer v 3.1b2, http://hmmer.org/download.html):
hmmsearch --domtblout ${output_domtblout} --noali --cpu 15 ${derep_input_fasta} ${derep_input_fasta} &> ${log_file}"
HMM-HMM comparisons (hh-suite 3.3.0):
hhblits -i ${query_hmm} -d ${path_to_merged_databases_hhsuite} -o ${output_domtblout_hhr} -blasttab ${output_domtblout_btb}
Network visualisations were created using Cytoscape 3.9.1
Other analyses were conducted using scripts submitted to: https://github.com/zwmuam/lysin_vocabulary