Machine learning suggests that small size helps broaden plasmid host range
Data files
Oct 24, 2023 version files 376.39 MB
-
data_matrix.csv
541.53 KB
-
fastANI_edgeweights
122.84 MB
-
plasmid_net_connected.cys
142.35 MB
-
plasmid_net_grouped_MOB.cys
110.65 MB
-
README.md
5.96 KB
Nov 03, 2023 version files 376.39 MB
-
data_matrix.csv
541.53 KB
-
fastANI_edgeweights
122.84 MB
-
plasmid_net_connected.cys
142.35 MB
-
plasmid_net_grouped_MOB.cys
110.65 MB
-
README.md
5.81 KB
Jun 18, 2024 version files 376.39 MB
-
fastANI_edgeweights
122.84 MB
-
feature_matrix_F3.csv
541.53 KB
-
plasmid_net_connected.cys
142.35 MB
-
plasmid_net_grouped_MOB.cys
110.65 MB
-
README.md
10.82 KB
Abstract
Plasmids mediate gene exchange across taxonomic barriers through conjugation, shaping bacterial evolution for billions of years. While plasmid mobility can be harnessed for genetic engineering and drug-delivery applications, rapid plasmid-mediated spread of resistance genes has rendered most clinical antibiotics useless. To solve this urgent and growing problem, we must understand how plasmids spread across bacterial communities. Here, we applied machine-learning models to identify features that are important for extending the plasmid host range. We assembled an up-to-date dataset of more than thirty thousand bacterial plasmids, separated them into 1125 clusters, and assigned each cluster a distribution possibility score, taking into account the host distribution of each taxonomic rank and the sampling bias of the existing sequencing data. Using this score and an optimized plasmid feature pool, we built a model stack consisting of DecisionTreeRegressor, EvoTreeRegressor, and LGBMRegressor as base models and LinearRegressor as a meta-learner. Our mathematical modeling revealed that sequence brevity is the most important determinant for plasmid spread, followed by P-loop NTPases, mobility factors, and β-lactamases. Ours and other recent results suggest that small plasmids may broaden their range by evading host defenses and using alternative modes of transfer instead of autonomous conjugation.
https://doi.org/10.5061/dryad.1g1jwsv31
Description of the data and file structure
There are four files in this dataset:
(1) "fastANI_edgeweights": the edgeweights used for plasmids clustering analysis by using Leidenalg (leidenalg documentation — leidenalg 0.10.2.dev15+g56e7241.d20231013 documentation) with CPMVertexPartition. The original edgeweights are obtained by running FastANI (ParBLiSS/FastANI: Fast Whole-Genome Similarity (ANI) Estimation (github.com)), and then transformed to final edgeweights as follows: ANI/100, if ANI >= 95; 1/(1 + 20*(1 - ANI/100)), otherwise. Headers: field 1 are the query sequences; field 2 are the target sequences; field 3 are the final edgeweights of these two sequences.
(2) "plasmid_net_connected.cys": the original calculated plasmid network.
(3) "plasmid_net_grouped_MOB.cys": The original plasmid network was grouped into 1125 clusters (>= 3 members) inferred by Leidenalg. The nodes are colored according to the three plasmid mobility types: Orange, conjugative. Blue, mobilizable. Green, non-mobilizable.
(4) "feature_matrix_F3.csv": The final data matrix for machine learning model training and testing. Hearders: Psum is the sum of plasmid distribution possibility; other headers are features, which are listed below:
features | short_names |
---|---|
smallest | size < 20 kbp |
small | 20kbp - 50 kbp |
average | 50 kbp - 100 kbp |
large | > 100 kbp |
lower | < 30% GC |
low | 30% - 40% GC |
mean | 40% - 50% GC |
high | 50% - 60% GC |
higher | > 60% GC |
circular | plasmid topology |
linear | plasmid topology |
rep_cluster_707 | replication type from Lactobacillaceae |
MOBF | relaxase_types |
MOBP | relaxase_types |
MOBQ | relaxase_types |
MOBC | relaxase_types |
MOBH | relaxase_types |
MOBV | relaxase_types |
MPF_F | mpf_types |
MPF_I | mpf_types |
MPF_T | mpf_types |
oriT_MOBP | oriT_types |
PF09140.14 | MipZ, P-loop_NTPase |
PF00528.25 | BPD_transp_1 |
PF03466.23 | LysR_substrate |
PF04335.16 | VirB8 |
PF13728.9 | TraF |
PF16932.8 | T4SS_TraI |
PF00271.34 | Helicase_C |
PF13817.9 | DDE_Tnp_IS66_C |
PF11799.11 | IMS_C |
PF13520.9 | AA_permease_2 |
PF01751.25 | Toprim |
PF03524.18 | CagX |
PF12293.11 | T4BSS_DotH_IcmK |
PF12696.10 | TraG-D_C, P-loop_NTPase |
PF02796.18 | HTH_7 |
PF04945.16 | YHS |
PF00486.31 | Trans_reg_C |
PF13304.9 | AAA_21, P-loop_NTPase |
PF02775.24 | TPP_enzyme_C |
PF06586.14 | TraK |
PF00126.30 | HTH_1 |
PF02776.21 | TPP_enzyme_N |
PF13560.9 | HTH_31 |
PF00702.29 | Hydrolase |
PF00005.30 | ABC_tran, P-loop_NTPase |
PF00145.20 | DNA_methylase |
PF06067.14 | DUF932 |
PF01443.21 | Viral_helicase1, P-loop_NTPase |
PF00665.29 | rve |
PF06986.14 | F_T4SS_TraN |
PF02899.20 | Phage_int_SAM_1 |
PF08388.14 | GIIM |
PF06952.14 | PsiA |
PF02463.22 | SMC_N, P-loop_NTPase |
PF01609.24 | DDE_Tnp_1 |
PF18821.4 | LPD7 |
PF07690.19 | MFS_1 |
PF00239.24 | Resolvase |
PF07732.18 | Cu-oxidase_3 |
PF01131.23 | Topoisom_bac |
PF13700.9 | DUF4158 |
PF05101.16 | VirB3 |
PF13671.9 | AAA_33, P-loop_NTPase |
PF08751.14 | TrwC relaxase |
PF06406.14 | StbA_N |
PF01464.23 | SLT |
PF06122.14 | TraH |
PF07424.14 | TrbM |
PF00440.26 | TetR_N |
PF00892.23 | EamA |
PF19357.2 | DUF5934 |
PF06290.14 | PsiB |
PF00753.30 | Lactamase_B |
PF00532.24 | Peripla_BP_1 |
PF13683.9 | rve_3 |
PF16509.8 | KORA |
PF00497.23 | SBP_bac_3 |
PF00816.24 | Histone_HNS |
PF13586.9 | DDE_Tnp_1_2 |
PF00817.23 | IMS |
PF08534.13 | Redoxin |
PF09673.13 | TrbC_Ftype |
PF07996.14 | T4SS |
PF08535.13 | KorB |
PF13333.9 | rve_2 |
PF16816.8 | DotD |
PF13610.9 | DDE_Tnp_IS240 |
PF13518.9 | HTH_28 |
PF04956.16 | TrbC |
PF09676.13 | TraV |
PF02737.21 | 3HCDH_N |
PF07728.17 | AAA_5, P-loop_NTPase |
PF09677.13 | TrbI_Ftype |
PF03135.17 | CagE_TrbE_VirB |
PF00589.25 | Phage_integrase |
PF13276.9 | HTH_21 |
PF01527.23 | HTH_Tnp_1 |
PF13434.9 | Lys_Orn_oxgnase |
PF03743.17 | TrbI |
PF01850.24 | PIN |
PF12831.10 | FAD_oxidored |
PF13007.10 | LZ_Tnp_IS66 |
PF13340.9 | DUF4096 |
PF02374.18 | ArsA_ATPase, P-loop_NTPase |
PF00436.28 | SSB |
PF00717.26 | Peptidase_S24 |
PF19044.3 | P-loop_TraG, P-loop_NTPase |
PF02534.17 | T4SS-DNA_transf, P-loop_NTPase |
PF07015.14 | VirC1, P-loop_NTPase |
PF00437.23 | T2SSE, P-loop_NTPase |
PF02195.21 | ParBc |
PF10609.12 | ParA, P-loop_NTPase |
PF06564.15 | CBP_BcsQ, P-loop_NTPase |
PF00122.23 | E1-E2_ATPase, P-type ATPase |
PF02518.29 | HATPase_c |
PF13751.9 | DDE_Tnp_1_6 |
PF01022.23 | HTH_5 |
PF05309.14 | TraE |
PF01266.27 | DAO |
PF01547.28 | SBP_bac_1 |
PF05513.14 | TraA |
PF13708.9 | DUF4942 |
PF11393.11 | T4BSS_DotI_IcmL |
PF04610.17 | TrbL |
PF13342.9 | Toprim_Crpt |
PF13416.9 | SBP_bac_8 |
PF13604.9 | AAA_30, P-loop_NTPase |
PF00462.27 | glutaredoxin |
PF02811.22 | PHP |
PF07733.15 | DNA_pol3_alpha |
PF17657.4 | DNA_pol3_finger |
PF13154.9 | DUF3991 |
PF05284.15 | DUF736 |
PF13155.9 | Toprim_2 |
PF13442.9 | cytochrome_CBB3 |
PF14579.9 | HHH_6 |
PF00034.24 | cytochrom_C |
PF18555.4 | MobL |
Code/Software
The network (.cys) files can be explored by Cytoscape (Cytoscape: An Open Source Platform for Complex Network Analysis and Visualization). The necessary code for machine learning training and testing is available in GitHub (BingWangK/plasmid_project_IAlab (github.com)).