The human gut microbiome contains many bacterial strains of the same species (‘strain-level variants’). Describing strains in a biologically meaningful manner rather than purely taxonomic objects is an important goal but challenging due to the complexity of strain-level variation. Here, we measured patterns of co-evolution across >7,000 strains spanning the bacterial tree-of-life. Using these patterns as a prior for studying hundreds of gut commensal strains that we isolated, sequenced, and metabolically profiled revealed widespread structure beneath the phylogenetic level of species. Defining strains by their co-evolutionary signatures enabled predicting their metabolic phenotypes and engineering consortia from strain genome content alone. Our findings demonstrate a biologically relevant organization to strain-level variation and motivate a new schema for describing bacterial strains based on their evolutionary history.
README: An evolution-based framework for describing human gut bacteria
Extended Data for "An evolution-based framework for describing human gut bacteria".
Description of the data and file structure
Data: xslx file
· Panel A: Data matrix of Orthologous Gene Groups (OGGs) annotations for each of the 669 strains in the Commensal Strain Bank. First column holds the unique ID for each strain. Subsequent columns hold the data for each OGG as listed in first row. Each value denotes the number of sequences from the strains coding regions that are mapped to that OGG (see methods).
· Panel B: Projections of each of 669 strains onto the principal components describing co-evolutionary variation in UniProt (see methods). First column is the strain id, subsequent columns are for each of the 7047 principal components with each value denoting the contribution of strain i onto principal component k.
· Panel C: Spectral tree of all 669 strains from commensal strain bank in Newick format.
Table: S3 (CSV file)
Patristic distances between each strain in the UniProt Spectral Tree. Patristic distance is a unitless distance metric calculated as the sum of branch lengths between two leaves of a phylogenetic tree.
Columns:
strain_1 (uniprot id for strain 1)
strain_2 (uniprot id for strain 2)
patristic_distance (patristic distance between strain 1 and strain 2 in the UniProt spectral tree)
Materials and Methods (futher details and metadata in associated article's supplementary material) Creating a bank of commensal human gut microbiome strains Fecal samples were obtained from 28 human donors that fell within the age range of 18 to 63 with a median age of 35. Donors were selected as those with no antibiotic use in the past year, no known history of diabetes, colitis, autoimmune disease, cancer, pneumonia, dysentery, or cellulitis at time of consent. Institutions that approved protocols of fecal sample collection were Memorial Sloan Kettering (MSK) and the University of Chicago. Fresh fecal samples were immediately reduced in an anaerobic chamber upon collection and diluted and cultured on various growth media. Agar media types vary, but include any of following: Columbia Blood Agar, Brain Heart Infusion +Yeast, Brain Heart Infusion + Mucin, Brain Heart Infusion + Yeast + Acetate or N-Acetylglucosamine, reinforced Clostridial Agar, Peptone Yeast Glucose, Yeast Casitone Fatty Acids, Defined media M5. Colonies were selected and grown to be sufficiently turbid, 20% glycerol/PBS stocks were created and stored in a -80C freezer.Colonies were selected for whole-genome based on pyro-sequencing of the 16S region which provides a rough estimate of genus level designation. For each donor, only colonies that had a sequence identity threshold of less than 99% from CD-Hit (v. 4.8.1) were selected for whole-genome sequencing (1). Bacterial genomic DNA was extracted using QIAamp DNA Mini Kit (QIAGEN) according to manufacturer’s manual. The purified DNA was quantified using a Qubit 2.0 fluorometer. 1000ng of each sample was prepared for sequencing using the QIAseq FX DNA Library Kit (QIAGEN). The protocol was carried out for a targeted fragment size of 550bp. Sequencing was performed on the MiSeq or NextSeq platform (Illumina) with a paired- end (PE) kit in pools designed to provide 1-3 million PE reads per sample with read length of 250 or 150 bp. Adapters were trimmed off with Trimmomatic with following parameters: the leading and trailing 3 bp of the sequences were trimmed off, quality was controlled by a sliding window of 4, with an average quality score of 15 (default parameters of Trimmomatic). Moreover, any read that was less than 50 bp long after trimming and quality control were discarded. The remaining high-quality reads were assembled into contigs using SPAdes (v3.14.0)(2).Taxonomic classification of the assembled contigs was performed with the following methods: (a) Kraken2 (v2.1.1); (b) full/partial length 16S rRNA gene from each isolated colony’s assembled contigs is extracted and input into BLASTn (v2.10.1+) to query against NCBI’s RNA RefSeq database (3, 4). Top five hits for each query are manually curated to determine an isolate’s identity, with identity and coverage cutoff both at 95%; (c) GTDB-Tk (v1.5.1) (5). Final taxonomy is determined by the consensus of the three methods. Any colony that did not match initial pyro-sequencing taxonomy or lacked consensus are excluded from the commensal strain bank.
Annotating each strain in the commensal strain bank by its orthologous gene group (OGG) content For individual isolates, the genome assemblies were annotated using Prokka (v1.12) producing a fasta file of all coding regions from the assembled genome translated to the amino- acid protein sequences (10). This fasta file was then input to eggNOG mapper (v2.0.1b) to annotate each protein sequence against the eggNOG database (v5.0) of orthologous gene- groups (OGGs) at the level of Bacteria (‘@2’) (11, 12). Each isolate was then aligned based on this common set of OGG features, where isolates correspond to each row and OGGs correspond to each column and each entry holds the number of protein sequences that matched to that OGG. This OGG alignment of 669 isolates forms the CSB OGG matrix (Fig. 1C). Each isolate was annotated across 11,248 total OGGs of which 5,449 OGGs have greater than zero variance; annotations of isolated were done with 16S-BLASTn, GTDB, and final NCBI taxonomic designations at the level of Phylum through Species.