Skip to main content

Diversity of biosynthetic gene clusters in gut bacteria of turtle ants

Cite this dataset

Duplais, Christophe (2021). Diversity of biosynthetic gene clusters in gut bacteria of turtle ants [Dataset]. Dryad.


In insect-microbe nutritional symbioses the gut symbionts supplement the host diet with nutrients by producing amino acids and vitamins or by degrading lignin or polysaccharides macromolecules. In multipartite mutualisms composed of multiple symbionts from different taxonomical orders, it has been suggested that beside the genes involved in the nutritional symbiosis the symbionts maintain genes responsible for the production of metabolites putatively playing a role in the maintenance and interaction of the bacterial communities living in close proximity. To test this hypothesis, we investigated the diversity of biosynthetic gene clusters (BGCs) producing different non-primary metabolites in the genomes and metagenomes of the conserved gut symbionts associated with the herbivorous turtle ants (genus: Cephalotes). We studied 17 Cephalotes species collected across several geographical areas to reveal that (i) mining metagenomes and genomes show complementary results demonstrating the robustness of this approach to retrieve BGCs, (ii) the conserved gut symbionts involved in the nutritional symbiosis have a high diversity of BGCs of different chemical families, (iii) the phylogenetic analysis of BGCs encoding the production of arylpolyenes, non-ribosomal peptides (NRP), polyketides (PK), and siderophores shows high similarity between BGCs of a single symbiont across different ant host species, and between BGCs originated from different bacterial orders within a single host species. These findings together suggest multiple mechanisms of bacterial genome conservation and evolution of BGCs. We document the occurrence, diversity, and similarity of BGCs in the genome of obligate gut bacteria involved in multipartite mutualisms across the phylogeny of turtle ants.


Genomes and metagenomes analysis

The 14 genomes of cultured gut bacteria and 18 metagenomes of Cephalotes gut bacteria were obtained from JGI-IMG version 5.056 (Table S6 and S7 respectively) from the previous projects Gs0085494 (“Cephalotes varians microbial communities from the Florida Keys, USA”), Gs0114286 (“Symbiotic bacteria isolated from Cephalotes varians”), Gs0117930 (“Cephalotes ants gut microbiomes”) and Gs0118097 (“Symbiotic bacteria isolated from Cephalotes rohweri”)6.

The 14 cultured isolate genomes are all part of the Cephalotes core microbiome: two Cephaloticoccus genomes (order: Opitutales, class: Opitutae; one genome from each ant host),  four Ventosimonas genomes (order: Pseudomonadales, class: Gammaproteobacteria; one genome from C. varians and three genomes from C. rohweri), six Burkholderiales genomes (class: Betaproteobacteria; four genomes from C. varians and two genomes from C. rohweri), one Rhizobiales genome (class: Alphaproteobacteria;  from C. varians), and one Xanthomonadales genome (class: Gammaproteobacteria; from C. rohweri). All these bacteria belong to the Proteobacteria phylum, except Opitutales which are part of the Verrucomicrobia phylum.

The metagenomic data used in this study have two metagenomes from the same C. varians species. However, for the metagenome of C. varians PL010W the sequencing quality is very different from all the other metagenomes (number of reads, GC content), therefore it was excluded in our analysis (Table S7). The metagenomes were analyzed via the software Anvi’o version 5.557 to sort the different bacterial families composing each metagenome into distinct bins. In this analysis, the fasta sequence of a metagenome is used to create a contig database and a profile database. Then, this contig database is visualized and bins are manually created to maximize completeness while minimizing redundancy. Finally, the software CheckM version 1.158 was used to identify the taxonomy of each bin (Fig. S10).

Bacterial biosynthetic gene clusters analysis

The bacterial biosynthetic gene clusters (BGCs) of each genome and each metagenomic bin were analyzed with antiSMASH version 5.030 (Fig. S10) with the following analysis options: strict detection, and activation of search for KnownClusterBlast, ClusterBlast, SubClusterBlast and Active site finder. BGCs smaller than 5kb were then filtered out of the data and were not included. The taxonomic classification of each cluster was verified to the genus level using the software Blast+59 (Table S8). There was no identification conflict between CheckM and Blast+. NaPDos (, accessed July 2020)31 was used to classify the ketosynthase (KS) domain and condensation (C) domain sequences of NRP and PK retrieved from the genome and metagenome mining analysis and to infer the KS and C phylogenies.

Distance matrix calculations

The published Cephalotes phylogeny was retrieved60 and the packages ape61 and phytools62 of the R software version 3.6.163 were used to exclude from this phylogeny the Cephalotes species from which no genome or metagenome was available. The cophenetic distance option in the R package stats63 was used on this pruned phylogeny to create the Cephalotes phylogenetic distance matrix. The cophenetic distance replaces original pairwise distances between the Cephalotes species by the computed distances between their clusters. To calculate the metagenomic BGCs distance matrix, a BGC matrix containing the numbers and types of each BGC found in each metagenome was created. Then the distance function (Euclidean method) of the R package stats was used on the BGC matrix to calculate the metagenomic BGCs distance matrix. The Euclidean method was chosen because it calculates the absolute distance between two samples with continuous numerical variables, without removing redundancies.

Genomic similarity assessment

The genomic similarity between the cultured isolate genomes and the metagenomes from the C. rohweri and C. varians were compared using two independent methods.

First, each metagenomic bin was compared to each cultured isolate genome from the corresponding host species using the combined values of the average nucleotide identity (gANI) and alignment fraction (AF)64,65. In this method, the nucleotide of the protein-coding genes of each metagenomic bin (genome 1) and each cultured isolate genome (genome 2) are compared using NSimScan66, and only the bidirectional best hits (BBHs) displaying at least 70% sequence identity over at least 70% of the length of the shorter sequence are kept. The gANI is calculated by dividing the sum of the percent of identity times the alignment length for all BBHs by the sum of the lengths of the BBH genes. A gANI higher than 0.95 indicates a threshold for same species level67. The AF is calculated by dividing the sum of the lengths of all BBH genes by the sum of the length of all the genes in genome 1. For AF values higher than 0.5 are considered conspecifics43.

Next, the contigs of each metagenomic bin were mapped to each cultured isolate genome from the corresponding host species using the function in the BBMerge software version 38.2268 with the parameters: kfilter=22 subfilter=15 maxindel=80. These parameters were used in a similar study published recently43. BBMap aligns contig sequences from each metagenomic bin (genome 1) and each cultured isolate genome (genome 2) and return the percentage of contigs in genome 2 which were found in genome 1.

Genetic networking construction

The genetic networking of the genomic and metagenomic bacterial BGCs was generated using BiG-SCAPE version 1.0.040,41 (Fig. S10). The network was constructed with the three following options: “--include_singletons”, “--mix”, and “--cutoffs 1.0”. Briefly, the Pfam-domains of each BGC were identified using HMMER version 3.1b269 with the respective Hidden Markov Models (HMM) obtained from the Pfam database70. The distance matric is created by calculating the distance between every pair of BGC. Pairwise distance calculation is divided between three values: the percentage of shared domain types (Jaccard Index), the similarity of aligned domain sequences (Domain Sequence Similarity index) and the similarity of domain pair types (Adjacency Index). Specific details for each index and the method are available in Navarro-Munoz et al.33. The resulting similarity matrix was filtered with different thresholds between 0.6 and 0.840,71,72, and the thresholds 0.65 was chosen because it filtered out enough to form different groups while maximizing the number of bacterial BGCs in each network40,72. The filtered similarity matrix was visualized with Cytoscape version 3.7.273, using the MCL clusterization algorithm from clusterMaker274 version 1.3.1 with an index value I = 2.0.

BGCs genomic core gene analysis

Although BGCs are made of many domains, some have been deemed the “genomic core”. The genomic core is a group of one or a few domains within a single BGC, which contain all the required modules needed for this BGC to be functional and are highly conserved across bacteria. The genomic core of four types of BGCs (arylpolyene, NRP, siderophore and T1PK) were analyzed using the CORASON version 1.033 (Fig. S10).

First, a database for each type of BGCs was created from all the BGCs of the same type recovered in the genomes and metagenomes, which were obtained from our previous AntiSMASH version 5.0 analysis. Then a BGC from each of our databases was chosen as the reference BGC for its database, and a gene from the biosynthetic core of this reference BGC was chosen as the query protein. The choice of the reference BGC and query protein for a database were made taking different variables into account: (1) the reference BGC must be one of the longest BGCs in the database, (2) the query protein must come from a biosynthetic core gene or an additional biosynthetic gene close to the core, and (3) the query protein must be present in at least half of the BGCs in the database.

Once a reference BGC and a protein query were selected for a database, the CORASON version 1.0 was used to determine the genomic core of this type of BGCs, and to infer the phylogenies. For the phylogenetic reconstruction, the core BGCs are concatenated and aligned using Muscle version 3.8.3175, and the phylogeny is inferred using the Maximum likelihood method in FastTree version 2.1.1076. Specific details for the method are available in Navarro-Munoz et al.33. If less than half of the BGCs present in a database appear in the phylogeny, or if many BGCs appear more than once in the phylogeny (which may happen if the selected query protein belongs to a biosynthetic gene that can be found multiple times in the same BGC), the reference BGC and query protein previously selected were deemed to be inadequate for the analysis. In this case, the process to choose a reference BGC and query protein were done once more, following the same procedure as before. The reference BGC and query protein selected for the four types of BGCs studied are presented in Table S8. The strength of this method is to combine four important steps into one tool: conserved genomic core calculation, orthologue genes blasting, sequence alignment and phylogenetic reconstruction.

Statistical analysis

To test for correlations between the geography of the Cephalotes ants and the BGCs found in the symbiotic metagenomes were the Chi-squared function of the R package stats77 was implemented. The Cephalotes phylogenetic distance matrix and the bacterial BGCs distance matrix from the metagenome analysis were compared using the Mantel test of the R package ape61. The PCoA between the number and types of bacterial BGCs from the metagenome analysis and Cephalotes geography were performed using the prcomp function of the R package stats and the R package ggplot78. The correlations between the number of each type of bacterial BGCs from the metagenome analysis and Cephalotes geography were calculated using the one-way ANOVA function with the Tukey’s pairwise of the Past software version 3.2579.

Usage notes

This Excel file contains all the Supplementary Tables of Chanson et al. 2021.

Table S1. BGCs types identified in the bacterial genomes 12

Table S2. Statistic assessment of the metagenomic bins created. 13

Table S3. BGCs types identified in the bacterial metagenomes. 17

Table S4. Genomic similarity measurement between metagenomic bins and cultured isolate genomes. 27

Table S5. Genomic similarity measurement between BGCs from cultured isolate genomes and metagenomic bins. 31

Table S6. Assembly statistics of genomes. 34

Table S7. Assembly statistics of metagenomes. 36

Table S8. Reference BGC and query protein selected in the genomic core analysis. 37


Agence Nationale de la Recherche, Award: ANR-10-LABX-25-01

National Science Foundation, Award: NSF DEB 1900357