Data for: Ancient rapid radiation explains most conflicts among gene trees and well-supported phylogenomic trees of nostocalean cyanobacteria
Data files
Abstract
Prokaryotic genomes are often considered to be mosaics of genes that do not necessarily share the same evolutionary history due to widespread Horizontal Gene Transfers (HGTs). Consequently, representing evolutionary relationships of prokaryotes as bifurcating trees has long been controversial. However, studies reporting conflicts among gene trees derived from phylogenomic datasets have shown that these conflicts can be the result of artifacts or evolutionary processes other than HGT, such as incomplete lineage sorting, low phylogenetic signal, and systematic errors due to substitution model misspecification. Here, we present the results of an extensive exploration of phylogenetic conflicts in the cyanobacterial order Nostocales, for which previous studies have inferred strongly supported conflicting relationships when using different concatenated phylogenomic datasets. We found that most of these conflicts are concentrated in deep clusters of short internodes of the Nostocales phylogeny, where the great majority of individual genes have low resolving power. We then inferred phylogenetic networks to detect HGT events while also accounting for incomplete lineage sorting. Our results indicate that most conflicts among gene trees are likely due to incomplete lineage sorting linked to an ancient rapid radiation, rather than to HGTs. Moreover, the short internodes of this radiation fit the expectations of the anomaly zone, i.e., a region of the tree parameter space where a species tree is discordant with its most likely gene tree. We demonstrated that concatenation of different sets of loci can recover up to 17 distinct and well-supported relationships within the putative anomaly zone of Nostocales, corresponding to the observed conflicts among well-supported trees based on concatenated datasets from previous studies. Our findings highlight the important role of rapid radiations as a potential cause of strongly conflicting phylogenetic relationships when using phylogenomic datasets of bacteria. We propose that polytomies may be the most appropriate phylogenetic representation of these rapid radiations that are part of anomaly zones, especially when all possible genomic markers have been considered to infer these phylogenies.
README: Ancient Rapid Radiation Explains Most Conflicts Among Gene Trees and Well-supported Phylogenomic Trees of Nostocalean Cyanobacteria
This repository contains the Supplementary materials, files, as well as all the code, data and outputs from the analyses used in the study.
Description of the data and file structure
This repository contains Supplementary Materials, code and data. See below for details of the file structure.
Supplementary Materials
The supplementary materials are distributed in two files:
PardoDelaHoz_nostocales_suppl.pdf
: this is the main Supplementary Materials file. It contains all supplementary figures and tables, with captions, as well as the references.Supplemtary_file_S1.csv
: this is a table that summarizes the results of the GUNC analyses.
Code
The code distributed in two files:
commands.sh
: This file has the full workflow with calls to all the scripts. It is intended to be a reference to navigate the scripts.scripts_and_misc.zip
: This zip file contains two directories:scripts/
: All the scripts used are in this directory. It also contains a file calledr_functions.R
which has all the custom R functions used in the study.misc_files/
: This directory contains miscellaneous files that are called by the scripts. For example, there are files with list of loci codes from the different datasets and, and the different taxa subsets.
Data
The repository includes the data used as well as almost all the output files produced by the analyses. Everything is in the data.zip
file. WARNING: the contents of this archive will take ~69 Gb of space when decompressed. The only files we did not include where the raw outputs from the posterior sampling done with PhyloBayes, because it would double the size of this-already huge-repository. However, the posterior samples from this analyses are summarized in the files under analyses/phylonetworks/bucky/infiles/
. Please feel free to conctact me at cjpardodelahoz@gmail.com if you want access to the raw outputs. The rest of it is organized as follows:
genomes/
: Contains the assemblies of all 220 genomes that we used in the study in fasta format. These assemblies were the starting point for all analyses. The filegenomes/genome_label.tsv
has a table with that links the filenames in this directory with the corresponding taxon names.databases/
: Contains the blast database used by mafft-homologs during the alignments of the amino acid sequences.analyses/
: Contains the output from all analyses. It is structured like this:genome_qc/
: Contains the output of the BUSCO analyses that were used for QC of the sampled genomes and to extract the loci used in several of the phylogenomic analyses.busco/all_cyanodb10/
contains the output of the BUSCO analyses done with the cyanobacteria_odb10 database.busco/all_nostocalesdb10/
contains the output of the BUSCO analyses done with the nostocales_odb10 database. The structure of both of these folders is the same:BUSCO_result_all.csv
: Matrix of busco loci (rows) and taxa (columns) where the elements indicate whether a particular locus was complete, duplicated, fragmented, or missing from a particular taxon genome.BUSCO_num_result_all.csv
: Same asBUSCO_result_all.csv
but the elements of the matrix were replaced by 0 (duplicated, fragmented or missing) and 1 (complete). Both of these files are generated by the R scriptsscripts/filter_taxa_and_loci_all.R
for the cyanodb10, andscripts/filter_loci_nostocalesdb10.R
for the nostocalesdb10.nested_BUSCO_result_num.csv
: Only underbusco/all_cyanodb10/
. This is the same matrix asBUSCO_num_result_all.csv
, but organized in a nested fashion, which I used to explore the number of loci present in each taxon, and the number of taxa in which each locus was present. This file is generated by the R scriptscripts/filter_taxa_and_loci_all.R
.summary
: compilation of thefull_table.tsv
(see below) files from all the taxa. I used this to summarize the BUSCO results across all taxa.by_taxon/
: contains the output of the BUSCO analyses done with the respective database, with a directory for each of the 220 taxa. For each taxon, the busco output consists of:short_summary.specific.*
: a short summary of the busco results.logs/
directory with the STDERR and STDOUT of BUSCO, HMMER, and Prodigal.prodigal_output/
: gene predictions in nucleotide and amino acid format (predicted_genes/predicted.*
), andtmp/
subdirectory with the prodigal run temp files.run_*_odb10/
: database-specific busco results:busco_sequences/
: FASTA files with the hits of the BUSCO loci found for the target genome in nucleotide (*.fna
) and amino acid (\*.faa
) form, organized by busco status (complete, multicopy, and fragmented).hmmer_output/
: HMMER output for the seraches with each of the HMMs of the corresponding ortholog database gainst the target genome.full_table.tsv
: table with full details of BUSCO results, including status for each locus code, sequence coordinates of the hits, and gene names.missing_busco_list.tsv
: list of missing busco loci codes.short_summary.txt
: same asshort_summary.specific.*
above.
prelim/
: Contains the sequences and alignments used to infer the preliminary phylogenetic tree with all taxa used in the study (Fig. S1). It also contains the RAxML output of the concatenated analyses, including the tree in newick format undertrees/concat
L31/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L31 dataset.L70/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L70 dataset.L746/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L746 dataset.L1648/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L1648 datasets, including the alignments and trees generated with different alignment trimming strategies (ng, strict, kcg, and kcg2). This directory also contains the alignment summaries obtained with AMAS.py, which were used to generate Fig. S3. Finally, there is also subdirectory with the output tables from Modelfinder, which were used to compare the fit of site-heterogeneous and site-homogeneous model, summarized in Fig. S4.ngmin/
: Contains the nucleotide and amino acid alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L1082 (nucleotide) and L1233 (amino acid) datasets. It also contains the input and output from the treeshring analyses, which was used to filter the L1648 alignments from taxa in relatively long branches before producing the ngmin datasets.tbas/
: Contains the nucleotide and amino acid sequences, alignments, as well as the concatenated nucleotide tree and iqtree output obtained with the L1648 loci and 16S rDNA for the 211 taxa that passed the initial QC filter. This is the tree that we will upload to T-BAS for people to use as a reference for placement of new nostocalean taxa.conflict/
: Contains the input and output files from the three discovista analyses that we conducted to investigate phylogenetic conflict: gene trees vs their corresponding concatenated trees; concatenated trees vs 22 key topological bipartitions; and astral trees vs 22 key topological bipartitions. The Discovista outputs from these analyses were used to generate the pie charts in Fig. 2, the heatmap in figure three, as well as the boxplots in Figs. S6-S7.phylonetworks/
: Contains the 1293 alignments with from taxa subset1 used to infer the bayesian gene trees from which we calculated concordance factors for the snaq inferences. It also contains the raw (modelfinder_bulk/
) and extracted (modelfinder_out
) output from the modelfinder analyses on these alignments, which we use to select the best model for each locus for MCMC sampling using phylobayes. Within thealignments/
directory, there are also two.pb
file for each locus. Those files contain the locus-specific command line that was used to run Phylobayes with the best model found for each locus. Thebucky/infiles/*.in
files contain the counts of the different tree topologies from the bayesian posterior samples from each locus, which were used as input to run bucky and infer concordance factors. Thebucky/outfiles/*.concordance
files contain the log of the BUCKy run for each quartet (set of for taxa with labels 1–12). Thebucky/outfiles/*.cf
contain the concordance factors (CF) estimated for each quartet in CSV format. The first four fields are taxon names, followed by the CF estimate and lower (CF_lo) and higher (CF_hi) end of the 95% HPD of the estimates for each of the three alternative topologies of the quartet. The last field is the number of genes used to infer the quartet CFs. Thesnaq/
directory contains the input and output of the networ estimation analyses. Thesnaq/CFtable.csv
is a compilation of the CFs for all quartets, i.e., allbucky/outfiles/*.cf
files described above. Thestart_tree_subset1.tre
file was the starting tree for the snaq inferences, and thesubset1.tree
is the major edge topology of the best network with h = 2, which we used for Figure 4 of the paper. Thesnaq/net*.out
files contain the inferred networks in newixk format for each value of h we tested, while thesnaq/net*.best
files contain the best network for each h according to the pseudolikelihood score. Thesnaq/net*.log
files contain the SNaQ log for each run. Theplog_scores.csv
has a table with the pseudolikelihood scores of the best network for each h. Thesnaq/bootstrap/*
contains the output and logs of the 100 bootsnaq pseudoreplicate searches on the net2 network. The numbers afterboot
in the filenames are the starting seeds used for each pseudoreplicate serach.divtime/
: Contains all the files used and generated as part of the divergence time estimation.data/
has the tree from figure 4 in phylip format, including the fossil calibration, and the concatenated alignment of the 1293 loci for taxa in subset 1.gH/
contains the control file (mcmctree-outBV.ctl
) used to infer the gradient and hessian matrices needed to run the approximate likelihood analyses, the LG matrix (lg.dat
) and the output from the codeml analyis.mcmc/c{1..3}
andprior/c{1..3}
contain the control files (.ctl
files) as well as the output from the MCMC sampling of the posterior and prior distribution of node times for each of three chains (c1-c3).phylogenomic_jackknifing/
: Contains the alignments and trees obtained for the phylogenomic jackknifing analyses, summarized in Fig. 5.loci_samples/
contains files with the list of randomly sampled loci for each one of dataset sizes explored. For example,31_rep1
has a list of the first replicate of 31 randomly sampled loci from the 1293 loci which are complete for taxa subset 1.
Additional sharing of Code
Visit the GitHub repository of the project, which has the version history of the code as well as a detailed workflow and description of the computational setup used in the study.
Methods
This dataset includes the draft genome assemblies from 220 cyanobacterial strains, 215 of which were previously published and retrieved from Genbank (see Table S1 in the associated manuscript) and 5 generated as part of the study. It also includes all the data, code, and output files generated as part of the analyses in the paper. See the README.md for more information. Visit also the project's GitHub repository (https://github.com/cjpardodelahoz/nostocales), which contains the version history of all the code, as well as detailed workflows for the analyses conducted as part of the study.