Data from: Rapid radiations outweigh reticulations during the evolution of a 750-million-year-old lineage of cyanobacteria
Data files
Oct 04, 2025 version files 1.23 GB
-
cyano_genomes.zip
763.64 MB
-
label_key.csv
7.49 KB
-
phylogenetics.zip
471.03 MB
-
README.md
6.82 KB
Abstract
This is the genomic and phylogenetic data used and generated in our study on Nostoc evolution and species boundaries. It contains the sequence data, alignments, and tree files from our phylogenomic analyses and the raw outputs from the analyses of phylogenetic conflict.
This repository contains the genomic and phylogenetic data that we used in our study on Nostoc evolution and species boundaries. Both archives contain directories in the structure that I used during the computational analyses as documented in the code for this project (available from GitHub). Below is a detailed description of the contents of the two files.
Genomes
The cyano_genomes.zip archive contains three directories with two different versions of the genomes we used in the study:
set103/contains the fasta files for the 151 genomes that we used for phylogenetic analyses. In this version of the files the genomes are not sorted by putative chromosome or putative plasmid. Rather, all the genomic content is a single fasta. Thelabel_key.csvcontains a key linking the file names with the taxa that they correspond to. Note that all of these genomes are also available from NCBI. The accessions are listed in Data S1A of the manuscript.set12c/andset12p/both contain fasta files for the same 151 genomes, but in this version the contigs are sorted into putative chromosome (set12c/) and putative plasmids (set12p/). We used the chromosome files for the FastANI, PopCOGenT, and GTDB analyses.
Phylogenetic data
The phylogenetics.zip archive contains two directories with the essential data to reproduce our phylogenomic analyses. phylogenetics/set200/ has files and output for the divergence time estimation at the Nostocales level and phylogenetics/set103/ has the files and output for phylogenetic analyses of Nostoc, based on the cyano_genomes/set103 genome files.
NOTE: the wASTRAL and rbcLX trees and alignments are also available in the T-BAS platform where you can also do phylogenetic placement of new Nostoc sequences.
Data and output from divergence time estimation on Nostocales – phylogenetics/set200/
set200.treefile: topology used for divergence time estimation of nostocalesseqs/*faa: amino acid sequence files for each of the 1648 loci from the nostocales_odb_10 database of orthologs used by BUSCO. These are the loci we used in our previous paper on phylogenomics of nostocalean cyanobacteria
alignments/single/*aln.faa: aligned amino acid sequences for the 1648 loci.*ng.faa: aligned amino acid sequences for the 1648 loci after trimmming sites with gaps.
concat/concat_ng_1_part.phy: concatenated alignment of the 1648 loci with no gaps and in phylip format for divergence time estimation.
mcmctree_templates/: This directory contains template files to setup the control and model files needed to run MCMCTRee.lg.dat: The amino acid substitution matrix for the LG model.mcmctree-outBV.ctl: Template for control file to get in.BV for MCMCTree run.mcmctree.ctl: Control file template for mcmc sampling.
divtime/1_partmcmc: output from posterior sampling.c[1..3]/: a folder for each of three chainsFigTree.tre: Tree file with 95% HPD of posterior age estimatesmcmc_c[1..3].txt: Posterior sampling for the internode ages, rate parameters and likelihoods.
prior: output from prior samplingc[1..3]/: a folder for each of three chainsFigTree.tre: Tree file with 95% HPD of prior age estimatesmcmc_c[1..3].txt: Prior sampling for the internode ages, rate parameters and likelihoods.
Data and output for phylogenomic analyses of Nostoc – phylogenetics/set103/
seqs/*faa: amino acid sequence files for each of the 1899 loci from the nostocales_odb_10 database of orthologs used by BUSCO. The 1517 loci we used for analyses are a subset of these.*fna: nucleotide sequence files for each of the 1899 loci from the nostocales_odb_10 database of orthologs used by BUSCO.16s.fas: file with 16 rDNA sequences.trnl.fas: file with trnL sequences.
alignments/single/*aln.faa: aligned amino acid sequences for the 1898 loci.*ng.faa: aligned amino acid sequences for the 1898 loci after trimmming sites with gaps.*ng.fna: aligned nucleotide sequences for the 1898 loci after trimmming sites with gaps.16s_aln.*: aligned 16S sequences in nexus and fasta formats.trnl_aln.*: aligned trnL sequences in nexus and fasta formats.*codon_partition: codon partition files used for model selection and phylogenetic inference.
concat/concat_ng.fna: concatenated nucelotide alignment of 1519 loci after trimming sites with gaps.codon_partition_concat_ng_na: codon partition file used for model selection in the concatenated ML inference.
divtime/1_partmcmc: output from posterior sampling.c[1..3]/: a folder for each of three chainsFigTree.tre: Tree file with 95% HPD of posterior age estimatesmcmc_c[1..3].txt: Posterior sampling for the internode ages, rate parameters and likelihoods.
prior: output from prior samplingc[1..3]/: a folder for each of three chainsFigTree.tre: Tree file with 95% HPD of prior age estimatesmcmc_c[1..3].txt: Prior sampling for the internode ages, rate parameters and likelihoods.
conflict/single_vs_concat/: output from topological comparisons between gene trees and concatenated ML tree.discovista_out/single.metatable.results.csv: DiscoVista output. Each row in the matrix corresponds to a gene tree. The "ID" column indicates the locus and each of the remaning columns corresponds to a topological bipartition in the species tree. The cells of the matrix indicate whether the bipartition was recovered in the gene tree with strong or weak support.
single_vs_wastral/: output from topological comparisons between gene trees and weighted-Astral tree.discovista_out/single.metatable.results.csv: DiscoVista output. Each row in the matrix corresponds to a gene tree. The "ID" column indicates the locus and each of the remaning columns corresponds to a topological bipartition in the species tree. The cells of the matrix indicate whether the bipartition was recovered in the gene tree with strong or weak support.
