Data from: AUTOEB: A software for systematically evaluating bipartitions in a phylogenetic tree employing an approximately unbiased test
Data files
Jan 07, 2025 version files 35.07 MB
-
16SrDNA_full.fasta
3.38 MB
-
16SrDNA_full.treefile
167.05 KB
-
arc50_25_.fasta
465.92 KB
-
arc50_25_.treefile
7.29 KB
-
arc50_50_.fasta
928.72 KB
-
arc50_50_.treefile
7.31 KB
-
arc50_full.fasta
1.86 MB
-
arc50_full.treefile
8.73 KB
-
bac104_full.fasta
13.87 MB
-
bac104_full.treefile
27.16 KB
-
euk340_20_.fasta
2.26 MB
-
euk340_20_.treefile
5.15 KB
-
euk340_5_.fasta
567.34 KB
-
euk340_5_.treefile
5.14 KB
-
euk340_full.fasta
11.49 MB
-
euk340_full.treefile
6.08 KB
-
README.md
2.68 KB
Abstract
The core of molecular phylogeny is the inference of a tree diagram representing the evolutionary relatedness among nucleotide or amino acid sequences. In addition, evaluating the credibility of “bipartitions,” each of which splits the inferred tree into two subtrees, is an indispensable part of modern phylogenetic studies. The most popular method for examining the credibility of bipartitions in a phylogenetic tree is the bootstrap. In the maximum likelihood framework, two alternative methods for the bootstrap, UFBoot2 and SH-aLRT, are available. In this study, we propose a new software “AUTOEB,” which evaluates bipartitions in a given phylogenetic tree employing an approximately unbiased (AU) test. For each bipartition, the software generates two alternative trees from a given tree by disrupting the bipartition of interest with the minimum changes in tree topology and compares them by the AU test. In the case of either or both alternative trees failing to be rejected, the software calls the particular bipartition “unresolved” and otherwise “resolved.” We phylogenetically analyzed four empirical sequence data and demonstrated that AUTOEB can provide an alternative criterion toward bipartitions that received high support values from the pre-existed methods, and help to evade potential false interpretations based on phylogenetic trees.
README: Phylogenetic alignments and trees assessed in Bamba
https://doi.org/10.5061/dryad.18931zd64
Description of the data and file structure
File name format: {dataset}_{size}.{file type}
Dataset
- euk340: multiprotein MSA of Eukaryotes
- arc50: multiprotein MSA of Archaea
- bac104: multiprotein MSA of Bacteria
- 16SrDNA: single-gene MSA (16SrDNA) of Archaea and Bacteria
Size
- full: Full sequence
- XX%: Sequence with randomly selected sites from "full" file
File type
- XX.fasta: Multiple sequence alignment
- XX.treefile: ML tree reconstructed from XX.fasta
Files and variables
File: 16SrDNA_full.treefile
Description: The ML tree inferred from 16SrDNA_full.fasta
File: 16SrDNA_full.fasta
Description: A phylogenetic alignment of 16SrDNA sequences from bacteria and archaea
File: arc50_25%.treefile
Description: The ML tree inferred from a randomly selected 25% of the positions in arch50_full.fasta
File: arc50_25%.fasta
Description: A phylogenetic alignment of randomly selected 25% of the positions in arch50_full.fasta
File: arc50_50%.treefile
Description: The ML tree inferred from a randomly selected 50% of the positions in arch50_full.fasta
File: arc50_full.fasta
Description: A phylogenetic alignment comprising 50 proteins conserved among archaea species
File: arc50_full.treefile
Description: The ML tree inferred from arch50_full.fasta
File: arc50_50%.fasta
Description: A phylogenetic alignment of randomly selected 50% of the positions in arch50_full.fasta
File: euk340_5%.treefile
Description: The ML tree inferred from a randomly selected 5% of the positions in euk340_full.fasta
File: bac104_full.treefile
Description: The ML tree inferred from bac104 alignment
File: euk340_20%.treefile
Description: The ML tree inferred from a randomly selected 20% of the positions in euk340_full.fasta
File: euk340_20%.fasta
Description: A phylogenetic alignment of randomly selected 20% of the positions in euk340_full.fasta
File: euk340_full.treefile
Description: The ML tree inferred from arc50_full.fasta
File: euk340_full.fasta
Description: A phylogenetic alignment comprising 340 proteins conserved among eukaryotes
File: euk340_5%.fasta
Description: A phylogenetic alignment of randomly selected 5% of the positions in euk340_full.fasta (i.e., euk340 alignment)
File: bac104_full.fasta
Description: A phylogenetic alignment comprising 104 proteins conserved among bacterial species
Methods
We prepared three multiprotein alignments, as well as a single-gene (16S rDNA) alignment. The first multiprotein alignment, “euk340,” comprises 340 proteins (116,499 amino acid positions in total) sampled from 97 phylogenetically diverse eukaryotes. This alignment is identical to that analyzed in Harada et al. (Mol Biol Evol 2024 41:msae014).
We here prepared two new multiprotein alignments designated “arc50” and “bac104.” The proteins contained in arc50 were chosen from 53 phylogenetically informative markers (arc53) in the Genome Taxonomy Database Release 214 (GTDB R214). Among the genomes available for each order in Archaea, a single genome with the greatest completeness was retained, remaining 149 genomes for the alignment. Then, three proteins, which are absent in more than 79 out of the 149 genomes, were omitted. single-protein alignments were prepared by MAFFT (version: 7.505) under the L-INS-i model, followed by trimming of ambiguously aligned positions and those with high gap propositions by BMGE (version: 2.0) with -g 0.2 -e 0.3 option. The final multiprotein alignment was generated by concatenating the 50 single-protein alignments sampled from the 149 archaeal species (12,425 amino acid positions in total). The procedure described above was repeated to prepare bac104. 490 genomes with high completeness, each representing an order in Bacteria, were selected for multiprotein phylogenetic analyses. Among the 120 phylogenetically informative markers (bac120) in the GTDB R214, we omitted 16 proteins that are absent in more than 190 out of the 490 genomes. The final multiprotein alignment was generated by concatenating the 104 single-protein alignments sampled from 490 bacterial species (28,182 amino acid positions in total).
We obtained the nucleotide sequences of small subunits of ribosomal RNA gene (16S rDNA) of the species in Bacteria and Archaea from the GTDB as of July 26, 2023. First, we selected 3,823 genomes, each of which is classified as representatives and with the greatest completeness among those sampled from each family in Archaea/Bacteria. Then, the 16S rDNA sequences were retrieved from the selected genomes. The 16S rDNA sequences were aligned by MAFFT (version: 7.505) under the L-INS-i model. Ambiguously aligned positions and the positions with high gap proportions in the alignment were discarded by trimAl (version: 1.4.rev15) with -gt 0.1 option. After the first alignment trimming, 784 sequences, in which gaps occupied more than 75% of the total positions, were removed from the initial alignment. We further discarded ambiguously aligned positions and the positions with high gap proportions by BMGE (version: 2.0) with -g 0.4 -e 0.3 option. Then, we subjected the alignment after the second trimming to preliminary phylogenetic analysis and removed 25 rapidly evolving (long branch) sequences that potentially bias the phylogenetic inferences. The final refinement of the alignment after the exclusion of the rapidly evolving sequences was done by BMGE with the same options. The final 16SrDNA alignment comprises 3,014 sequences with 1,102 nucleotide positions.
We used IQ-TREE (version: 2.2.0) for all the phylogenetic analyses conducted in this study. For the ML tree reconstruction from euk340 and arc50, the LG + C60 + F + G model, a site heterogeneous substitution model, was applied. The ML trees were inferred from bac104 under the LG + C20 + F + G model. The C20 option was set to account for the site heterogeneity in bac104, as the C60 option was too computationally intensive for this multiprotein alignment including 490 OTUs. The ML tree was inferred from 16SrDNA under the GTR + F + G model.