Is the deuterostome clade an artefact?
Data files
Jun 19, 2025 version files 22.31 GB
-
alignments.zip
902.35 MB
-
crossVals.zip
154.48 MB
-
dataProvenance.xlsx
25.57 KB
-
EDC_files.zip
25.07 KB
-
inputTrees.zip
768.82 KB
-
IQTreeOutputFiles.zip
4.87 GB
-
rawData.zip
3.88 GB
-
README.md
5.68 KB
-
RELLbootstraps.zip
390.39 MB
-
results.zip
6.45 GB
-
scripts.zip
88.06 KB
-
topoSims.zip
5.67 GB
Jun 20, 2025 version files 22.31 GB
-
alignments.zip
902.35 MB
-
crossVals.zip
154.48 MB
-
dataProvenance.xlsx
25.57 KB
-
EDC_files.zip
25.07 KB
-
inputTrees.zip
768.82 KB
-
IQTreeOutputFiles.zip
4.87 GB
-
rawData.zip
3.88 GB
-
README.md
5.69 KB
-
RELLbootstraps.zip
390.39 MB
-
results.zip
6.45 GB
-
scripts.zip
88.06 KB
-
topoSims.zip
5.67 GB
Abstract
There is a long-standing consensus that the animal phyla closest to our own phylum of Chordata are the Echinodermata and Hemichordata. These three phyla constitute the major clade of Deuterostomia. Recent analyses have questioned the support for the monophyly of Deuterostomia, however, showing that the branch leading to deuterostomes is very short and may be influenced by systematic error. Here we use a site-by-site approach to explore possible sources of error. Under conditions that promote long-branch attraction (LBA) – especially branch-length heterogeneity and sites constrained in their amino acid composition – we find that deuterostome monophyly is strongly supported. When we make efforts to mitigate these sources of error, support for Deuterostomia markedly decreases or even disappears. Our results call into question one of the longest established major branches of the animal kingdom. A very short, or non-existent, deuterostome branch has implications for interpretating putative deuterostome fossils and for reconstructing the bilaterian ancestor.
Directory list:
dataProvenance.xlsx
: contains the accession numbers and sources for the proteomic and transcriptomic data used in this study
rawData.zip
: contains the pre-processed input data
- OrthoFinder_input contains the raw genomes/proteomes/transcriptomes listed in dataProvenance.xlsx
- orthoGroups_raw.zip contains the raw OrthoFinder output, excluding the WorkingDirectory and the empty Single_Copy_Orthologue_Sequences
- orthoGroups_paraFilter contains the 183 paralogue filtered orthogroups, see main text for methodology
scripts.zip
: contains any additional script needed to run/modify analyses or their output that are not already available from GitHub repositories (cited in main text). The scripts are roughly organised by task:
- dataProcessing
- IQTreeOutputProcessing
- IQTreeSearches
- plotting
- renamingScripts
- simulations
From here all subfolders follow these naming conventions:
- Taxon sampling (see main paper):
- setting6_fast or S6: includes fast-evolving taxa
- setting7_slow or S7: does not include fast-evolving taxa
- Topology (see Fig. 2 in main paper for tree topologies)
- MonoDeut: monophyletic Deuterostomia topology
- ProtAmbu: Orthozoa topology
- ProtChord: Centroneuralia topology
- ProtPara: paraphyletic Protostomia
- Amino acid substitution model: LG, CAT, EDM; if ‘_G’ in folder name the models were set up as LG+G, CAT+G or EDM+G (see main paper for methodology details)
alignments.zip
folder: contains the concatenated alignments used for all phylogenetic analyses (306-taxon and subsetted alignments)
- concatenated_306taxa: list of taxa and the fasta (.aln) and phylip (.phy) formatted concatenated alignment
- setting6_fast: subsetted alignments including long-branched taxa
- fasta: folder with 100 fasta-formatted randomly subsetted alignments
- phylip: folder with 100 phylip-formatted randomly subsetted alignments
- setting7_slow: subsetted alignments excluding long-branched taxa
- fasta: folder with 100 fasta-formatted randomly subsetted alignments
- phylip: folder with 100 phylip-formatted randomly subsetted alignments
crossVals.zip
: contains the output of leave-one-out cross-validation (loocv) analyses. PhyloBayes run files can be provided on request, omitted from repository to limit size.
- aln68_S7 and aln69_S6: contain the phylip alignments, guide trees and EDM category file used as input to PhyloBayes
- Each output folder contains the *.cpo and *.sitelogl files needed to summarise the lcoov analyses with the PhyloBayes-provided scripts (details pages 16-17 of the manual)
EDCfiles.zip
: contains the IQTREE-compatible EDM category file generated with the EDCluster package
inputTrees.zip
: contains the guide trees used for the IQTREE topology-scoring analyses
- setting6_fast: contains the pruned trees for 100 subsetted alignments
- setting7_slow: contains the pruned trees for 100 subsetted alignments
- treesToPrune: original 306-taxa unpruned trees for MonoDeut, ProtAmbu and ProtChord hypotheses
IQTreeOutputFiles.zip
: contains the output files from IQTREE’s topology-scoring runs, organised by subsetting strategy and substitution model used (e.g. setting6_fast/EDM_G/). Within each of these subfolders analyses are organised into:
- aln#_MonoDeut
- aln#_ProtAmbu
- aln#_ProtChord
- This EXACT folder structure is needed to run the IQTreeOutput_manipulation.R script. Each subfolder contains the standard IQTREE output files, plus a file with the per site log-likelihood score (.sitelh) and a file with the per site rate category (.rate), which are the input files for the IQTreeOutput_manipulation.R script.
results.zip
: contains edited files with site likelihoods, rate categories and supported topologies
- pseudoRateCat_noGamma: contains the files needed to generate the supplementary plots S1 and S3
- for each setting*/model/
- sitelhRatesTables: output of the IQTreeOutput_manipulation.R script, file extensions explained in script.
- siteSupportedTopology: output of the likelihood_transform.py script. Output of interest compiled in the allReps_labelled.stats file.
- siteProfiles: files needed to generate figure 5 in the main text
topoSims.zip
: contains the alignments and the IQTree output for the simulation analyses
- Deut: data simulated under the Deuterostomia topology
- setting6_fast:
- deutSup_aln14: data simulated from the long-branched subsetted alignment 14
- alignments: simulated alignments
- IQTreeRuns: follow IQTreeOutputFiles.zip structure and naming convention but all analyses ran with the gamma parameter
- siteProfileSims.zip: PhyloBayes site-profiles and EDCluster EDM category files for simulated alignments 40, 80, 120, 160 and 200
- paraSup_aln91: data simulated from the long-branched subsetted alignment 91
- same subfolders as deutSup_aln14
- deutSup_aln14: data simulated from the long-branched subsetted alignment 14
- setting7_slow:
- deutSup_aln47: data simulated from the short-branched subsetted alignment 47
- same subfolders as setting6_fast/deutSup_aln14
- paraSup_aln62: data simulated from the short-branched subsetted alignment 62
- same subfolders as setting6_fast/deutSup_aln14
- deutSup_aln47: data simulated from the short-branched subsetted alignment 47
- setting6_fast:
- Orth: data simulated under Orthozoa topology
- subfolders identical to Deut folder
RELLbootstraps.zip
: contains the R script and workspace image to generate the supplementary plots S6-7
Change Log
20 Jun 2025: Minor updates to README.