Data from: A phylogenomic backbone for Acoelomorpha inferred from transcriptomic data
Data files
Oct 22, 2024 version files 954.52 MB
-
00-Assemblies.zip
685.76 MB
-
01-TransDecoder.zip
228.69 MB
-
02-Alignments.zip
8.52 MB
-
03-Genetrees.zip
1.76 MB
-
04-FullDataset_noNotocelis.zip
9.74 MB
-
05-Loci_subsampling.zip
18.78 MB
-
06-Phylogenetically_informative_genes.zip
236.99 KB
-
07-MCMCtree_inputdata.zip
216.16 KB
-
08-Morphological_partitions_to_test-Nexus.zip
106.83 KB
-
09-Morphological_phylogenetics-Nexus.zip
11.89 KB
-
genesortR-Gene_properties.csv
709.14 KB
-
README.md
3.29 KB
Abstract
Xenacoelomorpha are mostly microscopic, morphologically simple worms, lacking many structures typical of other bilaterians. Xenacoelomorphs –which include three main groups: Acoela, Nemertodermatida, and Xenoturbella– have been proposed to be an early diverging Bilateria, sister to protostomes and deuterostomes, but other phylogenomic analyses have recovered this clade nested within the deuterostomes, as sister to Ambulacraria. The position of Xenacoelomorpha within the metazoan tree has understandably attracted a lot of attention, overshadowing the study of phylogenetic relationships within this group. Given that Xenoturbella includes only six species whose relationships are well understood, we decided to focus on the most specious Acoelomorpha (Acoela + Nemertodermatida). Here, we have sequenced 29 transcriptomes, doubling the number of sequenced species, to infer a backbone tree for Acoelomorpha based on genomic data. The recovered topology is mostly congruent with previous studies. The most important difference is the recovery of Paratomella as the first off-shoot within Acoela, dramatically changing the reconstruction of the ancestral acoel. Besides, we have detected incongruence between the gene tres and the species tree, likely linked to incomplete lineage sorting, and some signal of introgression between the families Dakuidae and Mecynostomidae, which hampers inferring the correct placement of this family and, particularly, of the genus Notocelis. We have also used this dataset to infer for the first time diversification times within Acoelomorpha, which coincide with known bilaterian diversification and extinction events. Given the importance of morphological data in acoelomorph phylogenetics, we tested several partitions and models. Although morphological data failed to recover a robust phylogeny, phylogenetic placement has proven to be a suitable alternative when a reference phylogeny is available.
https://doi.org/10.5061/dryad.nvx0k6f0j
Here, we have sequenced 29 transcriptomes, doubling the number of sequenced species, to infer a backbone tree for Acoelomorpha based on genomic data. In doing so, we not only generated a robust phylogeny but also explored the data to identify potential sources of incongruence that generate topological instability. We then used this new phylogenomic hypothesis to (1) date major cladogenetic events within Acoelomorpha, (2) evaluate the performance of available morphological characters when reconstructing acoelomorph phylogeny and in phylogenetic placement of new species, and (3) study morphological evolution in this group.
Description of the data and file structure
Here, we provide access to the files necessary to replicate our findings, separated into six directories, four related to molecular phylogenomics and two to morphological phylogenetics.
- Molecular phylogenomics:
- “00-Assemblies”: the best assembly per specimen, after removing cross-contaminant contigs with CroCo.
- “01-TransDecoder”: Two fasta files per specimen, including the coding regions coded as amino acids (.pep) or nucleotides (.cds).
- “02-Alignments”: Multiple sequence alignments of the 2774 genes included in the full dataset.
- “03-Genetrees”: Genetrees corresponding to these alignments. Please, note that the number of genetrees does not match the number of alignments. IQ-TREE cannot infer trees with nodal support from alignments with fewer than four sequences. The minimum number of sequences per alignments is five, but some might be identical and hence IQ-TREE does not consider them during tree inference.
- “04-FullDataset_noNotocelis”: Full dataset after excluding the species Notocelis gullmarensis. This is the supermatrix created to run phylogenomic analyses and from which all data filtering was performed.
- “05-Loci_subsampling”: Gene trees, supermatrices, and partition files for all submatrices. There are 10 main datasets, created after selecting either the best 300 and 567 genes according to their: occupation, substitution rate, saturation, compositional heterogeneity, and average patristic distances.
- “06-Phylogenetically_informative_genes”: Dataset created to infer the position of Notocelis and for the topology tests. These genes were selected based on the Likelihood Mapping algorithm implemented in IQ-TREE. Only the genes for which >70% of the quartets fall in one of the corners were selected.
- “07-MCMCtree_inputdata”: Supermatrix, reference tree, and control files for the MCMCtree analysis.
- Morphological phylogenetics:
- “08-Morphological_partitions_to_test-Nexus”: list of all nexus files used to infer the best model configuration for each partition.
- “09-Morphological_phylogenetics-Nexus”: list of nexus files used during phylogenetic inference, i.e. the best model configuration per partition scheme.
Finally, an additional file called “genesortR-Gene_properties.csv” is included. This is the output of the genesortR script, which includes a summary of several gene properties per alignments. this is the file that was used to create all submatrices analysed.
Molecular phylogenomics
A total of 29 transcriptomes were generated from individuals collected between 2007 and 2020, preserved in either RNAlater or RNA Shield and long-term stored at -20ºC. Total RNA was extracted using the Zymo Microprep Quick-RNA kit (Zymo Research) and amplified with the SMARTer Universal Low Input RNA Kit (Takara Bio). The quality of the extractions was ensured with the Bioanalyzer High Sensitivity DNA Analysis and sent to either SciLifeLab or Macrogen for sequencing in an Illumina HiSeq X platform. Three cleaning and assembly strategies were devised to maximise assembly completeness. First, following standard practice, raw reads were cleaned with Trimmomatic and assembled with Trinity 2.9.1 with default parameters. Second, using the TransPi pipeline (version 1.1.0) with three kmer lengths (21, 31, and 41). Third, raw reads were quality-filtered before the Trinity assembly in a three-step process: sequencing errors were corrected with Rcorrector 1.0.4, sequencing adapters were removed with Trimmomatic (as implemented in Trinity 2.9.1), and the reads were quality-filtered with Prinseq 0.20.4, trimming nucleotides under 30 PHRED from both ends and filtering out reads with a mean quality under 20, entropy under 50, and shorter than 40 base pairs. Redundant contigs were removed with EvidentialGene v2019.05.14 and cross-contaminants were filtered with CroCo 1.1 measuring the contig expression with Kallisto 0.46.2. All transcriptomes were assembled following the three pipelines and the best assembly was selected based on its completeness score, measured with BUSCO 3.0.2 and the Metazoa_odb9 database. Finally, coding regions with a minimum length of 300 amino acids were extracted with TransDecoder 5.3.0 and duplicates collapsed (minimum identity 95%, minimum overlap 40 amino acids) with the Dedupe program from BBMap 38.92.
The extracted proteins were assigned to orthogroups with OrthoFinder 2.4.1 and screened for paralogs with PhyloPyPruner 1.2.3 with the following settings: pruning algorithm “Largest Subtree”, keep orthogroups with at least five taxa, trim branches longer than five times the standard deviation of all branch lengths, collapse nodes with nodal support under 60, and, in species-specific duplications, keep the sequences with the shortest pairwise distance to its sister taxa. For the seven species represented by two transcriptomes, only the specimen with the highest number of orthologs was kept. Non-homologous stretches within the sequences were identified and masked with Prequal 1.02, and all sequences shorter than 250 unmasked amino acids were removed. All remaining orthogroups with more than five species were aligned with MAFFT 7.475 using the L-INS-i algorithm. Ambiguously aligned positions, sequences shorter than 66% of the total alignment length, and sites with more than 80% missing data were filtered with BMGE 1.12. The alignments that did not meet the assumptions of stationarity and homogeneity were identified with IQ-TREE 2.1.3 and removed. The resulting dataset included 2774 genes. This dataset was filtered by occupancy, substitution rate, level of saturation, compositional heterogeneity, and average patristic distances to reduce systematic errors. ASTRAL, IQ-TREE (partitioned and site-specific C20 and C60 models), and PhyloBayes were used in phylogenomic inference. MCMCtree was used to infer divergence times.
Morphological phylogenetics
A morphological matrix including all species and up to 44 characters was prepared based on descriptions from the literature and photographs of the specimens analysed. Several partition schemes were tested to maximise the phylogenetic signal of the data. The stepping-stone algorithm implemented in MrBayes 3.2.7 was used to calculate the likelihood of each scheme under the standard discrete model, but applying the ascertainment bias correction and with different model parameters: fixed or variable rates among partitions (APRV), fixed or variable rates among characters (ACRV), and linking or unlinking branch lengths, testing nine models per partition scheme. Finally, the best overall partition scheme and best model configuration per scheme were identified with BayesFactors. MrBayes was used to infer a phylogenetic tree for each partition scheme, applying the best-fit model configuration. For each analysis we ran two independent runs with four Markov chains each for 50 million generations, sampling every 10,000 generations and discarding the first 25% as burn-in. Chain convergence was assessed by ensuring a correct mixing in the log-likelihood plot, that all ESS values were above 200, and that the Potential Scale Reduction Factor was at least one.
Additionally, the ability of these characters to place a set of species in a given tree was tested using the phylogenetic placement algorithm implemented in RAxML 8.2.12. First, morphological characters were weighted in RAxML using the IQ-TREE topology as a guide tree with four gamma categories and applying the Lewis ascertainment bias correction. Then, a morphological matrix with 84 acoel species was downloaded from Jondelius et al. (2011, Systematic Biology) and used to place the species in the reference tree applying the inferred character weights.