Data from: Phylogenomic diversity of archigregarine apicomplexans
Data files
Oct 04, 2024 version files 2.48 GB
-
1-assemblies.zip
72.94 MB
-
2-host-COIs.fasta
7.51 KB
-
3-SSU.zip
65.51 KB
-
4-multigene_singlegene.zip
48.91 MB
-
5-multigene_concatenated.zip
16.35 MB
-
6-raw_images.zip
2.35 GB
-
README.md
5.25 KB
Abstract
Gregarines are a large and diverse subgroup of Apicomplexa, a lineage of obligate animal symbionts including pathogens like Plasmodium, the malaria parasite. Unlike Plasmodium, however, gregarines are poorly studied, despite the fact that as early-branching apicomplexans they are crucial to our understanding of the origin and evolution of all apicomplexans and their parasitic lifestyle. Exemplifying this, the earliest branch of gregarines, the archigregarines, are particularly poorly studied: around 80 species have been described from marine invertebrates, but almost all of them were assigned to a single genus, Selenidium. Most are known only from light micrographs and largely unresolved rDNA phylogenies, where they exhibit a great deal of sequence variation, but fall into at least five subclades. To resolve the relationships within archigregarines, we sequenced 12 single-cell transcriptomes from species representing all four known subclades, as well as one blastogregarine (which frequently branch with Selenidium). A 190-gene phylogenomic tree confirmed fourmaximally-supported individual clades of archigregarines and blastogregarines. These clades are discrete and distantly related, and also correlate with host identity. We propose the establishment of three novel genera of archigregarines to reflect their phylogenetic diversity and host range, and nine novel species isolated from a range of marine invertebrates.
These are supplemental data for the publication "Phylogenomic diversity of archigregarine apicomplexans", in Royal Society Open Biology.
This dataset contains the assembled single-cell transcriptomes of 11 archigregarine and one blastogregarine taxa. These assemblies were used to reconstruct a 190-gene phylogeny of Apicomplexa, which shows archigregarines and blastogregarines are composed of five independent clades branching sister to other gregarines. All single-gene data used to construct the multigene phylogeny, as well as concatenations are included, as are SSU rRNA gene phylogeny data.
Treefiles are openable with FigTree or other tree-viewing software.
Sequenced archigregarines and blastogregarines:
Devanium cincinnus Ph216
Devanium robustum SelFal
Lunidium laculatum SNEK
Lunidium melongena SelMel
Lunidium proboscidis SNEKD
Lunidium shako KNOB
Metzidium perlucensae SQU2901
Selenidium capillus Ph213
Selenidium natalis SEL2980
Selenidium pherusae Ph226
Selenidium validusae WK
Siedleckia leitoscoloplosis BL3
Description of the data and file structure
There are 5 main parts to this dataset:
- Assemblies of 11 archigregarines and 1 blastogregarine. In nucleotide format (ending in .fasta; generated with rnaSPAdes) and predicted peptides (ending in .pep; predicted with TransDecoder). Both in fasta-format.
- Mitochondrial cytochrome c oxidase (COI) sequences of invertebrate hosts ('2-host-COIs.fasta', in fasta-format).
- SSU rRNA gene alignment of gregarines and other apicomplexans. Unmasked (aligned with MAFFT, ending in .muscle.fasta) and trimmed alignment files (trimmed with gblocks, ending in .gblocks.fasta; both in fasta-format), and corresponding treefile (RAxML-ng with 1,000 bootstraps under GTR+Gamma model; ending in .tre, NEWICK-format).
- Multigene single-gene data. This includes the folders:
- 'mafft': 263 single-gene alignments (all in fasta-format), aligned with MAFFT L-INS-I
- 'bmge': 263 single-gene trimmed alignments (all in fasta-format), trimmed with BMGE
- 'SGTs': 263 single-gene treefiles (all in NEWICK-format), estimated with IQTree under the LG4X model and 1,000 ultrafast bootstraps
- Multigene concatenated data. This includes the folders:
- 'main_190gene': multigene concatenation alignment of 63 apicomplexan taxa and 190 genes (concatenation_190gene_over40per.Jul2-2023.sani.fasta; in fasta-format) and corresponding treefiles derived from phylogenetic analysis, generated with IQTree2 under the LG+C60+F+G model with 1,000 Ultrafast Bootstraps (file 'concatenation_190gene_UFB.treefile') and 200 non-parametric boostraps under the PMSF model (file 'concatenation_190gene_PMSF.treefile'). Both treefiles are in NEWICK-format.
- '129gene': multigene concatenation of 63 taxa and 129 genes (CAT_63S129F.fasta; in fasta-format), and corresponding treefile estimated with IQTree2 under the LG+C60+F+G model with 1,000 Ultrafast Bootstraps (CAT_63S129F.fasta.treefile; in NEWICK-format).
- '22gene': multigene concatenation of 63 taxa and 129 genes (CAT_63S22F.fasta; in fasta-format), and corresponding treefile estimated with IQTree2 under the LG+C60+F+G model with 1,000 Ultrafast Bootstraps (CAT_63S22F.fasta.treefile; in NEWICK-format).
- 'fast-site_removal': 12 fasta-format alignment files ending in .fas, 12 treefiles generated with running the corresponding alignment file in IQTree2 under LG+C20+F+G and 1,000 Ultrafast Bootstraps. The alignment files were generated by using PhyloFisher's fast_site_remover.py script (https://github.com/TheBrownLab/PhyloFisher ) to remove 3,000 fast-evolving amino acid sites at each step.
- 'heterotacheous-site_removal': 12 fasta-format alignment files ending in .fas, 12 treefiles generated with running the corresponding alignment file in IQTree2 under LG+C20+F+G and 1,000 Ultrafast Bootstraps. The alignment files were generated by using PhyloFisher's heterotachy.py script (https://github.com/TheBrownLab/PhyloFisher ) to remove 3,000 heterotacheous amino acid sites at each step
- 'SR4_aa-recoding': one alignment file in fasta-format and corresponding treefile generated with IQTree2 under the GTR+R6+F model and 1,000 Ultrafast Bootstraps. To generate this alignment, the original alignment from 'main_190gene' was recoded to SR4 using PhyloFisher's recoder.py script (https://github.com/TheBrownLab/PhyloFisher ) with the -re SR4 option.
- Raw images. In JPEG or RAW-format (ending in .ARW). This includes individual folders for each taxon:
- Devanium_cincinnus_Ph216
- Devanium_robustum_SelFal
- Lunidium_laculatum_SNEK
- Lunidium_proboscidis_SNEKD
- Lunidium_shako_KNOB
- Metzidium_perlucensae_SQU2901
- Selenidium_capillus_Ph213
- Selenidium_natalis_SEL2980
- Selenidium_pherusae_Ph226
- Siedleckia_nematoides_BL3
Sharing/Access information
Raw Illumina read data is deposited under NCBI BioProject accession PRJNA1090553, and SSU rDNA sequences of all reported taxa can be found under NCBI accessions PP553612 to PP553623.
- Lax, Gordon; Park, Eunji; Na, Ina et al. (2024). Phylogenomic diversity of archigregarine apicomplexans. Open Biology. https://doi.org/10.1098/rsob.240141
