Skip to main content

A molecular phylogeny of historical and contemporary specimens of an under-studied micro-invertebrate group

Cite this dataset

Orr, Russell et al. (2020). A molecular phylogeny of historical and contemporary specimens of an under-studied micro-invertebrate group [Dataset]. Dryad.


Resolution of relationships at lower taxonomic levels is crucial for answering many evolutionary questions, and as such, sufficiently varied species representation is vital. This latter goal is not always achievable with relatively fresh samples. To alleviate the difficulties in procuring rarer taxa, we have seen increasing utilization of historical specimens in building molecular phylogenies using high throughput sequencing. This effort, however, has mainly focused on large-bodied or well-studied groups, with small-bodied and under-studied taxa under-prioritized. Here, we utilize both historical and contemporary specimens, to increase the resolution of phylogenetic relationships among a group of under-studied and small-bodied metazoans, namely, cheilostome bryozoans. In this study, we pioneer sequencing of air-dried cheilostomes, utilizing a recent library preparation method for low DNA input. We evaluate a de novo mitogenome assembly and two iterative methods, using the sequenced target specimen as a reference for mapping, for our sequences. In doing so, we present mitochondrial and ribosomal RNA sequences of 43 cheilostomes representing 37 species, including 14 from historical samples ranging from 50 to 149 years old. The inferred phylogenetic relationships of these samples, analyzed together with publicly available sequence data, are shown in a statistically well-supported 65 taxa and 17 genes cheilostome tree, which is also the most broadly sampled and largest to date. The robust phylogenetic placement of historical samples whose contemporary conspecifics and or congenerics have been sequenced verify the appropriateness of our workflow and give confidence in the phylogenetic placement of those historical samples for which there are no close relatives sequenced.  The success of our workflow is highlighted by the circularization of a total of 27 mitogenomes, seven from historical cheilostome samples. Our study highlights the potential of utilizing DNA from micro-invertebrate specimens stored in natural history collections for resolving phylogenetic relationships among species.


Twenty-one dried historical cheilostome samples and 29 recently collected samples, of which 26 were ethanol-preserved and 3 dried, were targeted (Table S1). We selected the historical samples to represent a spread of collection dates while including those that are phylogenetically previously resolved (i.e. their phylogenetic placements are known and/or we have available contemporary samples of their conspecifics or congenerics) and those that are currently enigmatic. Recently collected samples were selected for their potential for phylogenetic verification of the historical specimens. Each colony was subsampled for DNA isolation and scanning electron microscopy (SEM), using a Hitachi TM4040PLus. For microscopy, we bleached subsamples in diluted household bleach for a few hours to overnight, removing soft tissues in order to document skeletal morphology. SEMs of dried samples were taken both pre- and post-bleaching. All physical vouchers necessary for taxonomic verification, are stored at the Natural History Museum of Oslo, University of Oslo, and SEMs are available in the Online Supporting Information (SI) as SEM cards familiar to bryozoologists. Additional sample metadata are reported in Table S1.


DNA isolation, sequencing, assembly and annotation

DNA from the 21 historical specimens were isolated in a laboratory designed for handling samples with low DNA concentrations (sensi-lab, NHM, Oslo). DNA extractions were performed inside a hood equipped with UV lights and all equipment was bleached and UV-sterilized prior to use. Samples were vortexed twice in nuclease-free H2O, air-dried, then subject to UV for 10 minutes to minimize possible surface contaminants. These treated samples were subsequently crushed with a stainless-steel mortar and pestle (Gondek, Boessenkool, & Star, 2018). To minimize the risk of sample cross-contamination, each historical sample was processed separately prior to DNA isolation. The 29 contemporary samples were isolated in a standard laboratory without the aforementioned surface decontamination and crushing steps.


Genomic DNA for all 50 specimens were isolated using the DNeasy Blood and Tissue kit (QIAGEN, Germantown, MD, USA). Colonies were homogenized in lysis buffer, using a pestle, in the presence of proteinase-K prior to a 16-hour 56oC incubation period. An on-column RNase A step was additionally performed to obtain RNA-free genomic DNA. DNA was eluted in pre-heated 60oC Tris-Cl buffer (10 mM) and incubated at 37oC for 10 minutes. Recovered DNA was quantified prior to library preparation using a Qubit 2.0 fluorometer (ThermoFisher, US).


Sequencing libraries were prepared with the standard KAPA HyperPrep kit (Roche, USA) by the Norwegian Sequencing Centre (NSC, Oslo, Norway) for DNA samples >15ng. For those with <15ng DNA the SMARTer® ThruPLEX® DNA-Seq Kit (Takara Bio Inc, Japan) was used in the sensi-lab at NHM Oslo (Table S1). The contemporary DNA samples were fragmented to a 350 bp insert size, whilst no such step was performed on the historical samples with already short DNA length, giving a variable insert size (Table S2). Unique dual indexes (UDIs) were used once per library and any unligated adapter removed. Libraries constructed from the KAPA and SMART preparation methods were each pooled, then separately loaded on two independent flow cells prior to multiplex sequencing with independent runs. Genomic DNA up to 150 bp in read length was pair-end (PE) sequenced on an Illumina HiSeq4000 at the NSC. A blank control, taken during the DNA extraction of the historical samples, was also sequenced (library preparation: SMARTer® ThruPLEX® DNA-Seq Kit). Illumina HiSeq reads were quality checked using FastQC v.0.11.8 (Andrews, 2010), then quality- and adapter-trimmed using TrimGalore v0.4.4 with a Phred score cutoff of 30  (Krueger, 2015). Trimmed reads were de novo assembled with SPAdes 3.11.1 (Bankevich et al., 2012) using k-mers of 21,33, 55, 77, 99, and 127. The mitogenome and rRNA operon of each sample were identified separately with blastn (Altschul, Gish, Miller, Myers, & Lipman, 1990) using blast+ against a database constructed from broadly sampled cheilostome sequences already deposited in NCBI (Table S6). An E-value of 1.00e-185 and maximum target sequence of 1 were used to filter any blast hits of non- cheilostome origin. We use three current assembly methods and compare their potential for the recovery of mitogenome sequences from our data and subsequent phylogenetic inference of our historical samples. For each historical sample, the SPAdes de novo assembled mitogenome sequence was used as the reference (seed) input for its own iterative mapped assembly using a relatively new but already widely-used method, GetOrganelle (Jin et al., 2019) and another commonly used method, NOVOplasty 3.7 (Dierckxsens, Mardulyn, & Smits, 2016), both under default settings. In addition, a mitogenome sequence from a taxon phylogenetically related to the historical sample in question was also provided for a second iterative assembly (Fig S1). Both iterative methods of GetOrganelle and NOVOplasty work by mapping sequencing reads to a reference, or seed, before de novo extension of contig ends. Circularization (or closure) of each mitochondrial genome was confirmed using blast2 (Altschul et al., 1990), with the same sequence used as query and reference to validate overlap of contig ends. The overlapping mitochondrial sequence region was subsequently trimmed manually.  


Annotation and alignments

Mitogenomes from the three separate assembly methods, for each of the samples were annotated with Mitos2 using a metazoan reference (RefSeq 89) and the invertebrate genetic code (Bernt et al., 2013) to identify two rRNA genes (rrnL and rrnS) and 13 protein coding genes (atp6, atp8, cox1, cox2, cox3, cob, nad1, nad2, nad3, nad4, nad4l, nad5, and nad6). In addition, two rRNA operon genes (ssu (18s) and lsu (28s)) were identified and annotated using RNAmmer (Lagesen et al., 2007). Further, mitogenes and rRNA operons from 27 bryozoan taxa (Table S5), obtained from NCBI, selected for their potential to verify the success of our workflow for our historical specimens, were aligned with our samples to compile a broader taxon sample representing a cheilostome ingroup and ctenostome outgroup. A minimum gene number of four per taxon, to avoid possible effects of missing-data on phylogenetic inference (Philippe et al., 2004; Wiens & Morrill, 2011), was set for this study. The number of genes included for each taxon are shown in Table 1, while sequence length per gene and % missing characters (nucleotide or amino acid) per taxon are shown in Table S4. MAFFT (Katoh & Standley, 2013) was used for alignment with default parameters: for the four rRNA genes (nucleotide) the Q-INS-i model, considering secondary RNA structure, was utilized; for the 13 protein-coding genes, in amino acid format, the G-INS-I model was used. The 17 separate alignments were edited manually using Mesquite v3.1 to remove any uncertain characters (Maddison & Maddison, 2017). Ambiguously aligned characters were removed from each alignment using Gblocks (Talavera & Castresana, 2007) with least stringent parameters. The single-gene alignments were concatenated to a supermatrix using the catfasta2phyml perl script (Nylander, 2010). An initial mitochondrial supermatrix, consisting of up to five assemblies per sample (Fig. S1) was used to compare and evaluate differences among the three assembly methods via a phylogenetic analysis (see next section). Using only a single assembly method per taxon from this initial supermatrix of historical samples we then created a downstream final supermatrix. Each sample (i.e. those presented in Table 1) was represented only once in the final phylogenetic analysis in this supermatrix to which the two rRNA operon genes from the SPAdes de novo assembly were added (Fig. 1). The alignments (both masked and unmasked) are available through Dryad (


Phylogenetic reconstruction

Maximum likelihood (ML) phylogenetic analyses were carried out for each single gene alignment using the “AUTO” parameter in RAxML v8.0.26 (Stamatakis, 2006) to establish the evolutionary model with the best fit. The general time reversible (GTR+G) was the preferred model for the four rRNA genes (18s, 28s, rrnS and rrnL), and MtZoa+G for all 13 protein coding genes. The concatenated datasets, divided into rRNA and protein gene partitions, each with its own separate gamma distribution were analyzed using RAxML. The topology with the highest likelihood score of 100 heuristic searches was chosen. Bootstrap values were calculated from 500 pseudo-replicates.


Bayesian inference (BI) was performed using a modified version of MrBayes incorporating the MtZoa evolutionary model (Huelsenbeck & Ronquist, 2001; Tanabe, 2016) for the samples represented in the final supermatrix (Fig .1). The dataset was executed, as before, with rRNA and protein gene partitions under their separate gamma distributions. Two independent runs, each with three heated and one cold Markov Chain Monte Carlo (MCMC) chain, were initiated from a random starting tree. The MCMC chains were run for 20,000,000 generations with trees sampled every 1,000th generation. The posterior probabilities and mean marginal likelihood values of the trees were calculated after the burnin phase. The average standard deviation of split frequencies between the two runs was <0.01, indicating convergence of the MCMC chains.


ERC, Award: 724324

ERC, Award: 724324