Skip to main content

A phylogenomic approach to resolving interrelationships of polyclad flatworms, with implications for life history evolution

Cite this dataset

Goodheart, Jessica et al. (2022). A phylogenomic approach to resolving interrelationships of polyclad flatworms, with implications for life history evolution [Dataset]. Dryad.


Platyhelminthes (flatworms) are a diverse invertebrate phylum that are useful for exploring life history evolution. Within Platyhelminthes, only two clades develop through a larval stage: free-living polyclads and parasitic neodermatans. Neodermatan larvae are considered evolutionarily derived, whereas polyclad larvae are hypothesized to be retained from the last common ancestor of Platyhelminthes – and Spiralia – due to ciliary band similarities among polyclad and other spiralian larvae. However, larval evolution has been challenging to investigate within polyclads due to low support for deeper phylogenetic relationships. To investigate polyclad life history evolution, we generated transcriptomic data for 21 species of polyclads to build a well-supported phylogeny for the group. We then used ancestral state reconstruction to investigate ancestral modes of development (direct vs indirect) within Polycladida, and flatworms in general. The resulting tree provides strong support for deeper nodes and we recover a new monophyletic clade of early branching cotyleans. Early branching clades of acotyleans and cotyleans possess diverse modes of development, suggesting a complex history of larval evolution in polyclads that likely includes multiple losses and/or multiple gains. Our ancestral state reconstructions in previous platyhelminth phylogenies also suggest that larval similarities between flatworms and other phyla are most likely convergently evolved.


Quality control and assembly of reads

Reads that failed to pass the Illumina “Chastity” quality filter were excluded from our analyses. Reads passing the quality filter were assembled using Trinity (version 2.4.0 for most, but version 2.6.6 for species Boninia divae and Theama mediterranea; (Grabherr et al. 2011)) with default settings, which required assembled transcript fragments to be at least 200 bp in length. Reads were trimmed pre-assembly for the species Boninia divae and Theama mediterranea using Trimmomatic (Bolger, Lohse, and Usadel 2014).

Orthology assignment

Translated transcript fragments were organized into orthologous groups corresponding to a custom platyhelminth-specific core-ortholog set of 9,157 protein models (constructed in the same manner as in (Goodheart et al. 2015)) using HaMStR (version 13.2.6; (Ebersberger, Strauss, and von Haeseler 2009)), which in turn used FASTA (version 36.3.6; (Pearson and Lipman 1988)), GeneWise (version 2.2.0; (Birney, Clamp, and Durbin 2004)), and HMMER (version 3.1b2; (Eddy 2011)). In the first step of the HaMStR procedure, substrings of assembled transcript fragments (translated nucleotide sequences) that matched one of the platyhelminth protein models were provisionally assigned to that orthologous group. To reduce the number of highly divergent, potentially paralogous sequences returned by this search, we set the E-value cutoff defining an HMM hit to 1e-05 (the HaMStR default is 1.0), and retained only the top-scoring quartile of hits. In the second HaMStR step, the provisional hits from the HMM search were compared to our reference taxon, Echinococcus granulosus (Batsch, 1786), and retained only if they survived a reciprocal best BLAST hit test with the reference taxon using an E-value cutoff of 1e-05 (the HaMStR default was 10.0). In our implementation, we substituted FASTA for BLAST (Altschul et al. 1990) because FASTA programs readily accepted our custom amino acid substitution matrix (POLY90). 

The Platyhelminthes core-ortholog set was generated by first downloading all available platyhelminthes clusters with 50% similarity or higher from UniProt (Bairoch et al. 2005) (70,698 clusters). Excluding clusters that contained only one sequence left 20,874 clusters. We calculated the sequence similarity of each cluster and as a heuristic, decided to remove clusters whose percent identity was less than 70%, which left 20,549 clusters. We then assessed the number of times each taxon was represented within those clusters. Echinococcus granulosus was identified as the most closely related, most abundant taxon (9,157 associated clusters with 70% similarity or higher), and was therefore selected as the reference taxon for the custom HaMStR database. We constructed the platyhelminthes HaMStR database by following the steps given in the HaMStR README file, which included generating profile hidden Markov models for each cluster using HMMER. Our platyhelminthes HaMStR database contained 9,157 orthologous groups. All protein sequences for Echinococcus granulosus (UniProt/NCBI taxon ID 6210) were downloaded from UniProt and used to generate the BLAST database for HaMStR.

Construction of the custom substitution matrix (PLATY90) followed the procedure outlined in Lemaitre et al. (Lemaitre et al. 2011), which used only greater than 90%-similarity platyhelminthes clusters downloaded from UniProt with singleton clusters removed. Use of a taxonomically-focused amino acid substitution matrix follows similar procedures used in arthropods (Bazinet et al. 2017) and gastropods (Goodheart et al. 2015, 2017) that seek to improve the amino acid alignments performed in the process of a phylogenomic workflow. In this protocol, a block is defined as a conserved, gap-free region of the alignment. Our blocks output file contained 205,562 blocks.

Construction of data matrix and paralogy filtering

Protein sequences in each orthologous group were aligned using MAFFT (version 7.187; (Katoh and Standley 2013)). We used the --auto and --add fragments options of MAFFT to align transcript fragments to the Echinococcus granulosus reference sequence, which was considered the existing alignment. We converted the protein alignments to corresponding nucleotide alignments using a custom Perl script. A maximum likelihood tree was inferred using RAxML-NG (RAxML Next Generation version 0.6.0; (Kozlov et al. 2018)) for each orthologous group where at least 75% of the taxa were present (4668 orthologous groups), and was given as input to PhyloTreePruner (version 1.0; (Kocot et al. 2013)). Orthologous groups that showed evidence of out-paralogs for any taxa (2530 orthologous groups out of 4426) were pruned according to the default PhyloTreePruner protocol, which removes all additional sequences outside of a maximally inclusive sub-tree. For orthologous groups containing in-paralogs, multiple sequences were combined into a single consensus sequence for each taxon, and orthologous groups for which fewer than 75% of taxa remained were discarded. This process left 4469 orthologous groups eligible for inclusion in our data matrices. Individual orthologous group alignments were concatenated, and codons not represented by sequence data in at least four taxa were then removed.

Phylogenetic analyses

For phylogenetic analysis, the final nucleotide data matrix from transcriptome data was partitioned by codon position by assigning different model parameters and rates to the three codon positions. We conducted the phylogenetic analysis using RAxML-NG (version 0.6.0; (Kozlov et al. 2018)). We used the default settings in RAxML-NG, and partitioned our data set by codon position. Each partition was assigned a general time reversible substitution model (GTR; (Tavaré 1986)) with a rate heterogeneity model with a proportion of invariant sites estimated (+I) and the remainder with a gamma distribution (+G; (Yang 1993)]), along with stepwise-addition starting trees. For our analysis, 500 bootstrap replicates were generated and a best tree search was performed with 20 search replicates. 


  • Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10.
  • Bairoch, Amos, Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, et al. 2005. “The Universal Protein Resource (UniProt).” Nucleic Acids Research 33 (Database issue): D154–59.
  • Bazinet, Adam L., Kim T. Mitter, Donald R. Davis, Erik J. Van Nieukerken, Michael P. Cummings, and Charles Mitter. 2017. “Phylotranscriptomics Resolves Ancient Divergences in the Lepidoptera.” Systematic Entomology 42 (2): 305–16.
  • Birney, Ewan, Michele Clamp, and Richard Durbin. 2004. “GeneWise and Genomewise.” Genome Research 14 (5): 988–95.
  • Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics  30 (15): 2114–20.
  • Ebersberger, Ingo, Sascha Strauss, and Arndt von Haeseler. 2009. “HaMStR: Profile Hidden Markov Model Based Search for Orthologs in ESTs.” BMC Evolutionary Biology 9 (July): 157.
  • Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” PLoS Computational Biology 7 (10): e1002195.
  • Goodheart, Jessica A., Adam L. Bazinet, Allen G. Collins, and Michael P. Cummings. 2015. “Relationships within Cladobranchia (Gastropoda: Nudibranchia) Based on RNA-Seq Data: An Initial Investigation.” Royal Society Open Science 2 (9): 150196.
  • Goodheart, Jessica A., Adam L. Bazinet, Ángel Valdés, Allen G. Collins, and Michael P. Cummings. 2017. “Prey Preference Follows Phylogeny: Evolutionary Dietary Patterns within the Marine Gastropod Group Cladobranchia (Gastropoda: Heterobranchia: Nudibranchia).” BMC Evolutionary Biology 17 (1): 221.
  • Grabherr, Manfred G., Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Full-Length Transcriptome Assembly from RNA-Seq Data without a Reference Genome.” Nature Biotechnology 29 (7): 644–52.
  • Katoh, Kazutaka, and Daron M. Standley. 2013. “MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability.” Molecular Biology and Evolution 30 (4): 772–80.
  • Kocot, Kevin M., Mathew R. Citarella, Leonid L. Moroz, and Kenneth M. Halanych. 2013. “PhyloTreePruner: A Phylogenetic Tree-Based Approach for Selection of Orthologous Sequences for Phylogenomics.” Evolutionary Bioinformatics Online 9 (October): 429–35.
  • Kozlov, Alexey, Diego Darriba, Tomas Flouri, Benoit Morel, and Alexandros Stamatakis. 2018. “RAxML-NG: A Fast, Scalable, and User-Friendly Tool for Maximum Likelihood Phylogenetic Inference.” Bioinformatics. bioRxiv.
  • Lemaitre, Claire, Aurélien Barré, Christine Citti, Florence Tardy, François Thiaucourt, Pascal Sirand-Pugnet, and Patricia Thébault. 2011. “A Novel Substitution Matrix Fitted to the Compositional Bias in Mollicutes Improves the Prediction of Homologous Relationships.” BMC Bioinformatics 12 (1).
  • Pearson, W. R., and D. J. Lipman. 1988. “Improved Tools for Biological Sequence Comparison.” Proceedings of the National Academy of Sciences 85 (8): 2444–48.
  • Tavaré, Simon. 1986. “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences.” In In Some Mathematical Questions in Biology—DNA Sequence Analysis, edited by R. M. Miura, 57–86. Providence, RI, USA: American Mathematical Society.
  • Yang, Z. 1993. “Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites.” Molecular Biology and Evolution 10 (6): 1396–1401.


Smithsonian Instution Global Genome Initiative, Award: GGI-Rolling-2016-035

National Science Foundation, Award: PIRE-1243541

National Science Foundation, Award: PRFB-1711201

Universität Innsbruck