Data from: The impact of paralogy on phylogenomic studies - a case study on annelid relationships


Struck, Torsten H. (2013), Data from: The impact of paralogy on phylogenomic studies - a case study on annelid relationships, Dryad, Dataset,


Phylogenomic studies based on hundreds of genes derived from expressed sequence tags libraries are increasingly used to reveal the phylogeny of taxa. A prerequisite for these studies is the assignment of genes into clusters of orthologous sequences. Sophisticated methods of orthology prediction are used in such analyses, but it is rarely assessed whether paralogous sequences have been erroneously grouped together as orthologous sequences after the prediction, and whether this had an impact on the phylogenetic reconstruction using a super-matrix approach. Herein, I tested the impact of paralogous sequences on the reconstruction of annelid relationships based on phylogenomic datasets. Using single-partition analyses, screening for bootstrap support, blast searches and pruning of sequences in the supermatrix, wrongly assigned paralogous sequences were found in eight partitions and the placement of five taxa (the annelids Owenia, Scoloplos, Sthenelais and Eurythoe and the nemertean Cerebratulus) including the robust bootstrap support could be attributed to the presence of paralogous sequences in two partitions. Excluding these sequences resulted in a different, weaklier supported placement for these taxa. Moreover, the analyses revealed that paralogous sequences impacted the reconstruction when only a single taxon represented a previously supported higher taxon such as a polychaete family. One possibility of a priori detection of wrongly assigned paralogous sequences could to combine 1) a screening of single-partition analyses based on criteria such as nodal support or internal branch length with 2) blast searches of suspicious cases as presented herein. Also possible are a posteriori approaches in which support for specific clades is investigated by comparing alternative hypotheses based on differences in per-site likelihoods. Increasing the sizes of EST libraries will also decrease the likelihood of wrongly assigned paralogous sequences, and in the case of orthology prediction methods like HaMStR it is likewise decreased by using more than one reference taxon.

