Detecting and removing sample contamination in phylogenomic data: An example and its implications for Cicadidae phylogeny (Insecta: Hemiptera)
Owen, Christopher (2022), Detecting and removing sample contamination in phylogenomic data: An example and its implications for Cicadidae phylogeny (Insecta: Hemiptera), Dryad, Dataset, https://doi.org/10.5061/dryad.tht76hdz1
Contamination of a genetic sample with DNA from one or more non-target species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and Next-Generation Sequencing (NGS) studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on detection of bimodal distributions of patristic distances across gene trees. When the contamination occurs between samples within a dataset, comparisons between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. Here we present an automated pipeline for identifying and excluding likely cross-contaminated loci based on detection of these bimodal distributions of patristic distances between taxa across gene trees. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the process(es) that caused the contamination. Exclusion of putatively contaminated loci from a dataset generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the AHE markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned dataset, consisting of 90 genera representing 44 of 56 current Cicadidae tribes, and 429 loci, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix (IQ-TREE ML) and multispecies coalescent-based (ASTRAL-III) species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution.