Data from: Dissecting signal and noise in diatom chloroplast protein encoding genes with phylogenetic information profiling
Theriot, Edward C. et al. (2016), Data from: Dissecting signal and noise in diatom chloroplast protein encoding genes with phylogenetic information profiling, Dryad, Dataset, https://doi.org/10.5061/dryad.610md
Previous analyses of single diatom chloroplast protein-encoded genes recovered results highly incongruent with both traditional phylogenies and phylogenies derived from the nuclear encoded small subunit (SSU) gene. Our analysis here of six individual chloroplast genes (atpB, psaA, psaB, psbA, psbC and rbcL) obtained similar anomalous results. However, phylogenetic noise in these genes did not appear to be correlated, and their concatenation appeared to effectively sum their collective signal. We empirically demonstrated the value of combining phylogenetic information profiling, partitioned Bremer support and entropy analysis in examining the utility of various partitions in phylogenetic analysis. Noise was low in the 1st and 2nd codon positions, but so was signal. Conversely, high noise levels in the 3rd codon position was accompanied by high signal. Perhaps counterintuitively, simple exclusion experiments demonstrated this was especially true at deeper nodes where the 3rd codon position contributed most to a result congruent with morphology and SSU (and the total evidence tree here). Correlated with our empirical findings, probability of correct signal (derived from information profiling) increased and the statistical significance of substitutional saturation decreased as data were aggregated. In this regard, the aggregated 3rd codon position performed as well or better than more slowly evolving sites. Simply put, direct methods of noise removal (elimination of fast-evolving sites) disproportionately removed signal. Information profiling and partitioned Bremer support suggest that addition of chloroplast data will rapidly improve our understanding of the diatom phylogeny, but conversely also illustrate that some parts of the diatom tree are likely to remain recalcitrant to addition of molecular data. The methods based on information profiling have been criticized for their numerous assumptions and parameter estimates and the fact that they are based on quartets of taxa. Our empirical results support theoretical arguments that the simplifying assumptions made in these methods are robust to “real-life” situations.