Previous analyses of single diatom chloroplast protein-encoded genes recovered results highly incongruent with both traditional phylogenies and phylogenies derived from the nuclear encoded small subunit (SSU) gene. Our analysis here of six individual chloroplast genes (atpB, psaA, psaB, psbA, psbC and rbcL) obtained similar anomalous results. However, phylogenetic noise in these genes did not appear to be correlated, and their concatenation appeared to effectively sum their collective signal. We empirically demonstrated the value of combining phylogenetic information profiling, partitioned Bremer support and entropy analysis in examining the utility of various partitions in phylogenetic analysis. Noise was low in the 1st and 2nd codon positions, but so was signal. Conversely, high noise levels in the 3rd codon position was accompanied by high signal. Perhaps counterintuitively, simple exclusion experiments demonstrated this was especially true at deeper nodes where the 3rd codon position contributed most to a result congruent with morphology and SSU (and the total evidence tree here). Correlated with our empirical findings, probability of correct signal (derived from information profiling) increased and the statistical significance of substitutional saturation decreased as data were aggregated. In this regard, the aggregated 3rd codon position performed as well or better than more slowly evolving sites. Simply put, direct methods of noise removal (elimination of fast-evolving sites) disproportionately removed signal. Information profiling and partitioned Bremer support suggest that addition of chloroplast data will rapidly improve our understanding of the diatom phylogeny, but conversely also illustrate that some parts of the diatom tree are likely to remain recalcitrant to addition of molecular data. The methods based on information profiling have been criticized for their numerous assumptions and parameter estimates and the fact that they are based on quartets of taxa. Our empirical results support theoretical arguments that the simplifying assumptions made in these methods are robust to “real-life” situations.

Theriot et al., diatom, 7 gene dataset, with all charsets defined

Nexus DNA sequence file of aligned sequences. 208 taxa (207 diatoms, 1 outgroup). 9349 bases. 7 genes. In order: nuclearly-encoded SSU, chloroplast encoded atpB, psaA, psaB, psbA, psbC,rbcL. Ends trimmed for each gene so that at least 50% of individual sequences have a base. Character set information included, identifying genes, inferred paired and unpaired sites for SSU, and codon positions. Each protein encoding gene further trimmed so that each begins with a first codon position.

Theriot.etal.July2014.diatom.7gene.data.with.all.charsets.nex

Fig.S1.SaturationPlot.psaA3

Plot of transitions and transversions from F84 distances as calculated in DAMBE5, for diatom plus Bolidomonas psaA 3rd codon position.

Fig.S2.7gene.full

Maximum likelihood tree with all terminals identified, from analysis of 7 genes (SSU, atpB, psaA, psaB, psbA, psbC, rbcL) for 207 diatoms and Bolidomonas pacifica (outgroup).

Fig.S3.cplastonly

Maximum likelihood tree with all terminals identified, from analysis of 6 genes (atpB, psaA, psaB, psbA, psbC, rbcL) for 207 diatoms and Bolidomonas pacifica (outgroup).

Fig.S4.ssuonly

Maximum likelihood tree with all terminals identified, from analysis of SSU for 207 diatoms and Bolidomonas pacifica (outgroup).

Table.S1.Taxa.GenbankNumbers

Taxa, authorities, sources, and Genbank accession numbers for 7 gene dataset on diatoms submitted by Theriot et al.

Table.S2.PrimerTable

Table of primers used for DNA sequencing for dataset submitted by Theriot et al. (7 genes, diatoms).

Table.S3.EntropyTest

Results of Entropy Analysis for substitutional saturation conducted in DAMBE5 on 7 gene diatom dataset submitted by Theriot et al.

Table.S4.CDCSummary

Results of Codon Deviation Coefficient analysis conducted on protein encoding genes of diatoms.

Table.S5.pC.pI.pP.byGene.v2

Probability of correct (C),and incorrect (I) parsimony informative character patterns at select nodes in the diatom tree (Figure 1 in manuscript), Probability of non-parsimony informative character pattern or polytomy (P) at select nodes. Probability of C minus I at select nodes. For each gene individually, for all 6 chloroplast genes concatenated and for all 7 genes (SSU plus chloroplast genes) concatenated.

Table.S6.pC.pI.pP.byPosition

Probability of correct (C),and incorrect (I) parsimony informative character patterns at select nodes in the diatom tree (Figure 1 in manuscript), Probability of non-parsimony informative character pattern or polytomy (P) at select nodes. Probability of C minus I at select nodes. For each of 5 selected partitions (SSU paired sites, SSU unpaired sites, all 1st codon positions aggregated, all 2nd codon positions aggregated, and all 3rd codon positions aggregated.)

Data from: Dissecting signal and noise in diatom chloroplast protein encoding genes with phylogenetic information profiling

Data files

Abstract

Theriot et al., diatom, 7 gene dataset, with all charsets defined

Fig.S1.SaturationPlot.psaA3

Fig.S2.7gene.full

Fig.S3.cplastonly

Fig.S4.ssuonly

Table.S1.Taxa.GenbankNumbers

Table.S2.PrimerTable

Table.S3.EntropyTest

Table.S4.CDCSummary

Table.S5.pC.pI.pP.byGene.v2

Table.S6.pC.pI.pP.byPosition

Data from: Dissecting signal and noise in diatom chloroplast protein encoding genes with phylogenetic information profiling

Data files

Abstract

Usage notes

Theriot et al., diatom, 7 gene dataset, with all charsets defined

Fig.S1.SaturationPlot.psaA3

Fig.S2.7gene.full

Fig.S3.cplastonly

Fig.S4.ssuonly

Table.S1.Taxa.GenbankNumbers

Table.S2.PrimerTable

Table.S3.EntropyTest

Table.S4.CDCSummary

Table.S5.pC.pI.pP.byGene.v2

Table.S6.pC.pI.pP.byPosition

Works referencing this dataset