Skip to main content
Dryad

Data from: Mitochondrial phylogenomics of early land plants: mitigating the effects of saturation, compositional heterogeneity, and codon-usage bias

Data files

Jul 29, 2014 version files 7.35 MB

Abstract

Phylogenetic analyses using concatenation of genomic-scale data have been seen as the panacea to resolving the incongruences among inferences from few or single genes. However, phylogenomics may also suffer from systematic errors, due to the, perhaps cumulative, effects of saturation, among-taxa compositional (GC content) heterogeneity, or codon-usage bias plaguing the individual nucleotide loci that are concatenated. Here we provide an example of how these factors affect the inferences of the phylogeny of early land plants based on mitochondrial genomic data. Mitochondrial sequences evolve slowly in plants and hence are thought to be suitable for resolving deep relationships. We newly assembled mitochondrial genomes from 20 bryophytes, complemented these with 40 other streptophytes (land plants plus algal outgroups), compiling a data matrix of 60 taxa and 41 mitochondrial genes. Homogeneous analyses of the concatenated nucleotide data resolve mosses as sister-group to the remaining land plants. However, the corresponding translated amino acid data support the liverwort lineage in this position. Both results receive weak to moderate support in maximum likelihood analyses, but strong support in Bayesian inferences. Tests of alternative hypotheses using either nucleotide or amino-acid data provide implicit support for the respective optimal topologies. By analyzing the nucleotide data, we found that the 3rd codon positions are more saturated than the 1st and 2nd codon positions, and excluding these from the analyses leads to a topology congruent with that obtained using amino-acid data. Further, we determined that land plant lineages differ in their nucleotide composition, and in their usage of synonymous codon variants. Composition heterogeneous Bayesian analyses employing a non-stationary model that accounts for variation in among-lineage composition, and inferences from degenerated nucleotide data that avoids the effects of synonymous mutations that underlie codon-usage bias, again recovered liverworts being sister to the remaining land plants. These analyses indicate that the discrepancy between the nucleotide-based and the amino acid-based trees is caused by the lineage specific, parallel compositional bias, or synonymous mutations driving codon-usage bias, as well as saturation in the 3rd codon positions. While genomic data may generate highly supported phylogenetic trees, these inferences may be artifacts. We suggest that phylogenomic analyses should assess the possible impact of potential biases through comparisons of protein coding gene data and their amino-acids translations, by analyzing data modeling compositional bias, and by excluding nucleotide noisy signals due to saturation or codon-usage bias. We caution against relying on any one presentation of the data (nucleotide or amino acid) or any one type of analysis even when analyzing large-scale data sets, no matter how well-supported, without fully exploring the effects of substitution models.