Dissecting incongruence between concatenation- and quartet-based approaches in phylogenomic data
Cite this dataset
Shen, Xing-Xing; Steenwyk, Jacob; Rokas, Antonis (2021). Dissecting incongruence between concatenation- and quartet-based approaches in phylogenomic data [Dataset]. Dryad. https://doi.org/10.5061/dryad.9p8cz8wc5
Topological conflict or incongruence is widespread in phylogenomic data. Concatenation- and coalescent-based approaches often result in incongruent topologies, but the causes of this conflict can be difficult to characterize. We examined incongruence stemming from conflict between likelihood-based signal (quantified by the difference in gene-wise log likelihood score or ΔGLS) and quartet-based topological signal (quantified by the difference in gene-wise quartet score or ΔGQS) for every gene in three phylogenomic studies in animals, fungi, and plants, which were chosen because their concatenation-based IQ-TREE (T1) and quartet-based ASTRAL (T2) phylogenies are known to produce eight conflicting internal branches (bipartitions). By comparing the types of phylogenetic signal for all genes in these three data matrices, we found that 30% - 36% of genes in each data matrix are inconsistent, that is, each of these genes has higher log likelihood score for T1 versus T2 (i.e., ΔGLS >0) whereas its T1 topology has lower quartet score than its T2 topology (i.e., ΔGQS <0) or vice versa. Comparison of inconsistent and consistent genes using a variety of metrics (e.g., evolutionary rate, gene tree topology, distribution of branch lengths, hidden paralogy, and gene tree discordance) showed that inconsistent genes are more likely to recover neither T1 nor T2 and have higher levels of gene tree discordance than consistent genes. Simulation analyses demonstrate that removal of inconsistent genes from datasets with low levels of incomplete lineage sorting (ILS) and low and medium levels of gene tree estimation error (GTEE) reduced incongruence and increased accuracy. In contrast, removal of inconsistent genes from datasets with medium and high ILS levels and high GTEE levels eliminated or extensively reduced incongruence, but the resulting congruent species phylogenies were not always topologically identical to the true species trees.
This repository contains phylogenetic data matrices, multiple sequence alignments, phylogenetic trees, supplementary figures and tables, R codes, and the custom Perl scripts used in this study.
1) Each folder corresponds to each topic in our study.
2) Example_GLS&GQS_caculation can guide us to calculate gene-wise log-likelihood score (GLS) and gene-wise quartet score (GQS) for every gene
Please let me know if you have any question about them at email: email@example.com.
National Natural Science Foundation of China, Award: 32071665
National Science Foundation, Award: DEB-1442113