Skip to main content
Dryad

Data from: Species tree estimation and the impact of gene loss following whole-genome duplication

Data files

Jun 15, 2022 version files 23.64 GB

Abstract

Whole-genome duplication (WGD) has been demonstrated to occur broadly and repeatedly in the evolutionary history of eukaryotes, and is recognized as a prominent evolutionary force, especially in plants. Immediately following WGD, most genes are present in two copies as paralogs. Due to this redundancy, one copy of a paralog pair commonly undergoes pseudogenization and is eventually lost. When speciation occurs shortly after WGD, however, differential loss of paralogs may lead to spurious phylogenetic inference resulting from the inclusion of pseudoorthologs – paralogous genes mistakenly identify as orthologs because they are present in single copes within each sampled species. The influence and impact of including pseudoorthologs versus true orthologs as result of gene extinction (or incomplete laboratory sampling) in a phylogenetic context is only recently starting to gain empirical attention. Moreover, few of these studies have yet to investigate this phenomenon in an explicit coalescent framework. Here, using mathematical models, numerous simulated data sets, and two newly assembled empirical data sets, we assess the effect of pseudoorthologs on species tree estimation under varying levels of incomplete lineage sorting (ILS) and different patterns of gene loss following WGD. When gene loss occurs in the terminal branches of the species tree, the alignment-based (BPP) and gene-tree-based (ASTRAL, MP-EST, and STAR) coalescent methods are adversely affected as the level of ILS increases. This can be greatly improved by sampling a sufficiently large number of genes. Under the same circumstances, however, concatenation methods consistently estimate incorrect species trees as the number of sampled genes increases. Furthermore, pseudoorthologs can mislead species tree inference if gene loss occurs in the internal branches of the species tree, where both coalescent and concatenation methods are prone to produce inconsistent results. However, pseudoorthologs are problematic when filtering only for single-copy genes in phylogenomic data sets. Pruning orthologs or even randomly selecting a copy from multi-copy genes can avoid most of those pseudoorthologs. These results underscore the importance of understanding the influence of pseudoorthologs in the phylogenomics era.