Whole-genome duplication (WGD) has been demonstrated to occur broadly and repeatedly in the evolutionary history of eukaryotes, and is recognized as a prominent evolutionary force, especially in plants. Immediately following WGD, most genes are present in two copies as paralogs. Due to this redundancy, one copy of a paralog pair commonly undergoes pseudogenization and is eventually lost. When speciation occurs shortly after WGD, however, differential loss of paralogs may lead to spurious phylogenetic inference resulting from the inclusion of pseudoorthologs – paralogous genes mistakenly identify as orthologs because they are present in single copes within each sampled species. The influence and impact of including pseudoorthologs versus true orthologs as result of gene extinction (or incomplete laboratory sampling) in a phylogenetic context is only recently starting to gain empirical attention. Moreover, few of these studies have yet to investigate this phenomenon in an explicit coalescent framework. Here, using mathematical models, numerous simulated data sets, and two newly assembled empirical data sets, we assess the effect of pseudoorthologs on species tree estimation under varying levels of incomplete lineage sorting (ILS) and different patterns of gene loss following WGD. When gene loss occurs in the terminal branches of the species tree, the alignment-based (BPP) and gene-tree-based (ASTRAL, MP-EST, and STAR) coalescent methods are adversely affected as the level of ILS increases. This can be greatly improved by sampling a sufficiently large number of genes. Under the same circumstances, however, concatenation methods consistently estimate incorrect species trees as the number of sampled genes increases. Furthermore, pseudoorthologs can mislead species tree inference if gene loss occurs in the internal branches of the species tree, where both coalescent and concatenation methods are prone to produce inconsistent results. However, pseudoorthologs are problematic when filtering only for single-copy genes in phylogenomic data sets. Pruning orthologs or even randomly selecting a copy from multi-copy genes can avoid most of those pseudoorthologs. These results underscore the importance of understanding the influence of pseudoorthologs in the phylogenomics era.

DNA sequences were simulated on four 5-taxon species trees with two different topologies (i.e., pectinate species trees S1 and S3, and symmetrical species trees S2 and S4; see Fig. 2 in the paper) under the multispecies coalescent model. Next, we generated differential loss of paralogs on each simulated gene according to one of 14 patterns described in the paper. In the simulation, the four terminal lineages in the genealogical tree of the eight gene copies are removed according to the loss probabilities.

S1.tar.gz: DNA sequences simulated from the species tree S1 under the multispecies coalescent model with gene loss

S2.tar.gz: DNA sequences simulated from the species tree S2 under the multispecies coalescent model with gene loss

S3.tar.gz: DNA sequences simulated from the species tree S3 under the multispecies coalescent model with gene loss

S4.tar.gz: DNA sequences simulated from the species tree S4 under the multispecies coalescent model with gene loss

Data from: Species tree estimation and the impact of gene loss following whole-genome duplication

Data files

Abstract

Data from: Species tree estimation and the impact of gene loss following whole-genome duplication

Data files

Abstract

Methods

Usage notes

Works referencing this dataset