DupLoss-2: Improved phylogenomic species tree inference under gene duplication and loss
Data files
Oct 24, 2025 version files 40.26 MB
-
ElevatedLossRatesData.zip
20.39 MB
-
HighErrorData.zip
19.87 MB
-
README.md
1.47 KB
Abstract
Accurate species tree reconstruction in the presence of widespread gene duplication and loss is a challenging problem in eukaryote phylogenomics. Many phylogenomics methods have been developed over the years to address this challenge; these range from older methods based on gene tree parsimony to newer quartet-based methods. In this work, we introduce improved software for gene tree parsimony-based species tree reconstruction under gene duplication and loss. The new software, DupLoss-2, uses an improved procedure for computing gene losses and is far more accurate and much easier to use than its previous version, released over a decade ago. We thoroughly evaluate DupLoss-2 and eight other existing methods, including ASTRAL-Pro, ASTRAL-Pro 2, DISCO-ASTRAL, DISCO-ASTRID, FastMulRFS, and SpeciesRax, using existing benchmarking data and find that DupLoss-2 outperforms all other methods on most of the datasets. It delivers an average of almost 30% reduction in reconstruction error compared to iGTP-Duploss, the previous version of this software, and a 10% reduction compared to the best performing existing method. DupLoss-2 is written in C++ and is freely available open-source.
https://doi.org/10.5061/dryad.0cfxpnwb9
Description of the data and file structure
All datasets are simulated and were generated to evaluate methods for phylogenomic species tree inference under gene duplication and loss.
Files and variables
File: HighErrorData.zip
Description: HighErrorData.zip includes estimated gene trees with a high rate of reconstruction error. The estimated gene trees correspond to simulated species trees with 100 taxa and true gene trees simulated under three different gene duplication and loss rates. The zip file consists of three folders corresponding to the three duplication-loss rates. Each folder contains 20 replicate datasets, each consisting of 1000 estimated gene trees.
File: ElevatedLossRatesData.zip
Description: ElevatedLossRatesData.zip contains the input data for an analysis in which a subset of the taxa/species have elevated gene loss rates. The dataset includes three deletion rates—10%, 20%, and 40%—which correspond to the percentage of estimated gene trees from which 20% of randomly selected taxa were removed. Each folder contains 20 replicate datasets, each consisting of 1000 estimated gene trees.
Access information
Data was derived from the following source: http://dx.doi.org/10.5061/dryad.mr3g6
