Data and code from: Efficient inference of macrophylogenies: Insights from the avian tree of life

Zhao, Min 1 ; Thom, Gregory2; Faircloth, Brant2; Andersen, Michael3; Barker, Keith4; Benz, Brett5; Braun, Michael6; Bravo, Gustavo7; Brumfield, Robb2; Chesser, Terry8; Derryberry, Elizabeth9; Glenn, Travis10; Harvey, Michael11; Hosner, Peter12; Imfeld, Tyler13; Joseph, Leo14; Manthey, Joseph15; McCormack, John16; McCullough, Jenna17; Moyle, Robert18; Oliveros, Carl2; White Carreiro, Noor19; Winker, Kevin20; Field, Daniel21; Ksepka, Daniel22; Braun, Edward1 ; Kimball, Rebecca 1 ; Smith, Brian 23

Published Nov 21, 2025 on Dryad. https://doi.org/10.5061/dryad.5dv41nsgw

Data files

Nov 21, 2025 version files 9.30 GB

alignments_final_trimmed_clean.zip

5.30 GB
datafiles_large_tables.zip

3.29 MB
filtered_concatenated_files.zip

3.54 GB
README.md

3.55 KB
scripts.zip

48.80 MB
treefiles.zip

404.48 MB

Abstract

The exponential growth of molecular sequence data over the past decade has enabled the construction of numerous clade-specific phylogenies encompassing hundreds or thousands of taxa. These independent studies often include overlapping data, presenting a unique opportunity to build macrophylogenies (phylogenies sampling > 1,000 taxa) for entire classes across the Tree of Life. However, the inference of large trees remains constrained by logistical, computational, and methodological challenges. The Avian Tree of Life provides an ideal model for evaluating strategies to robustly infer macrophylogenies from intersecting datasets derived from smaller studies. In this study, we leveraged a comprehensive resource of sequence capture datasets to evaluate the phylogenetic accuracy and computational costs of four methodological approaches: (1) supermatrix approaches using concatenation, including the “fast” maximum likelihood (ML) methods, (2) filtering datasets to reduce heterogeneity, (3) supertree estimation based on published phylogenomic trees, and (4) a “divide-and-conquer” strategy, wherein smaller ML trees were estimated and subsequently combined using a supertree approach. Additionally, we examined the impact of these methods on divergence time estimation using a dataset that includes newly vetted fossil calibrations for the Avian Tree of Life. Our findings highlight that recently developed fast tree search approaches offer a reasonable compromise between computational efficiency and phylogenetic accuracy, facilitating inference of macrophylogenies.

https://doi.org/10.5061/dryad.5dv41nsgw

Description of the data and file structure

This dataset contains the data and code to estimate macrophylogenies using various approaches (supermatrix, filtered supermatrices, supertree, divide-and-conquer, and coalescent species tree method), compare their phylogenetic accuracy and computational requirements and perform divergence time estimation.

Files and variables

File: alignments_final_trimmed_clean.zip

Description: This file contains individual alignments for the full dataset.

File: filtered_concatenated_files.zip

Description: This file contains 27 concatenated sequence data files for the filtered datasets.

File: treefiles.zip

Description: This file contains several subfolders.

1. initial exploration, which includes tree files produced by different approaches (RAxML-NG, fasttrees, supertrees, divide-and-conquer supertrees and ASTRAL species trees as well as gene trees used to summarize species trees) and Robinson-Foulds tree distance matrix.

2. tests on filter1 and filter3, which includes tree files from tests on two filtered datasets.

3. modified methods, which includes new fasttrees from the full dataset and 26(7) filtered datasets (note that two filtered datasets are identical), hybrid supertrees, hybrid divide-and-conquer supertrees and Robinson-Foulds tree distance matrix.

4. divergence time trees, which includes time-calibrated trees using various trees from previous steps (note that for supertree and divide-and-conquer trees, branch lengths were optimized using GTR+G or GTR+R4 models, thus two trees are provided for each approach).

File: datafiles_large_tables.zip

Description: This file contains six data files.

Data 1 - supermatrix taxon sampling.xlsx, which includes information of our taxon sampling (voucher, accession, probe set etc. if available).
Data 2 - individual-based summary statistics for all taxa.csv, which includes individual-based summary statistics for all taxa sampled.
Data 3 - stats of datasets and new fasttree run summary.xlsx, which includes averaged locus-based summary statistics for each dataset and run information (site model, log likelihoods, computational time etc.) for each new fasttree analysis (four searches for each dataset).
Data 4 - DivideConquer subsets.xlsx, which includes the taxon sampling in each of the 50 subsets used in the divide-and-conquer approach.
Data 5 - tests on two filtered datasets.xlsx, which includes the run information (site model, log likelihoods, starting trees etc.) for test runs on filter1 and filter3 datasets, as well as the expected clade recovery of their resulting trees.
Data 6 - summary statistics per locus.xlsx, which includes locus-based summary statistics for individual alignments.
Data 7 - crown ages for all monophyletic groups (genera, families, orders and high-level clades) in our seven time trees. Monotypic/single-sampled groups are ignored.

File: scripts.zip

Description: This file contains scripts and data files to process phylogenomic data, estimate trees, calculate tree distances, assess expected clade recovery, estimate divergence time, and generate various plots. It also includes a README file for the scripts. They are also available from https://github.com/balaenazhao/BigUCE