Data from: Leveraging weighted quartet distributions for enhanced species tree inference from genome-wide data

Hasan, Navid Bin 1 ; Biswas, Avijit1; Wahab, Zahin1; Mahbub, Mahim1; Reaz, Rezwana1; Bayzid, Md Shamsuzzoha1

Published Nov 11, 2024 on Dryad. https://doi.org/10.5061/dryad.wstqjq2wn

Data files

Nov 11, 2024 version files 3.05 GB

Data.zip

3.05 GB
README.md

7.87 KB

Abstract

Species tree estimation from genes sampled from throughout the whole genome is challenging because of gene tree discordance, often caused by incomplete lineage sorting (ILS). Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and theoretical guarantees of robustness to arbitrarily high amounts of ILS. ASTRAL, the most widely used quartet-based method, aims to infer species trees by maximizing the number of quartets in the gene trees consistent with the species tree. An alternative approach is inferring quartets for all subsets of four species and amalgamating them into a coherent species tree. While summary methods can be sensitive to gene tree estimation error, quartet amalgamation offers an advantage by potentially bypassing gene tree estimation. However, greatly understudied is the choice of weighted quartet inference method and downstream effects on species tree estimations under realistic model conditions. In this study, we investigated a wide array of methods for generating weighted quartets and critically assessed their impact on species tree inference. Our study provides evidence that the careful generation and amalgamation of weighted quartets, as implemented in methods like wQFM, can lead to significantly more accurate trees than popular methods like ASTRAL, especially in the face of gene tree estimation errors.

https://doi.org/10.5061/dryad.wstqjq2wn

Description of the data and file structure

Overview:

This dataset contains all the gene sequence alignments and gene trees used in our study. It is also accompanied by helper scripts to modify or generate different model conditions of the data.

Details:

11-taxon: Contains the estimated gene trees, true gene trees, and true species trees. The estimated gene trees are accompanied by the gene sequence alignments they were obtained from (using RAxML). It contains both lower-ILS and higher-ILS model conditions.
15-taxon: Contains the same items as the 11-taxon dataset. It contains four model conditions: combinations between 100 and 100 genes with 100 and 1000 base pairs.
37-taxon: Contains the same items as the 11-taxon dataset. Contains several model conditions. New model conditions can be generated from the "1X-800-1000" model condition using the provided scripts.
Avian: Contains alignments of all the 14,446 genes used in the "Alignments/all/All.aln.reduced" file. The "all.part.reduced" file contains the gene boundaries. It also has all the gene trees.
Mammalian: Contains alignments and gene trees of 424 genes.

File: Data.zip

Description: Contains all the data related to this study.

Structure:

Each folder denotes a different dataset. There are two types of datasets: 1) Simulated (11-taxon, 15-taxon, and 37-taxon), and 2) Biological (Avian and Mammalian).

The simulated datasets have the following structure:

```
11-taxon (the main directory)
└───<model-condition> (lower-ILS and higher-ILS)
│   └───estimated-genetrees (contains all the estimated gene trees along with the gene sequences)
│   |   └───<replicate> (replicate number: R1-R20)
│   |   │   └───<gene-number>
│   |   │   |   |   truegene.fasta (the sequence alignment)
│   |   │   |   |   RAxML_bipartitions.final.f200 (the RAxML gene tree)
│   |   │   |   |   RAxML_bootstrap.all (all RAxML bootstrap gene trees)
│   └───true-genetrees (contains all the true gene trees)
│   |   └───<replicate> (replicate number: R1-R20)
│   |   │   └───<gene-number>
│   |   │   |   |   true.gt (the true gene tree)
│   └───true-speciestrees (contains all the true species trees of each replicate)
│   |   │   <replicate>true.tre (the true species tree of each replicate)
15-taxon (the main directory)
└───<model-condition> (one folder for each of the four model conditions)
│   └───estimated-genetrees (contains all the estimated gene trees along with the gene sequences)
│   |   └───<replicate> (replicate number: R1-R10)
│   |   │   └───<gene-number>
│   |   │   |   |   <gene-number>.fasta (the sequence alignment)
│   |   │   |   |   <gene-number>.final.f100 (the RAxML gene tree)
│   |   │   |   |   <gene-number>.all (all RAxML bootstrap gene trees)
└───true-genetrees (true gene trees)
│   └───<replicate> (replicate number: R1-R10)
│   |   └───<gene-number>
│   |   │   |   <gene-number>.tre (the true gene tree)
│   true-species.tre (the true species tree, common to all replicates)
│   true_tree_trimmed (the true species tree without any branch lengths, common to all replicates)
37-taxon (the main directory)
└───<model-condition> (one folder for each model condition)
│   |   mammalian-model-species.tre (the true species tree, common to all replcates)
│   └───<replicate> (replicate number: R1-R20)
│   │   └───<gene-number>
│   │   |   |   <gene-number>.fasta (the sequence alignment)
│   │   |   └───raxmlboot.gtrgamma (contains the RAxML gene tree and the bootstrapped gene trees)
│   │   |   |   |   RAxML_bipartitions.final.f200 (the RAxML gene tree)
│   │   |   |   |   RAxML_bootstrap.all (all RAxML bootstrap gene trees)
└───true-genetrees (the true gene trees)
|   └───mammalian-1X-truegt (true gene trees for the 1X ILS model conditions)
|   |   └───1X-<gene-count>-true (the true gene trees for replicates with a particular gene count (200,400,800))
|   |   |   └───<replicate> (replicate number: R1-R20)
│   │   │   │   └───<gene-number>
│   │   │   │   │   true.gt (the true gene tree)
|   create-sub-alignments-mam.sh (script to create sub-alignments for a new model condition from a model condition with longer sequence length)
|   create_gt.sh (script to estimate gene trees from alignments using RAxML, it also generates 200 bootstrap replicates for each gene tree)
|   create_gt_no_boot.sh (script to estimate gene trees from alignments using RAxML, no bootstrapping)
|   copy_gt.sh (script to copy gene trees from one model condition to a new one, the ILS and sequence length of both model conditions must be identical)
```

The biological datasets have the following structure:

```
Avian (the main directory)
|   avian-model-species.tre (the MP-EST reference tree for the Avian dataset)
└───Gene-Trees
│   └───allgenes (contains all gene trees)
│   |   └───<gene-number>
│   |   │   |   RAxML_bipartitions.final (the RAxML gene tree; for exons, the filename has the prefix 'C12-')
│   |   │   |   RAxML_bootstrap.all (all RAxML bootstrap gene trees; for exons, the filename has the prefix 'C12-')
└───Alignments (contains the gene sequence alignments)
│   └───all
│   |   │   All.aln.reduced (all gene sequences concatenated)
│   |   │   All.part.reduced (contains the gene boundaries)
│   |   │   Sequences-Concatenated-All.nex (concatenation of all genes in .nex format)
│   └───2516_introns (contains sequences for each of the 2516 introns)
│   │   └───2500orthologs
│   |   │   └───<gene-number> (contains the sequence file corresponding to the gene)
│   |   │   │   │   sate.removed.intron.noout.aligned-allgap.filtered (the alignment without outgroups)
│   |   │   │   │   sate.removed.intron.original.aligned-allgap.filtered (the alignment with outgroups)
Mammalian
└───song-mammalian-bio
│   │   mammalian-model-species.tre (the MP-EST reference tree for the Mammalian dataset)
│   └───424genes (contains all the estimated gene trees along with the gene sequences)
│   |   └───<gene-number>
│   |   │   |   <gene-number>.fasta (the sequence alignment)
│   |   │   |   └───raxmlboot.gtrgamma (contains the RAxML gene tree and the bootstrapped gene trees)
│   │   |   |   |   |   RAxML_bipartitions.final.f200 (the RAxML gene tree)
│   │   |   |   |   |   RAxML_bootstrap.all (all RAxML bootstrap gene trees)
```

File types:

.tre or .final.f200: gene tree or species tree
.all: all bootstrap replciates of gene trees
.fasta or .fa: gene sequence alignments

Supplementary Material:

The supplementary material contains additional figures and tables related to the study, mostly related to the 11-taxon dataset.

Code/software

Any text editor may be used to view the data. Viewing very large files may hang your PC, so using shell commands is recommended. Notepad++ works well on windows.

Access information

Other publicly accessible locations of the data:

37-taxon dataset: https://www.ideals.illinois.edu/items/55768
The avian dataset: http://gigadb.org/dataset/101000
The mammalian dataset: https://www.ideals.illinois.edu/items/55772