CASTER: Direct species tree inference from whole-genome alignments
Data files
Sep 27, 2023 version files 117.36 GB
-
Alignment_birch.tar.bz2
81.52 MB
-
Alignment_butterfly.bz2
610.61 MB
-
Alignment_SR201_0.1X_mutation_part1.tar.bz2
7.36 GB
-
Alignment_SR201_0.1X_mutation_part2.tar.bz2
7.35 GB
-
Alignment_SR201_10X_mutation.tar.bz2
5.87 GB
-
Alignment_SR201_10X_population_part1.tar.bz2
7.35 GB
-
Alignment_SR201_10X_population_part2.tar.bz2
7.34 GB
-
Alignment_SR201_default_diploid_part1.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part2.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part3.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part4.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part5.tar.bz2
5.88 GB
-
Alignment_SR21_part1.tar.bz2
6.14 GB
-
Alignment_SR21_part2.tar.bz2
6.14 GB
-
Alignment_SR21_part3.tar.bz2
6.14 GB
-
Alignment_SR21_part4.tar.bz2
6.14 GB
-
Alignment_SR21_part5.tar.bz2
6.15 GB
-
bpp_mcmc.zip
323.67 MB
-
lizard.phylip.bz2
29.54 MB
-
README.md
8.06 KB
-
snapper_mcmc.zip
389.30 KB
-
species_trees.zip
7.83 MB
Aug 07, 2024 version files 122.68 GB
-
Alignment_birch.tar.bz2
81.52 MB
-
Alignment_butterfly.bz2
610.61 MB
-
Alignment_SR201_0.1X_mutation_part1.tar.bz2
7.36 GB
-
Alignment_SR201_0.1X_mutation_part2.tar.bz2
7.35 GB
-
Alignment_SR201_10X_mutation.tar.bz2
5.87 GB
-
Alignment_SR201_10X_population_part1.tar.bz2
7.35 GB
-
Alignment_SR201_10X_population_part2.tar.bz2
7.34 GB
-
Alignment_SR201_default_diploid_part1.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part2.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part3.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part4.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part5.tar.bz2
5.88 GB
-
Alignment_SR21_part1.tar.bz2
6.14 GB
-
Alignment_SR21_part2.tar.bz2
6.14 GB
-
Alignment_SR21_part3.tar.bz2
6.14 GB
-
Alignment_SR21_part4.tar.bz2
6.14 GB
-
Alignment_SR21_part5.tar.bz2
6.15 GB
-
lizard.phylip.bz2
29.54 MB
-
mammal82.zip
10.05 MB
-
quartet.zip
5.63 GB
-
README.md
9.58 KB
-
score_true.zip
2.07 MB
-
species_trees.zip
7.83 MB
-
subsample.zip
4.16 MB
Nov 13, 2024 version files 126.74 GB
-
Alignment_birch.tar.bz2
81.52 MB
-
Alignment_butterfly.bz2
610.61 MB
-
Alignment_SR201_0.1X_mutation_part1.tar.bz2
7.36 GB
-
Alignment_SR201_0.1X_mutation_part2.tar.bz2
7.35 GB
-
Alignment_SR201_10X_mutation.tar.bz2
5.87 GB
-
Alignment_SR201_10X_population_part1.tar.bz2
7.35 GB
-
Alignment_SR201_10X_population_part2.tar.bz2
7.34 GB
-
Alignment_SR201_default_diploid_part1.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part2.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part3.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part4.tar.bz2
5.88 GB
-
Alignment_SR201_default_diploid_part5.tar.bz2
5.88 GB
-
Alignment_SR21_part1.tar.bz2
6.14 GB
-
Alignment_SR21_part2.tar.bz2
6.14 GB
-
Alignment_SR21_part3.tar.bz2
6.14 GB
-
Alignment_SR21_part4.tar.bz2
6.14 GB
-
Alignment_SR21_part5.tar.bz2
6.15 GB
-
gene_trees.zip
4.06 GB
-
lizard.phylip.bz2
29.54 MB
-
mammal82.zip
10.05 MB
-
quartet.zip
5.63 GB
-
README.md
8.99 KB
-
score_true.zip
2.07 MB
-
species_trees.zip
7.83 MB
-
subsample.zip
4.16 MB
Abstract
Genomes contain mosaics of discordant evolutionary histories, challenging the accurate inference of the tree of life. While genome-wide data are routinely used for discordance-aware phylogenomic analyses, due to modeling and scalability limitations, the current practice leaves out large chunks of the genomes. As more high-quality genomes become available, we urgently need discordance-aware methods to infer the tree directly from a multiple genome alignment. Here, we introduce CASTER, a site-based method that eliminates the need to predefine recombination-free loci. CASTER is statistically consistent under incomplete lineage sorting and is scalable to hundreds of mammalian whole genomes. We show both in simulations and on real data that CASTER is scalable and accurate and that its per-site scores can reveal interesting patterns of evolution across the genome.
https://doi.org/10.5061/dryad.bg79cnph0
Data, Code, and Tables for “CASTER: Direct species tree inference from whole-genome alignments.”
To benchmark the performance of CASTER against alternative methods, we simulate aligned sequences under Hudson model (species tree & recombination). Particularly, we simulate 200 ingroup species + 1 outgroup species (SR201 dataset) to benchmark the performance of CASTER under different mutation rates, population sizes, and ploidies (unphased). Next, we subsample 20 ingroup species + 1 outgroup species but increase the number of individuals per species to benchmark the performance of CASTER with up to 20 individuals. The scripts for those simulations are also provided (simulate_SR201_10X_population.py, simulate_SR201_most_conditions.py, simulate_SR21.py).
We also simulate a lot more alignments under SR201 condition for scalability studies (scale) which are not included here because:
- The total size exceeds the 300GB limit
- Those who are interested in this data set can easily simulate them using simulate_scale.py
- We are only interested in the running time, rather than the accuracy
We also apply CASTER to biological datasets. We upload a copy of the alignments used here, unless they are easily accessible (which we provide links).
Data
Alignments
Quartet dataset (FASTA format)
quartet.zip: all condition (100 replicates)
SR201 simulated dataset (FASTA format)
- Alignment_SR201_default_diploid_part[1-5].tar.bz2: default + diploid condition (50 replicates)
- Alignment_SR201_0.1X_mutation_part[1-2].tar.bz2: 0.1X mutation rate condition (50 replicates)
- Alignment_SR201_10X_mutation.tar.bz2: 10X mutation rate condition (20 replicates)
- Alignment_SR201_10X_population_part[1-2].tar.bz2: 10X population size condition (50 replicates)
Note:
Alignment_SR201_default_diploid_part[1-5].tar.bz2 contain two sequences (phased) per species. The default condition (haploid) can be obtained by extracting every other sequence.
SR21 simulated dataset (FASTA format)
Alignment_SR21_part[1-5].tar.bz2: all condition (50 replicates)
Note:
Alignment_SR21_part[1-5].tar.bz2 contain 20 sequences per species. Conditions with 1/2/5 individuals per species can be obtained by subsampling every 20/10/4 sequences.
S101 dataset
We reused S101 dataset from ASTRALIII (ASTRAL-III: polynomial time species tree
reconstruction from partially resolved gene trees). You can access the S101 dataset (S101.tar.gz) here.
Placental mammal dataset
See: https://cglgenomics.ucsc.edu/data/cactus/
Avian dataset
Also see: https://cglgenomics.ucsc.edu/data/cactus/
82 mammal dataset
mammal82.zip
See: https://datadryad.org/stash/dataset/doi:10.5061/dryad.bp462
Ratite dataset
See: https://datadryad.org/stash/dataset/doi:10.5061/dryad.v72d325
Ruminant dataset
See: https://datadryad.org/stash/dataset/doi:10.5061/dryad.52213gc
Oakleaf butterfly dataset
Alignment_butterfly.bz2
Birch dataset
Alignment_birch.tar.bz2
Lizard dataset
Alignment_lizard.phylip.bz2
Gene trees
gene_trees.zip: Gene trees for wASTRAL inferences (Newick format)
Folder structure
- SR201_default_condition: SR201 dataset default condition (50 replicates)
- SR201_0.1X_mutation: SR201 dataset 0.1X mutation rate condition (50 replicates)
- SR201_10X_mutation: SR201 dataset 10X mutation rate condition (20 replicates)
- SR201_10X_population: SR201 dataset 10X population size condition (50 replicates)
- SR21_1_ind: SR21 dataset 1 individual (50 replicates)
- SR21_5_ind: SR21 dataset 5 individual (50 replicates)
Species trees
- species_trees.zip: Species tree files and logs
- subsample.zip: Species tree files and logs for subsample experiment
- score_true.zip: Comparision of CASTER scores between true and reconstructed phylogenies
Folder structure (species_trees.zip)
- biological: CASTER inferred trees on biological datasets
- truth: True species tree for SR201 and SR21 datasets
- estimated/SR201_default_condition: Estimated species trees for SR201 dataset default condition (50 replicates)
- estimated/SR201_unphased_diploid: Estimated species trees for SR201 dataset unphased diploid condition (50 replicates)
- estimated/SR201_0.1X_mutation: Estimated species trees for SR201 dataset 0.1X mutation rate condition (50 replicates)
- estimated/SR201_10X_mutation: Estimated species trees for SR201 dataset 10X mutation rate condition (20 replicates)
- estimated/SR201_10X_population: Estimated species trees for SR201 dataset 10X population size condition (50 replicates)
- estimated/SR21_1_ind: Estimated species trees for SR21 dataset 1 individual (50 replicates)
- estimated/SR21_2_ind: Estimated species trees for SR21 dataset 2 individual (50 replicates)
- estimated/SR21_5_ind: Estimated species trees for SR21 dataset 5 individual (50 replicates)
- estimated/SR21_20_ind: Estimated species trees for SR21 dataset 20 individual (50 replicates)
- estimated/scale: Estimated species trees for scale study dataset
Simulated files (estimated/*)
- *.caster-site files: CASTER-site species trees
- *.caster-pair files: CASTER-pair species trees
- *.raxml.bestTree files: RaXML species trees
- *.svdquartets files: SVDQuartets species trees
- *.gtrees.wastral files: wASTRAL-hybrid species trees
- *.log: log files
Biological files (biological/*)
- mammal/whole_genome.site: Mammalian whole-genome CASTER-site tree
- mammal/chr/chr*.site: Mammalian per-chromosome CASTER-site trees
- ruminant/Ruminant.WGA.phy.caster-site: Ruminant whole-genome CASTER-site tree
- ruminant/Ruminant.WGA.phy.caster-pair: Ruminant whole-genome CASTER-pair tree
- butterfly/38indiv_4Dec.recode_scaffold_1to32.phylip.caster-site: Oakleaf butterfly SNP tree
- birch/all.ploidy.phylip.caster-pair: CASTER-pair tree for all 47 birch species
- birch/20_diploid.phylip.caster-site: CASTER-site tree for 20 diploid birch species
- birch/20_diploid.phylip.caster-pair: CASTER-pair tree for 20 diploid birch species
- lizard/all.phylip.na.caster-{site,pair}: CASTER-site/pair tree for lizard nucleotide alignments with 3-rd codon positions
- lizard/all.phylip.na12.caster-{site,pair}: CASTER-site/pair tree for lizard nucleotide alignments without 3-rd codon positions
- lizard/all.phylip.aa.recode.caster-pair: CASTER-pair tree for lizard amino-acid alignments
- *.log: log files
Code
CASTER software
Users can find up-to-date version of CASTER here.
For reproducibility of our results, we also upload CASTER v1.15.0.0 as “ASTER.zip” file.
Sliding window code
This is the code we used in sliding-window analysis. Compile and run it without input parameters to see help messages.
caster-site_sliding_window.cpp
Proof checks
CASTER-Proof.nb is a Mathmatica script to verify the correctness of Proposition 2.
Simulation
Scripts to simulate alignments in SR201 and SR21 datasets:
- simulate_SR201_10X_population.py
- simulate_SR201_most_conditions.py
- simulate_scale.py
- simulate_SR21.py
Analyzing sliding window outputs
- stat_mammalian_sliding_window.py: script for mammalian sliding windows
- stat_SR201_sliding_window.py: script for SR201 sliding windows
GitHub repository for data and scripts used for plotting figures
See: https://github.com/chaoszhang/CASTER-data
- CASTER-data-main.zip: A permanent archieve of the GitHub repository
Change log
October 2024
Uploading gene trees
- gene_trees.zip: Annotated gene tree files for wASTRAL and ASTRAL
Combining tables and scripts into a Github repository
- CASTER-data-main.zip: Combined tables and scripts for plotting figures
- Deleting individual table and script files
August 2024
Additional biological dataset
- mammal82.zip: 2nd mammal dataset by Liu et al
Additional simulation experiments
- quartet.zip: New quartet (4-species) dataset
- subsample.zip: Additional subsampling experiment complementary to SR201 dataset
- score_true.zip: Additional experiment comparing CASTER scores of reconstructed and true phylogenies complementary to SR201 dataset
Obsolete experiments removed
- bpp_mcmc.zip: BPP experiment on SR21 dataset was removed due to convergence issue
- snapper_mcmc.zip: SNAPPER experiment on SR21 dataset was removed due to convergence issue