Data from: Biased estimates of phylogenetic branch lengths resulting from the discretised Gamma model of site rate heterogeneity
Data files
May 27, 2026 version files 1.23 GB
-
Figs1-3-S1.tar.gz
679.46 MB
-
FigsS4-S5-S6.tar.gz
548.90 MB
-
PANGEA_IDS.zip
49.49 KB
-
README.md
4.64 KB
Abstract
Simulated data underlying the analyses in the paper 'Biased estimates of phylogenetic branch lengths resulting from the discretised Gamma model of site rate heterogeneity' and PANGEA-HIV IDs for real sequences used.
Dataset DOI: 10.5061/dryad.d51c5b0fh
Description of the data and file structure
Contents:
File: PANGEA_IDS.zip
Description: IDs for real HIV data
This folder contains CSV files listing all PANGEA sequence IDs used in the analysis of real HIV data (methods subsection 'HIV phylogenetic reconstruction'). Each file is a single replicate of the downsampling exercise. Replicate 1 underlies figures 2 and S3, while figure S2 compares the results of all four.
The CSV files have two columns. The first consists of sequence IDs. The second gives the minimum size of the alignment that includes that sequence. For example, if this is 950 then a sequence is only present in the 950, 975 and 1000-sequence alignments, while if it is 50 then it is present in every alignment.
The sequences can be found at http://github.com/PANGEA-HIV/PANGEA-Sequences under these IDs
File: Figs1-3-S1.tar.gz
Description: Simulated data for the main text, figure 9, and figure S1
This contains the first set of simulated data (methods subsection 'Simulations'). Sequences are simulated using AliSim under either the 4-category DGM (gm4) or a continuous Gamma rate heterogeneity model (gammacont).
Branch length effect contains the simulated trees, simulated alignments, and reconstructed trees used for figures 1a and 1b.
simulated_trees contains the simulated phylogenies in Nexus format.
simulation_details_bl.csv gives the mean branch lengths for each simulated phylogeny.
simulated_alignments_expansion contains the simulated sequences underlying figures 1 and 2. Filenames containing 'disc' were simulated using a discrete rate heterogeneity model, and 'cont' a continuous one.
reconstructed_trees_expansion contains the reconstructed phylogenies ('.treefile'), and IQTREE information files ('.iqtree'), for the data underlying figures 1 and 2. Filenames containing 'disc' were simulated using a discrete rate heterogeneity model, and 'cont' a continuous one. Filenames containing 'gamma' were reconstructed using the DGM, and 'freerate' with FreeRate.
simulated_alignments_contraction contains the simulated sequences underlying figure 9. All rate heterogeneity models were continuous.
reconstructed_trees_contraction contains the reconstructed phylogenies ('.treefile'), and IQTREE information files ('.iqtree'), for the data underlying figure 9. All were based on alignments simulated using a continuous model and reconstructed with the DGM.
Sample size effect contains the simulated trees, simulated alignments, and reconstructed trees used for figures 1c and S1.
simulated_trees contains the simulated phylogenies in Nexus format. The number of tips appears in the file name after 'ss'.
simulated_alignments contains the simulated sequences. The replicate number appears after 'rep' and the number of tips under 'ss'. All rate heterogeneity models were continuous.
reconstructed_trees_iqtree contains the reconstructed phylogenies ('.treefile'), and IQTREE information files ('.iqtree'), for the data underlying figures 3 and 6. The replicate number appears after 'rep' and the number of tips under 'ss'. Filenames containing Filenames containing 'gamma' were reconstructed using the DGM, and 'freerate' with FreeRate.
reconstructed_trees_phyml contains the reconstructed phylogenies for the data underlying figures S1 where the reconstruction package was PhyML. The replicate number appears after 'rep' and the number of tips under 'ss'. Filenames containing 'gamma' were reconstructed using the DGM, and 'freerate' with FreeRate.
reconstructed_trees_raxmlng contains the reconstructed phylogenies for the data underlying figures S1 where the reconstruction package was RAxML-NG. The replicate number appears after 'rep' and the number of tips under 'ss'. Filenames containing 'gamma' were reconstructed using the DGM, and 'freerate' with FreeRate.
File: FigsS4-S5-S6.tar.gz
Description: Simulated data for the exercise in the appendix
This simulated data is used for the much larger exercise in the appendix (section 'Biased per-site rate estimates lead to biased branch lengths'). This is a large file. File names are of the format:
aliseqs_gammacont_X_Y_tryZ.fasta
Here X is the replicate of the simulation (between 1 and 50), Y is the sample size (between 100 and 500), and Z the replicate of the downsampling process (between 1 and 5).
