Coping with ineffective overlap in multilocus phylogenetics
Data files
Jun 27, 2025 version files 1.52 GB
-
Dobrin_etal_2018_CC0compliant.zip
640.45 MB
-
Gordon1986.zip
14.15 KB
-
GymnoTree_2022-03-15.zip
876.37 MB
-
README.md
6.61 KB
Abstract
Missing data is a long-standing issue in phylogenetic inference, which often results in high levels of taxonomic instability, obscuring otherwise well-supported relationships. Multiple approaches have been developed to deal with the negative effects of ineffective overlap on tree resolution, often by identifying taxa for removal. Here we repurpose a heuristic method developed to identify unstable taxa in morphological data matrices, concatabominations, and combine it with a novel gene-tree jackknifing on matrix representation of trees to identify candidates for targeted sequencing. Using a multilocus caecilian dataset, we illustrate the method's capacity to identify candidate taxa and loci for additional sequencing, compare the results to those of the mathematics-based gene sampling sufficiency approach, and explore the terrace space associated with the multilocus dataset. We show that our approach yields tractable numbers of loci/taxa for targeted sequencing that successfully mitigate topological instability due to ineffective overlap, even when modest amounts of data are added.
Supplementary data for:
Serra Silva, A. et al. (2025). Coping with Ineffective Overlap in Multilocus Phylogenetics. In press.
Detailed descriptions of analyses and parameters can be found in the main document and its supplemental information, but all folders below include the scripts necessary to repeat the analyses.
File extensions and naming conventions are shared between zipped folders and are explained below using the Dobrin_etal2018_CC0compliant.zip as an example.
If the geneTrees subfolder includes subfolders named after phylogenetic inference software and no additional listing of files is given, then ALL file extensions correspond to the software's default output.
Contents of each folder
- Dobrin_etal2018_CC0compliant.zip: contains the analyses based on the datasets used in Dobrin et al. (2018)
- Each dataset subfolder has the following organisation:
- Astral: has the input and outputs of the Astral analyses
- _localPP - Astral output with branch local posterior probability
- _quartet - Astral output with branch quartet scores (QS)
- _noQMs - any branches with branch support QS = '?' edited to QS = '0'
- _collapsed - branches with QS < 0.51 collapsed
- concatabominations: has the scripts to generate the MRs and the input to the Concatabominations pipeline
- Subfolders named allLoci include all loci alignments, subfolders starting with 'no' do not contain the named locus (e.g., if a subfolder is named no16S, then that is the removed/jackknifed alignment)
- MRP_test.py - p4 script to generate matrix representation (MR) of splits
- mrp.nex - output of MRP_test.py
- mrpCleanUp.sh - script to convert mrp.nex into the Concatabominations-compatible mrp_concat.nex
- mrp_concat.nex - concatabomination pipeline input
- mrp_concat.nex.* - files output by concatabominations
- mrp_concat.nex.html - Safe Taxonomic Reduction results
- mrp_concat.nex.taxonomicEqiv.sim.txt - concatabominations network (required input for Cytoscape)
- geneTrees: contains the scripts to run the RAxML searches and their outputs
- *RAxML.sh - script to run the RAxML tree searches
- output_RAxML.zip - includes ALL intermediate output files for all loci
- RAxML_bestTree.* - inferred best likelihood tree for each locus
- MRP: has the input, script, and output of the MRP searches
- Paup_morphoblock.nex - script to run PAUP using mrp_concat.nex as input. All other files in this subfolder will be generated by running the PAUP script and are named therein.
- seqs: contains the modified phylip data matrices; all alignments have a minimum of four taxa
- For the Mammals and Primates subfolders, please email us for the modified alignment files. We cannot provide them here due to licensing constraints; any re-use of these files requires citing their original publications.
- Terraphy: contains the output of the terraphy analyses
- Astral: input, script, and output of terraphy run on the Astral tree
- input - contains the rooted tree and datamatrix
- preprocessing - intermediate files generated by the terraphy analyses
- run_all_analyses.sh - script to run the terraphy analyses
- build.tre - BUILD consensus of the terrace
- parentCount - text file with the number of trees in the terrace
- strict.tre - strict consensus of terrace
- MRP: input, script, and output of terraphy run on the Astral tree
- same subfolder and file organisation as the Terraphy/Astral folder
- Astral: input, script, and output of terraphy run on the Astral tree
- Astral: has the input and outputs of the Astral analyses
- Each dataset subfolder has the following organisation:
- Gordon1986.zip: contains the terraphy analyses on Gordon's (1986) trees
- GymnoTree_2022-03-15.zip: contains the data, scripts, and outputs of the caecilian trees
- Alignments: contains the unprocessed and masked gene alignments
- Astral_supertrees: input and output of Astral analyses with the following organisation (used throughout all subfolders)
- MrBayes
- 16SnoHmont: Hypogeophis montanus present in BDNF alignment but not 16S rRNA
- 16SwHmont: Hypogeophis montanus present in BDNF and 16S rRNA alignments
- noHmont: Hypogeophis montanus not present in any alignment
- remaining folders follow the naming convention described in the Dobrin*.zip folder listing
- RAxML
- 16SnoHmont: Hypogeophis montanus present in BDNF alignement but not 16S rRNA
- 16SwHmont: Hypogeophis montanus present in BDNF and 16S rRNA alignments
- noHmont: Hypogeophis montanus not present in any alignment
- MrBayes
- concatabominations: contains script to generate matrix representations (MR), processed MRs, and output of the concatabominations pipeline split into
- MrBayes_geneTrees
- RAxML_geneTrees
- concatenatedMatrices: input, scripts, and output of analyses on concatenated matrices, split into
- MrBayes_concat
- RAxML_concat
- geneTrees: input, scripts, and output of analyses on individual gene alignments, split into
- MrBayes
- RAxML
- MRP_supertrees: input, scripts, and output of MRP searches
- MrBayes_geneTrees
- RAxML_geneTrees
- pseudoData: concatabominations and terraphy results for the pseudoData analyses, organisation follows the same as main concatabominations and terraphy folders
- terraphy: input, scripts, and output of terraphy analyses
- Astral_supertrees
- concatenatedMatrices
- MRP_supertrees
Citations
If you use the datamatrices in the Dobrin_etal2018_CC0compliant.zip, we encourage you to cite the original publications. They can be found in Table 4 of the main text or in Dobrin et al. (2018) Table 1.
- Silva, Ana Serra; Siu-Ting, Karen; Creevey, Christopher J et al. (2025). Coping with Ineffective Overlap in Multilocus Phylogenetics. Systematic Biology. https://doi.org/10.1093/sysbio/syaf044
