Data sets for phylogenomic analyses in: Ant backbone phylogeny resolved by modelling compositional heterogeneity among sites in genomic data
Data files
Jan 04, 2024 version files 69.49 MB
-
Cai_2024_ant_phylogeny_data.zip
69.48 MB
-
README.md
7.72 KB
Abstract
Ants are the most ubiquitous and ecologically dominant arthropods on Earth, and understanding their phylogeny is crucial for deciphering their character evolution, species diversification, and biogeography. Although recent genomic data have shown promise in clarifying intrafamilial relationships across the tree of ants, inconsistencies between molecular datasets have also emerged. Here I re-examine the most comprehensive published Sanger-sequencing and genome-scale datasets of ants using model comparison methods that model among-site compositional heterogeneity to understand the sources of conflict in phylogenetic studies. My results under the best-fitting model, selected on the basis of Bayesian cross-validation and posterior predictive model checking, identify contentious nodes in ant phylogeny whose resolution is modelling-dependent. I show that the Bayesian infinite mixture CAT model outperforms empirical finite mixture models (C20, C40 and C60) and that, under the best-fitting CAT-GTR+G4 model, the enigmatic Martialis heureka is sister to all ants except Leptanillinae, rejecting the more popular hypothesis supported under worse-fitting models, that place it as sister to Leptanillinae. These analyses resolve a lasting controversy in ant phylogeny and highlight the significance of model comparison and adequate modelling of among-site compositional heterogeneity in reconstructing the deep phylogeny of insects.
This Readme file summarizes the resultant files of my phylogenetic analyses deposited in the DRYAD repository of the paper:
Cai, C., 2024. Ant backbone phylogeny resolved by modelling compositional heterogeneity among sites in genomic data. Communications Biology.
The results of my phylogenetic analyses are listed in two root folders, corresponding to the two previously published studies: Borowiec et al. (2019) and Romiguier et al. (2022).
1-Borowiec et al. 2019:
This folder includes four folders, showing results based on four datasets of Borowiec et al. (2019) under the site-heterogeneous CAT-GTR+G4 model in PhyloBayes.
1-Full_data_set_unconstrained-7451 NT sites: Full 11-gene matrix (123 taxa, 7,451 nucleotide [NT] sites).
- D1-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- D1-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
2-AT-rich_outgr_removed-7451 NT sites: Full matrix with the most AT-rich outgroups excluded (117 taxa, 7,451 NT sites).、
- D2-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- D2-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
3-GC-rich_outgr_removed-7451 NT sites: Full matrix with the most GC-rich outgroups excluded (117 taxa, 7,451 NT sites)
- D3-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- D3-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
4-Homogeneous-3995 NT sites: Homogeneous matrix with heterogeneous partitions removed (123 taxa, 3,995 NT sites).
- D4-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- D4-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
2-Romiguier et al. 2022:
This folder includes five folders, showing results based on four filtered supermatrices of Romiguier et al. (2022) under the site-heterogeneous CAT-GTR+G4 model in PhyloBayes, as well as the simpler LG4X+R, LG+C20, C40 and C60 models in IQ-TREE. Results of the model comparison and posterior predictive model checking analyses using PhyloBayes are also listed.
1-Matrix 1: it includes phylogenomic results based on Matrix 1 (in phy. format) under three evolutionary models. Models used in the analyses are incorprated in the folder names.
- Matrix 1.phy: with 38 taxa and 647114 amino acid sites
- 1-iqtree-LG4X+R: resultant files under the LG4X+R model
- antc1.phy.contree: consensus tree generated by the IQ-TREE analysis
- antc1.phy.log: log file for the IQ-TREE analysis
2-iqtree-LG+C20: resultant files under the LG+C20 model
- c20-antc1.phy.contree: consensus tree generated by the IQ-TREE analysis
- c20-antc1.phy.log: log file for the IQ-TREE analysis
- 3-phylobayes-CAT+GTR: resultant files under the CAT-GTR+G model
- m1-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- m1-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
2-Matrix 2: it includes phylogenomic results based on Matrix 2 under three evolutionary models. Models used in the analyses are incorprated in the folder names.
- Matrix 2.phy: with 47 taxa and 623908 amino acid sites
1-iqtree-ant-wasp-LG4X+R: resultant files under the LG4X+R model
- antw6205.phy.contree: consensus tree generated by the IQ-TREE analysis
- antw6205.phy.log: log file for the IQ-TREE analysis
2-iqtree-ant-wasp-LG+C20: resultant files under the LG+C20 model
- antw6205.phy.contree: consensus tree generated by the IQ-TREE analysis
- antw6205.phy.log: log file for the IQ-TREE analysis
3-phylobayes-ant-wasp-CAT+GTR: resultant files under the CAT-GTR+G model
- m2-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- m2-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
3-Matrix 3: it includes phylogenomic results based on Matrix 3 under five evolutionary models. Models used in the analyses are incorprated in the folder names.
- Matrix 3.phy: with 47 taxa and 983951 amino acid sites
1-iqtree-LG+F+G4: resultant files under the LG+F+G4 model
- antw15.phy.contree: consensus tree generated by the IQ-TREE analysis
- antw15.phy.log: log file for the IQ-TREE analysis
2-iqtree-LG+C20: resultant files under the LG+C20 model
- antw15.phy.contree: consensus tree generated by the IQ-TREE analysis
- antw15.phy.log: log file for the IQ-TREE analysis
3-iqtree-LG+C40: resultant files under the LG+C40 model
- antw15.phy.contree: consensus tree generated by the IQ-TREE analysis
- antw15.phy.log: log file for the IQ-TREE analysis
4-iqtree-LG+C60: resultant files under the LG+C60 model
- antw15.phy.contree: consensus tree generated by the IQ-TREE analysis
- antw15.phy.log: log file for the IQ-TREE analysis
5-phylobayes-CAT-GTR: resultant files under the CAT-GTR+G model
- m3-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- m3-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
4-Matrix 4: it includes phylogenomic results based on Matrix 4 under three evolutionary models. Models used in the analyses are incorprated in the folder names.
- Matrix 4.phy: with 17 taxa and 1692050 amino acid sites
1-iqtree-LG4X+R: resultant files under the LG4X+R model
- ant2.phy.contree: consensus tree generated by the IQ-TREE analysis
- ant2.phy.log: log file for the IQ-TREE analysis
2-iqtree-LG+C20: resultant files under the LG+C20 model
- ant2.phy.contree: consensus tree generated by the IQ-TREE analysis
- ant2.phy.log: log file for the IQ-TREE analysis
3-phylobayes-CAT+GTR: resultant files under the CAT-GTR+G model
- m4-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- m4-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
5-Matrix 5: it includes phylogenomic results based on Matrix 5 under three evolutionary models. Models used in the analyses are incorprated in the folder names.
- Matrix 5.phy: with 82 taxa and 21902 amino acid sites
1-iqtree-LG4X+R: resultant files under the LG4X+R model
- ant0203.phy.contree: consensus tree generated by the IQ-TREE analysis
- ant0203.phy.log: log file for the IQ-TREE analysis
2-iqtree-LG+C20: resultant files under the LG+C20 model
- ant0203.phy.contree: consensus tree generated by the IQ-TREE analysis
- ant0203.phy.log: log file for the IQ-TREE analysis
3-iqtree-GHOST: resultant files under the GHOST model
- ant0203.phy.contree: consensus tree generated by the IQ-TREE analysis
- ant0203.phy.log: log file for the IQ-TREE analysis
4-phylobayes-CAT+GTR: resultant files under the CAT-GTR+G model
- m5-bpcomp.bpdiff: bpdiff result using the bpcomp tool in PhyloBayes
- m5-bpcomp.con.tre: consensus tree of the PhyloBayes analysis
6-model comparison and posterior predictive model checking: It contains two folders detailed as follows; all analyses were based on the comparative small Matrix 3.
- 1-model test: Two efficient and reliable approaches, the leave-one-out cross-validation (LOO-CV) and the widely applicable information criterion (wAIC), were used. This folder contains the input and outfiles of the model test analyses:
data.phy: the tested dataset, Matrix 3; this file was also used for the Posterior predictive model checking analysis below
tree.tre: the contree file of the Phylobayes analyses; this file was also used for the Posterior predictive model checking analysis below
result of model test.txt: the debiased scores under five tested models: LG+F+G, LG+C20+F+G, LG+C40+F+G, LG+C60+F+G, and CAT-GTR+G - 2-Posterior predictive model checking: this folder contains one file resulted from the analyses of Matrix 3
Posterior predictive analyses.xlsx: results for the tested models, including LG+F+G, LG+C20+F+G, LG+C40+F+G, LG+C60+F+G, and CAT-GTR+G