Skip to main content
Dryad

Major revisions in pancrustacean phylogeny and evidence of sensitivity to taxon sampling

Cite this dataset

Bernot, James P et al. (2023). Major revisions in pancrustacean phylogeny and evidence of sensitivity to taxon sampling [Dataset]. Dryad. https://doi.org/10.5061/dryad.dr7sqvb2h

Abstract

Abstract The clade Pancrustacea, comprising crustaceans and hexapods, is the most diverse group of animals on earth, containing over 80% of animal species and half of animal biomass. It has been the subject of several recent phylogenomic analyses, yet relationships within Pancrustacea show a notable lack of stability. Here, the phylogeny is estimated with expanded taxon sampling, particularly of malacostracans. We show small changes in taxon sampling have large impacts on phylogenetic estimation. By analyzing identical orthologs between two slightly different taxon sets, we show that the differences in the resulting topologies are due primarily to the effects of taxon sampling on the phylogenetic reconstruction method. We compare trees resulting from our phylogenomic analyses with those from the literature to explore the large tree space of pancrustacean phylogenetic hypotheses and find that statistical topology tests reject the previously published trees in favor of the maximum likelihood trees produced here. Our results reject several clades including Caridoida, Eucarida, Multicrustacea, Vericrustacea, and Syncarida. Notably, we find Copepoda nested within Allotriocarida with high support and recover a novel relationship between decapods, euphausiids, and syncarids that we refer to as the Syneucarida. With denser taxon sampling, we find Stomatopoda sister to this latter clade, which we collectively name Stomatocarida, dividing Malacostraca into three clades: Leptostraca, Peracarida, and Stomatocarida. A new Bayesian divergence time estimation is conducted using 13 vetted fossils. We review our results in the context of other pancrustacean phylogenetic hypotheses and highlight 15 key taxa to sample in future studies.

README: Title of Dataset: Data from Major revisions in pancrustacean phylogeny and evidence of sensitivity to taxon sampling


Matricies, treefiles, and alignments from: Major revisions in pancrustacean phylogeny and evidence of sensitivity to taxon sampling

Description of the data and file structure

zip files for:

  • divergence_time_files # chronograms from divergence time analyses
  • matrices # super matrices (full alignments) used in this study
  • orthologs # individual ortholog alignments and gene trees
  • treefiles # newick tree files (filenames refer to the figure numbers in the manuscript)
  • Dataset2_homologs # unaligned fastas before ortholog filters Detailed Table of Contents

  • divergence_time_files
    • dating_MCMCTree_c1.chronogram # chronogram from chain 1 of MCMCTree divergence time analsis
    • dating_MCMCTree_c3.chronogram # chronogram from chain 2 of MCMCTree divergence time analsis
    • dating_MCMCTree_c2.chronogram # chronogram from chain 3 of MCMCTree divergence time analsis
    • dating_MCMCTree_prior.chronogram # chronogram under the prior of MCMCTree divergence time analsis
    • Phylobayes_LN.chronogram # chronogram of 3 chains using the LN model in Phylobayes
    • Phylobayes_LN_prior.chronogram # chronogram of 3 chains under the prior of the LN model in Phylobayes
    • Phylobayes_CIR.chronogram # chronogram of 3 chains using the CIR model in Phylobayes
    • Phylobayes_CIR_prior.chronogram # chronogram of 3 chains under the prior of the CIR model in Phylobayes
    • Phylobayes_UGAM.chronogram # chronogram of 3 chains using the UGAM model in Phylobayes
    • Phylobayes_UGAM_prior.chronogram # chronogram of 3 chains using the UGAM model in Phylobayes
  • matrices
    • Dataset1_matrix.phy # Dataset 1 is an earlier version of this study containing 98 taxa and 559 orthologs (details in manuscript text)
    • Dataset1_shared_orthologs_only_alignment.fasta # Matrix of the 98 taxa in Dataset 1 with only the 267 orthologs shared with Dataset 2
    • Dataset2_matrix.phy # Final matrix consisting of 105 taxa (details in manuscript text)
    • Dataset2_50gene_subset_Phylobayes_divergence.phy # Matrix of top 50 genes from Dataset 2 with highest Robinson-Foulds similarity to species tree from ML analysis of Dataset2_matrix.phy
    • Dataset2_matrix_dayhoff_recoded.phy # Dataset 2 matrix under Dayhoff6 recoding
    • Dataset2_minus_fastestgenes_matrix.phy # Dataset 2 with the 10% (i.e., 57) fastest evolving genes removed
    • Dataset2_shared_orthologs_only_alignment.fasta # Matrix of the 105 taxa in Dataset 2 with only the 267 orthologs shared with Dataset 1
    • Dataset2_matrix_trimmed_of_added_taxa.fasta # Taxon sampling experiment. Dataset 2 matrix after removing the additional taxa that were added relative to Dataset 1 (details in manuscript text)
  • orthologs
    • Dataset1_ortholog_alignments_gblocks.zip # individual alignments for the 559 orthologs in Dataset 1
    • Dataset1_ortholog_alignments.tar.gz # individual alignments for the 559 orthologs in Dataset 1 prior to trimming with gblocks
    • Dataset1_ortholog_trees.zip # individual gene trees for the 559 orthologs in Dataset 1
    • Dataset2_ortholog_alignments_gblocks.zip # individual alignments for the 576 orthologs in Dataset 2
    • Dataset2_ortholog_alignments.tar.gz # individual alignments for the 576 orthologs in Dataset 2 prior to trimming with gblocks
    • Dataset2_ortholog_trees.zip # individual gene trees for the 576 orthologs in Dataset 2
    • Datatset3_vs_Dataset1_shared_orthologs.txt # list of orthologs shared between Datasets 2 and 3
  • treefiles
    • Fig2A_Dataset2_C60LG.tre # Tree resulting from LG+C60+F+G analysis of Dataset 2 AA matrix
    • FigS1A_Dataset2_raxml.tre # Tree resulting from RAxML analysis of Dataset 2 AA matrix
    • FigS1B_Dataset2_Dayhoff6_raxml.tre # Tree resulting from RAxML analysis of Dataset 2 Dayhoff6 matrix
    • FigS1C_Dataset2_C60LG_minusfastgenes.tre # Tree resulting from IQ-TREE LG+C60+F+G analysis of Dataset 2 AA matrix with 10% fastest evolving genes removed
    • FigS2A_Dataset2_CATGTR_consensus.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 2 AA matrix - consensus of 3 chains
    • FigS2B_Dataset2_CATGTR_c1.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 2 AA matrix - chain 1
    • FigS2C_Dataset2_CATGTR_c2.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 2 AA matrix - chain 2
    • FigS2D_Dataset2_CATGTR_c3.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 2 AA matrix - chain 3
    • FigS3_Dataset2_Dayhoff6_CATGTR_consensus.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 2 Dayhoff6 matrix - consensus of 2 chains
    • FigS4A_Dataset2_ATRAL.tre # Tree resulting from ASTRAL analysis of Dataset 2 orthologs
    • FigS4B_Dataset2_ASTRAL_BS10.tre # Tree resulting from ASTRAL analysis of Dataset 2 orthologs, with nodes in gene trees with <10% BS support collapsed
    • FigS4C_Dataset2_ASTRAL_BS20.tre # Tree resulting from ASTRAL analysis of Dataset 2 orthologs, with nodes in gene trees with <20% BS support collapsed
    • FigS4D_Dataset2_ASTRAL_BS30.tre # Tree resulting from ASTRAL analysis of Dataset 2 orthologs, with nodes in gene trees with <30% BS support collapsed
    • FigS5_Dataset1_C60LG_speciestree.treefile # Tree resulting from LG+C60+F+G analysis of Dataset 1 AA matrix
    • FigS6_Dataset1_Dayhoff6_raxml.tre # Tree resulting from RAxML analysis of Dataset 1 Dayhoff6 matrix
    • FigS7_Dataset1_CATGTR_consensus.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 1 AA matrix - consensus of 4 chains
    • FigS8A_Dataset1_Dayhoff6_CATGTR_consensus.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 1 Dayhoff6 matrix - consensus of 2 chains
    • FigS8B_Dataset1_Dayhoff6_CATGTR_c1.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 1 Dayhoff6 matrix - chain 1
    • FigS8C_Dataset1_Dayhoff6_CATGTR_c2.tre # Tree resulting from Phylobayes CAT-GTR analysis of Dataset 1 Dayhoff6 matrix - chain 2
    • FigS9A_Dataset1_ASTRAL.tre # Tree resulting from ASTRAL analysis of Dataset 1 orthologs
    • FigS9B_Dataset1_ASTRAL_BS10.tre # Tree resulting from ASTRAL analysis of Dataset 1 orthologs, with nodes in gene trees with <10% BS support collapsed
    • FigS9C_Dataset1_ASTRAL_BS30.tre # Tree resulting from ASTRAL analysis of Dataset 1 orthologs, with nodes in gene trees with <30% BS support collapsed
    • FigS10A_Dataset1_shared_orthologs_C60LG.tre # Trees results from IQ-TREE LG+C60+F+G analysis using only shared orthologs - Dataset 1 taxa
    • FigS10B_Dataset2_shared_orthologs_C60LG.tre # Trees results from IQ-TREE LG+C60+F+G analysis using only shared orthologs - Dataset 2 taxa
    • FigS11_Dataset2_C60LG_trimmed_taxa.tre # Trees resulting from LG+C60+F+G analysis of the Dataset 2 orthologs trimmed of additional taxa relative to Dataset 1
  • Dataset2_homologs.tar.gz # unaligned fastas before ortholog filters. Results from all vs all BLAST and MCL, clusters with >40 taxa

Methods

Phylogenetic trees and alignments from analyses in this study.

Funding

National Science Foundation, Award: 2010898

National Science Foundation of Sri Lanka, Award: 1856679

NSF Postdoctoral Research Fellowships in Biology , Award: #2010898

NSF DEB, Award: #1856679

NSF PRFB Program, Award: #2010898