Data from: Phylotranscriptomics: saturated third codon positions radically influence the estimation of trees based on next-gen data

Breinholt, Jesse W., Florida Museum of Natural History

Kawahara, Akito Y., Florida Museum of Natural History

Published Oct 24, 2014 on Dryad. https://doi.org/10.5061/dryad.r5cq0

Cite this dataset

Breinholt, Jesse W.; Kawahara, Akito Y. (2014). Data from: Phylotranscriptomics: saturated third codon positions radically influence the estimation of trees based on next-gen data [Dataset]. Dryad. https://doi.org/10.5061/dryad.r5cq0

Abstract

The recent advancement in molecular sequencing techniques has led to a surge in the number of studies that incorporate large amounts of genetic data in phylogenetic studies. We test the assumption that analyzing large amounts of genetic data will lead to improvements in tree resolution and branch support using moths in the superfamily Bombycoidea, a group in which some of its inter-familial relationships have been difficult to resolve. Specifically, we examine how codon position and saturation might influence resolution and node support among three key families using a next-gen dataset that included 19 taxa and 938 genes (~1.2M bp). Maximum likelihood, parsimony, and species tree analysis using gene-tree parsimony, on numerous different nucleotide and amino acids datasets, resulted in largely congruent topologies with high bootstrap support, compared to prior studies that included a fewer number of loci. However, for a few shallow nodes, nucleotide and amino acid data provided high support for conflicting relationships. The third codon position was saturated and phylogenetic analysis of this position alone supported a completely different, potentially misleading sister group relationship. We used the program RADICAL to assess the number of genes needed to fix some of these difficult nodes. One such node needed a total of 850 genes, but only needed 250 when synonymous signal was removed. While transcriptomics can provide large amounts of data needed to resolve many difficult phylogenetic relationships, the importance of assessing the effect of synonymous substitutions and third codon positions in next-gen datasets still remains.

Usage notes

Breinholt_Kawahara_2013_nuc

Nexus file containing 938 genes for 19 taxa. See Taxon_list.txt for names of each taxon, this is a nucleotide nexus file with a CHARSET that defines each gene. Gene names correspond to gene numbers in the Insecta HMMER v3-2 core ortholog database. For further information on these genes see Supplementary Table 2 from Breinholt and Kawahara 2013.

Breinholt_Kawahara_2013_aa.nex

Nexus file containing 938 genes for 19 taxa. See taxa_list.txt for names of each taxon, this is a amino acid nexus file with a CHARSET that defines each gene. Gene names correspond to gene numbers in the Insecta HMMER v3-2 core ortholog database. For further information on these genes see Supplementary Table 2 from Breinholt and Kawahara 2013.

Taxon_list.txt

List of taxa codes and names and source of data for the two nexus files below in tab-delimited text.

acti2_assembly.fasta

Assembly of Actias luna from Genbank SRA accession #SRR1002974, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See SOAP_assembly.qsub for the command used for this assembly.

attac_assembly.fasta

Assembly of Attacus atlas from Genbank SRA accession #SRR1002994, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and processed with the fastx toolkit. See SOAP_assembly.qsub for the command used for this assembly.

cundu3_assembly.fasta

Assembly of Ceratomia undulosa from Genbank SRA accession #SRR1002985, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and process with the fastx toolkit. See SOAP_assembly.qsub for the command used for this assembly.

dara_assembly.fasta

Assembly of Darapsa myron from Genbank SRA accession #SRR1002986, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and process with the fastx toolkit. See SOAP_assembly.qsub for the command used for this assembly.

elug1_assembly.fasta

Assembly of Enyo lugubris from Genbank SRA accession #SRR1002983, using multiple kmers (13,23,33,43,63) with SOAPdenovo-Trans v1.01. Different Kmer assemblies were combined with cd-hit-est and process with the fastx toolkit. See SOAP_assembly.qsub for the command used for this assembly.

hema2_assembly

SOAP_assembly.qsub

This script was used for multiple kmer transcriptome assemblies. The script is specific to the University of Florida module system but can be easily edited for use on other HPC systems.

HaMStR.qsub

This script contains commands used for HaMStR ortholog prediction. It is specific to HPC systems with PBS schedulers and requires the set of files and directories detailed in the HaMStR manual.

README.txt

This file contains descriptions of all the files associated with this Dryad package.